一.准备知识
pandas.isnull(Series对象) 返回bool型的Series对象
同 Series对象.isnull()
DataFrame和Series都可以通过bool型的Series取值
bool型的Series和True/False进行比较,可用于数据过滤
NAN和任何值做计算时,结果都为NAN,所以在对某一列做求和,求均值等操作时,需先过滤掉缺失值
*.mean()方法可以自动过滤缺失值,再求平均
使用DataFrame的pivot_table()实现分组 聚合处理:
my_dataframe.pivot_table(index="列名1", values="列名2"/["列名2","列名3",...], aggfunc=np.聚合函数)
备注:index指定分组依据的列;values指定对哪几个列做聚合统计运算;aggfunc指定处理的numpy函数
my_dataframe.dropna(axis=1) 或 my_dataframe.dropna(axis='column') 删除存在空值的列
my_dataframe.dropna(axis=0, subset=['column1','column2',...]) 删除指定列存在空值的记录
备注:返回一个新的DataFrame,原DataFrame不会发生改变
my_dataframe.loc[行索引,列索引] 获取指定值
my_dataframe.sortvalues('列名', inplace=False, ascending=True) #默认:返回一个新的升序dataframe
my_dataframe.reset_index(drop=True) #用于排序后重置索引
二.代码示例(运行环境:python2.7)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78import pandas as pd import numpy as np titanic_survival = pd.read_csv('titanic_train.csv') # ============================================================================= # age = titanic_survival["Age"] # # age_is_null = pd.isnull(age) # age_is_null = age.isnull() # # # bool型Series统计缺失值数量 # age_null_true = age[age_is_null] # print(len(age_null_true)) # # # 过滤缺失值 # age_null_false = age[age_is_null == False] # print(age_null_false) # print(titanic_survival.loc[age_is_null == False]) # ============================================================================= # ============================================================================= # # 计算年龄的均值 # age_null = titanic_survival["Age"].isnull() # age = titanic_survival["Age"][age_null == False] # mean_age = sum(age) / len(age) # print(mean_age) # # mean_age = titanic_survival["Age"].mean() # print(mean_age) # ============================================================================= # ============================================================================= # # 旅客类型 # passenger_classes = [1, 2, 3] # # 计算每类旅客的平均票价:分组(按照旅客类型分组)聚合(求每组的平均值) # fares_by_class = {} # for this_class in passenger_classes: # pclass_rows = titanic_survival[titanic_survival["Pclass"] == this_class] # pclass_fares = pclass_rows["Fare"] # fare_for_class = pclass_fares.mean() # fares_by_class[this_class] = fare_for_class # print(fares_by_class) # # # 使用DataFrame的pivot_table()实现数据透视表 # fares_by_class = titanic_survival.pivot_table(index="Pclass", values="Fare", aggfunc=np.mean) # # 返回的类型为DataFrame # print(fares_by_class) # # # 对多个列做同一个聚合操作 # port_stats = titanic_survival.pivot_table(index="Embarked", values=["Survived","Fare"], aggfunc=np.sum) # print(port_stats) # ============================================================================= # ============================================================================= # drop_na_colums = titanic_survival.dropna(axis=1) # print(drop_na_colums.shape) # print(titanic_survival.shape) # # 返回一个新的DataFrame,原DataFrame不会发生改变 # new_titanic_survival = titanic_survival.dropna(axis=0, subset=["Age","Sex"]) # print(new_titanic_survival.shape) # print(titanic_survival.shape) # ============================================================================= # ============================================================================= # print(titanic_survival.loc[0,"Age"]) # print(titanic_survival.loc[100,"Pclass"]) # ============================================================================= # ============================================================================= # # sort_values()默认返回一个新的DataFrame;inplace=True时,就地修改 # new_titanic_survival = titanic_survival.sort_values("Age", ascending=False) # print(new_titanic_survival.iloc[0:3]) # reindex_titanic = new_titanic_survival.reset_index(drop=True) # print(reindex_titanic[0:3]) # ============================================================================= def hundredth_row(column): hundredth_item = column.iloc[99] return hundredth_item # 对DataFrame的每一列使用hundredth_row()函数处理 hundredth_row = titanic_survival.apply(hundredth_row) print(hundredth_row)
CSV文件网盘下载链接:https://pan.baidu.com/s/1jAOOXCobDSeBZc3h3qGybQ
最后
以上就是寒冷吐司最近收集整理的关于pandas数据分析模块(二)的全部内容,更多相关pandas数据分析模块(二)内容请搜索靠谱客的其他文章。
发表评论 取消回复