概述
Pandas的数据清洗-apply函数
之前的replace、dropna、fillna函数要么针对NaN的某行或某列或某个,这些函数的作用有限,本章介绍的apply等函数可以针对整个Series或DataFrame的各个值进行相应的数据的处理。
- 对Series应用apply函数。
import pandas as pd
import numpy as np
s = pd.Series(np.arange(2,6))
print s
print s.apply(lambda x : 2 * x)
程序的执行结果;
0 2
1 3
2 4
3 5
dtype: int64
0 4
1 6
2 8
3 10
dtype: int64
从程序的执行结果可以看出,应用了apply函数对Series的每个元素值都乘上了2。
- 对DataFrame应用apply函数。
import pandas as pd
import numpy as np
df = pd.DataFrame(val, index = idx, columns = col)
print df
print df.apply(lambda col : col.sum(), axis = 0)
print df.apply(lambda row : row.sum(), axis = 1)
df["hello x the"] = df.apply(lambda row : row.hello * row.the, axis = 1)
print df
- apply函数一般针对整行或者整列而applymap函数会针对单独的元素值来处理。
import pandas as pd
import numpy as np
df = pd.DataFrame(val, index = idx, columns = col)
print df
print df.applymap(lambda x : x + 3)
程序执行结果:
hello the cruel world
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
hello the cruel world
a 3 4 5 6
b 7 8 9 10
c 11 12 13 14
d 15 16 17 18
Pandas的数据拼接-concat函数
在pandas里提供concat函数可以将形参给出的列表里的各个pandas的数据拼接成一个大的数据。
- 两个Series的拼接
import pandas as pd
import numpy as np
s1 = pd.Series(np.arange(2,6))
s2 = pd.Series(np.arange(8,12))
ss = pd.concat([s1, s2])
print ss
程序的执行结果:
0 2
1 3
2 4
3 5
0 8
1 9
2 10
3 11
dtype: int64
- 两个DataFrame的拼接 1). label和columns均相同的情况下:
import pandas as pd
import numpy as np
col = "hello the cruel world".split()
idx = ["a", "b", "c", "d"]
val1 = np.arange(16).reshape(4, 4)
val2 = np.arange(20, 36).reshape(4, 4)
df1 = pd.DataFrame(val1, index = idx, columns = col)
print df1
df2 = pd.DataFrame(val2, index = idx, columns = col)
print df2
df12 = pd.concat([df1, df2])
print df12
程序的执行结果:
hello the cruel world # prinf df1
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
hello the cruel world # print df2
a 20 21 22 23
b 24 25 26 27
c 28 29 30 31
d 32 33 34 35
hello the cruel world # print df12
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
a 20 21 22 23
b 24 25 26 27
c 28 29 30 31
d 32 33 34 35
2). 对于DataFrame的拼接比较复杂,原因是label和columns有可能不是一一对应的,这个时候两DataFrame未匹配上的label或columns下的值为NaN。
import pandas as pd
import numpy as np
col1 = "hello the cruel world".split()
col2 = "hello the nice world".split()
idx1 = ["a", "b", "c", "d"]
idx2 = ["a", "b", "d", "e"]
val1 = np.arange(16).reshape(4, 4)
val2 = np.arange(20, 36).reshape(4, 4)
df1 = pd.DataFrame(val1, index = idx1, columns = col1)
print df1
df2 = pd.DataFrame(val2, index = idx2, columns = col2)
print df2
df12 = pd.concat([df1, df2])
print df12
程序执行结果:
hello the cruel world # print df1
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
hello the nice world # print df2
a 20 21 22 23
b 24 25 26 27
d 28 29 30 31
e 32 33 34 35
cruel hello nice the world # print df12
a 2 0 NaN 1 3
b 6 4 NaN 5 7
c 10 8 NaN 9 11
d 14 12 NaN 13 15
a NaN 20 22 21 23
b NaN 24 26 25 27
d NaN 28 30 29 31
e NaN 32 34 33 35
- 指定拼接的轴,默认是列方向的拼接数据,可以指定concat 的形参axis为行上的拼接数据。
import pandas as pd
import numpy as np
col1 = "hello the cruel world".split()
col2 = "hello the nice world".split()
idx1 = ["a", "b", "c", "d"]
idx2 = ["a", "b", "d", "e"]
val1 = np.arange(16).reshape(4, 4)
val2 = np.arange(20, 36).reshape(4, 4)
df1 = pd.DataFrame(val1, index = idx1, columns = col1)
print df1
df2 = pd.DataFrame(val2, index = idx2, columns = col2)
print df2
df12 = pd.concat([df1, df2], axis = 1)
print df12
程序的执行结果:
hello the cruel world # print df1
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
hello the nice world# print df2
a 20 21 22 23
b 24 25 26 27
d 28 29 30 31
e 32 33 34 35
hello the cruel world hello the nice world # print df12
a 0 1 2 3 20 21 22 23
b 4 5 6 7 24 25 26 27
c 8 9 10 11 NaN NaN NaN NaN
d 12 13 14 15 28 29 30 31
e NaN NaN NaN NaN 32 33 34 35
Pandas的数据拼接-内外连接
Pandas在做数据拼接的时候提供类似于数据库的内连接、外连接的操作。默认是outer join即外连接,可以使用参数指定连接的类型为内连接inner join。
import pandas as pd
import numpy as np
col1 = "hello the cruel world".split()
col2 = "hello the nice world".split()
idx1 = ["a", "b", "c", "d"]
idx2 = ["a", "b", "d", "e"]
val1 = np.arange(16).reshape(4, 4)
val2 = np.arange(20, 36).reshape(4, 4)
df1 = pd.DataFrame(val1, index = idx1, columns = col1)
print df1
df2 = pd.DataFrame(val2, index = idx2, columns = col2)
print df2
print "***outer join", "*" * 20
print pd.concat([df1, df2], join = "outer")
print "***inner join", "*" * 20
print pd.concat([df1, df2], join = "inner")
程序的执行结果:
hello the cruel world
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
hello the nice world
a 20 21 22 23
b 24 25 26 27
d 28 29 30 31
e 32 33 34 35
***outer join ********************
cruel hello nice the world
a 2 0 NaN 1 3
b 6 4 NaN 5 7
c 10 8 NaN 9 11
d 14 12 NaN 13 15
a NaN 20 22 21 23
b NaN 24 26 25 27
d NaN 28 30 29 31
e NaN 32 34 33 35
***inner join ********************
hello the world
a 0 1 3
b 4 5 7
c 8 9 11
d 12 13 15
a 20 21 23
b 24 25 27
d 28 29 31
e 32 33 35
inner join类型的拼接实际是求两个集的交集。
Pandas的数据拼接-merge函数
concat函数可以实现内外连接,而pandas的merge函数可以真正实现数据库的内外连接,且外连接还可以有左右连接的特性。
- merge函数默认拼接数据是inner join即内连接。下面以学生选课为例,设计两个DataFrame通过merge函数来拼接合并。
import pandas as pd
import numpy as np
col1 = "class_name class_id class_lecturer".split()
col2 = "class_id stu_id".split()
val1 = [["IT", 100, "Wangli"],["CS", 101, "WangMa"],["CAD", 102, "Liping"]]
val2 = [[100, 20181115],[100, 20181116],[101, 20181117]]
course = pd.DataFrame(val1, columns = col1)
print "***course", "*" * 38
print course
choose = pd.DataFrame(val2, columns = col2)
print "***choose", "*" * 38
print choose
print "***course merge choose", "*" * 25
print course.merge(choose)
print "***choose merge course", "*" * 25
print choose.merge(course)
程序的执行结果:
***course **************************************
class_name class_id class_lecturer
0 IT 100 Wangli
1 CS 101 WangMa
2 CAD 102 Liping
***choose **************************************
class_id stu_id
0 100 20181115
1 100 20181116
2 101 20181117
***course merge choose *************************
class_name class_id class_lecturer stu_id
0 IT 100 Wangli 20181115
1 IT 100 Wangli 20181116
2 CS 101 WangMa 20181117
***choose merge course *************************
class_id stu_id class_name class_lecturer
0 100 20181115 IT Wangli
1 100 20181116 IT Wangli
2 101 20181117 CS WangMa
- merge的outer连接方式。结果是两个DataFrame均输出,未匹配上的用NaN填充。
import pandas as pd
import numpy as np
col1 = "class_name class_id class_lecturer".split()
col2 = "class_id stu_id".split()
val1 = [["IT", 100, "Wangli"],["CS", 101, "WangMa"],["CAD", 102, "Liping"], ["ME", 103, "Wufang"],["IT", 104, "Xiaomin"]]
val2 = [[100, 20181115],[100, 20181116],[101, 20181117]]
course = pd.DataFrame(val1, columns = col1)
print "***course", "*" * 38
print course
choose = pd.DataFrame(val2, columns = col2)
print "***choose", "*" * 38
print choose
print "***course merge choose in inner", "*" * 25
print course.merge(choose, how = "inner")
print "***course merge choose in outer", "*" * 25
print course.merge(choose, how = "outer")
print "***choose merge course in inner", "*" * 25
print choose.merge(course, how = "inner")
print "***choose merge course in outer", "*" * 25
print choose.merge(course, how = "outer")
程序执行结果:
***course **************************************
class_name class_id class_lecturer
0 IT 100 Wangli
1 CS 101 WangMa
2 CAD 102 Liping
3 ME 103 Wufang
4 IT 104 Xiaomin
***choose **************************************
class_id stu_id
0 100 20181115
1 100 20181116
2 101 20181117
***course merge choose in inner *************************
class_name class_id class_lecturer stu_id
0 IT 100 Wangli 20181115
1 IT 100 Wangli 20181116
2 CS 101 WangMa 20181117
***course merge choose in outer *************************
class_name class_id class_lecturer stu_id
0 IT 100 Wangli 20181115
1 IT 100 Wangli 20181116
2 CS 101 WangMa 20181117
3 CAD 102 Liping NaN
4 ME 103 Wufang NaN
5 IT 104 Xiaomin NaN
***choose merge course in inner *************************
class_id stu_id class_name class_lecturer
0 100 20181115 IT Wangli
1 100 20181116 IT Wangli
2 101 20181117 CS WangMa
***choose merge course in outer *************************
class_id stu_id class_name class_lecturer
0 100 20181115 IT Wangli
1 100 20181116 IT Wangli
2 101 20181117 CS WangMa
3 102 NaN CAD Liping
4 103 NaN ME Wufang
5 104 NaN IT Xiaomin
- merge的左右连接,这里调用merge的Dataframe是“左表”而连接即作为形参的是DataFrame是“右表”。左连接左表全输出而右表能匹配的输出,匹配不上的填充NaN,同理右连接时“右表”全输出,而左表匹配上输出,匹配不上填充NaN。
import pandas as pd
import numpy as np
col1 = "class_name class_id class_lecturer".split()
col2 = "class_id stu_id".split()
val1 = [["IT", 100, "Wangli"],["CS", 101, "WangMa"],["CAD", 102, "Liping"], ["ME", 103, "Wufang"],["IT", 104, "Xiaomin"]]
val2 = [[100, 20181115],[100, 20181116],[101, 20181117],[100, 20181118],[101, 20181119], [200, 20181120]]
course = pd.DataFrame(val1, columns = col1)
print "***course", "*" * 38
print course
choose = pd.DataFrame(val2, columns = col2)
print "***choose", "*" * 38
print choose
print "***course merge choose in left", "*" * 25
print course.merge(choose, how = "left")
print "***course merge choose in right", "*" * 25
print course.merge(choose, how = "right")
print "***choose merge course in left", "*" * 25
print choose.merge(course, how = "left")
print "***choose merge course in right", "*" * 25
print choose.merge(course, how = "right")
程序执行结果:
***course **************************************
class_name class_id class_lecturer
0 IT 100 Wangli
1 CS 101 WangMa
2 CAD 102 Liping
3 ME 103 Wufang
4 IT 104 Xiaomin
***choose **************************************
class_id stu_id
0 100 20181115
1 100 20181116
2 101 20181117
3 100 20181118
4 101 20181119
5 200 20181120
***course merge choose in left *************************
class_name class_id class_lecturer stu_id
0 IT 100 Wangli 20181115
1 IT 100 Wangli 20181116
2 IT 100 Wangli 20181118
3 CS 101 WangMa 20181117
4 CS 101 WangMa 20181119
5 CAD 102 Liping NaN
6 ME 103 Wufang NaN
7 IT 104 Xiaomin NaN
***course merge choose in right *************************
class_name class_id class_lecturer stu_id
0 IT 100 Wangli 20181115
1 IT 100 Wangli 20181116
2 IT 100 Wangli 20181118
3 CS 101 WangMa 20181117
4 CS 101 WangMa 20181119
5 NaN 200 NaN 20181120
***choose merge course in left *************************
class_id stu_id class_name class_lecturer
0 100 20181115 IT Wangli
1 100 20181116 IT Wangli
2 101 20181117 CS WangMa
3 100 20181118 IT Wangli
4 101 20181119 CS WangMa
5 200 20181120 NaN NaN
***choose merge course in right *************************
class_id stu_id class_name class_lecturer
0 100 20181115 IT Wangli
1 100 20181116 IT Wangli
2 100 20181118 IT Wangli
3 101 20181117 CS WangMa
4 101 20181119 CS WangMa
5 102 NaN CAD Liping
6 103 NaN ME Wufang
7 104 NaN IT Xiaomin
请注意[200, 20181120]
这条选课数据,课程id为200在course里并不存在。而["CAD", 102, "Liping"], ["ME", 103, "Wufang"],["IT", 104, "Xiaomin"]
这三门课没有学生选。 由此可见,merge函数的left join、right join和数据库的表的left join、right join的概念完全匹配。
Pandas的数据重塑-pivot与pivot_table函数
pandas.pivot的重点在于reshape, 通俗理解就是合并同类项,所以在行与列的交叉点值的索引应该是唯一值,如果不是唯一值,则会报,即原始数据集中存在重复条目,此时pivot函数无法确定数据透视表中的数值即会报错ValueError: Index contains duplicate entries, cannot reshape
。尽管如此,pivot()方法可以对数据进行行列互换,或者进行透视转换,在有些场合下分析数据时非常方便。
- pivot函数,DataFrame对象的pivot()方法可以接收三个参数,分别是index、columns和values,其中index用来指定转换后DataFrame对象的纵向索引,columns用来指定转换后DataFrame对象的横向索引或者列名,values用来指定转换后DataFrame对象的值。
import pandas as pd
idx = [101,101,101,102,102,102,103,103,103]
idx += [104, 104, 105, 106]
name = ["apple","pearl","orange", "apple","orange","pearl","apple","pearl","orange"]
name += ["apple","pearl", "apple", "orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
price += [5.30, 4.00, 5.25, 7.50]
df0 = pd.DataFrame({ "fruit": name, "price" : price, "supplier" :idx})
print df0
print ""
df1 = df0.pivot(index = "supplier", columns = "fruit", values = "price")
print df1
程序执行结果:
fruit price supplier
0 apple 5.20 101
1 pearl 3.50 101
2 orange 7.30 101
3 apple 5.00 102
4 orange 7.50 102
5 pearl 7.30 102
6 apple 5.20 103
7 pearl 3.70 103
8 orange 7.30 103
9 apple 5.30 104
10 pearl 4.00 104
11 apple 5.25 105
12 orange 7.50 106
fruit apple orange pearl
supplier
101 5.20 7.3 3.5
102 5.00 7.5 7.3
103 5.20 7.3 3.7
104 5.30 NaN 4.0
105 5.25 NaN NaN
106 NaN 7.5 NaN
- pivot_table函数,在调用pivot方法前需要保证数据集中不存在重复条目,否则我们需要调用另外一个方法:pivot_table,函数会对重复条目进行column to aggregate列聚合即求均值。
import pandas as pd
idx = [101,101,101,102,103,103,103,103,103]
name = ["apple","pearl","orange", "apple","pearl","orange","apple","pearl","orange"]
price = [1.0,2.0,3.0,4.00,5.0,6.0,7.0,8.0,9.0]
df0 = pd.DataFrame({ "fruit": name, "price" : price, "supplier" :idx})
print df0
#df1 = df0.pivot(index = "supplier", columns = "fruit", values = "price")
#print df1
print ""
df2 = df0.pivot_table(index = "supplier", columns = "fruit", values = "price")
print df2
程序执行结果:
fruit price supplier
0 apple 1 101
1 pearl 2 101
2 orange 3 101
3 apple 4 102
4 pearl 5 103
5 orange 6 103
6 apple 7 103
7 pearl 8 103
8 orange 9 103
fruit apple orange pearl
supplier
101 1 3.0 2.0
102 4 NaN NaN
103 7 7.5 6.5
由于103 orange
和103 pearl
存在2项,所以pivot函数不能用,用pivot_table函数会对103 orange
和103 pearl
聚合处理用均值作为其值。例如103 orange
的值应为7.5 = (9.0 + 6.0 ) / 2
。
最后
以上就是等待鸡为你收集整理的pandas中的函数-apply,concat,merge,pivot与pivot_table 解析 (24)Pandas的数据清洗-apply函数Pandas的数据拼接-concat函数Pandas的数据拼接-内外连接Pandas的数据拼接-merge函数Pandas的数据重塑-pivot与pivot_table函数的全部内容,希望文章能够帮你解决pandas中的函数-apply,concat,merge,pivot与pivot_table 解析 (24)Pandas的数据清洗-apply函数Pandas的数据拼接-concat函数Pandas的数据拼接-内外连接Pandas的数据拼接-merge函数Pandas的数据重塑-pivot与pivot_table函数所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复