pandas（二）函数应用和映射排序和排名带有重复值的轴索引

203 阅读 0 评论 134 点赞

我是靠谱客的博主可靠哑铃，这篇文章主要介绍pandas（二）函数应用和映射排序和排名带有重复值的轴索引，现在分享给大家，希望可以做个参考。

NumPy的ufuncs也可以操作pandas对象

>>> frame
one
two
three
four
a
0
1
2
3
b
4
5
6
7
c
8
9
10
11
d
12
13
14
15
>>> np.square(frame)#求平方

one
two
three
four
a
0
1
4
9
b
16
25
36
49
c
64
81
100
121
d
144
169
196
225
>>>

用DataFrame的apply方法，可以将函数应用到由各列或行所形成的一维数组中。

>>> frame
one
two
three
four
a
0
1
2
3
b
4
5
6
7
c
8
9
10
11
d
12
13
14
15
>>> func = lambda x : x.max()-x.min()
>>> frame.apply(func)
one
12
two
12
three
12
four
12
dtype: int64
>>> frame.apply(func,axis = 1)
a
3
b
3
c
3
d
3
dtype: int64

用DataFrame的applymap方法，可以将函数应用到元素级的数据上。

>>> f = lambda x : x+1
>>> frame
one
two
three
four
a
0
1
2
3
b
4
5
6
7
c
8
9
10
11
d
12
13
14
15
>>> frame.applymap(f)
one
two
three
four
a
1
2
3
4
b
5
6
7
8
c
9
10
11
12
d
13
14
15
16

Series也有一个元素级函数应用的方法map

>>> frame['one'] #获取dataframe的列为一个Series对象
a
0
b
4
c
8
d
12
Name: one, dtype: int32
>>> frame['one'].map(f)
a
1
b
5
c
9
d
13
Name: one, dtype: int64
>>>

排序和排名

用sort_index对行或列进行排序，返回一个排序好的新对象

>>> obj = Series(range(4),index=['d','b','a','c'])
>>> new_obj = obj.sort_index()
>>> new_obj
a
2
b
1
c
3
d
0
dtype: int64
>>> obj
d
0
b
1
a
2
c
3
dtype: int64
>>>

>>> new_obj = obj.sort_index(ascending = False)#默认是升序，通过参数ascending可以设置降序
>>> new_obj
d 0
c 3
b 1
a 2
dtype: int64

对于DataFrame可以根据任意轴进行排序

>>> frame = DataFrame(np.random.randn(4,4),columns = ['c','a','d','b'],index=[3,1,4,2])
>>> frame
c
a
d
b
3
0.004950 -1.272352
1.050491
0.823530
1
1.198348
0.647114
0.154131 -0.636497
4 -0.358309
0.525307 -1.868459
0.867197
2 -0.021764
0.140501
1.459700 -0.090884
>>> frame.sort_index()
c
a
d
b
1
1.198348
0.647114
0.154131 -0.636497
2 -0.021764
0.140501
1.459700 -0.090884
3
0.004950 -1.272352
1.050491
0.823530
4 -0.358309
0.525307 -1.868459
0.867197
>>> frame.sort_index(axis =1)
a
b
c
d
3 -1.272352
0.823530
0.004950
1.050491
1
0.647114 -0.636497
1.198348
0.154131
4
0.525307
0.867197 -0.358309 -1.868459
2
0.140501 -0.090884 -0.021764
1.459700

除了按照索引排序之外，还可以按照值排序

按值对Series进行排序的时候，用sort_values方法。在老版本中是order方法。

>>> obj = Series([3,4,1,6])
>>> obj
0
3
1
4
2
1
3
6
dtype: int64
>>> obj.sort_values()
2
1
0
3
1
4
3
6
dtype: int64

在排序时，缺失值会默认放到末尾。

在DataFrame中，可能希望按照一个或多个列中的值进行排序

>>> frame = DataFrame({'a':[4,7,-3,2],'b':[1,0,0,1]})
>>> frame
a
b
0
4
1
1
7
0
2 -3
0
3
2
1
>>> frame.sort_index(by='a')#这个方法将在不久之后废弃，可以使用sort_values方法
__main__:1: FutureWarning: by argument to sort_index is deprecated, please use .sort_values(by=...)
a
b
2 -3
0
3
2
1
0
4
1
1
7
0
>>> frame.sort_values(by='a')
a
b
2 -3
0
3
2
1
0
4
1
1
7
0
>>>

根据多个列排序

>>> frame.sort_values(by=['b','a'])
a
b
2 -3
0
1
7
0
3
2
1
0
4
1

排名跟排序有紧密的联系，首先根据值排序，然后增设一个排名值（从1开始，直到有效值的数量。如果两个值相等，都取两个排名的均值）

>>> obj = Series([7,-5,7,4,2,0,4])
>>> obj
0
7
1
-5
2
7
3
4
4
2
5
0
6
4
dtype: int64
>>> obj.rank()
0
6.5
1
1.0
2
6.5
3
4.5
4
3.0
5
2.0
6
4.5
dtype: float64
>>>

也可以根据值在原来数据中出现的顺序，进行排名。如果某几个值相等，现在数据中出现的排名靠前，这需要借助于method选项

>>> obj.rank(method='first')
0
6.0
1
1.0
2
7.0
3
4.0
4
3.0
5
2.0
6
5.0
dtype: float64

当然也支持降序排列，ascending=False即可

dataframe对象默认按照行排名，设置轴选项axis=1，就会按照列排名

method选项的值有

method	说明
average	默认：在相等分组中，为各个值分配平均排名
mix	使用整个分组的最大排名
min	使用整个分组的最小排名
first	按照值在原始数据中出现的顺序分配排名

带有重复值的轴索引

许多pandas函数需要标签唯一，但这并不是强制性的。

可以通过索引的is_unique去判断是否唯一

>>> obj =Series(range(5),index=['a','a','b','b','c'])
>>> obj
a
0
a
1
b
2
b
3
c
4
dtype: int64

>>> obj.index.is_unique
False

带有重复值索引，数据的选取时，如果索引对应多个值，返回一个Series，否则返回单个值

>>> obj['a']
a
0
a
1
dtype: int64
>>> obj['c']
4

对于DataFrame也是如此

如果索引对应多行，返回的依然是一个dataframe对象，否则是一个Series对象

>>> df = DataFrame(np.random.randn(5,3),index=['a','a','b','b','c'])
>>> df.ix['a']
0
1
2
a -0.757846
0.713964 -0.674956
a
0.198044
1.093223 -0.342281
>>> df.ix['c']
0
-2.647372
1
-0.526367
2
-0.296859
Name: c, dtype: float64
>>> type(df.ix['a'])
<class 'pandas.core.frame.DataFrame'>
>>> type(df.ix['c'])
<class 'pandas.core.series.Series'>