概述
Pandas有两个最主要也是最重要的数据结构: Series 和 DataFrame
1. 导包
In [1]: import numpy as np
In [2]: import pandas as pd
2. 创建DataFrame对象
2.1 通过numpy
数组(array)创建, 以datetime
为索引,并设置列标签
In [4]: dates = pd.date_range('20190601',periods = 4)
In [5]: dates
Out[5]:
DatetimeIndex(['2019-06-01', '2019-06-02', '2019-06-03', '2019-06-04'], dtype='datetime64[ns]', freq='D')
In [10]: np.random.seed(1)
# 设置随机数种子,确保随机得到的df1是相同的(参数 1 可以是任意一个整数)
In [11]: df1 = pd.DataFrame(np.random.randint(-10,20,size=(4,5)), index=dates, columns=list('ABCDE'))
In [12]: df1
Out[12]:
A
B
C
D
E
2019-06-01 -5
1
2
-2
-1
2019-06-02
1
-5
5 -10
6
2019-06-03 -9
2 -3
3
18
2019-06-04 -4
15
8
10
-5
2.2 通过可以转换为Series的字典创建
In [13]: df2 = pd.DataFrame({ 'A' : 3.,
...:
'B' : pd.Timestamp('20190601'),
...:
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
...:
'D' : np.array([3] * 4,dtype='int32'),
...:
'E' : pd.Categorical(["tiger","pig","duck","cat"]),
...:
'F' : 'str' })
In [14]: df2
Out[14]:
A
B
C
D
E
F
0
3.0 2019-06-01
1.0
3
tiger
str
1
3.0 2019-06-01
1.0
3
pig
str
2
3.0 2019-06-01
1.0
3
duck
str
3
3.0 2019-06-01
1.0
3
cat
str
In [96]: df3 = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
...:
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
...:
'C' : np.random.randn(8),
...:
'D' : np.random.randn(8)})
In [97]: df3
Out[97]:
A
B
C
D
0
foo
one -0.528172
1.462108
1
bar
one -1.072969 -2.060141
2
foo
two
0.865408 -0.322417
3
bar
three -2.301539 -0.384054
4
foo
two
1.744812
1.133769
5
bar
two -0.761207 -1.099891
6
foo
one
0.319039 -0.172428
7
foo
three -0.249370 -0.877858
3. 数据转置
In [53]: df1.T
Out[53]:
2019-06-01
2019-06-02
2019-06-03
2019-06-04
A
-2
7
17
4
B
3
14
7
4
C
8
2
8
18
D
2
-9
-1
4
E
19
17
1
4
4. 排序
- 按
axis=0
轴的 索引 降序排序 (升序去掉ascending=False
)
In [57]: df1.sort_index(axis=0, ascending=False)
Out[57]:
A
B
C
D
E
2019-06-04
4
4
18
4
4
2019-06-03
17
7
8 -1
1
2019-06-02
7
14
2 -9
17
2019-06-01
-2
3
8
2
19
- 按
axis=1
轴的 列标签 降序排序 (升序去掉ascending=False
)
In [54]: df1.sort_index(axis=1, ascending=False)
Out[54]:
E
D
C
B
A
2019-06-01
19
2
8
3
-2
2019-06-02
17 -9
2
14
7
2019-06-03
1 -1
8
7
17
2019-06-04
4
4
18
4
4
- 按 列标签
'B'
升序排序 (降序加上ascending=False
)
In [59]: df1.sort_values(by='B')
Out[59]:
A
B
C
D
E
2019-06-01
-2
3
8
2
19
2019-06-04
4
4
18
4
4
2019-06-03
17
7
8 -1
1
2019-06-02
7
14
2 -9
17
5. 数据查询操作
5.1 查询前n行:DataFrame.head()
;查看后n行:DataFrame.tail()
In [15]: df1.head(2)
Out[15]:
A
B
C
D
E
2019-06-01 -5
1
2
-2 -1
2019-06-02
1 -5
5 -10
6
In [16]: df1.tail(3)
Out[16]:
A
B
C
D
E
2019-06-02
1
-5
5 -10
6
2019-06-03 -9
2 -3
3
18
2019-06-04 -4
15
8
10
-5
5.2 查询数据索引index
、列标签columns
和数据values
In [17]: df1.index
Out[17]: DatetimeIndex(['2019-06-01', '2019-06-02', '2019-06-03', '2019-06-04'], dtype='datetime64[ns]', freq='D')
In [18]: df1.columns
Out[18]: Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
In [19]: df1.values
Out[19]:
array([[ -5,
1,
2,
-2,
-1],
[
1,
-5,
5, -10,
6],
[ -9,
2,
-3,
3,
18],
[ -4,
15,
8,
10,
-5]])
5.3 查询数据基本信息描述
说明:count
: 行数,mean
: 列平均值,std
: 列标准差,min
: 列最小值,max
: 列最大值
In [52]: df1.describe()
Out[52]:
A
B
C
D
E
count
4.000000
4.000000
4.000000
4.000000
4.000000
mean
-4.250000
3.250000
3.000000
0.250000
4.500000
std
4.112988
8.421203
4.690416
8.421203
10.082989
min
-9.000000
-5.000000 -3.000000 -10.000000
-5.000000
25%
-6.000000
-0.500000
0.750000
-4.000000
-2.000000
50%
-4.500000
1.500000
3.500000
0.500000
2.500000
75%
-2.750000
5.250000
5.750000
4.750000
9.000000
max
1.000000
15.000000
8.000000
10.000000
18.000000
5.4 查询数据<DataFrame[]
>
a. 获取 单独一列(发生降维), df['A']
<=>df.A
In [61]: df1['A']
# 返回的对象,发生降维
Out[61]:
2019-06-01
-5
2019-06-02
1
2019-06-03
-9
2019-06-04
-4
Freq: D, Name: A, dtype: int64
In [62]: type(df1['A'])
Out[62]: pandas.core.series.Series
b. 切片(行)
In [31]: df1[0:3]
# 索引:包括起始,不包括结束
Out[31]:
A
B
C
D
E
2019-06-01 -5
1
2
-2
-1
2019-06-02
1 -5
5 -10
6
2019-06-03 -9
2 -3
3
18
In [33]: df1['2019-06-01':'2019-06-03']
# 行标签:包括起始,也包括结束
Out[33]:
A
B
C
D
E
2019-06-01 -5
1
2
-2
-1
2019-06-02
1 -5
5 -10
6
2019-06-03 -9
2 -3
3
18
5.5 查询数据<DataFrame.loc[]
>
a. 获取 单独一行(发生降维)
In [35]: df1.loc[dates[0]]
# 返回的对象,发生降维
Out[35]:
A
-5
B
1
C
2
D
-2
E
-1
Name: 2019-06-01 00:00:00, dtype: int64
In [36]: type(df1.loc[dates[0]])
Out[36]: pandas.core.series.Series
b. 通过 行切片 + 列位置 获取对应数据
In [37]: df1.loc[:,['A','B']]
Out[37]:
A
B
2019-06-01 -5
1
2019-06-02
1
-5
2019-06-03 -9
2
2019-06-04 -4
15
In [40]: df1.loc['2019-06-02':'2019-06-03',['A','B']]
Out[40]:
A
B
2019-06-02
1 -5
2019-06-03 -9
2
In [41]: df1.loc['2019-06-02':'2019-06-03']
Out[41]:
A
B
C
D
E
2019-06-02
1 -5
5 -10
6
2019-06-03 -9
2 -3
3
18
c. 通过 行标签 + 列位置 获取对应数据(发生降维)
In [42]: df1.loc['2019-06-02',['A','B']]
Out[42]:
A
1
B
-5
Name: 2019-06-02 00:00:00, dtype: int64
d. 获取指定行和列数据(标量)
In [44]: df1.loc[dates[1],'A']
Out[44]: 1
In [45]: df1.at[dates[1],'A']
# 快速获取一个标量值
Out[45]: 1
5.6 查询数据<DataFrame.iloc[]
>
a. 获取 单独一行(发生降维)
In [46]: df1.iloc[3]
Out[46]:
A
-4
B
15
C
8
D
10
E
-5
Name: 2019-06-04 00:00:00, dtype: int64
b. 切片
In [51]: df1.iloc[2:,0:3]
Out[51]:
A
B
C
2019-06-03 -9
2 -3
2019-06-04 -4
15
8
In [54]: df1.iloc[:,1:3]
# 查询第二列-->第三列的所有数据
Out[54]:
B
C
2019-06-01
1
2
2019-06-02
-5
5
2019-06-03
2 -3
2019-06-04
15
8
In [55]: df1.iloc[2:4,:]
# 查询第三行-->第四行的所有数据
Out[55]:
A
B
C
D
E
2019-06-03 -9
2 -3
3
18
2019-06-04 -4
15
8
10
-5
c. 获取对应 位置 数据
In [52]: df1.iloc[[0,2],[1,2,4]]
Out[52]:
B
C
E
2019-06-01
1
2
-1
2019-06-03
2 -3
18
d. 获取指定行和列数据(标量)DataFrame.iat[1,1]
In [57]: df1.iat[1,1]
# 查询第二行,第二列的数据
Out[57]: -5
5.7 查询数据(boolean条件
)
In [39]: df1[df1.B > 0]
Out[39]:
A
B
C
D
E
2019-06-01 -5
1
2
-2
-1
2019-06-03 -9
2 -3
3
18
2019-06-04 -4
15
8
10
-5
In [40]: df1[df1 > 0]
Out[40]:
A
B
C
D
E
2019-06-01
NaN
1.0
2.0
NaN
NaN
# pandas用np.nan表示缺失数据
2019-06-02
1.0
NaN
5.0
NaN
6.0
2019-06-03
NaN
2.0
NaN
3.0
18.0
2019-06-04
NaN
15.0
8.0
10.0
NaN
5.8 查询数据(isin()方法
)
In [46]: df2[df2['E'].isin(['cat','tiger'])]
# 查询'E'列中指定数据('cat','tiger')“所在行的所有数据”
Out[46]:
A
B
C
D
E
F
0
3.0 2019-06-01
1.0
3
tiger
str
3
3.0 2019-06-01
1.0
3
cat
str
6. 修改数据
In [48]: df1.loc[dates[0]:dates[1],'B'] = 999
# 修改'B'列,第一、二行的两个数据
In [49]: df1
Out[49]:
A
B
C
D
E
2019-06-01 -5
999
2
-2
-1
2019-06-02
1
999
5 -10
6
2019-06-03 -9
2 -3
3
18
2019-06-04 -4
15
8
10
-5
In [50]: df1.loc[dates[2]:dates[3],'G'] = 888
# 增加'B'列
In [51]: df1
Out[51]:
A
B
C
D
E
G
2019-06-01 -5
999
2
-2
-1
NaN
2019-06-02
1
999
5 -10
6
NaN
2019-06-03 -9
2 -3
3
18
888.0
2019-06-04 -4
15
8
10
-5
888.0
7. 缺失数据
7.1 删除所有包含缺失数据的行
In [52]: df1.dropna(how='any')
Out[52]:
A
B
C
D
E
G
2019-06-03 -9
2 -3
3
18
888.0
2019-06-04 -4
15
8
10
-5
888.0
7.2 填充缺失数据
In [54]: df1.fillna(value=666)
Out[54]:
A
B
C
D
E
G
2019-06-01 -5
999
2
-2
-1
666.0
2019-06-02
1
999
5 -10
6
666.0
2019-06-03 -9
2 -3
3
18
888.0
2019-06-04 -4
15
8
10
-5
888.0
7.3 获取缺失数据掩码(缺失数据被填充后,依然能被查询到)
In [55]: pd.isnull(df1)
Out[55]:
A
B
C
D
E
G
2019-06-01
False
False
False
False
False
True
2019-06-02
False
False
False
False
False
True
2019-06-03
False
False
False
False
False
False
2019-06-04
False
False
False
False
False
False
8. 数据运算df1.apply()
In [60]: df1.apply(np.cumsum)
# cumsum:矩阵元素累计和
Out[60]:
A
B
C
D
E
G
2019-06-01
-5
999
2
-2
-1
NaN
2019-06-02
-4
1998
7 -12
5
NaN
2019-06-03 -13
2000
4
-9
23
888.0
2019-06-04 -17
2015
12
1
18
1776.0
In [61]: df1.apply(lambda x: x.max() - x.min())
# 自定义lambda函数,对df1各列进行同样的遍历计算
Out[61]:
A
10.0
B
997.0
C
11.0
D
20.0
E
23.0
G
0.0
dtype: float64
9. 数据合并
9.1 pd.concat()
实现pandas对象的拼接
In [72]: pieces = [df1[0:2], df1[2:], df1[:1]]
# pieces:3个pandas对象组成的列表
In [73]: pd.concat(pieces)
Out[73]:
A
B
C
D
E
G
2019-06-01 -5
999
2
-2
-1
NaN
2019-06-02
1
999
5 -10
6
NaN
2019-06-03 -9
2 -3
3
18
888.0
2019-06-04 -4
15
8
10
-5
888.0
2019-06-01 -5
999
2
-2
-1
NaN
9.2 pd.merge()
实现pandas对象的合并(注意区别)
In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
In [78]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
In [79]: left
Out[79]:
key
lval
0
foo
1
1
foo
2
In [80]: right
Out[80]:
key
rval
0
foo
4
1
foo
5
In [81]: pd.merge(left, right, on='key')
Out[81]:
key
lval
rval
0
foo
1
4
1
foo
1
5
2
foo
2
4
3
foo
2
5
In [84]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
In [85]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
In [86]: left
Out[86]:
key
lval
0
foo
1
1
bar
2
In [87]: right
Out[87]:
key
rval
0
foo
4
1
bar
5
In [88]: pd.merge(left, right, on='key')
Out[88]:
key
lval
rval
0
foo
1
4
1
bar
2
5
10. 追加数据df1.append()
(行尾追加一行数据)
In [91]: s = df1.iloc[3]
In [92]: s
Out[92]:
A
-4.0
B
15.0
C
8.0
D
10.0
E
-5.0
G
888.0
Name: 2019-06-04 00:00:00, dtype: float64
In [93]: df1.append(s, ignore_index=True)
# 返回一个新的DataFrame对象,注意数据的变化
Out[93]:
A
B
C
D
E
G
0 -5.0
999.0
2.0
-2.0
-1.0
NaN
1
1.0
999.0
5.0 -10.0
6.0
NaN
2 -9.0
2.0 -3.0
3.0
18.0
888.0
3 -4.0
15.0
8.0
10.0
-5.0
888.0
4 -4.0
15.0
8.0
10.0
-5.0
888.0
In [94]: df1
Out[94]:
A
B
C
D
E
G
2019-06-01 -5
999
2
-2
-1
NaN
2019-06-02
1
999
5 -10
6
NaN
2019-06-03 -9
2 -3
3
18
888.0
2019-06-04 -4
15
8
10
-5
888.0
In [95]: df1.append(s)
Out[95]:
A
B
C
D
E
G
2019-06-01 -5.0
999.0
2.0
-2.0
-1.0
NaN
2019-06-02
1.0
999.0
5.0 -10.0
6.0
NaN
2019-06-03 -9.0
2.0 -3.0
3.0
18.0
888.0
2019-06-04 -4.0
15.0
8.0
10.0
-5.0
888.0
2019-06-04 -4.0
15.0
8.0
10.0
-5.0
888.0
11. 分组操作
In [98]: df1
Out[98]:
A
B
C
D
E
G
2019-06-01 -5
999
2
-2
-1
NaN
2019-06-02
1
999
5 -10
6
NaN
2019-06-03 -9
2 -3
3
18
888.0
2019-06-04 -4
15
8
10
-5
888.0
In [101]: df1.groupby('B').sum()
# 对一列分组,然后对结果组应用和函数
Out[101]:
A
C
D
E
G
B
2
-9 -3
3
18
888.0
15
-4
8
10
-5
888.0
999 -4
7 -12
5
0.0
In [97]: df3
Out[97]:
A
B
C
D
0
foo
one -0.528172
1.462108
1
bar
one -1.072969 -2.060141
2
foo
two
0.865408 -0.322417
3
bar
three -2.301539 -0.384054
4
foo
two
1.744812
1.133769
5
bar
two -0.761207 -1.099891
6
foo
one
0.319039 -0.172428
7
foo
three -0.249370 -0.877858
In [103]: df3.groupby('A').sum()
# 对一列分组,然后对结果组应用和函数
Out[103]:
C
D
A
bar -4.135714 -3.544086
foo
2.151716
1.223174
In [104]: df3.groupby(['A','B']).sum()
# 对两列分组,然后对结果组应用和函数
Out[104]:
C
D
A
B
bar one
-1.072969 -2.060141
three -2.301539 -0.384054
two
-0.761207 -1.099891
foo one
-0.209133
1.289680
three -0.249370 -0.877858
two
2.610219
0.811352
12. 数据压缩“compress”
12.1 准备工作
In [105]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],
...:
['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
In [106]: tuples
Out[106]:
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
In [108]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [109]: index
Out[109]:
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
codes=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['first', 'second'])
In [118]: type(index)
Out[118]: pandas.core.indexes.multi.MultiIndex
In [110]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
In [111]: df
Out[111]:
A
B
first second
bar
one
0.042214
0.582815
two
-1.100619
1.144724
baz
one
0.901591
0.502494
two
0.900856 -0.683728
foo
one
-0.122890 -0.935769
two
-0.267888
0.530355
qux
one
-0.691661 -0.396754
two
-0.687173 -0.845206
In [119]: df4 = df[:4]
In [120]: df4
Out[120]:
A
B
first second
bar
one
0.042214
0.582815
two
-1.100619
1.144724
baz
one
0.901591
0.502494
two
0.900856 -0.683728
12.2 DataFrame数据stack()
压缩到同一列
In [121]: stacked = df4.stack()
In [122]: stacked
Out[122]:
first
second
bar
one
A
0.042214
B
0.582815
two
A
-1.100619
B
1.144724
baz
one
A
0.901591
B
0.502494
two
A
0.900856
B
-0.683728
dtype: float64
12.3 DataFrame数据unstack()
解压缩, stack的逆操作
In [123]: stacked.unstack()
Out[123]:
A
B
first second
bar
one
0.042214
0.582815
two
-1.100619
1.144724
baz
one
0.901591
0.502494
two
0.900856 -0.683728
最后
以上就是淡定往事为你收集整理的Pandas中DataFrame数据的常用操作(创建、转置、查询、排序、缺失、运算、合并、追加、修改、分组、压缩等)1. 导包2. 创建DataFrame对象3. 数据转置4. 排序5. 数据查询操作6. 修改数据7. 缺失数据8. 数据运算df1.apply()9. 数据合并10. 追加数据df1.append()(行尾追加一行数据)11. 分组操作12. 数据压缩“compress”的全部内容,希望文章能够帮你解决Pandas中DataFrame数据的常用操作(创建、转置、查询、排序、缺失、运算、合并、追加、修改、分组、压缩等)1. 导包2. 创建DataFrame对象3. 数据转置4. 排序5. 数据查询操作6. 修改数据7. 缺失数据8. 数据运算df1.apply()9. 数据合并10. 追加数据df1.append()(行尾追加一行数据)11. 分组操作12. 数据压缩“compress”所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复