Python数据处理库pandas中的DataFrame数据结构简介pandas 中有两大数据结构 Series和 DataFrame。本文主要介绍DataFrame的用法。DataFrame可以处理表格数据。

131 阅读 0 评论 87 点赞

我是靠谱客的博主幽默指甲油，这篇文章主要介绍Python数据处理库pandas中的DataFrame数据结构简介pandas 中有两大数据结构 Series和 DataFrame。本文主要介绍DataFrame的用法。DataFrame可以处理表格数据。，现在分享给大家，希望可以做个参考。

pandas 中有两大数据结构 Series和 DataFrame。本文主要介绍DataFrame的用法。DataFrame可以处理表格数据。

Series介绍在 Python数据处理库pandas中的Series数据结构简介

有很多方法可以创建DataFrame 数据，比如通过字典：

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
In [1]: import pandas as pd
In [8]: data = {'name': ['张三', '张三', '张三', '李四', '李四', '李四'],
...:
'year': [2016, 2017, 2018, 2016, 2017, 2018],
...:
'income': [6000, 6500, 7000, 25000, 26000, 29000]}
In [9]: frame = pd.DataFrame(data)
In [10]: frame
Out[10]:
name
year
income
0
张三
2016
6000
1
张三
2017
6500
2
张三
2018
7000
3
李四
2016
25000
4
李四
2017
26000
5
李四
2018
29000

可以通过head方法来选取前5排：

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
In [11]: frame.head()
Out[11]:
name
year
income
0
张三
2016
6000
1
张三
2017
6500
2
张三
2018
7000
3
李四
2016
25000
4
李四
2017
26000

通过指定columns 来对列排序：

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
In [13]: pd.DataFrame(data, columns=['year', 'income', 'name'])
Out[13]:
year
income name
0
2016
6000
张三
1
2017
6500
张三
2
2018
7000
张三
3
2016
25000
李四
4
2017
26000
李四
5
2018
29000
李四

如果在columns中指定了一个不存在的列，这不存在的列会用NAN补足：

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
In [14]: frame2 = pd.DataFrame(data, columns=['income', 'year', 'name', 'gender'],
...:
index=['one', 'two', 'three', 'four', 'five', 'six'])
...:
In [15]: frame2
Out[15]:
income
year name gender
one
6000
2016
张三
NaN
two
6500
2017
张三
NaN
three
7000
2018
张三
NaN
four
25000
2016
李四
NaN
five
26000
2017
李四
NaN
six
29000
2018
李四
NaN
In [17]: frame2.columns
Out[17]: Index(['income', 'year', 'name', 'gender'], dtype='object')

通过列的名称来选取这一列的数据

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
In [18]: frame2['name']
Out[18]:
one
张三
two
张三
three
张三
four
李四
five
李四
six
李四
Name: name, dtype: object
In [20]: frame2.income
Out[20]:
one
6000
two
6500
three
7000
four
25000
five
26000
six
29000
Name: income, dtype: int64

通过loc 来选取某一行的数据

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
In [21]: frame2.loc['six']
Out[21]:
income
29000
year
2018
name
李四
gender
NaN
Name: six, dtype: object

可以对某列数字进行赋值

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
In [22]: frame2['gender'] = 'male'
In [23]: frame2
Out[23]:
income
year name gender
one
6000
2016
张三
male
two
6500
2017
张三
male
three
7000
2018
张三
male
four
25000
2016
李四
male
five
26000
2017
李四
male
six
29000
2018
李四
male
In [24]: frame2['gender'] = ['male', 'male', 'male', 'female', 'female', 'female']
In [25]: frame2
Out[25]:
income
year name
gender
one
6000
2016
张三
male
two
6500
2017
张三
male
three
7000
2018
张三
male
four
25000
2016
李四
female
five
26000
2017
李四
female
six
29000
2018
李四
female

可以应用series对对应index赋值

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
In [26]: gender = pd.Series(['male', 'female'], index=['one', 'four'])
In [27]: frame2['gender'] = gender
In [28]: frame2
Out[28]:
income
year name
gender
one
6000
2016
张三
male
two
6500
2017
张三
NaN
three
7000
2018
张三
NaN
four
25000
2016
李四
female
five
26000
2017
李四
NaN
six
29000
2018
李四
NaN

通过del 删除某列

复制代码

1
2
3
4
In [29]: del frame2['gender']
In [30]: frame2.columns
Out[30]: Index(['income', 'year', 'name'], dtype='object')

另一种创建dataframe类的方法是通过字典嵌套方法：

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
In [31]: income = {'张三': {2016: 6000, 2017:6500, 2018:7000},
...:
'李四': {2016: 25000, 2017:26000}}
In [32]: frame3= pd.DataFrame(income)
In [33]: frame3
Out[33]:
张三
李四
2016
6000
25000.0
2017
6500
26000.0
2018
7000
NaN

可以对dataframe转置：

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
In [34]: frame3.T
Out[34]:
2016
2017
2018
张三
6000.0
6500.0
7000.0
李四
25000.0
26000.0
NaN

对行和列加名字：

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
In [35]: frame3.index.name = 'year'
In [36]: frame3.columns.name = 'name'
In [37]: frame3
Out[37]:
name
张三
李四
year
2016
6000
25000.0
2017
6500
26000.0
2018
7000
NaN

应用to_numpy方法将dataframe转换成二维数组

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In [38]: frame3.to_numpy()
Out[38]:
array([[ 6000., 25000.],
[ 6500., 26000.],
[ 7000.,
nan]])
In [39]: frame2.to_numpy()
Out[39]:
array([[6000, 2016, '张三'],
[6500, 2017, '张三'],
[7000, 2018, '张三'],
[25000, 2016, '李四'],
[26000, 2017, '李四'],
[29000, 2018, '李四']], dtype=object)