python数据处理过程

127 阅读 0 评论 84 点赞

我是靠谱客的博主成就石头，这篇文章主要介绍python数据处理过程，现在分享给大家，希望可以做个参考。

python数据处理篇-数据组合

代码和结果

复制代码

1
2
3
4
5
6
7
8
#主要用append，concat，join，merge函数
import pandas as pd
df1 = pd.read_csv('D:/python_learning/python_pandas_book/pandas_for_everyone-master/data/concat_1.csv')
df2 = pd.read_csv('D:/python_learning/python_pandas_book/pandas_for_everyone-master/data/concat_2.csv')
df3 = pd.read_csv('D:/python_learning/python_pandas_book/pandas_for_everyone-master/data/concat_3.csv')
df1
df2

复制代码

1
2
3
4
5
A
B
C
D

0 a4 b4 c4 d4
1 a5 b5 c5 d5
2 a6 b6 c6 d6
3 a7 b7 c7 d7

复制代码

1
2
3
row_concat = pd.concat([df1,df2,df3])#按行拼接数据,发现只是简单的拼接，
print(row_concat)

复制代码

1
2
3
4
5
 A
B
C
D

0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
0 a4 b4 c4 d4
1 a5 b5 c5 d5
2 a6 b6 c6 d6
3 a7 b7 c7 d7
0 a8 b8 c8 d8
1 a9 b9 c9 d9
2 a10 b10 c10 d10
3 a11 b11 c11 d11

复制代码

1
2
3
4
5
new_row_series = pd.Series(['n1','n2','n3','n4'])
print(new_row_series)#发现是一个列
row_concat_new = pd.concat([row_concat,new_row_series])#这里一定要加[]
print(row_concat_new)#直接这样做的话发现会被作为一列新的列加到原来的数据框里面

复制代码

0 a0 b0 c0 d0 NaN
1 a1 b1 c1 d1 NaN
2 a2 b2 c2 d2 NaN
3 a3 b3 c3 d3 NaN
0 a4 b4 c4 d4 NaN
1 a5 b5 c5 d5 NaN
2 a6 b6 c6 d6 NaN
3 a7 b7 c7 d7 NaN
0 a8 b8 c8 d8 NaN
1 a9 b9 c9 d9 NaN
2 a10 b10 c10 d10 NaN
3 a11 b11 c11 d11 NaN
0 NaN NaN NaN NaN n1
1 NaN NaN NaN NaN n2
2 NaN NaN NaN NaN n3
3 NaN NaN NaN NaN n4

复制代码

1
2
3
4
5
6
#那怎样才能作为行加到原来的数据框呢？我们尝试将它转为数据框，数据框的生成
new_row_df = pd.DataFrame([['n1','n2','n3','n4']],columns=['A','B','C','D'])#前面为何加两个[]
print(new_row_df)
row_concat_df = pd.concat([row_concat,new_row_df])#这样就完美加进去了
print(row_concat_df)

复制代码

1
2
3
4
5
A
B
C
D

0 n1 n2 n3 n4

复制代码

1
2
3
4
5
 A
B
C
D

0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
0 a4 b4 c4 d4
1 a5 b5 c5 d5
2 a6 b6 c6 d6
3 a7 b7 c7 d7
0 a8 b8 c8 d8
1 a9 b9 c9 d9
2 a10 b10 c10 d10
3 a11 b11 c11 d11
0 n1 n2 n3 n4

复制代码

1
2
3
4
#这样未免有点麻烦了
print(df1.append(df2))#利用append函数可以直接添加，岂不快哉！但我们会发现索引有重复的，那我们是否可以将原来的索引去掉生成一个新的索引呢？
print(df1.append(df2,ignore_index=True))#加上这个函数即可

复制代码

1
2
3
4
5
A
B
C
D

0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
4 a4 b4 c4 d4
5 a5 b5 c5 d5
6 a6 b6 c6 d6
7 a7 b7 c7 d7

复制代码

1
2
3
4
#concat函数连接列
pd.concat([df1,df2,df3],axis=1)#axis=1 对列进行操作
pd.concat([df1,df2,df3],axis=1,ignore_index=True)

复制代码

0 a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d8
1 a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d9
2 a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d10
3 a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11

复制代码

1
2
3
concat_df = pd.concat([df1,df2],axis=1,ignore_index=True)
print(concat_df)

复制代码

0 a0 b0 c0 d0 a4 b4 c4 d4
1 a1 b1 c1 d1 a5 b5 c5 d5
2 a2 b2 c2 d2 a6 b6 c6 d6
3 a3 b3 c3 d3 a7 b7 c7 d7

复制代码

1
2
3
4
#用columns函数将列命名
concat_df.columns = ['A','B','C','D','E','F','G','H']
print(concat_df)

复制代码

1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
G
H

0 a0 b0 c0 d0 a4 b4 c4 d4
1 a1 b1 c1 d1 a5 b5 c5 d5
2 a2 b2 c2 d2 a6 b6 c6 d6
3 a3 b3 c3 d3 a7 b7 c7 d7

复制代码

1
2
3
4
5
6
7
8
9
10
11
#使用merge函数合并多个数据集，与sqljoin类似的功能
#加载数据集
person = pd.read_csv('D:/python_learning/python_pandas_book/pandas_for_everyone-master/data/survey_person.csv')
site = pd.read_csv('D:/python_learning/python_pandas_book/pandas_for_everyone-master/data/survey_site.csv')
survey = pd.read_csv('D:/python_learning/python_pandas_book/pandas_for_everyone-master/data/survey_survey.csv')
visited = pd.read_csv('D:/python_learning/python_pandas_book/pandas_for_everyone-master/data/survey_visited.csv')
print(person.head())
print(site.head())
print(survey.head())
print(visited.head())

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
print('person表的几个字段')
print('ident:',person['ident'].unique())
print('personal:',person['personal'].unique())
print('family:',person['family'].unique())
print('')
print('site表的几个字段')
print('name:',site['name'].unique())
print('lat:',site['lat'].unique())
print('long:',site['long'].unique())
print('')
print('survey表的几个字段')
print('taken:',survey['taken'].unique())
print('person:',survey['person'].unique())
print('quant:',survey['quant'].unique())
print('reading:',survey['reading'].unique())
print('')
print('visited表的几个字段')
print('ident:','family:',visited['ident'].unique())
print('site:',visited['site'].unique())
print('dated:',visited['dated'].unique())

person表的几个字段
ident: [‘dyer’ ‘pb’ ‘lake’ ‘roe’ ‘danforth’]
personal: [‘William’ ‘Frank’ ‘Anderson’ ‘Valentina’]
family: [‘Dyer’ ‘Pabodie’ ‘Lake’ ‘Roerich’ ‘Danforth’]

site表的几个字段
name: [‘DR-1’ ‘DR-3’ ‘MSK-4’]
lat: [-49.85 -47.15 -48.87]
long: [-128.57 -126.72 -123.4 ]

survey表的几个字段
taken: [619 622 734 735 751 752 837 844]
person: [‘dyer’ ‘pb’ ‘lake’ nan ‘roe’]
quant: [‘rad’ ‘sal’ ‘temp’]
reading: [ 9.82 0.13 7.8 0.09 8.41 0.05 -21.5 7.22 0.06 -26.
4.35 -18.5 0.1 2.19 -16. 41.6 1.46 0.21 22.5 11.25]

visited表的几个字段
ident: family: [619 622 734 735 751 752 837 844]
site: [‘DR-1’ ‘DR-3’ ‘MSK-4’]
dated: [‘1927-02-08’ ‘1927-02-10’ ‘1939-01-07’ ‘1930-01-12’ ‘1930-02-26’ nan
‘1932-01-14’ ‘1932-03-22’]

复制代码

1
2
3
4
5
6
7
8
9
10
#两个表连接，一对多
person_survey = person.merge(survey,left_on='ident',right_on='person')#左连接，merge函数的应用
print(person)
print('按索引排序')
print(survey.sort_index())#DataFrame实现按索引排序
print('按person列排序')
print(survey.sort_values(by = "person",ascending=False))#DataFrame实现按某列排序,后面参数决定是降序还是升序
print('')
print(person_survey.sort_index())#合并之后18行，因为person表是没有重复值的，而survey是，然后survey有个值在person表里面没有

复制代码

1
2
3
4
5

ident
personal
family

0 dyer William Dyer
1 pb Frank Pabodie
2 lake Anderson Lake
3 roe Valentina Roerich
4 danforth Frank Danforth

按索引排序
taken person quant reading
0 619 dyer rad 9.82
1 619 dyer sal 0.13
2 622 dyer rad 7.80
3 622 dyer sal 0.09
4 734 pb rad 8.41
5 734 lake sal 0.05
6 734 pb temp -21.50
7 735 pb rad 7.22
8 735 NaN sal 0.06
9 735 NaN temp -26.00
10 751 pb rad 4.35
11 751 pb temp -18.50
12 751 lake sal 0.10
13 752 lake rad 2.19
14 752 lake sal 0.09
15 752 lake temp -16.00
16 752 roe sal 41.60
17 837 lake rad 1.46
18 837 lake sal 0.21
19 837 roe sal 22.50
20 844 roe rad 11.25

按person列排序
taken person quant reading
20 844 roe rad 11.25
19 837 roe sal 22.50
16 752 roe sal 41.60
6 734 pb temp -21.50
10 751 pb rad 4.35
7 735 pb rad 7.22
11 751 pb temp -18.50
4 734 pb rad 8.41
5 734 lake sal 0.05
12 751 lake sal 0.10
13 752 lake rad 2.19
14 752 lake sal 0.09
15 752 lake temp -16.00
17 837 lake rad 1.46
18 837 lake sal 0.21
1 619 dyer sal 0.13
3 622 dyer sal 0.09
2 622 dyer rad 7.80
0 619 dyer rad 9.82
8 735 NaN sal 0.06
9 735 NaN temp -26.00

ident personal family taken person quant reading
0 dyer William Dyer 619 dyer rad 9.82
1 dyer William Dyer 619 dyer sal 0.13
2 dyer William Dyer 622 dyer rad 7.80
3 dyer William Dyer 622 dyer sal 0.09
4 pb Frank Pabodie 734 pb rad 8.41
5 pb Frank Pabodie 734 pb temp -21.50
6 pb Frank Pabodie 735 pb rad 7.22
7 pb Frank Pabodie 751 pb rad 4.35
8 pb Frank Pabodie 751 pb temp -18.50
9 lake Anderson Lake 734 lake sal 0.05
10 lake Anderson Lake 751 lake sal 0.10
11 lake Anderson Lake 752 lake rad 2.19
12 lake Anderson Lake 752 lake sal 0.09
13 lake Anderson Lake 752 lake temp -16.00
14 lake Anderson Lake 837 lake rad 1.46
15 lake Anderson Lake 837 lake sal 0.21
16 roe Valentina Roerich 752 roe sal 41.60
17 roe Valentina Roerich 837 roe sal 22.50
18 roe Valentina Roerich 844 roe rad 11.25

复制代码

1
2
3
4
5
6
7
print(site)
print('')
print(visited)
site_visited = visited.merge(site,left_on='site',right_on='name')
print('')
print(site_visited)

复制代码

1
2
3
4
name
lat
long

0 DR-1 -49.85 -128.57
1 DR-3 -47.15 -126.72
2 MSK-4 -48.87 -123.40

ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaN
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22

ident site dated name lat long
0 619 DR-1 1927-02-08 DR-1 -49.85 -128.57
1 622 DR-1 1927-02-10 DR-1 -49.85 -128.57
2 844 DR-1 1932-03-22 DR-1 -49.85 -128.57
3 734 DR-3 1939-01-07 DR-3 -47.15 -126.72
4 735 DR-3 1930-01-12 DR-3 -47.15 -126.72
5 751 DR-3 1930-02-26 DR-3 -47.15 -126.72
6 752 DR-3 NaN DR-3 -47.15 -126.72
7 837 MSK-4 1932-01-14 MSK-4 -48.87 -123.40