概述
-
数据清洗是数据分析关键的一步,直接影响之后的处理工作
-
数据需要修改吗?有什么需要修改的吗?数据应该怎么调整才能适用于接下来的分析和挖掘?
-
是一个迭代的过程,实际项目中可能需要不止一次地执行这些清洗操作
-
处理缺失数据:pd.fillna(),pd.dropna()
1.数据连接(pd.merge)
-
pd.merge
-
根据单个或多个键将不同DataFrame的行连接起来
-
类似数据库的连接操作
示例代码:
import pandas as pd
import numpy as np
df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],
'data2' : np.random.randint(0,10,3)})
print(df_obj1)
print(df_obj2)
运行结果:
data1 key
data1 key
0
8
b
1
8
b
2
3
a
3
5
c
4
4
a
5
9
a
6
6
b
data2 key
0
9
a
1
0
b
2
3
d
1. 默认将重叠列的列名作为“外键”进行连接
示例代码:
# 默认将重叠列的列名作为“外键”进行连接
print(pd.merge(df_obj1, df_obj2))
运行结果:
data1 key
data2
0
8
b
0
1
8
b
0
2
6
b
0
3
3
a
9
4
4
a
9
5
9
a
9
2. on显示指定“外键”
示例代码:
# on显示指定“外键”
print(pd.merge(df_obj1, df_obj2, on='key'))
运行结果:
data1 key
data2
0
8
b
0
1
8
b
0
2
6
b
0
3
3
a
9
4
4
a
9
5
9
a
9
3. left_on,左侧数据的“外键”,right_on,右侧数据的“外键”
示例代码:
# left_on,right_on分别指定左侧数据和右侧数据的“外键”
# 更改列名
df_obj1 = df_obj1.rename(columns={'key':'key1'})
df_obj2 = df_obj2.rename(columns={'key':'key2'})
print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2'))
运行结果:
data1 key1
data2 key2
0
8
b
0
b
1
8
b
0
b
2
6
b
0
b
3
3
a
9
a
4
4
a
9
a
5
9
a
9
a
默认是“内连接”(inner),即结果中的键是交集
how指定连接方式
4. “外连接”(outer),结果中的键是并集
示例代码:
# “外连接”
print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='outer'))
运行结果:
data1 key1
data2 key2
0
8.0
b
0.0
b
1
8.0
b
0.0
b
2
6.0
b
0.0
b
3
3.0
a
9.0
a
4
4.0
a
9.0
a
5
9.0
a
9.0
a
6
5.0
c
NaN
NaN
7
NaN
NaN
3.0
d
5. “左连接”(left)
示例代码:
# 左连接
print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='left'))
运行结果:
data1 key1
data2 key2
0
8
b
0.0
b
1
8
b
0.0
b
2
3
a
9.0
a
3
5
c
NaN
NaN
4
4
a
9.0
a
5
9
a
9.0
a
6
6
b
0.0
b
6. “右连接”(right)
示例代码:
# 右连接
print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='right'))
运行结果:
data1 key1
data2 key2
0
8.0
b
0
b
1
8.0
b
0
b
2
6.0
b
0
b
3
3.0
a
9
a
4
4.0
a
9
a
5
9.0
a
9
a
6
NaN
NaN
3
d
7. 处理重复列名
suffixes,默认为_x, _y
示例代码:
# 处理重复列名
df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],
'data' : np.random.randint(0,10,3)})
print(pd.merge(df_obj1, df_obj2, on='key', suffixes=('_left', '_right')))
运行结果:
data_left key
data_right
0
9
b
1
1
5
b
1
2
1
b
1
3
2
a
8
4
2
a
8
5
5
a
8
8. 按索引连接
left_index=True或right_index=True
示例代码:
# 按索引连接
df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'data2' : np.random.randint(0,10,3)}, index=['a', 'b', 'd'])
print(pd.merge(df_obj1, df_obj2, left_on='key', right_index=True))
运行结果:
data1 key
data2
0
3
b
6
1
4
b
6
6
8
b
6
2
6
a
0
4
3
a
0
5
0
a
0
2.数据合并(pd.concat)
- 沿轴方向将多个对象合并到一起
1. NumPy的concat
np.concatenate
示例代码:
import numpy as np
import pandas as pd
arr1 = np.random.randint(0, 10, (3, 4))
arr2 = np.random.randint(0, 10, (3, 4))
print(arr1)
print(arr2)
print(np.concatenate([arr1, arr2]))
print(np.concatenate([arr1, arr2], axis=1))
运行结果:
# print(arr1)
[[3 3 0 8]
[2 0 3 1]
[4 8 8 2]]
# print(arr2)
[[6 8 7 3]
[1 6 8 7]
[1 4 7 1]]
# print(np.concatenate([arr1, arr2]))
[[3 3 0 8]
[2 0 3 1]
[4 8 8 2]
[6 8 7 3]
[1 6 8 7]
[1 4 7 1]]
# print(np.concatenate([arr1, arr2], axis=1))
[[3 3 0 8 6 8 7 3]
[2 0 3 1 1 6 8 7]
[4 8 8 2 1 4 7 1]]
2. pd.concat
-
注意指定轴方向,默认axis=0
-
join指定合并方式,默认为outer
-
Series合并时查看行索引有无重复
1) index 没有重复的情况
示例代码:
# index 没有重复的情况
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(0,5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(5,9))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(9,12))
print(ser_obj1)
print(ser_obj2)
print(ser_obj3)
print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))
作者:_知几
链接:http://www.jianshu.com/p/6a9715f864af
來源:简书
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
运行结果:
# print(ser_obj1)
0
1
1
8
2
4
3
9
4
4
dtype: int64
# print(ser_obj2)
5
2
6
6
7
4
8
2
dtype: int64
# print(ser_obj3)
9
6
10
2
11
7
dtype: int64
# print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
0
1
1
8
2
4
3
9
4
4
5
2
6
6
7
4
8
2
9
6
10
2
11
7
dtype: int64
# print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))
0
1
2
0
1.0
NaN
NaN
1
5.0
NaN
NaN
2
3.0
NaN
NaN
3
2.0
NaN
NaN
4
4.0
NaN
NaN
5
NaN
9.0
NaN
6
NaN
8.0
NaN
7
NaN
3.0
NaN
8
NaN
6.0
NaN
9
NaN
NaN
2.0
10
NaN
NaN
3.0
11
NaN
NaN
3.0
2) index 有重复的情况
示例代码:
# index 有重复的情况
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(4))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(3))
print(ser_obj1)
print(ser_obj2)
print(ser_obj3)
print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
运行结果:
# print(ser_obj1)
0
0
1
3
2
7
3
2
4
5
dtype: int64
# print(ser_obj2)
0
5
1
1
2
9
3
9
dtype: int64
# print(ser_obj3)
0
8
1
7
2
9
dtype: int64
# print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
0
0
1
3
2
7
3
2
4
5
0
5
1
1
2
9
3
9
0
8
1
7
2
9
dtype: int64
# print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1, join='inner'))
# join='inner' 将去除NaN所在的行或列
0
1
2
0
0
5
8
1
3
1
7
2
7
9
9
3) DataFrame合并时同时查看行索引和列索引有无重复
示例代码:
df_obj1 = pd.DataFrame(np.random.randint(0, 10, (3, 2)), index=['a', 'b', 'c'],
columns=['A', 'B'])
df_obj2 = pd.DataFrame(np.random.randint(0, 10, (2, 2)), index=['a', 'b'],
columns=['C', 'D'])
print(df_obj1)
print(df_obj2)
print(pd.concat([df_obj1, df_obj2]))
print(pd.concat([df_obj1, df_obj2], axis=1, join='inner'))
运行结果:
# print(df_obj1)
A
B
a
3
3
b
5
4
c
8
6
# print(df_obj2)
C
D
a
1
9
b
6
8
# print(pd.concat([df_obj1, df_obj2]))
A
B
C
D
a
3.0
3.0
NaN
NaN
b
5.0
4.0
NaN
NaN
c
8.0
6.0
NaN
NaN
a
NaN
NaN
1.0
9.0
b
NaN
NaN
6.0
8.0
# print(pd.concat([df_obj1, df_obj2], axis=1, join='inner'))
A
B
C
D
a
3
3
1
9
b
5
4
6
8
3.数据重构
1. stack
-
将列索引旋转为行索引,完成层级索引
-
DataFrame->Series
示例代码:
import numpy as np
import pandas as pd
df_obj = pd.DataFrame(np.random.randint(0,10, (5,2)), columns=['data1', 'data2'])
print(df_obj)
stacked = df_obj.stack()
print(stacked)
运行结果:
# print(df_obj)
data1
data2
0
7
9
1
7
8
2
8
9
3
4
1
4
1
2
# print(stacked)
0
data1
7
data2
9
1
data1
7
data2
8
2
data1
8
data2
9
3
data1
4
data2
1
4
data1
1
data2
2
dtype: int64
2. unstack
-
将层级索引展开
-
Series->DataFrame
-
认操作内层索引,即level=-1
示例代码:
# 默认操作内层索引
print(stacked.unstack())
# 通过level指定操作索引的级别
print(stacked.unstack(level=0))
运行结果:
# print(stacked.unstack())
data1
data2
0
7
9
1
7
8
2
8
9
3
4
1
4
1
2
# print(stacked.unstack(level=0))
0
1
2
3
4
data1
7
7
8
4
1
data2
9
8
9
1
2
4.数据转换
4.1.处理重复数据
1 duplicated() 返回布尔型Series表示每行是否为重复行
示例代码:
import numpy as np
import pandas as pd
df_obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,
'data2' : np.random.randint(0, 4, 8)})
print(df_obj)
print(df_obj.duplicated())
运行结果:
# print(df_obj)
data1
data2
0
a
3
1
a
2
2
a
3
3
a
3
4
b
1
5
b
0
6
b
3
7
b
0
# print(df_obj.duplicated())
0
False
1
False
2
True
3
True
4
False
5
False
6
False
7
True
dtype: bool
2 drop_duplicates() 过滤重复行
默认判断全部列
可指定按某些列判断
示例代码:
print(df_obj.drop_duplicates())
print(df_obj.drop_duplicates('data2'))
运行结果:
# print(df_obj.drop_duplicates())
data1
data2
0
a
3
1
a
2
4
b
1
5
b
0
6
b
3
# print(df_obj.drop_duplicates('data2'))
data1
data2
0
a
3
1
a
2
4
b
1
5
b
0
3. 根据map传入的函数对每行或每列进行转换
- Series根据map传入的函数对每行或每列进行转换
示例代码:
ser_obj = pd.Series(np.random.randint(0,10,10))
print(ser_obj)
print(ser_obj.map(lambda x : x ** 2))
运行结果:
# print(ser_obj)
0
1
1
4
2
8
3
6
4
8
5
6
6
6
7
4
8
7
9
3
dtype: int64
# print(ser_obj.map(lambda x : x ** 2))
0
1
1
16
2
64
3
36
4
64
5
36
6
36
7
16
8
49
9
9
dtype: int64
4.2.数据替换
replace根据值的内容进行替换
示例代码:
# 单个值替换单个值
print(ser_obj.replace(1, -100))
# 多个值替换一个值
print(ser_obj.replace([6, 8], -100))
# 多个值替换多个值
print(ser_obj.replace([4, 7], [-100, -200]))
运行结果:
# print(ser_obj.replace(1, -100))
0
-100
1
4
2
8
3
6
4
8
5
6
6
6
7
4
8
7
9
3
dtype: int64
# print(ser_obj.replace([6, 8], -100))
0
1
1
4
2
-100
3
-100
4
-100
5
-100
6
-100
7
4
8
7
9
3
dtype: int64
# print(ser_obj.replace([4, 7], [-100, -200]))
0
1
1
-100
2
8
3
6
4
8
5
6
6
6
7
-100
8
-200
9
3
dtype: int64
三、全球食品数据分析
项目参考:https://www.kaggle.com/bhouwens/d/openfoodfacts/world-food-facts/how-much-sugar-do-we-eat/discussion
# -*- coding : utf-8 -*-
# 处理zip压缩文件
import zipfile
import os
import pandas as pd
import matplotlib.pyplot as plt
def unzip(zip_filepath, dest_path):
"""
解压zip文件
"""
with zipfile.ZipFile(zip_filepath) as zf:
zf.extractall(path=dest_path)
def get_dataset_filename(zip_filepath):
"""
获取数据集文件名
"""
with zipfile.ZipFile(zip_filepath) as zf:
return zf.namelist()[0]
def main():
"""
主函数
"""
# 声明变量
dataset_path = './data'
# 数据集路径
zip_filename = 'open-food-facts.zip'
# zip文件名
zip_filepath = os.path.join(dataset_path, zip_filename)
# zip文件路径
dataset_filename = get_dataset_filename(zip_filepath)
# 数据集文件名(在zip中)
dataset_filepath = os.path.join(dataset_path, dataset_filename)
# 数据集文件路径
print('解压zip...', end='')
unzip(zip_filepath, dataset_path)
print('完成.')
# 读取数据
data = pd.read_csv(dataset_filepath, usecols=['countries_en', 'additives_n'])
# 分析各国家食物中的食品添加剂种类个数
# 1. 数据清理
# 去除缺失数据
data = data.dropna()
# 或者data.dropna(inplace=True)
# 将国家名称转换为小写
data['countries_en'] = data['countries_en'].str.lower()
# 2. 数据分组统计
country_additives = data['additives_n'].groupby(data['countries_en']).mean()
# 3. 按值从大到小排序
result = country_additives.sort_values(ascending=False)
# 4. pandas可视化top10
result.iloc[:10].plot.bar()
plt.show()
# 5. 保存处理结果
result.to_csv('./country_additives.csv')
# 删除解压数据,清理空间(可选操作)
if os.path.exists(dataset_filepath):
os.remove(dataset_filepath)
if __name__ == '__main__':
main()
最后
以上就是大方豆芽为你收集整理的[55]数据清洗、合并、转化和重构的全部内容,希望文章能够帮你解决[55]数据清洗、合并、转化和重构所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复