Python数据分析之清洗

78 阅读 0 评论 52 点赞

我是靠谱客的博主文艺身影，这篇文章主要介绍Python数据分析之清洗，现在分享给大家，希望可以做个参考。

# 缺失值

复制代码

1
2
3
import pandas as pd
import numpy as np
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

复制代码

1
string_data

0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object 检查

复制代码

1
string_data.isnull()

0 False 1 False 2 True 3 False dtype: bool

复制代码

1
2
#None也被视为NA
string_data[0] = None

复制代码

1
string_data.isnull()

0 True 1 False 2 True 3 False dtype: bool

复制代码

1
df = pd.DataFrame({'dropna':'祛除缺失值','fillna':'填充','isnull':'检查是','notnull':'检查不是'},index = ['1','2','3','4'])

复制代码

1
df.iloc[1] #一些方法

dropna 祛除缺失值 fillna 填充 isnull 检查是 notnull 检查不是 Name: 2, dtype: object ## 过滤缺失值

复制代码

1
from numpy import nan as NA

复制代码

1
data = pd.Series([1,NA,3.5,NA,7])

复制代码

1
data.dropna() #删除缺失值

0 1.0 2 3.5 4 7.0 dtype: float64

复制代码

1
data[data.notnull()] #等价写法

0 1.0 2 3.5 4 7.0 dtype: float64 ### DataFrame操作

复制代码

1
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])

复制代码

1
data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

复制代码

1
2
cleaned = data.dropna()
data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

复制代码

1
cleaned

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	1.0	6.5	3.0

#### 删除全为NA的行

复制代码

1
data.dropna(how = 'all')

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
3	NaN	6.5	3.0

#### 列删除操作

复制代码

1
2
data[4] = NA
data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2	4
0	1.0	6.5	3.0	NaN
1	1.0	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	NaN	6.5	3.0	NaN

复制代码

1
data.dropna(1,how = 'all')

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

#### 删除具有特定NA值数量的行

复制代码

1
2
3
4
df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4,1 ]= NA
df.iloc[:2,2] = NA
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	-0.088513	NaN	NaN
1	0.637624	NaN	NaN
2	0.054991	NaN	0.795304
3	-1.069859	NaN	-1.572700
4	0.230697	-1.893626	1.205645
5	-0.918601	-0.827842	-1.712528
6	0.916691	-0.238712	0.724451

复制代码

1
df.dropna() #有NA值就删除该行

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
4	0.230697	-1.893626	1.205645
5	-0.918601	-0.827842	-1.712528
6	0.916691	-0.238712	0.724451

复制代码

1
df.dropna(thresh=2) #删除具有两个NA值的行

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
2	0.054991	NaN	0.795304
3	-1.069859	NaN	-1.572700
4	0.230697	-1.893626	1.205645
5	-0.918601	-0.827842	-1.712528
6	0.916691	-0.238712	0.724451

### 填充缺失值

复制代码

1
df.fillna(0) #以0填充

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	-0.088513	0.000000	0.000000
1	0.637624	0.000000	0.000000
2	0.054991	0.000000	0.795304
3	-1.069859	0.000000	-1.572700
4	0.230697	-1.893626	1.205645
5	-0.918601	-0.827842	-1.712528
6	0.916691	-0.238712	0.724451

复制代码

1
df.fillna({1:0.5,2:0}) #对不同的行使用不同的填充数

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	-0.088513	0.500000	0.000000
1	0.637624	0.500000	0.000000
2	0.054991	0.500000	0.795304
3	-1.069859	0.500000	-1.572700
4	0.230697	-1.893626	1.205645
5	-0.918601	-0.827842	-1.712528
6	0.916691	-0.238712	0.724451

复制代码

1
2
3
_ = df.fillna(0,inplace=True)
#作用于原对象
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	-0.088513	0.000000	0.000000
1	0.637624	0.000000	0.000000
2	0.054991	0.000000	0.795304
3	-1.069859	0.000000	-1.572700
4	0.230697	-1.893626	1.205645
5	-0.918601	-0.827842	-1.712528
6	0.916691	-0.238712	0.724451

#### 前向填充

复制代码

1
2
3
4
df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:,1] = NA
df.iloc[4:,2] = NA
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	-0.251059	-1.375272	0.538068
1	0.329612	1.157921	0.508928
2	-0.118987	NaN	0.694185
3	0.724308	NaN	-0.620832
4	0.045673	NaN	NaN
5	-0.317643	NaN	NaN

复制代码

1
df.fillna(method='ffill') #前向

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	-0.251059	-1.375272	0.538068
1	0.329612	1.157921	0.508928
2	-0.118987	1.157921	0.694185
3	0.724308	1.157921	-0.620832
4	0.045673	1.157921	-0.620832
5	-0.317643	1.157921	-0.620832

复制代码

1
df.fillna(method = 'ffill',limit=2) #数量限制

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	-0.251059	-1.375272	0.538068
1	0.329612	1.157921	0.508928
2	-0.118987	1.157921	0.694185
3	0.724308	1.157921	-0.620832
4	0.045673	NaN	-0.620832
5	-0.317643	NaN	-0.620832

#### 特殊填充

复制代码

1
2
data = pd.Series([1.,NA,3.5,NA,7])
data.fillna(data.mean()) #以平均值填充

0 1.000000 1 3.833333 2 3.500000 3 3.833333 4 7.000000 dtype: float64 参数说明: value Scalar value or dict-like object to use to fill missing values method Interpolation; by default ‘ffill’ if function called with no other arguments axis Axis to fill on; default axis=0 inplace Modify the calling object without producing a copy limit For forward and backward filling, maximum number of consecutive periods to fill # 数据转换

复制代码

1
2
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]})

复制代码

1
data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4
6	two	4

### 检查是否与前项重复

复制代码

1
data.duplicated()

0 False 1 False 2 False 3 False 4 False 5 False 6 True dtype: bool #### 去重

复制代码

1
data.drop_duplicates()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4

#### 指定列

复制代码

1
2
data['v1'] = range(7)
data.drop_duplicates(['k1'])

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	k1	k2	v1
0	one	1	0
1	two	1	1

#### 指定保留重复的第一项还是最后一项

复制代码

1
data.drop_duplicates(['k1','k2'])

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	k1	k2	v1
0	one	1	0
1	two	1	1
2	one	2	2
3	two	3	3
4	one	3	4
5	two	4	5

复制代码

1
data.drop_duplicates(['k1','k2'],keep = 'last')

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	k1	k2	v1
0	one	1	0
1	two	1	1
2	one	2	2
3	two	3	3
4	one	3	4
6	two	4	6

## 使用函数或映射转换

复制代码

1
2
3
4
5
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

复制代码

1
data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	food	ounces
0	bacon	4.0
1	pulled pork	3.0
2	bacon	12.0
3	Pastrami	6.0
4	corned beef	7.5
5	Bacon	8.0
6	pastrami	3.0
7	honey ham	5.0
8	nova lox	6.0

### 映射

复制代码

1
2
3
4
5
6
7
8
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}

#### 更改大小写

复制代码

1
2
lowercased = data.food.str.lower()
lowercased

0 bacon 1 pulled pork 2 bacon 3 pastrami 4 corned beef 5 bacon 6 pastrami 7 honey ham 8 nova lox Name: food, dtype: object #### 增加新列,应用映射’

复制代码

1
data['animal'] = lowercased.map(meat_to_animal)

复制代码

1
data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	food	ounces	animal
0	bacon	4.0	pig
1	pulled pork	3.0	pig
2	bacon	12.0	pig
3	Pastrami	6.0	cow
4	corned beef	7.5	cow
5	Bacon	8.0	pig
6	pastrami	3.0	cow
7	honey ham	5.0	pig
8	nova lox	6.0	salmon

#### 函数写法

复制代码

1
data['food'].map(lambda x: meat_to_animal[x.lower()])

0 pig 1 pig 2 pig 3 cow 4 cow 5 pig 6 cow 7 pig 8 salmon Name: food, dtype: object ### 替换

复制代码

1
data = pd.Series([1., -999., 2., -999., -1000., 3.])

复制代码

1
data

0 1.0 1 -999.0 2 2.0 3 -999.0 4 -1000.0 5 3.0 dtype: float64

复制代码

1
data.replace(-999,np.nan)

0 1.0 1 NaN 2 2.0 3 NaN 4 -1000.0 5 3.0 dtype: float64

复制代码

1
data.replace([-999,-1000],[np.nan,0]) #替换多个值

0 1.0 1 NaN 2 2.0 3 NaN 4 0.0 5 3.0 dtype: float64

复制代码

1
data.replace({-999:np.nan,-1000:0}) #等价写法

0 1.0 1 NaN 2 2.0 3 NaN 4 0.0 5 3.0 dtype: float64 ### 轴重命名

复制代码

1
2
3
4
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])

#### 映射

复制代码

1
transform = lambda x: x[:4].upper()

复制代码

1
data.index.map(transform)

Index([‘OHIO’, ‘COLO’, ‘NEW ‘], dtype=’object’)

复制代码

1
data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
New York	8	9	10	11

复制代码

1
data.index = data.index.map(transform) #作用于原对象

复制代码

1
data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	one	two	three	four
OHIO	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

### rename

复制代码

1
data.rename(index = str.title,columns = str.upper)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	ONE	TWO	THREE	FOUR
Ohio	0	1	2	3
Colo	4	5	6	7
New	8	9	10	11

复制代码

1
data.rename(index={'OHIO': 'INDIANA'},columns={'three': 'peekaboo'}) #特定替换

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	one	two	peekaboo	four
INDIANA	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

复制代码

1
data.rename(index = {'OHIO':'INDIANA'},inplace = True) #作用于原对象

复制代码

1
data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	one	two	three	four
INDIANA	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

## 分位数

复制代码

1
ages = [20,22,25,27,21,23,37,31,61,45,41,32]

复制代码

1
bins = [18,25,35,60,100]

复制代码

1
2
cats = pd.cut(ages,bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], …, (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]] Length: 12 Categories (4, interval[int64]): [(18, 25]

复制代码

1
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

复制代码

1
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]] closed=’right’, dtype=’interval[int64]’)

复制代码

1
pd.value_counts(cats)

(18, 25] 5 (35, 60] 3 (25, 35] 3 (60, 100] 1 dtype: int64 ##### 包含

复制代码

1
pd.cut(ages,[18,26,36,61,100],right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), …, [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)] Length: 12 Categories (4, interval[int64]): [[18, 26) ##### 命名

复制代码

1
group_names = ['Y','YA','WA','S']

复制代码

1
pd.cut(ages,bins,labels = group_names)

[Y, Y, Y, YA, Y, …, YA, S, WA, WA, YA] Length: 12 Categories (4, object): [Y #### 等分

复制代码

1
2
data = np.random.rand(20)
pd.cut(data,4,precision=2) #precision控制小数点数量

[(0.49, 0.74], (0.25, 0.49], (0.49, 0.74], (0.25, 0.49], (-3.6e-05, 0.25], …, (0.49, 0.74], (0.25, 0.49], (0.74, 0.99], (0.74, 0.99], (0.74, 0.99]] Length: 20 Categories (4, interval[float64]): [(-3.6e-05, 0.25] #### qcut

复制代码

1
2
3
data = np.random.randn(1000)
cats = pd.qcut(data,4)
cats

[(-2.9539999999999997, -0.658], (0.669, 3.876], (-2.9539999999999997, -0.658], (-0.0173, 0.669], (-0.658, -0.0173], …, (0.669, 3.876], (-0.0173, 0.669], (-0.658, -0.0173], (-0.0173, 0.669], (0.669, 3.876]] Length: 1000 Categories (4, interval[float64]): [(-2.9539999999999997, -0.658]

复制代码

1
pd.value_counts(cats)

(0.669, 3.876] 250 (-0.0173, 0.669] 250 (-0.658, -0.0173] 250 (-2.9539999999999997, -0.658] 250 dtype: int64 #### 自定义

复制代码

1
pd.qcut(data,[0,0.1,0.5,0.9,1.])

[(-1.236, -0.0173], (-0.0173, 1.233], (-1.236, -0.0173], (-0.0173, 1.233], (-1.236, -0.0173], …, (-0.0173, 1.233], (-0.0173, 1.233], (-1.236, -0.0173], (-0.0173, 1.233], (-0.0173, 1.233]] Length: 1000 Categories (4, interval[float64]): [(-2.9539999999999997, -1.236] ### 过滤异常值

复制代码

1
data = pd.DataFrame(np.random.randn(1000,4))

复制代码

1
data.describe()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	-0.075999	0.020124	-0.002652	-0.024497
std	0.976555	0.955685	0.986620	1.010481
min	-3.955131	-3.403433	-3.214173	-3.405990
25%	-0.752617	-0.605502	-0.651444	-0.677522
50%	-0.077978	0.009801	-0.037225	-0.004654
75%	0.587452	0.686600	0.613769	0.641942
max	3.054668	3.369481	3.081614	2.983911

#### 按条件过滤

复制代码

1
col = data[2]

复制代码

1
col[np.abs(col) > 3]

824 3.081614 938 -3.214173 Name: 2, dtype: float64 #### 检查其所在行

复制代码

1
data[(np.abs(data) > 3).any(1)]

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2	3
233	-3.955131	0.644402	0.906170	0.871523
268	3.054668	1.215981	0.069664	-0.956292
291	-0.989445	-0.333444	-0.302554	-3.118577
419	-0.334358	-3.403433	-1.207032	-0.812907
528	0.716480	3.369481	-0.067002	-0.062219
682	0.984525	-0.013239	0.127607	-3.405990
769	0.518519	-0.263145	-0.192145	-3.161299
822	-1.235697	-3.032942	0.594252	1.165504
824	0.945307	-2.096176	3.081614	1.307574
832	-3.816917	0.758124	-0.424847	0.016269
938	1.172387	0.353721	-3.214173	-0.088637

#### 操作

复制代码

1
2
3
data[np.abs(data) > 3]
= np.sign(data) * 3
data.describe

复制代码

1
np.sign(data).head() #产生1和-1'

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2	3
0	1.0	-1.0	1.0	-1.0
1	-1.0	1.0	1.0	1.0
2	-1.0	1.0	-1.0	1.0
3	-1.0	-1.0	1.0	-1.0
4	-1.0	1.0	-1.0	1.0

复制代码

1
data.iloc[938] #验证

0 1.172387 1 0.353721 2 -3.000000 3 -0.088637 Name: 938, dtype: float64 ## 组合和抽样

复制代码

1
df = pd.DataFrame(np.arange(5*4).reshape((5,4)))

复制代码

1
sampler = np.random.permutation(5)

复制代码

1
sampler

array([3, 2, 1, 0, 4])

复制代码

1
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2	3
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19

复制代码

1
df.take(sampler)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2	3
3	12	13	14	15
2	8	9	10	11
1	4	5	6	7
0	0	1	2	3
4	16	17	18	19

#### 随机抽样

复制代码

1
df.sample(n=3)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2	3
1	4	5	6	7
3	12	13	14	15
0	0	1	2	3

#### 作用于原对象

复制代码

1
2
choicesv
= pd.Series([5,7,-1,6,4])

复制代码

1
draws = choicesv.sample(n = 10,replace = True)

复制代码

1
draws

4 4 0 5 3 6 1 7 1 7 3 6 4 4 1 7 2 -1 1 7 dtype: int64 ### Computing Indicator/Dummy Variables

复制代码

1
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})

复制代码

1
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	data1	key
0	0	b
1	1	b
2	2	a
3	3	c
4	4	a
5	5	b

复制代码

1
pd.get_dummies(df.key)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	a	b	c
0	0	1	0
1	0	1	0
2	1	0	0
3	0	0	1
4	1	0	0
5	0	1	0

复制代码

1
dummies = pd.get_dummies(df.key,prefix = 'key')

复制代码

1
2
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	data1	key_a	key_b	key_c
0	0	0	1	0
1	1	0	1	0
2	2	1	0	0
3	3	0	0	1
4	4	1	0	0
5	5	0	1	0

#### 行

复制代码

1
mnames = ['movie_id', 'title', 'genres']

复制代码

1
2
movies = pd.read_table('/Users/meininghang/Downloads/pydata-book-2nd-edition/datasets/movielens/movies.dat',
sep = '::',header = None,names = mnames)

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:2: ParserWarning: Falling back to the ‘python’ engine because the ‘c’ engine does not support regex separators (separators > 1 char and different from ‘s+’ are interpreted as regex); you can avoid this warning by specifying engine=’python’.

复制代码

1
movies[:10]

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children’s\|Comedy
1	2	Jumanji (1995)	Adventure\|Children’s\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children’s
8	9	Sudden Death (1995)	Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller

#### 题材

复制代码

1
all_genres = []

复制代码

1
2
for x in movies.genres:
all_genres.extend(x.split('|'))

复制代码

1
2
genres = pd.unique(all_genres)
genres

array([‘Animation’, “Children’s”, ‘Comedy’, ‘Adventure’, ‘Fantasy’, ‘Romance’, ‘Drama’, ‘Action’, ‘Crime’, ‘Thriller’, ‘Horror’, ‘Sci-Fi’, ‘Documentary’, ‘War’, ‘Musical’, ‘Mystery’, ‘Film-Noir’, ‘Western’], dtype=object)

复制代码

1
zero_matrix = np.zeros((len(movies),len(genres)))

复制代码

1
dummies = pd.DataFrame(zero_matrix,columns = genres)

复制代码

1
gen = movies.genres[0]

复制代码

1
gen.split('|')

[‘Animation’, “Children’s”, ‘Comedy’]

复制代码

1
dummies.columns.get_indexer(gen.split('|'))

array([0, 1, 2])

复制代码

1
2
3
for i,gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(gen.split('|'))
dummies.iloc[i,indices] = 1

复制代码

1
movies_windic = movies.join(dummies.add_prefix('Genre_'))

复制代码

1
movies_windic.iloc[0]

movie_id 1 title Toy Story (1995) genres Animation|Children’s|Comedy Genre_Animation 1 Genre_Children’s 1 Genre_Comedy 1 Genre_Adventure 0 Genre_Fantasy 0 Genre_Romance 0 Genre_Drama 0 Genre_Action 0 Genre_Crime 0 Genre_Thriller 0 Genre_Horror 0 Genre_Sci-Fi 0 Genre_Documentary 0 Genre_War 0 Genre_Musical 0 Genre_Mystery 0 Genre_Film-Noir 0 Genre_Western 0 Name: 0, dtype: object #### get_dummies

复制代码

1
np.random.seed(12345)

复制代码

1
values = np.random.rand(10)

复制代码

1
values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503, 0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

复制代码

1
bins = [0,0.2,0.4,0.6,0.8,1]

复制代码

1
pd.get_dummies(pd.cut(values,bins))

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	(0.0, 0.2]	(0.2, 0.4]	(0.4, 0.6]	(0.6, 0.8]	(0.8, 1.0]
0	0	0	0	0	1
1	0	1	0	0	0
2	1	0	0	0	0
3	0	1	0	0	0
4	0	0	1	0	0
5	0	0	1	0	0
6	0	0	0	0	1
7	0	0	0	1	0
8	0	0	0	1	0
9	0	0	0	1	0

字符串操作

内置方法

复制代码

1
2
val = 'a,b,
guido'

复制代码

1
val.split(' , ') #切

复制代码

1
2
3
['a,b,
guido']

修剪

复制代码

1
2
pieces = [x.strip() for x in val.split(',')]
pieces

复制代码

1
2
['a', 'b', 'guido']

链接

复制代码

1
f,s,t = pieces

复制代码

1
f + '::' + s + '::' + t

复制代码

1
2
'a::b::guido'

位置

复制代码

1
'guido' in val

复制代码

1
2
True

复制代码

1
val.index(',')

复制代码

1
2
1

复制代码

1
val.find(':')

复制代码

1
2
-1

计数

复制代码

1
val.count(',')

复制代码

1
2
2

替换

复制代码

1
val.replace(',','::')

复制代码

1
2
3
'a::b::
guido'

复制代码

1
val.replace(',',' ')

复制代码

1
2
3
'a b
guido'

参数:
Argument Description
count Return the number of non-overlapping occurrences of substring in the string.
endswith Returns True if string ends with suffix.
startswith Returns True if string starts with prefix.
join Use string as delimiter for concatenating a sequence of other strings.
index Return position of first character in substring if found in the string; raises ValueError if not found.
find Return position of first character of first occurrence of substring in the string; like index, but returns –1 if not found.
rfind Return position of first character of last occurrence of substring in the string; returns –1 if not found.
replace Replace occurrences of string with another string.
strip, rstrip, lstrip Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.
split Break string into list of substrings using passed delimiter.
lower Convert alphabet characters to lowercase.
upper Convert alphabet characters to uppercase.
casefold Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form.
ljust, rjust Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width.

正则

复制代码

1
import re

复制代码

1
2
3
text = 'foo
bart baz
tqux'

复制代码

1
re.split('s+',text)

复制代码

1
2
['foo', 'bar', 'baz', 'qux']

编译

复制代码

1
regex = re.compile('s+')

复制代码

1
regex.split(text)

复制代码

1
2
['foo', 'bar', 'baz', 'qux']

查询方式

复制代码

1
regex.findall(text)

复制代码

1
2
3
4
['
', 't ', '
t']

复制代码

1
regex.search(text) #返回第一个匹配结果

复制代码

1
2
3
<_sre.SRE_Match object; span=(3, 7), match='
'>

电子邮件

复制代码

1
2
3
4
5
6
7
8
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}'
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

复制代码

1
regex.findall(text)

复制代码

1
2
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

替换

复制代码

1
regex.sub('READCTED',text)

复制代码

1
2
'Dave READCTEDnSteve READCTEDnRob READCTEDnRyan READCTEDn'

分组

复制代码

1
2
3
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+).([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
m = regex.match('wesm@bright.net')

复制代码

1
m.groups()

复制代码

1
2
('wesm', 'bright', 'net')

复制代码

1
regex.findall(text) #返回分组结果

复制代码

1
2
3
4
5
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]

sub同时分组

复制代码

1
regex.sub(r'Username: 1, Domain: 2, Suffix: 3',text)

复制代码

1
2
'Dave Username: dave, Domain: google, Suffix: comnSteve Username: steve, Domain: gmail, Suffix: comnRob Username: rob, Domain: gmail, Suffix: comnRyan Username: ryan, Domain: yahoo, Suffix: comn'

参数:
Argument Description
findall Return all non-overlapping matching patterns in a string as a list
finditer Like findall, but returns an iterator
match Match pattern at start of string and optionally segment pattern components into groups; if the pattern matches, returns a match object, and otherwise None
search Scan string for match to pattern; returning a match object if so; unlike match, the match can be anywhere in the string as opposed to only at the beginning
split Break string into pieces at each occurrence of pattern
sub, subn Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression; use symbols 1, 2, … to refer to match group elements in the replacement string

向量化

复制代码

1
2
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
'Rob': 'rob@gmail.com', 'Wes': np.nan}

复制代码

1
data = pd.Series(data)

复制代码

1
data

复制代码

1
2
3
4
5
6
7
8
9
10
Dave
dave@google.com
Rob
rob@gmail.com
Steve
steve@gmail.com
Wes
NaN
dtype: object

复制代码

1
data.isnull()

复制代码

1
2
3
4
5
6
7
8
9
10
Dave
False
Rob
False
Steve
False
Wes
True
dtype: bool

包含

复制代码

1
data.str.contains('gmail')

复制代码

1
2
3
4
5
6
7
8
9
10
Dave
False
Rob
True
Steve
True
Wes
NaN
dtype: object

切片

复制代码

1
data.str[:5]

复制代码

1
2
3
4
5
6
7
8
9
10
Dave
dave@
Rob
rob@g
Steve
steve
Wes
NaN
dtype: object

方法:
Method Description
cat Concatenate strings element-wise with optional delimiter
contains Return boolean array if each string contains pattern/regex
count Count occurrences of pattern
extract
Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group
endswith Equivalent to x.endswith(pattern) for each element
startswith Equivalent to x.startswith(pattern) for each element
findall Compute list of all occurrences of pattern/regex for each string
get Index into each element (retrieve i-th element)
isalnum Equivalent to built-in str.alnum
isalpha Equivalent to built-in str.isalpha
isdecimal Equivalent to built-in str.isdecimal
isdigit Equivalent to built-in str.isdigit
islower Equivalent to built-in str.islower
isnumeric Equivalent to built-in str.isnumeric
isupper Equivalent to built-in str.isupper
join Join strings in each element of the Series with passed separator
len Compute length of each string
lower, upper Convert cases; equivalent to x.lower() or x.upper() for each element
match Use re.match with the passed regular expression on each element, returning matched groups as list
pad Add whitespace to left, right, or both sides of strings
center Equivalent to pad(side=’both’)
repeat Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string)
replace Replace occurrences of pattern/regex with some other string
slice Slice each string in the Series
split Split strings on delimiter or regular expression
strip Trim whitespace from both sides, including newlines
rstrip Trim whitespace on right side
lstrip Trim whitespace on left side

最后

以上就是文艺身影最近收集整理的关于Python数据分析之清洗的全部内容，更多相关Python数据分析之清洗内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：python
浏览次数：78 次浏览
发布日期：2024-01-23 22:01:18
本文链接：https://www.kaopuke.com/article/k-p-k_13_u_23_okf1_14__23__10_1.html

Python数据分析之清洗

字符串操作

内置方法

修剪

链接

位置

计数

替换

正则

编译

查询方式

电子邮件

替换

分组

sub同时分组

向量化

包含

切片

最后

评论列表共有 0 条评论

发表评论取消回复

	(0.0, 0.2]	(0.2, 0.4]	(0.4, 0.6]	(0.6, 0.8]	(0.8, 1.0]
0	0	0	0	0	1
1	0	1	0	0	0
2	1	0	0	0	0
3	0	1	0	0	0
4	0	0	1	0	0
5	0	0	1	0	0
6	0	0	0	0	1
7	0	0	0	1	0
8	0	0	0	1	0
9	0	0	0	1	0

	(0.0, 0.2]	(0.2, 0.4]	(0.4, 0.6]	(0.6, 0.8]	(0.8, 1.0]
0	0	0	0	0	1
1	0	1	0	0	0
2	1	0	0	0	0
3	0	1	0	0	0
4	0	0	1	0	0
5	0	0	1	0	0
6	0	0	0	0	1
7	0	0	0	1	0
8	0	0	0	1	0
9	0	0	0	1	0

Python数据分析之清洗

字符串操作

内置方法

修剪

链接

位置

计数

替换

正则

编译

查询方式

电子邮件

替换

分组

sub同时分组

向量化

包含

切片

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

微信扫一扫：分享

发表评论取消回复

	(0.0, 0.2]	(0.2, 0.4]	(0.4, 0.6]	(0.6, 0.8]	(0.8, 1.0]
0	0	0	0	0	1
1	0	1	0	0	0
2	1	0	0	0	0
3	0	1	0	0	0
4	0	0	1	0	0
5	0	0	1	0	0
6	0	0	0	0	1
7	0	0	0	1	0
8	0	0	0	1	0
9	0	0	0	1	0