我是靠谱客的博主 文艺身影,这篇文章主要介绍Python数据分析之清洗,现在分享给大家,希望可以做个参考。

# 缺失值

复制代码
1
2
3
import pandas as pd import numpy as np string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
复制代码
1
string_data
0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object 检查
复制代码
1
string_data.isnull()
0 False 1 False 2 True 3 False dtype: bool
复制代码
1
2
#None也被视为NA string_data[0] = None
复制代码
1
string_data.isnull()
0 True 1 False 2 True 3 False dtype: bool
复制代码
1
df = pd.DataFrame({'dropna':'祛除缺失值','fillna':'填充','isnull':'检查是','notnull':'检查不是'},index = ['1','2','3','4'])
复制代码
1
df.iloc[1] #一些方法
dropna 祛除缺失值 fillna 填充 isnull 检查是 notnull 检查不是 Name: 2, dtype: object ## 过滤缺失值
复制代码
1
from numpy import nan as NA
复制代码
1
data = pd.Series([1,NA,3.5,NA,7])
复制代码
1
data.dropna() #删除缺失值
0 1.0 2 3.5 4 7.0 dtype: float64
复制代码
1
data[data.notnull()] #等价写法
0 1.0 2 3.5 4 7.0 dtype: float64 ### DataFrame操作
复制代码
1
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])
复制代码
1
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
复制代码
1
2
cleaned = data.dropna() data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
复制代码
1
cleaned
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
01.06.53.0
#### 删除全为NA的行
复制代码
1
data.dropna(how = 'all')
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
01.06.53.0
11.0NaNNaN
3NaN6.53.0
#### 列删除操作
复制代码
1
2
data[4] = NA data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0124
01.06.53.0NaN
11.0NaNNaNNaN
2NaNNaNNaNNaN
3NaN6.53.0NaN
复制代码
1
data.dropna(1,how = 'all')
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
#### 删除具有特定NA值数量的行
复制代码
1
2
3
4
df = pd.DataFrame(np.random.randn(7,3)) df.iloc[:4,1 ]= NA df.iloc[:2,2] = NA df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
0-0.088513NaNNaN
10.637624NaNNaN
20.054991NaN0.795304
3-1.069859NaN-1.572700
40.230697-1.8936261.205645
5-0.918601-0.827842-1.712528
60.916691-0.2387120.724451
复制代码
1
df.dropna() #有NA值就删除该行
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
40.230697-1.8936261.205645
5-0.918601-0.827842-1.712528
60.916691-0.2387120.724451
复制代码
1
df.dropna(thresh=2) #删除具有两个NA值的行
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
20.054991NaN0.795304
3-1.069859NaN-1.572700
40.230697-1.8936261.205645
5-0.918601-0.827842-1.712528
60.916691-0.2387120.724451
### 填充缺失值
复制代码
1
df.fillna(0) #以0填充
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
0-0.0885130.0000000.000000
10.6376240.0000000.000000
20.0549910.0000000.795304
3-1.0698590.000000-1.572700
40.230697-1.8936261.205645
5-0.918601-0.827842-1.712528
60.916691-0.2387120.724451
复制代码
1
df.fillna({1:0.5,2:0}) #对不同的行使用不同的填充数
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
0-0.0885130.5000000.000000
10.6376240.5000000.000000
20.0549910.5000000.795304
3-1.0698590.500000-1.572700
40.230697-1.8936261.205645
5-0.918601-0.827842-1.712528
60.916691-0.2387120.724451
复制代码
1
2
3
_ = df.fillna(0,inplace=True) #作用于原对象 df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
0-0.0885130.0000000.000000
10.6376240.0000000.000000
20.0549910.0000000.795304
3-1.0698590.000000-1.572700
40.230697-1.8936261.205645
5-0.918601-0.827842-1.712528
60.916691-0.2387120.724451
#### 前向填充
复制代码
1
2
3
4
df = pd.DataFrame(np.random.randn(6,3)) df.iloc[2:,1] = NA df.iloc[4:,2] = NA df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
0-0.251059-1.3752720.538068
10.3296121.1579210.508928
2-0.118987NaN0.694185
30.724308NaN-0.620832
40.045673NaNNaN
5-0.317643NaNNaN
复制代码
1
df.fillna(method='ffill') #前向
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
0-0.251059-1.3752720.538068
10.3296121.1579210.508928
2-0.1189871.1579210.694185
30.7243081.157921-0.620832
40.0456731.157921-0.620832
5-0.3176431.157921-0.620832
复制代码
1
df.fillna(method = 'ffill',limit=2) #数量限制
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
0-0.251059-1.3752720.538068
10.3296121.1579210.508928
2-0.1189871.1579210.694185
30.7243081.157921-0.620832
40.045673NaN-0.620832
5-0.317643NaN-0.620832
#### 特殊填充
复制代码
1
2
data = pd.Series([1.,NA,3.5,NA,7]) data.fillna(data.mean()) #以平均值填充
0 1.000000 1 3.833333 2 3.500000 3 3.833333 4 7.000000 dtype: float64 参数说明: value Scalar value or dict-like object to use to fill missing values method Interpolation; by default ‘ffill’ if function called with no other arguments axis Axis to fill on; default axis=0 inplace Modify the calling object without producing a copy limit For forward and backward filling, maximum number of consecutive periods to fill # 数据转换
复制代码
1
2
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], 'k2': [1, 1, 2, 3, 3, 4, 4]})
复制代码
1
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
k1k2
0one1
1two1
2one2
3two3
4one3
5two4
6two4
### 检查是否与前项重复
复制代码
1
data.duplicated()
0 False 1 False 2 False 3 False 4 False 5 False 6 True dtype: bool #### 去重
复制代码
1
data.drop_duplicates()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
k1k2
0one1
1two1
2one2
3two3
4one3
5two4
#### 指定列
复制代码
1
2
data['v1'] = range(7) data.drop_duplicates(['k1'])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
k1k2v1
0one10
1two11
#### 指定保留重复的第一项还是最后一项
复制代码
1
data.drop_duplicates(['k1','k2'])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
k1k2v1
0one10
1two11
2one22
3two33
4one34
5two45
复制代码
1
data.drop_duplicates(['k1','k2'],keep = 'last')
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
k1k2v1
0one10
1two11
2one22
3two33
4one34
6two46
## 使用函数或映射转换
复制代码
1
2
3
4
5
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon', 'pastrami', 'honey ham', 'nova lox'], 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
复制代码
1
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
foodounces
0bacon4.0
1pulled pork3.0
2bacon12.0
3Pastrami6.0
4corned beef7.5
5Bacon8.0
6pastrami3.0
7honey ham5.0
8nova lox6.0
### 映射
复制代码
1
2
3
4
5
6
7
8
meat_to_animal = { 'bacon': 'pig', 'pulled pork': 'pig', 'pastrami': 'cow', 'corned beef': 'cow', 'honey ham': 'pig', 'nova lox': 'salmon' }
#### 更改大小写
复制代码
1
2
lowercased = data.food.str.lower() lowercased
0 bacon 1 pulled pork 2 bacon 3 pastrami 4 corned beef 5 bacon 6 pastrami 7 honey ham 8 nova lox Name: food, dtype: object #### 增加新列,应用映射’
复制代码
1
data['animal'] = lowercased.map(meat_to_animal)
复制代码
1
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
foodouncesanimal
0bacon4.0pig
1pulled pork3.0pig
2bacon12.0pig
3Pastrami6.0cow
4corned beef7.5cow
5Bacon8.0pig
6pastrami3.0cow
7honey ham5.0pig
8nova lox6.0salmon
#### 函数写法
复制代码
1
data['food'].map(lambda x: meat_to_animal[x.lower()])
0 pig 1 pig 2 pig 3 cow 4 cow 5 pig 6 cow 7 pig 8 salmon Name: food, dtype: object ### 替换
复制代码
1
data = pd.Series([1., -999., 2., -999., -1000., 3.])
复制代码
1
data
0 1.0 1 -999.0 2 2.0 3 -999.0 4 -1000.0 5 3.0 dtype: float64
复制代码
1
data.replace(-999,np.nan)
0 1.0 1 NaN 2 2.0 3 NaN 4 -1000.0 5 3.0 dtype: float64
复制代码
1
data.replace([-999,-1000],[np.nan,0]) #替换多个值
0 1.0 1 NaN 2 2.0 3 NaN 4 0.0 5 3.0 dtype: float64
复制代码
1
data.replace({-999:np.nan,-1000:0}) #等价写法
0 1.0 1 NaN 2 2.0 3 NaN 4 0.0 5 3.0 dtype: float64 ### 轴重命名
复制代码
1
2
3
4
data = pd.DataFrame(np.arange(12).reshape((3, 4)), index=['Ohio', 'Colorado', 'New York'], columns=['one', 'two', 'three', 'four'])
#### 映射
复制代码
1
transform = lambda x: x[:4].upper()
复制代码
1
data.index.map(transform)
Index([‘OHIO’, ‘COLO’, ‘NEW ‘], dtype=’object’)
复制代码
1
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
onetwothreefour
Ohio0123
Colorado4567
New York891011
复制代码
1
data.index = data.index.map(transform) #作用于原对象
复制代码
1
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
onetwothreefour
OHIO0123
COLO4567
NEW891011
### rename
复制代码
1
data.rename(index = str.title,columns = str.upper)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
ONETWOTHREEFOUR
Ohio0123
Colo4567
New891011
复制代码
1
data.rename(index={'OHIO': 'INDIANA'},columns={'three': 'peekaboo'}) #特定替换
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
onetwopeekaboofour
INDIANA0123
COLO4567
NEW891011
复制代码
1
data.rename(index = {'OHIO':'INDIANA'},inplace = True) #作用于原对象
复制代码
1
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
onetwothreefour
INDIANA0123
COLO4567
NEW891011
## 分位数
复制代码
1
ages = [20,22,25,27,21,23,37,31,61,45,41,32]
复制代码
1
bins = [18,25,35,60,100]
复制代码
1
2
cats = pd.cut(ages,bins) cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], …, (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]] Length: 12 Categories (4, interval[int64]): [(18, 25]
复制代码
1
cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
复制代码
1
cats.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]] closed=’right’, dtype=’interval[int64]’)
复制代码
1
pd.value_counts(cats)
(18, 25] 5 (35, 60] 3 (25, 35] 3 (60, 100] 1 dtype: int64 ##### 包含
复制代码
1
pd.cut(ages,[18,26,36,61,100],right=False)
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), …, [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)] Length: 12 Categories (4, interval[int64]): [[18, 26) ##### 命名
复制代码
1
group_names = ['Y','YA','WA','S']
复制代码
1
pd.cut(ages,bins,labels = group_names)
[Y, Y, Y, YA, Y, …, YA, S, WA, WA, YA] Length: 12 Categories (4, object): [Y #### 等分
复制代码
1
2
data = np.random.rand(20) pd.cut(data,4,precision=2) #precision控制小数点数量
[(0.49, 0.74], (0.25, 0.49], (0.49, 0.74], (0.25, 0.49], (-3.6e-05, 0.25], …, (0.49, 0.74], (0.25, 0.49], (0.74, 0.99], (0.74, 0.99], (0.74, 0.99]] Length: 20 Categories (4, interval[float64]): [(-3.6e-05, 0.25] #### qcut
复制代码
1
2
3
data = np.random.randn(1000) cats = pd.qcut(data,4) cats
[(-2.9539999999999997, -0.658], (0.669, 3.876], (-2.9539999999999997, -0.658], (-0.0173, 0.669], (-0.658, -0.0173], …, (0.669, 3.876], (-0.0173, 0.669], (-0.658, -0.0173], (-0.0173, 0.669], (0.669, 3.876]] Length: 1000 Categories (4, interval[float64]): [(-2.9539999999999997, -0.658]
复制代码
1
pd.value_counts(cats)
(0.669, 3.876] 250 (-0.0173, 0.669] 250 (-0.658, -0.0173] 250 (-2.9539999999999997, -0.658] 250 dtype: int64 #### 自定义
复制代码
1
pd.qcut(data,[0,0.1,0.5,0.9,1.])
[(-1.236, -0.0173], (-0.0173, 1.233], (-1.236, -0.0173], (-0.0173, 1.233], (-1.236, -0.0173], …, (-0.0173, 1.233], (-0.0173, 1.233], (-1.236, -0.0173], (-0.0173, 1.233], (-0.0173, 1.233]] Length: 1000 Categories (4, interval[float64]): [(-2.9539999999999997, -1.236] ### 过滤异常值
复制代码
1
data = pd.DataFrame(np.random.randn(1000,4))
复制代码
1
data.describe()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0123
count1000.0000001000.0000001000.0000001000.000000
mean-0.0759990.020124-0.002652-0.024497
std0.9765550.9556850.9866201.010481
min-3.955131-3.403433-3.214173-3.405990
25%-0.752617-0.605502-0.651444-0.677522
50%-0.0779780.009801-0.037225-0.004654
75%0.5874520.6866000.6137690.641942
max3.0546683.3694813.0816142.983911
#### 按条件过滤
复制代码
1
col = data[2]
复制代码
1
col[np.abs(col) > 3]
824 3.081614 938 -3.214173 Name: 2, dtype: float64 #### 检查其所在行
复制代码
1
data[(np.abs(data) > 3).any(1)]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0123
233-3.9551310.6444020.9061700.871523
2683.0546681.2159810.069664-0.956292
291-0.989445-0.333444-0.302554-3.118577
419-0.334358-3.403433-1.207032-0.812907
5280.7164803.369481-0.067002-0.062219
6820.984525-0.0132390.127607-3.405990
7690.518519-0.263145-0.192145-3.161299
822-1.235697-3.0329420.5942521.165504
8240.945307-2.0961763.0816141.307574
832-3.8169170.758124-0.4248470.016269
9381.1723870.353721-3.214173-0.088637
#### 操作
复制代码
1
2
3
data[np.abs(data) > 3] = np.sign(data) * 3 data.describe
复制代码
1
np.sign(data).head() #产生1和-1'
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0123
01.0-1.01.0-1.0
1-1.01.01.01.0
2-1.01.0-1.01.0
3-1.0-1.01.0-1.0
4-1.01.0-1.01.0
复制代码
1
data.iloc[938] #验证
0 1.172387 1 0.353721 2 -3.000000 3 -0.088637 Name: 938, dtype: float64 ## 组合和抽样
复制代码
1
df = pd.DataFrame(np.arange(5*4).reshape((5,4)))
复制代码
1
sampler = np.random.permutation(5)
复制代码
1
sampler
array([3, 2, 1, 0, 4])
复制代码
1
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0123
00123
14567
2891011
312131415
416171819
复制代码
1
df.take(sampler)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0123
312131415
2891011
14567
00123
416171819
#### 随机抽样
复制代码
1
df.sample(n=3)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0123
14567
312131415
00123
#### 作用于原对象
复制代码
1
2
choicesv = pd.Series([5,7,-1,6,4])
复制代码
1
draws = choicesv.sample(n = 10,replace = True)
复制代码
1
draws
4 4 0 5 3 6 1 7 1 7 3 6 4 4 1 7 2 -1 1 7 dtype: int64 ### Computing Indicator/Dummy Variables
复制代码
1
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
复制代码
1
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
data1key
00b
11b
22a
33c
44a
55b
复制代码
1
pd.get_dummies(df.key)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
abc
0010
1010
2100
3001
4100
5010
复制代码
1
dummies = pd.get_dummies(df.key,prefix = 'key')
复制代码
1
2
df_with_dummy = df[['data1']].join(dummies) df_with_dummy
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
data1key_akey_bkey_c
00010
11010
22100
33001
44100
55010
#### 行
复制代码
1
mnames = ['movie_id', 'title', 'genres']
复制代码
1
2
movies = pd.read_table('/Users/meininghang/Downloads/pydata-book-2nd-edition/datasets/movielens/movies.dat', sep = '::',header = None,names = mnames)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:2: ParserWarning: Falling back to the ‘python’ engine because the ‘c’ engine does not support regex separators (separators > 1 char and different from ‘s+’ are interpreted as regex); you can avoid this warning by specifying engine=’python’.
复制代码
1
movies[:10]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
movie_idtitlegenres
01Toy Story (1995)Animation|Children’s|Comedy
12Jumanji (1995)Adventure|Children’s|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
56Heat (1995)Action|Crime|Thriller
67Sabrina (1995)Comedy|Romance
78Tom and Huck (1995)Adventure|Children’s
89Sudden Death (1995)Action
910GoldenEye (1995)Action|Adventure|Thriller
#### 题材
复制代码
1
all_genres = []
复制代码
1
2
for x in movies.genres: all_genres.extend(x.split('|'))
复制代码
1
2
genres = pd.unique(all_genres) genres
array([‘Animation’, “Children’s”, ‘Comedy’, ‘Adventure’, ‘Fantasy’, ‘Romance’, ‘Drama’, ‘Action’, ‘Crime’, ‘Thriller’, ‘Horror’, ‘Sci-Fi’, ‘Documentary’, ‘War’, ‘Musical’, ‘Mystery’, ‘Film-Noir’, ‘Western’], dtype=object)
复制代码
1
zero_matrix = np.zeros((len(movies),len(genres)))
复制代码
1
dummies = pd.DataFrame(zero_matrix,columns = genres)
复制代码
1
gen = movies.genres[0]
复制代码
1
gen.split('|')
[‘Animation’, “Children’s”, ‘Comedy’]
复制代码
1
dummies.columns.get_indexer(gen.split('|'))
array([0, 1, 2])
复制代码
1
2
3
for i,gen in enumerate(movies.genres): indices = dummies.columns.get_indexer(gen.split('|')) dummies.iloc[i,indices] = 1
复制代码
1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
复制代码
1
movies_windic.iloc[0]
movie_id 1 title Toy Story (1995) genres Animation|Children’s|Comedy Genre_Animation 1 Genre_Children’s 1 Genre_Comedy 1 Genre_Adventure 0 Genre_Fantasy 0 Genre_Romance 0 Genre_Drama 0 Genre_Action 0 Genre_Crime 0 Genre_Thriller 0 Genre_Horror 0 Genre_Sci-Fi 0 Genre_Documentary 0 Genre_War 0 Genre_Musical 0 Genre_Mystery 0 Genre_Film-Noir 0 Genre_Western 0 Name: 0, dtype: object #### get_dummies
复制代码
1
np.random.seed(12345)
复制代码
1
values = np.random.rand(10)
复制代码
1
values
array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503, 0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])
复制代码
1
bins = [0,0.2,0.4,0.6,0.8,1]
复制代码
1
pd.get_dummies(pd.cut(values,bins))
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
(0.0, 0.2](0.2, 0.4](0.4, 0.6](0.6, 0.8](0.8, 1.0]
000001
101000
210000
301000
400100
500100
600001
700010
800010
900010

字符串操作

内置方法

复制代码
1
2
val = 'a,b, guido'
复制代码
1
val.split(' , ') #切
复制代码
1
2
3
['a,b, guido']

修剪

复制代码
1
2
pieces = [x.strip() for x in val.split(',')] pieces
复制代码
1
2
['a', 'b', 'guido']

链接

复制代码
1
f,s,t = pieces
复制代码
1
f + '::' + s + '::' + t
复制代码
1
2
'a::b::guido'

位置

复制代码
1
'guido' in val
复制代码
1
2
True
复制代码
1
val.index(',')
复制代码
1
2
1
复制代码
1
val.find(':')
复制代码
1
2
-1

计数

复制代码
1
val.count(',')
复制代码
1
2
2

替换

复制代码
1
val.replace(',','::')
复制代码
1
2
3
'a::b:: guido'
复制代码
1
val.replace(',',' ')
复制代码
1
2
3
'a b guido'

参数:
Argument Description
count Return the number of non-overlapping occurrences of substring in the string.
endswith Returns True if string ends with suffix.
startswith Returns True if string starts with prefix.
join Use string as delimiter for concatenating a sequence of other strings.
index Return position of first character in substring if found in the string; raises ValueError if not found.
find Return position of first character of first occurrence of substring in the string; like index, but returns –1 if not found.
rfind Return position of first character of last occurrence of substring in the string; returns –1 if not found.
replace Replace occurrences of string with another string.
strip, rstrip, lstrip Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.
split Break string into list of substrings using passed delimiter.
lower Convert alphabet characters to lowercase.
upper Convert alphabet characters to uppercase.
casefold Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form.
ljust, rjust Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width.

正则

复制代码
1
import re
复制代码
1
2
3
text = 'foo bart baz tqux'
复制代码
1
re.split('s+',text)
复制代码
1
2
['foo', 'bar', 'baz', 'qux']

编译

复制代码
1
regex = re.compile('s+')
复制代码
1
regex.split(text)
复制代码
1
2
['foo', 'bar', 'baz', 'qux']
查询方式
复制代码
1
regex.findall(text)
复制代码
1
2
3
4
[' ', 't ', ' t']
复制代码
1
regex.search(text) #返回第一个匹配结果
复制代码
1
2
3
<_sre.SRE_Match object; span=(3, 7), match=' '>

电子邮件

复制代码
1
2
3
4
5
6
7
8
text = """Dave dave@google.com Steve steve@gmail.com Rob rob@gmail.com Ryan ryan@yahoo.com """ pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}' # re.IGNORECASE makes the regex case-insensitive regex = re.compile(pattern, flags=re.IGNORECASE)
复制代码
1
regex.findall(text)
复制代码
1
2
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

替换

复制代码
1
regex.sub('READCTED',text)
复制代码
1
2
'Dave READCTEDnSteve READCTEDnRob READCTEDnRyan READCTEDn'

分组

复制代码
1
2
3
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+).([A-Z]{2,4})' regex = re.compile(pattern, flags=re.IGNORECASE) m = regex.match('wesm@bright.net')
复制代码
1
m.groups()
复制代码
1
2
('wesm', 'bright', 'net')
复制代码
1
regex.findall(text) #返回分组结果
复制代码
1
2
3
4
5
[('dave', 'google', 'com'), ('steve', 'gmail', 'com'), ('rob', 'gmail', 'com'), ('ryan', 'yahoo', 'com')]

sub同时分组

复制代码
1
regex.sub(r'Username: 1, Domain: 2, Suffix: 3',text)
复制代码
1
2
'Dave Username: dave, Domain: google, Suffix: comnSteve Username: steve, Domain: gmail, Suffix: comnRob Username: rob, Domain: gmail, Suffix: comnRyan Username: ryan, Domain: yahoo, Suffix: comn'

参数:
Argument Description
findall Return all non-overlapping matching patterns in a string as a list
finditer Like findall, but returns an iterator
match Match pattern at start of string and optionally segment pattern components into groups; if the pattern matches, returns a match object, and otherwise None
search Scan string for match to pattern; returning a match object if so; unlike match, the match can be anywhere in the string as opposed to only at the beginning
split Break string into pieces at each occurrence of pattern
sub, subn Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression; use symbols 1, 2, … to refer to match group elements in the replacement string

向量化

复制代码
1
2
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com', 'Rob': 'rob@gmail.com', 'Wes': np.nan}
复制代码
1
data = pd.Series(data)
复制代码
1
data
复制代码
1
2
3
4
5
6
7
8
9
10
Dave dave@google.com Rob rob@gmail.com Steve steve@gmail.com Wes NaN dtype: object
复制代码
1
data.isnull()
复制代码
1
2
3
4
5
6
7
8
9
10
Dave False Rob False Steve False Wes True dtype: bool

包含

复制代码
1
data.str.contains('gmail')
复制代码
1
2
3
4
5
6
7
8
9
10
Dave False Rob True Steve True Wes NaN dtype: object

切片

复制代码
1
data.str[:5]
复制代码
1
2
3
4
5
6
7
8
9
10
Dave dave@ Rob rob@g Steve steve Wes NaN dtype: object

方法:
Method Description
cat Concatenate strings element-wise with optional delimiter
contains Return boolean array if each string contains pattern/regex
count Count occurrences of pattern
extract
Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group
endswith Equivalent to x.endswith(pattern) for each element
startswith Equivalent to x.startswith(pattern) for each element
findall Compute list of all occurrences of pattern/regex for each string
get Index into each element (retrieve i-th element)
isalnum Equivalent to built-in str.alnum
isalpha Equivalent to built-in str.isalpha
isdecimal Equivalent to built-in str.isdecimal
isdigit Equivalent to built-in str.isdigit
islower Equivalent to built-in str.islower
isnumeric Equivalent to built-in str.isnumeric
isupper Equivalent to built-in str.isupper
join Join strings in each element of the Series with passed separator
len Compute length of each string
lower, upper Convert cases; equivalent to x.lower() or x.upper() for each element
match Use re.match with the passed regular expression on each element, returning matched groups as list
pad Add whitespace to left, right, or both sides of strings
center Equivalent to pad(side=’both’)
repeat Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string)
replace Replace occurrences of pattern/regex with some other string
slice Slice each string in the Series
split Split strings on delimiter or regular expression
strip Trim whitespace from both sides, including newlines
rstrip Trim whitespace on right side
lstrip Trim whitespace on left side

最后

以上就是文艺身影最近收集整理的关于Python数据分析之清洗的全部内容,更多相关Python数据分析之清洗内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(52)

评论列表共有 0 条评论

立即
投稿
返回
顶部