我是
靠谱客 的博主
文艺身影 ,最近开发中收集的这篇文章主要介绍
Python数据分析之清洗 ,觉得挺不错的,现在分享给大家,希望可以做个参考。
概述
# 缺失值
import pandas as pd
import numpy as np
string_data = pd.Series(['aardvark' , 'artichoke' , np.nan, 'avocado' ])
string_data
0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object 检查
string_data.isnull()
0 False 1 False 2 True 3 False dtype: bool
string_data[0 ] = None
string_data.isnull()
0 True 1 False 2 True 3 False dtype: bool
df = pd.DataFrame({'dropna' :'祛除缺失值' ,'fillna' :'填充' ,'isnull' :'检查是' ,'notnull' :'检查不是' },index = ['1' ,'2' ,'3' ,'4' ])
df.iloc[1 ]
dropna 祛除缺失值 fillna 填充 isnull 检查是 notnull 检查不是 Name: 2, dtype: object ## 过滤缺失值
from numpy import nan as NA
data = pd.Series([1 ,NA,3.5 ,NA,7 ])
data.dropna()
0 1.0 2 3.5 4 7.0 dtype: float64
data[data.notnull()]
0 1.0 2 3.5 4 7.0 dtype: float64 ### DataFrame操作
data = pd.DataFrame([[1. , 6.5 , 3. ], [1. , NA, NA],[NA, NA, NA], [NA, 6.5 , 3. ]])
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0
cleaned = data.dropna()
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0
cleaned
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
#### 删除全为NA的行
data.dropna(how = 'all' )
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 3 NaN 6.5 3.0
#### 列删除操作
data[4 ] = NA
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 4 0 1.0 6.5 3.0 NaN 1 1.0 NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN 6.5 3.0 NaN
data.dropna(1 ,how = 'all' )
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0
#### 删除具有特定NA值数量的行
df = pd.DataFrame(np.random.randn(7 ,3 ))
df.iloc[:4 ,1 ]= NA
df.iloc[:2 ,2 ] = NA
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 0 -0.088513 NaN NaN 1 0.637624 NaN NaN 2 0.054991 NaN 0.795304 3 -1.069859 NaN -1.572700 4 0.230697 -1.893626 1.205645 5 -0.918601 -0.827842 -1.712528 6 0.916691 -0.238712 0.724451
df.dropna()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 4 0.230697 -1.893626 1.205645 5 -0.918601 -0.827842 -1.712528 6 0.916691 -0.238712 0.724451
df.dropna(thresh=2 )
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 2 0.054991 NaN 0.795304 3 -1.069859 NaN -1.572700 4 0.230697 -1.893626 1.205645 5 -0.918601 -0.827842 -1.712528 6 0.916691 -0.238712 0.724451
### 填充缺失值
df.fillna(0 )
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 0 -0.088513 0.000000 0.000000 1 0.637624 0.000000 0.000000 2 0.054991 0.000000 0.795304 3 -1.069859 0.000000 -1.572700 4 0.230697 -1.893626 1.205645 5 -0.918601 -0.827842 -1.712528 6 0.916691 -0.238712 0.724451
df.fillna({1 :0.5 ,2 :0 })
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 0 -0.088513 0.500000 0.000000 1 0.637624 0.500000 0.000000 2 0.054991 0.500000 0.795304 3 -1.069859 0.500000 -1.572700 4 0.230697 -1.893626 1.205645 5 -0.918601 -0.827842 -1.712528 6 0.916691 -0.238712 0.724451
_ = df.fillna(0 ,inplace=True )
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 0 -0.088513 0.000000 0.000000 1 0.637624 0.000000 0.000000 2 0.054991 0.000000 0.795304 3 -1.069859 0.000000 -1.572700 4 0.230697 -1.893626 1.205645 5 -0.918601 -0.827842 -1.712528 6 0.916691 -0.238712 0.724451
#### 前向填充
df = pd.DataFrame(np.random.randn(6 ,3 ))
df.iloc[2 :,1 ] = NA
df.iloc[4 :,2 ] = NA
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 0 -0.251059 -1.375272 0.538068 1 0.329612 1.157921 0.508928 2 -0.118987 NaN 0.694185 3 0.724308 NaN -0.620832 4 0.045673 NaN NaN 5 -0.317643 NaN NaN
df.fillna(method='ffill' )
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 0 -0.251059 -1.375272 0.538068 1 0.329612 1.157921 0.508928 2 -0.118987 1.157921 0.694185 3 0.724308 1.157921 -0.620832 4 0.045673 1.157921 -0.620832 5 -0.317643 1.157921 -0.620832
df.fillna(method = 'ffill' ,limit=2 )
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 0 -0.251059 -1.375272 0.538068 1 0.329612 1.157921 0.508928 2 -0.118987 1.157921 0.694185 3 0.724308 1.157921 -0.620832 4 0.045673 NaN -0.620832 5 -0.317643 NaN -0.620832
#### 特殊填充
data = pd.Series([1. ,NA,3.5 ,NA,7 ])
data.fillna(data.mean())
0 1.000000 1 3.833333 2 3.500000 3 3.833333 4 7.000000 dtype: float64 参数说明: value Scalar value or dict-like object to use to fill missing values method Interpolation; by default ‘ffill’ if function called with no other arguments axis Axis to fill on; default axis=0 inplace Modify the calling object without producing a copy limit For forward and backward filling, maximum number of consecutive periods to fill # 数据转换
data = pd.DataFrame({'k1' : ['one' , 'two' ] * 3 + ['two' ],
'k2' : [1 , 1 , 2 , 3 , 3 , 4 , 4 ]})
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
k1 k2 0 one 1 1 two 1 2 one 2 3 two 3 4 one 3 5 two 4 6 two 4
### 检查是否与前项重复
data.duplicated()
0 False 1 False 2 False 3 False 4 False 5 False 6 True dtype: bool #### 去重
data.drop_duplicates()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
k1 k2 0 one 1 1 two 1 2 one 2 3 two 3 4 one 3 5 two 4
#### 指定列
data['v1' ] = range(7 )
data.drop_duplicates(['k1' ])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
#### 指定保留重复的第一项还是最后一项
data.drop_duplicates(['k1' ,'k2' ])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
k1 k2 v1 0 one 1 0 1 two 1 1 2 one 2 2 3 two 3 3 4 one 3 4 5 two 4 5
data.drop_duplicates(['k1' ,'k2' ],keep = 'last' )
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
k1 k2 v1 0 one 1 0 1 two 1 1 2 one 2 2 3 two 3 3 4 one 3 4 6 two 4 6
## 使用函数或映射转换
data = pd.DataFrame({'food' : ['bacon' , 'pulled pork' , 'bacon' ,
'Pastrami' , 'corned beef' , 'Bacon' ,
'pastrami' , 'honey ham' , 'nova lox' ],
'ounces' : [4 , 3 , 12 , 6 , 7.5 , 8 , 3 , 5 , 6 ]})
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
food ounces 0 bacon 4.0 1 pulled pork 3.0 2 bacon 12.0 3 Pastrami 6.0 4 corned beef 7.5 5 Bacon 8.0 6 pastrami 3.0 7 honey ham 5.0 8 nova lox 6.0
### 映射
meat_to_animal = {
'bacon' : 'pig' ,
'pulled pork' : 'pig' ,
'pastrami' : 'cow' ,
'corned beef' : 'cow' ,
'honey ham' : 'pig' ,
'nova lox' : 'salmon'
}
#### 更改大小写
lowercased = data.food.str.lower()
lowercased
0 bacon 1 pulled pork 2 bacon 3 pastrami 4 corned beef 5 bacon 6 pastrami 7 honey ham 8 nova lox Name: food, dtype: object #### 增加新列,应用映射’
data['animal' ] = lowercased.map(meat_to_animal)
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
food ounces animal 0 bacon 4.0 pig 1 pulled pork 3.0 pig 2 bacon 12.0 pig 3 Pastrami 6.0 cow 4 corned beef 7.5 cow 5 Bacon 8.0 pig 6 pastrami 3.0 cow 7 honey ham 5.0 pig 8 nova lox 6.0 salmon
#### 函数写法
data['food' ].map(lambda x: meat_to_animal[x.lower()])
0 pig 1 pig 2 pig 3 cow 4 cow 5 pig 6 cow 7 pig 8 salmon Name: food, dtype: object ### 替换
data = pd.Series([1. , -999. , 2. , -999. , -1000. , 3. ])
data
0 1.0 1 -999.0 2 2.0 3 -999.0 4 -1000.0 5 3.0 dtype: float64
data.replace(-999 ,np.nan)
0 1.0 1 NaN 2 2.0 3 NaN 4 -1000.0 5 3.0 dtype: float64
data.replace([-999 ,-1000 ],[np.nan,0 ])
0 1.0 1 NaN 2 2.0 3 NaN 4 0.0 5 3.0 dtype: float64
data.replace({-999 :np.nan,-1000 :0 })
0 1.0 1 NaN 2 2.0 3 NaN 4 0.0 5 3.0 dtype: float64 ### 轴重命名
data = pd.DataFrame(np.arange(12 ).reshape((3 , 4 )),
index=['Ohio' , 'Colorado' , 'New York' ],
columns=['one' , 'two' , 'three' , 'four' ])
#### 映射
transform = lambda x: x[:4 ].upper()
data.index.map(transform)
Index([‘OHIO’, ‘COLO’, ‘NEW ‘], dtype=’object’)
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 New York 8 9 10 11
data.index = data.index.map(transform)
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
one two three four OHIO 0 1 2 3 COLO 4 5 6 7 NEW 8 9 10 11
### rename
data.rename(index = str.title,columns = str.upper)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
ONE TWO THREE FOUR Ohio 0 1 2 3 Colo 4 5 6 7 New 8 9 10 11
data.rename(index={'OHIO' : 'INDIANA' },columns={'three' : 'peekaboo' })
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
one two peekaboo four INDIANA 0 1 2 3 COLO 4 5 6 7 NEW 8 9 10 11
data.rename(index = {'OHIO' :'INDIANA' },inplace = True )
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
one two three four INDIANA 0 1 2 3 COLO 4 5 6 7 NEW 8 9 10 11
## 分位数
ages = [20 ,22 ,25 ,27 ,21 ,23 ,37 ,31 ,61 ,45 ,41 ,32 ]
bins = [18 ,25 ,35 ,60 ,100 ]
cats = pd.cut(ages,bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], …, (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]] Length: 12 Categories (4, interval[int64]): [(18, 25]
cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]] closed=’right’, dtype=’interval[int64]’)
pd.value_counts(cats)
(18, 25] 5 (35, 60] 3 (25, 35] 3 (60, 100] 1 dtype: int64 ##### 包含
pd.cut(ages,[18 ,26 ,36 ,61 ,100 ],right=False )
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), …, [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)] Length: 12 Categories (4, interval[int64]): [[18, 26) ##### 命名
group_names = ['Y' ,'YA' ,'WA' ,'S' ]
pd.cut(ages,bins,labels = group_names)
[Y, Y, Y, YA, Y, …, YA, S, WA, WA, YA] Length: 12 Categories (4, object): [Y #### 等分
data = np.random.rand(20 )
pd.cut(data,4 ,precision=2 )
[(0.49, 0.74], (0.25, 0.49], (0.49, 0.74], (0.25, 0.49], (-3.6e-05, 0.25], …, (0.49, 0.74], (0.25, 0.49], (0.74, 0.99], (0.74, 0.99], (0.74, 0.99]] Length: 20 Categories (4, interval[float64]): [(-3.6e-05, 0.25] #### qcut
data = np.random.randn(1000 )
cats = pd.qcut(data,4 )
cats
[(-2.9539999999999997, -0.658], (0.669, 3.876], (-2.9539999999999997, -0.658], (-0.0173, 0.669], (-0.658, -0.0173], …, (0.669, 3.876], (-0.0173, 0.669], (-0.658, -0.0173], (-0.0173, 0.669], (0.669, 3.876]] Length: 1000 Categories (4, interval[float64]): [(-2.9539999999999997, -0.658]
pd.value_counts(cats)
(0.669, 3.876] 250 (-0.0173, 0.669] 250 (-0.658, -0.0173] 250 (-2.9539999999999997, -0.658] 250 dtype: int64 #### 自定义
pd.qcut(data,[0 ,0.1 ,0.5 ,0.9 ,1. ])
[(-1.236, -0.0173], (-0.0173, 1.233], (-1.236, -0.0173], (-0.0173, 1.233], (-1.236, -0.0173], …, (-0.0173, 1.233], (-0.0173, 1.233], (-1.236, -0.0173], (-0.0173, 1.233], (-0.0173, 1.233]] Length: 1000 Categories (4, interval[float64]): [(-2.9539999999999997, -1.236] ### 过滤异常值
data = pd.DataFrame(np.random.randn(1000 ,4 ))
data.describe()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 3 count 1000.000000 1000.000000 1000.000000 1000.000000 mean -0.075999 0.020124 -0.002652 -0.024497 std 0.976555 0.955685 0.986620 1.010481 min -3.955131 -3.403433 -3.214173 -3.405990 25% -0.752617 -0.605502 -0.651444 -0.677522 50% -0.077978 0.009801 -0.037225 -0.004654 75% 0.587452 0.686600 0.613769 0.641942 max 3.054668 3.369481 3.081614 2.983911
#### 按条件过滤
col = data[2 ]
col[np.abs(col) > 3 ]
824 3.081614 938 -3.214173 Name: 2, dtype: float64 #### 检查其所在行
data[(np.abs(data) > 3 ).any(1 )]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 3 233 -3.955131 0.644402 0.906170 0.871523 268 3.054668 1.215981 0.069664 -0.956292 291 -0.989445 -0.333444 -0.302554 -3.118577 419 -0.334358 -3.403433 -1.207032 -0.812907 528 0.716480 3.369481 -0.067002 -0.062219 682 0.984525 -0.013239 0.127607 -3.405990 769 0.518519 -0.263145 -0.192145 -3.161299 822 -1.235697 -3.032942 0.594252 1.165504 824 0.945307 -2.096176 3.081614 1.307574 832 -3.816917 0.758124 -0.424847 0.016269 938 1.172387 0.353721 -3.214173 -0.088637
#### 操作
data[np.abs(data) > 3 ]
= np.sign(data) * 3
data.describe
np.sign(data).head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 3 0 1.0 -1.0 1.0 -1.0 1 -1.0 1.0 1.0 1.0 2 -1.0 1.0 -1.0 1.0 3 -1.0 -1.0 1.0 -1.0 4 -1.0 1.0 -1.0 1.0
data.iloc[938 ]
0 1.172387 1 0.353721 2 -3.000000 3 -0.088637 Name: 938, dtype: float64 ## 组合和抽样
df = pd.DataFrame(np.arange(5 *4 ).reshape((5 ,4 )))
sampler = np.random.permutation(5 )
sampler
array([3, 2, 1, 0, 4])
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 3 12 13 14 15 4 16 17 18 19
df.take(sampler)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 3 3 12 13 14 15 2 8 9 10 11 1 4 5 6 7 0 0 1 2 3 4 16 17 18 19
#### 随机抽样
df.sample(n=3 )
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
#### 作用于原对象
choicesv
= pd.Series([5 ,7 ,-1 ,6 ,4 ])
draws = choicesv.sample(n = 10 ,replace = True )
draws
4 4 0 5 3 6 1 7 1 7 3 6 4 4 1 7 2 -1 1 7 dtype: int64 ### Computing Indicator/Dummy Variables
df = pd.DataFrame({'key' : ['b' , 'b' , 'a' , 'c' , 'a' , 'b' ], 'data1' : range(6 )})
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
data1 key 0 0 b 1 1 b 2 2 a 3 3 c 4 4 a 5 5 b
pd.get_dummies(df.key)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
a b c 0 0 1 0 1 0 1 0 2 1 0 0 3 0 0 1 4 1 0 0 5 0 1 0
dummies = pd.get_dummies(df.key,prefix = 'key' )
df_with_dummy = df[['data1' ]].join(dummies)
df_with_dummy
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
data1 key_a key_b key_c 0 0 0 1 0 1 1 0 1 0 2 2 1 0 0 3 3 0 0 1 4 4 1 0 0 5 5 0 1 0
#### 行
mnames = ['movie_id' , 'title' , 'genres' ]
movies = pd.read_table('/Users/meininghang/Downloads/pydata-book-2nd-edition/datasets/movielens/movies.dat' ,
sep = '::' ,header = None ,names = mnames)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:2: ParserWarning: Falling back to the ‘python’ engine because the ‘c’ engine does not support regex separators (separators > 1 char and different from ‘s+’ are interpreted as regex); you can avoid this warning by specifying engine=’python’.
movies[:10 ]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
movie_id title genres 0 1 Toy Story (1995) Animation|Children’s|Comedy 1 2 Jumanji (1995) Adventure|Children’s|Fantasy 2 3 Grumpier Old Men (1995) Comedy|Romance 3 4 Waiting to Exhale (1995) Comedy|Drama 4 5 Father of the Bride Part II (1995) Comedy 5 6 Heat (1995) Action|Crime|Thriller 6 7 Sabrina (1995) Comedy|Romance 7 8 Tom and Huck (1995) Adventure|Children’s 8 9 Sudden Death (1995) Action 9 10 GoldenEye (1995) Action|Adventure|Thriller
#### 题材
all_genres = []
for x in movies.genres:
all_genres.extend(x.split('|' ))
genres = pd.unique(all_genres)
genres
array([‘Animation’, “Children’s”, ‘Comedy’, ‘Adventure’, ‘Fantasy’, ‘Romance’, ‘Drama’, ‘Action’, ‘Crime’, ‘Thriller’, ‘Horror’, ‘Sci-Fi’, ‘Documentary’, ‘War’, ‘Musical’, ‘Mystery’, ‘Film-Noir’, ‘Western’], dtype=object)
zero_matrix = np.zeros((len(movies),len(genres)))
dummies = pd.DataFrame(zero_matrix,columns = genres)
gen = movies.genres[0 ]
gen.split('|' )
[‘Animation’, “Children’s”, ‘Comedy’]
dummies.columns.get_indexer(gen.split('|' ))
array([0, 1, 2])
for i,gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(gen.split('|' ))
dummies.iloc[i,indices] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_' ))
movies_windic.iloc[0 ]
movie_id 1 title Toy Story (1995) genres Animation|Children’s|Comedy Genre_Animation 1 Genre_Children’s 1 Genre_Comedy 1 Genre_Adventure 0 Genre_Fantasy 0 Genre_Romance 0 Genre_Drama 0 Genre_Action 0 Genre_Crime 0 Genre_Thriller 0 Genre_Horror 0 Genre_Sci-Fi 0 Genre_Documentary 0 Genre_War 0 Genre_Musical 0 Genre_Mystery 0 Genre_Film-Noir 0 Genre_Western 0 Name: 0, dtype: object #### get_dummies
np.random.seed(12345 )
values = np.random.rand(10 )
values
array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503, 0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])
bins = [0 ,0.2 ,0.4 ,0.6 ,0.8 ,1 ]
pd.get_dummies(pd.cut(values,bins))
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0] 0 0 0 0 0 1 1 0 1 0 0 0 2 1 0 0 0 0 3 0 1 0 0 0 4 0 0 1 0 0 5 0 0 1 0 0 6 0 0 0 0 1 7 0 0 0 1 0 8 0 0 0 1 0 9 0 0 0 1 0
字符串操作
内置方法
val = 'a,b,
guido'
val.split(' , ' )
['a,b,
guido']
修剪
pieces = [x.strip() for x in val.split(',' )]
pieces
['a', 'b', 'guido']
链接
f,s,t = pieces
f + '::' + s + '::' + t
'a::b::guido'
位置
'guido' in val
True
val.index(',' )
1
val.find(':' )
-1
计数
val.count(',' )
2
替换
val.replace(',' ,'::' )
'a::b::
guido'
val.replace(',' ,' ' )
'a b
guido'
参数: Argument Description count Return the number of non-overlapping occurrences of substring in the string. endswith Returns True if string ends with suffix. startswith Returns True if string starts with prefix. join Use string as delimiter for concatenating a sequence of other strings. index Return position of first character in substring if found in the string; raises ValueError if not found. find Return position of first character of first occurrence of substring in the string; like index, but returns –1 if not found. rfind Return position of first character of last occurrence of substring in the string; returns –1 if not found. replace Replace occurrences of string with another string. strip, rstrip, lstrip Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element. split Break string into list of substrings using passed delimiter. lower Convert alphabet characters to lowercase. upper Convert alphabet characters to uppercase. casefold Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form. ljust, rjust Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width.
正则
import re
text = 'foo
bart baz
tqux'
re.split('s+' ,text)
['foo', 'bar', 'baz', 'qux']
编译
regex = re.compile('s+' )
regex.split(text)
['foo', 'bar', 'baz', 'qux']
查询方式
regex.findall(text)
['
', 't ', '
t']
regex.search(text)
<_sre.SRE_Match object; span=(3, 7), match='
'>
电子邮件
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}'
regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text)
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
替换
regex.sub('READCTED' ,text)
'Dave READCTEDnSteve READCTEDnRob READCTEDnRyan READCTEDn'
分组
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+).([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
m = regex.match('wesm@bright.net' )
m.groups()
('wesm', 'bright', 'net')
regex.findall(text)
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
sub同时分组
regex.sub(r'Username: 1, Domain: 2, Suffix: 3' ,text)
'Dave Username: dave, Domain: google, Suffix: comnSteve Username: steve, Domain: gmail, Suffix: comnRob Username: rob, Domain: gmail, Suffix: comnRyan Username: ryan, Domain: yahoo, Suffix: comn'
参数: Argument Description findall Return all non-overlapping matching patterns in a string as a list finditer Like findall, but returns an iterator match Match pattern at start of string and optionally segment pattern components into groups; if the pattern matches, returns a match object, and otherwise None search Scan string for match to pattern; returning a match object if so; unlike match, the match can be anywhere in the string as opposed to only at the beginning split Break string into pieces at each occurrence of pattern sub, subn Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression; use symbols 1, 2, … to refer to match group elements in the replacement string
向量化
data = {'Dave' : 'dave@google.com' , 'Steve' : 'steve@gmail.com' ,
'Rob' : 'rob@gmail.com' , 'Wes' : np.nan}
data = pd.Series(data)
data
Dave
dave@google.com
Rob
rob@gmail.com
Steve
steve@gmail.com
Wes
NaN
dtype: object
data.isnull()
Dave
False
Rob
False
Steve
False
Wes
True
dtype: bool
包含
data.str.contains('gmail' )
Dave
False
Rob
True
Steve
True
Wes
NaN
dtype: object
切片
data.str[:5 ]
Dave
dave@
Rob
rob@g
Steve
steve
Wes
NaN
dtype: object
方法: Method Description cat Concatenate strings element-wise with optional delimiter contains Return boolean array if each string contains pattern/regex count Count occurrences of pattern extract Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group endswith Equivalent to x.endswith(pattern) for each element startswith Equivalent to x.startswith(pattern) for each element findall Compute list of all occurrences of pattern/regex for each string get Index into each element (retrieve i-th element) isalnum Equivalent to built-in str.alnum isalpha Equivalent to built-in str.isalpha isdecimal Equivalent to built-in str.isdecimal isdigit Equivalent to built-in str.isdigit islower Equivalent to built-in str.islower isnumeric Equivalent to built-in str.isnumeric isupper Equivalent to built-in str.isupper join Join strings in each element of the Series with passed separator len Compute length of each string lower, upper Convert cases; equivalent to x.lower() or x.upper() for each element match Use re.match with the passed regular expression on each element, returning matched groups as list pad Add whitespace to left, right, or both sides of strings center Equivalent to pad(side=’both’) repeat Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string) replace Replace occurrences of pattern/regex with some other string slice Slice each string in the Series split Split strings on delimiter or regular expression strip Trim whitespace from both sides, including newlines rstrip Trim whitespace on right side lstrip Trim whitespace on left side
最后
以上就是文艺身影 为你收集整理的Python数据分析之清洗 的全部内容,希望文章能够帮你解决Python数据分析之清洗 所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站 推荐给程序员好友。
本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
发表评论 取消回复