我是靠谱客的博主 文艺身影,最近开发中收集的这篇文章主要介绍Python数据分析之清洗,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

# 缺失值

import pandas as pd
import numpy as np
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object 检查
string_data.isnull()
0 False 1 False 2 True 3 False dtype: bool
#None也被视为NA
string_data[0] = None
string_data.isnull()
0 True 1 False 2 True 3 False dtype: bool
df = pd.DataFrame({'dropna':'祛除缺失值','fillna':'填充','isnull':'检查是','notnull':'检查不是'},index = ['1','2','3','4'])
df.iloc[1] #一些方法
dropna 祛除缺失值 fillna 填充 isnull 检查是 notnull 检查不是 Name: 2, dtype: object ## 过滤缺失值
from numpy import nan as NA
data = pd.Series([1,NA,3.5,NA,7])
data.dropna() #删除缺失值
0 1.0 2 3.5 4 7.0 dtype: float64
data[data.notnull()] #等价写法
0 1.0 2 3.5 4 7.0 dtype: float64 ### DataFrame操作
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
cleaned = data.dropna()
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
cleaned
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
01.06.53.0
#### 删除全为NA的行
data.dropna(how = 'all')
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
01.06.53.0
11.0NaNNaN
3NaN6.53.0
#### 列删除操作
data[4] = NA
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0124
01.06.53.0NaN
11.0NaNNaNNaN
2NaNNaNNaNNaN
3NaN6.53.0NaN
data.dropna(1,how = 'all')
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
#### 删除具有特定NA值数量的行
df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4,1 ]= NA
df.iloc[:2,2] = NA
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
0-0.088513NaNNaN
10.637624NaNNaN
20.054991NaN0.795304
3-1.069859NaN-1.572700
40.230697-1.8936261.205645
5-0.918601-0.827842-1.712528
60.916691-0.2387120.724451
df.dropna() #有NA值就删除该行
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
40.230697-1.8936261.205645
5-0.918601-0.827842-1.712528
60.916691-0.2387120.724451
df.dropna(thresh=2) #删除具有两个NA值的行
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
20.054991NaN0.795304
3-1.069859NaN-1.572700
40.230697-1.8936261.205645
5-0.918601-0.827842-1.712528
60.916691-0.2387120.724451
### 填充缺失值
df.fillna(0) #以0填充
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
0-0.0885130.0000000.000000
10.6376240.0000000.000000
20.0549910.0000000.795304
3-1.0698590.000000-1.572700
40.230697-1.8936261.205645
5-0.918601-0.827842-1.712528
60.916691-0.2387120.724451
df.fillna({1:0.5,2:0}) #对不同的行使用不同的填充数
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
0-0.0885130.5000000.000000
10.6376240.5000000.000000
20.0549910.5000000.795304
3-1.0698590.500000-1.572700
40.230697-1.8936261.205645
5-0.918601-0.827842-1.712528
60.916691-0.2387120.724451
_ = df.fillna(0,inplace=True)
#作用于原对象
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
0-0.0885130.0000000.000000
10.6376240.0000000.000000
20.0549910.0000000.795304
3-1.0698590.000000-1.572700
40.230697-1.8936261.205645
5-0.918601-0.827842-1.712528
60.916691-0.2387120.724451
#### 前向填充
df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:,1] = NA
df.iloc[4:,2] = NA
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
0-0.251059-1.3752720.538068
10.3296121.1579210.508928
2-0.118987NaN0.694185
30.724308NaN-0.620832
40.045673NaNNaN
5-0.317643NaNNaN
df.fillna(method='ffill') #前向
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
0-0.251059-1.3752720.538068
10.3296121.1579210.508928
2-0.1189871.1579210.694185
30.7243081.157921-0.620832
40.0456731.157921-0.620832
5-0.3176431.157921-0.620832
df.fillna(method = 'ffill',limit=2) #数量限制
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
012
0-0.251059-1.3752720.538068
10.3296121.1579210.508928
2-0.1189871.1579210.694185
30.7243081.157921-0.620832
40.045673NaN-0.620832
5-0.317643NaN-0.620832
#### 特殊填充
data = pd.Series([1.,NA,3.5,NA,7])
data.fillna(data.mean()) #以平均值填充
0 1.000000 1 3.833333 2 3.500000 3 3.833333 4 7.000000 dtype: float64 参数说明: value Scalar value or dict-like object to use to fill missing values method Interpolation; by default ‘ffill’ if function called with no other arguments axis Axis to fill on; default axis=0 inplace Modify the calling object without producing a copy limit For forward and backward filling, maximum number of consecutive periods to fill # 数据转换
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]})
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
k1k2
0one1
1two1
2one2
3two3
4one3
5two4
6two4
### 检查是否与前项重复
data.duplicated()
0 False 1 False 2 False 3 False 4 False 5 False 6 True dtype: bool #### 去重
data.drop_duplicates()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
k1k2
0one1
1two1
2one2
3two3
4one3
5two4
#### 指定列
data['v1'] = range(7)
data.drop_duplicates(['k1'])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
k1k2v1
0one10
1two11
#### 指定保留重复的第一项还是最后一项
data.drop_duplicates(['k1','k2'])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
k1k2v1
0one10
1two11
2one22
3two33
4one34
5two45
data.drop_duplicates(['k1','k2'],keep = 'last')
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
k1k2v1
0one10
1two11
2one22
3two33
4one34
6two46
## 使用函数或映射转换
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
foodounces
0bacon4.0
1pulled pork3.0
2bacon12.0
3Pastrami6.0
4corned beef7.5
5Bacon8.0
6pastrami3.0
7honey ham5.0
8nova lox6.0
### 映射
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
#### 更改大小写
lowercased = data.food.str.lower()
lowercased
0 bacon 1 pulled pork 2 bacon 3 pastrami 4 corned beef 5 bacon 6 pastrami 7 honey ham 8 nova lox Name: food, dtype: object #### 增加新列,应用映射’
data['animal'] = lowercased.map(meat_to_animal)
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
foodouncesanimal
0bacon4.0pig
1pulled pork3.0pig
2bacon12.0pig
3Pastrami6.0cow
4corned beef7.5cow
5Bacon8.0pig
6pastrami3.0cow
7honey ham5.0pig
8nova lox6.0salmon
#### 函数写法
data['food'].map(lambda x: meat_to_animal[x.lower()])
0 pig 1 pig 2 pig 3 cow 4 cow 5 pig 6 cow 7 pig 8 salmon Name: food, dtype: object ### 替换
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data
0 1.0 1 -999.0 2 2.0 3 -999.0 4 -1000.0 5 3.0 dtype: float64
data.replace(-999,np.nan)
0 1.0 1 NaN 2 2.0 3 NaN 4 -1000.0 5 3.0 dtype: float64
data.replace([-999,-1000],[np.nan,0]) #替换多个值
0 1.0 1 NaN 2 2.0 3 NaN 4 0.0 5 3.0 dtype: float64
data.replace({-999:np.nan,-1000:0}) #等价写法
0 1.0 1 NaN 2 2.0 3 NaN 4 0.0 5 3.0 dtype: float64 ### 轴重命名
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])
#### 映射
transform = lambda x: x[:4].upper()
data.index.map(transform)
Index([‘OHIO’, ‘COLO’, ‘NEW ‘], dtype=’object’)
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
onetwothreefour
Ohio0123
Colorado4567
New York891011
data.index = data.index.map(transform) #作用于原对象
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
onetwothreefour
OHIO0123
COLO4567
NEW891011
### rename
data.rename(index = str.title,columns = str.upper)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
ONETWOTHREEFOUR
Ohio0123
Colo4567
New891011
data.rename(index={'OHIO': 'INDIANA'},columns={'three': 'peekaboo'}) #特定替换
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
onetwopeekaboofour
INDIANA0123
COLO4567
NEW891011
data.rename(index = {'OHIO':'INDIANA'},inplace = True) #作用于原对象
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
onetwothreefour
INDIANA0123
COLO4567
NEW891011
## 分位数
ages = [20,22,25,27,21,23,37,31,61,45,41,32]
bins = [18,25,35,60,100]
cats = pd.cut(ages,bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], …, (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]] Length: 12 Categories (4, interval[int64]): [(18, 25]
cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]] closed=’right’, dtype=’interval[int64]’)
pd.value_counts(cats)
(18, 25] 5 (35, 60] 3 (25, 35] 3 (60, 100] 1 dtype: int64 ##### 包含
pd.cut(ages,[18,26,36,61,100],right=False)
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), …, [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)] Length: 12 Categories (4, interval[int64]): [[18, 26) ##### 命名
group_names = ['Y','YA','WA','S']
pd.cut(ages,bins,labels = group_names)
[Y, Y, Y, YA, Y, …, YA, S, WA, WA, YA] Length: 12 Categories (4, object): [Y #### 等分
data = np.random.rand(20)
pd.cut(data,4,precision=2) #precision控制小数点数量
[(0.49, 0.74], (0.25, 0.49], (0.49, 0.74], (0.25, 0.49], (-3.6e-05, 0.25], …, (0.49, 0.74], (0.25, 0.49], (0.74, 0.99], (0.74, 0.99], (0.74, 0.99]] Length: 20 Categories (4, interval[float64]): [(-3.6e-05, 0.25] #### qcut
data = np.random.randn(1000)
cats = pd.qcut(data,4)
cats
[(-2.9539999999999997, -0.658], (0.669, 3.876], (-2.9539999999999997, -0.658], (-0.0173, 0.669], (-0.658, -0.0173], …, (0.669, 3.876], (-0.0173, 0.669], (-0.658, -0.0173], (-0.0173, 0.669], (0.669, 3.876]] Length: 1000 Categories (4, interval[float64]): [(-2.9539999999999997, -0.658]
pd.value_counts(cats)
(0.669, 3.876] 250 (-0.0173, 0.669] 250 (-0.658, -0.0173] 250 (-2.9539999999999997, -0.658] 250 dtype: int64 #### 自定义
pd.qcut(data,[0,0.1,0.5,0.9,1.])
[(-1.236, -0.0173], (-0.0173, 1.233], (-1.236, -0.0173], (-0.0173, 1.233], (-1.236, -0.0173], …, (-0.0173, 1.233], (-0.0173, 1.233], (-1.236, -0.0173], (-0.0173, 1.233], (-0.0173, 1.233]] Length: 1000 Categories (4, interval[float64]): [(-2.9539999999999997, -1.236] ### 过滤异常值
data = pd.DataFrame(np.random.randn(1000,4))
data.describe()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0123
count1000.0000001000.0000001000.0000001000.000000
mean-0.0759990.020124-0.002652-0.024497
std0.9765550.9556850.9866201.010481
min-3.955131-3.403433-3.214173-3.405990
25%-0.752617-0.605502-0.651444-0.677522
50%-0.0779780.009801-0.037225-0.004654
75%0.5874520.6866000.6137690.641942
max3.0546683.3694813.0816142.983911
#### 按条件过滤
col = data[2]
col[np.abs(col) > 3]
824 3.081614 938 -3.214173 Name: 2, dtype: float64 #### 检查其所在行
data[(np.abs(data) > 3).any(1)]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0123
233-3.9551310.6444020.9061700.871523
2683.0546681.2159810.069664-0.956292
291-0.989445-0.333444-0.302554-3.118577
419-0.334358-3.403433-1.207032-0.812907
5280.7164803.369481-0.067002-0.062219
6820.984525-0.0132390.127607-3.405990
7690.518519-0.263145-0.192145-3.161299
822-1.235697-3.0329420.5942521.165504
8240.945307-2.0961763.0816141.307574
832-3.8169170.758124-0.4248470.016269
9381.1723870.353721-3.214173-0.088637
#### 操作
data[np.abs(data) > 3]
= np.sign(data) * 3
data.describe
np.sign(data).head() #产生1和-1'
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0123
01.0-1.01.0-1.0
1-1.01.01.01.0
2-1.01.0-1.01.0
3-1.0-1.01.0-1.0
4-1.01.0-1.01.0
data.iloc[938] #验证
0 1.172387 1 0.353721 2 -3.000000 3 -0.088637 Name: 938, dtype: float64 ## 组合和抽样
df = pd.DataFrame(np.arange(5*4).reshape((5,4)))
sampler = np.random.permutation(5)
sampler
array([3, 2, 1, 0, 4])
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0123
00123
14567
2891011
312131415
416171819
df.take(sampler)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0123
312131415
2891011
14567
00123
416171819
#### 随机抽样
df.sample(n=3)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0123
14567
312131415
00123
#### 作用于原对象
choicesv
= pd.Series([5,7,-1,6,4])
draws = choicesv.sample(n = 10,replace = True)
draws
4 4 0 5 3 6 1 7 1 7 3 6 4 4 1 7 2 -1 1 7 dtype: int64 ### Computing Indicator/Dummy Variables
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
data1key
00b
11b
22a
33c
44a
55b
pd.get_dummies(df.key)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
abc
0010
1010
2100
3001
4100
5010
dummies = pd.get_dummies(df.key,prefix = 'key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
data1key_akey_bkey_c
00010
11010
22100
33001
44100
55010
#### 行
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('/Users/meininghang/Downloads/pydata-book-2nd-edition/datasets/movielens/movies.dat',
sep = '::',header = None,names = mnames)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:2: ParserWarning: Falling back to the ‘python’ engine because the ‘c’ engine does not support regex separators (separators > 1 char and different from ‘s+’ are interpreted as regex); you can avoid this warning by specifying engine=’python’.
movies[:10]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
movie_idtitlegenres
01Toy Story (1995)Animation|Children’s|Comedy
12Jumanji (1995)Adventure|Children’s|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
56Heat (1995)Action|Crime|Thriller
67Sabrina (1995)Comedy|Romance
78Tom and Huck (1995)Adventure|Children’s
89Sudden Death (1995)Action
910GoldenEye (1995)Action|Adventure|Thriller
#### 题材
all_genres = []
for x in movies.genres:
all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres
array([‘Animation’, “Children’s”, ‘Comedy’, ‘Adventure’, ‘Fantasy’, ‘Romance’, ‘Drama’, ‘Action’, ‘Crime’, ‘Thriller’, ‘Horror’, ‘Sci-Fi’, ‘Documentary’, ‘War’, ‘Musical’, ‘Mystery’, ‘Film-Noir’, ‘Western’], dtype=object)
zero_matrix = np.zeros((len(movies),len(genres)))
dummies = pd.DataFrame(zero_matrix,columns = genres)
gen = movies.genres[0]
gen.split('|')
[‘Animation’, “Children’s”, ‘Comedy’]
dummies.columns.get_indexer(gen.split('|'))
array([0, 1, 2])
for i,gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(gen.split('|'))
dummies.iloc[i,indices] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
movie_id 1 title Toy Story (1995) genres Animation|Children’s|Comedy Genre_Animation 1 Genre_Children’s 1 Genre_Comedy 1 Genre_Adventure 0 Genre_Fantasy 0 Genre_Romance 0 Genre_Drama 0 Genre_Action 0 Genre_Crime 0 Genre_Thriller 0 Genre_Horror 0 Genre_Sci-Fi 0 Genre_Documentary 0 Genre_War 0 Genre_Musical 0 Genre_Mystery 0 Genre_Film-Noir 0 Genre_Western 0 Name: 0, dtype: object #### get_dummies
np.random.seed(12345)
values = np.random.rand(10)
values
array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503, 0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])
bins = [0,0.2,0.4,0.6,0.8,1]
pd.get_dummies(pd.cut(values,bins))
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
(0.0, 0.2](0.2, 0.4](0.4, 0.6](0.6, 0.8](0.8, 1.0]
000001
101000
210000
301000
400100
500100
600001
700010
800010
900010

字符串操作

内置方法

val = 'a,b,
guido'
val.split(' , ') #切
['a,b,
guido']

修剪

pieces = [x.strip() for x in val.split(',')]
pieces
['a', 'b', 'guido']

链接

f,s,t = pieces
f + '::' + s + '::' + t 
'a::b::guido'

位置

'guido' in val
True
val.index(',')
1
val.find(':')
-1

计数

val.count(',')
2

替换

val.replace(',','::')
'a::b::
guido'
val.replace(',',' ')
'a b
guido'

参数:
Argument Description
count Return the number of non-overlapping occurrences of substring in the string.
endswith Returns True if string ends with suffix.
startswith Returns True if string starts with prefix.
join Use string as delimiter for concatenating a sequence of other strings.
index Return position of first character in substring if found in the string; raises ValueError if not found.
find Return position of first character of first occurrence of substring in the string; like index, but returns –1 if not found.
rfind Return position of first character of last occurrence of substring in the string; returns –1 if not found.
replace Replace occurrences of string with another string.
strip, rstrip, lstrip Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.
split Break string into list of substrings using passed delimiter.
lower Convert alphabet characters to lowercase.
upper Convert alphabet characters to uppercase.
casefold Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form.
ljust, rjust Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width.

正则

import re
text = 'foo
bart baz
tqux'
re.split('s+',text)
['foo', 'bar', 'baz', 'qux']

编译

regex = re.compile('s+')
regex.split(text)
['foo', 'bar', 'baz', 'qux']
查询方式
regex.findall(text)
['
', 't ', '
t']
regex.search(text) #返回第一个匹配结果
<_sre.SRE_Match object; span=(3, 7), match='
'>

电子邮件

text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}'
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text)
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

替换

regex.sub('READCTED',text)
'Dave READCTEDnSteve READCTEDnRob READCTEDnRyan READCTEDn'

分组

pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+).([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
m = regex.match('wesm@bright.net')
m.groups()
('wesm', 'bright', 'net')
regex.findall(text) #返回分组结果
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]

sub同时分组

regex.sub(r'Username: 1, Domain: 2, Suffix: 3',text)
'Dave Username: dave, Domain: google, Suffix: comnSteve Username: steve, Domain: gmail, Suffix: comnRob Username: rob, Domain: gmail, Suffix: comnRyan Username: ryan, Domain: yahoo, Suffix: comn'

参数:
Argument Description
findall Return all non-overlapping matching patterns in a string as a list
finditer Like findall, but returns an iterator
match Match pattern at start of string and optionally segment pattern components into groups; if the pattern matches, returns a match object, and otherwise None
search Scan string for match to pattern; returning a match object if so; unlike match, the match can be anywhere in the string as opposed to only at the beginning
split Break string into pieces at each occurrence of pattern
sub, subn Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression; use symbols 1, 2, … to refer to match group elements in the replacement string

向量化

data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data
Dave
dave@google.com
Rob
rob@gmail.com
Steve
steve@gmail.com
Wes
NaN
dtype: object
data.isnull()
Dave
False
Rob
False
Steve
False
Wes
True
dtype: bool

包含

data.str.contains('gmail')
Dave
False
Rob
True
Steve
True
Wes
NaN
dtype: object

切片

data.str[:5]
Dave
dave@
Rob
rob@g
Steve
steve
Wes
NaN
dtype: object

方法:
Method Description
cat Concatenate strings element-wise with optional delimiter
contains Return boolean array if each string contains pattern/regex
count Count occurrences of pattern
extract
Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group
endswith Equivalent to x.endswith(pattern) for each element
startswith Equivalent to x.startswith(pattern) for each element
findall Compute list of all occurrences of pattern/regex for each string
get Index into each element (retrieve i-th element)
isalnum Equivalent to built-in str.alnum
isalpha Equivalent to built-in str.isalpha
isdecimal Equivalent to built-in str.isdecimal
isdigit Equivalent to built-in str.isdigit
islower Equivalent to built-in str.islower
isnumeric Equivalent to built-in str.isnumeric
isupper Equivalent to built-in str.isupper
join Join strings in each element of the Series with passed separator
len Compute length of each string
lower, upper Convert cases; equivalent to x.lower() or x.upper() for each element
match Use re.match with the passed regular expression on each element, returning matched groups as list
pad Add whitespace to left, right, or both sides of strings
center Equivalent to pad(side=’both’)
repeat Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string)
replace Replace occurrences of pattern/regex with some other string
slice Slice each string in the Series
split Split strings on delimiter or regular expression
strip Trim whitespace from both sides, including newlines
rstrip Trim whitespace on right side
lstrip Trim whitespace on left side

最后

以上就是文艺身影为你收集整理的Python数据分析之清洗的全部内容,希望文章能够帮你解决Python数据分析之清洗所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(36)

评论列表共有 0 条评论

立即
投稿
返回
顶部