我是靠谱客的博主 勤恳酸奶,最近开发中收集的这篇文章主要介绍4.2 数据转换(全),觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

4.2数据转换

import numpy as np
import pandas as pd

移除重复数据

data = pd.DataFrame({'k1': ['one','two']*3+['two'],
'k2': [1,1,2,3,3,4,4]})
data
k1k2
0one1
1two1
2one2
3two3
4one3
5two4
6two4

duplicated方法

data.duplicated()
0
False
1
False
2
False
3
False
4
False
5
False
6
True
dtype: bool

duplicated方法会返回布尔型的Series表示每一行是否有重复。例子中,第6行和第7行显然是相同的,所以第六行的值是True

drop_duplicates()方法

顾名思义,该方法是用来删除重复的行,第二次出现重复的行就会被删除

data.drop_duplicates()
k1k2
0one1
1two1
2one2
3two3
4one3
5two4
参数一:传入列

以上两个方法都可以设置根据哪一列来进行删除,直接传入包含列的list即可,例如

data['v1'] = np.arange(7)
data
k1k2v1
0one10
1two11
2one22
3two33
4one34
5two45
6two46

例一:只传入一列

data.drop_duplicates(['k1'])
k1k2v1
0one10
1two11
data.duplicated(['k1'])
0
False
1
False
2
True
3
True
4
True
5
True
6
True
dtype: bool

例二:传入两列

data.drop_duplicates(['k1','v1'])
k1k2v1
0one10
1two11
2one22
3two33
4one34
5two45
6two46
data.duplicated(['k1','k2'])
0
False
1
False
2
False
3
False
4
False
5
False
6
True
dtype: bool
参数二:keep

以上两个方法默认第一个出现的不算duplicate,设置**keep=‘last’**则可以保留最后一个:

data.drop_duplicates(['k1','k2'], keep='last')
k1k2v1
0one10
1two11
2one22
3two33
4one34
6two46

利用函数或映射进行数据转换

例如:

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami','corned beef', 'Bacon', 'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
foodounces
0bacon4.0
1pulled pork3.0
2bacon12.0
3Pastrami6.0
4corned beef7.5
5Bacon8.0
6pastrami3.0
7honey ham5.0
8nova lox6.0

任务:添加一列表示该肉类食物来源的动物类型

meat_to_animal = {'bacon':'pig',
'pulled pork':'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'}
使用映射
lowercased = data['food'].str.lower()
lowercased
0
bacon
1
pulled pork
2
bacon
3
pastrami
4
corned beef
5
bacon
6
pastrami
7
honey ham
8
nova lox
Name: food, dtype: object
data['animal'] = lowercased.map(meat_to_animal)
data
foodouncesanimal
0bacon4.0pig
1pulled pork3.0pig
2bacon12.0pig
3Pastrami6.0cow
4corned beef7.5cow
5Bacon8.0pig
6pastrami3.0cow
7honey ham5.0pig
8nova lox6.0salmon
data = data.drop('animal',axis=1)
使用函数(lambda函数)
data['animal'] = data['food'].map(lambda x: meat_to_animal[x.lower()])
data
foodouncesanimal
0bacon4.0pig
1pulled pork3.0pig
2bacon12.0pig
3Pastrami6.0cow
4corned beef7.5cow
5Bacon8.0pig
6pastrami3.0cow
7honey ham5.0pig
8nova lox6.0salmon

替换值

fillna方法可以看做用于替换值的特殊情况(NA是特殊值),replace方法更简单、灵活

data = pd.Series([1.,-999.,2.,-999.,-1000,3])
data
0
1.0
1
-999.0
2
2.0
3
-999.0
4
-1000.0
5
3.0
dtype: float64
data.replace(-999,np.nan)
0
1.0
1
NaN
2
2.0
3
NaN
4
-1000.0
5
3.0
dtype: float64

希望将-1000也替换掉?那么就传入一个list!

data.replace([-999,-1000],np.nan)
0
1.0
1
NaN
2
2.0
3
NaN
4
NaN
5
3.0
dtype: float64

希望将-999和-1000替换成不同的值?同样,传入一个list!

data.replace([-999,-1000],[np.nan, 0])
0
1.0
1
NaN
2
2.0
3
NaN
4
0.0
5
3.0
dtype: float64

传入dict也可

data.replace({-999:np.nan, -1000:0})
0
1.0
1
NaN
2
2.0
3
NaN
4
0.0
5
3.0
dtype: float64

重命名轴索引

  • 轴索引也可以通过和之前介绍过的名字一样的map方法重新命名
  • rename方法

df.index.map

data = pd.DataFrame(np.arange(12).reshape(3,4),
index=['Ohio','Colorado','New York'],
columns=['one','two','three','four'])
data
onetwothreefour
Ohio0123
Colorado4567
New York891011
transform = lambda x: x[:4].upper()
data.index = data.index.map(transform)
data
onetwothreefour
OHIO0123
COLO4567
NEW891011

rename

直接修改index参数和columns参数
data.rename(index=str.title,columns=str.upper)
ONETWOTHREEFOUR
Ohio0123
Colo4567
New891011
使用dict部分修改index和columns
data.rename(index={'OHIO':'INDIANA'}, columns={'three':'peekaboo'})
onetwopeekaboofour
INDIANA0123
COLO4567
NEW891011
inplace参数

直接修改数据集,而非返回副本

data.rename(index=str.upper, columns=str.title, inplace=True)
data
OneTwoThreeFour
OHIO0123
COLO4567
NEW891011

离散化和面元分割

pd.cut方法

  • 根据面元(bins)划分
  • 返回的是Categorical对象
ages = np.random.randint(120,size=100)
ages
array([ 46,
21, 116, 119, 118, 119,
90,
88,
16,
83, 103,
8,
17,
92,
75,
6,
19,
63,
19,
92,
91,
74,
10,
87,
25,
17,
116, 118,
60,
98,
52,
54, 101,
26,
66,
58,
37,
68,
19,
91,
65,
98,
57,
84,
85, 100,
32,
87, 105,
21,
29, 104,
109,
15, 101,
81,
10,
34,
27, 117,
85,
42,
54,
16, 112,
102,
51,
13,
93,
43,
8,
55,
33,
34,
72, 103,
79,
61,
41,
36,
47,
72, 116, 100,
0,
38,
41,
84,
54,
64,
46,
56,
10, 116,
78,
76, 116,
98,
20,
48])
bins=[18,25,35,60,100] # 分为(18,25], (25,35], (35,60], (60,100]这几个区间
cats = pd.cut(ages,bins)
cats
[(35.0, 60.0], (18.0, 25.0], NaN, NaN, NaN, ..., (60.0, 100.0], NaN, (60.0, 100.0], (18.0, 25.0], (35.0, 60.0]]
Length: 100
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

categorial对象的codes属性

  • 数字代表了属于的面元的index
  • -1代表不属于任何一个面元
cats.codes
array([ 2,
0, -1, -1, -1, -1,
3,
3, -1,
3, -1, -1, -1,
3,
3, -1,
0,
3,
0,
3,
3,
3, -1,
3,
0, -1, -1, -1,
2,
3,
2,
2, -1,
1,
3,
2,
2,
3,
0,
3,
3,
3,
2,
3,
3,
3,
1,
3, -1,
0,
1,
-1, -1, -1, -1,
3, -1,
1,
1, -1,
3,
2,
2, -1, -1, -1,
2, -1,
3,
2, -1,
2,
1,
1,
3, -1,
3,
3,
2,
2,
2,
3, -1,
3, -1,
2,
2,
3,
2,
3,
2,
2, -1, -1,
3,
3, -1,
3,
0,
2],
dtype=int8)

categorial对象的categories属性

  • 包含了所属区间以及一些其他信息
cats.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
closed='right',
dtype='interval[int64]')

pd.value_counts()方法

  • 这边可以用来计算categorial各类别分别有几个值
pd.value_counts(cats)
(60, 100]
33
(35, 60]
21
(25, 35]
7
(18, 25]
7
dtype: int64

categorial的right参数

  • 修改区间的开闭方向
  • 默认是左开右闭
pd.cut(ages,bins,right=False)
[[35.0, 60.0), [18.0, 25.0), NaN, NaN, NaN, ..., [60.0, 100.0), NaN, [60.0, 100.0), [18.0, 25.0), [35.0, 60.0)]
Length: 100
Categories (4, interval[int64]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

categorial的label参数

  • 自定义不同面元对应的名称
pd.cut(ages,bins,labels=['Youth','Young Adult','Middle Age','Senior'])
[Middle Age, Youth, NaN, NaN, NaN, ..., Senior, NaN, Senior, Youth, Middle Age]
Length: 100
Categories (4, object): [Youth < Young Adult < Middle Age < Senior]

传入的是面元数量而非边界?

  • 会根据最大值和最小值来计算等长的面元,并以此作为边界来划分
  • 可选参数precision来确定小数的位数
data = np.random.rand(20)
pd.cut(data,4)
[(0.338, 0.524], (0.152, 0.338], (0.524, 0.71], (0.524, 0.71], (0.524, 0.71], ..., (0.71, 0.895], (0.338, 0.524], (0.338, 0.524], (0.524, 0.71], (0.524, 0.71]]
Length: 20
Categories (4, interval[float64]): [(0.152, 0.338] < (0.338, 0.524] < (0.524, 0.71] < (0.71, 0.895]]
pd.cut(data,4,precision=2)
[(0.34, 0.52], (0.15, 0.34], (0.52, 0.71], (0.52, 0.71], (0.52, 0.71], ..., (0.71, 0.9], (0.34, 0.52], (0.34, 0.52], (0.52, 0.71], (0.52, 0.71]]
Length: 20
Categories (4, interval[float64]): [(0.15, 0.34] < (0.34, 0.52] < (0.52, 0.71] < (0.71, 0.9]]

qcut方法

  • 根据样本分位数对数据进行面元划分->能得到大小基本相等的面元
data = np.random.rand(1000)
cats = pd.qcut(data,4)
cats
[(0.481, 0.745], (0.256, 0.481], (-0.00040400000000000006, 0.256], (-0.00040400000000000006, 0.256], (0.481, 0.745], ..., (0.256, 0.481], (-0.00040400000000000006, 0.256], (0.256, 0.481], (0.745, 0.999], (-0.00040400000000000006, 0.256]]
Length: 1000
Categories (4, interval[float64]): [(-0.00040400000000000006, 0.256] < (0.256, 0.481] < (0.481, 0.745] < (0.745, 0.999]]
cats.value_counts()
(-0.00040400000000000006, 0.256]
250
(0.256, 0.481]
250
(0.481, 0.745]
250
(0.745, 0.999]
250
dtype: int64

可以发现每个区间的值都是一样的(在数量可以被categories整除时)

qcut:自定义分位数

cats = pd.qcut(data,[0,0.1,0.5,0.9,1.])
cats
[(0.481, 0.882], (0.0954, 0.481], (0.0954, 0.481], (-0.00040400000000000006, 0.0954], (0.481, 0.882], ..., (0.0954, 0.481], (0.0954, 0.481], (0.0954, 0.481], (0.882, 0.999], (0.0954, 0.481]]
Length: 1000
Categories (4, interval[float64]): [(-0.00040400000000000006, 0.0954] < (0.0954, 0.481] < (0.481, 0.882] < (0.882, 0.999]]
cats.value_counts()
(-0.00040400000000000006, 0.0954]
100
(0.0954, 0.481]
400
(0.481, 0.882]
400
(0.882, 0.999]
100
dtype: int64

可以发现是按照自定义的分位数划分的数量 [(0.1-0)*1000 (0.5-0.1)*1000 (0.9-0.5)*1000 (1-0.1)*1000]

检测和过滤异常值

data = pd.DataFrame(np.random.randn(1000,4))
data.describe()
0123
count1000.0000001000.0000001000.0000001000.000000
mean0.072513-0.0359900.000788-0.057542
std0.9985851.0119631.0226480.985153
min-2.869199-3.294376-3.174468-3.809258
25%-0.576538-0.749310-0.680436-0.694334
50%0.107212-0.0807160.013427-0.045903
75%0.7175620.6734830.6999500.570491
max3.5253823.3437223.1968832.934261

例一:找出某列中绝对值大小超过3的值

col = data[2]
col[np.abs(col)>3]
183
3.015125
437
3.196883
509
-3.174468
Name: 2, dtype: float64

例二:找出全部绝对值大小超过3的行
顺便复习一下DataFrame.any方法:
DataFrame.any(self, axis=0, bool_only=None, skipna=True, level=None, **kwargs)

np.abs(data)>3返回的是布尔类型的DataFrame,然后.any(1)返回True值代表该行有绝对值大小超过3的数字,外面data[]代表返回这些行的原始值

data[(np.abs(data)>3).any(1)] # axis=1
0123
193.5253820.857763-0.464948-1.067385
760.520871-3.2291110.3923700.652889
183-0.1320461.1983313.015125-0.259849
3683.2899280.797813-1.672622-0.315966
437-0.877511-0.2409203.196883-0.595124
4381.5554363.343722-0.838151-0.353502
5010.898879-3.294376-0.5212160.363553
509-1.7988962.085908-3.174468-0.632310
679-1.043170-2.1082860.940602-3.809258
7310.6809523.0005030.885260-1.007260
881-0.103924-3.025645-0.708571-1.093002

设置值

data[np.abs(data)>3]=np.sign(data)*3 #所有的值都将在区间[-3, 3]

复习一下np.sign()
$ sign(x) = begin{cases} 1, & text{if x x x > 0 } 0, & text{if x x x = 0 } -1, & text{if x x x < 0} end{cases} $

data.describe()
0123
count1000.0000001000.0000001000.0000001000.000000
mean0.071697-0.0357850.000750-0.056733
std0.9960081.0091881.0214790.982397
min-2.869199-3.000000-3.000000-3.000000
25%-0.576538-0.749310-0.680436-0.694334
50%0.107212-0.0807160.013427-0.045903
75%0.7175620.6734830.6999500.570491
max3.0000003.0000003.0000002.934261

根据值的正负,np.sign(data)还能生成1和-1

np.sign(data)
0123
01.01.01.01.0
1-1.0-1.0-1.01.0
21.0-1.0-1.01.0
31.01.0-1.0-1.0
41.0-1.01.01.0
...............
9951.01.01.0-1.0
9961.01.01.01.0
997-1.0-1.01.0-1.0
998-1.01.0-1.01.0
999-1.01.0-1.0-1.0

1000 rows × 4 columns

排列和随机采样

numpy.random.permutation

排列函数->对Series或DataFrame的列的排列工作

df = pd.DataFrame(np.arange(5*4).reshape(5,4))
df
0123
00123
14567
2891011
312131415
416171819
sampler = np.random.permutation(5)
sampler
array([3, 2, 1, 0, 4])
对行排列
  • iloc
  • take(DataFrame.take(indices, axis=0, convert=True, is_copy=True, **kwargs) 参数:convert : translate neg to pos indices (default))
df.iloc[sampler]
0123
312131415
2891011
14567
00123
416171819
df.take(sampler)
0123
312131415
2891011
14567
00123
416171819

sample方法

  • 随机选取若干行
df.sample(n=3)
0123
14567
312131415
2891011
replace参数
choices = pd.Series([5,7,-1,6,4])
choices.sample(n=10,replace=True) #如果replace参数为False(默认),那么n就不能大于Series的总长度
0
5
0
5
3
6
0
5
1
7
3
6
2
-1
1
7
0
5
2
-1
dtype: int64

计算指标/哑变量

df = pd.DataFrame({'key': ['b','b','a','c','a','b'], 'data1': np.arange(6)})
df
keydata1
0b0
1b1
2a2
3c3
4a4
5b5
pd.get_dummies(df['key'])
abc
0010
1010
2100
3001
4100
5010

prefix参数

  • 加上前缀
dummies = pd.get_dummies(df['key'],prefix='key')
dummies
key_akey_bkey_c
0010
1010
2100
3001
4100
5010
df_with_dummy = df[['data1']].join(dummies) #Series没有join函数,这边用[[]]转成DataFrame
df_with_dummy
data1key_akey_bkey_c
00010
11010
22100
33001
44100
55010

某行同时属于多个分类

mnames=['movie_id','title','genres']
movies = pd.read_table('PythonForDataAnalysis-master/ch07/ml-1m/movies.dat',sep='::',header=None,names=mnames,engine='python')
# engine默认是c,如果使用python,可以解析更多别的内容
movies.head()
movie_idtitlegenres
01Toy Story (1995)Animation|Children's|Comedy
12Jumanji (1995)Adventure|Children's|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
  1. 抽取出不同的genre值
all_genres = []
for x in movies.genres:
all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
'Western'], dtype=object)
  1. 从一个全是zero的DataFrame开始构建指标DataFrame
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
for i, gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(gen.split('|'))
dummies.iloc[i,indices]=1 # i代表当前电影的序号(行坐标),indices代表genres(列坐标)
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic
movie_idtitlegenresGenre_AnimationGenre_Children'sGenre_ComedyGenre_AdventureGenre_FantasyGenre_RomanceGenre_Drama...Genre_CrimeGenre_ThrillerGenre_HorrorGenre_Sci-FiGenre_DocumentaryGenre_WarGenre_MusicalGenre_MysteryGenre_Film-NoirGenre_Western
01Toy Story (1995)Animation|Children's|Comedy1.01.01.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
12Jumanji (1995)Adventure|Children's|Fantasy0.01.00.01.01.00.00.0...0.00.00.00.00.00.00.00.00.00.0
23Grumpier Old Men (1995)Comedy|Romance0.00.01.00.00.01.00.0...0.00.00.00.00.00.00.00.00.00.0
34Waiting to Exhale (1995)Comedy|Drama0.00.01.00.00.00.01.0...0.00.00.00.00.00.00.00.00.00.0
45Father of the Bride Part II (1995)Comedy0.00.01.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0

5 rows × 21 columns

注意:对于很大的数据,用这种方式构建多成员指标会很慢。最好使用更低级的函数,将其写⼊NumPy数组,然后用DataFrame包装。

values = np.random.rand(10)
values
array([0.72757738, 0.11463144, 0.98715221, 0.67817538, 0.76816537,
0.34697118, 0.5721833 , 0.24276292, 0.88097025, 0.30371988])
bins = np.arange(0,1,0.2)
bins
array([0. , 0.2, 0.4, 0.6, 0.8])
pd.get_dummies(pd.cut(values,bins))
(0.0, 0.2](0.2, 0.4](0.4, 0.6](0.6, 0.8]
00001
11000
20000
30001
40001
50100
60010
70100
80000
90100

最后

以上就是勤恳酸奶为你收集整理的4.2 数据转换(全)的全部内容,希望文章能够帮你解决4.2 数据转换(全)所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(82)

评论列表共有 0 条评论

立即
投稿
返回
顶部