概述
十九、数据整理(上)
作者:Chris Albon
译者:飞龙
协议:CC BY-NC-SA 4.0
在 Pandas 中通过分组应用函数
import pandas as pd
data = { 'Platoon' : [ 'A' , 'A' , 'A' , 'A' , 'A' , 'A' , 'B' , 'B' , 'B' , 'B' , 'B' , 'C' , 'C' , 'C' , 'C' , 'C' ] ,
'Casualties' : [ 1 , 4 , 5 , 7 , 5 , 5 , 6 , 1 , 4 , 5 , 6 , 7 , 4 , 6 , 4 , 6 ] }
df = pd. DataFrame( data)
df
Casualties Platoon 0 1 A 1 4 A 2 5 A 3 7 A 4 5 A 5 5 A 6 6 B 7 1 B 8 4 B 9 5 B 10 6 B 11 7 C 12 4 C 13 6 C 14 4 C 15 6 C
df. groupby( 'Platoon' ) [ 'Casualties' ] . apply ( lambda x: x. rolling( center= False , window= 2 ) . mean( ) )
'''
0 NaN
1 2.5
2 4.5
3 6.0
4 6.0
5 5.0
6 NaN
7 3.5
8 2.5
9 4.5
10 5.5
11 NaN
12 5.5
13 5.0
14 5.0
15 5.0
dtype: float64
'''
在 Pandas 中向分组应用操作
import pandas as pd
raw_data = { 'regiment' : [ 'Nighthawks' , 'Nighthawks' , 'Nighthawks' , 'Nighthawks' , 'Dragoons' , 'Dragoons' , 'Dragoons' , 'Dragoons' , 'Scouts' , 'Scouts' , 'Scouts' , 'Scouts' ] ,
'company' : [ '1st' , '1st' , '2nd' , '2nd' , '1st' , '1st' , '2nd' , '2nd' , '1st' , '1st' , '2nd' , '2nd' ] ,
'name' : [ 'Miller' , 'Jacobson' , 'Ali' , 'Milner' , 'Cooze' , 'Jacon' , 'Ryaner' , 'Sone' , 'Sloan' , 'Piger' , 'Riani' , 'Ali' ] ,
'preTestScore' : [ 4 , 24 , 31 , 2 , 3 , 4 , 24 , 31 , 2 , 3 , 2 , 3 ] ,
'postTestScore' : [ 25 , 94 , 57 , 62 , 70 , 25 , 94 , 57 , 62 , 70 , 62 , 70 ] }
df = pd. DataFrame( raw_data, columns = [ 'regiment' , 'company' , 'name' , 'preTestScore' , 'postTestScore' ] )
df
regiment company name preTestScore postTestScore 0 Nighthawks 1st Miller 4 25 1 Nighthawks 1st Jacobson 24 94 2 Nighthawks 2nd Ali 31 57 3 Nighthawks 2nd Milner 2 62 4 Dragoons 1st Cooze 3 70 5 Dragoons 1st Jacon 4 25 6 Dragoons 2nd Ryaner 24 94 7 Dragoons 2nd Sone 31 57 8 Scouts 1st Sloan 2 62 9 Scouts 1st Piger 3 70 10 Scouts 2nd Riani 2 62 11 Scouts 2nd Ali 3 70
groupby_regiment = df[ 'preTestScore' ] . groupby( df[ 'regiment' ] )
groupby_regiment
“这个分组变量现在是GroupBy
对象。 除了分组的键df ['key1']
的一些中间数据之外,它实际上还没有计算任何东西。 我们的想法是,该对象具有将所有操作应用于每个分组所需的所有信息。” – PyDA
使用list()
显示分组的样子。
list ( df[ 'preTestScore' ] . groupby( df[ 'regiment' ] ) )
'''
[('Dragoons', 4 3
5 4
6 24
7 31
Name: preTestScore, dtype: int64), ('Nighthawks', 0 4
1 24
2 31
3 2
Name: preTestScore, dtype: int64), ('Scouts', 8 2
9 3
10 2
11 3
Name: preTestScore, dtype: int64)]
'''
df[ 'preTestScore' ] . groupby( df[ 'regiment' ] ) . describe( )
count mean std min 25% 50% 75% max regiment Dragoons 4.0 15.50 14.153916 3.0 3.75 14.0 25.75 31.0 Nighthawks 4.0 15.25 14.453950 2.0 3.50 14.0 25.75 31.0 Scouts 4.0 2.50 0.577350 2.0 2.00 2.5 3.00 3.0
groupby_regiment. mean( )
'''
regiment
Dragoons 15.50
Nighthawks 15.25
Scouts 2.50
Name: preTestScore, dtype: float64
'''
df[ 'preTestScore' ] . groupby( [ df[ 'regiment' ] , df[ 'company' ] ] ) . mean( )
'''
regiment company
Dragoons 1st 3.5
2nd 27.5
Nighthawks 1st 14.0
2nd 16.5
Scouts 1st 2.5
2nd 2.5
Name: preTestScore, dtype: float64
'''
df[ 'preTestScore' ] . groupby( [ df[ 'regiment' ] , df[ 'company' ] ] ) . mean( ) . unstack( )
company 1st 2nd regiment Dragoons 3.5 27.5 Nighthawks 14.0 16.5 Scouts 2.5 2.5
df. groupby( [ 'regiment' , 'company' ] ) . mean( )
preTestScore postTestScore regiment company Dragoons 1st 3.5 47.5 2nd 27.5 75.5 Nighthawks 1st 14.0 59.5 2nd 16.5 59.5 Scouts 1st 2.5 66.0 2nd 2.5 66.0
df. groupby( [ 'regiment' , 'company' ] ) . size( )
'''
regiment company
Dragoons 1st 2
2nd 2
Nighthawks 1st 2
2nd 2
Scouts 1st 2
2nd 2
dtype: int64
'''
for name, group in df. groupby( 'regiment' ) :
print ( name)
print ( group)
'''
Dragoons
regiment company name preTestScore postTestScore
4 Dragoons 1st Cooze 3 70
5 Dragoons 1st Jacon 4 25
6 Dragoons 2nd Ryaner 24 94
7 Dragoons 2nd Sone 31 57
Nighthawks
regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
Scouts
regiment company name preTestScore postTestScore
8 Scouts 1st Sloan 2 62
9 Scouts 1st Piger 3 70
10 Scouts 2nd Riani 2 62
11 Scouts 2nd Ali 3 70
'''
按列分组:
特别是在这种情况下:按列对数据类型(即axis = 1
)分组,然后使用list()
查看该分组的外观。
list ( df. groupby( df. dtypes, axis= 1 ) )
'' '
[ ( dtype( 'int64' ) , preTestScore postTestScore
0 4 25
1 24 94
2 31 57
3 2 62
4 3 70
5 4 25
6 24 94
7 31 57
8 2 62
9 3 70
10 2 62
11 3 70 ) ,
( dtype( 'O' ) , regiment company name
0 Nighthawks 1st Miller
1 Nighthawks 1st Jacobson
2 Nighthawks 2nd Ali
3 Nighthawks 2nd Milner
4 Dragoons 1st Cooze
5 Dragoons 1st Jacon
6 Dragoons 2nd Ryaner
7 Dragoons 2nd Sone
8 Scouts 1st Sloan
9 Scouts 1st Piger
10 Scouts 2nd Riani
11 Scouts 2nd Ali) ]
df. groupby( 'regiment' ) . mean( ) . add_prefix( 'mean_' )
mean_preTestScore mean_postTestScore regiment Dragoons 15.50 61.5 Nighthawks 15.25 59.5 Scouts 2.50 66.0
def get_stats ( group) :
return { 'min' : group. min ( ) , 'max' : group. max ( ) , 'count' : group. count( ) , 'mean' : group. mean( ) }
bins = [ 0 , 25 , 50 , 75 , 100 ]
group_names = [ 'Low' , 'Okay' , 'Good' , 'Great' ]
df[ 'categories' ] = pd. cut( df[ 'postTestScore' ] , bins, labels= group_names)
df[ 'postTestScore' ] . groupby( df[ 'categories' ] ) . apply ( get_stats) . unstack( )
count max mean min categories Good 8.0 70.0 63.75 57.0 Great 2.0 94.0 94.00 94.0 Low 2.0 25.0 25.00 25.0 Okay 0.0 NaN NaN NaN
在 Pandas 数据帧上应用操作
import pandas as pd
import numpy as np
data = { 'name' : [ 'Jason' , 'Molly' , 'Tina' , 'Jake' , 'Amy' ] ,
'year' : [ 2012 , 2012 , 2013 , 2014 , 2014 ] ,
'reports' : [ 4 , 24 , 31 , 2 , 3 ] ,
'coverage' : [ 25 , 94 , 57 , 62 , 70 ] }
df = pd. DataFrame( data, index = [ 'Cochice' , 'Pima' , 'Santa Cruz' , 'Maricopa' , 'Yuma' ] )
df
coverage name reports year Cochice 25 Jason 4 2012 Pima 94 Molly 24 2012 Santa Cruz 57 Tina 31 2013 Maricopa 62 Jake 2 2014 Yuma 70 Amy 3 2014
capitalizer = lambda x: x. upper( )
将capitalizer
函数应用于name
列。
apply()
可以沿数据帧的任意轴应用函数。
df[ 'name' ] . apply ( capitalizer)
'''
Cochice JASON
Pima MOLLY
Santa Cruz TINA
Maricopa JAKE
Yuma AMY
Name: name, dtype: object
'''
将capitalizer
lambda 函数映射到序列name
中的每个元素。
map()
对序列的每个元素应用操作。
df[ 'name' ] . map ( capitalizer)
'''
Cochice JASON
Pima MOLLY
Santa Cruz TINA
Maricopa JAKE
Yuma AMY
Name: name, dtype: object
'''
将平方根函数应用于整个数据帧中的每个单元格。
applymap()
将函数应用于整个数据帧中的每个元素。
df = df. drop( 'name' , axis= 1 )
df. applymap( np. sqrt)
coverage reports year Cochice 5.000000 2.000000 44.855323 Pima 9.695360 4.898979 44.855323 Santa Cruz 7.549834 5.567764 44.866469 Maricopa 7.874008 1.414214 44.877611 Yuma 8.366600 1.732051 44.877611
在数据帧上应用函数。
def times100 ( x) :
if type ( x) is str :
return x
elif x:
return 100 * x
else :
return
df. applymap( times100)
coverage reports year Cochice 2500 400 201200 Pima 9400 2400 201200 Santa Cruz 5700 3100 201300 Maricopa 6200 200 201400 Yuma 7000 300 201400
向 Pandas 数据帧赋予新列
import pandas as pd
df = pd. DataFrame( )
df[ 'name' ] = [ 'John' , 'Steve' , 'Sarah' ]
df
df. assign( age = [ 31 , 32 , 19 ] )
name age 0 John 31 1 Steve 32 2 Sarah 19
将列表拆分为大小为 N 的分块
在这个片段中,我们接受一个列表并将其分解为大小为 n 的块。 在处理具有最大请求大小的 API 时,这是一种非常常见的做法。
这个漂亮的函数由 Ned Batchelder 贡献,发布于 StackOverflow。
first_names = [ 'Steve' , 'Jane' , 'Sara' , 'Mary' , 'Jack' , 'Bob' , 'Bily' , 'Boni' , 'Chris' , 'Sori' , 'Will' , 'Won' , 'Li' ]
def chunks ( l, n) :
for i in range ( 0 , len ( l) , n) :
yield l[ i: i+ n]
list ( chunks( first_names, 5 ) )
'''
[['Steve', 'Jane', 'Sara', 'Mary', 'Jack'],
['Bob', 'Bily', 'Boni', 'Chris', 'Sori'],
['Will', 'Won', 'Li']]
'''
在 Pandas 中使用正则表达式将字符串分解为列
import re
import pandas as pd
data = { 'raw' : [ 'Arizona 1 2014-12-23 3242.0' ,
'Iowa 1 2010-02-23 3453.7' ,
'Oregon 0 2014-06-20 2123.0' ,
'Maryland 0 2014-03-14 1123.6' ,
'Florida 1 2013-01-15 2134.0' ,
'Georgia 0 2012-07-14 2345.6' ] }
df = pd. DataFrame( data, columns = [ 'raw' ] )
df
raw 0 Arizona 1 2014-12-23 3242.0 1 Iowa 1 2010-02-23 3453.7 2 Oregon 0 2014-06-20 2123.0 3 Maryland 0 2014-03-14 1123.6 4 Florida 1 2013-01-15 2134.0 5 Georgia 0 2012-07-14 2345.6
df[ 'raw' ] . str . contains( '....-..-..' , regex= True )
'''
0 True
1 True
2 True
3 True
4 True
5 True
Name: raw, dtype: bool
'''
df[ 'female' ] = df[ 'raw' ] . str . extract( '(d)' , expand= True )
df[ 'female' ]
'''
0 1
1 1
2 0
3 0
4 1
5 0
Name: female, dtype: object
'''
df[ 'date' ] = df[ 'raw' ] . str . extract( '(....-..-..)' , expand= True )
df[ 'date' ]
'''
0 2014-12-23
1 2010-02-23
2 2014-06-20
3 2014-03-14
4 2013-01-15
5 2012-07-14
Name: date, dtype: object
'''
df[ 'score' ] = df[ 'raw' ] . str . extract( '(dddd.d)' , expand= True )
df[ 'score' ]
'''
0 3242.0
1 3453.7
2 2123.0
3 1123.6
4 2134.0
5 2345.6
Name: score, dtype: object
'''
df[ 'state' ] = df[ 'raw' ] . str . extract( '([A-Z]w{0,})' , expand= True )
df[ 'state' ]
'''
0 Arizona
1 Iowa
2 Oregon
3 Maryland
4 Florida
5 Georgia
Name: state, dtype: object
'''
df
raw female date score state 0 Arizona 1 2014-12-23 3242.0 1 2014-12-23 3242.0 Arizona 1 Iowa 1 2010-02-23 3453.7 1 2010-02-23 3453.7 Iowa 2 Oregon 0 2014-06-20 2123.0 0 2014-06-20 2123.0 Oregon 3 Maryland 0 2014-03-14 1123.6 0 2014-03-14 1123.6 Maryland 4 Florida 1 2013-01-15 2134.0 1 2013-01-15 2134.0 Florida 5 Georgia 0 2012-07-14 2345.6 0 2012-07-14 2345.6 Georgia
由两个数据帧贡献列
import pandas as pd
dataframe_one = pd. DataFrame( )
dataframe_one[ '1' ] = [ '1' , '1' , '1' ]
dataframe_one[ 'B' ] = [ 'b' , 'b' , 'b' ]
dataframe_two = pd. DataFrame( )
dataframe_two[ '2' ] = [ '2' , '2' , '2' ]
dataframe_two[ 'B' ] = [ 'b' , 'b' , 'b' ]
set . intersection( set ( dataframe_one) , set ( dataframe_two) )
从多个列表构建字典
officer_names = [ 'Sodoni Dogla' , 'Chris Jefferson' , 'Jessica Billars' , 'Michael Mulligan' , 'Steven Johnson' ]
officer_armies = [ 'Purple Army' , 'Orange Army' , 'Green Army' , 'Red Army' , 'Blue Army' ]
dict ( zip ( officer_names, officer_armies) )
'''
{'Chris Jefferson': 'Orange Army',
'Jessica Billars': 'Green Army',
'Michael Mulligan': 'Red Army',
'Sodoni Dogla': 'Purple Army',
'Steven Johnson': 'Blue Army'}
'''
将 CSV 转换为 Python 代码来重建它
import pandas as pd
df_original = pd. read_csv( 'http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv' )
df = pd. read_csv( 'http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv' )
print ( '==============================' )
print ( 'RUN THE CODE BELOW THIS LINE' )
print ( '==============================' )
print ( 'raw_data =' , df. to_dict( orient= 'list' ) )
print ( 'df = pd.DataFrame(raw_data, columns = ' + str ( list ( df_original) ) + ')' )
'''
==============================
RUN THE CODE BELOW THIS LINE
==============================
raw_data = {'Sepal.Length': [5.0999999999999996, 4.9000000000000004, 4.7000000000000002, 4.5999999999999996, 5.0, 5.4000000000000004, 4.5999999999999996, 5.0, 4.4000000000000004, 4.9000000000000004, 5.4000000000000004, 4.7999999999999998, 4.7999999999999998, 4.2999999999999998, 5.7999999999999998, 5.7000000000000002, 5.4000000000000004, 5.0999999999999996, 5.7000000000000002, 5.0999999999999996, 5.4000000000000004, 5.0999999999999996, 4.5999999999999996, 5.0999999999999996, 4.7999999999999998, 5.0, 5.0, 5.2000000000000002, 5.2000000000000002, 4.7000000000000002, 4.7999999999999998, 5.4000000000000004, 5.2000000000000002, 5.5, 4.9000000000000004, 5.0, 5.5, 4.9000000000000004, 4.4000000000000004, 5.0999999999999996, 5.0, 4.5, 4.4000000000000004, 5.0, 5.0999999999999996, 4.7999999999999998, 5.0999999999999996, 4.5999999999999996, 5.2999999999999998, 5.0, 7.0, 6.4000000000000004, 6.9000000000000004, 5.5, 6.5, 5.7000000000000002, 6.2999999999999998, 4.9000000000000004, 6.5999999999999996, 5.2000000000000002, 5.0, 5.9000000000000004, 6.0, 6.0999999999999996, 5.5999999999999996, 6.7000000000000002, 5.5999999999999996, 5.7999999999999998, 6.2000000000000002, 5.5999999999999996, 5.9000000000000004, 6.0999999999999996, 6.2999999999999998, 6.0999999999999996, 6.4000000000000004, 6.5999999999999996, 6.7999999999999998, 6.7000000000000002, 6.0, 5.7000000000000002, 5.5, 5.5, 5.7999999999999998, 6.0, 5.4000000000000004, 6.0, 6.7000000000000002, 6.2999999999999998, 5.5999999999999996, 5.5, 5.5, 6.0999999999999996, 5.7999999999999998, 5.0, 5.5999999999999996, 5.7000000000000002, 5.7000000000000002, 6.2000000000000002, 5.0999999999999996, 5.7000000000000002, 6.2999999999999998, 5.7999999999999998, 7.0999999999999996, 6.2999999999999998, 6.5, 7.5999999999999996, 4.9000000000000004, 7.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.5, 6.4000000000000004, 6.7999999999999998, 5.7000000000000002, 5.7999999999999998, 6.4000000000000004, 6.5, 7.7000000000000002, 7.7000000000000002, 6.0, 6.9000000000000004, 5.5999999999999996, 7.7000000000000002, 6.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.2000000000000002, 6.0999999999999996, 6.4000000000000004, 7.2000000000000002, 7.4000000000000004, 7.9000000000000004, 6.4000000000000004, 6.2999999999999998, 6.0999999999999996, 7.7000000000000002, 6.2999999999999998, 6.4000000000000004, 6.0, 6.9000000000000004, 6.7000000000000002, 6.9000000000000004, 5.7999999999999998, 6.7999999999999998, 6.7000000000000002, 6.7000000000000002, 6.2999999999999998, 6.5, 6.2000000000000002, 5.9000000000000004], 'Petal.Width': [0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.10000000000000001, 0.20000000000000001, 0.40000000000000002, 0.40000000000000002, 0.29999999999999999, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.5, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.59999999999999998, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 1.3999999999999999, 1.5, 1.5, 1.3, 1.5, 1.3, 1.6000000000000001, 1.0, 1.3, 1.3999999999999999, 1.0, 1.5, 1.0, 1.3999999999999999, 1.3, 1.3999999999999999, 1.5, 1.0, 1.5, 1.1000000000000001, 1.8, 1.3, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3999999999999999, 1.7, 1.5, 1.0, 1.1000000000000001, 1.0, 1.2, 1.6000000000000001, 1.5, 1.6000000000000001, 1.5, 1.3, 1.3, 1.3, 1.2, 1.3999999999999999, 1.2, 1.0, 1.3, 1.2, 1.3, 1.3, 1.1000000000000001, 1.3, 2.5, 1.8999999999999999, 2.1000000000000001, 1.8, 2.2000000000000002, 2.1000000000000001, 1.7, 1.8, 1.8, 2.5, 2.0, 1.8999999999999999, 2.1000000000000001, 2.0, 2.3999999999999999, 2.2999999999999998, 1.8, 2.2000000000000002, 2.2999999999999998, 1.5, 2.2999999999999998, 2.0, 2.0, 1.8, 2.1000000000000001, 1.8, 1.8, 1.8, 2.1000000000000001, 1.6000000000000001, 1.8999999999999999, 2.0, 2.2000000000000002, 1.5, 1.3999999999999999, 2.2999999999999998, 2.3999999999999999, 1.8, 1.8, 2.1000000000000001, 2.3999999999999999, 2.2999999999999998, 1.8999999999999999, 2.2999999999999998, 2.5, 2.2999999999999998, 1.8999999999999999, 2.0, 2.2999999999999998, 1.8], 'Petal.Length': [1.3999999999999999, 1.3999999999999999, 1.3, 1.5, 1.3999999999999999, 1.7, 1.3999999999999999, 1.5, 1.3999999999999999, 1.5, 1.5, 1.6000000000000001, 1.3999999999999999, 1.1000000000000001, 1.2, 1.5, 1.3, 1.3999999999999999, 1.7, 1.5, 1.7, 1.5, 1.0, 1.7, 1.8999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.3999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.5, 1.3999999999999999, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3, 1.5, 1.3, 1.3, 1.3, 1.6000000000000001, 1.8999999999999999, 1.3999999999999999, 1.6000000000000001, 1.3999999999999999, 1.5, 1.3999999999999999, 4.7000000000000002, 4.5, 4.9000000000000004, 4.0, 4.5999999999999996, 4.5, 4.7000000000000002, 3.2999999999999998, 4.5999999999999996, 3.8999999999999999, 3.5, 4.2000000000000002, 4.0, 4.7000000000000002, 3.6000000000000001, 4.4000000000000004, 4.5, 4.0999999999999996, 4.5, 3.8999999999999999, 4.7999999999999998, 4.0, 4.9000000000000004, 4.7000000000000002, 4.2999999999999998, 4.4000000000000004, 4.7999999999999998, 5.0, 4.5, 3.5, 3.7999999999999998, 3.7000000000000002, 3.8999999999999999, 5.0999999999999996, 4.5, 4.5, 4.7000000000000002, 4.4000000000000004, 4.0999999999999996, 4.0, 4.4000000000000004, 4.5999999999999996, 4.0, 3.2999999999999998, 4.2000000000000002, 4.2000000000000002, 4.2000000000000002, 4.2999999999999998, 3.0, 4.0999999999999996, 6.0, 5.0999999999999996, 5.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.5999999999999996, 4.5, 6.2999999999999998, 5.7999999999999998, 6.0999999999999996, 5.0999999999999996, 5.2999999999999998, 5.5, 5.0, 5.0999999999999996, 5.2999999999999998, 5.5, 6.7000000000000002, 6.9000000000000004, 5.0, 5.7000000000000002, 4.9000000000000004, 6.7000000000000002, 4.9000000000000004, 5.7000000000000002, 6.0, 4.7999999999999998, 4.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.0999999999999996, 6.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.5999999999999996, 6.0999999999999996, 5.5999999999999996, 5.5, 4.7999999999999998, 5.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.0999999999999996, 5.9000000000000004, 5.7000000000000002, 5.2000000000000002, 5.0, 5.2000000000000002, 5.4000000000000004, 5.0999999999999996], 'Species': ['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica'], 'Sepal.Width': [3.5, 3.0, 3.2000000000000002, 3.1000000000000001, 3.6000000000000001, 3.8999999999999999, 3.3999999999999999, 3.3999999999999999, 2.8999999999999999, 3.1000000000000001, 3.7000000000000002, 3.3999999999999999, 3.0, 3.0, 4.0, 4.4000000000000004, 3.8999999999999999, 3.5, 3.7999999999999998, 3.7999999999999998, 3.3999999999999999, 3.7000000000000002, 3.6000000000000001, 3.2999999999999998, 3.3999999999999999, 3.0, 3.3999999999999999, 3.5, 3.3999999999999999, 3.2000000000000002, 3.1000000000000001, 3.3999999999999999, 4.0999999999999996, 4.2000000000000002, 3.1000000000000001, 3.2000000000000002, 3.5, 3.6000000000000001, 3.0, 3.3999999999999999, 3.5, 2.2999999999999998, 3.2000000000000002, 3.5, 3.7999999999999998, 3.0, 3.7999999999999998, 3.2000000000000002, 3.7000000000000002, 3.2999999999999998, 3.2000000000000002, 3.2000000000000002, 3.1000000000000001, 2.2999999999999998, 2.7999999999999998, 2.7999999999999998, 3.2999999999999998, 2.3999999999999999, 2.8999999999999999, 2.7000000000000002, 2.0, 3.0, 2.2000000000000002, 2.8999999999999999, 2.8999999999999999, 3.1000000000000001, 3.0, 2.7000000000000002, 2.2000000000000002, 2.5, 3.2000000000000002, 2.7999999999999998, 2.5, 2.7999999999999998, 2.8999999999999999, 3.0, 2.7999999999999998, 3.0, 2.8999999999999999, 2.6000000000000001, 2.3999999999999999, 2.3999999999999999, 2.7000000000000002, 2.7000000000000002, 3.0, 3.3999999999999999, 3.1000000000000001, 2.2999999999999998, 3.0, 2.5, 2.6000000000000001, 3.0, 2.6000000000000001, 2.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 2.8999999999999999, 2.5, 2.7999999999999998, 3.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 3.0, 3.0, 2.5, 2.8999999999999999, 2.5, 3.6000000000000001, 3.2000000000000002, 2.7000000000000002, 3.0, 2.5, 2.7999999999999998, 3.2000000000000002, 3.0, 3.7999999999999998, 2.6000000000000001, 2.2000000000000002, 3.2000000000000002, 2.7999999999999998, 2.7999999999999998, 2.7000000000000002, 3.2999999999999998, 3.2000000000000002, 2.7999999999999998, 3.0, 2.7999999999999998, 3.0, 2.7999999999999998, 3.7999999999999998, 2.7999999999999998, 2.7999999999999998, 2.6000000000000001, 3.0, 3.3999999999999999, 3.1000000000000001, 3.0, 3.1000000000000001, 3.1000000000000001, 3.1000000000000001, 2.7000000000000002, 3.2000000000000002, 3.2999999999999998, 3.0, 2.5, 3.0, 3.3999999999999999, 3.0], 'Unnamed: 0': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150]}
'''
df = pd. DataFrame( raw_data, columns = [ 'Unnamed: 0' , 'Sepal.Length' , 'Sepal.Width' , 'Petal.Length' , 'Petal.Width' , 'Species' ] )
raw_data = { 'Petal.Width' : [ 0.20000000000000001 , 0.20000000000000001 , 0.20000000000000001 , 0.20000000000000001 , 0.20000000000000001 , 0.40000000000000002 , 0.29999999999999999 , 0.20000000000000001 , 0.20000000000000001 , 0.10000000000000001 , 0.20000000000000001 , 0.20000000000000001 , 0.10000000000000001 , 0.10000000000000001 , 0.20000000000000001 , 0.40000000000000002 , 0.40000000000000002 , 0.29999999999999999 , 0.29999999999999999 , 0.29999999999999999 , 0.20000000000000001 , 0.40000000000000002 , 0.20000000000000001 , 0.5 , 0.20000000000000001 , 0.20000000000000001 , 0.40000000000000002 , 0.20000000000000001 , 0.20000000000000001 , 0.20000000000000001 , 0.20000000000000001 , 0.40000000000000002 , 0.10000000000000001 , 0.20000000000000001 , 0.20000000000000001 , 0.20000000000000001 , 0.20000000000000001 , 0.10000000000000001 , 0.20000000000000001 , 0.20000000000000001 , 0.29999999999999999 , 0.29999999999999999 , 0.20000000000000001 , 0.59999999999999998 , 0.40000000000000002 , 0.29999999999999999 , 0.20000000000000001 , 0.20000000000000001 , 0.20000000000000001 , 0.20000000000000001 , 1.3999999999999999 , 1.5 , 1.5 , 1.3 , 1.5 , 1.3 , 1.6000000000000001 , 1.0 , 1.3 , 1.3999999999999999 , 1.0 , 1.5 , 1.0 , 1.3999999999999999 , 1.3 , 1.3999999999999999 , 1.5 , 1.0 , 1.5 , 1.1000000000000001 , 1.8 , 1.3 , 1.5 , 1.2 , 1.3 , 1.3999999999999999 , 1.3999999999999999 , 1.7 , 1.5 , 1.0 , 1.1000000000000001 , 1.0 , 1.2 , 1.6000000000000001 , 1.5 , 1.6000000000000001 , 1.5 , 1.3 , 1.3 , 1.3 , 1.2 , 1.3999999999999999 , 1.2 , 1.0 , 1.3 , 1.2 , 1.3 , 1.3 , 1.1000000000000001 , 1.3 , 2.5 , 1.8999999999999999 , 2.1000000000000001 , 1.8 , 2.2000000000000002 , 2.1000000000000001 , 1.7 , 1.8 , 1.8 , 2.5 , 2.0 , 1.8999999999999999 , 2.1000000000000001 , 2.0 , 2.3999999999999999 , 2.2999999999999998 , 1.8 , 2.2000000000000002 , 2.2999999999999998 , 1.5 , 2.2999999999999998 , 2.0 , 2.0 , 1.8 , 2.1000000000000001 , 1.8 , 1.8 , 1.8 , 2.1000000000000001 , 1.6000000000000001 , 1.8999999999999999 , 2.0 , 2.2000000000000002 , 1.5 , 1.3999999999999999 , 2.2999999999999998 , 2.3999999999999999 , 1.8 , 1.8 , 2.1000000000000001 , 2.3999999999999999 , 2.2999999999999998 , 1.8999999999999999 , 2.2999999999999998 , 2.5 , 2.2999999999999998 , 1.8999999999999999 , 2.0 , 2.2999999999999998 , 1.8 ] , 'Sepal.Width' : [ 3.5 , 3.0 , 3.2000000000000002 , 3.1000000000000001 , 3.6000000000000001 , 3.8999999999999999 , 3.3999999999999999 , 3.3999999999999999 , 2.8999999999999999 , 3.1000000000000001 , 3.7000000000000002 , 3.3999999999999999 , 3.0 , 3.0 , 4.0 , 4.4000000000000004 , 3.8999999999999999 , 3.5 , 3.7999999999999998 , 3.7999999999999998 , 3.3999999999999999 , 3.7000000000000002 , 3.6000000000000001 , 3.2999999999999998 , 3.3999999999999999 , 3.0 , 3.3999999999999999 , 3.5 , 3.3999999999999999 , 3.2000000000000002 , 3.1000000000000001 , 3.3999999999999999 , 4.0999999999999996 , 4.2000000000000002 , 3.1000000000000001 , 3.2000000000000002 , 3.5 , 3.6000000000000001 , 3.0 , 3.3999999999999999 , 3.5 , 2.2999999999999998 , 3.2000000000000002 , 3.5 , 3.7999999999999998 , 3.0 , 3.7999999999999998 , 3.2000000000000002 , 3.7000000000000002 , 3.2999999999999998 , 3.2000000000000002 , 3.2000000000000002 , 3.1000000000000001 , 2.2999999999999998 , 2.7999999999999998 , 2.7999999999999998 , 3.2999999999999998 , 2.3999999999999999 , 2.8999999999999999 , 2.7000000000000002 , 2.0 , 3.0 , 2.2000000000000002 , 2.8999999999999999 , 2.8999999999999999 , 3.1000000000000001 , 3.0 , 2.7000000000000002 , 2.2000000000000002 , 2.5 , 3.2000000000000002 , 2.7999999999999998 , 2.5 , 2.7999999999999998 , 2.8999999999999999 , 3.0 , 2.7999999999999998 , 3.0 , 2.8999999999999999 , 2.6000000000000001 , 2.3999999999999999 , 2.3999999999999999 , 2.7000000000000002 , 2.7000000000000002 , 3.0 , 3.3999999999999999 , 3.1000000000000001 , 2.2999999999999998 , 3.0 , 2.5 , 2.6000000000000001 , 3.0 , 2.6000000000000001 , 2.2999999999999998 , 2.7000000000000002 , 3.0 , 2.8999999999999999 , 2.8999999999999999 , 2.5 , 2.7999999999999998 , 3.2999999999999998 , 2.7000000000000002 , 3.0 , 2.8999999999999999 , 3.0 , 3.0 , 2.5 , 2.8999999999999999 , 2.5 , 3.6000000000000001 , 3.2000000000000002 , 2.7000000000000002 , 3.0 , 2.5 , 2.7999999999999998 , 3.2000000000000002 , 3.0 , 3.7999999999999998 , 2.6000000000000001 , 2.2000000000000002 , 3.2000000000000002 , 2.7999999999999998 , 2.7999999999999998 , 2.7000000000000002 , 3.2999999999999998 , 3.2000000000000002 , 2.7999999999999998 , 3.0 , 2.7999999999999998 , 3.0 , 2.7999999999999998 , 3.7999999999999998 , 2.7999999999999998 , 2.7999999999999998 , 2.6000000000000001 , 3.0 , 3.3999999999999999 , 3.1000000000000001 , 3.0 , 3.1000000000000001 , 3.1000000000000001 , 3.1000000000000001 , 2.7000000000000002 , 3.2000000000000002 , 3.2999999999999998 , 3.0 , 2.5 , 3.0 , 3.3999999999999999 , 3.0 ] , 'Species' : [ 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'setosa' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'versicolor' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' , 'virginica' ] , 'Unnamed: 0' : [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 , 48 , 49 , 50 , 51 , 52 , 53 , 54 , 55 , 56 , 57 , 58 , 59 , 60 , 61 , 62 , 63 , 64 , 65 , 66 , 67 , 68 , 69 , 70 , 71 , 72 , 73 , 74 , 75 , 76 , 77 , 78 , 79 , 80 , 81 , 82 , 83 , 84 , 85 , 86 , 87 , 88 , 89 , 90 , 91 , 92 , 93 , 94 , 95 , 96 , 97 , 98 , 99 , 100 , 101 , 102 , 103 , 104 , 105 , 106 , 107 , 108 , 109 , 110 , 111 , 112 , 113 , 114 , 115 , 116 , 117 , 118 , 119 , 120 , 121 , 122 , 123 , 124 , 125 , 126 , 127 , 128 , 129 , 130 , 131 , 132 , 133 , 134 , 135 , 136 , 137 , 138 , 139 , 140 , 141 , 142 , 143 , 144 , 145 , 146 , 147 , 148 , 149 , 150 ] , 'Sepal.Length' : [ 5.0999999999999996 , 4.9000000000000004 , 4.7000000000000002 , 4.5999999999999996 , 5.0 , 5.4000000000000004 , 4.5999999999999996 , 5.0 , 4.4000000000000004 , 4.9000000000000004 , 5.4000000000000004 , 4.7999999999999998 , 4.7999999999999998 , 4.2999999999999998 , 5.7999999999999998 , 5.7000000000000002 , 5.4000000000000004 , 5.0999999999999996 , 5.7000000000000002 , 5.0999999999999996 , 5.4000000000000004 , 5.0999999999999996 , 4.5999999999999996 , 5.0999999999999996 , 4.7999999999999998 , 5.0 , 5.0 , 5.2000000000000002 , 5.2000000000000002 , 4.7000000000000002 , 4.7999999999999998 , 5.4000000000000004 , 5.2000000000000002 , 5.5 , 4.9000000000000004 , 5.0 , 5.5 , 4.9000000000000004 , 4.4000000000000004 , 5.0999999999999996 , 5.0 , 4.5 , 4.4000000000000004 , 5.0 , 5.0999999999999996 , 4.7999999999999998 , 5.0999999999999996 , 4.5999999999999996 , 5.2999999999999998 , 5.0 , 7.0 , 6.4000000000000004 , 6.9000000000000004 , 5.5 , 6.5 , 5.7000000000000002 , 6.2999999999999998 , 4.9000000000000004 , 6.5999999999999996 , 5.2000000000000002 , 5.0 , 5.9000000000000004 , 6.0 , 6.0999999999999996 , 5.5999999999999996 , 6.7000000000000002 , 5.5999999999999996 , 5.7999999999999998 , 6.2000000000000002 , 5.5999999999999996 , 5.9000000000000004 , 6.0999999999999996 , 6.2999999999999998 , 6.0999999999999996 , 6.4000000000000004 , 6.5999999999999996 , 6.7999999999999998 , 6.7000000000000002 , 6.0 , 5.7000000000000002 , 5.5 , 5.5 , 5.7999999999999998 , 6.0 , 5.4000000000000004 , 6.0 , 6.7000000000000002 , 6.2999999999999998 , 5.5999999999999996 , 5.5 , 5.5 , 6.0999999999999996 , 5.7999999999999998 , 5.0 , 5.5999999999999996 , 5.7000000000000002 , 5.7000000000000002 , 6.2000000000000002 , 5.0999999999999996 , 5.7000000000000002 , 6.2999999999999998 , 5.7999999999999998 , 7.0999999999999996 , 6.2999999999999998 , 6.5 , 7.5999999999999996 , 4.9000000000000004 , 7.2999999999999998 , 6.7000000000000002 , 7.2000000000000002 , 6.5 , 6.4000000000000004 , 6.7999999999999998 , 5.7000000000000002 , 5.7999999999999998 , 6.4000000000000004 , 6.5 , 7.7000000000000002 , 7.7000000000000002 , 6.0 , 6.9000000000000004 , 5.5999999999999996 , 7.7000000000000002 , 6.2999999999999998 , 6.7000000000000002 , 7.2000000000000002 , 6.2000000000000002 , 6.0999999999999996 , 6.4000000000000004 , 7.2000000000000002 , 7.4000000000000004 , 7.9000000000000004 , 6.4000000000000004 , 6.2999999999999998 , 6.0999999999999996 , 7.7000000000000002 , 6.2999999999999998 , 6.4000000000000004 , 6.0 , 6.9000000000000004 , 6.7000000000000002 , 6.9000000000000004 , 5.7999999999999998 , 6.7999999999999998 , 6.7000000000000002 , 6.7000000000000002 , 6.2999999999999998 , 6.5 , 6.2000000000000002 , 5.9000000000000004 ] , 'Petal.Length' : [ 1.3999999999999999 , 1.3999999999999999 , 1.3 , 1.5 , 1.3999999999999999 , 1.7 , 1.3999999999999999 , 1.5 , 1.3999999999999999 , 1.5 , 1.5 , 1.6000000000000001 , 1.3999999999999999 , 1.1000000000000001 , 1.2 , 1.5 , 1.3 , 1.3999999999999999 , 1.7 , 1.5 , 1.7 , 1.5 , 1.0 , 1.7 , 1.8999999999999999 , 1.6000000000000001 , 1.6000000000000001 , 1.5 , 1.3999999999999999 , 1.6000000000000001 , 1.6000000000000001 , 1.5 , 1.5 , 1.3999999999999999 , 1.5 , 1.2 , 1.3 , 1.3999999999999999 , 1.3 , 1.5 , 1.3 , 1.3 , 1.3 , 1.6000000000000001 , 1.8999999999999999 , 1.3999999999999999 , 1.6000000000000001 , 1.3999999999999999 , 1.5 , 1.3999999999999999 , 4.7000000000000002 , 4.5 , 4.9000000000000004 , 4.0 , 4.5999999999999996 , 4.5 , 4.7000000000000002 , 3.2999999999999998 , 4.5999999999999996 , 3.8999999999999999 , 3.5 , 4.2000000000000002 , 4.0 , 4.7000000000000002 , 3.6000000000000001 , 4.4000000000000004 , 4.5 , 4.0999999999999996 , 4.5 , 3.8999999999999999 , 4.7999999999999998 , 4.0 , 4.9000000000000004 , 4.7000000000000002 , 4.2999999999999998 , 4.4000000000000004 , 4.7999999999999998 , 5.0 , 4.5 , 3.5 , 3.7999999999999998 , 3.7000000000000002 , 3.8999999999999999 , 5.0999999999999996 , 4.5 , 4.5 , 4.7000000000000002 , 4.4000000000000004 , 4.0999999999999996 , 4.0 , 4.4000000000000004 , 4.5999999999999996 , 4.0 , 3.2999999999999998 , 4.2000000000000002 , 4.2000000000000002 , 4.2000000000000002 , 4.2999999999999998 , 3.0 , 4.0999999999999996 , 6.0 , 5.0999999999999996 , 5.9000000000000004 , 5.5999999999999996 , 5.7999999999999998 , 6.5999999999999996 , 4.5 , 6.2999999999999998 , 5.7999999999999998 , 6.0999999999999996 , 5.0999999999999996 , 5.2999999999999998 , 5.5 , 5.0 , 5.0999999999999996 , 5.2999999999999998 , 5.5 , 6.7000000000000002 , 6.9000000000000004 , 5.0 , 5.7000000000000002 , 4.9000000000000004 , 6.7000000000000002 , 4.9000000000000004 , 5.7000000000000002 , 6.0 , 4.7999999999999998 , 4.9000000000000004 , 5.5999999999999996 , 5.7999999999999998 , 6.0999999999999996 , 6.4000000000000004 , 5.5999999999999996 , 5.0999999999999996 , 5.5999999999999996 , 6.0999999999999996 , 5.5999999999999996 , 5.5 , 4.7999999999999998 , 5.4000000000000004 , 5.5999999999999996 , 5.0999999999999996 , 5.0999999999999996 , 5.9000000000000004 , 5.7000000000000002 , 5.2000000000000002 , 5.0 , 5.2000000000000002 , 5.4000000000000004 , 5.0999999999999996 ] }
df = pd. DataFrame( raw_data, columns = [ 'Unnamed: 0' , 'Sepal.Length' , 'Sepal.Width' , 'Petal.Length' , 'Petal.Width' , 'Species' ] )
df. head( )
Unnamed: 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Species 0 1 5.1 3.5 1.4 0.2 setosa 1 2 4.9 3.0 1.4 0.2 setosa 2 3 4.7 3.2 1.3 0.2 setosa 3 4 4.6 3.1 1.5 0.2 setosa 4 5 5.0 3.6 1.4 0.2 setosa
df_original. head( )
Unnamed: 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Species 0 1 5.1 3.5 1.4 0.2 setosa 1 2 4.9 3.0 1.4 0.2 setosa 2 3 4.7 3.2 1.3 0.2 setosa 3 4 4.6 3.1 1.5 0.2 setosa 4 5 5.0 3.6 1.4 0.2 setosa
将分类变量转换为虚拟变量
import pandas as pd
raw_data = { 'first_name' : [ 'Jason' , 'Molly' , 'Tina' , 'Jake' , 'Amy' ] ,
'last_name' : [ 'Miller' , 'Jacobson' , 'Ali' , 'Milner' , 'Cooze' ] ,
'sex' : [ 'male' , 'female' , 'male' , 'female' , 'female' ] }
df = pd. DataFrame( raw_data, columns = [ 'first_name' , 'last_name' , 'sex' ] )
df
first_name last_name sex 0 Jason Miller male 1 Molly Jacobson female 2 Tina Ali male 3 Jake Milner female 4 Amy Cooze female
df_sex = pd. get_dummies( df[ 'sex' ] )
df_new = pd. concat( [ df, df_sex] , axis= 1 )
df_new
first_name last_name sex female male 0 Jason Miller male 0.0 1.0 1 Molly Jacobson female 1.0 0.0 2 Tina Ali male 0.0 1.0 3 Jake Milner female 1.0 0.0 4 Amy Cooze female 1.0 0.0
df_new = df. join( df_sex)
df_new
first_name last_name sex female male 0 Jason Miller male 0.0 1.0 1 Molly Jacobson female 1.0 0.0 2 Tina Ali male 0.0 1.0 3 Jake Milner female 1.0 0.0 4 Amy Cooze female 1.0 0.0
将分类变量转换为虚拟变量
import pandas as pd
import patsy
raw_data = { 'countrycode' : [ 1 , 2 , 3 , 2 , 1 ] }
df = pd. DataFrame( raw_data, columns = [ 'countrycode' ] )
df
patsy. dmatrix( 'C(countrycode)-1' , df, return_type= 'dataframe' )
C(countrycode)[1] C(countrycode)[2] C(countrycode)[3] 0 1.0 0.0 0.0 1 0.0 1.0 0.0 2 0.0 0.0 1.0 3 0.0 1.0 0.0 4 1.0 0.0 0.0
将字符串分类变量转换为数字变量
import pandas as pd
raw_data = { 'patient' : [ 1 , 1 , 1 , 2 , 2 ] ,
'obs' : [ 1 , 2 , 3 , 1 , 2 ] ,
'treatment' : [ 0 , 1 , 0 , 1 , 0 ] ,
'score' : [ 'strong' , 'weak' , 'normal' , 'weak' , 'strong' ] }
df = pd. DataFrame( raw_data, columns = [ 'patient' , 'obs' , 'treatment' , 'score' ] )
df
patient obs treatment score 0 1 1 0 strong 1 1 2 1 weak 2 1 3 0 normal 3 2 1 1 weak 4 2 2 0 strong
def score_to_numeric ( x) :
if x== 'strong' :
return 3
if x== 'normal' :
return 2
if x== 'weak' :
return 1
df[ 'score_num' ] = df[ 'score' ] . apply ( score_to_numeric)
df
patient obs treatment score score_num 0 1 1 0 strong 3 1 1 2 1 weak 1 2 1 3 0 normal 2 3 2 1 1 weak 1 4 2 2 0 strong 3
将变量转换为时间序列
import pandas as pd
raw_data = { 'date' : [ '2014-06-01T01:21:38.004053' , '2014-06-02T01:21:38.004053' , '2014-06-03T01:21:38.004053' ] ,
'score' : [ 25 , 94 , 57 ] }
df = pd. DataFrame( raw_data, columns = [ 'date' , 'score' ] )
df
date score 0 2014-06-01T01:21:38.004053 25 1 2014-06-02T01:21:38.004053 94 2 2014-06-03T01:21:38.004053 57
df[ "date" ] = pd. to_datetime( df[ "date" ] )
df = df. set_index( df[ "date" ] )
df
date score date — — — 2014-06-01 01:21:38.004053 2014-06-01 01:21:38.004053 25 2014-06-02 01:21:38.004053 2014-06-02 01:21:38.004053 94 2014-06-03 01:21:38.004053 2014-06-03 01:21:38.004053 57
在 Pandas 数据帧中计数
import pandas as pd
year = pd. Series( [ 1875 , 1876 , 1877 , 1878 , 1879 , 1880 , 1881 , 1882 , 1883 , 1884 ,
1885 , 1886 , 1887 , 1888 , 1889 , 1890 , 1891 , 1892 , 1893 , 1894 ] )
guardCorps = pd. Series( [ 0 , 2 , 2 , 1 , 0 , 0 , 1 , 1 , 0 , 3 , 0 , 2 , 1 , 0 , 0 , 1 , 0 , 1 , 0 , 1 ] )
corps1 = pd. Series( [ 0 , 0 , 0 , 2 , 0 , 3 , 0 , 2 , 0 , 0 , 0 , 1 , 1 , 1 , 0 , 2 , 0 , 3 , 1 , 0 ] )
corps2 = pd. Series( [ 0 , 0 , 0 , 2 , 0 , 2 , 0 , 0 , 1 , 1 , 0 , 0 , 2 , 1 , 1 , 0 , 0 , 2 , 0 , 0 ] )
corps3 = pd. Series( [ 0 , 0 , 0 , 1 , 1 , 1 , 2 , 0 , 2 , 0 , 0 , 0 , 1 , 0 , 1 , 2 , 1 , 0 , 0 , 0 ] )
corps4 = pd. Series( [ 0 , 1 , 0 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 1 , 1 , 0 , 0 ] )
corps5 = pd. Series( [ 0 , 0 , 0 , 0 , 2 , 1 , 0 , 0 , 1 , 0 , 0 , 1 , 0 , 1 , 1 , 1 , 1 , 1 , 1 , 0 ] )
corps6 = pd. Series( [ 0 , 0 , 1 , 0 , 2 , 0 , 0 , 1 , 2 , 0 , 1 , 1 , 3 , 1 , 1 , 1 , 0 , 3 , 0 , 0 ] )
corps7 = pd. Series( [ 1 , 0 , 1 , 0 , 0 , 0 , 1 , 0 , 1 , 1 , 0 , 0 , 2 , 0 , 0 , 2 , 1 , 0 , 2 , 0 ] )
corps8 = pd. Series( [ 1 , 0 , 0 , 0 , 1 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 1 , 1 , 0 , 1 ] )
corps9 = pd. Series( [ 0 , 0 , 0 , 0 , 0 , 2 , 1 , 1 , 1 , 0 , 2 , 1 , 1 , 0 , 1 , 2 , 0 , 1 , 0 , 0 ] )
corps10 = pd. Series( [ 0 , 0 , 1 , 1 , 0 , 1 , 0 , 2 , 0 , 2 , 0 , 0 , 0 , 0 , 2 , 1 , 3 , 0 , 1 , 1 ] )
corps11 = pd. Series( [ 0 , 0 , 0 , 0 , 2 , 4 , 0 , 1 , 3 , 0 , 1 , 1 , 1 , 1 , 2 , 1 , 3 , 1 , 3 , 1 ] )
corps14 = pd. Series( [ 1 , 1 , 2 , 1 , 1 , 3 , 0 , 4 , 0 , 1 , 0 , 3 , 2 , 1 , 0 , 2 , 1 , 1 , 0 , 0 ] )
corps15 = pd. Series( [ 0 , 1 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 1 , 1 , 0 , 0 , 0 , 2 , 2 , 0 , 0 , 0 , 0 ] )
variables = dict ( guardCorps = guardCorps, corps1 = corps1,
corps2 = corps2, corps3 = corps3, corps4 = corps4,
corps5 = corps5, corps6 = corps6, corps7 = corps7,
corps8 = corps8, corps9 = corps9, corps10 = corps10,
corps11 = corps11 , corps14 = corps14, corps15 = corps15)
horsekick = pd. DataFrame( variables, columns = [ 'guardCorps' ,
'corps1' , 'corps2' ,
'corps3' , 'corps4' ,
'corps5' , 'corps6' ,
'corps7' , 'corps8' ,
'corps9' , 'corps10' ,
'corps11' , 'corps14' ,
'corps15' ] )
horsekick. index = [ 1875 , 1876 , 1877 , 1878 , 1879 , 1880 , 1881 , 1882 , 1883 , 1884 ,
1885 , 1886 , 1887 , 1888 , 1889 , 1890 , 1891 , 1892 , 1893 , 1894 ]
horsekick
guardCorps corps1 corps2 corps3 corps4 corps5 corps6 corps7 corps8 corps9 corps10 corps11 corps14 corps15 1875 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1876 2 0 0 0 1 0 0 0 0 0 0 0 1 1 1877 2 0 0 0 0 0 1 1 0 0 1 0 2 0 1878 1 2 2 1 1 0 0 0 0 0 1 0 1 0 1879 0 0 0 1 1 2 2 0 1 0 0 2 1 0 1880 0 3 2 1 1 1 0 0 0 2 1 4 3 0 1881 1 0 0 2 1 0 0 1 0 1 0 0 0 0 1882 1 2 0 0 0 0 1 0 1 1 2 1 4 1 1883 0 0 1 2 0 1 2 1 0 1 0 3 0 0 1884 3 0 1 0 0 0 0 1 0 0 2 0 1 1 1885 0 0 0 0 0 0 1 0 0 2 0 1 0 1 1886 2 1 0 0 1 1 1 0 0 1 0 1 3 0 1887 1 1 2 1 0 0 3 2 1 1 0 1 2 0 1888 0 1 1 0 0 1 1 0 0 0 0 1 1 0 1889 0 0 1 1 0 1 1 0 0 1 2 2 0 2 1890 1 2 0 2 0 1 1 2 0 2 1 1 2 2 1891 0 0 0 1 1 1 0 1 1 0 3 3 1 0 1892 1 3 2 0 1 1 3 0 1 1 0 1 1 0 1893 0 1 0 0 0 1 0 2 0 0 1 3 0 0 1894 1 0 0 0 0 0 0 0 1 0 1 1 0 0
result = horsekick. apply ( pd. value_counts) . fillna( 0 ) ; result
| | guardCorps | corps1 | corps2 | corps3 | corps4 | corps5 | corps6 | corps7 | corps8 | corps9 | corps10 | corps11 | corps14 | corps15 | | 0 | 9.0 | 11.0 | 12.0 | 11.0 | 12.0 | 10.0 | 9.0 | 11.0 | 13.0 | 10.0 | 10.0 | 6 | 6 | 14.0 | | 1 | 7.0 | 4.0 | 4.0 | 6.0 | 8.0 | 9.0 | 7.0 | 6.0 | 7.0 | 7.0 | 6.0 | 8 | 8 | 4.0 | | 2 | 3.0 | 3.0 | 4.0 | 3.0 | 0.0 | 1.0 | 2.0 | 3.0 | 0.0 | 3.0 | 3.0 | 2 | 3 | 2.0 | | 3 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3 | 2 | 0.0 | | 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 1 | 0.0 |
pd. value_counts( horsekick[ 'guardCorps' ] . values, sort= False )
'''
0 9
1 7
2 3
3 1
dtype: int64
'''
horsekick[ 'guardCorps' ] . unique( )
在 Pandas 中创建流水线
Pandas 的流水线功能允许你将 Python 函数串联在一起,来构建数据处理流水线。
import pandas as pd
df = pd. DataFrame( )
df[ 'name' ] = [ 'John' , 'Steve' , 'Sarah' ]
df[ 'gender' ] = [ 'Male' , 'Male' , 'Female' ]
df[ 'age' ] = [ 31 , 32 , 19 ]
df
name gender age 0 John Male 31 1 Steve Male 32 2 Sarah Female 19
def mean_age_by_group ( dataframe, col) :
return dataframe. groupby( col) . mean( )
def uppercase_column_name ( dataframe) :
dataframe. columns = dataframe. columns. str . upper( )
return dataframe
( df. pipe( mean_age_by_group, col= 'gender' )
. pipe( uppercase_column_name)
)
AGE gender Female 19.0 Male 31.5
使用for
循环创建 Pandas 列
import pandas as pd
import numpy as np
raw_data = { 'student_name' : [ 'Miller' , 'Jacobson' , 'Ali' , 'Milner' , 'Cooze' , 'Jacon' , 'Ryaner' , 'Sone' , 'Sloan' , 'Piger' , 'Riani' , 'Ali' ] ,
'test_score' : [ 76 , 88 , 84 , 67 , 53 , 96 , 64 , 91 , 77 , 73 , 52 , np. NaN] }
df = pd. DataFrame( raw_data, columns = [ 'student_name' , 'test_score' ] )
grades = [ ]
for row in df[ 'test_score' ] :
if row > 95 :
grades. append( 'A' )
elif row > 90 :
grades. append( 'A-' )
elif row > 85 :
grades. append( 'B' )
elif row > 80 :
grades. append( 'B-' )
elif row > 75 :
grades. append( 'C' )
elif row > 70 :
grades. append( 'C-' )
elif row > 65 :
grades. append( 'D' )
elif row > 60 :
grades. append( 'D-' )
else :
grades. append( 'Failed' )
df[ 'grades' ] = grades
df
student_name test_score grades 0 Miller 76.0 C 1 Jacobson 88.0 B 2 Ali 84.0 B- 3 Milner 67.0 D 4 Cooze 53.0 Failed 5 Jacon 96.0 A 6 Ryaner 64.0 D- 7 Sone 91.0 A- 8 Sloan 77.0 C 9 Piger 73.0 C- 10 Riani 52.0 Failed 11 Ali NaN Failed
创建项目计数
from collections import Counter
fruit_eaten = Counter( [ 'Apple' , 'Apple' , 'Apple' , 'Banana' , 'Pear' , 'Pineapple' ] )
fruit_eaten
fruit_eaten. update( [ 'Pineapple' ] )
fruit_eaten
fruit_eaten. most_common( 3 )
基于条件创建一列
import pandas as pd
import numpy as np
data = { 'name' : [ 'Jason' , 'Molly' , 'Tina' , 'Jake' , 'Amy' ] ,
'age' : [ 42 , 52 , 36 , 24 , 73 ] ,
'preTestScore' : [ 4 , 24 , 31 , 2 , 3 ] ,
'postTestScore' : [ 25 , 94 , 57 , 62 , 70 ] }
df = pd. DataFrame( data, columns = [ 'name' , 'age' , 'preTestScore' , 'postTestScore' ] )
df
name age preTestScore postTestScore 0 Jason 42 4 25 1 Molly 52 24 94 2 Tina 36 31 57 3 Jake 24 2 62 4 Amy 73 3 70
df[ 'elderly' ] = np. where( df[ 'age' ] >= 50 , 'yes' , 'no' )
df
name age preTestScore postTestScore elderly 0 Jason 42 4 25 no 1 Molly 52 24 94 yes 2 Tina 36 31 57 no 3 Jake 24 2 62 no 4 Amy 73 3 70 yes
从词典键和值创建列表
dict = { 'county' : [ 'Cochice' , 'Pima' , 'Santa Cruz' , 'Maricopa' , 'Yuma' ] ,
'year' : [ 2012 , 2012 , 2013 , 2014 , 2014 ] ,
'fireReports' : [ 4 , 24 , 31 , 2 , 3 ] }
list ( dict . keys( ) )
list ( dict . values( ) )
'''
[[4, 24, 31, 2, 3],
[2012, 2012, 2013, 2014, 2014],
['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']]
'''
Pandas 中的交叉表
import pandas as pd
raw_data = { 'regiment' : [ 'Nighthawks' , 'Nighthawks' , 'Nighthawks' , 'Nighthawks' , 'Dragoons' , 'Dragoons' , 'Dragoons' , 'Dragoons' , 'Scouts' , 'Scouts' , 'Scouts' , 'Scouts' ] ,
'company' : [ 'infantry' , 'infantry' , 'cavalry' , 'cavalry' , 'infantry' , 'infantry' , 'cavalry' , 'cavalry' , 'infantry' , 'infantry' , 'cavalry' , 'cavalry' ] ,
'experience' : [ 'veteran' , 'rookie' , 'veteran' , 'rookie' , 'veteran' , 'rookie' , 'veteran' , 'rookie' , 'veteran' , 'rookie' , 'veteran' , 'rookie' ] ,
'name' : [ 'Miller' , 'Jacobson' , 'Ali' , 'Milner' , 'Cooze' , 'Jacon' , 'Ryaner' , 'Sone' , 'Sloan' , 'Piger' , 'Riani' , 'Ali' ] ,
'preTestScore' : [ 4 , 24 , 31 , 2 , 3 , 4 , 24 , 31 , 2 , 3 , 2 , 3 ] ,
'postTestScore' : [ 25 , 94 , 57 , 62 , 70 , 25 , 94 , 57 , 62 , 70 , 62 , 70 ] }
df = pd. DataFrame( raw_data, columns = [ 'regiment' , 'company' , 'experience' , 'name' , 'preTestScore' , 'postTestScore' ] )
df
regiment company experience name preTestScore postTestScore 0 Nighthawks infantry veteran Miller 4 25 1 Nighthawks infantry rookie Jacobson 24 94 2 Nighthawks cavalry veteran Ali 31 57 3 Nighthawks cavalry rookie Milner 2 62 4 Dragoons infantry veteran Cooze 3 70 5 Dragoons infantry rookie Jacon 4 25 6 Dragoons cavalry veteran Ryaner 24 94 7 Dragoons cavalry rookie Sone 31 57 8 Scouts infantry veteran Sloan 2 62 9 Scouts infantry rookie Piger 3 70 10 Scouts cavalry veteran Riani 2 62 11 Scouts cavalry rookie Ali 3 70
按公司和团队创建交叉表。按公司和团队计算观测数量。
pd. crosstab( df. regiment, df. company, margins= True )
company cavalry infantry All regiment Dragoons 2 2 4 Nighthawks 2 2 4 Scouts 2 2 4 All 6 6 12
pd. crosstab( [ df. company, df. experience] , df. regiment, margins= True )
regiment Dragoons Nighthawks Scouts All company experience cavalry rookie 1 1 1 3 veteran 1 1 1 3 infantry rookie 1 1 1 3 veteran 1 1 1 3 All 4 4 4 12
删除重复
import pandas as pd
raw_data = { 'first_name' : [ 'Jason' , 'Jason' , 'Jason' , 'Tina' , 'Jake' , 'Amy' ] ,
'last_name' : [ 'Miller' , 'Miller' , 'Miller' , 'Ali' , 'Milner' , 'Cooze' ] ,
'age' : [ 42 , 42 , 1111111 , 36 , 24 , 73 ] ,
'preTestScore' : [ 4 , 4 , 4 , 31 , 2 , 3 ] ,
'postTestScore' : [ 25 , 25 , 25 , 57 , 62 , 70 ] }
df = pd. DataFrame( raw_data, columns = [ 'first_name' , 'last_name' , 'age' , 'preTestScore' , 'postTestScore' ] )
df
first_name last_name age preTestScore postTestScore 0 Jason Miller 42 4 25 1 Jason Miller 42 4 25 2 Jason Miller 1111111 4 25 3 Tina Ali 36 31 57 4 Jake Milner 24 2 62 5 Amy Cooze 73 3 70
df. duplicated( )
'''
0 False
1 True
2 False
3 False
4 False
5 False
dtype: bool
'''
df. drop_duplicates( )
first_name last_name age preTestScore postTestScore 0 Jason Miller 42 4 25 2 Jason Miller 1111111 4 25 3 Tina Ali 36 31 57 4 Jake Milner 24 2 62 5 Amy Cooze 73 3 70
df. drop_duplicates( [ 'first_name' ] , keep= 'last' )
first_name last_name age preTestScore postTestScore 2 Jason Miller 1111111 4 25 3 Tina Ali 36 31 57 4 Jake Milner 24 2 62 5 Amy Cooze 73 3 70
Pandas 数据帧的描述性统计
import pandas as pd
data = { 'name' : [ 'Jason' , 'Molly' , 'Tina' , 'Jake' , 'Amy' ] ,
'age' : [ 42 , 52 , 36 , 24 , 73 ] ,
'preTestScore' : [ 4 , 24 , 31 , 2 , 3 ] ,
'postTestScore' : [ 25 , 94 , 57 , 62 , 70 ] }
df = pd. DataFrame( data, columns = [ 'name' , 'age' , 'preTestScore' , 'postTestScore' ] )
df
name age preTestScore postTestScore 0 Jason 42 4 25 1 Molly 52 24 94 2 Tina 36 31 57 3 Jake 24 2 62 4 Amy 73 3 70
5 rows × 4 columns
df[ 'age' ] . sum ( )
df[ 'preTestScore' ] . mean( )
df[ 'preTestScore' ] . cumsum( )
'''
0 4
1 28
2 59
3 61
4 64
Name: preTestScore, dtype: int64
'''
df[ 'preTestScore' ] . describe( )
'''
count 5.000000
mean 12.800000
std 13.663821
min 2.000000
25% 3.000000
50% 4.000000
75% 24.000000
max 31.000000
Name: preTestScore, dtype: float64
'''
df[ 'preTestScore' ] . count( )
df[ 'preTestScore' ] . min ( )
df[ 'preTestScore' ] . max ( )
df[ 'preTestScore' ] . median( )
df[ 'preTestScore' ] . var( )
df[ 'preTestScore' ] . std( )
df[ 'preTestScore' ] . skew( )
df[ 'preTestScore' ] . kurt( )
df. corr( )
age preTestScore postTestScore age 1.000000 -0.105651 0.328852 preTestScore -0.105651 1.000000 0.378039 postTestScore 0.328852 0.378039 1.000000
3 rows × 3 columns
df. cov( )
age preTestScore postTestScore age 340.80 -26.65 151.20 preTestScore -26.65 186.70 128.65 postTestScore 151.20 128.65 620.30
3 rows × 3 columns
丢弃行或者列
import pandas as pd
data = { 'name' : [ 'Jason' , 'Molly' , 'Tina' , 'Jake' , 'Amy' ] ,
'year' : [ 2012 , 2012 , 2013 , 2014 , 2014 ] ,
'reports' : [ 4 , 24 , 31 , 2 , 3 ] }
df = pd. DataFrame( data, index = [ 'Cochice' , 'Pima' , 'Santa Cruz' , 'Maricopa' , 'Yuma' ] )
df
name reports year Cochice Jason 4 2012 Pima Molly 24 2012 Santa Cruz Tina 31 2013 Maricopa Jake 2 2014 Yuma Amy 3 2014
df. drop( [ 'Cochice' , 'Pima' ] )
name reports year Santa Cruz Tina 31 2013 Maricopa Jake 2 2014 Yuma Amy 3 2014
df. drop( 'reports' , axis= 1 )
name year Cochice Jason 2012 Pima Molly 2012 Santa Cruz Tina 2013 Maricopa Jake 2014 Yuma Amy 2014
如果它包含某个值(这里是Tina
),丢弃一行。
具体来说:创建一个名为df
的新数据框,名称列中的单元格的值不等于Tina
。
df[ df. name != 'Tina' ]
name reports year Cochice Jason 4 2012 Pima Molly 24 2012 Maricopa Jake 2 2014 Yuma Amy 3 2014
按照行号丢弃一行(在本例中为第 3 行)。
请注意,Pandas使用从零开始的编号,因此 0 是第一行,1 是第二行,等等。
df. drop( df. index[ 2 ] )
name reports year Cochice Jason 4 2012 Pima Molly 24 2012 Maricopa Jake 2 2014 Yuma Amy 3 2014
可以扩展到范围。
df. drop( df. index[ [ 2 , 3 ] ] )
name reports year Cochice Jason 4 2012 Pima Molly 24 2012 Yuma Amy 3 2014
或相对于 DF 的末尾来丢弃。
df. drop( df. index[ - 2 ] )
name reports year Cochice Jason 4 2012 Pima Molly 24 2012 Santa Cruz Tina 31 2013 Yuma Amy 3 2014
你也可以选择相对于起始或末尾的范围。
df[ : 3 ]
name reports year Cochice Jason 4 2012 Pima Molly 24 2012 Santa Cruz Tina 31 2013
df[ : - 3 ]
name reports year Cochice Jason 4 2012 Pima Molly 24 2012
枚举列表
data = [ 'One' , 'Two' , 'Three' , 'Four' , 'Five' ]
for item in enumerate ( data) :
print ( item)
print ( item[ 1 ] )
'''
(0, 'One')
One
(1, 'Two')
Two
(2, 'Three')
Three
(3, 'Four')
Four
(4, 'Five')
Five
'''
在 Pandas 中将包含列表的单元扩展为自己的变量
import pandas as pd
raw_data = { 'score' : [ 1 , 2 , 3 ] ,
'tags' : [ [ 'apple' , 'pear' , 'guava' ] , [ 'truck' , 'car' , 'plane' ] , [ 'cat' , 'dog' , 'mouse' ] ] }
df = pd. DataFrame( raw_data, columns = [ 'score' , 'tags' ] )
df
score tags 0 1 [apple, pear, guava] 1 2 [truck, car, plane] 2 3 [cat, dog, mouse]
tags = df[ 'tags' ] . apply ( pd. Series)
tags = tags. rename( columns = lambda x : 'tag_' + str ( x) )
tags
tag_0 tag_1 tag_2 0 apple pear guava 1 truck car plane 2 cat dog mouse
pd. concat( [ df[ : ] , tags[ : ] ] , axis= 1 )
score tags tag_0 tag_1 tag_2 0 1 [apple, pear, guava] apple pear guava 1 2 [truck, car, plane] truck car plane 2 3 [cat, dog, mouse] cat dog mouse
过滤 pandas 数据帧
import pandas as pd
data = { 'name' : [ 'Jason' , 'Molly' , 'Tina' , 'Jake' , 'Amy' ] ,
'year' : [ 2012 , 2012 , 2013 , 2014 , 2014 ] ,
'reports' : [ 4 , 24 , 31 , 2 , 3 ] ,
'coverage' : [ 25 , 94 , 57 , 62 , 70 ] }
df = pd. DataFrame( data, index = [ 'Cochice' , 'Pima' , 'Santa Cruz' , 'Maricopa' , 'Yuma' ] )
df
coverage name reports year Cochice 25 Jason 4 2012 Pima 94 Molly 24 2012 Santa Cruz 57 Tina 31 2013 Maricopa 62 Jake 2 2014 Yuma 70 Amy 3 2014
df[ 'name' ]
'''
Cochice Jason
Pima Molly
Santa Cruz Tina
Maricopa Jake
Yuma Amy
Name: name, dtype: object
'''
df[ [ 'name' , 'reports' ] ]
name reports Cochice Jason 4 Pima Molly 24 Santa Cruz Tina 31 Maricopa Jake 2 Yuma Amy 3
df[ : 2 ]
coverage name reports year Cochice 25 Jason 4 2012 Pima 94 Molly 24 2012
df[ df[ 'coverage' ] > 50 ]
coverage name reports year Pima 94 Molly 24 2012 Santa Cruz 57 Tina 31 2013 Maricopa 62 Jake 2 2014 Yuma 70 Amy 3 2014
df[ ( df[ 'coverage' ] > 50 ) & ( df[ 'reports' ] < 4 ) ]
coverage name reports year Maricopa 62 Jake 2 2014 Yuma 70 Amy 3 2014
寻找数据帧的列中的最大值
% matplotlib inline
import pandas as pd
import matplotlib. pyplot as plt
import numpy as np
raw_data = { 'first_name' : [ 'Jason' , 'Molly' , 'Tina' , 'Jake' , 'Amy' ] ,
'last_name' : [ 'Miller' , 'Jacobson' , 'Ali' , 'Milner' , 'Cooze' ] ,
'age' : [ 42 , 52 , 36 , 24 , 73 ] ,
'preTestScore' : [ 4 , 24 , 31 , 2 , 3 ] ,
'postTestScore' : [ 25 , 94 , 57 , 62 , 70 ] }
df = pd. DataFrame( raw_data, columns = [ 'first_name' , 'last_name' , 'age' , 'preTestScore' , 'postTestScore' ] )
df
first_name last_name age preTestScore postTestScore 0 Jason Miller 42 4 25 1 Molly Jacobson 52 24 94 2 Tina Ali 36 31 57 3 Jake Milner 24 2 62 4 Amy Cooze 73 3 70
df[ 'preTestScore' ] . idxmax( )
寻找数据帧中的唯一值
import pandas as pd
import numpy as np
raw_data = { 'regiment' : [ '51st' , '29th' , '2nd' , '19th' , '12th' , '101st' , '90th' , '30th' , '193th' , '1st' , '94th' , '91th' ] ,
'trucks' : [ 'MAZ-7310' , np. nan, 'MAZ-7310' , 'MAZ-7310' , 'Tatra 810' , 'Tatra 810' , 'Tatra 810' , 'Tatra 810' , 'ZIS-150' , 'Tatra 810' , 'ZIS-150' , 'ZIS-150' ] ,
'tanks' : [ 'Merkava Mark 4' , 'Merkava Mark 4' , 'Merkava Mark 4' , 'Leopard 2A6M' , 'Leopard 2A6M' , 'Leopard 2A6M' , 'Arjun MBT' , 'Leopard 2A6M' , 'Arjun MBT' , 'Arjun MBT' , 'Arjun MBT' , 'Arjun MBT' ] ,
'aircraft' : [ 'none' , 'none' , 'none' , 'Harbin Z-9' , 'Harbin Z-9' , 'none' , 'Harbin Z-9' , 'SH-60B Seahawk' , 'SH-60B Seahawk' , 'SH-60B Seahawk' , 'SH-60B Seahawk' , 'SH-60B Seahawk' ] }
df = pd. DataFrame( raw_data, columns = [ 'regiment' , 'trucks' , 'tanks' , 'aircraft' ] )
df. head( )
regiment trucks tanks aircraft 0 51st MAZ-7310 Merkava Mark 4 none 1 29th NaN Merkava Mark 4 none 2 2nd MAZ-7310 Merkava Mark 4 none 3 19th MAZ-7310 Leopard 2A6M Harbin Z-9 4 12th Tatra 810 Leopard 2A6M Harbin Z-9
list ( set ( df. trucks) )
list ( df[ 'trucks' ] . unique( ) )
地理编码和反向地理编码
在使用地理数据时,地理编码(将物理地址或位置转换为经纬度)和反向地理编码(将经纬度转换为物理地址或位置)是常见任务。
Python 提供了许多软件包,使任务变得异常简单。 在下面的教程中,我使用 pygeocoder(Google 的 geo-API 的包装器)来进行地理编码和反向地理编码。
首先,我们要加载我们想要在脚本中使用的包。 具体来说,我正在为地理函数加载 pygeocoder,为数据帧结构加载 pandas,为缺失值(np.nan
)函数加载 numpy。
from pygeocoder import Geocoder
import pandas as pd
import numpy as np
地理数据有多种形式,在这种情况下,我们有一个 Python 字典,包含五个经纬度的字符串,每个坐标在逗号分隔的坐标对中。
data = { 'Site 1' : '31.336968, -109.560959' ,
'Site 2' : '31.347745, -108.229963' ,
'Site 3' : '32.277621, -107.734724' ,
'Site 4' : '31.655494, -106.420484' ,
'Site 5' : '30.295053, -104.014528' }
虽然技术上没必要,因为我最初使用 R,我是数据帧的忠实粉丝,所以让我们把模拟的数据字典变成数据帧。
df = pd. DataFrame. from_dict( data, orient= 'index' )
df
0 Site 1 31.336968, -109.560959 Site 2 31.347745, -108.229963 Site 3 32.277621, -107.734724 Site 4 31.655494, -106.420484 Site 5 30.295053, -104.014528
你现在可以看到,我们有了包含五行的数据帧,每行包含一个经纬度字符串。 在我们处理数据之前,我们需要1)将字符串分成纬度和经度,然后将它们转换为浮点数。以下代码就是这样。
lat = [ ]
lon = [ ]
for row in df[ 0 ] :
try :
lat. append( float ( row. split( ',' ) [ 0 ] ) )
lon. append( float ( row. split( ',' ) [ 1 ] ) )
except :
lat. append( np. NaN)
lon. append( np. NaN)
df[ 'latitude' ] = lat
df[ 'longitude' ] = lon
让我们看看现在有了什么。
df
0 latitude longitude Site 1 31.336968, -109.560959 31.336968 -109.560959 Site 2 31.347745, -108.229963 31.347745 -108.229963 Site 3 32.277621, -107.734724 32.277621 -107.734724 Site 4 31.655494, -106.420484 31.655494 -106.420484 Site 5 30.295053, -104.014528 30.295053 -104.014528
真棒。这正是我们想要看到的,一列用于纬度的浮点和一列用于经度的浮点。
为了反转地理编码,我们将特定的经纬度对(这里为第一行,索引为0
)提供给 pygeocoder 的reverse_geocoder
函数。
results = Geocoder. reverse_geocode( df[ 'latitude' ] [ 0 ] , df[ 'longitude' ] [ 0 ] )
现在我们可以开始提取我们想要的数据了。
results. coordinates
results. city
results. country
results. street_address
results. administrative_area_level_1
对于地理编码,我们需要将包含地址或位置(例如城市)的字符串,传入地理编码函数中。 但是,并非所有字符串的格式都是 Google 的 geo-API 可以理解的。 如果由.geocode().valid_address
函数验证有效,我们可以转换。
Geocoder. geocode( "4207 N Washington Ave, Douglas, AZ 85607" ) . valid_address
因为输出是True,我们现在知道这是一个有效的地址,因此可以打印纬度和经度坐标。
results. coordinates
但更有趣的是,一旦地址由 Google 地理 API 处理,我们就可以解析它并轻松地分隔街道号码,街道名称等。
result = Geocoder. geocode( "7250 South Tucson Boulevard, Tucson, AZ 85756" )
result. street_number
result. route
你就实现了它。Python 使整个过程变得简单,只需几分钟即可完成分析。祝好运!
地理定位城市和国家
本教程创建一个函数,尝试获取城市和国家并返回其经纬度。 但是当城市不可用时(通常是这种情况),则返回该国中心的经纬度。
from geopy. geocoders import Nominatim
geolocator = Nominatim( )
import numpy as np
def geolocate ( city= None , country= None ) :
'''
输入城市和国家,或仅输入国家。 如果可以的话,返回城市的经纬度坐标,否则返回该国家中心的经纬度。
'''
if city != None :
try :
loc = geolocator. geocode( str ( city + ',' + country) )
return ( loc. latitude, loc. longitude)
except :
return np. nan
else :
try :
loc = geolocator. geocode( country)
return ( loc. latitude, loc. longitude)
except :
return np. nan
geolocate( city= 'Austin' , country= 'USA' )
geolocate( country= 'USA' )
使用 pandas 分组时间序列
import pandas as pd
import numpy as np
df = pd. DataFrame( )
df[ 'german_army' ] = np. random. randint( low= 20000 , high= 30000 , size= 100 )
df[ 'allied_army' ] = np. random. randint( low= 20000 , high= 40000 , size= 100 )
df. index = pd. date_range( '1/1/2014' , periods= 100 , freq= 'H' )
df. head( )
german_army allied_army 2014-01-01 00:00:00 28755 33938 2014-01-01 01:00:00 25176 28631 — — — 2014-01-01 02:00:00 23261 39685 — — — 2014-01-01 03:00:00 28686 27756 — — — 2014-01-01 04:00:00 24588 25681 — — —
Truncate the dataframe
df. truncate( before= '1/2/2014' , after= '1/3/2014' )
german_army allied_army 2014-01-02 00:00:00 26401 20189 2014-01-02 01:00:00 29958 23934 2014-01-02 02:00:00 24492 39075 2014-01-02 03:00:00 25707 39262 2014-01-02 04:00:00 27129 35961 2014-01-02 05:00:00 27903 25418 2014-01-02 06:00:00 20409 25163 2014-01-02 07:00:00 25736 34794 2014-01-02 08:00:00 24057 27209 2014-01-02 09:00:00 26875 33402 2014-01-02 10:00:00 23963 38575 2014-01-02 11:00:00 27506 31859 2014-01-02 12:00:00 23564 25750 2014-01-02 13:00:00 27958 24365 2014-01-02 14:00:00 24915 38866 2014-01-02 15:00:00 23538 33820 2014-01-02 16:00:00 23361 30080 2014-01-02 17:00:00 27284 22922 2014-01-02 18:00:00 24176 32155 2014-01-02 19:00:00 23924 27763 2014-01-02 20:00:00 23111 32343 2014-01-02 21:00:00 20348 28907 2014-01-02 22:00:00 27136 38634 2014-01-02 23:00:00 28649 29950 2014-01-03 00:00:00 21292 26395
df. index = df. index + pd. DateOffset( months= 4 , days= 5 )
df. head( )
german_army allied_army 2014-05-06 00:00:00 28755 33938 2014-05-06 01:00:00 25176 28631 2014-05-06 02:00:00 23261 39685 2014-05-06 03:00:00 28686 27756 2014-05-06 04:00:00 24588 25681
df. shift( 1 ) . head( )
german_army allied_army 2014-05-06 00:00:00 NaN NaN 2014-05-06 01:00:00 28755.0 33938.0 2014-05-06 02:00:00 25176.0 28631.0 2014-05-06 03:00:00 23261.0 39685.0 2014-05-06 04:00:00 28686.0 27756.0
df. shift( - 1 ) . tail( )
german_army allied_army 2014-05-09 23:00:00 26903.0 39144.0 2014-05-10 00:00:00 27576.0 39759.0 2014-05-10 01:00:00 25232.0 35246.0 2014-05-10 02:00:00 23391.0 21044.0 2014-05-10 03:00:00 NaN NaN
df. resample( 'D' ) . sum ( )
german_army allied_army 2014-05-06 605161 755962 2014-05-07 608100 740396 2014-05-08 589744 700297 2014-05-09 607092 719283 2014-05-10 103102 135193
df. resample( 'D' ) . mean( )
german_army allied_army 2014-05-06 25215.041667 31498.416667 2014-05-07 25337.500000 30849.833333 2014-05-08 24572.666667 29179.041667 2014-05-09 25295.500000 29970.125000 2014-05-10 25775.500000 33798.250000
df. resample( 'D' ) . min ( )
german_army allied_army 2014-05-06 24882.0 31310.0 2014-05-07 25311.0 30969.5 2014-05-08 24422.5 28318.0 2014-05-09 24941.5 32082.5 2014-05-10 26067.5 37195.0
df. resample( 'D' ) . median( )
german_army allied_army 2014-05-06 24882.0 31310.0 2014-05-07 25311.0 30969.5 2014-05-08 24422.5 28318.0 2014-05-09 24941.5 32082.5 2014-05-10 26067.5 37195.0
df. resample( 'D' ) . first( )
german_army allied_army 2014-05-06 28755 33938 2014-05-07 26401 20189 2014-05-08 21292 26395 2014-05-09 25764 22613 2014-05-10 26903 39144
df. resample( 'D' ) . last( )
german_army allied_army 2014-05-06 28214 32110 2014-05-07 28649 29950 2014-05-08 28379 32600 2014-05-09 26752 22379 2014-05-10 23391 21044
df. resample( 'D' ) . ohlc( )
german_army allied_army open high 2014-05-06 28755 29206 2014-05-07 26401 29958 2014-05-08 21292 29786 2014-05-09 25764 29952 2014-05-10 26903 27576
按时间分组数据
2016 年 3 月 13 日,Pandas 版本 0.18.0 发布,重采样功能的运行方式发生了重大变化。 本教程遵循 v0.18.0,不适用于以前版本的 pandas。
首先让我们加载我们关心的模块。
import pandas as pd
import datetime
import numpy as np
接下来,让我们创建一些样例数据,我们可以将它们按时间分组作为样本。 在这个例子中,我创建了一个包含两列 365 行的数据帧。一列是日期,第二列是数值。
base = datetime. datetime. today( )
date_list = [ base - datetime. timedelta( days= x) for x in range ( 0 , 365 ) ]
score_list = list ( np. random. randint( low= 1 , high= 1000 , size= 365 ) )
df = pd. DataFrame( )
df[ 'datetime' ] = date_list
df[ 'datetime' ] = pd. to_datetime( df[ 'datetime' ] )
df. index = df[ 'datetime' ]
df[ 'score' ] = score_list
df. head( )
datetime score datetime 2016-06-02 09:57:54.793972 2016-06-02 09:57:54.793972 900 2016-06-01 09:57:54.793972 2016-06-01 09:57:54.793972 121 2016-05-31 09:57:54.793972 2016-05-31 09:57:54.793972 547 2016-05-30 09:57:54.793972 2016-05-30 09:57:54.793972 504 2016-05-29 09:57:54.793972 2016-05-29 09:57:54.793972 304
在 pandas 中,按时间分组的最常用方法是使用.resample()
函数。 在 v0.18.0 中,此函数是两阶段的。 这意味着df.resample('M')
创建了一个对象,我们可以对其应用其他函数(mean
,count
,sum
等)
df. resample( 'M' ) . mean( )
score datetime 2015-06-30 513.629630 2015-07-31 561.516129 2015-08-31 448.032258 2015-09-30 548.000000 2015-10-31 480.419355 2015-11-30 487.033333 2015-12-31 499.935484 2016-01-31 429.193548 2016-02-29 520.413793 2016-03-31 349.806452 2016-04-30 395.500000 2016-05-31 503.451613 2016-06-30 510.500000
df. resample( 'M' ) . sum ( )
score datetime 2015-06-30 13868 2015-07-31 17407 2015-08-31 13889 2015-09-30 16440 2015-10-31 14893 2015-11-30 14611 2015-12-31 15498 2016-01-31 13305 2016-02-29 15092 2016-03-31 10844 2016-04-30 11865 2016-05-31 15607 2016-06-30 1021
分组有很多选项。 你可以在 Pandas 的时间序列文档中了解它们的更多信息,但是,为了你的方便,我也在下面列出了它们。
值 描述 B business day frequency C custom business day frequency (experimental) D calendar day frequency W weekly frequency M month end frequency BM business month end frequency CBM custom business month end frequency MS month start frequency BMS business month start frequency Q quarter end frequency BQ business quarter endfrequency QS quarter start frequency BQS business quarter start frequency A year end frequency BA business year end frequency AS year start frequency BAS business year start frequency BH business hour frequency H hourly frequency T minutely frequency S secondly frequency L milliseonds U microseconds N nanoseconds
按小时分组数据
import pandas as pd
import numpy as np
time = pd. date_range( '1/1/2000' , periods= 2000 , freq= '5min' )
series = pd. Series( np. random. randint( 100 , size= 2000 ) , index= time)
series[ 0 : 10 ]
'''
2000-01-01 00:00:00 40
2000-01-01 00:05:00 13
2000-01-01 00:10:00 99
2000-01-01 00:15:00 72
2000-01-01 00:20:00 4
2000-01-01 00:25:00 36
2000-01-01 00:30:00 24
2000-01-01 00:35:00 20
2000-01-01 00:40:00 83
2000-01-01 00:45:00 44
Freq: 5T, dtype: int64
'''
series. groupby( series. index. hour) . mean( )
'''
0 50.380952
1 49.380952
2 49.904762
3 53.273810
4 47.178571
5 46.095238
6 49.047619
7 44.297619
8 53.119048
9 48.261905
10 45.166667
11 54.214286
12 50.714286
13 56.130952
14 50.916667
15 42.428571
16 46.880952
17 56.892857
18 54.071429
19 47.607143
20 50.940476
21 50.511905
22 44.550000
23 50.250000
dtype: float64
'''
对行分组
import pandas as pd
raw_data = { 'regiment' : [ 'Nighthawks' , 'Nighthawks' , 'Nighthawks' , 'Nighthawks' , 'Dragoons' , 'Dragoons' , 'Dragoons' , 'Dragoons' , 'Scouts' , 'Scouts' , 'Scouts' , 'Scouts' ] ,
'company' : [ '1st' , '1st' , '2nd' , '2nd' , '1st' , '1st' , '2nd' , '2nd' , '1st' , '1st' , '2nd' , '2nd' ] ,
'name' : [ 'Miller' , 'Jacobson' , 'Ali' , 'Milner' , 'Cooze' , 'Jacon' , 'Ryaner' , 'Sone' , 'Sloan' , 'Piger' , 'Riani' , 'Ali' ] ,
'preTestScore' : [ 4 , 24 , 31 , 2 , 3 , 4 , 24 , 31 , 2 , 3 , 2 , 3 ] ,
'postTestScore' : [ 25 , 94 , 57 , 62 , 70 , 25 , 94 , 57 , 62 , 70 , 62 , 70 ] }
df = pd. DataFrame( raw_data, columns = [ 'regiment' , 'company' , 'name' , 'preTestScore' , 'postTestScore' ] )
df
regiment company name preTestScore postTestScore 0 Nighthawks 1st Miller 4 25 1 Nighthawks 1st Jacobson 24 94 2 Nighthawks 2nd Ali 31 57 3 Nighthawks 2nd Milner 2 62 4 Dragoons 1st Cooze 3 70 5 Dragoons 1st Jacon 4 25 6 Dragoons 2nd Ryaner 24 94 7 Dragoons 2nd Sone 31 57 8 Scouts 1st Sloan 2 62 9 Scouts 1st Piger 3 70 10 Scouts 2nd Riani 2 62 11 Scouts 2nd Ali 3 70
regiment_preScore = df[ 'preTestScore' ] . groupby( df[ 'regiment' ] )
regiment_preScore. mean( )
'''
regiment
Dragoons 15.50
Nighthawks 15.25
Scouts 2.50
Name: preTestScore, dtype: float64
'''
Pandas 中的分层数据
import pandas as pd
raw_data = { 'regiment' : [ 'Nighthawks' , 'Nighthawks' , 'Nighthawks' , 'Nighthawks' , 'Dragoons' , 'Dragoons' , 'Dragoons' , 'Dragoons' , 'Scouts' , 'Scouts' , 'Scouts' , 'Scouts' ] ,
'company' : [ '1st' , '1st' , '2nd' , '2nd' , '1st' , '1st' , '2nd' , '2nd' , '1st' , '1st' , '2nd' , '2nd' ] ,
'name' : [ 'Miller' , 'Jacobson' , 'Ali' , 'Milner' , 'Cooze' , 'Jacon' , 'Ryaner' , 'Sone' , 'Sloan' , 'Piger' , 'Riani' , 'Ali' ] ,
'preTestScore' : [ 4 , 24 , 31 , 2 , 3 , 4 , 24 , 31 , 2 , 3 , 2 , 3 ] ,
'postTestScore' : [ 25 , 94 , 57 , 62 , 70 , 25 , 94 , 57 , 62 , 70 , 62 , 70 ] }
df = pd. DataFrame( raw_data, columns = [ 'regiment' , 'company' , 'name' , 'preTestScore' , 'postTestScore' ] )
df
regiment company name preTestScore postTestScore 0 Nighthawks 1st Miller 4 25 1 Nighthawks 1st Jacobson 24 94 2 Nighthawks 2nd Ali 31 57 3 Nighthawks 2nd Milner 2 62 4 Dragoons 1st Cooze 3 70 5 Dragoons 1st Jacon 4 25 6 Dragoons 2nd Ryaner 24 94 7 Dragoons 2nd Sone 31 57 8 Scouts 1st Sloan 2 62 9 Scouts 1st Piger 3 70 10 Scouts 2nd Riani 2 62 11 Scouts 2nd Ali 3 70
df = df. set_index( [ 'regiment' , 'company' ] , drop= False )
df
regiment company name preTestScore postTestScore regiment company Nighthawks 1st Nighthawks 1st Miller 4 1st Nighthawks 1st Jacobson 24 94 2nd Nighthawks 2nd Ali 31 57 2nd Nighthawks 2nd Milner 2 62 Dragoons 1st Dragoons 1st Cooze 3 1st Dragoons 1st Jacon 4 25 2nd Dragoons 2nd Ryaner 24 94 2nd Dragoons 2nd Sone 31 57 Scouts 1st Scouts 1st Sloan 2 1st Scouts 1st Piger 3 70 2nd Scouts 2nd Riani 2 62 2nd Scouts 2nd Ali 3 70
df = df. set_index( [ 'regiment' , 'company' ] )
df
name preTestScore postTestScore regiment company Nighthawks 1st Miller 4 1st Jacobson 24 94 2nd Ali 31 57 2nd Milner 2 62 Dragoons 1st Cooze 3 1st Jacon 4 25 2nd Ryaner 24 94 2nd Sone 31 57 Scouts 1st Sloan 2 1st Piger 3 70 2nd Riani 2 62 2nd Ali 3 70
df. index
MultiIndex( levels= [ [ 'Dragoons' , 'Nighthawks' , 'Scouts' ] , [ '1st' , '2nd' ] ] ,
labels= [ [ 1 , 1 , 1 , 1 , 0 , 0 , 0 , 0 , 2 , 2 , 2 , 2 ] , [ 0 , 0 , 1 , 1 , 0 , 0 , 1 , 1 , 0 , 0 , 1 , 1 ] ] ,
names= [ 'regiment' , 'company' ] )
df. swaplevel( 'regiment' , 'company' )
name preTestScore postTestScore company regiment 1st Nighthawks Miller 4 25 Nighthawks Jacobson 24 94 2nd Nighthawks Ali 31 57 Nighthawks Milner 2 62 1st Dragoons Cooze 3 70 Dragoons Jacon 4 25 2nd Dragoons Ryaner 24 94 Dragoons Sone 31 57 1st Scouts Sloan 2 62 Scouts Piger 3 70 2nd Scouts Riani 2 62 Scouts Ali 3 70
df. sum ( level= 'regiment' )
preTestScore postTestScore regiment Nighthawks 61 238 Dragoons 62 246 Scouts 10 264
最后
以上就是小巧大地 为你收集整理的数据科学和人工智能技术笔记 十九、数据整理(上)十九、数据整理(上)将列表拆分为大小为 N 的分块 的全部内容,希望文章能够帮你解决数据科学和人工智能技术笔记 十九、数据整理(上)十九、数据整理(上)将列表拆分为大小为 N 的分块 所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站 推荐给程序员好友。
本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
发表评论 取消回复