我是靠谱客的博主 小巧大地,最近开发中收集的这篇文章主要介绍数据科学和人工智能技术笔记 十九、数据整理(上)十九、数据整理(上)将列表拆分为大小为 N 的分块,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

十九、数据整理(上)

作者:Chris Albon

译者:飞龙

协议:CC BY-NC-SA 4.0

在 Pandas 中通过分组应用函数

import pandas as pd

# 创建示例数据帧
data = {'Platoon': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],
       'Casualties': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}
df = pd.DataFrame(data)
df
CasualtiesPlatoon
01A
14A
25A
37A
45A
55A
66B
71B
84B
95B
106B
117C
124C
136C
144C
156C
# 按照 df.platoon 对 df 分组
# 然后将滚动平均 lambda 函数应用于 df.casualties
df.groupby('Platoon')['Casualties'].apply(lambda x:x.rolling(center=False,window=2).mean())

'''
0     NaN
1     2.5
2     4.5
3     6.0
4     6.0
5     5.0
6     NaN
7     3.5
8     2.5
9     4.5
10    5.5
11    NaN
12    5.5
13    5.0
14    5.0
15    5.0
dtype: float64
''' 

在 Pandas 中向分组应用操作

# 导入模块
import pandas as pd

# 创建数据帧
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
regimentcompanynamepreTestScorepostTestScore
0Nighthawks1stMiller425
1Nighthawks1stJacobson2494
2Nighthawks2ndAli3157
3Nighthawks2ndMilner262
4Dragoons1stCooze370
5Dragoons1stJacon425
6Dragoons2ndRyaner2494
7Dragoons2ndSone3157
8Scouts1stSloan262
9Scouts1stPiger370
10Scouts2ndRiani262
11Scouts2ndAli370
# 创建一个 groupby 变量,按团队(regiment)对 preTestScores 分组
groupby_regiment = df['preTestScore'].groupby(df['regiment'])
groupby_regiment

# <pandas.core.groupby.SeriesGroupBy object at 0x113ddb550> 

“这个分组变量现在是GroupBy对象。 除了分组的键df ['key1']的一些中间数据之外,它实际上还没有计算任何东西。 我们的想法是,该对象具有将所有操作应用于每个分组所需的所有信息。” – PyDA

使用list()显示分组的样子。

list(df['preTestScore'].groupby(df['regiment']))

'''
[('Dragoons', 4     3
  5     4
  6    24
  7    31
  Name: preTestScore, dtype: int64), ('Nighthawks', 0     4
  1    24
  2    31
  3     2
  Name: preTestScore, dtype: int64), ('Scouts', 8     2
  9     3
  10    2
  11    3
  Name: preTestScore, dtype: int64)] 
'''

df['preTestScore'].groupby(df['regiment']).describe()
countmeanstdmin25%50%75%max
regiment
Dragoons4.015.5014.1539163.03.7514.025.7531.0
Nighthawks4.015.2514.4539502.03.5014.025.7531.0
Scouts4.02.500.5773502.02.002.53.003.0
# 每个团队的 preTestScore 均值
groupby_regiment.mean()

'''
regiment
Dragoons      15.50
Nighthawks    15.25
Scouts         2.50
Name: preTestScore, dtype: float64 
'''

df['preTestScore'].groupby([df['regiment'], df['company']]).mean()

'''
regiment    company
Dragoons    1st         3.5
            2nd        27.5
Nighthawks  1st        14.0
            2nd        16.5
Scouts      1st         2.5
            2nd         2.5
Name: preTestScore, dtype: float64 
'''

df['preTestScore'].groupby([df['regiment'], df['company']]).mean().unstack()
company1st2nd
regiment
Dragoons3.527.5
Nighthawks14.016.5
Scouts2.52.5
# 按团队和公司(company)对整个数据帧分组
df.groupby(['regiment', 'company']).mean()
preTestScorepostTestScore
regimentcompany
Dragoons1st3.547.5
2nd27.575.5
Nighthawks1st14.059.5
2nd16.559.5
Scouts1st2.566.0
2nd2.566.0
# 每个团队和公司的观测数量
df.groupby(['regiment', 'company']).size()

'''
regiment    company
Dragoons    1st        2
            2nd        2
Nighthawks  1st        2
            2nd        2
Scouts      1st        2
            2nd        2
dtype: int64 
'''

# 按团队对数据帧分组,对于每个团队,
for name, group in df.groupby('regiment'): 
    # 打印团队名称
    print(name)
    # 打印它的数据
    print(group)


'''
Dragoons
   regiment company    name  preTestScore  postTestScore
4  Dragoons     1st   Cooze             3             70
5  Dragoons     1st   Jacon             4             25
6  Dragoons     2nd  Ryaner            24             94
7  Dragoons     2nd    Sone            31             57
Nighthawks
     regiment company      name  preTestScore  postTestScore
0  Nighthawks     1st    Miller             4             25
1  Nighthawks     1st  Jacobson            24             94
2  Nighthawks     2nd       Ali            31             57
3  Nighthawks     2nd    Milner             2             62
Scouts
   regiment company   name  preTestScore  postTestScore
8    Scouts     1st  Sloan             2             62
9    Scouts     1st  Piger             3             70
10   Scouts     2nd  Riani             2             62
11   Scouts     2nd    Ali             3             70 
'''

按列分组:

特别是在这种情况下:按列对数据类型(即axis = 1)分组,然后使用list()查看该分组的外观。

list(df.groupby(df.dtypes, axis=1))

'''
[(dtype('int64'),     preTestScore  postTestScore
  0              4             25
  1             24             94
  2             31             57
  3              2             62
  4              3             70
  5              4             25
  6             24             94
  7             31             57
  8              2             62
  9              3             70
  10             2             62
  11             3             70),
 (dtype('O'),       regiment company      name
  0   Nighthawks     1st    Miller
  1   Nighthawks     1st  Jacobson
  2   Nighthawks     2nd       Ali
  3   Nighthawks     2nd    Milner
  4     Dragoons     1st     Cooze
  5     Dragoons     1st     Jacon
  6     Dragoons     2nd    Ryaner
  7     Dragoons     2nd      Sone
  8       Scouts     1st     Sloan
  9       Scouts     1st     Piger
  10      Scouts     2nd     Riani
  11      Scouts     2nd       Ali)] 

df.groupby('regiment').mean().add_prefix('mean_')
mean_preTestScoremean_postTestScore
regiment
Dragoons15.5061.5
Nighthawks15.2559.5
Scouts2.5066.0
# 创建获取分组状态的函数
def get_stats(group):
    return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}

bins = [0, 25, 50, 75, 100]
group_names = ['Low', 'Okay', 'Good', 'Great']
df['categories'] = pd.cut(df['postTestScore'], bins, labels=group_names)

df['postTestScore'].groupby(df['categories']).apply(get_stats).unstack()
countmaxmeanmin
categories
Good8.070.063.7557.0
Great2.094.094.0094.0
Low2.025.025.0025.0
Okay0.0NaNNaNNaN

在 Pandas 数据帧上应用操作

# 导入模型
import pandas as pd
import numpy as np

data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3],
        'coverage': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
coveragenamereportsyear
Cochice25Jason42012
Pima94Molly242012
Santa Cruz57Tina312013
Maricopa62Jake22014
Yuma70Amy32014
# 创建大写转换的 lambda 函数
capitalizer = lambda x: x.upper()

capitalizer函数应用于name列。

apply()可以沿数据帧的任意轴应用函数。

df['name'].apply(capitalizer)

'''
Cochice       JASON
Pima          MOLLY
Santa Cruz     TINA
Maricopa       JAKE
Yuma            AMY
Name: name, dtype: object 
'''

capitalizer lambda 函数映射到序列name中的每个元素。

map()对序列的每个元素应用操作。

df['name'].map(capitalizer)

'''
Cochice       JASON
Pima          MOLLY
Santa Cruz     TINA
Maricopa       JAKE
Yuma            AMY
Name: name, dtype: object 
'''

将平方根函数应用于整个数据帧中的每个单元格。

applymap()将函数应用于整个数据帧中的每个元素。

# 删除字符串变量,以便 applymap() 可以运行
df = df.drop('name', axis=1)

# 返回数据帧每个单元格的平方根
df.applymap(np.sqrt)
coveragereportsyear
Cochice5.0000002.00000044.855323
Pima9.6953604.89897944.855323
Santa Cruz7.5498345.56776444.866469
Maricopa7.8740081.41421444.877611
Yuma8.3666001.73205144.877611

在数据帧上应用函数。

# 创建叫做 times100 的函数
def times100(x):
    # 如果 x 是字符串,
    if type(x) is str:
        # 原样返回它
        return x
    # 如果不是,返回它乘上 100
    elif x:
        return 100 * x
    # 并留下其它东西
    else:
        return

df.applymap(times100)
coveragereportsyear
Cochice2500400201200
Pima94002400201200
Santa Cruz57003100201300
Maricopa6200200201400
Yuma7000300201400

向 Pandas 数据帧赋予新列

import pandas as pd

# 创建空数据帧
df = pd.DataFrame()

# 创建一列
df['name'] = ['John', 'Steve', 'Sarah']

# 查看数据帧
df
name
0John
1Steve
2Sarah
# 将一个新列赋予名为 age 的 df,它包含年龄列表
df.assign(age = [31, 32, 19])
nameage
0John31
1Steve32
2Sarah19

将列表拆分为大小为 N 的分块

在这个片段中,我们接受一个列表并将其分解为大小为 n 的块。 在处理具有最大请求大小的 API 时,这是一种非常常见的做法。

这个漂亮的函数由 Ned Batchelder 贡献,发布于 StackOverflow。

# 创建名称列表
first_names = ['Steve', 'Jane', 'Sara', 'Mary','Jack','Bob', 'Bily', 'Boni', 'Chris','Sori', 'Will', 'Won','Li']

# 创建叫做 chunks 的函数,有两个参数 l 和 n
def chunks(l, n):
    # 对于长度为 l 的范围中的项目 i
    for i in range(0, len(l), n):
        # 创建索引范围
        yield l[i:i+n]

# 从函数 chunks 的结果创建一个列表
list(chunks(first_names, 5))

'''
[['Steve', 'Jane', 'Sara', 'Mary', 'Jack'],
 ['Bob', 'Bily', 'Boni', 'Chris', 'Sori'],
 ['Will', 'Won', 'Li']] 
'''

在 Pandas 中使用正则表达式将字符串分解为列

# 导入模块
import re
import pandas as pd

# 创建带有一列字符串的数据帧
data = {'raw': ['Arizona 1 2014-12-23       3242.0',
                'Iowa 1 2010-02-23       3453.7',
                'Oregon 0 2014-06-20       2123.0',
                'Maryland 0 2014-03-14       1123.6',
                'Florida 1 2013-01-15       2134.0',
                'Georgia 0 2012-07-14       2345.6']}
df = pd.DataFrame(data, columns = ['raw'])
df
raw
0Arizona 1 2014-12-23 3242.0
1Iowa 1 2010-02-23 3453.7
2Oregon 0 2014-06-20 2123.0
3Maryland 0 2014-03-14 1123.6
4Florida 1 2013-01-15 2134.0
5Georgia 0 2012-07-14 2345.6
# df['raw'] 的哪些行包含 'xxxx-xx-xx'?
df['raw'].str.contains('....-..-..', regex=True)

'''
0    True
1    True
2    True
3    True
4    True
5    True
Name: raw, dtype: bool 
'''

# 在 raw 列中,提取字符串中的单个数字
df['female'] = df['raw'].str.extract('(d)', expand=True)
df['female']

'''
0    1
1    1
2    0
3    0
4    1
5    0
Name: female, dtype: object 
'''

# 在 raw 列中,提取字符串中的 xxxx-xx-xx
df['date'] = df['raw'].str.extract('(....-..-..)', expand=True)
df['date']

'''
0    2014-12-23
1    2010-02-23
2    2014-06-20
3    2014-03-14
4    2013-01-15
5    2012-07-14
Name: date, dtype: object 
'''

# 在 raw 列中,提取字符串中的 ####.##
df['score'] = df['raw'].str.extract('(dddd.d)', expand=True)
df['score']

'''
0    3242.0
1    3453.7
2    2123.0
3    1123.6
4    2134.0
5    2345.6
Name: score, dtype: object 
'''

# 在 raw 列中,提取字符串中的单词
df['state'] = df['raw'].str.extract('([A-Z]w{0,})', expand=True)
df['state']

'''
0     Arizona
1        Iowa
2      Oregon
3    Maryland
4     Florida
5     Georgia
Name: state, dtype: object 
'''

df
rawfemaledatescorestate
0Arizona 1 2014-12-23 3242.012014-12-233242.0Arizona
1Iowa 1 2010-02-23 3453.712010-02-233453.7Iowa
2Oregon 0 2014-06-20 2123.002014-06-202123.0Oregon
3Maryland 0 2014-03-14 1123.602014-03-141123.6Maryland
4Florida 1 2013-01-15 2134.012013-01-152134.0Florida
5Georgia 0 2012-07-14 2345.602012-07-142345.6Georgia

由两个数据帧贡献列

# 导入库
import pandas as pd

# 创建数据帧
dataframe_one = pd.DataFrame()
dataframe_one['1'] = ['1', '1', '1']
dataframe_one['B'] = ['b', 'b', 'b']

# 创建第二个数据帧
dataframe_two = pd.DataFrame()
dataframe_two['2'] = ['2', '2', '2']
dataframe_two['B'] = ['b', 'b', 'b']

# 将每个数据帧的列转换为集合,
# 然后找到这两个集合的交集。
# 这将是两个数据帧共享的列的集合。
set.intersection(set(dataframe_one), set(dataframe_two))

# {'B'} 

从多个列表构建字典

# 创建官员名称的列表
officer_names = ['Sodoni Dogla', 'Chris Jefferson', 'Jessica Billars', 'Michael Mulligan', 'Steven Johnson']

# 创建官员军队的列表
officer_armies = ['Purple Army', 'Orange Army', 'Green Army', 'Red Army', 'Blue Army']

# 创建字典,它是两个列表的 zip
dict(zip(officer_names, officer_armies))

'''
{'Chris Jefferson': 'Orange Army',
 'Jessica Billars': 'Green Army',
 'Michael Mulligan': 'Red Army',
 'Sodoni Dogla': 'Purple Army',
 'Steven Johnson': 'Blue Army'} 
'''

将 CSV 转换为 Python 代码来重建它

# 导入 pandas 包
import pandas as pd

# 将 csv 文件加载为数据帧
df_original = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv')
df = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv')

# 打印创建数据帧的代码
print('==============================')
print('RUN THE CODE BELOW THIS LINE')
print('==============================')
print('raw_data =', df.to_dict(orient='list'))
print('df = pd.DataFrame(raw_data, columns = ' + str(list(df_original)) + ')')

'''
==============================
RUN THE CODE BELOW THIS LINE
==============================
raw_data = {'Sepal.Length': [5.0999999999999996, 4.9000000000000004, 4.7000000000000002, 4.5999999999999996, 5.0, 5.4000000000000004, 4.5999999999999996, 5.0, 4.4000000000000004, 4.9000000000000004, 5.4000000000000004, 4.7999999999999998, 4.7999999999999998, 4.2999999999999998, 5.7999999999999998, 5.7000000000000002, 5.4000000000000004, 5.0999999999999996, 5.7000000000000002, 5.0999999999999996, 5.4000000000000004, 5.0999999999999996, 4.5999999999999996, 5.0999999999999996, 4.7999999999999998, 5.0, 5.0, 5.2000000000000002, 5.2000000000000002, 4.7000000000000002, 4.7999999999999998, 5.4000000000000004, 5.2000000000000002, 5.5, 4.9000000000000004, 5.0, 5.5, 4.9000000000000004, 4.4000000000000004, 5.0999999999999996, 5.0, 4.5, 4.4000000000000004, 5.0, 5.0999999999999996, 4.7999999999999998, 5.0999999999999996, 4.5999999999999996, 5.2999999999999998, 5.0, 7.0, 6.4000000000000004, 6.9000000000000004, 5.5, 6.5, 5.7000000000000002, 6.2999999999999998, 4.9000000000000004, 6.5999999999999996, 5.2000000000000002, 5.0, 5.9000000000000004, 6.0, 6.0999999999999996, 5.5999999999999996, 6.7000000000000002, 5.5999999999999996, 5.7999999999999998, 6.2000000000000002, 5.5999999999999996, 5.9000000000000004, 6.0999999999999996, 6.2999999999999998, 6.0999999999999996, 6.4000000000000004, 6.5999999999999996, 6.7999999999999998, 6.7000000000000002, 6.0, 5.7000000000000002, 5.5, 5.5, 5.7999999999999998, 6.0, 5.4000000000000004, 6.0, 6.7000000000000002, 6.2999999999999998, 5.5999999999999996, 5.5, 5.5, 6.0999999999999996, 5.7999999999999998, 5.0, 5.5999999999999996, 5.7000000000000002, 5.7000000000000002, 6.2000000000000002, 5.0999999999999996, 5.7000000000000002, 6.2999999999999998, 5.7999999999999998, 7.0999999999999996, 6.2999999999999998, 6.5, 7.5999999999999996, 4.9000000000000004, 7.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.5, 6.4000000000000004, 6.7999999999999998, 5.7000000000000002, 5.7999999999999998, 6.4000000000000004, 6.5, 7.7000000000000002, 7.7000000000000002, 6.0, 6.9000000000000004, 5.5999999999999996, 7.7000000000000002, 6.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.2000000000000002, 6.0999999999999996, 6.4000000000000004, 7.2000000000000002, 7.4000000000000004, 7.9000000000000004, 6.4000000000000004, 6.2999999999999998, 6.0999999999999996, 7.7000000000000002, 6.2999999999999998, 6.4000000000000004, 6.0, 6.9000000000000004, 6.7000000000000002, 6.9000000000000004, 5.7999999999999998, 6.7999999999999998, 6.7000000000000002, 6.7000000000000002, 6.2999999999999998, 6.5, 6.2000000000000002, 5.9000000000000004], 'Petal.Width': [0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.10000000000000001, 0.20000000000000001, 0.40000000000000002, 0.40000000000000002, 0.29999999999999999, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.5, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.59999999999999998, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 1.3999999999999999, 1.5, 1.5, 1.3, 1.5, 1.3, 1.6000000000000001, 1.0, 1.3, 1.3999999999999999, 1.0, 1.5, 1.0, 1.3999999999999999, 1.3, 1.3999999999999999, 1.5, 1.0, 1.5, 1.1000000000000001, 1.8, 1.3, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3999999999999999, 1.7, 1.5, 1.0, 1.1000000000000001, 1.0, 1.2, 1.6000000000000001, 1.5, 1.6000000000000001, 1.5, 1.3, 1.3, 1.3, 1.2, 1.3999999999999999, 1.2, 1.0, 1.3, 1.2, 1.3, 1.3, 1.1000000000000001, 1.3, 2.5, 1.8999999999999999, 2.1000000000000001, 1.8, 2.2000000000000002, 2.1000000000000001, 1.7, 1.8, 1.8, 2.5, 2.0, 1.8999999999999999, 2.1000000000000001, 2.0, 2.3999999999999999, 2.2999999999999998, 1.8, 2.2000000000000002, 2.2999999999999998, 1.5, 2.2999999999999998, 2.0, 2.0, 1.8, 2.1000000000000001, 1.8, 1.8, 1.8, 2.1000000000000001, 1.6000000000000001, 1.8999999999999999, 2.0, 2.2000000000000002, 1.5, 1.3999999999999999, 2.2999999999999998, 2.3999999999999999, 1.8, 1.8, 2.1000000000000001, 2.3999999999999999, 2.2999999999999998, 1.8999999999999999, 2.2999999999999998, 2.5, 2.2999999999999998, 1.8999999999999999, 2.0, 2.2999999999999998, 1.8], 'Petal.Length': [1.3999999999999999, 1.3999999999999999, 1.3, 1.5, 1.3999999999999999, 1.7, 1.3999999999999999, 1.5, 1.3999999999999999, 1.5, 1.5, 1.6000000000000001, 1.3999999999999999, 1.1000000000000001, 1.2, 1.5, 1.3, 1.3999999999999999, 1.7, 1.5, 1.7, 1.5, 1.0, 1.7, 1.8999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.3999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.5, 1.3999999999999999, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3, 1.5, 1.3, 1.3, 1.3, 1.6000000000000001, 1.8999999999999999, 1.3999999999999999, 1.6000000000000001, 1.3999999999999999, 1.5, 1.3999999999999999, 4.7000000000000002, 4.5, 4.9000000000000004, 4.0, 4.5999999999999996, 4.5, 4.7000000000000002, 3.2999999999999998, 4.5999999999999996, 3.8999999999999999, 3.5, 4.2000000000000002, 4.0, 4.7000000000000002, 3.6000000000000001, 4.4000000000000004, 4.5, 4.0999999999999996, 4.5, 3.8999999999999999, 4.7999999999999998, 4.0, 4.9000000000000004, 4.7000000000000002, 4.2999999999999998, 4.4000000000000004, 4.7999999999999998, 5.0, 4.5, 3.5, 3.7999999999999998, 3.7000000000000002, 3.8999999999999999, 5.0999999999999996, 4.5, 4.5, 4.7000000000000002, 4.4000000000000004, 4.0999999999999996, 4.0, 4.4000000000000004, 4.5999999999999996, 4.0, 3.2999999999999998, 4.2000000000000002, 4.2000000000000002, 4.2000000000000002, 4.2999999999999998, 3.0, 4.0999999999999996, 6.0, 5.0999999999999996, 5.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.5999999999999996, 4.5, 6.2999999999999998, 5.7999999999999998, 6.0999999999999996, 5.0999999999999996, 5.2999999999999998, 5.5, 5.0, 5.0999999999999996, 5.2999999999999998, 5.5, 6.7000000000000002, 6.9000000000000004, 5.0, 5.7000000000000002, 4.9000000000000004, 6.7000000000000002, 4.9000000000000004, 5.7000000000000002, 6.0, 4.7999999999999998, 4.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.0999999999999996, 6.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.5999999999999996, 6.0999999999999996, 5.5999999999999996, 5.5, 4.7999999999999998, 5.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.0999999999999996, 5.9000000000000004, 5.7000000000000002, 5.2000000000000002, 5.0, 5.2000000000000002, 5.4000000000000004, 5.0999999999999996], 'Species': ['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica'], 'Sepal.Width': [3.5, 3.0, 3.2000000000000002, 3.1000000000000001, 3.6000000000000001, 3.8999999999999999, 3.3999999999999999, 3.3999999999999999, 2.8999999999999999, 3.1000000000000001, 3.7000000000000002, 3.3999999999999999, 3.0, 3.0, 4.0, 4.4000000000000004, 3.8999999999999999, 3.5, 3.7999999999999998, 3.7999999999999998, 3.3999999999999999, 3.7000000000000002, 3.6000000000000001, 3.2999999999999998, 3.3999999999999999, 3.0, 3.3999999999999999, 3.5, 3.3999999999999999, 3.2000000000000002, 3.1000000000000001, 3.3999999999999999, 4.0999999999999996, 4.2000000000000002, 3.1000000000000001, 3.2000000000000002, 3.5, 3.6000000000000001, 3.0, 3.3999999999999999, 3.5, 2.2999999999999998, 3.2000000000000002, 3.5, 3.7999999999999998, 3.0, 3.7999999999999998, 3.2000000000000002, 3.7000000000000002, 3.2999999999999998, 3.2000000000000002, 3.2000000000000002, 3.1000000000000001, 2.2999999999999998, 2.7999999999999998, 2.7999999999999998, 3.2999999999999998, 2.3999999999999999, 2.8999999999999999, 2.7000000000000002, 2.0, 3.0, 2.2000000000000002, 2.8999999999999999, 2.8999999999999999, 3.1000000000000001, 3.0, 2.7000000000000002, 2.2000000000000002, 2.5, 3.2000000000000002, 2.7999999999999998, 2.5, 2.7999999999999998, 2.8999999999999999, 3.0, 2.7999999999999998, 3.0, 2.8999999999999999, 2.6000000000000001, 2.3999999999999999, 2.3999999999999999, 2.7000000000000002, 2.7000000000000002, 3.0, 3.3999999999999999, 3.1000000000000001, 2.2999999999999998, 3.0, 2.5, 2.6000000000000001, 3.0, 2.6000000000000001, 2.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 2.8999999999999999, 2.5, 2.7999999999999998, 3.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 3.0, 3.0, 2.5, 2.8999999999999999, 2.5, 3.6000000000000001, 3.2000000000000002, 2.7000000000000002, 3.0, 2.5, 2.7999999999999998, 3.2000000000000002, 3.0, 3.7999999999999998, 2.6000000000000001, 2.2000000000000002, 3.2000000000000002, 2.7999999999999998, 2.7999999999999998, 2.7000000000000002, 3.2999999999999998, 3.2000000000000002, 2.7999999999999998, 3.0, 2.7999999999999998, 3.0, 2.7999999999999998, 3.7999999999999998, 2.7999999999999998, 2.7999999999999998, 2.6000000000000001, 3.0, 3.3999999999999999, 3.1000000000000001, 3.0, 3.1000000000000001, 3.1000000000000001, 3.1000000000000001, 2.7000000000000002, 3.2000000000000002, 3.2999999999999998, 3.0, 2.5, 3.0, 3.3999999999999999, 3.0], 'Unnamed: 0': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150]}
'''

df = pd.DataFrame(raw_data, columns = ['Unnamed: 0', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']) 

# 如果你打算检查结果
# 1. 输入此单元格中上面单元格生成的代码
raw_data = {'Petal.Width': [0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.10000000000000001, 0.20000000000000001, 0.40000000000000002, 0.40000000000000002, 0.29999999999999999, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.5, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.59999999999999998, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 1.3999999999999999, 1.5, 1.5, 1.3, 1.5, 1.3, 1.6000000000000001, 1.0, 1.3, 1.3999999999999999, 1.0, 1.5, 1.0, 1.3999999999999999, 1.3, 1.3999999999999999, 1.5, 1.0, 1.5, 1.1000000000000001, 1.8, 1.3, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3999999999999999, 1.7, 1.5, 1.0, 1.1000000000000001, 1.0, 1.2, 1.6000000000000001, 1.5, 1.6000000000000001, 1.5, 1.3, 1.3, 1.3, 1.2, 1.3999999999999999, 1.2, 1.0, 1.3, 1.2, 1.3, 1.3, 1.1000000000000001, 1.3, 2.5, 1.8999999999999999, 2.1000000000000001, 1.8, 2.2000000000000002, 2.1000000000000001, 1.7, 1.8, 1.8, 2.5, 2.0, 1.8999999999999999, 2.1000000000000001, 2.0, 2.3999999999999999, 2.2999999999999998, 1.8, 2.2000000000000002, 2.2999999999999998, 1.5, 2.2999999999999998, 2.0, 2.0, 1.8, 2.1000000000000001, 1.8, 1.8, 1.8, 2.1000000000000001, 1.6000000000000001, 1.8999999999999999, 2.0, 2.2000000000000002, 1.5, 1.3999999999999999, 2.2999999999999998, 2.3999999999999999, 1.8, 1.8, 2.1000000000000001, 2.3999999999999999, 2.2999999999999998, 1.8999999999999999, 2.2999999999999998, 2.5, 2.2999999999999998, 1.8999999999999999, 2.0, 2.2999999999999998, 1.8], 'Sepal.Width': [3.5, 3.0, 3.2000000000000002, 3.1000000000000001, 3.6000000000000001, 3.8999999999999999, 3.3999999999999999, 3.3999999999999999, 2.8999999999999999, 3.1000000000000001, 3.7000000000000002, 3.3999999999999999, 3.0, 3.0, 4.0, 4.4000000000000004, 3.8999999999999999, 3.5, 3.7999999999999998, 3.7999999999999998, 3.3999999999999999, 3.7000000000000002, 3.6000000000000001, 3.2999999999999998, 3.3999999999999999, 3.0, 3.3999999999999999, 3.5, 3.3999999999999999, 3.2000000000000002, 3.1000000000000001, 3.3999999999999999, 4.0999999999999996, 4.2000000000000002, 3.1000000000000001, 3.2000000000000002, 3.5, 3.6000000000000001, 3.0, 3.3999999999999999, 3.5, 2.2999999999999998, 3.2000000000000002, 3.5, 3.7999999999999998, 3.0, 3.7999999999999998, 3.2000000000000002, 3.7000000000000002, 3.2999999999999998, 3.2000000000000002, 3.2000000000000002, 3.1000000000000001, 2.2999999999999998, 2.7999999999999998, 2.7999999999999998, 3.2999999999999998, 2.3999999999999999, 2.8999999999999999, 2.7000000000000002, 2.0, 3.0, 2.2000000000000002, 2.8999999999999999, 2.8999999999999999, 3.1000000000000001, 3.0, 2.7000000000000002, 2.2000000000000002, 2.5, 3.2000000000000002, 2.7999999999999998, 2.5, 2.7999999999999998, 2.8999999999999999, 3.0, 2.7999999999999998, 3.0, 2.8999999999999999, 2.6000000000000001, 2.3999999999999999, 2.3999999999999999, 2.7000000000000002, 2.7000000000000002, 3.0, 3.3999999999999999, 3.1000000000000001, 2.2999999999999998, 3.0, 2.5, 2.6000000000000001, 3.0, 2.6000000000000001, 2.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 2.8999999999999999, 2.5, 2.7999999999999998, 3.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 3.0, 3.0, 2.5, 2.8999999999999999, 2.5, 3.6000000000000001, 3.2000000000000002, 2.7000000000000002, 3.0, 2.5, 2.7999999999999998, 3.2000000000000002, 3.0, 3.7999999999999998, 2.6000000000000001, 2.2000000000000002, 3.2000000000000002, 2.7999999999999998, 2.7999999999999998, 2.7000000000000002, 3.2999999999999998, 3.2000000000000002, 2.7999999999999998, 3.0, 2.7999999999999998, 3.0, 2.7999999999999998, 3.7999999999999998, 2.7999999999999998, 2.7999999999999998, 2.6000000000000001, 3.0, 3.3999999999999999, 3.1000000000000001, 3.0, 3.1000000000000001, 3.1000000000000001, 3.1000000000000001, 2.7000000000000002, 3.2000000000000002, 3.2999999999999998, 3.0, 2.5, 3.0, 3.3999999999999999, 3.0], 'Species': ['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica'], 'Unnamed: 0': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150], 'Sepal.Length': [5.0999999999999996, 4.9000000000000004, 4.7000000000000002, 4.5999999999999996, 5.0, 5.4000000000000004, 4.5999999999999996, 5.0, 4.4000000000000004, 4.9000000000000004, 5.4000000000000004, 4.7999999999999998, 4.7999999999999998, 4.2999999999999998, 5.7999999999999998, 5.7000000000000002, 5.4000000000000004, 5.0999999999999996, 5.7000000000000002, 5.0999999999999996, 5.4000000000000004, 5.0999999999999996, 4.5999999999999996, 5.0999999999999996, 4.7999999999999998, 5.0, 5.0, 5.2000000000000002, 5.2000000000000002, 4.7000000000000002, 4.7999999999999998, 5.4000000000000004, 5.2000000000000002, 5.5, 4.9000000000000004, 5.0, 5.5, 4.9000000000000004, 4.4000000000000004, 5.0999999999999996, 5.0, 4.5, 4.4000000000000004, 5.0, 5.0999999999999996, 4.7999999999999998, 5.0999999999999996, 4.5999999999999996, 5.2999999999999998, 5.0, 7.0, 6.4000000000000004, 6.9000000000000004, 5.5, 6.5, 5.7000000000000002, 6.2999999999999998, 4.9000000000000004, 6.5999999999999996, 5.2000000000000002, 5.0, 5.9000000000000004, 6.0, 6.0999999999999996, 5.5999999999999996, 6.7000000000000002, 5.5999999999999996, 5.7999999999999998, 6.2000000000000002, 5.5999999999999996, 5.9000000000000004, 6.0999999999999996, 6.2999999999999998, 6.0999999999999996, 6.4000000000000004, 6.5999999999999996, 6.7999999999999998, 6.7000000000000002, 6.0, 5.7000000000000002, 5.5, 5.5, 5.7999999999999998, 6.0, 5.4000000000000004, 6.0, 6.7000000000000002, 6.2999999999999998, 5.5999999999999996, 5.5, 5.5, 6.0999999999999996, 5.7999999999999998, 5.0, 5.5999999999999996, 5.7000000000000002, 5.7000000000000002, 6.2000000000000002, 5.0999999999999996, 5.7000000000000002, 6.2999999999999998, 5.7999999999999998, 7.0999999999999996, 6.2999999999999998, 6.5, 7.5999999999999996, 4.9000000000000004, 7.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.5, 6.4000000000000004, 6.7999999999999998, 5.7000000000000002, 5.7999999999999998, 6.4000000000000004, 6.5, 7.7000000000000002, 7.7000000000000002, 6.0, 6.9000000000000004, 5.5999999999999996, 7.7000000000000002, 6.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.2000000000000002, 6.0999999999999996, 6.4000000000000004, 7.2000000000000002, 7.4000000000000004, 7.9000000000000004, 6.4000000000000004, 6.2999999999999998, 6.0999999999999996, 7.7000000000000002, 6.2999999999999998, 6.4000000000000004, 6.0, 6.9000000000000004, 6.7000000000000002, 6.9000000000000004, 5.7999999999999998, 6.7999999999999998, 6.7000000000000002, 6.7000000000000002, 6.2999999999999998, 6.5, 6.2000000000000002, 5.9000000000000004], 'Petal.Length': [1.3999999999999999, 1.3999999999999999, 1.3, 1.5, 1.3999999999999999, 1.7, 1.3999999999999999, 1.5, 1.3999999999999999, 1.5, 1.5, 1.6000000000000001, 1.3999999999999999, 1.1000000000000001, 1.2, 1.5, 1.3, 1.3999999999999999, 1.7, 1.5, 1.7, 1.5, 1.0, 1.7, 1.8999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.3999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.5, 1.3999999999999999, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3, 1.5, 1.3, 1.3, 1.3, 1.6000000000000001, 1.8999999999999999, 1.3999999999999999, 1.6000000000000001, 1.3999999999999999, 1.5, 1.3999999999999999, 4.7000000000000002, 4.5, 4.9000000000000004, 4.0, 4.5999999999999996, 4.5, 4.7000000000000002, 3.2999999999999998, 4.5999999999999996, 3.8999999999999999, 3.5, 4.2000000000000002, 4.0, 4.7000000000000002, 3.6000000000000001, 4.4000000000000004, 4.5, 4.0999999999999996, 4.5, 3.8999999999999999, 4.7999999999999998, 4.0, 4.9000000000000004, 4.7000000000000002, 4.2999999999999998, 4.4000000000000004, 4.7999999999999998, 5.0, 4.5, 3.5, 3.7999999999999998, 3.7000000000000002, 3.8999999999999999, 5.0999999999999996, 4.5, 4.5, 4.7000000000000002, 4.4000000000000004, 4.0999999999999996, 4.0, 4.4000000000000004, 4.5999999999999996, 4.0, 3.2999999999999998, 4.2000000000000002, 4.2000000000000002, 4.2000000000000002, 4.2999999999999998, 3.0, 4.0999999999999996, 6.0, 5.0999999999999996, 5.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.5999999999999996, 4.5, 6.2999999999999998, 5.7999999999999998, 6.0999999999999996, 5.0999999999999996, 5.2999999999999998, 5.5, 5.0, 5.0999999999999996, 5.2999999999999998, 5.5, 6.7000000000000002, 6.9000000000000004, 5.0, 5.7000000000000002, 4.9000000000000004, 6.7000000000000002, 4.9000000000000004, 5.7000000000000002, 6.0, 4.7999999999999998, 4.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.0999999999999996, 6.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.5999999999999996, 6.0999999999999996, 5.5999999999999996, 5.5, 4.7999999999999998, 5.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.0999999999999996, 5.9000000000000004, 5.7000000000000002, 5.2000000000000002, 5.0, 5.2000000000000002, 5.4000000000000004, 5.0999999999999996]}
df = pd.DataFrame(raw_data, columns = ['Unnamed: 0', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'])

# 查看原始数据帧的前几行
df.head()
Unnamed: 0Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
015.13.51.40.2setosa
124.93.01.40.2setosa
234.73.21.30.2setosa
344.63.11.50.2setosa
455.03.61.40.2setosa
# 查看使用我们的代码创建的,数据帧的前几行
df_original.head()
Unnamed: 0Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
015.13.51.40.2setosa
124.93.01.40.2setosa
234.73.21.30.2setosa
344.63.11.50.2setosa
455.03.61.40.2setosa

将分类变量转换为虚拟变量

# 导入模块
import pandas as pd

# 创建数据帧
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 
        'sex': ['male', 'female', 'male', 'female', 'female']}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'sex'])
df
first_namelast_namesex
0JasonMillermale
1MollyJacobsonfemale
2TinaAlimale
3JakeMilnerfemale
4AmyCoozefemale
# 从 sex 变量创建一组虚拟变量
df_sex = pd.get_dummies(df['sex'])

# 将虚拟变量连接到主数据帧
df_new = pd.concat([df, df_sex], axis=1)
df_new
first_namelast_namesexfemalemale
0JasonMillermale0.01.0
1MollyJacobsonfemale1.00.0
2TinaAlimale0.01.0
3JakeMilnerfemale1.00.0
4AmyCoozefemale1.00.0
# 连接新列的替代方案
df_new = df.join(df_sex)
df_new
first_namelast_namesexfemalemale
0JasonMillermale0.01.0
1MollyJacobsonfemale1.00.0
2TinaAlimale0.01.0
3JakeMilnerfemale1.00.0
4AmyCoozefemale1.00.0

将分类变量转换为虚拟变量

# 导入模块
import pandas as pd
import patsy

# 创建数据帧
raw_data = {'countrycode': [1, 2, 3, 2, 1]} 
df = pd.DataFrame(raw_data, columns = ['countrycode'])
df
countrycode
01
12
23
32
41
# 将 countrycode 变量转换为三个二元变量
patsy.dmatrix('C(countrycode)-1', df, return_type='dataframe')
C(countrycode)[1]C(countrycode)[2]C(countrycode)[3]
01.00.00.0
10.01.00.0
20.00.01.0
30.01.00.0
41.00.00.0

将字符串分类变量转换为数字变量

# 导入模块
import pandas as pd

raw_data = {'patient': [1, 1, 1, 2, 2], 
        'obs': [1, 2, 3, 1, 2], 
        'treatment': [0, 1, 0, 1, 0],
        'score': ['strong', 'weak', 'normal', 'weak', 'strong']} 
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
df
patientobstreatmentscore
0110strong
1121weak
2130normal
3211weak
4220strong
# 创建一个函数,将 df['score'] 的所有值转换为数字
def score_to_numeric(x):
    if x=='strong':
        return 3
    if x=='normal':
        return 2
    if x=='weak':
        return 1

df['score_num'] = df['score'].apply(score_to_numeric)
df
patientobstreatmentscorescore_num
0110strong3
1121weak1
2130normal2
3211weak1
4220strong3

将变量转换为时间序列

# 导入库
import pandas as pd

# 创建索引为一组名称的数据集
raw_data = {'date': ['2014-06-01T01:21:38.004053', '2014-06-02T01:21:38.004053', '2014-06-03T01:21:38.004053'],
        'score': [25, 94, 57]}
df = pd.DataFrame(raw_data, columns = ['date', 'score'])
df
datescore
02014-06-01T01:21:38.00405325
12014-06-02T01:21:38.00405394
22014-06-03T01:21:38.00405357
# 转置数据集,使索引(在本例中为名称)为列
df["date"] = pd.to_datetime(df["date"])

df = df.set_index(df["date"])

df
datescore
date
2014-06-01 01:21:38.0040532014-06-01 01:21:38.00405325
2014-06-02 01:21:38.0040532014-06-02 01:21:38.00405394
2014-06-03 01:21:38.0040532014-06-03 01:21:38.00405357

在 Pandas 数据帧中计数

# 导入库
import pandas as pd

year = pd.Series([1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 
                  1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894])
guardCorps = pd.Series([0,2,2,1,0,0,1,1,0,3,0,2,1,0,0,1,0,1,0,1])
corps1 = pd.Series([0,0,0,2,0,3,0,2,0,0,0,1,1,1,0,2,0,3,1,0])
corps2 = pd.Series([0,0,0,2,0,2,0,0,1,1,0,0,2,1,1,0,0,2,0,0])
corps3 = pd.Series([0,0,0,1,1,1,2,0,2,0,0,0,1,0,1,2,1,0,0,0])
corps4 = pd.Series([0,1,0,1,1,1,1,0,0,0,0,1,0,0,0,0,1,1,0,0])
corps5 = pd.Series([0,0,0,0,2,1,0,0,1,0,0,1,0,1,1,1,1,1,1,0])
corps6 = pd.Series([0,0,1,0,2,0,0,1,2,0,1,1,3,1,1,1,0,3,0,0])
corps7 = pd.Series([1,0,1,0,0,0,1,0,1,1,0,0,2,0,0,2,1,0,2,0])
corps8 = pd.Series([1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,1,0,1])
corps9 = pd.Series([0,0,0,0,0,2,1,1,1,0,2,1,1,0,1,2,0,1,0,0])
corps10 = pd.Series([0,0,1,1,0,1,0,2,0,2,0,0,0,0,2,1,3,0,1,1])
corps11 = pd.Series([0,0,0,0,2,4,0,1,3,0,1,1,1,1,2,1,3,1,3,1])
corps14 = pd.Series([ 1,1,2,1,1,3,0,4,0,1,0,3,2,1,0,2,1,1,0,0])
corps15 = pd.Series([0,1,0,0,0,0,0,1,0,1,1,0,0,0,2,2,0,0,0,0])

variables = dict(guardCorps = guardCorps, corps1 = corps1, 
                 corps2 = corps2, corps3 = corps3, corps4 = corps4, 
                 corps5 = corps5, corps6 = corps6, corps7 = corps7, 
                 corps8 = corps8, corps9 = corps9, corps10 = corps10, 
                 corps11 = corps11 , corps14 = corps14, corps15 = corps15)

horsekick = pd.DataFrame(variables, columns = ['guardCorps', 
                                                    'corps1', 'corps2', 
                                                    'corps3', 'corps4', 
                                                    'corps5', 'corps6', 
                                                    'corps7', 'corps8', 
                                                    'corps9', 'corps10', 
                                                    'corps11', 'corps14', 
                                                    'corps15'])

horsekick.index = [1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 
                  1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894]

horsekick
guardCorpscorps1corps2corps3corps4corps5corps6corps7corps8corps9corps10corps11corps14corps15
187500000001100010
187620001000000011
187720000011001020
187812211000001010
187900011220100210
188003211100021430
188110021001010000
188212000010112141
188300120121010300
188430100001002011
188500000010020101
188621001110010130
188711210032110120
188801100110000110
188900110110012202
189012020112021122
189100011101103310
189213201130110110
189301000102001300
189410000000101100
# 计算每个团队中每个死亡人数的次数
result = horsekick.apply(pd.value_counts).fillna(0); result

| | guardCorps | corps1 | corps2 | corps3 | corps4 | corps5 | corps6 | corps7 | corps8 | corps9 | corps10 | corps11 | corps14 | corps15 |
| 0 | 9.0 | 11.0 | 12.0 | 11.0 | 12.0 | 10.0 | 9.0 | 11.0 | 13.0 | 10.0 | 10.0 | 6 | 6 | 14.0 |
| 1 | 7.0 | 4.0 | 4.0 | 6.0 | 8.0 | 9.0 | 7.0 | 6.0 | 7.0 | 7.0 | 6.0 | 8 | 8 | 4.0 |
| 2 | 3.0 | 3.0 | 4.0 | 3.0 | 0.0 | 1.0 | 2.0 | 3.0 | 0.0 | 3.0 | 3.0 | 2 | 3 | 2.0 |
| 3 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3 | 2 | 0.0 |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 1 | 0.0 |

# 计算每个月死亡总数出现在 guardCorps 的次数
pd.value_counts(horsekick['guardCorps'].values, sort=False)

'''
0    9
1    7
2    3
3    1
dtype: int64 
'''

horsekick['guardCorps'].unique()

# array([0, 2, 1, 3]) 

在 Pandas 中创建流水线

Pandas 的流水线功能允许你将 Python 函数串联在一起,来构建数据处理流水线。

import pandas as pd

# 创建空数据帧
df = pd.DataFrame()

# Create a column
df['name'] = ['John', 'Steve', 'Sarah']
df['gender'] = ['Male', 'Male', 'Female']
df['age'] = [31, 32, 19]

# 查看数据帧
df
namegenderage
0JohnMale31
1SteveMale32
2SarahFemale19
# 创建函数,
def mean_age_by_group(dataframe, col):
    # 它按列分组数据,并返回每组的均值
    return dataframe.groupby(col).mean()

# 创建函数,
def uppercase_column_name(dataframe):
    # 它大写所有列标题
    dataframe.columns = dataframe.columns.str.upper()
    # 并返回它
    return dataframe

# 创建流水线,它应用 mean_age_by_group 函数
(df.pipe(mean_age_by_group, col='gender')
   # 之后应用 uppercase_column_name 函数
   .pipe(uppercase_column_name)
)
AGE
gender
Female19.0
Male31.5

使用for循环创建 Pandas 列

import pandas as pd
import numpy as np

raw_data = {'student_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'test_score': [76, 88, 84, 67, 53, 96, 64, 91, 77, 73, 52, np.NaN]}
df = pd.DataFrame(raw_data, columns = ['student_name', 'test_score'])

# 创建列表来储存数据
grades = []

# 对于列中的每一行
for row in df['test_score']:
    # 如果大于某个值
    if row > 95:
        # 添加字母分数
        grades.append('A')
    # 或者,如果大于某个值
    elif row > 90:
        # 添加字母分数
        grades.append('A-')
    # 或者,如果大于某个值
    elif row > 85:
        # 添加字母分数
        grades.append('B')
    # 或者,如果大于某个值
    elif row > 80:
        # 添加字母分数
        grades.append('B-')
    # 或者,如果大于某个值
    elif row > 75:
        # 添加字母分数
        grades.append('C')
    # 或者,如果大于某个值
    elif row > 70:
        # 添加字母分数
        grades.append('C-')
    # 或者,如果大于某个值
    elif row > 65:
        # 添加字母分数
        grades.append('D')
    # 或者,如果大于某个值
    elif row > 60:
        # 添加字母分数
        grades.append('D-')
    # 否则
    else:
        # 添加不及格分数
        grades.append('Failed')

# 从列表创建一列
df['grades'] = grades

# 查看新数据帧
df
student_nametest_scoregrades
0Miller76.0C
1Jacobson88.0B
2Ali84.0B-
3Milner67.0D
4Cooze53.0Failed
5Jacon96.0A
6Ryaner64.0D-
7Sone91.0A-
8Sloan77.0C
9Piger73.0C-
10Riani52.0Failed
11AliNaNFailed

创建项目计数

from collections import Counter

# 创建一个今天吃的水果的计数器
fruit_eaten = Counter(['Apple', 'Apple', 'Apple', 'Banana', 'Pear', 'Pineapple'])

# 查看计数器
fruit_eaten

# Counter({'Apple': 3, 'Banana': 1, 'Pear': 1, 'Pineapple': 1}) 

# 更新菠萝的计数(因为你只吃菠萝)
fruit_eaten.update(['Pineapple'])

# 查看计数器
fruit_eaten

# Counter({'Apple': 3, 'Banana': 1, 'Pear': 1, 'Pineapple': 2}) 

# 查看计数最大的三个项目
fruit_eaten.most_common(3)

# [('Apple', 3), ('Pineapple', 2), ('Banana', 1)] 

基于条件创建一列

# 导入所需模块
import pandas as pd
import numpy as np

data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'age': [42, 52, 36, 24, 73], 
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df
nameagepreTestScorepostTestScore
0Jason42425
1Molly522494
2Tina363157
3Jake24262
4Amy73370
# 创建一个名为 df.elderly 的新列
# 如果 df.age 大于 50 则值为 yes,否则为 no
df['elderly'] = np.where(df['age']>=50, 'yes', 'no')

# 查看数据帧
df
nameagepreTestScorepostTestScoreelderly
0Jason42425no
1Molly522494yes
2Tina363157no
3Jake24262no
4Amy73370yes

从词典键和值创建列表

# 创建字典
dict = {'county': ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'fireReports': [4, 24, 31, 2, 3]}

# 创建键的列表
list(dict.keys())

# ['fireReports', 'year', 'county'] 

# 创建值的列表
list(dict.values())

'''
[[4, 24, 31, 2, 3],
 [2012, 2012, 2013, 2014, 2014],
 ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']] 
'''

Pandas 中的交叉表

# 导入库
import pandas as pd

raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['infantry', 'infantry', 'cavalry', 'cavalry', 'infantry', 'infantry', 'cavalry', 'cavalry','infantry', 'infantry', 'cavalry', 'cavalry'], 
        'experience': ['veteran', 'rookie', 'veteran', 'rookie', 'veteran', 'rookie', 'veteran', 'rookie','veteran', 'rookie', 'veteran', 'rookie'],
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'experience', 'name', 'preTestScore', 'postTestScore'])
df
regimentcompanyexperiencenamepreTestScorepostTestScore
0NighthawksinfantryveteranMiller425
1NighthawksinfantryrookieJacobson2494
2NighthawkscavalryveteranAli3157
3NighthawkscavalryrookieMilner262
4DragoonsinfantryveteranCooze370
5DragoonsinfantryrookieJacon425
6DragoonscavalryveteranRyaner2494
7DragoonscavalryrookieSone3157
8ScoutsinfantryveteranSloan262
9ScoutsinfantryrookiePiger370
10ScoutscavalryveteranRiani262
11ScoutscavalryrookieAli370

按公司和团队创建交叉表。按公司和团队计算观测数量。

pd.crosstab(df.regiment, df.company, margins=True)
companycavalryinfantryAll
regiment
Dragoons224
Nighthawks224
Scouts224
All6612
# 为每个团队创建公司和经验的交叉表
pd.crosstab([df.company, df.experience], df.regiment,  margins=True)
regimentDragoonsNighthawksScoutsAll
companyexperience
cavalryrookie1113
veteran1113
infantryrookie1113
veteran1113
All44412

删除重复

# 导入模块
import pandas as pd

raw_data = {'first_name': ['Jason', 'Jason', 'Jason','Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', 'Miller', 'Miller','Ali', 'Milner', 'Cooze'], 
        'age': [42, 42, 1111111, 36, 24, 73], 
        'preTestScore': [4, 4, 4, 31, 2, 3],
        'postTestScore': [25, 25, 25, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
first_namelast_nameagepreTestScorepostTestScore
0JasonMiller42425
1JasonMiller42425
2JasonMiller1111111425
3TinaAli363157
4JakeMilner24262
5AmyCooze73370
# 确定哪些观测是重复的
df.duplicated()

'''
0    False
1     True
2    False
3    False
4    False
5    False
dtype: bool 
'''

df.drop_duplicates()
first_namelast_nameagepreTestScorepostTestScore
0JasonMiller42425
2JasonMiller1111111425
3TinaAli363157
4JakeMilner24262
5AmyCooze73370
# 删除 first_name 列中的重复项
# 但保留重复集中的最后一个观测
df.drop_duplicates(['first_name'], keep='last')
first_namelast_nameagepreTestScorepostTestScore
2JasonMiller1111111425
3TinaAli363157
4JakeMilner24262
5AmyCooze73370

Pandas 数据帧的描述性统计

# 导入模块
import pandas as pd

data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'age': [42, 52, 36, 24, 73], 
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df
nameagepreTestScorepostTestScore
0Jason42425
1Molly522494
2Tina363157
3Jake24262
4Amy73370

5 rows × 4 columns

# 所有年龄之和
df['age'].sum()

# 227 

df['preTestScore'].mean()

# 12.800000000000001 

df['preTestScore'].cumsum()

'''
0     4
1    28
2    59
3    61
4    64
Name: preTestScore, dtype: int64 
'''

df['preTestScore'].describe()

'''
count     5.000000
mean     12.800000
std      13.663821
min       2.000000
25%       3.000000
50%       4.000000
75%      24.000000
max      31.000000
Name: preTestScore, dtype: float64 
'''

df['preTestScore'].count()

# 5 

df['preTestScore'].min()

# 2 

df['preTestScore'].max()

# 31 

df['preTestScore'].median()

# 4.0 

df['preTestScore'].var()

# 186.69999999999999 

df['preTestScore'].std()

# 13.663820841916802 

df['preTestScore'].skew()

# 0.74334524573267591 

df['preTestScore'].kurt()

# -2.4673543738411525 

df.corr()
agepreTestScorepostTestScore
age1.000000-0.1056510.328852
preTestScore-0.1056511.0000000.378039
postTestScore0.3288520.3780391.000000

3 rows × 3 columns

# 协方差矩阵
df.cov()
agepreTestScorepostTestScore
age340.80-26.65151.20
preTestScore-26.65186.70128.65
postTestScore151.20128.65620.30

3 rows × 3 columns

丢弃行或者列

# 导入模块
import pandas as pd

data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
namereportsyear
CochiceJason42012
PimaMolly242012
Santa CruzTina312013
MaricopaJake22014
YumaAmy32014
# 丢弃观测(行)
df.drop(['Cochice', 'Pima'])
namereportsyear
Santa CruzTina312013
MaricopaJake22014
YumaAmy32014
# 丢弃变量(列)
# 注意:`axis = 1`表示我们指的是列,而不是行
df.drop('reports', axis=1)
nameyear
CochiceJason2012
PimaMolly2012
Santa CruzTina2013
MaricopaJake2014
YumaAmy2014

如果它包含某个值(这里是Tina),丢弃一行。

具体来说:创建一个名为df的新数据框,名称列中的单元格的值不等于Tina

df[df.name != 'Tina']
namereportsyear
CochiceJason42012
PimaMolly242012
MaricopaJake22014
YumaAmy32014

按照行号丢弃一行(在本例中为第 3 行)。

请注意,Pandas使用从零开始的编号,因此 0 是第一行,1 是第二行,等等。

df.drop(df.index[2])
namereportsyear
CochiceJason42012
PimaMolly242012
MaricopaJake22014
YumaAmy32014

可以扩展到范围。

df.drop(df.index[[2,3]])
namereportsyear
CochiceJason42012
PimaMolly242012
YumaAmy32014

或相对于 DF 的末尾来丢弃。

df.drop(df.index[-2])
namereportsyear
CochiceJason42012
PimaMolly242012
Santa CruzTina312013
YumaAmy32014

你也可以选择相对于起始或末尾的范围。

df[:3] # 保留前三个
namereportsyear
CochiceJason42012
PimaMolly242012
Santa CruzTina312013
df[:-3] # 丢掉后三个
namereportsyear
CochiceJason42012
PimaMolly242012

枚举列表

# 创建字符串列表
data = ['One','Two','Three','Four','Five']

# 对于 enumerate(data) 中的每个项目
for item in enumerate(data):
    # 打印整个枚举的元素
    print(item)
    # 只打印值(没有索引)
    print(item[1])

'''
(0, 'One')
One
(1, 'Two')
Two
(2, 'Three')
Three
(3, 'Four')
Four
(4, 'Five')
Five 
'''

在 Pandas 中将包含列表的单元扩展为自己的变量

# 导入 pandas
import pandas as pd

# 创建数据集
raw_data = {'score': [1,2,3], 
        'tags': [['apple','pear','guava'],['truck','car','plane'],['cat','dog','mouse']]}
df = pd.DataFrame(raw_data, columns = ['score', 'tags'])

# 查看数据集
df
scoretags
01[apple, pear, guava]
12[truck, car, plane]
23[cat, dog, mouse]
# 将 df.tags 扩展为自己的数据帧
tags = df['tags'].apply(pd.Series)

# 将每个变量重命名为标签
tags = tags.rename(columns = lambda x : 'tag_' + str(x))

# 查看 tags 数据帧
tags
tag_0tag_1tag_2
0applepearguava
1truckcarplane
2catdogmouse
# 将 tags 数据帧添加回原始数据帧
pd.concat([df[:], tags[:]], axis=1)
scoretagstag_0tag_1tag_2
01[apple, pear, guava]applepearguava
12[truck, car, plane]truckcarplane
23[cat, dog, mouse]catdogmouse

过滤 pandas 数据帧

# 导入模块
import pandas as pd

data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3],
        'coverage': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
coveragenamereportsyear
Cochice25Jason42012
Pima94Molly242012
Santa Cruz57Tina312013
Maricopa62Jake22014
Yuma70Amy32014
# 查看列
df['name']

'''
Cochice       Jason
Pima          Molly
Santa Cruz     Tina
Maricopa       Jake
Yuma            Amy
Name: name, dtype: object 
'''

df[['name', 'reports']]
namereports
CochiceJason4
PimaMolly24
Santa CruzTina31
MaricopaJake2
YumaAmy3
# 查看前两行
df[:2]
coveragenamereportsyear
Cochice25Jason42012
Pima94Molly242012
# 查看 Coverage 大于 50 的行
df[df['coverage'] > 50]
coveragenamereportsyear
Pima94Molly242012
Santa Cruz57Tina312013
Maricopa62Jake22014
Yuma70Amy32014
# 查看 Coverage 大于 50 并且 Reports 小于 4 的行
df[(df['coverage']  > 50) & (df['reports'] < 4)]
coveragenamereportsyear
Maricopa62Jake22014
Yuma70Amy32014

寻找数据帧的列中的最大值

# 导入模块
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# 创建数据帧
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 
        'age': [42, 52, 36, 24, 73], 
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
first_namelast_nameagepreTestScorepostTestScore
0JasonMiller42425
1MollyJacobson522494
2TinaAli363157
3JakeMilner24262
4AmyCooze73370
# 获取 preTestScore 列中的最大值的索引
df['preTestScore'].idxmax()

# 2 

寻找数据帧中的唯一值

import pandas as pd
import numpy as np

raw_data = {'regiment': ['51st', '29th', '2nd', '19th', '12th', '101st', '90th', '30th', '193th', '1st', '94th', '91th'], 
            'trucks': ['MAZ-7310', np.nan, 'MAZ-7310', 'MAZ-7310', 'Tatra 810', 'Tatra 810', 'Tatra 810', 'Tatra 810', 'ZIS-150', 'Tatra 810', 'ZIS-150', 'ZIS-150'],
            'tanks': ['Merkava Mark 4', 'Merkava Mark 4', 'Merkava Mark 4', 'Leopard 2A6M', 'Leopard 2A6M', 'Leopard 2A6M', 'Arjun MBT', 'Leopard 2A6M', 'Arjun MBT', 'Arjun MBT', 'Arjun MBT', 'Arjun MBT'],
            'aircraft': ['none', 'none', 'none', 'Harbin Z-9', 'Harbin Z-9', 'none', 'Harbin Z-9', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk']}

df = pd.DataFrame(raw_data, columns = ['regiment', 'trucks', 'tanks', 'aircraft'])

# 查看前几行
df.head()
regimenttruckstanksaircraft
051stMAZ-7310Merkava Mark 4none
129thNaNMerkava Mark 4none
22ndMAZ-7310Merkava Mark 4none
319thMAZ-7310Leopard 2A6MHarbin Z-9
412thTatra 810Leopard 2A6MHarbin Z-9
# 通过将 pandas 列转换为集合
# 创建唯一值的列表
list(set(df.trucks))

# [nan, 'Tatra 810', 'MAZ-7310', 'ZIS-150'] 

# 创建 df.trucks 中的唯一值的列表
list(df['trucks'].unique())

# ['MAZ-7310', nan, 'Tatra 810', 'ZIS-150'] 

地理编码和反向地理编码

在使用地理数据时,地理编码(将物理地址或位置转换为经纬度)和反向地理编码(将经纬度转换为物理地址或位置)是常见任务。

Python 提供了许多软件包,使任务变得异常简单。 在下面的教程中,我使用 pygeocoder(Google 的 geo-API 的包装器)来进行地理编码和反向地理编码。

首先,我们要加载我们想要在脚本中使用的包。 具体来说,我正在为地理函数加载 pygeocoder,为数据帧结构加载 pandas,为缺失值(np.nan)函数加载 numpy。

# 加载包
from pygeocoder import Geocoder
import pandas as pd
import numpy as np

地理数据有多种形式,在这种情况下,我们有一个 Python 字典,包含五个经纬度的字符串,每个坐标在逗号分隔的坐标对中。

# 创建原始数据的字典
data = {'Site 1': '31.336968, -109.560959',
        'Site 2': '31.347745, -108.229963',
        'Site 3': '32.277621, -107.734724',
        'Site 4': '31.655494, -106.420484',
        'Site 5': '30.295053, -104.014528'}

虽然技术上没必要,因为我最初使用 R,我是数据帧的忠实粉丝,所以让我们把模拟的数据字典变成数据帧。

# 将字典转换为 pandas 数据帧
df = pd.DataFrame.from_dict(data, orient='index')

# 查看数据帧
df
0
Site 131.336968, -109.560959
Site 231.347745, -108.229963
Site 332.277621, -107.734724
Site 431.655494, -106.420484
Site 530.295053, -104.014528

你现在可以看到,我们有了包含五行的数据帧,每行包含一个经纬度字符串。 在我们处理数据之前,我们需要1)将字符串分成纬度和经度,然后将它们转换为浮点数。以下代码就是这样。

# 为循环创建两个列表
lat = []
lon = []

# 对于变量中的每一行
for row in df[0]:
    # 尝试
    try:
        # 用逗号分隔行,转换为浮点
        # 并将逗号前的所有内容追加到 lat
        lat.append(float(row.split(',')[0]))
        # 用逗号分隔行,转换为浮点
        # 并将逗号后的所有内容追加到 lon
        lon.append(float(row.split(',')[1]))
    # 但是如果你得到了错误
    except:
        # 向 lat 添加缺失值
        lat.append(np.NaN)
        # 向 lon 添加缺失值
        lon.append(np.NaN)

# 从 lat 和 lon 创建新的两列
df['latitude'] = lat
df['longitude'] = lon

让我们看看现在有了什么。

# 查看数据帧
df
0latitudelongitude
Site 131.336968, -109.56095931.336968-109.560959
Site 231.347745, -108.22996331.347745-108.229963
Site 332.277621, -107.73472432.277621-107.734724
Site 431.655494, -106.42048431.655494-106.420484
Site 530.295053, -104.01452830.295053-104.014528

真棒。这正是我们想要看到的,一列用于纬度的浮点和一列用于经度的浮点。

为了反转地理编码,我们将特定的经纬度对(这里为第一行,索引为0)提供给 pygeocoder 的reverse_geocoder函数。

# 将经度和纬度转换为某个位置
results = Geocoder.reverse_geocode(df['latitude'][0], df['longitude'][0])

现在我们可以开始提取我们想要的数据了。

# 打印经纬度
results.coordinates

# (31.3372728, -109.5609559) 

# 打印城市
results.city

# 'Douglas' 

# 打印国家/地区
results.country

# 'United States' 

# 打印街道地址(如果可用)
results.street_address

# 打印行政区
results.administrative_area_level_1

# 'Arizona' 

对于地理编码,我们需要将包含地址或位置(例如城市)的字符串,传入地理编码函数中。 但是,并非所有字符串的格式都是 Google 的 geo-API 可以理解的。 如果由.geocode().valid_address函数验证有效,我们可以转换。

# 验证地址是否有效(即在 Google 的系统中)
Geocoder.geocode("4207 N Washington Ave, Douglas, AZ 85607").valid_address

# True 

因为输出是True,我们现在知道这是一个有效的地址,因此可以打印纬度和经度坐标。

# 打印经纬度
results.coordinates

# (31.3372728, -109.5609559) 

但更有趣的是,一旦地址由 Google 地理 API 处理,我们就可以解析它并轻松地分隔街道号码,街道名称等。

# 寻找特定地址中的经纬度
result = Geocoder.geocode("7250 South Tucson Boulevard, Tucson, AZ 85756")

# 打印街道号码
result.street_number

# '7250' 

# 打印街道名
result.route

# 'South Tucson Boulevard' 

你就实现了它。Python 使整个过程变得简单,只需几分钟即可完成分析。祝好运!

地理定位城市和国家

本教程创建一个函数,尝试获取城市和国家并返回其经纬度。 但是当城市不可用时(通常是这种情况),则返回该国中心的经纬度。

from geopy.geocoders import Nominatim
geolocator = Nominatim()
import numpy as np

def geolocate(city=None, country=None):
    '''
    输入城市和国家,或仅输入国家。 如果可以的话,返回城市的经纬度坐标,否则返回该国家中心的经纬度。
    '''

    # 如果城市存在
    if city != None:
        # 尝试
        try:
            # 地理定位城市和国家
            loc = geolocator.geocode(str(city + ',' + country))
            # 并返回经纬度
            return (loc.latitude, loc.longitude)
        # 否则
        except:
            # 返回缺失值
            return np.nan
    # 如果城市不存在
    else:
        # 尝试
        try:
            # 地理定位国家中心
            loc = geolocator.geocode(country)
            # 返回经纬度
            return (loc.latitude, loc.longitude)
        # 否则
        except:
            # 返回缺失值
            return np.nan

# 地理定位城市和国家
geolocate(city='Austin', country='USA')

# (30.2711286, -97.7436995) 

# 仅仅地理定位国家
geolocate(country='USA')

# (39.7837304, -100.4458824) 

使用 pandas 分组时间序列

# 导入所需模块
import pandas as pd
import numpy as np

df = pd.DataFrame()

df['german_army'] = np.random.randint(low=20000, high=30000, size=100)
df['allied_army'] = np.random.randint(low=20000, high=40000, size=100)
df.index = pd.date_range('1/1/2014', periods=100, freq='H')

df.head()
german_armyallied_army
2014-01-01 00:00:002875533938
2014-01-01 01:00:002517628631
2014-01-01 02:00:002326139685
2014-01-01 03:00:002868627756
2014-01-01 04:00:002458825681

Truncate the dataframe

df.truncate(before='1/2/2014', after='1/3/2014')
german_armyallied_army
2014-01-02 00:00:002640120189
2014-01-02 01:00:002995823934
2014-01-02 02:00:002449239075
2014-01-02 03:00:002570739262
2014-01-02 04:00:002712935961
2014-01-02 05:00:002790325418
2014-01-02 06:00:002040925163
2014-01-02 07:00:002573634794
2014-01-02 08:00:002405727209
2014-01-02 09:00:002687533402
2014-01-02 10:00:002396338575
2014-01-02 11:00:002750631859
2014-01-02 12:00:002356425750
2014-01-02 13:00:002795824365
2014-01-02 14:00:002491538866
2014-01-02 15:00:002353833820
2014-01-02 16:00:002336130080
2014-01-02 17:00:002728422922
2014-01-02 18:00:002417632155
2014-01-02 19:00:002392427763
2014-01-02 20:00:002311132343
2014-01-02 21:00:002034828907
2014-01-02 22:00:002713638634
2014-01-02 23:00:002864929950
2014-01-03 00:00:002129226395
# 设置数据帧的索引
df.index = df.index + pd.DateOffset(months=4, days=5)

df.head()
german_armyallied_army
2014-05-06 00:00:002875533938
2014-05-06 01:00:002517628631
2014-05-06 02:00:002326139685
2014-05-06 03:00:002868627756
2014-05-06 04:00:002458825681
# 将变量提前一小时
df.shift(1).head()
german_armyallied_army
2014-05-06 00:00:00NaNNaN
2014-05-06 01:00:0028755.033938.0
2014-05-06 02:00:0025176.028631.0
2014-05-06 03:00:0023261.039685.0
2014-05-06 04:00:0028686.027756.0
# 将变量延后一小时
df.shift(-1).tail()
german_armyallied_army
2014-05-09 23:00:0026903.039144.0
2014-05-10 00:00:0027576.039759.0
2014-05-10 01:00:0025232.035246.0
2014-05-10 02:00:0023391.021044.0
2014-05-10 03:00:00NaNNaN
# 对每小时观测值求和来按天汇总
df.resample('D').sum()
german_armyallied_army
2014-05-06605161755962
2014-05-07608100740396
2014-05-08589744700297
2014-05-09607092719283
2014-05-10103102135193
# 对每小时观测值求平均来按天汇总
df.resample('D').mean()
german_armyallied_army
2014-05-0625215.04166731498.416667
2014-05-0725337.50000030849.833333
2014-05-0824572.66666729179.041667
2014-05-0925295.50000029970.125000
2014-05-1025775.50000033798.250000
# 对每小时观测值求最小值来按天汇总
df.resample('D').min()
german_armyallied_army
2014-05-0624882.031310.0
2014-05-0725311.030969.5
2014-05-0824422.528318.0
2014-05-0924941.532082.5
2014-05-1026067.537195.0
# 对每小时观测值求中值来按天汇总
df.resample('D').median()
german_armyallied_army
2014-05-0624882.031310.0
2014-05-0725311.030969.5
2014-05-0824422.528318.0
2014-05-0924941.532082.5
2014-05-1026067.537195.0
# 对每小时观测值取第一个值来按天汇总
df.resample('D').first()
german_armyallied_army
2014-05-062875533938
2014-05-072640120189
2014-05-082129226395
2014-05-092576422613
2014-05-102690339144
# 对每小时观测值取最后一个值来按天汇总
df.resample('D').last()
german_armyallied_army
2014-05-062821432110
2014-05-072864929950
2014-05-082837932600
2014-05-092675222379
2014-05-102339121044
# 对每小时观测值取第一个值,最后一个值,最高值,最低值来按天汇总
df.resample('D').ohlc()
german_armyallied_army
openhigh
2014-05-062875529206
2014-05-072640129958
2014-05-082129229786
2014-05-092576429952
2014-05-102690327576

按时间分组数据

2016 年 3 月 13 日,Pandas 版本 0.18.0 发布,重采样功能的运行方式发生了重大变化。 本教程遵循 v0.18.0,不适用于以前版本的 pandas。

首先让我们加载我们关心的模块。

# 导入所需模块
import pandas as pd
import datetime
import numpy as np

接下来,让我们创建一些样例数据,我们可以将它们按时间分组作为样本。 在这个例子中,我创建了一个包含两列 365 行的数据帧。一列是日期,第二列是数值。

# 为今天创建 datetime 变量
base = datetime.datetime.today()
# 创建一列变量
# 包含 365 天的 datetime 值
date_list = [base - datetime.timedelta(days=x) for x in range(0, 365)]

# 创建 365 个数值的列表
score_list = list(np.random.randint(low=1, high=1000, size=365))

# 创建空数据帧
df = pd.DataFrame()

# 从 datetime 变量创建一列
df['datetime'] = date_list
# 将列转换为 datetime 类型
df['datetime'] = pd.to_datetime(df['datetime'])
# 将 datetime 列设为索引
df.index = df['datetime'] 
# 为数值得分变量创建一列
df['score'] = score_list

# 让我们看看数据
df.head()
datetimescore
datetime
2016-06-02 09:57:54.7939722016-06-02 09:57:54.793972900
2016-06-01 09:57:54.7939722016-06-01 09:57:54.793972121
2016-05-31 09:57:54.7939722016-05-31 09:57:54.793972547
2016-05-30 09:57:54.7939722016-05-30 09:57:54.793972504
2016-05-29 09:57:54.7939722016-05-29 09:57:54.793972304

在 pandas 中,按时间分组的最常用方法是使用.resample()函数。 在 v0.18.0 中,此函数是两阶段的。 这意味着df.resample('M')创建了一个对象,我们可以对其应用其他函数(meancountsum等)

# 按月对数据分组,并取每组(即每个月)的平均值
df.resample('M').mean()
score
datetime
2015-06-30513.629630
2015-07-31561.516129
2015-08-31448.032258
2015-09-30548.000000
2015-10-31480.419355
2015-11-30487.033333
2015-12-31499.935484
2016-01-31429.193548
2016-02-29520.413793
2016-03-31349.806452
2016-04-30395.500000
2016-05-31503.451613
2016-06-30510.500000
# 按月对数据分组,并获取每组(即每个月)的总和
df.resample('M').sum()
score
datetime
2015-06-3013868
2015-07-3117407
2015-08-3113889
2015-09-3016440
2015-10-3114893
2015-11-3014611
2015-12-3115498
2016-01-3113305
2016-02-2915092
2016-03-3110844
2016-04-3011865
2016-05-3115607
2016-06-301021

分组有很多选项。 你可以在 Pandas 的时间序列文档中了解它们的更多信息,但是,为了你的方便,我也在下面列出了它们。

描述
Bbusiness day frequency
Ccustom business day frequency (experimental)
Dcalendar day frequency
Wweekly frequency
Mmonth end frequency
BMbusiness month end frequency
CBMcustom business month end frequency
MSmonth start frequency
BMSbusiness month start frequency
Qquarter end frequency
BQbusiness quarter endfrequency
QSquarter start frequency
BQSbusiness quarter start frequency
Ayear end frequency
BAbusiness year end frequency
ASyear start frequency
BASbusiness year start frequency
BHbusiness hour frequency
Hhourly frequency
Tminutely frequency
Ssecondly frequency
Lmilliseonds
Umicroseconds
Nnanoseconds

按小时分组数据

# 导入库
import pandas as pd
import numpy as np

# 创建 2000 个元素的时间序列
# 每五分钟一个元素,起始于 2000.1.1
time = pd.date_range('1/1/2000', periods=2000, freq='5min')

# 创建 pandas 序列,带有 0 到 100 的随机值
# 将 time 用于索引
series = pd.Series(np.random.randint(100, size=2000), index=time)

# 查看前几行
series[0:10]

'''
2000-01-01 00:00:00    40
2000-01-01 00:05:00    13
2000-01-01 00:10:00    99
2000-01-01 00:15:00    72
2000-01-01 00:20:00     4
2000-01-01 00:25:00    36
2000-01-01 00:30:00    24
2000-01-01 00:35:00    20
2000-01-01 00:40:00    83
2000-01-01 00:45:00    44
Freq: 5T, dtype: int64 
'''

# 按索引的小时值对数据分组,然后按平均值进行汇总
series.groupby(series.index.hour).mean()

'''
0     50.380952
1     49.380952
2     49.904762
3     53.273810
4     47.178571
5     46.095238
6     49.047619
7     44.297619
8     53.119048
9     48.261905
10    45.166667
11    54.214286
12    50.714286
13    56.130952
14    50.916667
15    42.428571
16    46.880952
17    56.892857
18    54.071429
19    47.607143
20    50.940476
21    50.511905
22    44.550000
23    50.250000
dtype: float64 
'''

对行分组

# 导入模块
import pandas as pd

# 示例数据帧
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
regimentcompanynamepreTestScorepostTestScore
0Nighthawks1stMiller425
1Nighthawks1stJacobson2494
2Nighthawks2ndAli3157
3Nighthawks2ndMilner262
4Dragoons1stCooze370
5Dragoons1stJacon425
6Dragoons2ndRyaner2494
7Dragoons2ndSone3157
8Scouts1stSloan262
9Scouts1stPiger370
10Scouts2ndRiani262
11Scouts2ndAli370
# 创建分组对象。 换句话说,
# 创建一个表示该特定分组的对象。 
# 这里,我们按照团队来分组 pre-test 得分。
regiment_preScore = df['preTestScore'].groupby(df['regiment'])

# 展示每个团队的 pre-test 得分的均值
regiment_preScore.mean()

'''
regiment
Dragoons      15.50
Nighthawks    15.25
Scouts         2.50
Name: preTestScore, dtype: float64 
'''

Pandas 中的分层数据

# 导入模块
import pandas as pd

# 创建数据帧
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
regimentcompanynamepreTestScorepostTestScore
0Nighthawks1stMiller425
1Nighthawks1stJacobson2494
2Nighthawks2ndAli3157
3Nighthawks2ndMilner262
4Dragoons1stCooze370
5Dragoons1stJacon425
6Dragoons2ndRyaner2494
7Dragoons2ndSone3157
8Scouts1stSloan262
9Scouts1stPiger370
10Scouts2ndRiani262
11Scouts2ndAli370
# 设置分层索引但将列保留在原位
df = df.set_index(['regiment', 'company'], drop=False)
df
regimentcompanynamepreTestScorepostTestScore
regimentcompany
Nighthawks1stNighthawks1stMiller4
1stNighthawks1stJacobson2494
2ndNighthawks2ndAli3157
2ndNighthawks2ndMilner262
Dragoons1stDragoons1stCooze3
1stDragoons1stJacon425
2ndDragoons2ndRyaner2494
2ndDragoons2ndSone3157
Scouts1stScouts1stSloan2
1stScouts1stPiger370
2ndScouts2ndRiani262
2ndScouts2ndAli370
# 将分层索引设置为团队然后公司
df = df.set_index(['regiment', 'company'])
df
namepreTestScorepostTestScore
regimentcompany
Nighthawks1stMiller4
1stJacobson2494
2ndAli3157
2ndMilner262
Dragoons1stCooze3
1stJacon425
2ndRyaner2494
2ndSone3157
Scouts1stSloan2
1stPiger370
2ndRiani262
2ndAli370
# 查看索引
df.index

MultiIndex(levels=[['Dragoons', 'Nighthawks', 'Scouts'], ['1st', '2nd']],
           labels=[[1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2], [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]],
           names=['regiment', 'company']) 

# 交换索引中的级别
df.swaplevel('regiment', 'company')
namepreTestScorepostTestScore
companyregiment
1stNighthawksMiller425
NighthawksJacobson2494
2ndNighthawksAli3157
NighthawksMilner262
1stDragoonsCooze370
DragoonsJacon425
2ndDragoonsRyaner2494
DragoonsSone3157
1stScoutsSloan262
ScoutsPiger370
2ndScoutsRiani262
ScoutsAli370
# 按需求和数据
df.sum(level='regiment')
preTestScorepostTestScore
regiment
Nighthawks61238
Dragoons62246
Scouts10264

最后

以上就是小巧大地为你收集整理的数据科学和人工智能技术笔记 十九、数据整理(上)十九、数据整理(上)将列表拆分为大小为 N 的分块的全部内容,希望文章能够帮你解决数据科学和人工智能技术笔记 十九、数据整理(上)十九、数据整理(上)将列表拆分为大小为 N 的分块所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(66)

评论列表共有 0 条评论

立即
投稿
返回
顶部