Python数据分析与机器学习实战-07.pandas索引与计算

348 阅读 0 评论 230 点赞

我是靠谱客的博主完美纸飞机，这篇文章主要介绍Python数据分析与机器学习实战-07.pandas索引与计算，现在分享给大家，希望可以做个参考。

通过pandas获取筛选数据的方法：可以选择索引切片，也可以通过字段名筛数据：

筛选数据：

1.索引的形式：

（1）一个索引：food_info.loc[0]

（2）多个索引：food_info.loc[3:6]；food_info.loc[[2,5,10]

2.通过字段名：food_info["NDB_No"]；food_info[["Zinc_(mg)", "Copper_(mg)"]]#真实奇怪，获取多个列时，需要2个【】【】

3.将矩阵转化为python列表list：food_info.columns.tolist()

4.运算：获取最大值，food_info["Energ_Kcal"].max()

5.排序：sort_values(by,axis=0,ascending=True,inplace=False,kind="quicksort",na_position="last")

参数	说明
by	指定列名(axis=0或’index’)或索引值(axis=1或’columns’)
axis	若axis=0或’index’，则按照指定列中数据大小排序；若axis=1或’columns’，则按照指定索引中数据大小排序，默认axis=0
ascending	是否按指定列的数组升序排列，默认为True，即升序排列
inplace	是否用排序后的数据集替换原来的数据，默认为False，即不替换
kind
na_position	{‘first’,‘last’}，设定缺失值的显示位置

通过pandas获取筛选数据的方法：可以选择索引切片，也可以通过字段名筛数据：

方法一：通过索引取值，操作类似切片，按行索引取值，类似SQL

1）通过1个索引获取数据：

# -*- coding: utf-8 -*-
import pandas
#读取文件
food_info = pandas.read_csv("/Users/liyili2/Downloads/tang/pandas_data/food_info.csv")
#pandas.loc[i]:按照行索引号取数，如果索引号不存在，会报错提示。
result1=food_info.loc[0]
result2=food_info.loc[6]
result3=food_info.loc[8000]

2）通过多个索引获取数据：

# -*- coding: utf-8 -*-
import pandas
#读取文件
food_info = pandas.read_csv("/Users/liyili2/Downloads/tang/pandas_data/food_info.csv")
result4=food_info.loc[3:6]
print(result4)#返回索引号3-6的数据
two_five_ten = [2,5,10]#返回索引2，5，10的值，#方法一
print(food_info.loc[two_five_ten])
print(food_info.loc[[2,5,10]])#返回索引2，5，10的值，#方法二

总的：

# -*- coding: utf-8 -*-
import pandas
#读取文件
food_info = pandas.read_csv("/Users/liyili2/Downloads/tang/pandas_data/food_info.csv")
#pandas.loc[i]:按照行索引号取数，如果索引号不存在，会报错提示。
result1=food_info.loc[0]
print("result1的结果",result1.dtype)
print(result1)
print("
")
result2=food_info.loc[6]
print("result2的结果")
print(result2)
print("
")
result3=food_info.loc[8000]
print("result3的结果")
print(result3)
result4=food_info.loc[3:6]
print("result4的结果:返回索引号3-6的数据")
print(result4)
#返回索引2，5，10的值
#方法一
two_five_ten = [2,5,10]
print("方法一，result5的结果:返回索引号2，5，10的数据")
print(food_info.loc[two_five_ten])
#方法二
print("方法二，result5的结果:返回索引号2，5，10的数据")
print(food_info.loc[[2,5,10]])

方法二：通过字段名获取筛选数据

单独获取一列数据：

# -*- coding: utf-8 -*-
import pandas
#读取文件
food_info = pandas.read_csv("/Users/liyili2/Downloads/tang/pandas_data/food_info.csv")
ndb_col = food_info["NDB_No"]
print("获取列名为：NDB_No的数据")
print(ndb_col)
col_name = "NDB_No"
ndb_col = food_info[col_name]

获取多列数据：

# -*- coding: utf-8 -*-
import pandas
#读取文件
food_info = pandas.read_csv("/Users/liyili2/Downloads/tang/pandas_data/food_info.csv")
columns = ["Zinc_(mg)", "Copper_(mg)"]
zinc_copper = food_info[columns]
print(zinc_copper)
zinc_copper = food_info[["Zinc_(mg)", "Copper_(mg)"]]
print(zinc_copper)

# -*- coding: utf-8 -*-
import pandas
#读取文件
food_info = pandas.read_csv("/Users/liyili2/Downloads/tang/pandas_data/food_info.csv")
print(food_info.columns)
print(type(food_info.columns))
print(food_info.head(2))
col_names = food_info.columns.tolist()#可以使用tolist()函数转化为list
print(type(col_names))
print(col_names)
gram_columns = []
for c in col_names:
if c.endswith("(g)"):#记住了，新知识点
gram_columns.append(c)
gram_df = food_info[gram_columns]
print("
")
print(gram_df.head(3))

结果：
Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
'Cholestrl_(mg)'],
dtype='object')
<class 'pandas.core.indexes.base.Index'>
NDB_No
Shrt_Desc
...
FA_Poly_(g)
Cholestrl_(mg)
0
1001
BUTTER WITH SALT
...
3.043
215.0
1
1002
BUTTER WHIPPED WITH SALT
...
3.012
219.0
[2 rows x 36 columns]
<class 'list'>
['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)', 'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)', 'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)', 'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)', 'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)', 'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)', 'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg', 'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)', 'Cholestrl_(mg)']
Water_(g)
Protein_(g)
Lipid_Tot_(g)
...
FA_Sat_(g)
FA_Mono_(g)
FA_Poly_(g)
0
15.87
0.85
81.11
...
51.368
21.021
3.043
1
15.87
0.85
81.11
...
50.489
23.426
3.012
2
0.24
0.28
99.48
...
61.924
28.732
3.694
[3 rows x 10 columns]

3.pandas中数据类型

#object - For string values
#int - For integer values
#float - For float values
#datetime - For time values
#bool - For Boolean values
#print(food_info.dtypes)

4.pandas运算

# -*- coding: utf-8 -*-
import pandas
#读取文件
food_info = pandas.read_csv("/Users/liyili2/Downloads/tang/pandas_data/food_info.csv")
print(food_info.head(3))
col_names = food_info.columns.tolist()#可以使用tolist()函数转化为list
print(food_info["Iron_(mg)"])
div_1000 = food_info["Iron_(mg)"] / 1000#"字段Iron_(mg)的所有数据都要除以1000"
print(div_1000)
add_100 = food_info["Iron_(mg)"] + 100#"字段Iron_(mg)的所有数据都要加上100"
print(add_100)
sub_100 = food_info["Iron_(mg)"] - 100#"字段Iron_(mg)的所有数据都要减去100"
print(sub_100)
mult_2 = food_info["Iron_(mg)"]*2#"字段Iron_(mg)的所有数据都要乘以2"
print(mult_2)
water_energy = food_info["Water_(g)"] * food_info["Energ_Kcal"]
iron_grams = food_info["Iron_(mg)"] / 1000
result=food_info["Iron_(g)"] = iron_grams
print(result)
weighted_protein = food_info["Protein_(g)"] * 2
weighted_fat = -0.75 * food_info["Lipid_Tot_(g)"]
initial_rating = weighted_protein + weighted_fat
print(initial_rating)
max_calories = food_info["Energ_Kcal"].max()#获取所在行的最大值
normalized_calories = food_info["Energ_Kcal"] / max_calories
normalized_protein = food_info["Protein_(g)"] / food_info["Protein_(g)"].max()
normalized_fat = food_info["Lipid_Tot_(g)"] / food_info["Lipid_Tot_(g)"].max()
food_info["Normalized_Protein"] = normalized_protein
food_info["Normalized_Fat"] = normalized_fat

5.排序：

sort_values(by,axis=0,ascending=True,inplace=False,kind="quicksort",na_position="last")

参数	说明
by	指定列名(axis=0或’index’)或索引值(axis=1或’columns’)
axis	若axis=0或’index’，则按照指定列中数据大小排序；若axis=1或’columns’，则按照指定索引中数据大小排序，默认axis=0
ascending	是否按指定列的数组升序排列，默认为True，即升序排列
inplace	是否用排序后的数据集替换原来的数据，默认为False，即不替换
kind
na_position	{‘first’,‘last’}，设定缺失值的显示位置

参考：https://blog.csdn.net/MsSpark/article/details/83154128

例子：

利用字典dict创建数据框
import numpy as np
import pandas as pd
print("原始数据")
df=pd.DataFrame({'col1':['A','A','B',np.nan,'D','C'],
'col2':[2,1,9,8,7,7],
'col3':[0,1,9,4,2,8]
})
print(df)
print("
")
print("依据第一列排序，并将该列空值放在首位")
print(df.sort_values(by=['col1'],na_position='first'))
print("
")
print("依据第二、三列，数值降序排序")
print(df.sort_values(by=['col2','col3'],ascending=False))
print("
")
print("根据第一列中数值排序，按降序排列，并替换原数据")
df.sort_values(by=['col1'],ascending=False,inplace=True,
na_position='first')
print(df)
x = pd.DataFrame({'x1':[1,2,2,3],'x2':[4,3,2,1],'x3':[3,2,4,1]})
print("
")
print(x)
print("
")
print("按照索引值为0的行，即第一行的值来降序排序")
print(x.sort_values(by =0,ascending=False,axis=1))

结果展示：

原始数据	依据第一列排序，并将该列空值放在首位
col1 col2 col3 0 A 2 0 1 A 1 1 2 B 9 9 3 NaN 8 4 4 D 7 2 5 C 7 8	col1 col2 col3 3 NaN 8 4 0 A 2 0 1 A 1 1 2 B 9 9 5 C 7 8 4 D 7 2

原始数据	依据第二、三列，数值降序排序
col1 col2 col3 0 A 2 0 1 A 1 1 2 B 9 9 3 NaN 8 4 4 D 7 2 5 C 7 8	col1 col2 col3 2 B 9 9 3 NaN 8 4 5 C 7 8 4 D 7 2 0 A 2 0 1 A 1 1

原始数据	根据第一列中数值排序，按降序排列，并替换原数据
col1 col2 col3 0 A 2 0 1 A 1 1 2 B 9 9 3 NaN 8 4 4 D 7 2 5 C 7 8	col1 col2 col3 3 NaN 8 4 4 D 7 2 5 C 7 8 2 B 9 9 0 A 2 0 1 A 1 1

原始数据	按照索引值为0的行，即第一行的值来降序排序
x1 x2 x3 0 1 4 3 1 2 3 2 2 2 2 4 3 3 1 1	x2 x3 x1 0 4 3 1 1 3 2 2 2 2 4 2 3 1 1 3