概述
Pandas是机器学习中专门用于数据处理的库,遇到很多数据时首先要使用Pandas进行预处理得到我们想要的信息,下面让我们来看一下Pandas中有哪些操作
import pandas
food_info=pandas.read_csv("food_info.csv")#将csv文件中的数据进行读取
print(type(food_info))#pandas中的核心结构叫做DATAFRAME
print(food_info.head(3))#打印出来一个表格显示,默认显示前五行
print(food_info.tail(4))#用来显示尾几行
输出结果:
<class 'pandas.core.frame.DataFrame'>
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g)
0 1001 BUTTER WITH SALT 15.87 717 0.85
1 1002 BUTTER WHIPPED WITH SALT 15.87 717 0.85
2 1003 BUTTER OIL ANHYDROUS 0.24 876 0.28Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) ...
0 81.11 2.11 0.06 0.0 0.06 ...
1 81.11 2.11 0.06 0.0 0.06 ...
2 99.48 0.00 0.00 0.0 0.00 ...Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU Vit_K_(mcg)
0 2499.0 684.0 2.32 1.5 60.0 7.0
1 2499.0 684.0 2.32 1.5 60.0 7.0
2 3069.0 840.0 2.80 1.8 73.0 8.6FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
0 51.368 21.021 3.043 215.0
1 50.489 23.426 3.012 219.0
2 61.924 28.732 3.694 256.0[3 rows x 36 columns]
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g)
8614 90240 SCALLOP (BAY&SEA) CKD STMD 70.25 111 20.54
8615 90480 SYRUP CANE 26.00 269 0.00
8616 90560 SNAIL RAW 79.20 90 16.10
8617 93600 TURTLE GREEN RAW 78.50 89 19.80Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g)
8614 0.84 2.97 5.41 0.0 0.0
8615 0.00 0.86 73.14 0.0 73.2
8616 1.40 1.30 2.00 0.0 0.0
8617 0.50 1.20 0.00 0.0 0.0... Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU Vit_K_(mcg)
8614 ... 5.0 2.0 0.0 0.0 2.0 0.0
8615 ... 0.0 0.0 0.0 0.0 0.0 0.0
8616 ... 100.0 30.0 5.0 0.0 0.0 0.1
8617 ... 100.0 30.0 0.5 0.0 0.0 0.1FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
8614 0.218 0.082 0.222 41.0
8615 0.000 0.000 0.000 0.0
8616 0.361 0.259 0.252 50.0
8617 0.127 0.088 0.170 50.0[4 rows x 36 columns]
print(food_info.columns)#显示列名
print(food_info.shape)#表示数据有8618个样本,每个样本有36个指标
print(food_info.loc[0,"NDB_No"])#不能直接通过索引读取,需要通过loc函数打印每行,加上列名可以定位到具体元素
print(food_info["NDB_No"])#通过列名来打印每一列
输出结果:
Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
'Cholestrl_(mg)'],
dtype='object')
(8618, 36)
1001
0 1001
1 1002
2 1003
3 1004
4 1005
...
8613 83110
8614 90240
8615 90480
8616 90560
8617 93600
Name: NDB_No, Length: 8618, dtype: int64
# pandas中的计算可以直接加减乘除,会把数据表中每一元素进行相应计算
water_energy=food_info["Water_(g)"]*food_info["Energ_Kcal"]
iron_grams=water_energy/1000
print(food_info.shape)
food_info["Iron_(g)"]=iron_grams#新建一个列名并赋值
print(food_info.shape)
输出结果:
(8618, 36)
(8618, 37)
food_info.sort_values("Sodium_(mg)",inplace=True,ascending=False)#pandas中的排序操作指定一个列名,可把排序定为FALse
print(food_info["Sodium_(mg)"])
food_info_reindex=food_info.reset_index(drop=True)#此函数可将排序后的索引值改变
print(food_info_reindex)
# print(help(food_info.sort_values))
# print(help(food_info.reset_index()))
输出结果:
276 38758.0 5814 27360.0 6192 26050.0 1242 26000.0 1245 24000.0 ... 8184 NaN 8185 NaN 8195 NaN 8251 NaN 8267 NaN Name: Sodium_(mg), Length: 8618, dtype: float64 NDB_No Shrt_Desc Water_(g) 0 2047 SALT TABLE 0.20 1 18372 LEAVENING AGENTS BAKING SODA 0.20 2 19225 DESSERTS RENNIN TABLETS UNSWTND 6.50 3 6075 SOUP BF BROTH OR BOUILLON PDR DRY 3.27 4 6081 SOUP CHICK BROTH CUBES DRY 2.50 ... ... ... ... 8613 35092 WILLOW LEAVES IN OIL (ALASKA NATIVE) 28.00 8614 35093 WILLOW YOUNG LEAVES CHOPD (ALASKA NATIVE) 68.70 8615 35139 SQUASH INDIAN CKD BLD (NAVAJO) 96.21 8616 35199 PRICKLY PEARS BRLD (NORTHERN PLAINS INDIANS) 75.83 8617 35231 SEA LION STELLER FAT (ALASKA NATIVE) 4.70 Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) 0 0 0.00 0.00 99.8 0.00 1 0 0.00 0.00 36.9 0.00 2 84 1.00 0.10 72.5 19.80 3 213 15.97 8.89 54.5 17.40 4 198 14.60 4.70 54.7 23.50 ... ... ... ... ... ... 8613 592 2.60 61.00 0.3 8.10 8614 122 6.10 1.60 2.9 20.70 8615 16 0.31 0.15 0.1 3.22 8616 91 0.39 0.31 1.9 21.57 8617 850 0.90 94.00 0.2 0.00 Fiber_TD_(g) Sugar_Tot_(g) ... Vit_A_RAE Vit_E_(mg) Vit_D_mcg 0 0.0 0.00 ... 0.0 0.00 0.0 1 0.0 0.00 ... 0.0 0.00 0.0 2 0.0 NaN ... 0.0 NaN NaN 3 0.0 16.71 ... 0.0 2.17 0.0 4 0.0 0.00 ... 0.0 0.09 NaN ... ... ... ... ... ... ... 8613 NaN NaN ... NaN NaN NaN 8614 NaN NaN ... NaN NaN NaN 8615 1.5 2.02 ... NaN NaN NaN 8616 NaN NaN ... NaN NaN NaN 8617 NaN NaN ... 97.0 NaN 0.0 Vit_D_IU Vit_K_(mcg) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) 0 0.0 0.0 0.000 0.000 0.000 1 0.0 0.0 0.000 0.000 0.000 2 NaN NaN 0.041 0.038 0.007 3 0.0 3.2 4.320 3.616 0.332 4 NaN 0.0 1.200 1.920 1.620 ... ... ... ... ... ... 8613 NaN NaN NaN NaN NaN 8614 NaN NaN NaN NaN NaN 8615 NaN NaN NaN NaN NaN 8616 NaN NaN NaN NaN NaN 8617 0.0 NaN NaN NaN NaN Cholestrl_(mg) Iron_(g) 0 0.0 0.00000 1 0.0 0.00000 2 0.0 0.54600 3 10.0 0.69651 4 13.0 0.49500 ... ... ... 8613 NaN 16.57600 8614 NaN 8.38140 8615 NaN 1.53936 8616 NaN 6.90053 8617 95.0 3.99500 [8618 rows x 37 columns]
import pandas as pd
import numpy as np
titanic_survivral=pd.read_csv("titanic_train.csv")
titanic_survivral.head(4)
age=titanic_survivral["Age"]#选中要判断的一列
age_is_null=pd.isnull(age)#运用isnull函数进行判断
print(age_is_null)
print(titanic_survivral["Age"].mean())#可以把NAN排除后求均值的函数
passenger_survivor=titanic_survivral.pivot_table(index="Pclass",values="Survived",aggfunc=np.mean)
# 一个很重要的pivot_table函数,index代表以谁为基准,values表示那个数据与index之间有关系,aggfunc表示两个数据之间纯在什么关系
print(passenger_survivor)
输出结果:
0 False 1 False 2 False 3 False 4 False ... 886 False 887 False 888 True 889 False 890 False Name: Age, Length: 891, dtype: bool 29.69911764705882 Survived Pclass 1 0.629630 2 0.472826 3 0.242363
new_titanic_survivor=titanic_survivral.dropna(axis=0,subset=["Age","Sex"])#此函数可以将选定列中的Nan值给去掉
print(new_titanic_survivor)
输出结果:
PassengerId Survived Pclass 0 1 0 3 1 2 1 1 2 3 1 3 3 4 1 1 4 5 0 3 .. ... ... ... 885 886 0 3 886 887 0 2 887 888 1 1 889 890 1 1 890 891 0 3 Name Sex Age SibSp 0 Braund, Mr. Owen Harris male 22.0 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 2 Heikkinen, Miss. Laina female 26.0 0 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 4 Allen, Mr. William Henry male 35.0 0 .. ... ... ... ... 885 Rice, Mrs. William (Margaret Norton) female 39.0 0 886 Montvila, Rev. Juozas male 27.0 0 887 Graham, Miss. Margaret Edith female 19.0 0 889 Behr, Mr. Karl Howell male 26.0 0 890 Dooley, Mr. Patrick male 32.0 0 Parch Ticket Fare Cabin Embarked 0 0 A/5 21171 7.2500 NaN S 1 0 PC 17599 71.2833 C85 C 2 0 STON/O2. 3101282 7.9250 NaN S 3 0 113803 53.1000 C123 S 4 0 373450 8.0500 NaN S .. ... ... ... ... ... 885 5 382652 29.1250 NaN Q 886 0 211536 13.0000 NaN S 887 0 112053 30.0000 B42 S 889 0 111369 30.0000 C148 C 890 0 370376 7.7500 NaN Q [714 rows x 12 columns]
还有一点请注意,Pandas中允许自定义函数def格式,通过.apply(functionName)可以调用自定义函数,以满足库中未提供的功能。
本程序中用到的csv数据集 提取码:twwi
最后
以上就是眯眯眼口红为你收集整理的机器学习基本库之Pandas的全部内容,希望文章能够帮你解决机器学习基本库之Pandas所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复