我是靠谱客的博主 眯眯眼口红,最近开发中收集的这篇文章主要介绍机器学习基本库之Pandas,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

Pandas是机器学习中专门用于数据处理的库,遇到很多数据时首先要使用Pandas进行预处理得到我们想要的信息,下面让我们来看一下Pandas中有哪些操作


import pandas
food_info=pandas.read_csv("food_info.csv")#将csv文件中的数据进行读取
print(type(food_info))#pandas中的核心结构叫做DATAFRAME
print(food_info.head(3))#打印出来一个表格显示,默认显示前五行
print(food_info.tail(4))#用来显示尾几行

输出结果:

<class 'pandas.core.frame.DataFrame'>
   NDB_No                 Shrt_Desc  Water_(g)  Energ_Kcal  Protein_(g)  
0    1001          BUTTER WITH SALT      15.87         717         0.85   
1    1002  BUTTER WHIPPED WITH SALT      15.87         717         0.85   
2    1003      BUTTER OIL ANHYDROUS       0.24         876         0.28   

   Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  Fiber_TD_(g)  Sugar_Tot_(g)  ...  
0          81.11     2.11            0.06           0.0           0.06  ...   
1          81.11     2.11            0.06           0.0           0.06  ...   
2          99.48     0.00            0.00           0.0           0.00  ...   

   Vit_A_IU  Vit_A_RAE  Vit_E_(mg)  Vit_D_mcg  Vit_D_IU  Vit_K_(mcg)  
0    2499.0      684.0        2.32        1.5      60.0          7.0   
1    2499.0      684.0        2.32        1.5      60.0          7.0   
2    3069.0      840.0        2.80        1.8      73.0          8.6   

   FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  Cholestrl_(mg)  
0      51.368       21.021        3.043           215.0  
1      50.489       23.426        3.012           219.0  
2      61.924       28.732        3.694           256.0  

[3 rows x 36 columns]
      NDB_No                   Shrt_Desc  Water_(g)  Energ_Kcal  Protein_(g)  
8614   90240  SCALLOP (BAY&SEA) CKD STMD      70.25         111        20.54   
8615   90480                  SYRUP CANE      26.00         269         0.00   
8616   90560                   SNAIL RAW      79.20          90        16.10   
8617   93600            TURTLE GREEN RAW      78.50          89        19.80   

      Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  Fiber_TD_(g)  Sugar_Tot_(g)  
8614           0.84     2.97            5.41           0.0            0.0   
8615           0.00     0.86           73.14           0.0           73.2   
8616           1.40     1.30            2.00           0.0            0.0   
8617           0.50     1.20            0.00           0.0            0.0   

      ...  Vit_A_IU  Vit_A_RAE  Vit_E_(mg)  Vit_D_mcg  Vit_D_IU  Vit_K_(mcg)  
8614  ...       5.0        2.0         0.0        0.0       2.0          0.0   
8615  ...       0.0        0.0         0.0        0.0       0.0          0.0   
8616  ...     100.0       30.0         5.0        0.0       0.0          0.1   
8617  ...     100.0       30.0         0.5        0.0       0.0          0.1   

      FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  Cholestrl_(mg)  
8614       0.218        0.082        0.222            41.0  
8615       0.000        0.000        0.000             0.0  
8616       0.361        0.259        0.252            50.0  
8617       0.127        0.088        0.170            50.0  

[4 rows x 36 columns] 

print(food_info.columns)#显示列名
print(food_info.shape)#表示数据有8618个样本,每个样本有36个指标
print(food_info.loc[0,"NDB_No"])#不能直接通过索引读取,需要通过loc函数打印每行,加上列名可以定位到具体元素
print(food_info["NDB_No"])#通过列名来打印每一列

 输出结果:

Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
       'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
       'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
       'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
       'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
       'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
       'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
       'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
       'Cholestrl_(mg)'],
      dtype='object')
(8618, 36)
1001
0        1001
1        1002
2        1003
3        1004
4        1005
        ...  
8613    83110
8614    90240
8615    90480
8616    90560
8617    93600
Name: NDB_No, Length: 8618, dtype: int64

# pandas中的计算可以直接加减乘除,会把数据表中每一元素进行相应计算
water_energy=food_info["Water_(g)"]*food_info["Energ_Kcal"]
iron_grams=water_energy/1000
print(food_info.shape)
food_info["Iron_(g)"]=iron_grams#新建一个列名并赋值
print(food_info.shape)

 输出结果:

(8618, 36)
(8618, 37)

food_info.sort_values("Sodium_(mg)",inplace=True,ascending=False)#pandas中的排序操作指定一个列名,可把排序定为FALse
print(food_info["Sodium_(mg)"])
food_info_reindex=food_info.reset_index(drop=True)#此函数可将排序后的索引值改变
print(food_info_reindex)
# print(help(food_info.sort_values))
# print(help(food_info.reset_index()))

输出结果:
 

276
38758.0
5814
27360.0
6192
26050.0
1242
26000.0
1245
24000.0
...
8184
NaN
8185
NaN
8195
NaN
8251
NaN
8267
NaN
Name: Sodium_(mg), Length: 8618, dtype: float64
NDB_No
Shrt_Desc
Water_(g)

0
2047
SALT TABLE
0.20
1
18372
LEAVENING AGENTS BAKING SODA
0.20
2
19225
DESSERTS RENNIN TABLETS UNSWTND
6.50
3
6075
SOUP BF BROTH OR BOUILLON PDR DRY
3.27
4
6081
SOUP CHICK BROTH CUBES DRY
2.50
...
...
...
...
8613
35092
WILLOW LEAVES IN OIL (ALASKA NATIVE)
28.00
8614
35093
WILLOW YOUNG LEAVES CHOPD (ALASKA NATIVE)
68.70
8615
35139
SQUASH INDIAN CKD BLD (NAVAJO)
96.21
8616
35199
PRICKLY PEARS BRLD (NORTHERN PLAINS INDIANS)
75.83
8617
35231
SEA LION STELLER FAT (ALASKA NATIVE)
4.70
Energ_Kcal
Protein_(g)
Lipid_Tot_(g)
Ash_(g)
Carbohydrt_(g)

0
0
0.00
0.00
99.8
0.00
1
0
0.00
0.00
36.9
0.00
2
84
1.00
0.10
72.5
19.80
3
213
15.97
8.89
54.5
17.40
4
198
14.60
4.70
54.7
23.50
...
...
...
...
...
...
8613
592
2.60
61.00
0.3
8.10
8614
122
6.10
1.60
2.9
20.70
8615
16
0.31
0.15
0.1
3.22
8616
91
0.39
0.31
1.9
21.57
8617
850
0.90
94.00
0.2
0.00
Fiber_TD_(g)
Sugar_Tot_(g)
...
Vit_A_RAE
Vit_E_(mg)
Vit_D_mcg

0
0.0
0.00
...
0.0
0.00
0.0
1
0.0
0.00
...
0.0
0.00
0.0
2
0.0
NaN
...
0.0
NaN
NaN
3
0.0
16.71
...
0.0
2.17
0.0
4
0.0
0.00
...
0.0
0.09
NaN
...
...
...
...
...
...
...
8613
NaN
NaN
...
NaN
NaN
NaN
8614
NaN
NaN
...
NaN
NaN
NaN
8615
1.5
2.02
...
NaN
NaN
NaN
8616
NaN
NaN
...
NaN
NaN
NaN
8617
NaN
NaN
...
97.0
NaN
0.0
Vit_D_IU
Vit_K_(mcg)
FA_Sat_(g)
FA_Mono_(g)
FA_Poly_(g)

0
0.0
0.0
0.000
0.000
0.000
1
0.0
0.0
0.000
0.000
0.000
2
NaN
NaN
0.041
0.038
0.007
3
0.0
3.2
4.320
3.616
0.332
4
NaN
0.0
1.200
1.920
1.620
...
...
...
...
...
...
8613
NaN
NaN
NaN
NaN
NaN
8614
NaN
NaN
NaN
NaN
NaN
8615
NaN
NaN
NaN
NaN
NaN
8616
NaN
NaN
NaN
NaN
NaN
8617
0.0
NaN
NaN
NaN
NaN
Cholestrl_(mg)
Iron_(g)
0
0.0
0.00000
1
0.0
0.00000
2
0.0
0.54600
3
10.0
0.69651
4
13.0
0.49500
...
...
...
8613
NaN
16.57600
8614
NaN
8.38140
8615
NaN
1.53936
8616
NaN
6.90053
8617
95.0
3.99500
[8618 rows x 37 columns]

 

import pandas as pd
import numpy as np
titanic_survivral=pd.read_csv("titanic_train.csv")
titanic_survivral.head(4)
age=titanic_survivral["Age"]#选中要判断的一列
age_is_null=pd.isnull(age)#运用isnull函数进行判断
print(age_is_null)
print(titanic_survivral["Age"].mean())#可以把NAN排除后求均值的函数
passenger_survivor=titanic_survivral.pivot_table(index="Pclass",values="Survived",aggfunc=np.mean)
# 一个很重要的pivot_table函数,index代表以谁为基准,values表示那个数据与index之间有关系,aggfunc表示两个数据之间纯在什么关系
print(passenger_survivor)

输出结果: 

0
False
1
False
2
False
3
False
4
False
...
886
False
887
False
888
True
889
False
890
False
Name: Age, Length: 891, dtype: bool
29.69911764705882
Survived
Pclass
1
0.629630
2
0.472826
3
0.242363

new_titanic_survivor=titanic_survivral.dropna(axis=0,subset=["Age","Sex"])#此函数可以将选定列中的Nan值给去掉
print(new_titanic_survivor)

输出结果:

 PassengerId
Survived
Pclass

0
1
0
3
1
2
1
1
2
3
1
3
3
4
1
1
4
5
0
3
..
...
...
...
885
886
0
3
886
887
0
2
887
888
1
1
889
890
1
1
890
891
0
3
Name
Sex
Age
SibSp

0
Braund, Mr. Owen Harris
male
22.0
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
2
Heikkinen, Miss. Laina
female
26.0
0
3
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
4
Allen, Mr. William Henry
male
35.0
0
..
...
...
...
...
885
Rice, Mrs. William (Margaret Norton)
female
39.0
0
886
Montvila, Rev. Juozas
male
27.0
0
887
Graham, Miss. Margaret Edith
female
19.0
0
889
Behr, Mr. Karl Howell
male
26.0
0
890
Dooley, Mr. Patrick
male
32.0
0
Parch
Ticket
Fare Cabin Embarked
0
0
A/5 21171
7.2500
NaN
S
1
0
PC 17599
71.2833
C85
C
2
0
STON/O2. 3101282
7.9250
NaN
S
3
0
113803
53.1000
C123
S
4
0
373450
8.0500
NaN
S
..
...
...
...
...
...
885
5
382652
29.1250
NaN
Q
886
0
211536
13.0000
NaN
S
887
0
112053
30.0000
B42
S
889
0
111369
30.0000
C148
C
890
0
370376
7.7500
NaN
Q
[714 rows x 12 columns]

 还有一点请注意,Pandas中允许自定义函数def格式,通过.apply(functionName)可以调用自定义函数,以满足库中未提供的功能。

本程序中用到的csv数据集 提取码:twwi

最后

以上就是眯眯眼口红为你收集整理的机器学习基本库之Pandas的全部内容,希望文章能够帮你解决机器学习基本库之Pandas所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(75)

评论列表共有 0 条评论

立即
投稿
返回
顶部