概述
版本信息
Python 3.6.2
pandas 0.23.4
本文所需文件下载地址:https://download.csdn.net/download/plsong_csdn/10764054
基本操作
读取文件
mport pandas as pd
food_info = pd.read_csv('food_info.csv') #读取数据
print(type(food_info)) #DataFrame结构
print(food_info.dtypes) #当前数据中的数据结构有哪些
food_info.head() #显示前几个信息
food_info.tail() #显示后几个信息
food_info.columns #每个列的指标,也就是说猎命
food_info.shape #样本的大小
运行结果:
<class 'pandas.core.frame.DataFrame'>
NDB_No int64
Shrt_Desc object
Water_(g) float64
Energ_Kcal int64
......
Cholestrl_(mg) float64
dtype: object
说明:
(1)DataFrame数据类型是pd读取文件进行存储的数据类型,可以近似看作一个矩阵的结构。
(2)object数据类型其实就是string类型的数据。
pandas索引
按行按列读取
import pandas as pd
food_info = pd.read_csv('food_info.csv') #读取数据
print(food_info.loc[0]) #取第一行数据
print(food_info.loc[3:6]) #取4:7行的数据
ndb_col = food_info['NDB_No'] #取'NDB_No'这一列的数据
print(ndb_col)
colums = ['NDB_No', 'Water_(g)'] #同时读取'NDB_No', 'Water_(g)'这一列的数据
print(food_info[colums])
运行结果:
NDB_No 1001
Shrt_Desc BUTTER WITH SALT
......
FA_Poly_(g) 3.043
Cholestrl_(mg) 215
Name: 0, dtype: object
NDB_No Shrt_Desc ... FA_Poly_(g) Cholestrl_(mg)
3 1004 CHEESE BLUE ... 0.800 75.0
4 1005 CHEESE BRICK ... 0.784 94.0
5 1006 CHEESE BRIE ... 0.826 100.0
6 1007 CHEESE CAMEMBERT ... 0.724 72.0
[4 rows x 36 columns]
0 1001
1 1002
2 1003
...
8617 93600
Name: NDB_No, Length: 8618, dtype: int64
NDB_No Water_(g)
0 1001 15.87
1 1002 15.87
......
8617 93600 78.50
[8618 rows x 2 columns]
读取文件中的特定信息
import pandas as pd
food_info = pd.read_csv('food_info.csv') #读取数据
col_name = food_info.columns.tolist() #将当前的列名作为一个list
print(col_name)
print('---------------------')
gram_columns = []
for c in col_name:
if c.endswith('(g)'):
gram_columns.append(c)
print(gram_columns)
print('---------------------')
gram_df = food_info[gram_columns] #frame结构
print(gram_df.head(3))
运行结果:
['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)', 'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)', 'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)', 'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)', 'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)', 'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)', 'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg', 'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)', 'Cholestrl_(mg)']
-----------------------
['Water_(g)', 'Protein_(g)', 'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)', 'Sugar_Tot_(g)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)']
-----------------------
Water_(g) Protein_(g) ... FA_Mono_(g) FA_Poly_(g)
0 15.87 0.85 ... 21.021 3.043
1 15.87 0.85 ... 23.426 3.012
2 0.24 0.28 ... 28.732 3.694
[3 rows x 10 columns]
添加新属性
import pandas as pd
food_info = pd.read_csv('food_info.csv') #读取数据
iron_grams = food_info['Iron_(mg)'] / 1000 #将mg化为g
print(food_info.shape)
food_info['Iron_(g)'] = iron_grams #将['Iron_(g)']这一列属性添加到food_info这个frame里面
print(food_info.shape)
food_info.head(4)
运行结果:
(8618, 36)
(8618, 37)
说明:
food_info['Iron_(g)'] = iron_grams
上面这一句代码其实做了两个工作
(1)为food_info创建一个新属性,也就是说添加了新的一列
(2)将['Iron_(g)']这个新属性赋值给food_info
对某一属性进行排序
import pandas as pd
food_info = pd.read_csv('food_info.csv') #读取数据
food_info.sort_values('Water_(g)',inplace=True)
print(food_info['Water_(g)'])
print('--------------------------------------')
food_info.sort_values('Water_(g)',inplace=True, ascending=False) #取消升序
print(food_info['Water_(g)'])
运行结果:
676 0.00
664 0.00
...
4404 99.98
4372 99.98
4378 100.00
4377 100.00
4348 100.00
4376 100.00
4209 100.00
1983 NaN
6067 NaN
6095 NaN
6113 NaN
6150 NaN
7776 NaN
Name: Water_(g), Length: 8618, dtype: float64
--------------------------------------
4209 100.00
4376 100.00
4348 100.00
4377 100.00
...
743 0.00
676 0.00
1983 NaN
6067 NaN
6095 NaN
6113 NaN
6150 NaN
7776 NaN
Name: Water_(g), Length: 8618, dtype: float64
说明:NaN代表该列是缺失的。
求泰坦尼克号上乘客的平均年龄
方法一
import pandas as pd
import numpy as np
titanic_survival = pd.read_csv('titanic_train.csv')
age = titanic_survival['Age'] #读取年龄
print('age[0:10]:n',age.loc[0:10]) #输出age的0:10列
age_is_null = pd.isnull(age) #判断是否有空缺项
good_ages = titanic_survival["Age"][age_is_null == False]
age_mean = sum(good_ages) / len(good_ages)
print("age_mean:",age_mean)
运行结果:
age[0:10]:
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 NaN
6 54.0
7 2.0
8 27.0
9 14.0
10 4.0
Name: Age, dtype: float64
age_mean: 29.69911764705882
说明:因为有空缺的项会对结果有影响,因此,计算时应当对空缺的项目进行处理,上面采取的是去掉年龄空缺的乘客;
方法二:
import pandas as pd
import numpy as np
titanic_survival = pd.read_csv('titanic_train.csv')
age_mean = titanic_survival['Age'].mean()
print('age_mean:',age_mean)
运行结果:
age_mean: 29.69911764705882
说明:
(1)用mean()函数,采取默认去掉年龄空缺的乘客。.
(2)去掉有空缺的行,可以使用dropna()
两个属性之间的关系pivot_table
示例:求不同船舱的平均票价
import pandas as pd
import numpy as np
titanic_survival = pd.read_csv('titanic_train.csv')
fares_by_calss = titanic_survival.pivot_table(index='Pclass',values="Fare",aggfunc=np.mean)
print(fares_by_calss)
运行结果:
Pclass Fare
1 84.154687
2 20.662183
3 13.675550
示例:求不同船舱的乘客的平均年龄
import pandas as pd
import numpy as np
titanic_survival = pd.read_csv('titanic_train.csv')
fares_by_calss = titanic_survival.pivot_table(index='Pclass',values="Age",aggfunc=np.mean)
print(fares_by_calss)
运行结果:
Age
Pclass
1 38.233441
2 29.877630
3 25.140620
示例:一个量与其他两个量之间的关系。求不同上船地点乘客的幸存总数和票价总数
import pandas as pd
import numpy as np
titanic_survival = pd.read_csv('titanic_train.csv')
fares_by_calss = titanic_survival.pivot_table(index='Embarked',values=["Age","Survived"],aggfunc=np.sum)
print(fares_by_calss)
运行结果:
Age Survived
Embarked
C 4005.92 93
Q 786.50 30
S 16312.75 217
pandas中的series结构
初识series
import pandas as pd
import numpy as np
fandango = pd.read_csv('fandango_score_comparison.csv')
series_film = fandango['FILM']
print(type(series_film))
print(series_film[0:5])
film_name = series_film.values
print(type(film_name))
运行结果:
<class 'pandas.core.series.Series'>
0 Avengers: Age of Ultron (2015)
1 Cinderella (2015)
2 Ant-Man (2015)
3 Do You Believe? (2015)
4 Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object
<class 'numpy.ndarray'>
说明:
(1)series可以看作是datafram的组成部分。有众多series组成dataframe.
(2)series又是由ndarray构成
series索引
import pandas as pd
from pandas import Series
fandango = pd.read_csv('fandango_score_comparison.csv') #读取文件,dataframe
series_film = fandango['FILM'] #series
series_rt = fandango['RottenTomatoes'] #series
film_name = series_film.values #ndarray
rt_scores = series_rt.values #ndarray
series_custom = Series(rt_scores, index=film_name)
print(series_custom[['Minions (2015)', 'Leviathan (2014)']])
运行结果:
Minions (2015) 54
Leviathan (2014) 99
dtype: int64
说明
series_custom = Series(rt_scores, index=film_name)
series可以用名字当索引
★finished 2018.11.4night
★by songpl
最后
以上就是酷炫冬天为你收集整理的【python学习】Python数据处理库pandas基本操作pandas索引添加新属性对某一属性进行排序求泰坦尼克号上乘客的平均年龄两个属性之间的关系pivot_tablepandas中的series结构的全部内容,希望文章能够帮你解决【python学习】Python数据处理库pandas基本操作pandas索引添加新属性对某一属性进行排序求泰坦尼克号上乘客的平均年龄两个属性之间的关系pivot_tablepandas中的series结构所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复