data = pd.read_table(path6, sep ="s+", parse_dates =[[0,1,2]])
data.head()
Yr_Mo_Dy
RPT
VAL
ROS
KIL
SHA
BIR
DUB
CLA
MUL
CLO
BEL
MAL
0
2061-01-01
15.04
14.96
13.17
9.29
NaN
9.87
13.67
10.25
10.83
12.58
18.50
15.04
1
2061-01-02
14.71
NaN
10.83
6.50
12.62
7.67
11.50
10.04
9.79
9.67
17.54
13.83
2
2061-01-03
18.50
16.88
12.33
10.13
11.17
6.17
11.25
NaN
8.50
7.67
12.75
12.71
3
2061-01-04
10.58
6.63
11.75
4.58
4.54
2.88
8.63
1.79
5.83
5.88
5.46
10.88
4
2061-01-05
13.33
13.25
11.42
6.17
10.71
8.21
11.92
6.54
10.92
10.34
12.92
11.83
步骤4 2061年?我们真的有这一年的数据?创建一个函数并用它去修复这个bug
deffix_century(x):
year = x.year -100if x.year >1989else x.year
return datetime.date(year, x.month, x.day)# apply the function fix_century on the column and replace the values to the right ones
data['Yr_Mo_Dy']= data['Yr_Mo_Dy'].apply(fix_century)# data.info()
data.head()
Yr_Mo_Dy
RPT
VAL
ROS
KIL
SHA
BIR
DUB
CLA
MUL
CLO
BEL
MAL
0
1961-01-01
15.04
14.96
13.17
9.29
NaN
9.87
13.67
10.25
10.83
12.58
18.50
15.04
1
1961-01-02
14.71
NaN
10.83
6.50
12.62
7.67
11.50
10.04
9.79
9.67
17.54
13.83
2
1961-01-03
18.50
16.88
12.33
10.13
11.17
6.17
11.25
NaN
8.50
7.67
12.75
12.71
3
1961-01-04
10.58
6.63
11.75
4.58
4.54
2.88
8.63
1.79
5.83
5.88
5.46
10.88
4
1961-01-05
13.33
13.25
11.42
6.17
10.71
8.21
11.92
6.54
10.92
10.34
12.92
11.83
步骤5 将日期设为索引,注意数据类型,应该是datetime64[ns]
data["Yr_Mo_Dy"]= pd.to_datetime(data["Yr_Mo_Dy"])# set 'Yr_Mo_Dy' as the index
data = data.set_index('Yr_Mo_Dy')
data.head()# data.info()
RPT
VAL
ROS
KIL
SHA
BIR
DUB
CLA
MUL
CLO
BEL
MAL
Yr_Mo_Dy
1961-01-01
15.04
14.96
13.17
9.29
NaN
9.87
13.67
10.25
10.83
12.58
18.50
15.04
1961-01-02
14.71
NaN
10.83
6.50
12.62
7.67
11.50
10.04
9.79
9.67
17.54
13.83
1961-01-03
18.50
16.88
12.33
10.13
11.17
6.17
11.25
NaN
8.50
7.67
12.75
12.71
1961-01-04
10.58
6.63
11.75
4.58
4.54
2.88
8.63
1.79
5.83
5.88
5.46
10.88
1961-01-05
13.33
13.25
11.42
6.17
10.71
8.21
11.92
6.54
10.92
10.34
12.92
11.83
步骤6 对应每一个location,一共有多少数据值缺失
data.isnull().sum()
RPT 6
VAL 3
ROS 2
KIL 5
SHA 2
BIR 0
DUB 3
CLA 2
MUL 3
CLO 1
BEL 0
MAL 4
dtype: int64
步骤7 对应每一个location,一共有多少完整的数据值
data.shape[0]- data.isnull().sum()
RPT 6568
VAL 6571
ROS 6572
KIL 6569
SHA 6572
BIR 6574
DUB 6571
CLA 6572
MUL 6571
CLO 6573
BEL 6574
MAL 6570
dtype: int64
loc_stats = pd.DataFrame()
loc_stats['min']= data.min()# min
loc_stats['max']= data.max()# max
loc_stats['mean']= data.mean()# mean
loc_stats['std']= data.std()# standard deviations
loc_stats
day_stats = pd.DataFrame()# this time we determine axis equals to one so it gets each row.
day_stats['min']= data.min(axis =1)# min
day_stats['max']= data.max(axis =1)# max
day_stats['mean']= data.mean(axis =1)# mean
day_stats['std']= data.std(axis =1)# standard deviations
day_stats.head()
min
max
mean
std
Yr_Mo_Dy
1961-01-01
9.29
18.50
13.018182
2.808875
1961-01-02
6.50
17.54
11.336364
3.188994
1961-01-03
6.17
18.50
11.641818
3.681912
1961-01-04
1.79
11.75
6.619167
3.198126
1961-01-05
6.17
13.33
10.630000
2.445356
步骤11 对于每一个location,计算一月份的平均风速
注意,1961年的1月和1962年的1月应该区别对待
data['date']= data.index
# creates a column for each value from date
data['month']= data['date'].apply(lambda date: date.month)
data['year']= data['date'].apply(lambda date: date.year)
data['day']= data['date'].apply(lambda date: date.day)# gets all value from the month 1 and assign to janyary_winds
january_winds = data.query('month == 1')# gets the mean from january_winds, using .loc to not print the mean of month, year and day
january_winds.loc[:,'RPT':"MAL"].mean()
RPT 14.847325
VAL 12.914560
ROS 13.299624
KIL 7.199498
SHA 11.667734
BIR 8.054839
DUB 11.819355
CLA 9.512047
MUL 9.543208
CLO 10.053566
BEL 14.550520
MAL 18.028763
dtype: float64
步骤12 对于数据记录按照年为频率取4样
data.query('month == 1 and day == 1')
RPT
VAL
ROS
KIL
SHA
BIR
DUB
CLA
MUL
CLO
BEL
MAL
date
month
year
day
Yr_Mo_Dy
1961-01-01
15.04
14.96
13.17
9.29
NaN
9.87
13.67
10.25
10.83
12.58
18.50
15.04
1961-01-01
1
1961
1
1962-01-01
9.29
3.42
11.54
3.50
2.21
1.96
10.41
2.79
3.54
5.17
4.38
7.92
1962-01-01
1
1962
1
1963-01-01
15.59
13.62
19.79
8.38
12.25
10.00
23.45
15.71
13.59
14.37
17.58
34.13
1963-01-01
1
1963
1
1964-01-01
25.80
22.13
18.21
13.25
21.29
14.79
14.12
19.58
13.25
16.75
28.96
21.00
1964-01-01
1
1964
1
1965-01-01
9.54
11.92
9.00
4.38
6.08
5.21
10.25
6.08
5.71
8.63
12.04
17.41
1965-01-01
1
1965
1
1966-01-01
22.04
21.50
17.08
12.75
22.17
15.59
21.79
18.12
16.66
17.83
28.33
23.79
1966-01-01
1
1966
1
1967-01-01
6.46
4.46
6.50
3.21
6.67
3.79
11.38
3.83
7.71
9.08
10.67
20.91
1967-01-01
1
1967
1
1968-01-01
30.04
17.88
16.25
16.25
21.79
12.54
18.16
16.62
18.75
17.62
22.25
27.29
1968-01-01
1
1968
1
1969-01-01
6.13
1.63
5.41
1.08
2.54
1.00
8.50
2.42
4.58
6.34
9.17
16.71
1969-01-01
1
1969
1
1970-01-01
9.59
2.96
11.79
3.42
6.13
4.08
9.00
4.46
7.29
3.50
7.33
13.00
1970-01-01
1
1970
1
1971-01-01
3.71
0.79
4.71
0.17
1.42
1.04
4.63
0.75
1.54
1.08
4.21
9.54
1971-01-01
1
1971
1
1972-01-01
9.29
3.63
14.54
4.25
6.75
4.42
13.00
5.33
10.04
8.54
8.71
19.17
1972-01-01
1
1972
1
1973-01-01
16.50
15.92
14.62
7.41
8.29
11.21
13.54
7.79
10.46
10.79
13.37
9.71
1973-01-01
1
1973
1
1974-01-01
23.21
16.54
16.08
9.75
15.83
11.46
9.54
13.54
13.83
16.66
17.21
25.29
1974-01-01
1
1974
1
1975-01-01
14.04
13.54
11.29
5.46
12.58
5.58
8.12
8.96
9.29
5.17
7.71
11.63
1975-01-01
1
1975
1
1976-01-01
18.34
17.67
14.83
8.00
16.62
10.13
13.17
9.04
13.13
5.75
11.38
14.96
1976-01-01
1
1976
1
1977-01-01
20.04
11.92
20.25
9.13
9.29
8.04
10.75
5.88
9.00
9.00
14.88
25.70
1977-01-01
1
1977
1
1978-01-01
8.33
7.12
7.71
3.54
8.50
7.50
14.71
10.00
11.83
10.00
15.09
20.46
1978-01-01
1
1978
1
步骤13 对于数据记录按照月为频率取样
data.query('day == 1')
RPT
VAL
ROS
KIL
SHA
BIR
DUB
CLA
MUL
CLO
BEL
MAL
date
month
year
day
Yr_Mo_Dy
1961-01-01
15.04
14.96
13.17
9.29
NaN
9.87
13.67
10.25
10.83
12.58
18.50
15.04
1961-01-01
1
1961
1
1961-02-01
14.25
15.12
9.04
5.88
12.08
7.17
10.17
3.63
6.50
5.50
9.17
8.00
1961-02-01
2
1961
1
1961-03-01
12.67
13.13
11.79
6.42
9.79
8.54
10.25
13.29
NaN
12.21
20.62
NaN
1961-03-01
3
1961
1
1961-04-01
8.38
6.34
8.33
6.75
9.33
9.54
11.67
8.21
11.21
6.46
11.96
7.17
1961-04-01
4
1961
1
1961-05-01
15.87
13.88
15.37
9.79
13.46
10.17
9.96
14.04
9.75
9.92
18.63
11.12
1961-05-01
5
1961
1
1961-06-01
15.92
9.59
12.04
8.79
11.54
6.04
9.75
8.29
9.33
10.34
10.67
12.12
1961-06-01
6
1961
1
1961-07-01
7.21
6.83
7.71
4.42
8.46
4.79
6.71
6.00
5.79
7.96
6.96
8.71
1961-07-01
7
1961
1
1961-08-01
9.59
5.09
5.54
4.63
8.29
5.25
4.21
5.25
5.37
5.41
8.38
9.08
1961-08-01
8
1961
1
1961-09-01
5.58
1.13
4.96
3.04
4.25
2.25
4.63
2.71
3.67
6.00
4.79
5.41
1961-09-01
9
1961
1
1961-10-01
14.25
12.87
7.87
8.00
13.00
7.75
5.83
9.00
7.08
5.29
11.79
4.04
1961-10-01
10
1961
1
1961-11-01
13.21
13.13
14.33
8.54
12.17
10.21
13.08
12.17
10.92
13.54
20.17
20.04
1961-11-01
11
1961
1
1961-12-01
9.67
7.75
8.00
3.96
6.00
2.75
7.25
2.50
5.58
5.58
7.79
11.17
1961-12-01
12
1961
1
1962-01-01
9.29
3.42
11.54
3.50
2.21
1.96
10.41
2.79
3.54
5.17
4.38
7.92
1962-01-01
1
1962
1
1962-02-01
19.12
13.96
12.21
10.58
15.71
10.63
15.71
11.08
13.17
12.62
17.67
22.71
1962-02-01
2
1962
1
1962-03-01
8.21
4.83
9.00
4.83
6.00
2.21
7.96
1.87
4.08
3.92
4.08
5.41
1962-03-01
3
1962
1
1962-04-01
14.33
12.25
11.87
10.37
14.92
11.00
19.79
11.67
14.09
15.46
16.62
23.58
1962-04-01
4
1962
1
1962-05-01
9.62
9.54
3.58
3.33
8.75
3.75
2.25
2.58
1.67
2.37
7.29
3.25
1962-05-01
5
1962
1
1962-06-01
5.88
6.29
8.67
5.21
5.00
4.25
5.91
5.41
4.79
9.25
5.25
10.71
1962-06-01
6
1962
1
1962-07-01
8.67
4.17
6.92
6.71
8.17
5.66
11.17
9.38
8.75
11.12
10.25
17.08
1962-07-01
7
1962
1
1962-08-01
4.58
5.37
6.04
2.29
7.87
3.71
4.46
2.58
4.00
4.79
7.21
7.46
1962-08-01
8
1962
1
1962-09-01
10.00
12.08
10.96
9.25
9.29
7.62
7.41
8.75
7.67
9.62
14.58
11.92
1962-09-01
9
1962
1
1962-10-01
14.58
7.83
19.21
10.08
11.54
8.38
13.29
10.63
8.21
12.92
18.05
18.12
1962-10-01
10
1962
1
1962-11-01
16.88
13.25
16.00
8.96
13.46
11.46
10.46
10.17
10.37
13.21
14.83
15.16
1962-11-01
11
1962
1
1962-12-01
18.38
15.41
11.75
6.79
12.21
8.04
8.42
10.83
5.66
9.08
11.50
11.50
1962-12-01
12
1962
1
1963-01-01
15.59
13.62
19.79
8.38
12.25
10.00
23.45
15.71
13.59
14.37
17.58
34.13
1963-01-01
1
1963
1
1963-02-01
15.41
7.62
24.67
11.42
9.21
8.17
14.04
7.54
7.54
10.08
10.17
17.67
1963-02-01
2
1963
1
1963-03-01
16.75
19.67
17.67
8.87
19.08
15.37
16.21
14.29
11.29
9.21
19.92
19.79
1963-03-01
3
1963
1
1963-04-01
10.54
9.59
12.46
7.33
9.46
9.59
11.79
11.87
9.79
10.71
13.37
18.21
1963-04-01
4
1963
1
1963-05-01
18.79
14.17
13.59
11.63
14.17
11.96
14.46
12.46
12.87
13.96
15.29
21.62
1963-05-01
5
1963
1
1963-06-01
13.37
6.87
12.00
8.50
10.04
9.42
10.92
12.96
11.79
11.04
10.92
13.67
1963-06-01
6
1963
1
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
1976-07-01
8.50
1.75
6.58
2.13
2.75
2.21
5.37
2.04
5.88
4.50
4.96
10.63
1976-07-01
7
1976
1
1976-08-01
13.00
8.38
8.63
5.83
12.92
8.25
13.00
9.42
10.58
11.34
14.21
20.25
1976-08-01
8
1976
1
1976-09-01
11.87
11.00
7.38
6.87
7.75
8.33
10.34
6.46
10.17
9.29
12.75
19.55
1976-09-01
9
1976
1
1976-10-01
10.96
6.71
10.41
4.63
7.58
5.04
5.04
5.54
6.50
3.92
6.79
5.00
1976-10-01
10
1976
1
1976-11-01
13.96
15.67
10.29
6.46
12.79
9.08
10.00
9.67
10.21
11.63
23.09
21.96
1976-11-01
11
1976
1
1976-12-01
13.46
16.42
9.21
4.54
10.75
8.67
10.88
4.83
8.79
5.91
8.83
13.67
1976-12-01
12
1976
1
1977-01-01
20.04
11.92
20.25
9.13
9.29
8.04
10.75
5.88
9.00
9.00
14.88
25.70
1977-01-01
1
1977
1
1977-02-01
11.83
9.71
11.00
4.25
8.58
8.71
6.17
5.66
8.29
7.58
11.71
16.50
1977-02-01
2
1977
1
1977-03-01
8.63
14.83
10.29
3.75
6.63
8.79
5.00
8.12
7.87
6.42
13.54
13.67
1977-03-01
3
1977
1
1977-04-01
21.67
16.00
17.33
13.59
20.83
15.96
25.62
17.62
19.41
20.67
24.37
30.09
1977-04-01
4
1977
1
1977-05-01
6.42
7.12
8.67
3.58
4.58
4.00
6.75
6.13
3.33
4.50
19.21
12.38
1977-05-01
5
1977
1
1977-06-01
7.08
5.25
9.71
2.83
2.21
3.50
5.29
1.42
2.00
0.92
5.21
5.63
1977-06-01
6
1977
1
1977-07-01
15.41
16.29
17.08
6.25
11.83
11.83
12.29
10.58
10.41
7.21
17.37
7.83
1977-07-01
7
1977
1
1977-08-01
4.33
2.96
4.42
2.33
0.96
1.08
4.96
1.87
2.33
2.04
10.50
9.83
1977-08-01
8
1977
1
1977-09-01
17.37
16.33
16.83
8.58
14.46
11.83
15.09
13.92
13.29
13.88
23.29
25.17
1977-09-01
9
1977
1
1977-10-01
16.75
15.34
12.25
9.42
16.38
11.38
18.50
13.92
14.09
14.46
22.34
29.67
1977-10-01
10
1977
1
1977-11-01
16.71
11.54
12.17
4.17
8.54
7.17
11.12
6.46
8.25
6.21
11.04
15.63
1977-11-01
11
1977
1
1977-12-01
13.37
10.92
12.42
2.37
5.79
6.13
8.96
7.38
6.29
5.71
8.54
12.42
1977-12-01
12
1977
1
1978-01-01
8.33
7.12
7.71
3.54
8.50
7.50
14.71
10.00
11.83
10.00
15.09
20.46
1978-01-01
1
1978
1
1978-02-01
27.25
24.21
18.16
17.46
27.54
18.05
20.96
25.04
20.04
17.50
27.71
21.12
1978-02-01
2
1978
1
1978-03-01
15.04
6.21
16.04
7.87
6.42
6.67
12.29
8.00
10.58
9.33
5.41
17.00
1978-03-01
3
1978
1
1978-04-01
3.42
7.58
2.71
1.38
3.46
2.08
2.67
4.75
4.83
1.67
7.33
13.67
1978-04-01
4
1978
1
1978-05-01
10.54
12.21
9.08
5.29
11.00
10.08
11.17
13.75
11.87
11.79
12.87
27.16
1978-05-01
5
1978
1
1978-06-01
10.37
11.42
6.46
6.04
11.25
7.50
6.46
5.96
7.79
5.46
5.50
10.41
1978-06-01
6
1978
1
1978-07-01
12.46
10.63
11.17
6.75
12.92
9.04
12.42
9.62
12.08
8.04
14.04
16.17
1978-07-01
7
1978
1
1978-08-01
19.33
15.09
20.17
8.83
12.62
10.41
9.33
12.33
9.50
9.92
15.75
18.00
1978-08-01
8
1978
1
1978-09-01
8.42
6.13
9.87
5.25
3.21
5.71
7.25
3.50
7.33
6.50
7.62
15.96
1978-09-01
9
1978
1
1978-10-01
9.50
6.83
10.50
3.88
6.13
4.58
4.21
6.50
6.38
6.54
10.63
14.09
1978-10-01
10
1978
1
1978-11-01
13.59
16.75
11.25
7.08
11.04
8.33
8.17
11.29
10.75
11.25
23.13
25.00
1978-11-01
11
1978
1
1978-12-01
21.29
16.29
24.04
12.79
18.21
19.29
21.54
17.21
16.71
17.83
17.75
25.70
1978-12-01
12
1978
1
216 rows × 16 columns
练习7-可视化
探索泰坦尼克灾难数据
步骤1 导入必要的库
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
步骤2 从以下地址导入数据
path7 ='./exercise_data/train.csv'
步骤3 将数据框命名为titanic
titanic = pd.read_csv(path7)
titanic.head()
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
步骤4 将PassengerId设置为索引
titanic.set_index('PassengerId').head()
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
PassengerId
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
步骤5 绘制一个展示男女乘客比例的扇形图
males =(titanic['Sex']=='male').sum()
females =(titanic['Sex']=='female').sum()# put them into a list called proportions
proportions =[males, females]# Create a pie chart
plt.pie(# using proportions
proportions,# with the labels being officer names
labels =['Males','Females'],# with no shadows
shadow =False,# with colors
colors =['blue','red'],# with one slide exploded out
explode =(0.15,0),# with the start angle at 90%
startangle =90,# with the percent listed as a fraction
autopct ='%1.1f%%')# View the plot drop above
plt.axis('equal')# Set labels
plt.title("Sex Proportion")# View the plot
plt.tight_layout()
plt.show()
步骤6 绘制一个展示船票Fare, 与乘客年龄和性别的散点图
lm = sns.lmplot(x ='Age', y ='Fare', data = titanic, hue ='Sex', fit_reg=False)# set title
lm.set(title ='Fare x Age')# get the axes object and tweak it
axes = lm.axes
axes[0,0].set_ylim(-5,)
axes[0,0].set_xlim(-5,85)
步骤7 有多少人生还?
titanic.Survived.sum()
342
步骤8 绘制一个展示船票价格的直方图
df = titanic.Fare.sort_values(ascending =False)
df
# create bins interval using numpy
binsVal = np.arange(0,600,10)
binsVal
# create the plot
plt.hist(df, bins = binsVal)# Set the title and labels
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.title('Fare Payed Histrogram')# show the plot
plt.show()
发表评论 取消回复