【机器学习实例】

63 阅读 0 评论 42 点赞

我是靠谱客的博主激动发夹，最近开发中收集的这篇文章主要介绍【机器学习实例】，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

https://blog.csdn.net/power1_power2/article/details/79664830

源数据地址：http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

下载数据

from urllib.request import urlretrieve
def load_data(download = True):
if download:
data_path,_ = urlretrieve("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", "D://pic//adult.csv")
print('数据已下载')
load_data()

对数据的列名进行赋值并读取数据

import pandas as pd
col_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation",
"relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "result"]
data = pd.read_csv("D://pic//adult.csv", names=col_names)
print(data[:10])
age
workclass
fnlwgt
education
education-num

0
39
State-gov
77516
Bachelors
13
1
50
Self-emp-not-inc
83311
Bachelors
13
2
38
Private
215646
HS-grad
9
3
53
Private
234721
11th
7
4
28
Private
338409
Bachelors
13
5
37
Private
284582
Masters
14
6
49
Private
160187
9th
5
7
52
Self-emp-not-inc
209642
HS-grad
9
8
31
Private
45781
Masters
14
9
42
Private
159449
Bachelors
13
marital-status
occupation
relationship
race

0
Never-married
Adm-clerical
Not-in-family
White
1
Married-civ-spouse
Exec-managerial
Husband
White
2
Divorced
Handlers-cleaners
Not-in-family
White
3
Married-civ-spouse
Handlers-cleaners
Husband
Black
4
Married-civ-spouse
Prof-specialty
Wife
Black
5
Married-civ-spouse
Exec-managerial
Wife
White
6
Married-spouse-absent
Other-service
Not-in-family
Black
7
Married-civ-spouse
Exec-managerial
Husband
White
8
Never-married
Prof-specialty
Not-in-family
White
9
Married-civ-spouse
Exec-managerial
Husband
White
sex
capital-gain
capital-loss
hours-per-week
native-country
result
0
Male
2174
0
40
United-States
<=50K
1
Male
0
0
13
United-States
<=50K
2
Male
0
0
40
United-States
<=50K
3
Male
0
0
40
United-States
<=50K
4
Female
0
0
40
Cuba
<=50K
5
Female
0
0
40
United-States
<=50K
6
Female
0
0
16
Jamaica
<=50K
7
Male
0
0
45
United-States
>50K
8
Female
14084
0
50
United-States
>50K
9
Male
5178
0
40
United-States
>50K

数据的类别信息描述：

age：连续型数值变量；

workcass：雇主类型，多类别变量；

fnlwgt：人口普查员认为观察值的人数，连续型变量；

education：教育程度，多类别变量；

education_num：受教育年限，连续型变量；

marital-status：婚姻状况，多类别变量；

occupation：职业，多类别变量；

relationship：群体性关系，多类别变量；

race：种族，多类别变量；

sex：性别，二分变量；

capital-gain：资本收益，连续型变量；

capital-loss：资本损失，连续型变量；

hours-per-week：每周工作时间，连续型变量；

native-country：国籍，多类别变量；

result：结果，二分变量；

特征处理

查看数据缺失情况：

#方法一
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age
32561 non-null int64
workclass
32561 non-null object
fnlwgt
32561 non-null int64
education
32561 non-null object
education-num
32561 non-null int64
marital-status
32561 non-null object
occupation
32561 non-null object
relationship
32561 non-null object
race
32561 non-null object
sex
32561 non-null object
capital-gain
32561 non-null int64
capital-loss
32561 non-null int64
hours-per-week
32561 non-null int64
native-country
32561 non-null object
result
32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
#方法二
print(data.isnull().any())
age
False
workclass
False
fnlwgt
False
education
False
education-num
False
marital-status
False
occupation
False
relationship
False
race
False
sex
False
capital-gain
False
capital-loss
False
hours-per-week
False
native-country
False
result
False
dtype: bool
print(data.shape)
(32561, 15)

使用函数可以看出没有缺失的变量，但是实际数据中有很多无效字符?，.，$等，对无效数据进行处理

import numpy as np
data_clean = data.replace(regex=[r'?|.|$'],value=np.nan)
print(data_clean.isnull().any())
age
False
workclass
True
fnlwgt
False
education
False
education-num
False
marital-status
False
occupation
True
relationship
False
race
False
sex
False
capital-gain
False
capital-loss
False
hours-per-week
False
native-country
True
result
False
dtype: bool

将所有含有缺失值的行都去掉

adult = data_clean.dropna(how='any')
print(adult.shape)
(30162, 15)

剔除没有用的数据特征

adult = adult.drop(['fnlwgt'],axis=1)
adult.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30162 entries, 0 to 32560
Data columns (total 14 columns):
age
30162 non-null int64
workclass
30162 non-null object
education
30162 non-null object
education-num
30162 non-null int64
marital-status
30162 non-null object
occupation
30162 non-null object
relationship
30162 non-null object
race
30162 non-null object
sex
30162 non-null object
capital-gain
30162 non-null int64
capital-loss
30162 non-null int64
hours-per-week
30162 non-null int64
native-country
30162 non-null object
result
30162 non-null object
dtypes: int64(5), object(9)
memory usage: 3.5+ MB

划分训练集与测试集

#监督型机器学习
from sklearn.model_selection import train_test_split
#数据分离
col_names = ["age", "workclass", "education", "education-num", "marital-status", "occupation",
"relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "result"]
X_train , X_test , y_train , y_test = train_test_split(adult[col_names[1:13]],adult[col_names[13]],test_size=0.25,random_state=33)
print(X_train.shape)
print(X_test.shape)
print(X_train.head())
print(y_train.head())
D:developAnaconda3python.exe D:/thislove/pythonworkspace/blogspark/Adult.py
(22621, 12)
(7541, 12)
workclass
education
education-num
marital-status

20607
Private
Some-college
10
Married-civ-spouse
31257
Private
HS-grad
9
Married-civ-spouse
31892
Private
HS-grad
9
Never-married
20220
Private
HS-grad
9
Divorced
24044
Private
Some-college
10
Divorced
occupation
relationship
race
sex
capital-gain

20607
Craft-repair
Husband
White
Male
0
31257
Other-service
Husband
Black
Male
0
31892
Adm-clerical
Not-in-family
White
Female
0
20220
Machine-op-inspct
Unmarried
Black
Female
0
24044
Sales
Not-in-family
White
Female
0
capital-loss
hours-per-week
native-country
20607
0
50
United-States
31257
0
50
United-States
31892
0
45
United-States
20220
0
40
United-States
24044
0
45
United-States
20607
>50K
31257
<=50K
31892
<=50K
20220
<=50K
24044
>50K
Name: result, dtype: object
Process finished with exit code 0