我是靠谱客的博主 激动发夹,最近开发中收集的这篇文章主要介绍【机器学习实例】,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

https://blog.csdn.net/power1_power2/article/details/79664830

源数据地址:http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

下载数据

from urllib.request import urlretrieve
def load_data(download = True):
if download:
data_path,_ = urlretrieve("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", "D://pic//adult.csv")
print('数据已下载')
load_data()

对数据的列名进行赋值并读取数据

import pandas as pd
col_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation",
"relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "result"]
data = pd.read_csv("D://pic//adult.csv", names=col_names)
print(data[:10])
age
workclass
fnlwgt
education
education-num

0
39
State-gov
77516
Bachelors
13
1
50
Self-emp-not-inc
83311
Bachelors
13
2
38
Private
215646
HS-grad
9
3
53
Private
234721
11th
7
4
28
Private
338409
Bachelors
13
5
37
Private
284582
Masters
14
6
49
Private
160187
9th
5
7
52
Self-emp-not-inc
209642
HS-grad
9
8
31
Private
45781
Masters
14
9
42
Private
159449
Bachelors
13
marital-status
occupation
relationship
race

0
Never-married
Adm-clerical
Not-in-family
White
1
Married-civ-spouse
Exec-managerial
Husband
White
2
Divorced
Handlers-cleaners
Not-in-family
White
3
Married-civ-spouse
Handlers-cleaners
Husband
Black
4
Married-civ-spouse
Prof-specialty
Wife
Black
5
Married-civ-spouse
Exec-managerial
Wife
White
6
Married-spouse-absent
Other-service
Not-in-family
Black
7
Married-civ-spouse
Exec-managerial
Husband
White
8
Never-married
Prof-specialty
Not-in-family
White
9
Married-civ-spouse
Exec-managerial
Husband
White
sex
capital-gain
capital-loss
hours-per-week
native-country
result
0
Male
2174
0
40
United-States
<=50K
1
Male
0
0
13
United-States
<=50K
2
Male
0
0
40
United-States
<=50K
3
Male
0
0
40
United-States
<=50K
4
Female
0
0
40
Cuba
<=50K
5
Female
0
0
40
United-States
<=50K
6
Female
0
0
16
Jamaica
<=50K
7
Male
0
0
45
United-States
>50K
8
Female
14084
0
50
United-States
>50K
9
Male
5178
0
40
United-States
>50K 

数据的类别信息描述:

age:连续型数值变量;

workcass:雇主类型,多类别变量;

fnlwgt:人口普查员认为观察值的人数,连续型变量;

education:教育程度,多类别变量;

education_num:受教育年限,连续型变量;

marital-status:婚姻状况,多类别变量;

occupation:职业,多类别变量;

relationship:群体性关系,多类别变量;

race:种族,多类别变量;

sex:性别,二分变量;

capital-gain:资本收益,连续型变量;

capital-loss:资本损失,连续型变量;

hours-per-week:每周工作时间,连续型变量;

native-country:国籍,多类别变量;

result:结果,二分变量;

特征处理

查看数据缺失情况:

#方法一
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age
32561 non-null int64
workclass
32561 non-null object
fnlwgt
32561 non-null int64
education
32561 non-null object
education-num
32561 non-null int64
marital-status
32561 non-null object
occupation
32561 non-null object
relationship
32561 non-null object
race
32561 non-null object
sex
32561 non-null object
capital-gain
32561 non-null int64
capital-loss
32561 non-null int64
hours-per-week
32561 non-null int64
native-country
32561 non-null object
result
32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
#方法二
print(data.isnull().any())
age
False
workclass
False
fnlwgt
False
education
False
education-num
False
marital-status
False
occupation
False
relationship
False
race
False
sex
False
capital-gain
False
capital-loss
False
hours-per-week
False
native-country
False
result
False
dtype: bool
print(data.shape)
(32561, 15)

使用函数可以看出没有缺失的变量,但是实际数据中有很多无效字符?,.,$等,对无效数据进行处理

import numpy as np
data_clean = data.replace(regex=[r'?|.|$'],value=np.nan)
print(data_clean.isnull().any())
age
False
workclass
True
fnlwgt
False
education
False
education-num
False
marital-status
False
occupation
True
relationship
False
race
False
sex
False
capital-gain
False
capital-loss
False
hours-per-week
False
native-country
True
result
False
dtype: bool

将所有含有缺失值的行都去掉

adult = data_clean.dropna(how='any')
print(adult.shape)
(30162, 15)

剔除没有用的数据特征

adult = adult.drop(['fnlwgt'],axis=1)
adult.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30162 entries, 0 to 32560
Data columns (total 14 columns):
age
30162 non-null int64
workclass
30162 non-null object
education
30162 non-null object
education-num
30162 non-null int64
marital-status
30162 non-null object
occupation
30162 non-null object
relationship
30162 non-null object
race
30162 non-null object
sex
30162 non-null object
capital-gain
30162 non-null int64
capital-loss
30162 non-null int64
hours-per-week
30162 non-null int64
native-country
30162 non-null object
result
30162 non-null object
dtypes: int64(5), object(9)
memory usage: 3.5+ MB

划分训练集与测试集

#监督型机器学习
from sklearn.model_selection import train_test_split
#数据分离
col_names = ["age", "workclass", "education", "education-num", "marital-status", "occupation",
"relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "result"]
X_train , X_test , y_train , y_test = train_test_split(adult[col_names[1:13]],adult[col_names[13]],test_size=0.25,random_state=33)
print(X_train.shape)
print(X_test.shape)
print(X_train.head())
print(y_train.head())
D:developAnaconda3python.exe D:/thislove/pythonworkspace/blogspark/Adult.py
(22621, 12)
(7541, 12)
workclass
education
education-num
marital-status

20607
Private
Some-college
10
Married-civ-spouse
31257
Private
HS-grad
9
Married-civ-spouse
31892
Private
HS-grad
9
Never-married
20220
Private
HS-grad
9
Divorced
24044
Private
Some-college
10
Divorced
occupation
relationship
race
sex
capital-gain

20607
Craft-repair
Husband
White
Male
0
31257
Other-service
Husband
Black
Male
0
31892
Adm-clerical
Not-in-family
White
Female
0
20220
Machine-op-inspct
Unmarried
Black
Female
0
24044
Sales
Not-in-family
White
Female
0
capital-loss
hours-per-week
native-country
20607
0
50
United-States
31257
0
50
United-States
31892
0
45
United-States
20220
0
40
United-States
24044
0
45
United-States
20607
>50K
31257
<=50K
31892
<=50K
20220
<=50K
24044
>50K
Name: result, dtype: object
Process finished with exit code 0

 

最后

以上就是激动发夹为你收集整理的【机器学习实例】的全部内容,希望文章能够帮你解决【机器学习实例】所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(42)

评论列表共有 0 条评论

立即
投稿
返回
顶部