概述
https://blog.csdn.net/power1_power2/article/details/79664830
源数据地址:http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
下载数据
from urllib.request import urlretrieve
def load_data(download = True):
if download:
data_path,_ = urlretrieve("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", "D://pic//adult.csv")
print('数据已下载')
load_data()
对数据的列名进行赋值并读取数据
import pandas as pd
col_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation",
"relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "result"]
data = pd.read_csv("D://pic//adult.csv", names=col_names)
print(data[:10])
age
workclass
fnlwgt
education
education-num
0
39
State-gov
77516
Bachelors
13
1
50
Self-emp-not-inc
83311
Bachelors
13
2
38
Private
215646
HS-grad
9
3
53
Private
234721
11th
7
4
28
Private
338409
Bachelors
13
5
37
Private
284582
Masters
14
6
49
Private
160187
9th
5
7
52
Self-emp-not-inc
209642
HS-grad
9
8
31
Private
45781
Masters
14
9
42
Private
159449
Bachelors
13
marital-status
occupation
relationship
race
0
Never-married
Adm-clerical
Not-in-family
White
1
Married-civ-spouse
Exec-managerial
Husband
White
2
Divorced
Handlers-cleaners
Not-in-family
White
3
Married-civ-spouse
Handlers-cleaners
Husband
Black
4
Married-civ-spouse
Prof-specialty
Wife
Black
5
Married-civ-spouse
Exec-managerial
Wife
White
6
Married-spouse-absent
Other-service
Not-in-family
Black
7
Married-civ-spouse
Exec-managerial
Husband
White
8
Never-married
Prof-specialty
Not-in-family
White
9
Married-civ-spouse
Exec-managerial
Husband
White
sex
capital-gain
capital-loss
hours-per-week
native-country
result
0
Male
2174
0
40
United-States
<=50K
1
Male
0
0
13
United-States
<=50K
2
Male
0
0
40
United-States
<=50K
3
Male
0
0
40
United-States
<=50K
4
Female
0
0
40
Cuba
<=50K
5
Female
0
0
40
United-States
<=50K
6
Female
0
0
16
Jamaica
<=50K
7
Male
0
0
45
United-States
>50K
8
Female
14084
0
50
United-States
>50K
9
Male
5178
0
40
United-States
>50K
数据的类别信息描述:
age:连续型数值变量;
workcass:雇主类型,多类别变量;
fnlwgt:人口普查员认为观察值的人数,连续型变量;
education:教育程度,多类别变量;
education_num:受教育年限,连续型变量;
marital-status:婚姻状况,多类别变量;
occupation:职业,多类别变量;
relationship:群体性关系,多类别变量;
race:种族,多类别变量;
sex:性别,二分变量;
capital-gain:资本收益,连续型变量;
capital-loss:资本损失,连续型变量;
hours-per-week:每周工作时间,连续型变量;
native-country:国籍,多类别变量;
result:结果,二分变量;
特征处理
查看数据缺失情况:
#方法一
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age
32561 non-null int64
workclass
32561 non-null object
fnlwgt
32561 non-null int64
education
32561 non-null object
education-num
32561 non-null int64
marital-status
32561 non-null object
occupation
32561 non-null object
relationship
32561 non-null object
race
32561 non-null object
sex
32561 non-null object
capital-gain
32561 non-null int64
capital-loss
32561 non-null int64
hours-per-week
32561 non-null int64
native-country
32561 non-null object
result
32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
#方法二
print(data.isnull().any())
age
False
workclass
False
fnlwgt
False
education
False
education-num
False
marital-status
False
occupation
False
relationship
False
race
False
sex
False
capital-gain
False
capital-loss
False
hours-per-week
False
native-country
False
result
False
dtype: bool
print(data.shape)
(32561, 15)
使用函数可以看出没有缺失的变量,但是实际数据中有很多无效字符?,.,$等,对无效数据进行处理
import numpy as np
data_clean = data.replace(regex=[r'?|.|$'],value=np.nan)
print(data_clean.isnull().any())
age
False
workclass
True
fnlwgt
False
education
False
education-num
False
marital-status
False
occupation
True
relationship
False
race
False
sex
False
capital-gain
False
capital-loss
False
hours-per-week
False
native-country
True
result
False
dtype: bool
将所有含有缺失值的行都去掉
adult = data_clean.dropna(how='any')
print(adult.shape)
(30162, 15)
剔除没有用的数据特征
adult = adult.drop(['fnlwgt'],axis=1)
adult.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30162 entries, 0 to 32560
Data columns (total 14 columns):
age
30162 non-null int64
workclass
30162 non-null object
education
30162 non-null object
education-num
30162 non-null int64
marital-status
30162 non-null object
occupation
30162 non-null object
relationship
30162 non-null object
race
30162 non-null object
sex
30162 non-null object
capital-gain
30162 non-null int64
capital-loss
30162 non-null int64
hours-per-week
30162 non-null int64
native-country
30162 non-null object
result
30162 non-null object
dtypes: int64(5), object(9)
memory usage: 3.5+ MB
划分训练集与测试集
#监督型机器学习
from sklearn.model_selection import train_test_split
#数据分离
col_names = ["age", "workclass", "education", "education-num", "marital-status", "occupation",
"relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "result"]
X_train , X_test , y_train , y_test = train_test_split(adult[col_names[1:13]],adult[col_names[13]],test_size=0.25,random_state=33)
print(X_train.shape)
print(X_test.shape)
print(X_train.head())
print(y_train.head())
D:developAnaconda3python.exe D:/thislove/pythonworkspace/blogspark/Adult.py
(22621, 12)
(7541, 12)
workclass
education
education-num
marital-status
20607
Private
Some-college
10
Married-civ-spouse
31257
Private
HS-grad
9
Married-civ-spouse
31892
Private
HS-grad
9
Never-married
20220
Private
HS-grad
9
Divorced
24044
Private
Some-college
10
Divorced
occupation
relationship
race
sex
capital-gain
20607
Craft-repair
Husband
White
Male
0
31257
Other-service
Husband
Black
Male
0
31892
Adm-clerical
Not-in-family
White
Female
0
20220
Machine-op-inspct
Unmarried
Black
Female
0
24044
Sales
Not-in-family
White
Female
0
capital-loss
hours-per-week
native-country
20607
0
50
United-States
31257
0
50
United-States
31892
0
45
United-States
20220
0
40
United-States
24044
0
45
United-States
20607
>50K
31257
<=50K
31892
<=50K
20220
<=50K
24044
>50K
Name: result, dtype: object
Process finished with exit code 0
最后
以上就是激动发夹为你收集整理的【机器学习实例】的全部内容,希望文章能够帮你解决【机器学习实例】所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
发表评论 取消回复