Xgboost建模

92 阅读 0 评论 61 点赞

我是靠谱客的博主含蓄乐曲，这篇文章主要介绍Xgboost建模，现在分享给大家，希望可以做个参考。

xgboost参数

选择较高的学习速率(learning rate)。一般情况下，学习速率的值为0.1。但是，对于不同的问题，理想的学习速率有时候会在0.05到0.3之间波动。选择对应于此学习速率的理想决策树数量。XGBoost有一个很有用的函数“cv”，这个函数可以在每一次迭代中使用交叉验证，并返回理想的决策树数量。
对于给定的学习速率和决策树数量，进行决策树特定参数调优(max_depth, min_child_weight, gamma, subsample, colsample_bytree)。在确定一棵树的过程中，我们可以选择不同的参数，待会儿我会举例说明。
xgboost的正则化参数的调优。(lambda, alpha)。这些参数可以降低模型的复杂度，从而提高模型的表现。
降低学习速率，确定理想参数。

1.读取libsvm格式数据并指定参数建模

xgboost的使用方法

①使用xgboost自带的数据集格式 + xgboost自带的建模方式
- 把数据读取成xgb.DMatrix格式(libsvm/dataframe.values给定X和Y)
- 准备好一个watch_list(观测和评估的数据集)
- xgb.train(dtrain)
- xgb.predict(dtest)
②使用pandas的DataFrame格式 + xgboost的sklearn接口
- estimator = xgb.XGBClassifier()/xgb.XGBRegressor()
- estimator.fit(df_train.values, df_target.values)

复制代码

#!/usr/bin/python
import numpy as np
#import scipy.sparse
import pickle
import xgboost as xgb
# 基本例子，从libsvm文件中读取数据，做二分类
# 数据是libsvm的格式
#1 3:1 10:1 11:1 21:1 30:1 34:1 36:1 40:1 41:1 53:1 58:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 105:1 117:1 124:1
#0 3:1 10:1 20:1 21:1 23:1 34:1 36:1 39:1 41:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 116:1 120:1
#0 1:1 10:1 19:1 21:1 24:1 34:1 36:1 39:1 42:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 116:1 122:1
# 转换成Dmatrix格式
dtrain = xgb.DMatrix('./data/agaricus.txt.train')
dtest = xgb.DMatrix('./data/agaricus.txt.test')
# 超参数设定
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
# 设定watchlist用于查看模型状态
watchlist
= [(dtest,'eval'), (dtrain,'train')]
num_round = 2
bst = xgb.train(param, dtrain, num_round, watchlist)
# 使用模型预测
preds = bst.predict(dtest)
# 判断准确率
labels = dtest.get_label()
print ('错误类为%f' % (sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))
# 模型存储
bst.save_model('./model/0001.model')

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/usr/bin/python
import numpy as np
#import scipy.sparse
import pickle
import xgboost as xgb
# 基本例子，从libsvm文件中读取数据，做二分类
# 数据是libsvm的格式
#1 3:1 10:1 11:1 21:1 30:1 34:1 36:1 40:1 41:1 53:1 58:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 105:1 117:1 124:1
#0 3:1 10:1 20:1 21:1 23:1 34:1 36:1 39:1 41:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 116:1 120:1
#0 1:1 10:1 19:1 21:1 24:1 34:1 36:1 39:1 42:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 116:1 122:1
# 转换成Dmatrix格式
dtrain = xgb.DMatrix('./data/agaricus.txt.train')
dtest = xgb.DMatrix('./data/agaricus.txt.test')
# 超参数设定
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
# 设定watchlist用于查看模型状态
watchlist
= [(dtest,'eval'), (dtrain,'train')]
num_round = 2
bst = xgb.train(param, dtrain, num_round, watchlist)
# 使用模型预测
preds = bst.predict(dtest)
# 判断准确率
labels = dtest.get_label()
print ('错误类为%f' % (sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))
# 模型存储
bst.save_model('./model/0001.model')

复制代码

1
2
3
4
5
[15:49:14] 6513x127 matrix with 143286 entries loaded from ./data/agaricus.txt.train
[15:49:14] 1611x127 matrix with 35442 entries loaded from ./data/agaricus.txt.test
[0] eval-error:0.042831 train-error:0.046522
[1] eval-error:0.021726 train-error:0.022263
错误类为0.021726

2.配合pandas DataFrame格式数据建模

复制代码

1
2
3
4
5
# 皮马印第安人糖尿病数据集 包含很多字段：怀孕次数 口服葡萄糖耐量试验中血浆葡萄糖浓度 舒张压（mm Hg） 三头肌组织褶厚度（mm）
# 2小时血清胰岛素（μU/ ml） 体重指数（kg/（身高(m)^2） 糖尿病系统功能 年龄（岁）
import pandas as pd
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')
data.head()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

复制代码

#!/usr/bin/python
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split
# 基本例子，从csv文件中读取数据，做二分类
# 用pandas读入数据
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')
# 做数据切分
train, test = train_test_split(data)
# 转换成Dmatrix格式
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_column = 'Outcome'
xgtrain = xgb.DMatrix(train[feature_columns].values, train[target_column].values)
xgtest = xgb.DMatrix(test[feature_columns].values, test[target_column].values)
# 参数设定
param = {'max_depth':5, 'eta':0.1, 'silent':1, 'subsample':0.7, 'colsample_bytree':0.7, 'objective':'binary:logistic' }
# 设定watchlist用于查看模型状态
watchlist
= [(xgtest,'eval'), (xgtrain,'train')]
num_round = 10
bst = xgb.train(param, xgtrain, num_round, watchlist)
# 使用模型预测
preds = bst.predict(xgtest)
# 判断准确率
labels = xgtest.get_label()
print ('错误类为%f' % (sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))
# 模型存储
bst.save_model('./model/0002.model')

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/usr/bin/python
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split
# 基本例子，从csv文件中读取数据，做二分类
# 用pandas读入数据
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')
# 做数据切分
train, test = train_test_split(data)
# 转换成Dmatrix格式
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_column = 'Outcome'
xgtrain = xgb.DMatrix(train[feature_columns].values, train[target_column].values)
xgtest = xgb.DMatrix(test[feature_columns].values, test[target_column].values)
# 参数设定
param = {'max_depth':5, 'eta':0.1, 'silent':1, 'subsample':0.7, 'colsample_bytree':0.7, 'objective':'binary:logistic' }
# 设定watchlist用于查看模型状态
watchlist
= [(xgtest,'eval'), (xgtrain,'train')]
num_round = 10
bst = xgb.train(param, xgtrain, num_round, watchlist)
# 使用模型预测
preds = bst.predict(xgtest)
# 判断准确率
labels = xgtest.get_label()
print ('错误类为%f' % (sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))
# 模型存储
bst.save_model('./model/0002.model')

复制代码

1
2
3
4
5
6
7
8
9
10
11
[0] eval-error:0.322917 train-error:0.21875
[1] eval-error:0.244792 train-error:0.168403
[2] eval-error:0.255208 train-error:0.182292
[3] eval-error:0.270833 train-error:0.170139
[4] eval-error:0.244792 train-error:0.144097
[5] eval-error:0.25 train-error:0.145833
[6] eval-error:0.229167 train-error:0.144097
[7] eval-error:0.25 train-error:0.145833
[8] eval-error:0.239583 train-error:0.147569
[9] eval-error:0.234375 train-error:0.140625
错误类为0.234375

3.使用xgboost的sklearn包

复制代码

#!/usr/bin/python
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib
# 基本例子，从csv文件中读取数据，做二分类
# 用pandas读入数据
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')
# 做数据切分
train, test = train_test_split(data)
# 取出特征X和目标y的部分
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_column = 'Outcome'
train_X = train[feature_columns].values
train_y = train[target_column].values
test_X = test[feature_columns].values
test_y = test[target_column].values
# 初始化模型
xgb_classifier = xgb.XGBClassifier(n_estimators=20,
max_depth=4, 
learning_rate=0.1, 
subsample=0.7, 
colsample_bytree=0.7)
# 拟合模型
xgb_classifier.fit(train_X, train_y)
# 使用模型预测
preds = xgb_classifier.predict(test_X)
# 判断准确率
print ('错误类为%f' %((preds!=test_y).sum()/float(test_y.shape[0])))
# 模型存储
joblib.dump(xgb_classifier, './model/0003.model')

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/usr/bin/python
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib
# 基本例子，从csv文件中读取数据，做二分类
# 用pandas读入数据
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')
# 做数据切分
train, test = train_test_split(data)
# 取出特征X和目标y的部分
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_column = 'Outcome'
train_X = train[feature_columns].values
train_y = train[target_column].values
test_X = test[feature_columns].values
test_y = test[target_column].values
# 初始化模型
xgb_classifier = xgb.XGBClassifier(n_estimators=20,
max_depth=4, 
learning_rate=0.1, 
subsample=0.7, 
colsample_bytree=0.7)
# 拟合模型
xgb_classifier.fit(train_X, train_y)
# 使用模型预测
preds = xgb_classifier.predict(test_X)
# 判断准确率
print ('错误类为%f' %((preds!=test_y).sum()/float(test_y.shape[0])))
# 模型存储
joblib.dump(xgb_classifier, './model/0003.model')

复制代码

1
2
错误类为0.276042
['./model/0003.model']

4.交叉验证

复制代码

1
xgb.cv(param, dtrain, num_round, nfold=5,metrics={'error'}, seed = 0)

	train-error-mean	train-error-std	test-error-mean	test-error-std
0	0.006832	0.001012	0.006756	0.001407
1	0.002994	0.002806	0.002303	0.002524
2	0.001382	0.000352	0.001382	0.001228
3	0.001190	0.000658	0.001382	0.001228
4	0.001382	0.000282	0.001075	0.000921
5	0.000921	0.000506	0.001228	0.001041
6	0.000921	0.000506	0.001228	0.001041
7	0.000921	0.000506	0.001228	0.001041
8	0.000921	0.000506	0.001228	0.001041
9	0.000921	0.000506	0.001228	0.001041

5.添加预处理的交叉验证

复制代码

1
2
3
4
5
6
7
8
# 计算正负样本比，调整样本权重
def fpreproc(dtrain, dtest, param):
label = dtrain.get_label()
ratio = float(np.sum(label == 0)) / np.sum(label==1)
param['scale_pos_weight'] = ratio
return (dtrain, dtest, param)
# 先做预处理，计算样本权重，再做交叉验证
xgb.cv(param, dtrain, num_round, nfold=5, metrics={'auc'}, seed = 0, fpreproc = fpreproc)

	train-auc-mean	train-auc-std	test-auc-mean	test-auc-std
0	0.999772	0.000126	0.999731	0.000191
1	0.999942	0.000044	0.999909	0.000085
2	0.999964	0.000035	0.999926	0.000084
3	0.999979	0.000036	0.999950	0.000089
4	0.999976	0.000043	0.999946	0.000098
5	0.999994	0.000010	0.999988	0.000020
6	0.999993	0.000012	0.999988	0.000020
7	0.999993	0.000012	0.999988	0.000020
8	0.999993	0.000012	0.999988	0.000020
9	0.999993	0.000012	0.999988	0.000020

6.自定义损失函数与评估准则

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
print ('running cross validation, with cutomsized loss function')
# 自定义损失函数，需要提供损失函数的一阶导和二阶导
def logregobj(preds, dtrain):
labels = dtrain.get_label()
preds = 1.0 / (1.0 + np.exp(-preds))
grad = preds - labels
hess = preds * (1.0-preds)
return grad, hess
# 自定义评估准则，评估预估值和标准答案之间的差距
def evalerror(preds, dtrain):
labels = dtrain.get_label()
return 'error', float(sum(labels != (preds > 0.0))) / len(labels)
watchlist
= [(dtest,'eval'), (dtrain,'train')]
param = {'max_depth':3, 'eta':0.1, 'silent':1}
num_round = 5
# 自定义损失函数训练
bst = xgb.train(param, dtrain, num_round, watchlist, logregobj, evalerror)
# 交叉验证
xgb.cv(param, dtrain, num_round, nfold = 5, seed = 0, obj = logregobj, feval=evalerror)

复制代码

1
2
3
4
5
6
7
8
9
10
11
running cross validation, with cutomsized loss function
[0] eval-rmse:0.306902
train-rmse:0.306163 eval-error:0.518312 train-error:0.517887
[1] eval-rmse:0.17919
train-rmse:0.177276 eval-error:0.518312 train-error:0.517887
[2] eval-rmse:0.172566
train-rmse:0.171727 eval-error:0.016139 train-error:0.014433
[3] eval-rmse:0.269611
train-rmse:0.271113 eval-error:0.016139 train-error:0.014433
[4] eval-rmse:0.396904
train-rmse:0.398245 eval-error:0.016139 train-error:0.014433

	train-error-mean	train-error-std	train-rmse-mean	train-rmse-std	test-error-mean	test-error-std	test-rmse-mean	test-rmse-std
0	0.517887	0.001085	0.308880	0.005170	0.517886	0.004343	0.309038	0.005207
1	0.517887	0.001085	0.176504	0.002046	0.517886	0.004343	0.177802	0.003767
2	0.014433	0.000223	0.172680	0.003719	0.014433	0.000892	0.174890	0.009391
3	0.014433	0.000223	0.275761	0.001776	0.014433	0.000892	0.276689	0.005918
4	0.014433	0.000223	0.399889	0.003369	0.014433	0.000892	0.400118	0.006243

7.只用前n颗树预测

复制代码

#!/usr/bin/python
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split
# 基本例子，从csv文件中读取数据，做二分类
# 用pandas读入数据
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')
# 做数据切分
train, test = train_test_split(data)
# 转换成Dmatrix格式
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_column = 'Outcome'
xgtrain = xgb.DMatrix(train[feature_columns].values, train[target_column].values)
xgtest = xgb.DMatrix(test[feature_columns].values, test[target_column].values)
# 参数设定
param = {'max_depth':5, 'eta':0.1, 'silent':1, 'subsample':0.7, 'colsample_bytree':0.7, 'objective':'binary:logistic' }
# 设定watchlist用于查看模型状态
watchlist
= [(xgtest,'eval'), (xgtrain,'train')]
num_round = 10
bst = xgb.train(param, xgtrain, num_round, watchlist)
# 只用第1颗树预测
ypred1 = bst.predict(xgtest, ntree_limit=1)
# 用前9颗树预测
ypred2 = bst.predict(xgtest, ntree_limit=9)
label = xgtest.get_label()
print ('用前1颗树预测的错误率为 %f' % (np.sum((ypred1>0.5)!=label) /float(len(label))))
print ('用前9颗树预测的错误率为 %f' % (np.sum((ypred2>0.5)!=label) /float(len(label))))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/usr/bin/python
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split
# 基本例子，从csv文件中读取数据，做二分类
# 用pandas读入数据
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')
# 做数据切分
train, test = train_test_split(data)
# 转换成Dmatrix格式
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_column = 'Outcome'
xgtrain = xgb.DMatrix(train[feature_columns].values, train[target_column].values)
xgtest = xgb.DMatrix(test[feature_columns].values, test[target_column].values)
# 参数设定
param = {'max_depth':5, 'eta':0.1, 'silent':1, 'subsample':0.7, 'colsample_bytree':0.7, 'objective':'binary:logistic' }
# 设定watchlist用于查看模型状态
watchlist
= [(xgtest,'eval'), (xgtrain,'train')]
num_round = 10
bst = xgb.train(param, xgtrain, num_round, watchlist)
# 只用第1颗树预测
ypred1 = bst.predict(xgtest, ntree_limit=1)
# 用前9颗树预测
ypred2 = bst.predict(xgtest, ntree_limit=9)
label = xgtest.get_label()
print ('用前1颗树预测的错误率为 %f' % (np.sum((ypred1>0.5)!=label) /float(len(label))))
print ('用前9颗树预测的错误率为 %f' % (np.sum((ypred2>0.5)!=label) /float(len(label))))

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[0] eval-error:0.28125
train-error:0.203125
[1] eval-error:0.182292 train-error:0.1875
[2] eval-error:0.21875
train-error:0.184028
[3] eval-error:0.213542 train-error:0.175347
[4] eval-error:0.223958 train-error:0.164931
[5] eval-error:0.223958 train-error:0.164931
[6] eval-error:0.208333 train-error:0.164931
[7] eval-error:0.192708 train-error:0.15625
[8] eval-error:0.21875
train-error:0.15625
[9] eval-error:0.208333 train-error:0.147569
用前1颗树预测的错误率为 0.281250
用前9颗树预测的错误率为 0.218750

sklearn与Xgboost配合使用

1.Xgboost建模，sklearn评估

复制代码

import pickle
import xgboost as xgb
import numpy as np
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, mean_squared_error
from sklearn.datasets import load_iris, load_digits, load_boston
rng = np.random.RandomState(31337)
# 二分类：混淆矩阵
print("数字0和1的二分类问题")
digits = load_digits(2)
y = digits['target']
X = digits['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
print("在2折数据上的交叉验证")
for train_index, test_index in kf.split(X):
xgb_model = xgb.XGBClassifier().fit(X[train_index],y[train_index])
predictions = xgb_model.predict(X[test_index])
actuals = y[test_index]
print("混淆矩阵:")
print(confusion_matrix(actuals, predictions))
# 多分类：混淆矩阵
print("nIris: 多分类")
iris = load_iris()
y = iris['target']
X = iris['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
print("在2折数据上的交叉验证")
for train_index, test_index in kf.split(X):
xgb_model = xgb.XGBClassifier().fit(X[train_index],y[train_index])
predictions = xgb_model.predict(X[test_index])
actuals = y[test_index]
print("混淆矩阵:")
print(confusion_matrix(actuals, predictions))
# 回归问题：MSE
print("n波士顿房价回归预测问题")
boston = load_boston()
y = boston['target']
X = boston['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
print("在2折数据上的交叉验证")
for train_index, test_index in kf.split(X):
xgb_model = xgb.XGBRegressor().fit(X[train_index],y[train_index])
predictions = xgb_model.predict(X[test_index])
actuals = y[test_index]
print("MSE:",mean_squared_error(actuals, predictions))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import pickle
import xgboost as xgb
import numpy as np
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, mean_squared_error
from sklearn.datasets import load_iris, load_digits, load_boston
rng = np.random.RandomState(31337)
# 二分类：混淆矩阵
print("数字0和1的二分类问题")
digits = load_digits(2)
y = digits['target']
X = digits['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
print("在2折数据上的交叉验证")
for train_index, test_index in kf.split(X):
xgb_model = xgb.XGBClassifier().fit(X[train_index],y[train_index])
predictions = xgb_model.predict(X[test_index])
actuals = y[test_index]
print("混淆矩阵:")
print(confusion_matrix(actuals, predictions))
# 多分类：混淆矩阵
print("nIris: 多分类")
iris = load_iris()
y = iris['target']
X = iris['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
print("在2折数据上的交叉验证")
for train_index, test_index in kf.split(X):
xgb_model = xgb.XGBClassifier().fit(X[train_index],y[train_index])
predictions = xgb_model.predict(X[test_index])
actuals = y[test_index]
print("混淆矩阵:")
print(confusion_matrix(actuals, predictions))
# 回归问题：MSE
print("n波士顿房价回归预测问题")
boston = load_boston()
y = boston['target']
X = boston['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
print("在2折数据上的交叉验证")
for train_index, test_index in kf.split(X):
xgb_model = xgb.XGBRegressor().fit(X[train_index],y[train_index])
predictions = xgb_model.predict(X[test_index])
actuals = y[test_index]
print("MSE:",mean_squared_error(actuals, predictions))

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
数字0和1的二分类问题
在2折数据上的交叉验证
混淆矩阵:
[[87
0]
[ 1 92]]
混淆矩阵:
[[91
0]
[ 3 86]]
Iris: 多分类
在2折数据上的交叉验证
混淆矩阵:
[[19
0
0]
[ 0 31
3]
[ 0
1 21]]
混淆矩阵:
[[31
0
0]
[ 0 16
0]
[ 0
3 25]]
波士顿房价回归预测问题
在2折数据上的交叉验证
[15:53:36] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
MSE: 9.860776812557337
[15:53:36] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
MSE: 15.942418468446029

2.网格搜索查找最优超参数

复制代码

1
2
3
4
5
6
7
8
9
10
11
# 第2种训练方法的 调参方法：使用sklearn接口的regressor + GridSearchCV
print("参数最优化：")
y = boston['target']
X = boston['data']
xgb_model = xgb.XGBRegressor()
param_dict = {'max_depth': [2,4,6],
'n_estimators': [50,100,200]}
clf = GridSearchCV(xgb_model, param_dict, verbose=1)
clf.fit(X,y)
print(clf.best_score_)
print(clf.best_params_)

复制代码

参数最优化：
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
0.6001029721598573
{'max_depth': 4, 'n_estimators': 100}
[Parallel(n_jobs=1)]: Done
27 out of
27 | elapsed:
0.7s finished

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
参数最优化：
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:buildxgboostxgboost-0.90.gitsrcobjectiveregression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
0.6001029721598573
{'max_depth': 4, 'n_estimators': 100}
[Parallel(n_jobs=1)]: Done
27 out of
27 | elapsed:
0.7s finished

3.early-stopping 早停

复制代码

1
2
3
4
5
6
7
8
9
10
11
# 第1/2种训练方法的 调参方法：early stopping
# 在训练集上学习模型，一颗一颗树添加，在验证集上看效果，当验证集效果不再提升，停止树的添加与生长
X = digits['data']
y = digits['target']
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0)
clf = xgb.XGBClassifier()
clf.fit(X_train,
y_train,
early_stopping_rounds=10,
eval_metric="auc",
eval_set=[(X_val, y_val)])

复制代码

[0] validation_0-auc:0.999497
Will train until validation_0-auc hasn't improved in 10 rounds.
[1] validation_0-auc:0.999497
[2] validation_0-auc:0.999497
[3] validation_0-auc:0.999749
[4] validation_0-auc:0.999749
[5] validation_0-auc:0.999749
[6] validation_0-auc:0.999749
[7] validation_0-auc:0.999749
[8] validation_0-auc:0.999749
[9] validation_0-auc:0.999749
[10]
validation_0-auc:1
[11]
validation_0-auc:1
[12]
validation_0-auc:1
[13]
validation_0-auc:1
[14]
validation_0-auc:1
[15]
validation_0-auc:1
[16]
validation_0-auc:1
[17]
validation_0-auc:1
[18]
validation_0-auc:1
[19]
validation_0-auc:1
[20]
validation_0-auc:1
Stopping. Best iteration:
[10]
validation_0-auc:1
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
n_estimators=100, n_jobs=1, nthread=None,
objective='binary:logistic', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
subsample=1, verbosity=1)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
[0] validation_0-auc:0.999497
Will train until validation_0-auc hasn't improved in 10 rounds.
[1] validation_0-auc:0.999497
[2] validation_0-auc:0.999497
[3] validation_0-auc:0.999749
[4] validation_0-auc:0.999749
[5] validation_0-auc:0.999749
[6] validation_0-auc:0.999749
[7] validation_0-auc:0.999749
[8] validation_0-auc:0.999749
[9] validation_0-auc:0.999749
[10]
validation_0-auc:1
[11]
validation_0-auc:1
[12]
validation_0-auc:1
[13]
validation_0-auc:1
[14]
validation_0-auc:1
[15]
validation_0-auc:1
[16]
validation_0-auc:1
[17]
validation_0-auc:1
[18]
validation_0-auc:1
[19]
validation_0-auc:1
[20]
validation_0-auc:1
Stopping. Best iteration:
[10]
validation_0-auc:1
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
n_estimators=100, n_jobs=1, nthread=None,
objective='binary:logistic', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
subsample=1, verbosity=1)

4.特征重要度

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
iris = load_iris()
y = iris['target']
X = iris['data']
xgb_model = xgb.XGBClassifier().fit(X,y)
print('特征排序：')
feature_names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
# 获取特征重要度
feature_importances = xgb_model.feature_importances_
indices = np.argsort(feature_importances)[::-1]
for index in indices:
print("特征 %s 重要度为 %f" %(feature_names[index], feature_importances[index]))
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(16,8))
plt.title("feature importances")
plt.bar(range(len(feature_importances)), feature_importances[indices], color='b')
plt.xticks(range(len(feature_importances)), np.array(feature_names)[indices], color='b')

复制代码

1
2
3
4
5
6
7
8
9
10
特征排序：
特征 petal_length 重要度为 0.595834
特征 petal_width 重要度为 0.358166
特征 sepal_width 重要度为 0.033481
特征 sepal_length 重要度为 0.012520
([<matplotlib.axis.XTick at 0x1ed5a5bc7b8>,
<matplotlib.axis.XTick at 0x1ed5a3e6278>,
<matplotlib.axis.XTick at 0x1ed5a65c780>,
<matplotlib.axis.XTick at 0x1ed5a669748>],
<a list of 4 Text xticklabel objects>)

5.并行训练加速

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import os
if __name__ == "__main__":
try:
from multiprocessing import set_start_method
except ImportError:
raise ImportError("Unable to import multiprocessing.set_start_method."
" This example only runs on Python 3.4")
set_start_method("forkserver")
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_boston
import xgboost as xgb
rng = np.random.RandomState(31337)
print("Parallel Parameter optimization")
boston = load_boston()
os.environ["OMP_NUM_THREADS"] = "2"
# or to whatever you want
y = boston['target']
X = boston['data']
xgb_model = xgb.XGBRegressor()
clf = GridSearchCV(xgb_model,
{'max_depth': [2, 4, 6],'n_estimators': [50, 100, 200]},
verbose=1,
n_jobs=2)
clf.fit(X, y)
print(clf.best_score_)
print(clf.best_params_)

转载于:https://www.cnblogs.com/chenxiangzhen/p/10962893.html

最后

以上就是含蓄乐曲最近收集整理的关于Xgboost建模的全部内容，更多相关Xgboost建模内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：人工智能
浏览次数：92 次浏览
发布日期：2023-11-16 02:50:03
本文链接：https://www.kaopuke.com/article/k-p-k_13_u_23_o_14_f1_13_j_26_5.html

Xgboost建模

xgboost参数

1.读取libsvm格式数据并指定参数建模

xgboost的使用方法

2.配合pandas DataFrame格式数据建模

3.使用xgboost的sklearn包

4.交叉验证

5.添加预处理的交叉验证

6.自定义损失函数与评估准则

7.只用前n颗树预测

sklearn与Xgboost配合使用

1.Xgboost建模，sklearn评估

2.网格搜索查找最优超参数

3.early-stopping 早停

4.特征重要度

5.并行训练加速

最后

评论列表共有 0 条评论

发表评论取消回复

Xgboost建模

xgboost参数

1.读取libsvm格式数据并指定参数建模

xgboost的使用方法

2.配合pandas DataFrame格式数据建模

3.使用xgboost的sklearn包

4.交叉验证

5.添加预处理的交叉验证

6.自定义损失函数与评估准则

7.只用前n颗树预测

sklearn与Xgboost配合使用

1.Xgboost建模，sklearn评估

2.网格搜索查找最优超参数

3.early-stopping 早停

4.特征重要度

5.并行训练加速

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

微信扫一扫：分享

发表评论取消回复