xgboost的参数详细说明

236 阅读 0 评论 156 点赞

我是靠谱客的博主优美豌豆，这篇文章主要介绍xgboost的参数详细说明，现在分享给大家，希望可以做个参考。

基本用法

先列出Xgboost中可指定的参数，参数的详细说明如下

总共有3类参数：通用参数/general parameters, 集成(增强)参数/booster parameters 和任务参数/task
parameters

通用参数/General Parameters

booster [default=gbtree]
- gbtree 和 gblinear
silent [default=0]
- 0表示输出信息， 1表示安静模式
nthread
- 跑xgboost的线程数，默认最大线程数
num_pbuffer [无需用户手动设定]
- size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step.
num_feature [无需用户手动设定]
- feature dimension used in boosting, set to maximum dimension of the feature

集成(增强)参数/booster parameters

eta [default=0.3, 可以视作学习率]
- 为了防止过拟合，更新过程中用到的收缩步长。在每次提升计算之后，算法会直接获得新特征的权重。 eta通过缩减特征的权重使提升计算过程更加保守。缺省值为0.3
- 取值范围为：[0,1]
gamma [default=0, alias: min_split_loss]
- 为了对树的叶子节点做进一步的分割而必须设置的损失减少的最小值，该值越大，算法越保守
- range: [0,∞]
max_depth [default=6]
- 用于设置树的最大深度
- range: [1,∞]
min_child_weight [default=1]
- 表示子树观测权重之和的最小值，如果树的生长时的某一步所生成的叶子结点，其观测权重之和小于min_child_weight，那么可以放弃该步生长，在线性回归模式中，这仅仅与每个结点所需的最小观测数相对应。该值越大，算法越保守
- range: [0,∞]
max_delta_step [default=0]
- 如果该值为0，就是没有限制；如果设为一个正数，可以使每一步更新更加保守通常情况下这一参数是不需要设置的，但是在logistic回归的训练集中类极端不平衡的情况下，将这一参数的设置很有用，将该参数设为1-10可以控制每一步更新
- range: [0,∞]
subsample [default=1]
- 表示观测的子样本的比率，将其设置为0.5意味着xgboost将随机抽取一半观测用于数的生长，这将有助于防止过拟合现象
- range: (0,1]
colsample_bytree [default=1]
- 表示用于构造每棵树时变量的子样本比率
- range: (0,1]
colsample_bylevel [default=1]
- 用来控制树的每一级的每一次分裂，对列数的采样的占比。一般不太用这个参数，因为subsample参数和colsample_bytree参数可以起到相同的作用。
- range: (0,1]
lambda [default=1, alias: reg_lambda]
- L2 权重的L2正则化项
alpha [default=0, alias: reg_alpha]
- L1 权重的L1正则化项
tree_method, string [default=‘auto’]
- The tree construction algorithm used in XGBoost(see description in the reference paper)
- Distributed and external memory version only support approximate algorithm.
- Choices: {‘auto’, ‘exact’, ‘approx’}
  - ‘auto’: Use heuristic to choose faster one.
    - For small to medium dataset, exact greedy will be used.
    - For very large-dataset, approximate algorithm will be chosen.
    - Because old behavior is always use exact greedy in single machine, user will get a message when approximate algorithm is chosen to notify this choice.
  - ‘exact’: Exact greedy algorithm.
  - ‘approx’: Approximate greedy algorithm using sketching and histogram.
sketch_eps, [default=0.03]
- This is only used for approximate greedy algorithm.
- This roughly translated into O(1 / sketch_eps) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy.
- Usually user does not have to tune this. but consider setting to a lower number for more accurate enumeration.
- range: (0, 1)
scale_pos_weight, [default=1]
- 在各类别样本十分不平衡时，把这个参数设定为一个正值，可以使算法更快收敛
- 一个可以考虑的值: sum(negative cases) / sum(positive cases) see Higgs Kaggle competition demo for examples: R, py1, py2, py3
updater, [default=‘grow_colmaker,prune’]
- A comma separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitely by a user. The following updater plugins exist:
  - ‘grow_colmaker’: non-distributed column-based construction of trees.
  - ‘distcol’: distributed tree construction with column-based data splitting mode.
  - ‘grow_histmaker’: distributed tree construction with row-based data splitting based on global proposal of histogram counting.
  - ‘grow_local_histmaker’: based on local histogram counting.
  - ‘grow_skmaker’: uses the approximate sketching algorithm.
  - ‘sync’: synchronizes trees in all distributed nodes.
  - ‘refresh’: refreshes tree’s statistics and/or leaf values based on the current data. Note that no random subsampling of data rows is performed.
  - ‘prune’: prunes the splits where loss < min_split_loss (or gamma).
- In a distributed setting, the implicit updater sequence value would be adjusted as follows:
  - ‘grow_histmaker,prune’ when dsplit=‘row’ (or default) and prob_buffer_row == 1 (or default); or when data has multiple sparse pages
  - ‘grow_histmaker,refresh,prune’ when dsplit=‘row’ and prob_buffer_row < 1
  - ‘distcol’ when dsplit=‘col’
refresh_leaf, [default=1]
- This is a parameter of the ‘refresh’ updater plugin. When this flag is true, tree leafs as well as tree nodes’ stats are updated. When it is false, only node stats are updated.
process_type, [default=‘default’]
- A type of boosting process to run.
- Choices: {‘default’, ‘update’}
  - ‘default’: the normal boosting process which creates new trees.
  - ‘update’: starts from an existing model and only updates its trees. In each boosting iteration, a tree from the initial model is taken, a specified sequence of updater plugins is run for that tree, and a modified tree is added to the new model. The new model would have either the same or smaller number of trees, depending on the number of boosting iteratons performed. Currently, the following built-in updater plugins could be meaningfully used with this process type: ‘refresh’, ‘prune’. With ‘update’, one cannot use updater plugins that create new nrees.

任务参数/task parameters

objective [ default=reg:linear ] 这个参数定义需要被最小化的损失函数。最常用的值有
- “reg:linear” --线性回归
- “reg:logistic” --逻辑回归
- “binary:logistic” --二分类的逻辑回归，返回预测的概率(不是类别)
- “binary:logitraw” --输出归一化前的得分
- “count:poisson” --poisson regression for count data, output mean of poisson distribution
  - max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization)
- “multi:softmax” --设定XGBoost做多分类，你需要同时设定num_class(类别数)的值
- “multi:softprob” --输出维度为ndata * nclass的概率矩阵
- “rank:pairwise” --设定XGBoost去完成排序问题(最小化pairwise loss)
- “reg:gamma” --gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed
- “reg:tweedie” --Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.
base_score [ default=0.5 ]
- the initial prediction score of all instances, global bias
- for sufficient number of iterations, changing this value will not have too much effect.
eval_metric [ 默认是根据损失函数/目标函数自动选定的 ]
- 有如下的选择:
  - “rmse”: 均方误差
  - “mae”: 绝对平均误差
  - “logloss”: negative log损失
  - “error”: 二分类的错误率
  - “error@t”: 通过提供t为阈值(而不是0.5)，计算错误率
  - “merror”: 多分类的错误类，计算公式为#(wrong cases)/#(all cases).
  - “mlogloss”: 多类log损失
  - “auc”: ROC曲线下方的面积 for ranking evaluation.
  - “ndcg”:Normalized Discounted Cumulative Gain
  - “map”:平均准确率
  - “ndcg@n”,“map@n”: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
  - “ndcg-”,“map-”,“ndcg@n-”,“map@n-”: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. training repeatedly
- “poisson-nloglik”: negative log-likelihood for Poisson regression
- “gamma-nloglik”: negative log-likelihood for gamma regression
- “gamma-deviance”: residual deviance for gamma regression
- “tweedie-nloglik”: negative log-likelihood for Tweedie regression (at a specified value of the tweedie_variance_power parameter)
seed [ default=0 ]
- random number seed.