我是靠谱客的博主 平淡野狼,最近开发中收集的这篇文章主要介绍XGBoost 函数说明,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

    • XGBoost Parameters
    • General Parameters
      • 1 booster defaultgbtree
      • 2 silent default0
      • 3 nthread default to maximum number of threads available if not set
      • 4 num_pbuffer set automatically by xgboost no need to be set by user
      • 5 num_feature set automatically by xgboost no need to be set by user
    • Parameters for Tree Booster
      • 1 eta default03 alias learning_rate
      • 2 gamma default0 alias min_split_loss
      • 3 max_depth default6
      • 4 min_child_weight default1
      • 5 max_delta_step default0
      • 6 subsample default1
      • 7 colsample_bytree default1
      • 8 colsample_bylevel default1
      • 9 lambda default1 alias reg_lambda
      • 10 alpha default0 alias reg_alpha
      • 11 Lambda_bias
      • 12 tree_method string defaultauto
      • 13 sketch_eps default003
      • 14 scale_pos_weight default1
      • 15 updater defaultgrow_colmakerprune
      • 16 refresh_leaf default1
      • 17 process_type defaultdefault
    • Additional parameters for Dart Booster
      • 1 sample_type defaultuniform
      • 2 normalize_type defaulttree
      • 3 rate_drop default00
      • 4 one_drop default0
    • Parameters for Linear Booster
      • 1 lambda default0 alias reg_lambda
      • 2 alpha default0 alias reg_alpha
      • 3 lambda_bias default0 alias reg_lambda_bias
    • Parameters for Tweedie Regression
    • Learning Task Parameters
    • Command Line Parameters

1 XGBoost Parameters

Before running XGboost, we must set three types of parameters:
general parameters, booster parameters and task parameters.
 General parameters relate to which booster we are using to do boosting, commonly tree or linear model(线性模型)
 Booster parameters depends on which booster you have chosen
 Learning Task parameters that decides on the learning scenario, for example, regression tasks may use different parameters with ranking tasks.
 Command line parameters that relates to behavior of CLI version of xgboost


2 General Parameters

(1) booster [default=gbtree]

• 有两中模型可以选择gbtree和gblinear。gbtree使用基于树的模型进行提升计算,gblinear使用线性模型进行提升计算。缺省值为gbtree
which booster to use, can be gbtree, gblinear or dart. gbtree and dart use tree based model while gblinear uses linear function.

(2) silent [default=0]

取0时表示打印出运行时信息,取1时表示以缄默方式运行,不打印运行时信息。缺省值为0
0 means printing running messages, 1 means silent mode.

(3) nthread [default to maximum number of threads available if not set]

• XGBoost运行时的线程数。缺省值是当前系统可以获得的最大线程数
number of parallel threads used to run xgboost

(4) num_pbuffer [set automatically by xgboost, no need to be set by user]

size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step.size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step

(5) num_feature [set automatically by xgboost, no need to be set by user]

• boosting过程中用到的特征维数,设置为特征个数。XGBoost会自动设置,不需要手工设置
feature dimension used in boosting, set to maximum dimension of the feature


3 Parameters for Tree Booster

(1) eta [default=0.3, alias: learning_rate]

 step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative.
 range: [0,1]
为了防止过拟合,更新过程中用到的收缩步长。在每次提升计算之后,算法会直接获得新特征的权重。 eta通过缩减特征的权重使提升计算过程更加保守。缺省值为0.3
取值范围为:[0,1]

(2) gamma [default=0, alias: min_split_loss]

 minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.
 range: [0,∞]

(3) max_depth [default=6]

 maximum depth of a tree, increase this value will make the model more complex / likely to be overfitting.
 range: [1,∞]
数的最大深度。缺省值为6
取值范围为:[1,∞]

(4) min_child_weight [default=1]

 minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be.
 range: [0,∞]
孩子节点中最小的样本权重和。如果一个叶子节点的样本权重和小于min_child_weight则拆分过程结束。在现行回归模型中,这个参数是指建立每个模型所需要的最小样本数。该成熟越大算法越conservative
取值范围为: [0,∞]

(5) max_delta_step [default=0]

 Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update
 range: [0,∞]

(6) subsample [default=1]

 subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.
 range: (0,1]
用于训练模型的子样本占整个样本集合的比例。如果设置为0.5则意味着XGBoost将随机的冲整个样本集合中随机的抽取出50%的子样本建立树模型,这能够防止过拟合。
取值范围为:(0,1]

(7) colsample_bytree [default=1]

  1. subsample ratio of columns when constructing each tree.
  2. range: (0,1]

(8) colsample_bylevel [default=1]

  1. subsample ratio of columns for each split, in each level.
  2. 列表内容 range: (0,1]
    在建立树时对特征采样的比例。缺省值为1
    取值范围:(0,1]

(9) lambda [default=1, alias: reg_lambda]

L2 regularization term on weights, increase this value will make model more conservative.
L2 正则的惩罚系数

(10) alpha [default=0, alias: reg_alpha]

L1 regularization term on weights, increase this value will make model more conservative.
L1 正则的惩罚系数

(11) Lambda_bias

在偏置上的L2正则。缺省值为0(在L1上没有偏置项的正则,因为L1时偏置不重要)

(12) tree_method, string [default=’auto’]

1) The tree construction algorithm used in XGBoost(see description in the reference paper)
2) Distributed and external memory version only support approximate algorithm.
3) Choices: {‘auto’, ‘exact’, ‘approx’}

   ‘auto’: Use heuristic to choose faster one.
   For small to medium dataset, exact greedy will be used.
   For very large-dataset, approximate algorithm will be chosen.
   Because old behavior is always use exact greedy in single machine, user will get a message when approximate algorithm is chosen to notify this choice.
   ‘exact’: Exact greedy algorithm.
   ‘approx’: Approximate greedy algorithm using sketching and histogram.

(13) sketch_eps, [default=0.03]

   This is only used for approximate greedy algorithm.
   This roughly translated into O(1 / sketch_eps) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy.
   Usually user does not have to tune this. but consider setting to a lower number for more accurate enumeration.
   range: (0, 1)

(14) scale_pos_weight, [default=1]

Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative cases) / sum(positive cases) See Parameters Tuning for more discussion. Also see Higgs Kaggle competition demo for examples: R, py1, py2, py3

(15) updater, [default=’grow_colmaker,prune’]

1) A comma separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitely by a user. The following updater plugins exist:

   ‘grow_colmaker’: non-distributed column-based construction of trees.
   ‘distcol’: distributed tree construction with column-based data splitting mode.
   ‘grow_histmaker’: distributed tree construction with row-based data splitting based on global proposal of histogram counting.
   ‘grow_local_histmaker’: based on local histogram counting.
   ‘grow_skmaker’: uses the approximate sketching algorithm.
   ‘sync’: synchronizes trees in all distributed nodes.
   ‘refresh’: refreshes tree’s statistics and/or leaf values based on the current data. Note that no random subsampling of data rows is performed.
   ‘prune’: prunes the splits where loss < min_split_loss (or gamma).

2) In a distributed setting, the implicit updater sequence value would be adjusted as follows:

   ‘grow_histmaker, prune’ when dsplit=’row’ (or default) and prob_buffer_row == 1 (or default); or when data has multiple sparse pages
   ‘grow_histmaker, refresh, prune’ when dsplit=’row’ and prob_buffer_row < 1
   ‘distcol’ when dsplit=’col’

(16) refresh_leaf, [default=1]

This is a parameter of the ‘refresh’ updater plugin. When this flag is true, tree leafs as well as tree nodes’ stats are updated. When it is false, only node stats are updated.

(17) process_type, [default=’default’]

1) A type of boosting process to run.
2) Choices: {‘default’, ‘update’}
 ‘default’: the normal boosting process which creates new trees.
 ‘update’: starts from an existing model and only updates its trees. In each boosting iteration, a tree from the initial model is taken, a specified sequence of updater plugins is run for that tree, and a modified tree is added to the new model. The new model would have either the same or smaller number of trees, depending on the number of boosting iteratons performed. Currently, the following built-in updater plugins could be meaningfully used with this process type: ‘refresh’, ‘prune’. With ‘update’, one cannot use updater plugins that create new nrees.


4 Additional parameters for Dart Booster

(1) sample_type [default=”uniform”]

type of sampling algorithm.
   “uniform”: dropped trees are selected uniformly.
   “weighted”: dropped trees are selected in proportion to weight.

(2) normalize_type [default=”tree”]

1) type of normalization algorithm.

   “tree”: new trees have the same weight of each of dropped trees.
   weight of new trees are 1 / (k + learning_rate)
   dropped trees are scaled by a factor of k / (k + learning_rate)
   “forest”: new trees have the same weight of sum of dropped trees (forest).
   weight of new trees are 1 / (1 + learning_rate)
   dropped trees are scaled by a factor of 1 / (1 + learning_rate)

(3) rate_drop [default=0.0]

dropout rate (a fraction of previous trees to drop during the dropout).
range: [0.0, 1.0]

(4) one_drop [default=0]

when this flag is enabled, at least one tree is always dropped during the dropout (allows Binomial-plus-one or epsilon-dropout from the original DART paper).
(5) skip_drop [default=0.0]

   Probability of skipping the dropout procedure during a boosting iteration.
   If a dropout is skipped, new trees are added in the same manner as gbtree.
   Note that non-zero skip_drop has higher priority than rate_drop or one_drop.
   range: [0.0, 1.0]

5 Parameters for Linear Booster

(1) lambda [default=0, alias: reg_lambda]

L2 regularization term on weights, increase this value will make model more conservative.
L2 正则的惩罚系数

(2) alpha [default=0, alias: reg_alpha]

L1 regularization term on weights, increase this value will make model more conservative.
L1 正则的惩罚系数

(3) lambda_bias [default=0, alias: reg_lambda_bias]

L2 regularization term on bias (no L1 reg on bias because it is not important)
在偏置上的L2正则。缺省值为0(在L1上没有偏置项的正则,因为L1时偏置不重要)


6 Parameters for Tweedie Regression

(1) tweedie_variance_power [default=1.5]

   parameter that controls the variance of the Tweedie distribution
   var(y) ~ E(y)^tweedie_variance_power
   range: (1,2)
   set closer to 2 to shift towards a gamma distribution
   set closer to 1 to shift towards a Poisson distribution.

7 Learning Task Parameters

Specify the learning task and the corresponding learning objective. The objective options are below:

(1) objective [ default=reg:linear ]
–定义学习任务及相应的学习目标,可选的目标函数如下:
(2) “reg:linear” –linear regression
–线性回归。
(3) “reg:logistic” –logistic regression
–逻辑回归。
(4) “binary:logistic” –logistic regression for binary classification, output probability
–二分类的逻辑回归问题,输出为概率。
(5) “binary:logitraw” –logistic regression for binary classification, output score before logistic transformation
–二分类的逻辑回归问题,输出的结果为wTx。
(6) “count:poisson” –poisson regression for count data, output mean of poisson distribution
max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization)
计数问题的poisson回归,输出结果为poisson分布。
(7) “multi:softmax” –set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)
–让XGBoost采用softmax目标函数处理多分类问题,同时需要设置参数
(8) “multi:softprob” –same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probability of each data point belonging to each class.
–和softmax一样,但是输出的是ndata * nclass的向量,可以将该向量reshape成ndata行nclass列的矩阵。没行数据表示样本所属于每个类别的概率。

(9) “rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwise loss

(10) “reg:gamma” –gamma regression with log-link. Output is a mean of gamma distribution.
It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed
(11) “reg:tweedie” –Tweedie regression with log-link.
It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.

(12) base_score [ default=0.5 ]

the initial prediction score of all instances, global bias
   for sufficient number of iterations, changing this value will not have too much effect.

(13) eval_metric [ default according to objective ]
 evaluation metrics for validation data, a default metric will be assigned according to objective (rmse for regression, and error for classification, mean average precision for ranking )
校验数据所需要的评价指标,不同的目标函数将会有缺省的评价指标(rmse for regression, and error for classification, mean average precision for ranking)
 User can add multiple evaluation metrics, for python user, remember to pass the metrics in as list of parameters pairs instead of map, so that latter ‘eval_metric’ won’t override previous one
用户可以添加多种评价指标,对于Python用户要以list传递参数对给程序,而不是map参数list参数不会覆盖’eval_metric’
 The choices are listed below:
 “rmse”: root mean square error
 “mae”: mean absolute error
 “logloss”: negative log-likelihood
 “error”: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
 “error@t”: a different than 0.5 binary classification threshold value could be specified by providing a numerical value through ‘t’.
 “merror”: Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).
 “mlogloss”: Multiclass logloss
 “auc”: Area under the curve for ranking evaluation.
 “ndcg”:Normalized Discounted Cumulative Gain
 “map”:Mean average precision
 “ndcg@n”,”map@n”: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
 “ndcg-”,”map-”,”ndcg@n-”,”map@n-”: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. training repeatedly

   “poisson-nloglik”: negative log-likelihood for Poisson regression
   “gamma-nloglik”: negative log-likelihood for gamma regression
   “gamma-deviance”: residual deviance for gamma regression
   “tweedie-nloglik”: negative log-likelihood for Tweedie regression (at a specified value of the tweedie_variance_power parameter)

(14) seed [ default=0 ]
随机数的种子,默认值为0
(15) random number seed.

8 Command Line Parameters

The following parameters are only used in the console version of xgboost

   use_buffer [ default=1]
   Whether to create a binary buffer from text input. Doing so normally will speed up loading times
  • 是否为输入创建二进制的缓存文件,缓存文件可以加速计算。缺省值为1
   num_round
   The number of rounds for boosting,boosting迭代计算次数。
   data
   The path of training data,- 输入数据的路径
   test:data
   The path of test data to do prediction,- 测试数据的路径
   save_period [default=0]
   the period to save the model, setting save_period=10 means that for every 10 rounds XGBoost will save the model, setting it to 0 means not saving any model during the training.
  • 表示保存第i*save_period次迭代的模型。例如save_period=10表示每隔10迭代计算XGBoost将会保存中间结果,设置为0表示每次计算的模型都要保持。
   task [default=train] options: train, pred, eval, dump
   train: training using data
   pred: making prediction for test:data

对测试数据进行预测

eval: for evaluating statistics specified by eval[name]=filename
通过eval[name]=filenam定义评价指标
   dump: for dump the learned model into text format (preliminary)

-将学习模型保存成文本格式

   model_in [default=NULL]
   path to input model, needed for test, eval, dump, if it is specified in training, xgboost will continue training from the input model
  • 指向模型的路径在test, eval, dump都会用到,如果在training中定义XGBoost将会接着输入模型继续训练
   model_out [default=NULL]
   path to output model after training finishes, if not specified, will output like 0003.model where 0003 is number of rounds to do boosting.
  • 训练完成后模型的保持路径,如果没有定义则会输出类似0003.model这样的结果,0003是第三次训练的模型结果。
   model_dir [default=models]The output directory of the saved models during training
  • 输出模型所保存的路径。
   fmap
   feature map, used for dump model
   name_dump [default=dump.txt]
   name of model dump file
   name_pred [default=pred.txt]
   name of prediction file, used in pred mode
  • 预测结果文件
   pred_margin [default=0]
   predict margin instead of transformed probability
  • 输出预测的边界,而不是转换后的概率

最后

以上就是平淡野狼为你收集整理的XGBoost 函数说明的全部内容,希望文章能够帮你解决XGBoost 函数说明所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(69)

评论列表共有 0 条评论

立即
投稿
返回
顶部