概述
大家都知道,对于回归任务,GBDT的loss=(y-pred)^2,因此残差residual=2*(y-pred)很容易理解。
那么,GBDT做分类任务时,残差是怎样的呢???
Gradient Boosting attempts to solve this minimization problem numerically via steepest descent,
The steepest descent direction is the negative gradient of the loss function evaluated at the current model ,
which can be calculated for any differentiable loss function。
The algorithms for regression and classification only differ in the concrete loss function used.
下面以分类的deviance为例:http://scikit-learn.org/stable/modules/ensemble.html#loss-functions
Classification
- Binomial deviance (
'deviance'
): The negative binomial log-likelihood loss function for binary classification (provides probability estimates). - The initial model is given by the log odds-ratio.
0)GradientBoostingClassifier的_init_中:
1415,loss='deviance'
1423,super(GradientBoostingClassifier, self).__init__(loss=loss, ...)
1)fit(https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/ensemble/gradient_boosting.py#L930)有个关键代码:
2)_fit_stages(https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/ensemble/gradient_boosting.py#L1035)
1048,loss_ = self.loss_
先看loss_(),后看_fit_stage()。
3)先看loss_()。
https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/ensemble/gradient_boosting.py#L651
LOSS_FUNCTIONS = {'ls': LeastSquaresError, | |
'lad': LeastAbsoluteError, | |
'huber': HuberLossFunction, | |
'quantile': QuantileLossFunction, | |
'deviance': None, # for both, multinomial and binomial | |
'exponential': ExponentialLoss, | |
} |
4)后看_fit_stage()(https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/ensemble/gradient_boosting.py#L747)
从763行我们知道,即便是分类器,内部的tree也是Regression tree。
另外,759行是我们最关注的residual计算方式,再回到L491有下面代码:
http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.special.expit.html
The expit function, also known as the logistic function, is defined as expit(x) = 1/(1+exp(-x)). It is the inverse of the logit function.
所以,我们的最终结论是:
1)分类同样适用回归树。
2)对于二元分类,y只能是1或者0,残差通过y-pred计算,其中pred实际上是logistic function计算出来的一个概率!!!
3)对于N元分类,只能转换成N个二元分类(其中y为0或1)。这个结论从下面的描述中可以猜到,没有详细看代码。
Note
Classification with more than 2 classes requires the induction of n_classes
regression trees at each at each iteration,
thus, the total number of induced trees equals n_classes * n_estimators
.
For datasets with a large number of classes we strongly recommend to use RandomForestClassifier
as an alternative to GradientBoostingClassifier
.
最后
以上就是温婉老鼠为你收集整理的GBDT训练分类器时,残差是如何计算的?的全部内容,希望文章能够帮你解决GBDT训练分类器时,残差是如何计算的?所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复