参考:http://scikit-learn.org/stable/modules/ensemble.html
在实际项目中,我们真的很少用到那些简单的模型,比如LR、kNN、NB等,虽然经典,但在工程中确实不实用。
今天我们关注在工程中用的相对较多的Ensemble methods。
Ensemble methods(集成方法)主要是综合多个estimators加权或不加权的投票结果来产生最终结果。主要有两大类:
-
In averaging methods(平均方法), the driving principle is to build several estimatorsindependently and then toaverage their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.
Examples: Bagging methods, Forests of randomized trees, ...
-
By contrast, in boosting methods(提升方法), base estimators are builtsequentially and one tries to reduce the bias of the combined estimator(the former estimator). The motivation is to combine several weak models to produce a powerful ensemble.
Examples: AdaBoost, Gradient Tree Boosting, ...
接下来主要讲:
1、Bagging meta-estimator
注意bagging和boosting的区别:bagging methods work best with strong and complex models (e.g., fully developed decision trees), in contrast with boosting methods which usually work best with weak models (e.g., shallow decision trees).
不同bagging方法的区别:产生random subsets的方式,有些是随机子样本集,有些是随机子features集,有些是随机子样本/features集,还有一些是有放回的抽样(samples、features可重复)。
scikit-learn提供了a unified BaggingClassifier meta-estimator (resp. BaggingRegressor),同时由参数max_samples和max_features决定子集大小、由bootstrap和bootstrap_features决定子集产生过程是否有替换。小例子:
- Single estimator versus bagging: bias-variance decomposition
2、Forests of ranomized trees
两种算法:RandomForest algorithm and the Extra-Trees method。最终结果是average prediction of the individual classifiers。给个简单例子:
Like decision trees, forests of trees also extend to multi-output problems (if Y is an array of size [n_samples, n_outputs]).
RandomForest algorithm :
有两个class,分别处理分类和回归,RandomForestClassifier and RandomForestRegressor classes。样本提取时允许replacement(a bootstrap sample),在随机选取的部分(而不是全部的)features上进行划分,与原论文的vote方法不同,scikit-learn通过平均每个分类器的预测概率(averaging their probabilistic prediction)来生成最终结果。
Extremely Randomized Trees :
有两个class,分别处理分类和回归, ExtraTreesClassifier and ExtraTreesRegressor classes。默认使用所有样本,但划分时features随机选取部分。
给个比较例子:
最后
以上就是可靠玫瑰最近收集整理的关于scikit-learn(工程中用的相对较多的模型介绍):1.11. Ensemble methods的全部内容,更多相关scikit-learn(工程中用的相对较多的模型介绍):1.11.内容请搜索靠谱客的其他文章。
发表评论 取消回复