scikit-learn（工程中用的相对较多的模型介绍）：1.11. Ensemble methods

75 阅读 0 评论 50 点赞

我是靠谱客的博主可靠玫瑰，这篇文章主要介绍scikit-learn（工程中用的相对较多的模型介绍）：1.11. Ensemble methods，现在分享给大家，希望可以做个参考。

参考：http://scikit-learn.org/stable/modules/ensemble.html

在实际项目中，我们真的很少用到那些简单的模型，比如~~LR、~~kNN、NB等，虽然经典，但在工程中确实不实用。

今天我们关注在工程中用的相对较多的Ensemble methods。

Ensemble methods（集成方法）主要是综合多个estimators加权或不加权的投票结果来产生最终结果。主要有两大类：

In averaging methods（平均方法）, the driving principle is to build several estimatorsindependently and then toaverage their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.

Examples: Bagging methods, Forests of randomized trees, ...
By contrast, in boosting methods（提升方法）, base estimators are builtsequentially and one tries to reduce the bias of the combined estimator（the former estimator）. The motivation is to combine several weak models to produce a powerful ensemble.

Examples: AdaBoost, Gradient Tree Boosting, ...

接下来主要讲：

1、Bagging meta-estimator

注意bagging和boosting的区别：bagging methods work best with strong and complex models (e.g., fully developed decision trees), in contrast with boosting methods which usually work best with weak models (e.g., shallow decision trees).

不同bagging方法的区别：产生random subsets的方式，有些是随机子样本集，有些是随机子features集，有些是随机子样本/features集，还有一些是有放回的抽样（samples、features可重复）。

scikit-learn提供了a unified BaggingClassifier meta-estimator (resp. BaggingRegressor)，同时由参数max_samples和max_features决定子集大小、由bootstrap和bootstrap_features决定子集产生过程是否有替换。小例子：

>>> from sklearn.ensemble import BaggingClassifier
>>> from sklearn.neighbors import KNeighborsClassifier
>>> bagging = BaggingClassifier(KNeighborsClassifier(),
...                             max_samples=0.5, max_features=0.5)

Single estimator versus bagging: bias-variance decomposition

2、Forests of ranomized trees

两种算法：RandomForest algorithm and the Extra-Trees method。最终结果是average prediction of the individual classifiers。给个简单例子：

 
  >>> from sklearn.ensemble import RandomForestClassifier
>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]
>>> clf = RandomForestClassifier(n_estimators=10)
>>> clf = clf.fit(X, Y)
 
 

Like decision trees, forests of trees also extend to multi-output problems (if Y is an array of size [n_samples, n_outputs]).

RandomForest algorithm ：

有两个class，分别处理分类和回归，RandomForestClassifier and RandomForestRegressor classes。样本提取时允许replacement（a bootstrap sample），在随机选取的部分（而不是全部的）features上进行划分，与原论文的vote方法不同，scikit-learn通过平均每个分类器的预测概率（averaging their probabilistic prediction）来生成最终结果。

Extremely Randomized Trees ：

有两个class，分别处理分类和回归， ExtraTreesClassifier and ExtraTreesRegressor classes。默认使用所有样本，但划分时features随机选取部分。

给个比较例子：