python过滤器方法中机器学习的特征选择

84 阅读 0 评论 56 点赞

我是靠谱客的博主感性帽子，这篇文章主要介绍python过滤器方法中机器学习的特征选择，现在分享给大家，希望可以做个参考。

Too many cooks spoil the broth.

太多的厨师把汤糟了。

Even back in 1575, George Gascoigne already knew that a sumptuous bowl of broth can’t be achieved with too many cooks in the kitchen. The rigor of that proverb extends to modern days, yes, even in Machine Learning.

甚至早在1575年，乔治·加斯科因(George Gascoigne)便已经知道，厨房里的厨师太多就无法获得一碗丰盛的汤。是的，这句谚语的严格性一直延续到现代，即使在机器学习中也是如此。

Have you ever wondered why the performance of your model hit a plateau no matter how you fine-tune those hyperparameters? Or even worse that you only see a mediocre improvement on performance after using the most accurate set of data you could ever find? Well, the culprit might actually be the predictors (columns) you use to train your models.

您是否曾经想过，无论如何微调那些超参数，为什么模型的性能都会达到平稳状态？甚至更糟的是，在使用可能找到的最准确的数据集之后，您只会看到性能的中等改善？嗯，罪魁祸首实际上可能是您用来训练模型的预测变量(列)。

Ideally, predictors should be statistically relevant to the output data a model intends to predict, and those predictors should be carefully hand-picked to ensure the best-expected performance. This article will give you a brief walkthrough on what feature selection is all about, accompanied by some practical examples in Python.

理想情况下，预测变量应该与模型要预测的输出数据在统计上相关，并且应该精心挑选这些预测变量以确保获得最佳预期的性能。本文将为您简要介绍功能选择的全部内容，并附带一些Python实用示例。

Why Feature Selection matters?

为什么特征选择很重要？

Image for post — Self-illustrated by the author

Feature selection is primarily focused on removing redundant or non-informative predictors from the model. [1]

特征选择主要集中在从模型中删除冗余或非信息性预测变量。 [1]

On the surface level, feature selection simply means discarding and reducing predictors to the sweet spot of an optimal subset. Some rationales why feature selection is important in machine learning:

从表面上看，特征选择仅意味着丢弃预测变量并将其减少到最佳子集的最佳位置。为什么特征选择在机器学习中很重要的一些理由：

Parsimony (or simplicity) — simple models are easier to interpret than complex models, especially when making inferences.
简约(或简单)-简单模型比复杂模型更易于解释，尤其是进行推理时。
Time is money. Fewer features mean less calculation time, which directly results in shorter training times.
时间就是金钱。更少的功能意味着更少的计算时间，这直接缩短了训练时间。
Avoiding the curse of dimensionality — A high accuracy model trained with a lot of features can be delusive, as it can be a sign of overfitting and won’t generalize to new samples.
避免维数的诅咒-训练有很多功能的高精度模型可能会产生欺骗，因为这可能是过度拟合的标志，并且不会推广到新样本。

Approaches for Feature Selection

特征选择方法

There are generally three methods for feature selection:

通常有三种方法来选择特征：

Filter methods use statistical calculation to evaluate the relevance of the predictors outside of the predictive models and keep only the predictors that pass some criterion. [2] Considerations when choosing filter methods are the types of data involved, both in predictors and outcome — either numerical or categorical.

过滤器方法使用统计计算来评估预测模型之外的预测变量的相关性，并仅保留通过某些标准的预测变量。 [2]选择过滤器方法时要考虑的是预测值和结果中涉及的数据类型-数值或分类。

Wrapper methods evaluate multiple models using procedures that add and/or remove predictors to find the optimal combination that maximizes model performance. [3] Generally three directions of procedures are possible — forward (starts with 1 predictor and adds more iteratively), backward (starts with all predictors and eliminates one-by-one iteratively), and step-wise (bi-directional).

包装器方法使用添加和/或删除预测变量的过程来评估多个模型，以找到使模型性能最大化的最佳组合。 [3]通常，过程的三个方向是可能的-正向(从1个预测变量开始，并且迭代地增加)，向后(从所有预测变量开始，并且逐一消除)，以及逐步(双向)。

Embedded methods are models where the feature selection procedure occurs naturally in the course of the model fitting process. [4] Put simply, this method integrates the feature selection algorithm as part of the machine learning algorithm. The most typical embedded technique is tree based algorithm, which includes decision tree and random forest. The general idea of feature selection is decided at the splitting node based on information gain. Other exemplars of embedded methods are the LASSO with the L1 penalty and Ridge with the L2 penalty for constructing a linear model.

嵌入式方法 是在模型拟合过程中自然会发生特征选择过程的模型。 [4]简单地说，该方法将特征选择算法集成为机器学习算法的一部分。最典型的嵌入式技术是基于树的算法，其中包括决策树和随机森林。特征选择的一般思想是基于信息增益在分割节点处决定的。嵌入式方法的其他示例是用于构建线性模型的具有L1罚分的LASSO和具有L2罚分的Ridge。

Filter methods in Python

Python中的过滤器方法

In this tutorial, we will be using the Scikit-learn package to perform the filter methods in Python, which means they are all performed using statistical techniques.

在本教程中，我们将使用Scikit-learn包在Python中执行过滤器方法，这意味着它们都是使用统计技术执行的。

The complete Python codes can be found on Github as well as the raw data used in the example for categorical data.

可以在Github上找到完整的Python代码以及该示例中用于分类数据的原始数据。

Numerical input, numerical output — Pearson’s with f_regression()
数值输入，数值输出—具有f_regression()的皮尔逊

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from sklearn.datasets import make_regression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
import pandas as pd


# generate dataset
X, y = make_regression(n_samples=100, n_features=50, n_informative=10)
#assign column names 
col_list = ['col_' + str(x) for x in range(0,50)]
#create a dataframe table
df = pd.DataFrame(X, columns=col_list)


#feature selection using f_regression 
fs = SelectKBest(score_func=f_regression, k=5)
fit = fs.fit(X,y)


#create df for scores
dfscores = pd.DataFrame(fit.scores_)
#create df for column names
dfcolumns = pd.DataFrame(df.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
#naming the dataframe columns
featureScores.columns = ['Selected_columns','Score_pearsons'] 


#print 10 best features
print(featureScores.nlargest(5,'Score_pearsons'))

2. Numerical input, categorical output — ANOVA with f_classif()

2.数值输入，分类输出-具有f_classif()的ANOVA

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
import pandas as pd


# generate dataset
X, y = make_classification(n_samples=100, n_features=50, n_informative=10)
#assign column names 
col_list = ['col_' + str(x) for x in range(0,50)]
#create a dataframe table
df = pd.DataFrame(X, columns=col_list)


#feature selection using f_classif
fs = SelectKBest(score_func=f_classif, k=5)
fit = fs.fit(X,y)
#create df for scores
dfscores = pd.DataFrame(fit.scores_)
#create df for column names
dfcolumns = pd.DataFrame(df.columns)


#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
#naming the dataframe columns
featureScores.columns = ['Selected_columns','Score_ANOVA'] 
#print 10 best features
print(featureScores.nlargest(5,'Score_ANOVA'))

3. Categorical input, categorical output — Chi-squared with chi2()

3.分类输入，分类输出-与chi2()的卡方

For the categorical example, a dataset for car evaluation is used. It consists of 6 features — buying price, maintenance price, (number of ) doors, persons (capacity), luggage boot (size), safety (rating) to determine the acceptability class, distributed across 4 different possible outcomes.

对于分类示例，使用了用于汽车评估的数据集。它由6个功能组成-购买价格，维护价格，(门数)，人(容量)，行李箱(尺寸)，安全性(等级)，以确定可接受的等级，并分布在4种可能的结果中。

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import pandas as pd


#import raw data, data can be found in Github directory 
#https://github.com/jackty9/Feature_Selection_in_Python/blob/master/car_data.csv
df = pd.read_csv("car_data.csv")
X = df1.iloc[:,0:5]
X = pd.get_dummies(X)
y = df1.iloc[:,-1]
y = pd.get_dummies(y)


#feature selection using chi2
bestfeatures = SelectKBest(score_func=chi2, k=5)
fit = bestfeatures.fit(X,y)
#create df for scores
dfscores = pd.DataFrame(fit.scores_)
#create df for column names
dfcolumns = pd.DataFrame(X.columns)


#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
#naming the dataframe columns
featureScores.columns = ['Selected_columns','Score_chi2'] 
#print 5 best features
print(featureScores.nlargest(5,'Score_chi2'))

3.1. Categorical input, categorical output — Mutual Info with mutual_info_classif()

3.1。 分类输入，分类输出-具有Mutual_info_classif()的互信息

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
import pandas as pd


#import raw data, data can be found in Github directory 
#https://github.com/jackty9/Feature_Selection_in_Python/blob/master/car_data.csv
df1 = pd.read_csv("car_data.csv")
X = df1.iloc[:,0:5]
X = pd.get_dummies(X)
y = df1.iloc[:,-1]


#feature selection using mutual_info_classif
bestfeatures = SelectKBest(score_func=mutual_info_classif, k=5)
fit = bestfeatures.fit(X,y)
#create df for scores
dfscores = pd.DataFrame(fit.scores_)
#create df for column names
dfcolumns = pd.DataFrame(X.columns)


#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
#naming the dataframe columns
featureScores.columns = ['Selected_columns','Score_MutualInfo']  
#print 5 best features
print(featureScores.nlargest(5,'Score_MutualInfo'))

Summary

摘要

In this post, you discovered how to choose filter-based statistical measures for feature selection with numerical and categorical data. You also learned how to implement them in Python.

在本文中，您发现了如何为基于数字和分类数据的特征选择选择基于过滤器的统计度量。您还学习了如何在Python中实现它们。

Some people may ask — what if there is a mix of numerical and categorical data in my predictors? Well, then you have to split the two types of data and then apply the methods based on the output variables.

有人可能会问-如果我的预测变量中混合了数值数据和分类数据，该怎么办？好吧，那么您必须拆分两种类型的数据，然后根据输出变量应用方法。

Is there an ideal number of predictors? Nope, at least not universally. It depends on the size of data — the number of rows vs the number of columns (predictors), the ML algorithm used (SVM is more susceptible to overfitting than tree-based algorithms), available computing resources, and of course time for the project. The rule of thumb is to consider all of the points mentioned and see what best fits in your situation.

有理想数量的预测指标吗？不，至少不是普遍如此。它取决于数据的大小-行数与列数(预测数)，使用的ML算法(SVM比基于树的算法更容易过拟合)，可用的计算资源以及项目的时间。经验法则是考虑所有提到的要点，并查看最适合您的情况的方法。

What to anticipate next

接下来会发生什么

The article covers the first approach in feature selection — filter methods using statistical measures. In the following articles, we will look into the second and third approaches — wrapper and embedded methods. Follow and stay tuned!

本文介绍了特征选择的第一种方法-使用统计量的过滤方法。在以下文章中，我们将研究第二种和第三种方法-包装器和嵌入式方法。关注并保持关注！

Reference:

参考：

[1] Applied Predictive Modeling, page 488

[1] 应用预测建模，第488页

[2] Applied Predictive Modeling, page 490

[2] 应用预测建模，第490页

[3] Applied Predictive Modeling, page 490

[3] 应用预测建模，第490页

[4] Feature Engineering and Selection, page 17

[4] 特征工程和选择，第17页