Feature Selection and Extraction

53 阅读 0 评论 35 点赞

我是靠谱客的博主鳗鱼小懒猪，最近开发中收集的这篇文章主要介绍Feature Selection and Extraction，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

Reference:

Pattern recognition, by Sergios Theodoridis, Konstantinos Koutroumbas (2009)

Slides of CS4220, TUD

Content

- The Peaking Phenomenon (5.3)
- Class Separability Measures (5.6)
- - Divergence
  - Chernoff Bound and Bhattacharyya Distance
  - Scatter Matrices
- Feature Selection (5.7)
- - Sequential Backward Selection
  - Sequential Forward Selection
  - Floating Search Methods
- Feature Extraction
- - Supervised-Linear Discriminant Analysis-LDA (5.8)
  - Unsupervised-Principal Component Analysis-PCA (6.3)
  - - Karhunen–Loève (KL) transform
    - Mean Square Error Approximation
    - Total Variance
  - Nonlinear-Kernel PCA (6.7.1)

The Peaking Phenomenon (5.3)

More features $⟹ ?$ Better performance

If the corresponding PDFs are known, the Bayesian error goes down with more features. We can perfectly discriminate the two classes by arbitrarily increasing the number of features.
If the PDFs are unknown and the associated parameters must be estimated using a finite training set, then we must try to keep the number of features to a relatively low number.

In practice, for a finite $N$ , by increasing the number of features one obtains an initial improvement in performance, but after a critical value further increase of the number of features results in an increase of the probability of error. This phenomenon is also known as the peaking phenomenon.

在这里插入图片描述

What feature subset to keep? $⟹$ Feature selection / extraction

Feature selection: select $d$ out of $p$ measurements
Feature extraction: map $p$ measurements to $d$ measurements

在这里插入图片描述

What do we need?

Criterion functions, e.g., error, class overlap, information loss…
Optimization or “search” algorithms to find mapping for given criterion

Class Separability Measures (5.6)

Divergence

Let us recall our familiar Bayes rule. Given two classes $omega_1$ and $omega_2$ and a feature vector $x$ , we select $_1$ if
$P(omega_1|mathbf x)>P(omega_2|mathbf x)$
The classification error probability depends on the difference between $P(omega_1|mathbf x)$ and $P(omega_2|mathbf x)$ , hence the ratio $P(omega_1|mathbf x)/P(omega_2|mathbf x)$ can convey useful information concerning the discriminatory capabilities associated with an adopted feature vector $x$ . Alternatively, the same information resides in the ratio
$x|omega_1)}{p(mathbf x|omega_2)}equiv D_{12}(mathbf x)$
Since $x$ takes different values, it is natural to consider the mean value over class $omega_1$ (because $x$ is classified to class $omega_1$ ), that is
$D_{12}=int_{-infty}^{+infty} p(mathbf x|omega_1)ln frac{p(mathbf x|omega_1)}{p(mathbf x|omega_2)}dmathbf xtag{DV.1}$
Similar arguments hold for class $omega_2$ , and we define
$D_{21}=int_{-infty}^{+infty} p(mathbf x|omega_2)ln frac{p(mathbf x|omega_2)}{p(mathbf x|omega_1)}dmathbf xtag{DV.2}$
The sum
$d_{12}=D_{12}+D_{21}tag{DV.3}$
is known as the divergence and can be used as a separability measure for the classes $omega_1,omega _2$ , with respect to the adopted feature vector $x$ .

For a multiclass problem, the divergence is computed for every class pair $omega_i,omega_j$
$d_{ij}=D_{ij}+D_{ji}=int_{-infty}^{+infty}[p(mathbf x|omega_i)-p(mathbf x|omega_j)]ln frac{p(mathbf x|omega_i)}{p(mathbf x|omega_j)}dmathbf x tag{DV.4}$
and the average class separability can be computed using the average divergence
$d=sum_{i=1}^Msum_{j=1}^MP(omega_i)P(omega_j)d_{ij}tag{DV.5}$
If the components of the feature vector are statistically independent, then it can be shown that
$d_{ij}(x_1,x_2,cdots,x_l)&=int_{-infty}^{+infty}[p(mathbf x|omega_i)-p(mathbf x|omega_j)]sum_{r=1}^{l}ln frac{p(x_r|omega_i)}{p(x_r|omega_j)}dmathbf x \ &stackrel{a}{=}sum_{r=1}^{l}int_{-infty}^{+infty}[p(x_r|omega_i)-p(x_r|omega_j)]ln frac{p(x_r|omega_i)}{p(x_r|omega_j)}dx_r\ &=sum_{r=1}^{l}d_{ij}(x_r) end{aligned}tag{DV.6}$
where $= a$ is due to
$int_{-infty}^{+infty}cdotsint_{-infty}^{+infty}p(mathbf x|omega_i)-p(mathbf x|omega_j)dx_1cdots dx_{r-1}dx_{r+1}cdots dx_l=p(x_r|omega_i)-p(x_r|omega_j)$
Assuming now that the density functions are Gaussians $mu_i,boldsymbol Sigma_i)$ and $mu_j,boldsymbol Sigma_j)$ , respectively, the computation of the divergence is simplified, and it is not difficult to show that
$d_{ij}=frac{1}{2}mathrm{trace}{boldsymbol Sigma_i^{-1}boldsymbol Sigma_j+boldsymbol Sigma_j^{-1}boldsymbol Sigma_i-2I }+frac{1}{2}(boldsymbol mu_i-boldsymbol mu_j)^T(boldsymbol Sigma_i^{-1}+boldsymbol Sigma_j^{-1})(boldsymbol mu_i-boldsymbol mu_j)tag{DV.7}$
It can be seen that a class separability measure cannot depend only on the difference of the mean values; it must also be variance dependent. If the covariance matrices of the two Gaussian distributions are equal, then the divergence is further simplified to
$d_{ij}=(boldsymbol mu_i-boldsymbol mu_j)^Tboldsymbol Sigma^{-1}(boldsymbol mu_i-boldsymbol mu_j)$
which is nothing other than the Mahalanobis distance between the corresponding mean vectors. This has a direct relation with the Bayes error, which is a desirable property for class separation measures.

Chernoff Bound and Bhattacharyya Distance

The minimum attainable classification error of the Bayes classifier for two classes $omega_1,omega_2$ can be written as
$P_e=int_{-infty}^{+infty} min [P(omega_i)p(mathbf x|omega_i),P(omega_j)p(mathbf x|omega_j)]dmathbf x tag{CB.1}$
Analytic computation of this integral in the general case is not possible. However, an upper bound can be derived based on the inequality
$a^sb^{1-s}quad text{for }a,bge 0, text{ and }0le s le1 tag{CB.2}$
Combining $(C B . 1)$ and $(C B . 2)$ , we get
$P_ele P(omega_i)^sP(omega_j)^{1-s}int_{-infty}^{+infty}p(mathbf x|omega _i)^sp(mathbf x|omega_j)^{1-s}dmathbf xequiv epsilon_{CB}tag{CB.3}$
$epsilon_{CB}$ is known as the Chernoff bound. The minimum bound can be computed by minimizing $epsilon_{CB}$ w.r.t $s$ .

A special form of the bound results for $s = 1 / 2$ . For Gaussian distributions $mu_i,boldsymbol Sigma_i)$ and $mu_j,boldsymbol Sigma_j)$ , it reduces to
$epsilon_{CB}=sqrt{P(omega_i)P(omega_j)}exp(-B)$
where
$mu_i-boldsymbol mu_j)^Tleft(frac{boldsymbol Sigma_i+boldsymbol Sigma_j}{2}right)^{-1}(boldsymbol mu_i-boldsymbol mu_j)+frac{1}{2}ln frac{left|frac{boldsymbol Sigma_i+boldsymbol Sigma_j}{2} right|}{sqrt{|boldsymbol Sigma_i||boldsymbol Sigma_j|}}tag{CB.4}$
which is known as the Bhattacharyya distance.

Scatter Matrices

A major disadvantage of the class separability criteria considered so far is that they are not easily computed,unless the Gaussian assumption is employed. We will now turn our attention to a set of simpler criteria, built upon information related to the way feature vector samples are scattered in the $l$ -dimensional space. To this end, the following matrices are defined:

Within-class scatter matrix
$S_w=sum_{i=1}^MP_i boldsymbol Sigma_i tag{SM.1}$
where $Sigma_i$ is the covariance matrix for class $omega_i$
$Sigma_i=E[(mathbf x-boldsymbol mu_i)(mathbf x-boldsymbol mu_i)^T]simeqfrac{sum_{mathbf xin omega_i}mathbf xmathbf x^T}{n_i}-boldsymbol mu_iboldsymbol mu_i^T$
and $P_i$ is the priori probability of class $omega_i$ . That is, $P_isimeq n_i/N$ , where $n_i$ is the number of samples in class $omega_i$ , out of a total of $N$ samples.
Between-class scatter matrix
$S_b=sum_{i=1}^M P_i(boldsymbol mu_i-boldsymbol mu_0)(boldsymbol mu_i-boldsymbol mu_0)^Ttag{SM.2}$
where $mu_{0}$ is the global mean vector
$boldsymbol{mu}_{0}=sum_{i}^{M} P_{i} boldsymbol{mu}_{i}$
Mixture scatter matrix
$S_{m}=Eleft[left(mathbf x-boldsymbol mu_{0}right)left(mathbf x-boldsymbol mu_{0}right)^{T}right] simeqfrac{summathbf xmathbf x^T}{N}-boldsymbol mu_0boldsymbol mu_0^T tag{SM.3}$

That is, $S_{m}$ is the covariance matrix of the feature vector with respect to the global mean. It can be shown that (with $boldsymbol{mu}_{0}=sum_{i}^{M} P_{i} boldsymbol{mu}_{i}$ )
$S_{m}=mathbf S_{w}+mathbf S_{b}tag{SM.4}$

From these definitions it is straightforward to see that the criterion
$J_{1}=frac{operatorname{trace}left{mathbf S_{m}right}}{operatorname{trace}left{mathbf S_{w}right}}tag{SM.5}$
takes large values when samples in the $l$ -dimensional space are well clustered around their mean, within each class, and the clusters of the different classes are well separated. Sometimes $S_b$ is used in place of $S_m$ .

An alternative criterion results if determinants are used in the place of traces. This is justified for scatter matrices that are symmetric positive definite, and thus their eigenvalues are positive. The trace is equal to the sum of the eigenvalues, while the determinant is equal to their product. Hence, large values of $J_1$ also correspond to large values of the criterion
$J_2=frac{|mathbf S_m|}{|mathbf S_w|}=|mathbf S_w^{-1}mathbf S_m|tag{SM.6}$
or
$J_3=mathrm{trace}{mathbf S_w^{-1}mathbf S_m}tag{SM.7}$
These criteria take a special form in the one-dimensional, two-class problem. In this case,it is easy to see that for equiprobable classes $S_w|$ is proportional to $sigma^2_1+sigma^2_2$ and $S_b|$ proportional to $(mu_1-mu_2)^2$ . Using $J_2$ leads to the so-called Fisher’s discriminant ratio (FDR) results
$FDR=frac{(mu_1-mu_2)^2}{sigma^2_1+sigma^2_2}tag{SM.8}$

在这里插入图片描述

Feature Selection (5.7)

The object is to select a subset of $l$ out of $m$ measurements which optimizes chosen criterion. Theoretically, there are $C_m^l=frac{m!}{l!(m-l)!}$ subsets to be compared. Since the computation may not affordable in high dimension, we settle for suboptimal searching techniques.

The simplest way, named scalar feature selection, treats features individually. The value of the criterion $C (k)$ is computed for each of the features, $k = 1, 2, . . ., m$ . The $l$ features corresponding to the $l$ best values of $C (k)$ are then selected to form the feature vector.

However, such approaches do not take into account existing correlations between features. Therefore, we proceed to techniques dealing with vectors, named vector feature selection.

Sequential Backward Selection

Starting from $m$ , at each step we drop out one feature from the “best” combination until we obtain a vector of $l$ features.

For example, let $m = 4$ , and the originally available features are $x_{1}, x_{2}, x_{3}, x_{4} .$ We wish to select two of them. The selection procedure consists of the following steps:

Adopt a class separability criterion, $C,$ and compute its value for the feature vector $left[x_{1}, x_{2}, x_{3}, x_{4}right]^{T}$
Eliminate one feature and for each of the possible resulting combinations, that is, $left[x_{1}, x_{2}, x_{3}right]^{T},left[x_{1}, x_{2}, x_{4}right]^{T},left[x_{1}, x_{3}, x_{4}right]^{T},left[x_{2}, x_{3}, x_{4}right]^{T},$ compute the corresponding criterion value. Select the combination with the best value, say $left[x_{1}, x_{2}, x_{3}right]^{T}$
From the selected three-dimensional feature vector eliminate one feature and for each of the resulting combinations, $left[x_{1}, x_{2}right]^{T},left[x_{1}, x_{3}right]^{T},left[x_{2}, x_{3}right]^{T},$ compute the criterion value and select the one with the best value.

Sequential Forward Selection

Here, the reverse to the preceding procedure is followed:

Compute the criterion value for each of the features. Select the feature with the best value, say $x_{1}$
Form all possible two-dimensional vectors that contain the winner from the previous step, that is, $left[x_{1}, x_{2}right]^{T},left[x_{1}, x_{3}right]^{T},left[x_{1}, x_{4}right]^{T}$ . Compute the criterion value for each of them and select the best one, say $left[x_{1}, x_{3}right]^{T}$ .

This figure shows how to select 2 out of 5 features using backward selection (red arrows) and forward selection (blue arrows). The black lines show all combinations needed to be computed if we want to find the optimal solution.

Note: Here we choose lower values, but sometimes we choose larger values, depending on the specific criterion.

在这里插入图片描述

Floating Search Methods

The $k + 1$ best subset $X_{k+1}$ is formed by “borrowing” an element from $Y_{m-k}$ . Then, return to the previously selected lower dimension subsets to check whether the inclusion of this new element improves the criterion $C$ . If it does,the new element replaces one of the previously selected features.

Feature Extraction

We are going to introduce two classical linear feature extractors: LDA and PCA.

Let our data points, $x$ , be in the $m$ -dimensional space and assume that they originate from two classes. Our goal is to generate a feature $y$ as a linear combination of the components of $x$ . In such a way, we expect to “squeeze” the classification related information residing in $x$ in a smaller number of features.

Supervised-Linear Discriminant Analysis-LDA (5.8)

In this section, this goal is achieved by seeking the direction $w$ in the $m$ dimensional space, along which the two classes are best separated in some way.

Given an $R^m$ the scalar
$w^T mathbf x}{|mathbf w|} tag{LDA.1}$
is the projection of $x$ along $w$ . Since scaling all our feature vectors by the same factor does not add any classification-related information, we will ignore the scaling factor $∥ w ∥$ .

For one dimensional, two-class problem, we can adopt the [Fisher’s discriminant ratio (FDR)](# Scatter Matrices)
$FDR=frac{(mu_1-mu_2)^2}{sigma^2_1+sigma^2_2}tag{SM.8}$
where $mu_1,mu_2$ are the mean values and $sigma_1^2,sigma^2_2$ the variances of $y$ in the two classes $omega_1$ and $omega_2$ , respectively, after the projection along $w$ . Using the definition in $(L D A . 1)$ and omitting $∥ w ∥$ , it is readily seen that
$mu_i=mathbf w^T boldsymbol mu_itag{LDA.2}$
where $mu_i$ is the mean value of the data in $omega_i$ in the $m$ -dimensional space. Assuming the classes to be equiprobable and recalling the definition of $S_b$ and $S_w$ in [Scatter Matrices](# Scatter Matrices), it is easily shown that
$(mu_1-mu_2)^2=mathbf w^T(boldsymbol mu_1-boldsymbol mu_2)(boldsymbol mu_1-boldsymbol mu_2)^T mathbf w propto mathbf w^Tmathbf S_bmathbf wtag{LDA.3}$

$sigma_i^2=E[(y-mu_i)^2]=E[mathbf w^T (mathbf x-boldsymbol mu_i)(mathbf x-boldsymbol mu_i)^Tmathbf w]=mathbf w^T boldsymbol Sigma_imathbf w$

$sigma_1^2+sigma_2^2 propto mathbf w^T mathbf S_w mathbf w tag{LDA.4}$

Combining $(S M . 8)$ , $(L D A . 3)$ , and $(L D A . 4)$ , we end up that the optimal direction is obtained by maximizing Fisher’s criterion
$w^Tmathbf S_bmathbf w}{mathbf w^Tmathbf S_wmathbf w} tag{LDA.5}$
w.r.t. $w$ . This is celebrated generalized Rayleigh quotient. Since $S_w$ is positive definite and symmetric (assuming $S_w$ is invertible), let $S_w=mathbf D mathbf D$ and $v = D w$ . Then
$v^T mathbf D^{-1} mathbf S_bmathbf D^{-1} mathbf v}{mathbf v^Tmathbf v}$
It is maximized if $v$ is chosen such that
$D^{-1} mathbf S_bmathbf D^{-1} mathbf v=lambda mathbf v$
or in terms of $w$ ,
$S_w^{-1}mathbf S_bmathbf w=lambda mathbf wtag{LDA.6}$
where $λ$ is the largest eigenvalue of $S_w^{-1}mathbf S_b$ .

However, for our simple case we do not have to worry about any eigen decomposition. By the definition of $S_b$ we have that
$S_wmathbf w=(boldsymbol mu_1-boldsymbol mu_2)(boldsymbol mu_1-boldsymbol mu_2)^Tmathbf w=alpha (boldsymbol mu_1-boldsymbol mu_2)$
where $α$ is a scalar. Solving the previous equation w.r.t. $w$ , and since we are only interested in the direction of $w$ , we can write
$S_w^{-1}(boldsymbol mu_1-boldsymbol mu_2)tag{LDA.7}$

在这里插入图片描述

Thus, we have reduced the number of features from $m$ to $1$ in an optimal way. Classification can now be performed based on $y$ . The resulting classifier is
$x)=y+w_0=(boldsymbol mu_1-boldsymbol mu_2)^Tmathbf S_w^{-1}mathbf x+w_0tag{LDA.8}$
It can be interpreted in two ways. Firstly, after the projection we got a scalar $y$ . Compare it to some threshold $w_0$ and decide which class it belongs to. Secondly, we can omit the projection procedure and simply view $g (x)$ as a boundary which is perpendicular to $w$ .

However, the threshold $w_0$ is not provided by Fisher’s condition and has to be determined. For example, for the case of two Gaussian classes with the same covariance matrix the optimal classifier is shown to take the form
$g(boldsymbol{x})=left(boldsymbol{mu}_{1}-boldsymbol{mu}_{2}right)^{T} boldsymbol{S}_{w}^{-1}left(boldsymbol{x}-frac{1}{2}left(boldsymbol{mu}_{1}+boldsymbol{mu}_{2}right)right)-ln frac{Pleft(omega_{2}right)}{Pleft(omega_{1}right)}$

Unsupervised-Principal Component Analysis-PCA (6.3)

In LDA, the class labels of the feature vectors were assumed known, and this information was optimally exploited to compute the transformation matrix. In PCA, however, we do not know the class labels of feature vectors. The transformation matrix will exploit the statistical information describing the data instead.

Karhunen–Loève (KL) transform

We assume that the data samples have zero mean. A desirable property of the generated features is to be mutually uncorrelated in an effort to avoid information redundancies. Therefore, we begin this section by first developing a method that generates mutually uncorrelated features, that is, $E [y (i) y (j)] = 0, i \neq = j$ .

Let
$A^T mathbf xtag{PCA.1}$
Since we have assumed that $E [x] = 0$ , it is readily seen that $E [y] = 0$ . From the definition of the correlation matrix we have
$R_y=E[mathbf ymathbf y^T]=mathbf A^T mathbf R_x mathbf A tag{PCA.2}$
In practice, $R_x$ is estimated as an average over the given set of training vectors.

Note that $R_{x}$ is a symmetric matrix, and hence its eigenvectors are mutually orthogonal. Thus, if matrix $A$ is chosen so that its columns are the orthonormal eigenvectors $boldsymbol{a}_{i}, i=0,1, ldots, N-1,$ of $R_{x},$ then $R_{y}$ is diagonal
$R_{y}=mathbf A^{T}mathbf R_{x} mathbf A=mathbf Lambdatag{PCA.3}$
where $Λ$ is the diagonal matrix having as elements on its diagonal the respective eigenvalues $lambda_{i}, i=0,1, ldots, N-1,$ of $R_{x}$ . Furthermore, assuming $R_{x}$ to be positive definite so the eigenvalues are positive. The resulting transform is known as the Karhunen–Loève (KL) transform, and it achieves our original goal of generating mutually uncorrelated features.

Although our starting point was to generate mutually uncorrelated features, the KL transform turns out to have a number of other important properties, which provide different ways for its interpretation and also the secret for its popularity.

Mean Square Error Approximation

Since $A$ is unitary, from $(P C A . 1)$ we have $A y = x$ , or
$x=sum_{i=0}^{N-1}y(i)mathbf a_iquad text{and} quad y(i)=mathbf a_i^T mathbf xtag{PCA.4}$
To reduce the dimension, let us now define a new vector in the $m$ -dimensional subspace
$x}=sum_{i=0}^{m-1} y(i) boldsymbol{a}_{i}tag{PCA.5}$
where only $m$ of the basis vectors are involved. Obviously, this is nothing but the projection of $x$ onto the subspace spanned by the $m$ (orthonormal) eigenvectors involved in the summation. If we try to approximate $x$ by its projection $x ^ , hat{mathbf x},$ the resulting mean square error is given by
$Eleft[|boldsymbol{x}-hat{boldsymbol{x}}|^{2}right]=Eleft[left|sum_{i=m}^{N-1} y(i) boldsymbol{a}_{i}right|^{2}right]tag{PCA.6}$
Our goal now is to choose the eigenvectors that result in the minimum MSE. From $(P C A . 6)$ and taking into account the orthonormality property of the eigenvectors, we have
$Eleft[left|sum_{i=m}^{N-1} y(i) boldsymbol{a}_{i}right|^{2}right] &=Eleft[sum_{i} sum_{j}left(y(i) boldsymbol{a}_{i}^{T}right)left(y(j) boldsymbol{a}_{j}right)right] \ &=sum_{i=m}^{N-1} Eleft[y^{2}(i)right]=sum_{i=m}^{N-1} boldsymbol{a}_{i}^{T} Eleft[boldsymbol{x} boldsymbol{x}^{T}right] boldsymbol{a}_{i} end{aligned}tag{PCA.7}$
Using the eigenvector definition, we finally get
$Eleft[|boldsymbol{x}-hat{boldsymbol{x}}|^{2}right]=sum_{i=m}^{N-1} boldsymbol{a}_{i}^{T} lambda_{i} boldsymbol{a}_{i}=sum_{i=m}^{N-1} lambda_{i}tag{PCA.8}$
Thus, if we choose in $(P C A . 5)$ the eigenvectors corresponding to the $m$ largest eigenvalues of the correlation matrix, then the error in $(P C A . 8)$ is minimized, being the sum of the $N - m$ smallest eigenvalues.

Total Variance

Since
$sum_{i=m}^{N-1} sigma _{y(i)}^2=sum_{i=m}^{N-1} Eleft[y^{2}(i)right]=sum_{i=m}^{N-1} boldsymbol{a}_{i}^{T} Eleft[boldsymbol{x} boldsymbol{x}^{T}right] boldsymbol{a}_{i}=sum_{i=m}^{N-1} boldsymbol{a}_{i}^{T} lambda_{i} boldsymbol{a}_{i}=sum_{i=m}^{N-1} lambda_{i}tag{PCA.9}$
we can see that the selected $m$ features retain most of the total variance associated with the original random variables $x (i)$ .

在这里插入图片描述

Nonlinear-Kernel PCA (6.7.1)

As its name suggests, this is a kernelized version of the classical PCA. Given the data set $X$ , we make an implicit mapping into a reproducing kernel Hilbert space $H$ ,
$x \in X \mapsto ϕ (x) \in H$
Let $x_i,i=1,2.cdots,n$ , be the available training points. We will work with an estimate of the correlation matrix in $H$ obtained as an average over the known sample points
$R=frac{1}{n}sum_{i=1}^n boldsymbol phi(mathbf x_i) boldsymbol phi(mathbf x_i)^Ttag{KPCA.1}$
As in PCA, performing the eigendecomposition of $R$ , that is
$R v = λ v (K P C A . 2)$
By the definition of $R$ , it can be shown that $v$ lies in the span of $x_1), boldsymbol phi(mathbf x_2),cdots, boldsymbol phi(mathbf x_n)}$ . Indeed,
$v=left(frac{1}{n}sum_{i=1}^n boldsymbol phi(mathbf x_i) boldsymbol phi(mathbf x_i)^Tright)mathbf v=frac{1}{n}sum_{i=1}^n left(boldsymbol phi(mathbf x_i)^Tmathbf vright)boldsymbol phi(mathbf x_i)$
and for $λ \neq = 0$ we can write
$v=sum_{i=1}^n a(i)boldsymbol phi(mathbf x_i) tag{KPCA.3}$
Denote
$x_i,mathbf x_j)=boldsymbol phi(mathbf x_i)^T boldsymbol phi(mathbf x_j)tag{KPCA.4}$
where $K (\cdot, \cdot)$ being the adopted kernel function, and $K$ is the Gram matrix.

From $(K P C A . 1), (K P C A . 2)$ and $(K P C A . 3)$ , we have
$x_1)~cdots~boldsymbol phi(mathbf x_n)]left[begin{matrix}boldsymbol phi(mathbf x_1)^T\vdots\ boldsymbol phi(mathbf x_n)^T end{matrix}right][boldsymbol phi(mathbf x_1)~cdots~boldsymbol phi(mathbf x_n)]mathbf a=lambda [boldsymbol phi(mathbf x_1)~cdots~boldsymbol phi(mathbf x_n)] mathbf a$
which can be satisfied if
$K a = n λ a (K P C A . 5)$
Thus, the $k$ -th eigenvector of $R$ , corresponding to the $k$ -th (nonzero) eigenvector of $K$ is expressed as
$boldsymbol{v}_{k}=sum_{i=1}^{n} a_{k}(i) boldsymbolphileft(boldsymbol{x}_{i}right), {k}=1,2, ldots, ptag{KPCA.6}$
where $lambda_{1} geq lambda_{2} geq ldots geq lambda_{p}$ denote the respective eigenvalues in descending order and $lambda_{p}$ is the smallest nonzero one and $a_{k}^{T} equivleft[a_{k}(1), ldots, a_{k}(n)right]$ is the $k$ -th eigenvector of the Gram matrix. The latter is assumed to be normalized so that $v_{k}, mathbf v_{k}rightrangle=1, k=1,2, ldots, p,$ where $⟨ \cdot, \cdot ⟩$ is the dot product in the Hilbert space $H .$ This imposes an equivalent normalization on the respective $a_{k}$ 's, resulting from
$v_{k}, mathbf v_{k}rightrangle &=leftlanglesum_{i=1}^{n} mathbf a_{k}(i)boldsymbol phileft(x_{i}right), sum_{j=1}^{n}mathbf a_{k}(j) boldsymbolphileft(x_{j}right)rightrangle \ &=sum_{i=1}^{n} sum_{j=1}^{n}mathbf a_{k}(i) mathbf a_{k}(j) mathcal{K}(i, j) \ &=mathbf a_{k}^{T} mathcal{K} mathbf a_{k}=n lambda_{k}mathbf a_{k}^{T}mathbf a_{k}, quad k=1,2, ldots, p end{aligned}tag{KPCA.7}$
We are now ready to summarize the basic steps for performing a kernel PCA. Given a vector $mathcal{R}^{N}$ and a kernel function $K (\cdot, \cdot)$ :

Compute the Gram matrix $j)=Kleft(boldsymbol{x}_{i}, boldsymbol{x}_{j}right), i, j=1,2, ldots, n$
Compute the $m$ dominant eigenvalues/eigenvectors $lambda_{k}, boldsymbol{a}_{k}, k=1,2, ldots, m$ of $K$ (Eq. $(K P C A . 5)$ ).
Perform the required normalization (Eq. $(K P C A . 7)$ ).
Compute the $m$ projections onto each one of the dominant eigenvectors,

$v_{k}, boldsymbolphi(mathbf x)rightrangle=sum_{i=1}^{n} mathbf a_{k}(i) Kleft(mathbf x_{i}, mathbf xright), k=1,2, ldots, mtag{KPCA.8}$

The operations given in $(K P C A . 8)$ correspond to a nonlinear mapping in the input space. Note that, in contrast to the linear PCA, the dominant eigenvectors $v_{k}, k=1,2, ldots, m$ , are not computed explicitly.

最后

以上就是鳗鱼小懒猪为你收集整理的Feature Selection and Extraction的全部内容，希望文章能够帮你解决Feature Selection and Extraction所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错，欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：Machine Learning
浏览次数：53 次浏览
发布日期：2024-08-09 11:55:01
本文链接：https://www.kaopuke.com/article/k-p-k_13_u_7_o_18_f4_12__7__18_z.html

Feature Selection and Extraction

概述

Content

The Peaking Phenomenon (5.3)

Class Separability Measures (5.6)

Divergence

Chernoff Bound and Bhattacharyya Distance

Scatter Matrices

Feature Selection (5.7)

Sequential Backward Selection

Sequential Forward Selection

Floating Search Methods

Feature Extraction

Supervised-Linear Discriminant Analysis-LDA (5.8)

Unsupervised-Principal Component Analysis-PCA (6.3)

Karhunen–Loève (KL) transform

Mean Square Error Approximation

Total Variance

Nonlinear-Kernel PCA (6.7.1)

最后

评论列表共有 0 条评论

发表评论取消回复

Feature Selection and Extraction

概述

Content

The Peaking Phenomenon (5.3)

Class Separability Measures (5.6)

Divergence

Chernoff Bound and Bhattacharyya Distance

Scatter Matrices

Feature Selection (5.7)

Sequential Backward Selection

Sequential Forward Selection

Floating Search Methods

Feature Extraction

Supervised-Linear Discriminant Analysis-LDA (5.8)

Unsupervised-Principal Component Analysis-PCA (6.3)

Karhunen–Loève (KL) transform

Mean Square Error Approximation

Total Variance

Nonlinear-Kernel PCA (6.7.1)

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

发表评论取消回复