【主成分分析】《数学建模算法与应用》第十章多元分析第二节主成分分析 python实现第十章多元分析

78 阅读 0 评论 52 点赞

我是靠谱客的博主眼睛大老鼠，这篇文章主要介绍【主成分分析】《数学建模算法与应用》第十章多元分析第二节主成分分析 python实现第十章多元分析，现在分享给大家，希望可以做个参考。

第十章多元分析

第二节主成分分析

介绍
这里是司守奎教授的《数学建模算法与应用》全书案例代码python实现，欢迎加入此项目将其案例代码用python实现
GitHub项目地址：Mathematical-modeling-algorithm-and-Application
CSDN专栏：数学建模
知乎专栏：数学建模算法与应用
联系作者
作者：STL_CC
邮箱：1459078309@qq.com

由于作者还是大一学生，才疏学浅，难免会有错误，欢迎指正
同时作者精力有限，希望更多大佬加入此项目，一来可以提高建模水平，二来可以分享建模经验

主成分回归分析

例10.5
Hald水泥问题，考察含如下四种化学成分
$x_1$ = $3CaO.Al_2O_3$ 的含量（%）， $x_2$ = $3CaO.SiO_2$ 的含量（%）
$x_3$ = $4CaO.Al_2O_3.Fe_2O_3$ 的含量（%）， $x_4$ = $2CaO.SiO_2$ 的含量（%）
的某种水泥，每一克所释放出的热量（卡） y 与这四种成分含量之间的关系数据共13组，见表7，对数据实施标准化，则 X T X /12就是样本相关系数阵（见表8）。
注：表格见教材

复制代码

1
2
3
4
5
6
7
8
9
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import zscore
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

复制代码

1
2
3
sn=pd.read_csv('sn.txt',header=None,sep='	')
m,n=sn.shape

复制代码

1
2
sn

	0	1	2	3	4
0	7	26	6	60	78.5
1	1	29	15	52	74.3
2	11	56	8	20	104.3
3	11	31	8	47	87.6
4	7	52	6	33	95.9
5	11	55	9	22	109.2
6	3	71	17	6	102.7
7	1	31	22	44	72.5
8	2	54	18	22	93.1
9	21	47	4	26	115.9
10	1	40	23	34	83.8
11	11	66	9	12	113.3
12	10	68	8	12	109.4

复制代码

1
2
3
x0=np.array(sn.iloc[:,0:n-1])
y0=np.array(sn.iloc[:,n-1])

复制代码

1
2
3
results = sm.OLS(y0, np.c_[np.ones((m,1)),x0]).fit()
print(results.summary())

复制代码

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.982
Model:                            OLS   Adj. R-squared:                  0.974
Method:                 Least Squares   F-statistic:                     111.5
Date:                Fri, 31 Jul 2020   Prob (F-statistic):           4.76e-07
Time:                        01:16:14   Log-Likelihood:                -26.918
No. Observations:                  13   AIC:                             63.84
Df Residuals:                       8   BIC:                             66.66
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         62.4054     70.071      0.891      0.399     -99.179     223.989
x1             1.5511      0.745      2.083      0.071      -0.166       3.269
x2             0.5102      0.724      0.705      0.501      -1.159       2.179
x3             0.1019      0.755      0.135      0.896      -1.638       1.842
x4            -0.1441      0.709     -0.203      0.844      -1.779       1.491
==============================================================================
Omnibus:                        0.165   Durbin-Watson:                   2.053
Prob(Omnibus):                  0.921   Jarque-Bera (JB):                0.320
Skew:                           0.201   Prob(JB):                        0.852
Kurtosis:                       2.345   Cond. No.                     6.06e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.06e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

D:Program Filesanacondalibsite-packagesscipystatsstats.py:1450: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=13
  "anyway, n=%i" % int(n))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.982
Model:                            OLS   Adj. R-squared:                  0.974
Method:                 Least Squares   F-statistic:                     111.5
Date:                Fri, 31 Jul 2020   Prob (F-statistic):           4.76e-07
Time:                        01:16:14   Log-Likelihood:                -26.918
No. Observations:                  13   AIC:                             63.84
Df Residuals:                       8   BIC:                             66.66
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         62.4054     70.071      0.891      0.399     -99.179     223.989
x1             1.5511      0.745      2.083      0.071      -0.166       3.269
x2             0.5102      0.724      0.705      0.501      -1.159       2.179
x3             0.1019      0.755      0.135      0.896      -1.638       1.842
x4            -0.1441      0.709     -0.203      0.844      -1.779       1.491
==============================================================================
Omnibus:                        0.165   Durbin-Watson:                   2.053
Prob(Omnibus):                  0.921   Jarque-Bera (JB):                0.320
Skew:                           0.201   Prob(JB):                        0.852
Kurtosis:                       2.345   Cond. No.                     6.06e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.06e+03. This might indicate that there are
strong multicollinearity or other numerical problems.


D:Program Filesanacondalibsite-packagesscipystatsstats.py:1450: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=13
  "anyway, n=%i" % int(n))

复制代码

1
2
3
4
r=np.corrcoef(x0.T)
xd=zscore(x0)
yd=zscore(y0)

复制代码

1
2
3
pca=PCA()
pca.fit(xd)

复制代码

1
2
3
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

复制代码

1
2
pca.explained_variance_ratio_#计算各个变量的贡献率

复制代码

1
2
array([5.58926009e-01, 3.94016518e-01, 4.66515373e-02, 4.05936433e-04])

复制代码

1
2
3
4
num=3#由于我们使用jupyter演示程序，可能不好交互式
pca_=PCA(n_components=num)
xd_=pca_.fit_transform(xd)

复制代码

1
2
3
4
results = sm.OLS(yd, np.c_[np.ones((m,1)),xd_]).fit()
print(results.summary())
#原MATLAB程序中还做了使系数全为正的处理，在此报告中，x1为负，但是其实只要转成正的就行

复制代码

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.982
Model:                            OLS   Adj. R-squared:                  0.976
Method:                 Least Squares   F-statistic:                     164.9
Date:                Fri, 31 Jul 2020   Prob (F-statistic):           3.50e-08
Time:                        01:16:16   Log-Likelihood:                 7.7143
No. Observations:                  13   AIC:                            -7.429
Df Residuals:                       9   BIC:                            -5.169
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.012e-16      0.045   4.52e-15      1.000      -0.101       0.101
x1            -0.6570      0.030    -22.045      0.000      -0.724      -0.590
x2             0.0083      0.035      0.234      0.820      -0.072       0.089
x3             0.3028      0.103      2.935      0.017       0.069       0.536
==============================================================================
Omnibus:                        0.246   Durbin-Watson:                   1.943
Prob(Omnibus):                  0.884   Jarque-Bera (JB):                0.416
Skew:                           0.162   Prob(JB):                        0.812
Kurtosis:                       2.186   Cond. No.                         3.46
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

D:Program Filesanacondalibsite-packagesscipystatsstats.py:1450: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=13
  "anyway, n=%i" % int(n))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.982
Model:                            OLS   Adj. R-squared:                  0.976
Method:                 Least Squares   F-statistic:                     164.9
Date:                Fri, 31 Jul 2020   Prob (F-statistic):           3.50e-08
Time:                        01:16:16   Log-Likelihood:                 7.7143
No. Observations:                  13   AIC:                            -7.429
Df Residuals:                       9   BIC:                            -5.169
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.012e-16      0.045   4.52e-15      1.000      -0.101       0.101
x1            -0.6570      0.030    -22.045      0.000      -0.724      -0.590
x2             0.0083      0.035      0.234      0.820      -0.072       0.089
x3             0.3028      0.103      2.935      0.017       0.069       0.536
==============================================================================
Omnibus:                        0.246   Durbin-Watson:                   1.943
Prob(Omnibus):                  0.884   Jarque-Bera (JB):                0.416
Skew:                           0.162   Prob(JB):                        0.812
Kurtosis:                       2.186   Cond. No.                         3.46
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


D:Program Filesanacondalibsite-packagesscipystatsstats.py:1450: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=13
  "anyway, n=%i" % int(n))

复制代码

1
2
xback=np.dot(xd_,pca.components_[:3])

复制代码

1
2
3
results = sm.OLS(yd, np.c_[np.ones((m,1)),xback]).fit()
print(results.summary())

复制代码

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.982
Model:                            OLS   Adj. R-squared:                  0.976
Method:                 Least Squares   F-statistic:                     164.9
Date:                Fri, 31 Jul 2020   Prob (F-statistic):           3.50e-08
Time:                        01:16:23   Log-Likelihood:                 7.7143
No. Observations:                  13   AIC:                            -7.429
Df Residuals:                       9   BIC:                            -5.169
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.012e-16      0.045   4.52e-15      1.000      -0.101       0.101
x1             0.5130      0.073      6.992      0.000       0.347       0.679
x2             0.2787      0.039      7.078      0.000       0.190       0.368
x3            -0.0608      0.070     -0.866      0.409      -0.220       0.098
x4            -0.4229      0.030    -13.871      0.000      -0.492      -0.354
==============================================================================
Omnibus:                        0.246   Durbin-Watson:                   1.943
Prob(Omnibus):                  0.884   Jarque-Bera (JB):                0.416
Skew:                           0.162   Prob(JB):                        0.812
Kurtosis:                       2.186   Cond. No.                     8.35e+15
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.17e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.982
Model:                            OLS   Adj. R-squared:                  0.976
Method:                 Least Squares   F-statistic:                     164.9
Date:                Fri, 31 Jul 2020   Prob (F-statistic):           3.50e-08
Time:                        01:16:23   Log-Likelihood:                 7.7143
No. Observations:                  13   AIC:                            -7.429
Df Residuals:                       9   BIC:                            -5.169
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.012e-16      0.045   4.52e-15      1.000      -0.101       0.101
x1             0.5130      0.073      6.992      0.000       0.347       0.679
x2             0.2787      0.039      7.078      0.000       0.190       0.368
x3            -0.0608      0.070     -0.866      0.409      -0.220       0.098
x4            -0.4229      0.030    -13.871      0.000      -0.492      -0.354
==============================================================================
Omnibus:                        0.246   Durbin-Watson:                   1.943
Prob(Omnibus):                  0.884   Jarque-Bera (JB):                0.416
Skew:                           0.162   Prob(JB):                        0.812
Kurtosis:                       2.186   Cond. No.                     8.35e+15
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.17e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

说明
1.在教材原来的matlab程序中，OLS普通最小二乘法系数他是用朴素的方法计算出来的，在本文中，借助python强大的拓展能力，笔者使用了statsmodel统计模型库中的OLS模型进行建模。它不仅能够返回相关系数，更重要的是他可以返回与之相关的各项参数并生成报表。
2.在matlab中，相关系数矩阵计算使用的corrcoef列指标是变量，然而np.corrcoef行指标是向量，所以要先转置。
3.PCA模型在python的sklearn机器学习库中，相比比较裸的MATLAB工具，它提供了更多强大的功能，可以去sklearn文档探索更多，这里写法会和MATLAB写法有所不同。
4.协方差矩阵和相关系数矩阵易混：

相关系数矩阵:相当于消除量纲的表示变量间相关性的一个矩阵
协方差矩阵:它是没有消除量纲的表示变量间相关性的矩阵
r=COV(x,y)/D(x)D(y)

参考文献
1.python中使用多个模块使用普通最小二乘法回归
2.最小二乘法原理
3.机器学习中的降维算法
4.浅谈方差、协方差矩阵、相关系数矩阵
5.用scikit-learn学习主成分分析(PCA)
6.sklearn中PCA官方文档
7.使用sklearn进行对数据标准化、归一化以及将数据还原的方法

案例研究

主成分分析案例－我国各地区普通高等教育发展水平综合评价
主成分分析试图在力保数据信息丢失最少的原则下，对多变量的截面数据表进行最佳综合简化，也就是说，对高维变量空间进行降维处理。本案例运用主成分分析方法综合评价我国各地区普通高等教育的发展水平。
具体资料见教材，部分数据将在程序中给出

复制代码

1
2
3
4
5
6
7
8
9
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import zscore
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

复制代码

1
2
3
4
gj=pd.read_csv('gj.txt',header=None,sep='	')
gj=zscore(gj)
r=np.corrcoef(gj.T)

复制代码

1
2
3
4
pca=PCA()
pca.fit(gj)
pca.explained_variance_ratio_#计算各个变量的得分

复制代码

1
2
3
4
array([7.50215857e-01, 1.57698725e-01, 5.36213464e-02, 2.06379017e-02,
       1.45001273e-02, 2.21867124e-03, 7.12026644e-04, 2.65835385e-04,
       7.25914518e-05, 5.69183083e-05])