第十章 多元分析
第二节主成分分析
介绍
这里是司守奎教授的《数学建模算法与应用》全书案例代码python实现,欢迎加入此项目将其案例代码用python实现
GitHub项目地址:Mathematical-modeling-algorithm-and-Application
CSDN专栏:数学建模
知乎专栏:数学建模算法与应用
联系作者
作者:STL_CC
邮箱:1459078309@qq.com
由于作者还是大一学生,才疏学浅,难免会有错误,欢迎指正
同时作者精力有限,希望更多大佬加入此项目,一来可以提高建模水平,二来可以分享建模经验
主成分回归分析
例10.5
Hald水泥问题,考察含如下四种化学成分
x
1
x_1
x1=
3
C
a
O
.
A
l
2
O
3
3CaO.Al_2O_3
3CaO.Al2O3的含量(%),
x
2
x_2
x2=
3
C
a
O
.
S
i
O
2
3CaO.SiO_2
3CaO.SiO2的含量(%)
x
3
x_3
x3=
4
C
a
O
.
A
l
2
O
3
.
F
e
2
O
3
4CaO.Al_2O_3.Fe_2O_3
4CaO.Al2O3.Fe2O3的含量(%),
x
4
x_4
x4=
2
C
a
O
.
S
i
O
2
2CaO.SiO_2
2CaO.SiO2的含量(%)
的某种水泥,每一克所释放出的热量(卡) y 与这四种成分含量之间的关系数据共13组,见表7,对数据实施标准化,则 X T X /12就是样本相关系数阵(见表8)。
注:表格见教材
1
2
3
4
5
6
7
8
9import numpy as np import pandas as pd import matplotlib import matplotlib.pyplot as plt import statsmodels.api as sm from scipy.stats import zscore from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler
1
2
3sn=pd.read_csv('sn.txt',header=None,sep=' ') m,n=sn.shape
1
2sn
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 7 | 26 | 6 | 60 | 78.5 |
1 | 1 | 29 | 15 | 52 | 74.3 |
2 | 11 | 56 | 8 | 20 | 104.3 |
3 | 11 | 31 | 8 | 47 | 87.6 |
4 | 7 | 52 | 6 | 33 | 95.9 |
5 | 11 | 55 | 9 | 22 | 109.2 |
6 | 3 | 71 | 17 | 6 | 102.7 |
7 | 1 | 31 | 22 | 44 | 72.5 |
8 | 2 | 54 | 18 | 22 | 93.1 |
9 | 21 | 47 | 4 | 26 | 115.9 |
10 | 1 | 40 | 23 | 34 | 83.8 |
11 | 11 | 66 | 9 | 12 | 113.3 |
12 | 10 | 68 | 8 | 12 | 109.4 |
1
2
3x0=np.array(sn.iloc[:,0:n-1]) y0=np.array(sn.iloc[:,n-1])
1
2
3results = sm.OLS(y0, np.c_[np.ones((m,1)),x0]).fit() print(results.summary())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.982 Model: OLS Adj. R-squared: 0.974 Method: Least Squares F-statistic: 111.5 Date: Fri, 31 Jul 2020 Prob (F-statistic): 4.76e-07 Time: 01:16:14 Log-Likelihood: -26.918 No. Observations: 13 AIC: 63.84 Df Residuals: 8 BIC: 66.66 Df Model: 4 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 62.4054 70.071 0.891 0.399 -99.179 223.989 x1 1.5511 0.745 2.083 0.071 -0.166 3.269 x2 0.5102 0.724 0.705 0.501 -1.159 2.179 x3 0.1019 0.755 0.135 0.896 -1.638 1.842 x4 -0.1441 0.709 -0.203 0.844 -1.779 1.491 ============================================================================== Omnibus: 0.165 Durbin-Watson: 2.053 Prob(Omnibus): 0.921 Jarque-Bera (JB): 0.320 Skew: 0.201 Prob(JB): 0.852 Kurtosis: 2.345 Cond. No. 6.06e+03 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 6.06e+03. This might indicate that there are strong multicollinearity or other numerical problems. D:Program Filesanacondalibsite-packagesscipystatsstats.py:1450: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=13 "anyway, n=%i" % int(n))
1
2
3
4r=np.corrcoef(x0.T) xd=zscore(x0) yd=zscore(y0)
1
2
3pca=PCA() pca.fit(xd)
1
2
3PCA(copy=True, iterated_power='auto', n_components=None, random_state=None, svd_solver='auto', tol=0.0, whiten=False)
1
2pca.explained_variance_ratio_#计算各个变量的贡献率
1
2array([5.58926009e-01, 3.94016518e-01, 4.66515373e-02, 4.05936433e-04])
1
2
3
4num=3#由于我们使用jupyter演示程序,可能不好交互式 pca_=PCA(n_components=num) xd_=pca_.fit_transform(xd)
1
2
3
4results = sm.OLS(yd, np.c_[np.ones((m,1)),xd_]).fit() print(results.summary()) #原MATLAB程序中还做了使系数全为正的处理,在此报告中,x1为负,但是其实只要转成正的就行
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.982 Model: OLS Adj. R-squared: 0.976 Method: Least Squares F-statistic: 164.9 Date: Fri, 31 Jul 2020 Prob (F-statistic): 3.50e-08 Time: 01:16:16 Log-Likelihood: 7.7143 No. Observations: 13 AIC: -7.429 Df Residuals: 9 BIC: -5.169 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 2.012e-16 0.045 4.52e-15 1.000 -0.101 0.101 x1 -0.6570 0.030 -22.045 0.000 -0.724 -0.590 x2 0.0083 0.035 0.234 0.820 -0.072 0.089 x3 0.3028 0.103 2.935 0.017 0.069 0.536 ============================================================================== Omnibus: 0.246 Durbin-Watson: 1.943 Prob(Omnibus): 0.884 Jarque-Bera (JB): 0.416 Skew: 0.162 Prob(JB): 0.812 Kurtosis: 2.186 Cond. No. 3.46 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. D:Program Filesanacondalibsite-packagesscipystatsstats.py:1450: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=13 "anyway, n=%i" % int(n))
1
2xback=np.dot(xd_,pca.components_[:3])
1
2
3results = sm.OLS(yd, np.c_[np.ones((m,1)),xback]).fit() print(results.summary())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.982 Model: OLS Adj. R-squared: 0.976 Method: Least Squares F-statistic: 164.9 Date: Fri, 31 Jul 2020 Prob (F-statistic): 3.50e-08 Time: 01:16:23 Log-Likelihood: 7.7143 No. Observations: 13 AIC: -7.429 Df Residuals: 9 BIC: -5.169 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 2.012e-16 0.045 4.52e-15 1.000 -0.101 0.101 x1 0.5130 0.073 6.992 0.000 0.347 0.679 x2 0.2787 0.039 7.078 0.000 0.190 0.368 x3 -0.0608 0.070 -0.866 0.409 -0.220 0.098 x4 -0.4229 0.030 -13.871 0.000 -0.492 -0.354 ============================================================================== Omnibus: 0.246 Durbin-Watson: 1.943 Prob(Omnibus): 0.884 Jarque-Bera (JB): 0.416 Skew: 0.162 Prob(JB): 0.812 Kurtosis: 2.186 Cond. No. 8.35e+15 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 4.17e-31. This might indicate that there are strong multicollinearity problems or that the design matrix is singular.
说明
1.在教材原来的matlab程序中,OLS普通最小二乘法系数他是用朴素的方法计算出来的,在本文中,借助python强大的拓展能力,笔者使用了statsmodel统计模型库中的OLS模型进行建模。它不仅能够返回相关系数,更重要的是他可以返回与之相关的各项参数并生成报表。
2.在matlab中,相关系数矩阵计算使用的corrcoef列指标是变量,然而np.corrcoef行指标是向量,所以要先转置。
3.PCA模型在python的sklearn机器学习库中,相比比较裸的MATLAB工具,它提供了更多强大的功能,可以去sklearn文档探索更多,这里写法会和MATLAB写法有所不同。
4.协方差矩阵和相关系数矩阵易混:
- 相关系数矩阵:相当于消除量纲的表示变量间相关性的一个矩阵
- 协方差矩阵:它是没有消除量纲的表示变量间相关性的矩阵
- r=COV(x,y)/D(x)D(y)
参考文献
1.python中使用多个模块使用普通最小二乘法回归
2.最小二乘法原理
3.机器学习中的降维算法
4.浅谈方差、协方差矩阵、相关系数矩阵
5.用scikit-learn学习主成分分析(PCA)
6.sklearn中PCA官方文档
7.使用sklearn进行对数据标准化、归一化以及将数据还原的方法
案例研究
主成分分析案例-我国各地区普通高等教育发展水平综合评价
主成分分析试图在力保数据信息丢失最少的原则下,对多变量的截面数据表进行最佳综合简化,也就是说,对高维变量空间进行降维处理。本案例运用主成分分析方法综合评价我国各地区普通高等教育的发展水平。
具体资料见教材,部分数据将在程序中给出
1
2
3
4
5
6
7
8
9import numpy as np import pandas as pd import matplotlib import matplotlib.pyplot as plt import statsmodels.api as sm from scipy.stats import zscore from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler
1
2
3
4gj=pd.read_csv('gj.txt',header=None,sep=' ') gj=zscore(gj) r=np.corrcoef(gj.T)
1
2
3
4pca=PCA() pca.fit(gj) pca.explained_variance_ratio_#计算各个变量的得分
1
2
3
4array([7.50215857e-01, 1.57698725e-01, 5.36213464e-02, 2.06379017e-02, 1.45001273e-02, 2.21867124e-03, 7.12026644e-04, 2.65835385e-04, 7.25914518e-05, 5.69183083e-05])
最后
以上就是眼睛大老鼠最近收集整理的关于【主成分分析】《数学建模算法与应用》第十章 多元分析 第二节 主成分分析 python实现 第十章 多元分析 的全部内容,更多相关【主成分分析】《数学建模算法与应用》第十章内容请搜索靠谱客的其他文章。
发表评论 取消回复