第十章 多元分析
第二节主成分分析
介绍
这里是司守奎教授的《数学建模算法与应用》全书案例代码python实现,欢迎加入此项目将其案例代码用python实现
GitHub项目地址:Mathematical-modeling-algorithm-and-Application
CSDN专栏:数学建模
知乎专栏:数学建模算法与应用
联系作者
作者:STL_CC
邮箱:[email protected]
由于作者还是大一学生,才疏学浅,难免会有错误,欢迎指正
同时作者精力有限,希望更多大佬加入此项目,一来可以提高建模水平,二来可以分享建模经验
主成分回归分析
例10.5
Hald水泥问题,考察含如下四种化学成分
x
1
x_1
x1=
3
C
a
O
.
A
l
2
O
3
3CaO.Al_2O_3
3CaO.Al2O3的含量(%),
x
2
x_2
x2=
3
C
a
O
.
S
i
O
2
3CaO.SiO_2
3CaO.SiO2的含量(%)
x
3
x_3
x3=
4
C
a
O
.
A
l
2
O
3
.
F
e
2
O
3
4CaO.Al_2O_3.Fe_2O_3
4CaO.Al2O3.Fe2O3的含量(%),
x
4
x_4
x4=
2
C
a
O
.
S
i
O
2
2CaO.SiO_2
2CaO.SiO2的含量(%)
的某种水泥,每一克所释放出的热量(卡) y 与这四种成分含量之间的关系数据共13组,见表7,对数据实施标准化,则 X T X /12就是样本相关系数阵(见表8)。
注:表格见教材
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import zscore
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
sn=pd.read_csv('sn.txt',header=None,sep=' ')
m,n=sn.shape
sn
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 7 | 26 | 6 | 60 | 78.5 |
1 | 1 | 29 | 15 | 52 | 74.3 |
2 | 11 | 56 | 8 | 20 | 104.3 |
3 | 11 | 31 | 8 | 47 | 87.6 |
4 | 7 | 52 | 6 | 33 | 95.9 |
5 | 11 | 55 | 9 | 22 | 109.2 |
6 | 3 | 71 | 17 | 6 | 102.7 |
7 | 1 | 31 | 22 | 44 | 72.5 |
8 | 2 | 54 | 18 | 22 | 93.1 |
9 | 21 | 47 | 4 | 26 | 115.9 |
10 | 1 | 40 | 23 | 34 | 83.8 |
11 | 11 | 66 | 9 | 12 | 113.3 |
12 | 10 | 68 | 8 | 12 | 109.4 |
x0=np.array(sn.iloc[:,0:n-1])
y0=np.array(sn.iloc[:,n-1])
results = sm.OLS(y0, np.c_[np.ones((m,1)),x0]).fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.982
Model: OLS Adj. R-squared: 0.974
Method: Least Squares F-statistic: 111.5
Date: Fri, 31 Jul 2020 Prob (F-statistic): 4.76e-07
Time: 01:16:14 Log-Likelihood: -26.918
No. Observations: 13 AIC: 63.84
Df Residuals: 8 BIC: 66.66
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 62.4054 70.071 0.891 0.399 -99.179 223.989
x1 1.5511 0.745 2.083 0.071 -0.166 3.269
x2 0.5102 0.724 0.705 0.501 -1.159 2.179
x3 0.1019 0.755 0.135 0.896 -1.638 1.842
x4 -0.1441 0.709 -0.203 0.844 -1.779 1.491
==============================================================================
Omnibus: 0.165 Durbin-Watson: 2.053
Prob(Omnibus): 0.921 Jarque-Bera (JB): 0.320
Skew: 0.201 Prob(JB): 0.852
Kurtosis: 2.345 Cond. No. 6.06e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.06e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
D:\Program Files\anaconda\lib\site-packages\scipy\stats\stats.py:1450: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=13
"anyway, n=%i" % int(n))
r=np.corrcoef(x0.T)
xd=zscore(x0)
yd=zscore(y0)
pca=PCA()
pca.fit(xd)
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
pca.explained_variance_ratio_#计算各个变量的贡献率
array([5.58926009e-01, 3.94016518e-01, 4.66515373e-02, 4.05936433e-04])
num=3#由于我们使用jupyter演示程序,可能不好交互式
pca_=PCA(n_components=num)
xd_=pca_.fit_transform(xd)
results = sm.OLS(yd, np.c_[np.ones((m,1)),xd_]).fit()
print(results.summary())
#原MATLAB程序中还做了使系数全为正的处理,在此报告中,x1为负,但是其实只要转成正的就行
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.982
Model: OLS Adj. R-squared: 0.976
Method: Least Squares F-statistic: 164.9
Date: Fri, 31 Jul 2020 Prob (F-statistic): 3.50e-08
Time: 01:16:16 Log-Likelihood: 7.7143
No. Observations: 13 AIC: -7.429
Df Residuals: 9 BIC: -5.169
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.012e-16 0.045 4.52e-15 1.000 -0.101 0.101
x1 -0.6570 0.030 -22.045 0.000 -0.724 -0.590
x2 0.0083 0.035 0.234 0.820 -0.072 0.089
x3 0.3028 0.103 2.935 0.017 0.069 0.536
==============================================================================
Omnibus: 0.246 Durbin-Watson: 1.943
Prob(Omnibus): 0.884 Jarque-Bera (JB): 0.416
Skew: 0.162 Prob(JB): 0.812
Kurtosis: 2.186 Cond. No. 3.46
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
D:\Program Files\anaconda\lib\site-packages\scipy\stats\stats.py:1450: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=13
"anyway, n=%i" % int(n))
xback=np.dot(xd_,pca.components_[:3])
results = sm.OLS(yd, np.c_[np.ones((m,1)),xback]).fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.982
Model: OLS Adj. R-squared: 0.976
Method: Least Squares F-statistic: 164.9
Date: Fri, 31 Jul 2020 Prob (F-statistic): 3.50e-08
Time: 01:16:23 Log-Likelihood: 7.7143
No. Observations: 13 AIC: -7.429
Df Residuals: 9 BIC: -5.169
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.012e-16 0.045 4.52e-15 1.000 -0.101 0.101
x1 0.5130 0.073 6.992 0.000 0.347 0.679
x2 0.2787 0.039 7.078 0.000 0.190 0.368
x3 -0.0608 0.070 -0.866 0.409 -0.220 0.098
x4 -0.4229 0.030 -13.871 0.000 -0.492 -0.354
==============================================================================
Omnibus: 0.246 Durbin-Watson: 1.943
Prob(Omnibus): 0.884 Jarque-Bera (JB): 0.416
Skew: 0.162 Prob(JB): 0.812
Kurtosis: 2.186 Cond. No. 8.35e+15
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.17e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
说明
1.在教材原来的matlab程序中,OLS普通最小二乘法系数他是用朴素的方法计算出来的,在本文中,借助python强大的拓展能力,笔者使用了statsmodel统计模型库中的OLS模型进行建模。它不仅能够返回相关系数,更重要的是他可以返回与之相关的各项参数并生成报表。
2.在matlab中,相关系数矩阵计算使用的corrcoef列指标是变量,然而np.corrcoef行指标是向量,所以要先转置。
3.PCA模型在python的sklearn机器学习库中,相比比较裸的MATLAB工具,它提供了更多强大的功能,可以去sklearn文档探索更多,这里写法会和MATLAB写法有所不同。
4.协方差矩阵和相关系数矩阵易混:
- 相关系数矩阵:相当于消除量纲的表示变量间相关性的一个矩阵
- 协方差矩阵:它是没有消除量纲的表示变量间相关性的矩阵
- r=COV(x,y)/D(x)D(y)
参考文献
1.python中使用多个模块使用普通最小二乘法回归
2.最小二乘法原理
3.机器学习中的降维算法
4.浅谈方差、协方差矩阵、相关系数矩阵
5.用scikit-learn学习主成分分析(PCA)
6.sklearn中PCA官方文档
7.使用sklearn进行对数据标准化、归一化以及将数据还原的方法
案例研究
主成分分析案例-我国各地区普通高等教育发展水平综合评价
主成分分析试图在力保数据信息丢失最少的原则下,对多变量的截面数据表进行最佳综合简化,也就是说,对高维变量空间进行降维处理。本案例运用主成分分析方法综合评价我国各地区普通高等教育的发展水平。
具体资料见教材,部分数据将在程序中给出
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import zscore
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
gj=pd.read_csv('gj.txt',header=None,sep=' ')
gj=zscore(gj)
r=np.corrcoef(gj.T)
pca=PCA()
pca.fit(gj)
pca.explained_variance_ratio_#计算各个变量的得分
array([7.50215857e-01, 1.57698725e-01, 5.36213464e-02, 2.06379017e-02,
1.45001273e-02, 2.21867124e-03, 7.12026644e-04, 2.65835385e-04,
7.25914518e-05, 5.69183083e-05])