假设有样本集有个数据:,维度为,其中 m < < n m<<n m<<n,在这里,为了估计出样本集 s s s的高斯模型,我们做出让步,我们假定样本集 s s s的数据是由在一个 k k k维空间里服从 n ( 0 , i ) n(0,i) n(0,i) 分布的向量 z z z 经线性变换,并在高斯噪音 e p s i l o n s i m n ( 0 , p s i ) \\epsilon\\sim n(0,\\psi) epsilonsimn(0,psi) 的作用下形成,且 p s i \\psi psi 是对角矩阵,用数学语言描述便是:< p>
其中 为 变换矩阵。
上式中表示观测到的变量,表示隐藏因子(未知变量),满足多元高斯分布, 表示误差因子(噪声因子),也满足多元高斯分布由于多元高斯分布的线性变换仍属于高斯分布,根据上式得到:
1. 软件包安装
if (!require(psych)) {
## Warning: 程辑包'psych'是用R版本4.1.3 来建造的
if (!require(car)) {
if (!require(ElemStatLearn)) {
if (!require(GPArotation)) {
2. 数据读取
为了与主成分分析方法进行比较,我们仍然选择 prostate 这个数据集,suiran 只有97个观测共9个变量,但通过与传统技术比较,足以让我们掌握正则化技术。斯坦福大学医疗中心提供了97个病人的前列腺特异性抗原(PSA)数据,这些病人均接受前列腺根治切除术。我们的目标是,通过临床检测提供的数据建立一个预测模型预测患者术后PSA水平。对于患者在手术后能够恢复到什么程度,PSA水平可能是一个较为有效的预后指标。手术之后,医生会在各个时间区间检查患者的PSA水平,并通过各种公式确定患者是否康复。术前预测模型和术后数据(这里没有提供)互相配合,就可能提高前列腺癌诊疗水平,改善其预后。如下所示:
svi:精囊是否受侵,一个指标变量,表示癌细胞是否已经透过前列腺壁侵入精囊腺(1—是, 0—否);
leason:患者的Gleason评分;由病理学家进行活体检查后给出(2—10) ,表示癌细胞的变异程度—评分越高,程度越危险;
## 'data.frame': 97 obs. of 10 variables:
## $ lcavol : num -0.58 -0.994 -0.511 -1.204 0.751 ...
## $ lweight: num 2.77 3.32 2.69 3.28 3.43 ...
## $ age : int 50 58 74 58 62 50 64 58 47 63 ...
## $ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ gleason: int 6 6 7 6 6 6 6 6 6 6 ...
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 0 ...
## $ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ...
## $ train : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
corPlot(prostate, gr = colorRampPalette(c("#2171B5", "white", "#B52127")))
3. 因子分析
a) 常规数据分析
首先判断需要提取的公共因子个数,函数 fa.parallel 可以判断需提取的因子个数。现在决定提取两个因子,可使用fa()函数获得相应的结果。fa()函数的格式如下: fa(r, nfactors=, n.obs=, rotate=, scores=, fm=),其中:
# Exploratory factor analysis of prostate data options(digits=2) determine
# number of factors to extract
fa.parallel(prostate[, -10], fa = "fa", n.obs = 97, n.iter = 100, show.legend = TRUE,
main = "Scree plot with parallel analysis")
## Parallel analysis suggests that the number of factors = 3 and the number of components = NA
abline(h = 0, lwd = 1, col = "green")
fa.parallel(prostate[, -10], fa = "both", n.obs = 97, n.iter = 100, show.legend = TRUE,
main = "Scree plot with parallel analysis")
## Parallel analysis suggests that the number of factors = 3 and the number of components = 2
abline(h = 0, lwd = 1, col = "green")
探索性因子分析采用最小残差法和主轴法、加权最小二乘法或最大似然法进行因子分析,在潜变量探索性因子分析(EFA)的多种方法中,利用普通最小二乘(OLS)求最小残差(minres)解是较好的方法之一。这就产生了非常类似于最大似然的解,即使是对于行为不佳的矩阵。minres的一种变体是加权最小二乘(WLS)。也许最传统的技术是主轴(PAF)。对相关矩阵进行特征值分解,然后用前n个因子估计每个变量的共同性。这些共同性输入到对角线上,重复这个过程,直到和(diag®)不变化。另一种估计方法是最大似然法。对于性能良好的矩阵,最大似然因子分析(fa或facal函数)可能是首选。用fa和n.iter >求得了载荷和因子间相关性的Bootstrapped置信区间。
fa <- fa(prostate[, -10], nfactors = 3, rotate = "none", fm = "pa")
## Factor Analysis using method = pa
## Call: fa(r = prostate[, -10], nfactors = 3, rotate = "none", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 PA3 h2 u2 com
## lcavol 0.78 -0.01 -0.27 0.68 0.32 1.2
## lweight 0.36 0.65 -0.12 0.57 0.43 1.6
## age 0.32 0.38 0.25 0.31 0.69 2.7
## lbph 0.15 0.64 0.20 0.47 0.53 1.3
## svi 0.67 -0.19 -0.25 0.55 0.45 1.5
## lcp 0.81 -0.23 -0.10 0.72 0.28 1.2
## gleason 0.67 -0.17 0.45 0.68 0.32 1.9
## pgg45 0.77 -0.20 0.45 0.83 0.17 1.8
## lpsa 0.77 0.17 -0.33 0.73 0.27 1.5
## PA1 PA2 PA3
## SS loadings 3.60 1.16 0.77
## Proportion Var 0.40 0.13 0.09
## Cumulative Var 0.40 0.53 0.61
## Proportion Explained 0.65 0.21 0.14
## Cumulative Proportion 0.65 0.86 1.00
## Mean item complexity = 1.6
## Test of the hypothesis that 3 factors are sufficient.
## The degrees of freedom for the null model are 36 and the objective function was 4.37 with Chi Square of 403.19
## The degrees of freedom for the model are 12 and the objective function was 0.33
## The root mean square of the residuals (RMSR) is 0.03
## The df corrected root mean square of the residuals is 0.05
## The harmonic number of observations is 97 with the empirical chi square 6.46 with prob < 0.89
## The total number of observations was 97 with Likelihood Chi Square = 29.37 with prob < 0.0035
## Tucker Lewis Index of factoring reliability = 0.855
## RMSEA index = 0.122 and the 90 % confidence intervals are 0.067 0.18
## BIC = -25.52
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy
## PA1 PA2 PA3
## Correlation of (regression) scores with factors 0.96 0.85 0.86
## Multiple R square of scores with factors 0.93 0.72 0.73
## Minimum correlation of possible factor scores 0.85 0.43 0.46
factor.plot(fa, labels = rownames(fa$loadings))
fa.diagram(fa, simple = FALSE)
因子旋转应不应该傲旋转? 旋转可以修改每个变量的载荷,这样有助于对因子的解释。旋转后的因子分能够解释的方差总量是不变的,但是每个因子对于能够解释的方差总量的贡献会改变。在旋转过程中,你会发现载荷的值或者更远离0,或者更接近0,这在理论上可以帮助我们识别那些对因子起重要作用的变量。这是一种将变量和唯一因子联系起来的尝试。请记住,因子分析是一种无监督学习,所以你是在努力去理解数据,而不是在验证某种假设。总之,旋转有助于你的这种努力。最常用的因子旋转方法被称为方差最大法。虽然还有其他方法,比如四次方最大法和等量最大法。但我们主要讨论方差最大旋转。根据我的经验,其他方法从来没有提供过比方差最大法更好的解。当然,你可以通过反复实验来决定使用哪种方法。在方差最大法中,我们要使平方后的载荷的总方差最火。方差最大化过程会旋转特征空间的特和坐标。但不改变致据点的也置。
# Listing 14.7 - Factor extraction with orthogonal rotation
fa.varimax <- fa(prostate[, -10], nfactors = 3, rotate = "varimax", fm = "pa")
## Factor Analysis using method = pa
## Call: fa(r = prostate[, -10], nfactors = 3, rotate = "varimax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA3 PA2 h2 u2 com
## lcavol 0.78 0.23 0.15 0.68 0.32 1.2
## lweight 0.29 -0.09 0.69 0.57 0.43 1.4
## age 0.06 0.25 0.49 0.31 0.69 1.5
## lbph -0.07 0.04 0.68 0.47 0.53 1.0
## svi 0.70 0.24 -0.05 0.55 0.45 1.2
## lcp 0.72 0.45 -0.02 0.72 0.28 1.7
## gleason 0.27 0.77 0.10 0.68 0.32 1.3
## pgg45 0.35 0.83 0.10 0.83 0.17 1.4
## lpsa 0.79 0.13 0.30 0.73 0.27 1.4
## PA1 PA3 PA2
## SS loadings 2.52 1.69 1.32
## Proportion Var 0.28 0.19 0.15
## Cumulative Var 0.28 0.47 0.61
## Proportion Explained 0.46 0.31 0.24
## Cumulative Proportion 0.46 0.76 1.00
## Mean item complexity = 1.3
## Test of the hypothesis that 3 factors are sufficient.
## The degrees of freedom for the null model are 36 and the objective function was 4.37 with Chi Square of 403.19
## The degrees of freedom for the model are 12 and the objective function was 0.33
## The root mean square of the residuals (RMSR) is 0.03
## The df corrected root mean square of the residuals is 0.05
## The harmonic number of observations is 97 with the empirical chi square 6.46 with prob < 0.89
## The total number of observations was 97 with Likelihood Chi Square = 29.37 with prob < 0.0035
## Tucker Lewis Index of factoring reliability = 0.855
## RMSEA index = 0.122 and the 90 % confidence intervals are 0.067 0.18
## BIC = -25.52
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy
## PA1 PA3 PA2
## Correlation of (regression) scores with factors 0.92 0.91 0.84
## Multiple R square of scores with factors 0.84 0.82 0.71
## Minimum correlation of possible factor scores 0.67 0.65 0.42
# plot factor solution
factor.plot(fa.varimax, labels = rownames(fa.varimax$loadings))
fa.diagram(fa.varimax, simple = FALSE)
b) 协方差数据的因子分析
我们先计算 prostate 数据集的协方差,然后判断提取因子个数,我们看到结果同时展示了PCA和EFA的结果。PCA结果建议提取两个或者三个成分,EFA建议提取3个因子。注意,代码中使用了fa=“both”,因子图形将会同时展示主成分和公共因子分析的结果。
p.cor = cor(prostate[, -10])
# convert covariances to correlations
correlations <- cov2cor(p.cor)
## lcavol lweight age lbph svi lcp
## lcavol 1.0000000 0.2805214 0.2249999 0.027349703 0.53884500 0.675310484
## lweight 0.2805214 1.0000000 0.3479691 0.442264395 0.15538491 0.164537146
## age 0.2249999 0.3479691 1.0000000 0.350185896 0.11765804 0.127667752
## lbph 0.0273497 0.4422644 0.3501859 1.000000000 -0.08584324 -0.006999431
## svi 0.5388450 0.1553849 0.1176580 -0.085843238 1.00000000 0.673111185
## lcp 0.6753105 0.1645371 0.1276678 -0.006999431 0.67311118 1.000000000
## gleason 0.4324171 0.0568821 0.2688916 0.077820447 0.32041222 0.514830063
## pgg45 0.4336522 0.1073538 0.2761124 0.078460018 0.45764762 0.631528246
## lpsa 0.7344603 0.4333194 0.1695928 0.179809404 0.56621822 0.548813175
## gleason pgg45 lpsa
## lcavol 0.43241706 0.43365225 0.7344603
## lweight 0.05688210 0.10735379 0.4333194
## age 0.26889160 0.27611245 0.1695928
## lbph 0.07782045 0.07846002 0.1798094
## svi 0.32041222 0.45764762 0.5662182
## lcp 0.51483006 0.63152825 0.5488132
## gleason 1.00000000 0.75190451 0.3689868
## pgg45 0.75190451 1.00000000 0.4223159
## lpsa 0.36898681 0.42231586 1.0000000
# 判断需要提取的公共因子个数 determine number of factors to extract
fa.parallel(correlations, n.obs = 97, fa = "both", n.iter = 100, main = "Scree plots with parallel analysis")
## Parallel analysis suggests that the number of factors = 3 and the number of components = 2
abline(h = 0, lwd = 1, col = "green")
# Listing 14.6 - Principal axis factoring without rotation
fa <- fa(correlations, nfactors = 3, rotate = "none", fm = "pa")
## Factor Analysis using method = pa
## Call: fa(r = correlations, nfactors = 3, rotate = "none", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 PA3 h2 u2 com
## lcavol 0.78 -0.01 -0.27 0.68 0.32 1.2
## lweight 0.36 0.65 -0.12 0.57 0.43 1.6
## age 0.32 0.38 0.25 0.31 0.69 2.7
## lbph 0.15 0.64 0.20 0.47 0.53 1.3
## svi 0.67 -0.19 -0.25 0.55 0.45 1.5
## lcp 0.81 -0.23 -0.10 0.72 0.28 1.2
## gleason 0.67 -0.17 0.45 0.68 0.32 1.9
## pgg45 0.77 -0.20 0.45 0.83 0.17 1.8
## lpsa 0.77 0.17 -0.33 0.73 0.27 1.5
## PA1 PA2 PA3
## SS loadings 3.60 1.16 0.77
## Proportion Var 0.40 0.13 0.09
## Cumulative Var 0.40 0.53 0.61
## Proportion Explained 0.65 0.21 0.14
## Cumulative Proportion 0.65 0.86 1.00
## Mean item complexity = 1.6
## Test of the hypothesis that 3 factors are sufficient.
## The degrees of freedom for the null model are 36 and the objective function was 4.37
## The degrees of freedom for the model are 12 and the objective function was 0.33
## The root mean square of the residuals (RMSR) is 0.03
## The df corrected root mean square of the residuals is 0.05
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy
## PA1 PA2 PA3
## Correlation of (regression) scores with factors 0.96 0.85 0.86
## Multiple R square of scores with factors 0.93 0.72 0.73
## Minimum correlation of possible factor scores 0.85 0.43 0.46
# Listing 14.7 - Factor extraction with orthogonal rotation
fa.varimax <- fa(correlations, nfactors = 3, rotate = "varimax", fm = "pa")
## Factor Analysis using method = pa
## Call: fa(r = correlations, nfactors = 3, rotate = "varimax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA3 PA2 h2 u2 com
## lcavol 0.78 0.23 0.15 0.68 0.32 1.2
## lweight 0.29 -0.09 0.69 0.57 0.43 1.4
## age 0.06 0.25 0.49 0.31 0.69 1.5
## lbph -0.07 0.04 0.68 0.47 0.53 1.0
## svi 0.70 0.24 -0.05 0.55 0.45 1.2
## lcp 0.72 0.45 -0.02 0.72 0.28 1.7
## gleason 0.27 0.77 0.10 0.68 0.32 1.3
## pgg45 0.35 0.83 0.10 0.83 0.17 1.4
## lpsa 0.79 0.13 0.30 0.73 0.27 1.4
## PA1 PA3 PA2
## SS loadings 2.52 1.69 1.32
## Proportion Var 0.28 0.19 0.15
## Cumulative Var 0.28 0.47 0.61
## Proportion Explained 0.46 0.31 0.24
## Cumulative Proportion 0.46 0.76 1.00
## Mean item complexity = 1.3
## Test of the hypothesis that 3 factors are sufficient.
## The degrees of freedom for the null model are 36 and the objective function was 4.37
## The degrees of freedom for the model are 12 and the objective function was 0.33
## The root mean square of the residuals (RMSR) is 0.03
## The df corrected root mean square of the residuals is 0.05
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy
## PA1 PA3 PA2
## Correlation of (regression) scores with factors 0.92 0.91 0.84
## Multiple R square of scores with factors 0.84 0.82 0.71
## Minimum correlation of possible factor scores 0.67 0.65 0.42
# Listing 14.8 - Factor extraction with oblique rotation
# install.packages('GPArotation')
fa.promax <- fa(correlations, nfactors = 3, rotate = "promax", fm = "pa")
## Factor Analysis using method = pa
## Call: fa(r = correlations, nfactors = 3, rotate = "promax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA3 PA2 h2 u2 com
## lcavol 0.80 0.03 0.04 0.68 0.32 1.0
## lweight 0.29 -0.22 0.68 0.57 0.43 1.6
## age -0.06 0.24 0.49 0.31 0.69 1.5
## lbph -0.15 0.03 0.71 0.47 0.53 1.1
## svi 0.72 0.08 -0.15 0.55 0.45 1.1
## lcp 0.68 0.30 -0.13 0.72 0.28 1.5
## gleason 0.05 0.79 0.04 0.68 0.32 1.0
## pgg45 0.12 0.83 0.02 0.83 0.17 1.0
## lpsa 0.82 -0.10 0.21 0.73 0.27 1.2
## PA1 PA3 PA2
## SS loadings 2.58 1.66 1.29
## Proportion Var 0.29 0.18 0.14
## Cumulative Var 0.29 0.47 0.61
## Proportion Explained 0.47 0.30 0.23
## Cumulative Proportion 0.47 0.77 1.00
## With factor correlations of
## PA1 PA3 PA2
## PA1 1.00 0.51 0.25
## PA3 0.51 1.00 0.18
## PA2 0.25 0.18 1.00
## Mean item complexity = 1.2
## Test of the hypothesis that 3 factors are sufficient.
## The degrees of freedom for the null model are 36 and the objective function was 4.37
## The degrees of freedom for the model are 12 and the objective function was 0.33
## The root mean square of the residuals (RMSR) is 0.03
## The df corrected root mean square of the residuals is 0.05
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy
## PA1 PA3 PA2
## Correlation of (regression) scores with factors 0.94 0.93 0.85
## Multiple R square of scores with factors 0.89 0.87 0.73
## Minimum correlation of possible factor scores 0.78 0.74 0.45
# calculate factor loading matrix
fsm <- function(oblique) {
if (class(oblique)[2] == "fa" & is.null(oblique$Phi)) {
warning("Object doesn't look like oblique EFA")
} else {
P <- unclass(oblique$loading)
F <- P %*% oblique$Phi
colnames(F) <- c("PA1", "PA2", "PA3")
## PA1 PA2 PA3
## lcavol 0.82473291 0.44227350 0.24555125
## lweight 0.34220476 0.04243046 0.70717334
## age 0.18477222 0.29736739 0.51524467
## lbph 0.03268287 0.07473469 0.67218104
## svi 0.72626068 0.41757206 0.04449359
## lcp 0.79912417 0.62091939 0.08992345
## gleason 0.45900614 0.82088146 0.18990904
## pgg45 0.55288520 0.90185461 0.20082633
## lpsa 0.82381653 0.35706958 0.39508870
# factor scores
## PA1 PA3 PA2
## lcavol 0.243980481 0.05617920 0.03965284
## lweight 0.063289015 -0.05858661 0.42084810
## age 0.001136759 0.04500146 0.19707194
## lbph -0.052983656 0.03009248 0.37132673
## svi 0.161318100 0.01106634 -0.08257798
## lcp 0.299975224 0.08308988 -0.12679762
## gleason -0.015790579 0.30869180 0.04827821
## pgg45 0.041112967 0.62164473 0.07257584
## lpsa 0.359024621 -0.10014835 0.15129501
# plot factor solution
factor.plot(fa.promax, labels = rownames(fa.promax$loadings))
fa.diagram(fa.promax, simple = FALSE)
