概率主成分分析与因子分析模型选择

概率主成分分析（Probabilistic PCA）和因子分析（Factor Analysis, FA）是两种概率模型，它们能够利用新数据的似然性来进行模型选择和协方差估计。本文将通过交叉验证在含有同质噪声（各特征的噪声方差相同）和异质噪声（各特征的噪声方差不同）的低秩数据上比较PCA和FA的性能。此外，还将模型的似然性与收缩协方差估计器得到的似然性进行比较。

在同质噪声条件下，FA和PCA都能成功恢复低秩子空间的大小，但PCA的似然性高于FA。然而，当存在异质噪声时，PCA会失败并高估秩的大小。在适当的情况下（选择合适的组件数量），低秩模型的保留数据更可能适合于PCA和FA模型，而不是收缩模型。

文中还比较了由Thomas P. Minka在NIPS 2000上发表的《Automatic Choice of Dimensionality for PCA》中提出的自动估计方法。

数据创建


import numpy as np
from scipy import linalg

n_samples, n_features, rank = 500, 25, 5
sigma = 1.0
rng = np.random.RandomState(42)
U, _, _ = linalg.svd(rng.randn(n_features, n_features))
X = np.dot(rng.randn(n_samples, rank), U[:, :rank].T)

# 添加同质噪声
X_homo = X + sigma * rng.randn(n_samples, n_features)

# 添加异质噪声
sigmas = sigma * rng.rand(n_features) + sigma / 2.0
X_hetero = X + rng.randn(n_samples, n_features) * sigmas

模型拟合


import matplotlib.pyplot as plt
from sklearn.covariance import LedoitWolf, ShrunkCovariance
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.model_selection import GridSearchCV, cross_val_score

n_components = np.arange(0, n_features, 5)

def compute_scores(X):
    pca = PCA(svd_solver="full")
    fa = FactorAnalysis()
    pca_scores, fa_scores = [], []
    for n in n_components:
        pca.n_components = n
        fa.n_components = n
        pca_scores.append(np.mean(cross_val_score(pca, X)))
        fa_scores.append(np.mean(cross_val_score(fa, X)))
    return pca_scores, fa_scores

def shrunk_cov_score(X):
    shrinkages = np.logspace(-2, 0, 30)
    cv = GridSearchCV(ShrunkCovariance(), {"shrinkage": shrinkages})
    return np.mean(cross_val_score(cv.fit(X).best_estimator_, X))

def lw_score(X):
    return np.mean(cross_val_score(LedoitWolf(), X))

for X, title in [(X_homo, "同质噪声"), (X_hetero, "异质噪声")]:
    pca_scores, fa_scores = compute_scores(X)
    n_components_pca = n_components[np.argmax(pca_scores)]
    n_components_fa = n_components[np.argmax(fa_scores)]
    pca = PCA(svd_solver="full", n_components="mle")
    pca.fit(X)
    n_components_pca_mle = pca.n_components_

    print("best n_components by PCA CV = %d" % n_components_pca)
    print("best n_components by FactorAnalysis CV = %d" % n_components_fa)
    print("best n_components by PCA MLE = %d" % n_components_pca_mle)

    plt.figure()
    plt.plot(n_components, pca_scores, "b", label="PCA scores")
    plt.plot(n_components, fa_scores, "r", label="FA scores")
    plt.axvline(rank, color="g", label="TRUTH: %d" % rank, linestyle="-")
    plt.axvline(n_components_pca, color="b", label="PCA CV: %d" % n_components_pca, linestyle="--")
    plt.axvline(n_components_fa, color="r", label="FactorAnalysis CV: %d" % n_components_fa, linestyle="--")
    plt.axvline(n_components_pca_mle, color="k", label="PCA MLE: %d" % n_components_pca_mle, linestyle="--")
    plt.axhline(shrunk_cov_score(X), color="violet", label="Shrunk Covariance MLE", linestyle="-.")
    plt.axhline(lw_score(X), color="orange", label="LedoitWolf MLE", linestyle="-.")
    plt.xlabel("组件数量")
    plt.ylabel("交叉验证分数")
    plt.legend(loc="lower right")
    plt.title(title)
    plt.show()

通过上述代码，可以看到在同质噪声条件下，PCA和FA都能有效地恢复数据的低秩结构，并且PCA的交叉验证分数高于FA。然而，在异质噪声条件下，PCA的性能下降，而FA则能更好地适应数据。此外，通过与其他协方差估计器的比较，发现在某些情况下，低秩模型的似然性可能优于收缩模型。

本研究的代码和数据可以通过以下链接下载：

因子分析（带旋转）以可视化模式
收缩协方差估计：LedoitWolf vs OAS 和最大似然
人脸数据集分解
增量PCA

主成分分析（PCA）在Iris数据集上的应用

本文介绍了如何使用主成分分析（PCA）技术对Iris数据集进行降维处理，并展示了数据的三维可视化效果。

鸢尾花数据集的PCA与LDA降维比较

本网页介绍了鸢尾花数据集的两种降维技术：主成分分析（PCA）和线性判别分析（LDA），并展示了它们在数据可视化上的应用。

概率主成分分析与因子分析模型选择

数据创建

模型拟合

主成分分析（PCA）在Iris数据集上的应用

鸢尾花数据集的PCA与LDA降维比较

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485

概率主成分分析与因子分析模型选择

数据创建

模型拟合

主成分分析（PCA）在Iris数据集上的应用

鸢尾花数据集的PCA与LDA降维比较

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢 联系电话：17898875485

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485