贝叶斯高斯混合模型分析

在本文中，将探讨如何利用贝叶斯高斯混合模型（Bayesian Gaussian Mixture Model）对一个由三个高斯分布混合而成的玩具数据集进行拟合。将使用两种不同类型的权重浓度先验：Dirichlet分布先验和Dirichlet过程先验。通过比较这两种先验对模型的影响，可以更好地理解它们在实际应用中的作用和效果。

贝叶斯高斯混合模型是一种灵活的聚类方法，它允许模型自动适应混合成分的数量。模型中的权重浓度先验参数与最终具有非零权重的成分数量有直接关系。设定较低的浓度先验值会使模型将大部分权重集中在少数几个成分上，而将其余成分的权重设置得非常接近零。相反，较高的浓度先验值将允许更多的成分在混合中活跃。

Dirichlet过程先验允许定义无限数量的成分，并自动选择正确的成分数量：只有当必要的时候才会激活一个成分。与此相反，传统的有限混合模型使用Dirichlet分布先验，倾向于更均匀地加权成分，因此倾向于将自然聚类划分为不必要的子成分。

代码实现

以下是一个使用Python语言和scikit-learn库实现的示例代码，展示了如何使用贝叶斯高斯混合模型对数据进行拟合，并绘制结果。


import matplotlib.pyplot as plt
from sklearn.mixture import BayesianGaussianMixture
import numpy as np

# 定义绘制椭圆的函数
def plot_ellipses(ax, weights, means, covars):
    for n in range(means.shape[0]):
        eig_vals, eig_vecs = np.linalg.eigh(covars[n])
        unit_eig_vec = eig_vecs[0] / np.linalg.norm(eig_vecs[0])
        angle = np.arctan2(unit_eig_vec[1], unit_eig_vec[0])
        # 椭圆需要度数
        angle = 180 * angle / np.pi
        # 特征向量归一化
        eig_vals = 2 * np.sqrt(2) * np.sqrt(eig_vals)
        ell = plt.patches.Ellipse(means[n], eig_vals[0], eig_vals[1], angle=180 + angle, edgecolor="black")
        ell.set_clip_box(ax.bbox)
        ell.set_alpha(weights[n])
        ell.set_facecolor("#56B4E9")
        ax.add_artist(ell)

# 定义绘制结果的函数
def plot_results(ax1, ax2, estimator, X, y, title, plot_title=False):
    ax1.set_title(title)
    ax1.scatter(X[:, 0], X[:, 1], s=5, marker="o", color=["#0072B2", "#F0E442", "#D55E00"][y], alpha=0.8)
    ax1.set_xlim(-2.0, 2.0)
    ax1.set_ylim(-3.0, 3.0)
    ax1.set_xticks(())
    ax1.set_yticks(())
    plot_ellipses(ax1, estimator.weights_, estimator.means_, estimator.covariances_)
    ax2.get_xaxis().set_tick_params(direction="out")
    ax2.yaxis.grid(True, alpha=0.7)
    for k, w in enumerate(estimator.weights_):
        ax2.bar(k, w, width=0.9, color="#56B4E9", zorder=3, align="center", edgecolor="black")
        ax2.text(k, w+0.007, "%.1f%%" % (w*100.0), horizontalalignment="center")
    ax2.set_xlim(-0.6, 2*3-0.4)
    ax2.set_ylim(0.0, 1.1)
    ax2.tick_params(axis="y", which="both", left=False, right=False, labelleft=False)
    ax2.tick_params(axis="x", which="both", top=False)
    if plot_title:
        ax1.set_ylabel("Estimated Mixtures")
        ax2.set_ylabel("Weight of each component")

# 参数设置
random_state, n_components, n_features = 2, 3, 2
colors = np.array(["#0072B2", "#F0E442", "#D55E00"])
covars = np.array([[[0.7, 0.0], [0.0, 0.1]], [[0.5, 0.0], [0.0, 0.1]], [[0.5, 0.0], [0.0, 0.1]]])
samples = np.array([200, 500, 200])
means = np.array([[0.0, -0.70], [0.0, 0.0], [0.0, 0.70]])

# 估计器设置
estimators = [
    ("有限混合模型，使用Dirichlet分布先验和 $\gamma_0=$",
     BayesianGaussianMixture(weight_concentration_prior_type="dirichlet_distribution",
                             n_components=2*n_components, reg_covar=0, init_params="random",
                             max_iter=1500, mean_precision_prior=0.8, random_state=random_state),
     [0.001, 1, 1000]),
    ("无限混合模型，使用Dirichlet过程先验和 $\gamma_0=$",
     BayesianGaussianMixture(weight_concentration_prior_type="dirichlet_process",
                             n_components=2*n_components, reg_covar=0, init_params="random",
                             max_iter=1500, mean_precision_prior=0.8, random_state=random_state),
     [1, 1000, 100000]),
]

# 生成数据
rng = np.random.RandomState(random_state)
X = np.vstack([rng.multivariate_normal(means[j], covars[j], samples[j]) for j in range(n_components)])
y = np.concatenate([np.full(samples[j], j, dtype=int) for j in range(n_components)])

# 绘制结果
for title, estimator, concentrations_prior in estimators:
    plt.figure(figsize=(4.7*3, 8))
    plt.subplots_adjust(bottom=0.04, top=0.90, hspace=0.05, wspace=0.05, left=0.03, right=0.99)
    gs = plt.GridSpec(3, len(concentrations_prior))
    for k, concentration in enumerate(concentrations_prior):
        estimator.weight_concentration_prior = concentration
        estimator.fit(X)
        plot_results(plt.subplot(gs[0:2, k]), plt.subplot(gs[2, k]), estimator, X, y,
                      r"%s $\gamma_0=$%.1e" % (title, concentration), plot_title=k==0)
    plt.show()

在上述代码中，首先定义了绘制椭圆的函数`plot_ellipses`，用于在散点图上绘制高斯分布的等高线。然后定义了`plot_results`函数，用于绘制数据点和每个高斯分布的权重。接着，设置了数据集的参数，包括随机状态、成分数量、特征数量、颜色、协方差矩阵、样本数量和均值。最后，创建了两个估计器，分别使用Dirichlet分布先验和Dirichlet过程先验，并为每个估计器设置了不同的权重浓度先验值。

通过运行上述代码，可以生成两个不同的图形，分别展示了使用不同权重浓度先验的贝叶斯高斯混合模型对数据集的拟合结果。这些结果可以帮助理解不同先验对模型聚类效果的影响。

通过本文的分析，可以得出以下结论：

贝叶斯高斯混合模型是一种强大的聚类工具，能够自动适应混合成分的数量。
权重浓度先验对模型的聚类效果有重要影响。低浓度先验值会导致模型将大部分权重集中在少数几个成分上，而高浓度先验值则允许更多的成分在混合中活跃。
Dirichlet过程先验允许定义无限数量的成分，并自动选择正确的成分数量，这在处理具有复杂结构的数据集时非常有用。
与传统的有限混合模型相比，贝叶斯高斯混合模型在处理自然聚类时能够避免不必要的子成分划分。

使用scikit-learn的set_output方法

本文介绍了如何在scikit-learn库中使用set_output方法来配置模型输出为pandas DataFrame格式，并提供了详细的代码示例。

高斯混合模型与狄利克雷过程模型比较

本网页介绍了高斯混合模型（GMM）和贝叶斯高斯混合模型（使用狄利克雷过程作为先验）的比较，包括它们的工作原理和在低维空间中的可视化展示。

贝叶斯高斯混合模型分析

代码实现

使用scikit-learn的set_output方法

高斯混合模型与狄利克雷过程模型比较

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485

贝叶斯高斯混合模型分析

代码实现

使用scikit-learn的set_output方法

高斯混合模型与狄利克雷过程模型比较

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢 联系电话：17898875485

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485