KMeans与MiniBatchKMeans聚类算法比较

在机器学习领域，聚类算法是一种无监督学习方法，用于将数据集中的样本划分为若干个簇，使得同一个簇内的样本相似度较高，而不同簇之间的样本相似度较低。KMeans和MiniBatchKMeans是两种常见的聚类算法，它们在处理大规模数据集时有所不同。本文将通过生成数据集、执行聚类、比较结果和可视化差异来展示这两种算法的不同之处。

数据生成

首先，需要生成一组用于聚类的数据。这里使用numpy库来生成数据，并使用sklearn库中的make_blobs函数来创建数据点。设定了三个中心点，每个中心点代表一个簇，并且设定了数据点的标准差为0.7，以确保数据点在空间中的分布具有一定的随机性。


import numpy as np
from sklearn.datasets import make_blobs
np.random.seed(0)
batch_size = 45
centers = [
    [1, 1],
    [-1, -1],
    [1, -1]
]
n_clusters = len(centers)
X, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)

KMeans聚类

接下来，使用KMeans算法对生成的数据进行聚类。KMeans算法通过迭代优化簇中心点的位置，使得簇内样本到中心点的距离之和最小。设置了初始化方法为"k-means++"，簇的数量为3，并且进行了10次初始化以提高聚类的稳定性。


import time
from sklearn.cluster import KMeans
k_means = KMeans(init="k-means++", n_clusters=3, n_init=10)
t0 = time.time()
k_means.fit(X)
t_batch = time.time() - t0

MiniBatchKMeans聚类

与KMeans算法相比，MiniBatchKMeans算法在处理大规模数据集时更加高效。它通过每次只处理数据集的一个子集来更新簇中心点，从而减少了计算量。同样设置了初始化方法为"k-means++"，簇的数量为3，并且设置了批量大小为45。


from sklearn.cluster import MiniBatchKMeans
mbk = MiniBatchKMeans(init="k-means++", n_clusters=3, batch_size=batch_size, n_init=10, max_no_improvement=10, verbose=0)
t0 = time.time()
mbk.fit(X)
t_mini_batch = time.time() - t0

聚类结果比较

为了比较两种算法的聚类结果，需要将两种算法的簇中心点进行配对，使得相同簇的中心点在两种算法中具有相同的颜色。使用sklearn.metrics.pairwise中的pairwise_distances_argmin函数来实现这一目标。


from sklearn.metrics.pairwise import pairwise_distances_argmin
k_means_cluster_centers = k_means.cluster_centers_
order = pairwise_distances_argmin(k_means.cluster_centers_, mbk.cluster_centers_)
mbk_means_cluster_centers = mbk.cluster_centers_[order]
k_means_labels = pairwise_distances_argmin(X, k_means_cluster_centers)
mbk_means_labels = pairwise_distances_argmin(X, mbk_means_cluster_centers)

最后，使用matplotlib库来可视化两种算法的聚类结果。绘制了三种不同的图形：KMeans聚类结果、MiniBatchKMeans聚类结果以及两种算法聚类结果的差异。通过这些图形，可以直观地比较两种算法在聚类效果上的差异。


import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8, 3))
fig.subplots_adjust(left=0.02, right=0.98, bottom=0.05, top=0.9)
colors = ["#4EACC5", "#FF9C34", "#4E9A06"]

# KMeans
ax = fig.add_subplot(1, 3, 1)
for k, col in zip(range(n_clusters), colors):
    my_members = k_means_labels == k
    cluster_center = k_means_cluster_centers[k]
    ax.plot(X[my_members, 0], X[my_members, 1], "w", markerfacecolor=col, marker=".")
    ax.plot(cluster_center[0], cluster_center[1], "o", markerfacecolor=col, markeredgecolor="k", markersize=6)
ax.set_title("KMeans")
ax.set_xticks(())
ax.set_yticks(())
plt.text(-3.5, 1.8, "train time: %.2f s\ninertia: %f" % (t_batch, k_means.inertia_))

# MiniBatchKMeans
ax = fig.add_subplot(1, 3, 2)
for k, col in zip(range(n_clusters), colors):
    my_members = mbk_means_labels == k
    cluster_center = mbk_means_cluster_centers[k]
    ax.plot(X[my_members, 0], X[my_members, 1], "w", markerfacecolor=col, marker=".")
    ax.plot(cluster_center[0], cluster_center[1], "o", markerfacecolor=col, markeredgecolor="k", markersize=6)
ax.set_title("MiniBatchKMeans")
ax.set_xticks(())
ax.set_yticks(())
plt.text(-3.5, 1.8, "train time: %.2f s\ninertia: %f" % (t_mini_batch, mbk.inertia_))

# Initialize the different array to all False
different = mbk_means_labels == 4
ax = fig.add_subplot(1, 3, 3)
for k in range(n_clusters):
    different += (k_means_labels == k) != (mbk_means_labels == k)
identical = np.logical_not(different)
ax.plot(X[identical, 0], X[identical, 1], "w", markerfacecolor="#bbbbbb", marker=".")
ax.plot(X[different, 0], X[different, 1], "w", markerfacecolor="m", marker=".")
ax.set_title("Difference")
ax.set_xticks(())
ax.set_yticks(())
plt.show()

通过上述代码，可以看到KMeans和MiniBatchKMeans两种算法在聚类效果上的差异。KMeans算法在处理小规模数据集时具有较好的聚类效果，但在处理大规模数据集时，MiniBatchKMeans算法由于其高效的计算方式，能够更快地完成聚类任务。此外，通过可视化结果，可以直观地比较两种算法在聚类效果上的差异，从而为实际应用中选择合适的聚类算法提供参考。

特征空间分析的稳健方法

本文介绍了一种在特征空间分析中使用的稳健方法，通过MeanShift聚类算法自动估计带宽，并使用matplotlib进行数据可视化。

密度不同的聚类分析

本文通过生成不同密度的数据集，使用OPTICS算法的Xi聚类检测方法和设置特定的可达性阈值来模拟DBSCAN算法，展示了不同阈值下DBSCAN算法的聚类效果。

KMeans与MiniBatchKMeans聚类算法比较

数据生成

KMeans聚类

MiniBatchKMeans聚类

聚类结果比较

特征空间分析的稳健方法

密度不同的聚类分析

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485

KMeans与MiniBatchKMeans聚类算法比较

数据生成

KMeans聚类

MiniBatchKMeans聚类

聚类结果比较

特征空间分析的稳健方法

密度不同的聚类分析

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢 联系电话：17898875485

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485