单类SVM与随机梯度下降单类SVM的比较

在机器学习领域，单类SVM（Support Vector Machine）是一种用于异常检测的算法。它能够通过训练数据来确定一个决策边界，从而区分正常数据和异常数据。然而，在处理大规模数据集时，传统的单类SVM算法可能会遇到计算效率的问题。为了解决这个问题，可以使用随机梯度下降（SGD）版本的单类SVM，即SGDOneClassSVM，它在处理线性可分数据时具有更好的扩展性。

本例中，将展示如何使用核方法近似传统的单类SVM，并使用SGDOneClassSVM来实现。首先，通过核近似技术将非线性问题转化为线性问题，然后应用SGDOneClassSVM进行训练。需要注意的是，SGDOneClassSVM的计算复杂度与样本数量线性相关，而传统的OneClassSVM的复杂度至少是二次方的。本例的目的并不是展示这种近似在计算时间上的优势，而是展示在一个小规模数据集上，可以得到相似的结果。

首先生成了一些训练数据和测试数据，包括正常的新观测值和异常的新观测值。然后，设置了单类SVM的超参数，包括nu和gamma，并使用这些参数来训练模型。接着，使用训练好的模型对训练数据、测试数据和异常数据进行预测，并计算出错的数量。

为了更好地理解模型的效果，使用了决策边界显示工具来可视化模型的决策边界。通过比较训练数据、测试数据和异常数据在决策边界上的表现，可以评估模型的异常检测能力。

在代码实现方面，首先导入了必要的库，包括matplotlib、numpy和sklearn等。然后，设置了字体和随机种子，以确保结果的可重复性。接着，生成了训练数据和测试数据，并设置了单类SVM的超参数。使用OneClassSVM和SGDOneClassSVM分别训练模型，并计算出错的数量。最后，使用决策边界显示工具来可视化模型的决策边界，并显示了训练数据、测试数据和异常数据在决策边界上的表现。

通过比较OneClassSVM和SGDOneClassSVM的结果，可以看到两者在小规模数据集上的表现是相似的。这表明，使用核方法近似和SGD优化的单类SVM是一种有效的异常检测方法，它可以在保持较高准确率的同时，提高模型的计算效率。

代码示例


import matplotlib.pyplot as plt
import numpy as np
from sklearn.kernel_approximation import Nystroem
from sklearn.linear_model import SGDOneClassSVM
from sklearn.pipeline import make_pipeline
from sklearn.svm import OneClassSVM

# 设置字体和随机种子
font = {"weight": "normal", "size": 15}
plt.rc("font", **font)
random_state = 42
rng = np.random.RandomState(random_state)

# 生成训练数据
X = 0.3 * rng.randn(500, 2)
X_train = np.r_[X + 2, X - 2]

# 生成一些正常的新观测值
X = 0.3 * rng.randn(20, 2)
X_test = np.r_[X + 2, X - 2]

# 生成一些异常的新观测值
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))

# 单类SVM超参数
nu = 0.05
gamma = 2.0

# 训练OneClassSVM模型
clf = OneClassSVM(gamma=gamma, kernel="rbf", nu=nu)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)

# 使用核近似和SGD训练OneClassSVM模型
transform = Nystroem(gamma=gamma, random_state=random_state)
clf_sgd = SGDOneClassSVM(nu=nu, shuffle=True, fit_intercept=True, random_state=random_state, tol=1e-4)
pipe_sgd = make_pipeline(transform, clf_sgd)
pipe_sgd.fit(X_train)
y_pred_train_sgd = pipe_sgd.predict(X_train)
y_pred_test_sgd = pipe_sgd.predict(X_test)
y_pred_outliers_sgd = pipe_sgd.predict(X_outliers)

# 可视化决策边界
from sklearn.inspection import DecisionBoundaryDisplay

fig, ax = plt.subplots(figsize=(9, 6))
xx, yy = np.meshgrid(np.linspace(-4.5, 4.5, 50), np.linspace(-4.5, 4.5, 50))
X = np.concatenate([xx.ravel().reshape(-1, 1), yy.ravel().reshape(-1, 1)], axis=1)
DecisionBoundaryDisplay.from_estimator(clf, X, response_method="decision_function", plot_method="contourf", ax=ax, cmap="PuBu")
DecisionBoundaryDisplay.from_estimator(clf, X, response_method="decision_function", plot_method="contour", ax=ax, linewidths=2, colors="darkred", levels=[0])
DecisionBoundaryDisplay.from_estimator(clf, X, response_method="decision_function", plot_method="contourf", ax=ax, colors="palevioletred", levels=[0, clf.decision_function(X).max()])

s = 20
b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c="white", s=s, edgecolors="k")
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c="blueviolet", s=s, edgecolors="k")
c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c="gold", s=s, edgecolors="k")
ax.set(title="One-Class SVM", xlim=(-4.5, 4.5), ylim=(-4.5, 4.5), xlabel=(f"error train: {y_pred_train[y_pred_train==-1].size} / {X_train.shape[0]}; " f"errors novel regular: {y_pred_test[y_pred_test==-1].size} / {X_test.shape[0]}; " f"errors novel abnormal: {y_pred_outliers[y_pred_outliers==1].size} / {X_outliers.shape[0]}"),)
ax.legend([mlines.Line2D([], [], color="darkred", label="learned frontier"), b1, b2, c], ["learned frontier", "training observations", "new regular observations", "new abnormal observations"], loc="upper left")
plt.show()

单类SVM与随机梯度下降单类SVM的比较

代码示例

相关示例

加权样本的决策函数图

多类别稀疏逻辑回归在20newsgroups数据集上的比较

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485

单类SVM与随机梯度下降单类SVM的比较

代码示例

相关示例

加权样本的决策函数图

多类别稀疏逻辑回归在20newsgroups数据集上的比较

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢 联系电话：17898875485

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485