多类别稀疏逻辑回归在20newsgroups数据集上的比较

在本文中，将探讨在20newsgroups数据集上，使用L1正则化的多类别逻辑回归与一对一逻辑回归的性能对比。L1正则化是一种常用的稀疏性诱导技术，它通过将不重要特征的权重缩减至零来实现特征选择。这种方法在提取每个类别的强判别词汇时非常有用。然而，如果目标是获得最佳的预测准确性，使用不诱导稀疏性的L2惩罚可能会更好。

一个更传统（可能也是更好的）方法来预测稀疏输入特征的子集是，先使用单变量特征选择，然后应用传统的（L2惩罚的）逻辑回归模型。在本研究中，使用了20newsgroups数据集，训练样本数为4500，特征数为130107，类别数为20。采用了一对一（One versus Rest）和多类别（Multinomial）两种模型，并使用SAGA求解器。

在实验中，对两种模型进行了1、2、3个周期的训练，并记录了测试准确率和每个类别的非零系数百分比。对于一对一模型，测试准确率为0.5960，非零系数百分比如数组[0.26593496 0.43348936 ...]所示。该模型在3个周期的训练中运行时间为1.51秒。而对于多类别模型，进行了1、2、5个周期的训练，测试准确率为0.6440，非零系数百分比如数组[0.36047253 0.1268187 ...]所示。该模型在5个周期的训练中运行时间为1.39秒。

从实验结果可以看出，多类别模型在测试准确率上优于一对一模型，且运行时间也更短。这表明在处理大规模数据集时，多类别逻辑回归不仅能够提供更准确的结果，而且训练速度也更快。此外，多类别模型的非零系数百分比也更高，这意味着它能够更好地识别出对分类有贡献的特征。

在代码实现方面，首先导入了必要的库，包括timeit用于计时，warnings用于忽略收敛警告，matplotlib.pyplot用于绘图，numpy用于数值计算，以及sklearn中的相关模块用于数据加载和模型训练。使用SAGA求解器，并设置了样本数为5000以加快运行时间。然后，从20newsgroups数据集中加载数据，并将其划分为训练集和测试集。接下来，定义了两个模型的参数，并分别训练和评估了它们的性能。最后，绘制了训练时间与测试准确率的关系图，并打印了实验的总运行时间。


import timeit
import warnings
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier

warnings.filterwarnings("ignore", category=ConvergenceWarning, module="sklearn")

t0 = timeit.default_timer()

# We use SAGA solver
solver = "saga"
# Turn down for faster run time
n_samples = 5000
X, y = fetch_20newsgroups_vectorized(subset="all", return_X_y=True)
X = X[:n_samples]
y = y[:n_samples]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y, test_size=0.1)

train_samples, n_features = X_train.shape
n_classes = np.unique(y).shape[0]
print("Dataset 20newsgroup, train_samples=%i, n_features=%i, n_classes=%i" % (train_samples, n_features, n_classes))

models = {
    "ovr": {
        "name": "One versus Rest",
        "iters": [1, 2, 3]},
    "multinomial": {
        "name": "Multinomial",
        "iters": [1, 2, 5]},
}

for model in models:
    # Add initial chance-level values for plotting purpose
    accuracies = [1 / n_classes]
    times = [0]
    densities = [1]
    model_params = models[model]
    # Small number of epochs for fast runtime
    for this_max_iter in model_params["iters"]:
        print("[model=%s, solver=%s] Number of epochs: %s" % (model_params["name"], solver, this_max_iter))
        clf = LogisticRegression(solver=solver, penalty="l1", max_iter=this_max_iter, random_state=42)
        if model == "ovr":
            clf = OneVsRestClassifier(clf)
        t1 = timeit.default_timer()
        clf.fit(X_train, y_train)
        train_time = timeit.default_timer() - t1
        y_pred = clf.predict(X_test)
        accuracy = np.sum(y_pred == y_test) / y_test.shape[0]
        if model == "ovr":
            coef = np.concatenate([est.coef_ for est in clf.estimators_])
        else:
            coef = clf.coef_
        density = np.mean(coef != 0, axis=1) * 100
        accuracies.append(accuracy)
        densities.append(density)
        times.append(train_time)
        models[model]["times"] = times
        models[model]["densities"] = densities
        models[model]["accuracies"] = accuracies
    print("Test accuracy for model %s: %.4f" % (model, accuracies[-1]))
    print("%% non-zero coefficients for model %s, per class:\n%s" % (model, densities[-1]))
    print("Run time (%i epochs) for model %s: %.2f" % (model_params["iters"][-1], model, times[-1]))

fig = plt.figure()
ax = fig.add_subplot(111)
for model in models:
    name = models[model]["name"]
    times = models[model]["times"]
    accuracies = models[model]["accuracies"]
    ax.plot(times, accuracies, marker="o", label="Model: %s" % name)
ax.set_xlabel("Train time (s)")
ax.set_ylabel("Test accuracy")
ax.legend()
fig.suptitle("Multinomial vs One-vs-Rest Logistic L1\nDataset %s" % "20newsgroups")
fig.tight_layout()
fig.subplots_adjust(top=0.85)
run_time = timeit.default_timer() - t0
print("Example run in %.3f s" % run_time)
plt.show()

单类SVM与随机梯度下降单类SVM的比较

本网页展示了如何在RBF核的情况下，使用随机梯度下降版本的单类SVM来近似sklearn.svm.OneClassSVM的解决方案。

使用L1正则化的逻辑回归进行MNIST数字分类

本网页介绍了如何使用L1正则化的逻辑回归模型对MNIST数据集中的手写数字进行分类，并展示了模型的稀疏性以及测试得分。

多类别稀疏逻辑回归在20newsgroups数据集上的比较

单类SVM与随机梯度下降单类SVM的比较

使用L1正则化的逻辑回归进行MNIST数字分类

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485

多类别稀疏逻辑回归在20newsgroups数据集上的比较

单类SVM与随机梯度下降单类SVM的比较

使用L1正则化的逻辑回归进行MNIST数字分类

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢 联系电话：17898875485

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485