多类别稀疏逻辑回归在20newsgroups数据集上的比较

在本文中,将探讨在20newsgroups数据集上,使用L1正则化的多类别逻辑回归与一对一逻辑回归的性能对比。L1正则化是一种常用的稀疏性诱导技术,它通过将不重要特征的权重缩减至零来实现特征选择。这种方法在提取每个类别的强判别词汇时非常有用。然而,如果目标是获得最佳的预测准确性,使用不诱导稀疏性的L2惩罚可能会更好。

一个更传统(可能也是更好的)方法来预测稀疏输入特征的子集是,先使用单变量特征选择,然后应用传统的(L2惩罚的)逻辑回归模型。在本研究中,使用了20newsgroups数据集,训练样本数为4500,特征数为130107,类别数为20。采用了一对一(One versus Rest)和多类别(Multinomial)两种模型,并使用SAGA求解器。

在实验中,对两种模型进行了1、2、3个周期的训练,并记录了测试准确率和每个类别的非零系数百分比。对于一对一模型,测试准确率为0.5960,非零系数百分比如数组[0.26593496 0.43348936 ...]所示。该模型在3个周期的训练中运行时间为1.51秒。而对于多类别模型,进行了1、2、5个周期的训练,测试准确率为0.6440,非零系数百分比如数组[0.36047253 0.1268187 ...]所示。该模型在5个周期的训练中运行时间为1.39秒。

从实验结果可以看出,多类别模型在测试准确率上优于一对一模型,且运行时间也更短。这表明在处理大规模数据集时,多类别逻辑回归不仅能够提供更准确的结果,而且训练速度也更快。此外,多类别模型的非零系数百分比也更高,这意味着它能够更好地识别出对分类有贡献的特征。

在代码实现方面,首先导入了必要的库,包括timeit用于计时,warnings用于忽略收敛警告,matplotlib.pyplot用于绘图,numpy用于数值计算,以及sklearn中的相关模块用于数据加载和模型训练。使用SAGA求解器,并设置了样本数为5000以加快运行时间。然后,从20newsgroups数据集中加载数据,并将其划分为训练集和测试集。接下来,定义了两个模型的参数,并分别训练和评估了它们的性能。最后,绘制了训练时间与测试准确率的关系图,并打印了实验的总运行时间。

import timeit import warnings import matplotlib.pyplot as plt import numpy as np from sklearn.datasets import fetch_20newsgroups_vectorized from sklearn.exceptions import ConvergenceWarning from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.multiclass import OneVsRestClassifier warnings.filterwarnings("ignore", category=ConvergenceWarning, module="sklearn") t0 = timeit.default_timer() # We use SAGA solver solver = "saga" # Turn down for faster run time n_samples = 5000 X, y = fetch_20newsgroups_vectorized(subset="all", return_X_y=True) X = X[:n_samples] y = y[:n_samples] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y, test_size=0.1) train_samples, n_features = X_train.shape n_classes = np.unique(y).shape[0] print("Dataset 20newsgroup, train_samples=%i, n_features=%i, n_classes=%i" % (train_samples, n_features, n_classes)) models = { "ovr": { "name": "One versus Rest", "iters": [1, 2, 3]}, "multinomial": { "name": "Multinomial", "iters": [1, 2, 5]}, } for model in models: # Add initial chance-level values for plotting purpose accuracies = [1 / n_classes] times = [0] densities = [1] model_params = models[model] # Small number of epochs for fast runtime for this_max_iter in model_params["iters"]: print("[model=%s, solver=%s] Number of epochs: %s" % (model_params["name"], solver, this_max_iter)) clf = LogisticRegression(solver=solver, penalty="l1", max_iter=this_max_iter, random_state=42) if model == "ovr": clf = OneVsRestClassifier(clf) t1 = timeit.default_timer() clf.fit(X_train, y_train) train_time = timeit.default_timer() - t1 y_pred = clf.predict(X_test) accuracy = np.sum(y_pred == y_test) / y_test.shape[0] if model == "ovr": coef = np.concatenate([est.coef_ for est in clf.estimators_]) else: coef = clf.coef_ density = np.mean(coef != 0, axis=1) * 100 accuracies.append(accuracy) densities.append(density) times.append(train_time) models[model]["times"] = times models[model]["densities"] = densities models[model]["accuracies"] = accuracies print("Test accuracy for model %s: %.4f" % (model, accuracies[-1])) print("%% non-zero coefficients for model %s, per class:\n%s" % (model, densities[-1])) print("Run time (%i epochs) for model %s: %.2f" % (model_params["iters"][-1], model, times[-1])) fig = plt.figure() ax = fig.add_subplot(111) for model in models: name = models[model]["name"] times = models[model]["times"] accuracies = models[model]["accuracies"] ax.plot(times, accuracies, marker="o", label="Model: %s" % name) ax.set_xlabel("Train time (s)") ax.set_ylabel("Test accuracy") ax.legend() fig.suptitle("Multinomial vs One-vs-Rest Logistic L1\nDataset %s" % "20newsgroups") fig.tight_layout() fig.subplots_adjust(top=0.85) run_time = timeit.default_timer() - t0 print("Example run in %.3f s" % run_time) plt.show()
沪ICP备2024098111号-1
上海秋旦网络科技中心:上海市奉贤区金大公路8218号1幢 联系电话:17898875485