在本文中,将探讨在20newsgroups数据集上,使用L1正则化的多类别逻辑回归与一对一逻辑回归的性能对比。L1正则化是一种常用的稀疏性诱导技术,它通过将不重要特征的权重缩减至零来实现特征选择。这种方法在提取每个类别的强判别词汇时非常有用。然而,如果目标是获得最佳的预测准确性,使用不诱导稀疏性的L2惩罚可能会更好。
一个更传统(可能也是更好的)方法来预测稀疏输入特征的子集是,先使用单变量特征选择,然后应用传统的(L2惩罚的)逻辑回归模型。在本研究中,使用了20newsgroups数据集,训练样本数为4500,特征数为130107,类别数为20。采用了一对一(One versus Rest)和多类别(Multinomial)两种模型,并使用SAGA求解器。
在实验中,对两种模型进行了1、2、3个周期的训练,并记录了测试准确率和每个类别的非零系数百分比。对于一对一模型,测试准确率为0.5960,非零系数百分比如数组[0.26593496 0.43348936 ...]所示。该模型在3个周期的训练中运行时间为1.51秒。而对于多类别模型,进行了1、2、5个周期的训练,测试准确率为0.6440,非零系数百分比如数组[0.36047253 0.1268187 ...]所示。该模型在5个周期的训练中运行时间为1.39秒。
从实验结果可以看出,多类别模型在测试准确率上优于一对一模型,且运行时间也更短。这表明在处理大规模数据集时,多类别逻辑回归不仅能够提供更准确的结果,而且训练速度也更快。此外,多类别模型的非零系数百分比也更高,这意味着它能够更好地识别出对分类有贡献的特征。
在代码实现方面,首先导入了必要的库,包括timeit用于计时,warnings用于忽略收敛警告,matplotlib.pyplot用于绘图,numpy用于数值计算,以及sklearn中的相关模块用于数据加载和模型训练。使用SAGA求解器,并设置了样本数为5000以加快运行时间。然后,从20newsgroups数据集中加载数据,并将其划分为训练集和测试集。接下来,定义了两个模型的参数,并分别训练和评估了它们的性能。最后,绘制了训练时间与测试准确率的关系图,并打印了实验的总运行时间。
import timeit
import warnings
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
warnings.filterwarnings("ignore", category=ConvergenceWarning, module="sklearn")
t0 = timeit.default_timer()
# We use SAGA solver
solver = "saga"
# Turn down for faster run time
n_samples = 5000
X, y = fetch_20newsgroups_vectorized(subset="all", return_X_y=True)
X = X[:n_samples]
y = y[:n_samples]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y, test_size=0.1)
train_samples, n_features = X_train.shape
n_classes = np.unique(y).shape[0]
print("Dataset 20newsgroup, train_samples=%i, n_features=%i, n_classes=%i" % (train_samples, n_features, n_classes))
models = {
"ovr": {
"name": "One versus Rest",
"iters": [1, 2, 3]},
"multinomial": {
"name": "Multinomial",
"iters": [1, 2, 5]},
}
for model in models:
# Add initial chance-level values for plotting purpose
accuracies = [1 / n_classes]
times = [0]
densities = [1]
model_params = models[model]
# Small number of epochs for fast runtime
for this_max_iter in model_params["iters"]:
print("[model=%s, solver=%s] Number of epochs: %s" % (model_params["name"], solver, this_max_iter))
clf = LogisticRegression(solver=solver, penalty="l1", max_iter=this_max_iter, random_state=42)
if model == "ovr":
clf = OneVsRestClassifier(clf)
t1 = timeit.default_timer()
clf.fit(X_train, y_train)
train_time = timeit.default_timer() - t1
y_pred = clf.predict(X_test)
accuracy = np.sum(y_pred == y_test) / y_test.shape[0]
if model == "ovr":
coef = np.concatenate([est.coef_ for est in clf.estimators_])
else:
coef = clf.coef_
density = np.mean(coef != 0, axis=1) * 100
accuracies.append(accuracy)
densities.append(density)
times.append(train_time)
models[model]["times"] = times
models[model]["densities"] = densities
models[model]["accuracies"] = accuracies
print("Test accuracy for model %s: %.4f" % (model, accuracies[-1]))
print("%% non-zero coefficients for model %s, per class:\n%s" % (model, densities[-1]))
print("Run time (%i epochs) for model %s: %.2f" % (model_params["iters"][-1], model, times[-1]))
fig = plt.figure()
ax = fig.add_subplot(111)
for model in models:
name = models[model]["name"]
times = models[model]["times"]
accuracies = models[model]["accuracies"]
ax.plot(times, accuracies, marker="o", label="Model: %s" % name)
ax.set_xlabel("Train time (s)")
ax.set_ylabel("Test accuracy")
ax.legend()
fig.suptitle("Multinomial vs One-vs-Rest Logistic L1\nDataset %s" % "20newsgroups")
fig.tight_layout()
fig.subplots_adjust(top=0.85)
run_time = timeit.default_timer() - t0
print("Example run in %.3f s" % run_time)
plt.show()