决策阈值的后处理调整

在二元分类器训练完成后，通常使用阈值来将概率输出转换为类别预测。默认的阈值通常是0.5，但这并不总是最优的选择。本文将展示如何使用TunedThresholdClassifierCV来根据特定的性能指标调整决策阈值，以提高模型的预测性能。

糖尿病数据集

为了说明决策阈值的调整，将使用糖尿病数据集。这个数据集可以在OpenML上找到。使用fetch_openml函数来获取这个数据集。


from sklearn.datasets import fetch_openml
diabetes = fetch_openml(data_id=37, as_frame=True, parser="pandas")
data, target = diabetes.data, diabetes.target

通过查看目标变量，可以了解面临的问题的类型。


target.value_counts()

可以看到，面临的是一个二元分类问题。由于标签没有编码为0和1，明确将标记为“tested_negative”的类别视为负类（也是最常见的），将标记为“tested_positive”的类别视为正类。

原始分类器

定义了一个基本的预测模型，由一个缩放器和一个逻辑回归分类器组成。


from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
model = make_pipeline(StandardScaler(), LogisticRegression())

使用交叉验证来评估模型。使用准确率和平衡准确率来报告模型的性能。平衡准确率是一个对类别不平衡不太敏感的指标，它将使能够正确地解释准确率得分。


from sklearn.model_selection import RepeatedStratifiedKFold, cross_validate
scoring = ["accuracy", "balanced_accuracy"]
cv_scores = ["train_accuracy", "test_accuracy", "train_balanced_accuracy", "test_balanced_accuracy"]
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
cv_results_vanilla_model = pd.DataFrame(cross_validate(model, data, target, scoring=scoring, cv=cv, return_train_score=True, return_estimator=True))
cv_results_vanilla_model[cv_scores].aggregate(["mean", "std"]).T

预测模型成功地把握了数据和目标之间的关系。训练和测试得分接近，意味着预测模型没有过拟合。还可以观察到，由于之前提到的类别不平衡，平衡准确率低于准确率。

调整决策阈值

TunedThresholdClassifierCV元估计器允许根据感兴趣的指标调整分类器的决策阈值。使用与之前相同的交叉验证策略来评估模型。


from sklearn.model_selection import TunedThresholdClassifierCV
tuned_model = TunedThresholdClassifierCV(estimator=model, scoring="balanced_accuracy")
cv_results_tuned_model = pd.DataFrame(cross_validate(tuned_model, data, target, scoring=scoring, cv=cv, return_train_score=True, return_estimator=True))
cv_results_tuned_model[cv_scores].aggregate(["mean", "std"]).T

与原始模型相比，观察到平衡准确率得分增加了。当然，这是以降低准确率得分为代价的。这意味着模型现在对正类更敏感，但在负类上犯了更多的错误。

模型系数的比较

重要的是要注意，这个调整后的预测模型在内部与原始模型是相同的模型：它们具有相同的拟合系数。


import matplotlib.pyplot as plt
vanilla_model_coef = pd.DataFrame([est[-1].coef_.ravel() for est in cv_results_vanilla_model["estimator"]], columns=diabetes.feature_names)
tuned_model_coef = pd.DataFrame([est.estimator_[-1].coef_.ravel() for est in cv_results_tuned_model["estimator"]], columns=diabetes.feature_names)
fig, ax = plt.subplots(ncols=2, figsize=(12, 4), sharex=True, sharey=True)
vanilla_model_coef.boxplot(ax=ax[0])
ax[0].set_ylabel("Coefficient value")
ax[0].set_title("Vanilla model")
tuned_model_coef.boxplot(ax=ax[1])
ax[1].set_title("Tuned model")
plt.suptitle("Coefficients of the predictive models")

在交叉验证期间，只改变了每个模型的决策阈值。


decision_threshold = pd.Series([est.best_threshold_ for est in cv_results_tuned_model["estimator"]])
ax = decision_threshold.plot.kde()
ax.axvline(decision_threshold.mean(), color="k", linestyle="--", label=f"Mean decision threshold: {decision_threshold.mean():.2f}")
ax.set_xlabel("Decision threshold")
ax.legend(loc="upper right")
ax.set_title("Distribution of the decision threshold across different cross-validation folds")

平均而言，大约0.32的决策阈值可以最大化平衡准确率，这与默认的0.5决策阈值不同。因此，当预测模型的输出用于做出决策时，调整决策阈值尤为重要。此外，用于调整决策阈值的指标应该谨慎选择。在这里，使用了平衡准确率，但它可能不是手头问题最合适的指标。“正确”指标的选择通常取决于问题，可能需要一些领域知识。有关更多详细信息，请参阅标题为“针对成本敏感学习的决策阈值后处理调整”的示例。

模型正则化对误差的影响

本网页介绍了线性模型中正则化参数对训练和测试误差的影响，并展示了如何通过验证曲线确定最优正则化参数，以及如何比较真实系数和估计系数。

线性回归与多项式特征：拟合不足与过拟合

本文介绍了线性回归模型在拟合非线性函数时可能遇到的拟合不足和过拟合问题，并通过多项式特征来展示如何近似非线性函数。

决策阈值的后处理调整

糖尿病数据集

原始分类器

调整决策阈值

模型系数的比较

模型正则化对误差的影响

线性回归与多项式特征：拟合不足与过拟合

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485

决策阈值的后处理调整

糖尿病数据集

原始分类器

调整决策阈值

模型系数的比较

模型正则化对误差的影响

线性回归与多项式特征：拟合不足与过拟合

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢 联系电话：17898875485

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485