概率校准方法比较

在机器学习中,分类器不仅提供类别标签,还提供与之相关的概率。这个概率可以给对预测的信心。然而,并非所有的分类器都能提供校准良好的概率,有的过于自信,有的则过于谦虚。因此,通常需要对预测概率进行单独的校准作为后处理步骤。本文将介绍两种不同的概率校准方法,并使用Brier分数来评估返回的概率质量。

比较了使用高斯朴素贝叶斯分类器(未校准)、sigmoid校准和非参数isotonic校准估计的概率。可以观察到,只有非参数模型能够提供接近预期0.5的概率校准,对于大多数属于中间簇且标签异质的样本。这导致Brier分数显著提高。

生成合成数据集

import numpy as np from sklearn.datasets import make_blobs from sklearn.model_selection import train_test_split n_samples = 50000 n_bins = 3 centers = [(-5, -5), (0, 0), (5, 5)] X, y = make_blobs(n_samples=n_samples, centers=centers, shuffle=False, random_state=42) y[:n_samples//2] = 0 y[n_samples//2:] = 1 sample_weight = np.random.RandomState(42).rand(y.shape[0]) X_train, X_test, y_train, y_test, sw_train, sw_test = train_test_split(X, y, sample_weight, test_size=0.9, random_state=42)

高斯朴素贝叶斯

from sklearn.calibration import CalibratedClassifierCV from sklearn.metrics import brier_score_loss from sklearn.naive_bayes import GaussianNB clf = GaussianNB() clf.fit(X_train, y_train) prob_pos_clf = clf.predict_proba(X_test)[:, 1] clf_isotonic = CalibratedClassifierCV(clf, cv=2, method="isotonic") clf_isotonic.fit(X_train, y_train, sample_weight=sw_train) prob_pos_isotonic = clf_isotonic.predict_proba(X_test)[:, 1] clf_sigmoid = CalibratedClassifierCV(clf, cv=2, method="sigmoid") clf_sigmoid.fit(X_train, y_train, sample_weight=sw_train) prob_pos_sigmoid = clf_sigmoid.predict_proba(X_test)[:, 1] print("Brier score losses: (the smaller the better)") clf_score = brier_score_loss(y_test, prob_pos_clf, sample_weight=sw_test) print("No calibration: %1.3f" % clf_score) clf_isotonic_score = brier_score_loss(y_test, prob_pos_isotonic, sample_weight=sw_test) print("With isotonic calibration: %1.3f" % clf_isotonic_score) clf_sigmoid_score = brier_score_loss(y_test, prob_pos_sigmoid, sample_weight=sw_test) print("With sigmoid calibration: %1.3f" % clf_sigmoid_score)

Brier分数损失:(数值越小越好)

import matplotlib.pyplot as plt from matplotlib import cm plt.figure() y_unique = np.unique(y) colors = cm.rainbow(np.linspace(0.0, 1.0, y_unique.size)) for this_y, color in zip(y_unique, colors): this_X = X_train[y_train == this_y] this_sw = sw_train[y_train == this_y] plt.scatter(this_X[:, 0], this_X[:, 1], s=this_sw*50, c=color[np.newaxis, :], alpha=0.5, edgecolor="k", label="Class %s" % this_y) plt.legend(loc="best") plt.title("Data") plt.figure() order = np.lexsort((prob_pos_clf,)) plt.plot(prob_pos_clf[order], "r", label="No calibration (%1.3f)" % clf_score) plt.plot(prob_pos_isotonic[order], "g", linewidth=3, label="Isotonic calibration (%1.3f)" % clf_isotonic_score) plt.plot(prob_pos_sigmoid[order], "b", linewidth=3, label="Sigmoid calibration (%1.3f)" % clf_sigmoid_score) plt.plot(np.linspace(0, y_test.size, 51)[1::2], y_test[order].reshape(25, -1).mean(1), "k", linewidth=3, label=r"Empirical") plt.ylim([-0.05, 1.05]) plt.xlabel("Instances sorted according to predicted probability (uncalibrated GNB)") plt.ylabel("P(y=1)") plt.legend(loc="upper left") plt.title("Gaussian naive Bayes probabilities") plt.show()
沪ICP备2024098111号-1
上海秋旦网络科技中心:上海市奉贤区金大公路8218号1幢 联系电话:17898875485