缺失值处理技术

机器学习中,数据预处理是一个重要的步骤,特别是当涉及到缺失值时。缺失值可以通过多种方式进行处理,例如使用常数值、特征的均值、中位数或众数来填补。本文将探讨不同的插补技术,并使用两个数据集:糖尿病数据集和加州住房数据集来进行比较。

数据集介绍

糖尿病数据集包含了442个样本,每个样本有10个特征,目标是预测糖尿病的进展。加州住房数据集更大,包含了20640个样本和8个特征,目标是预测加州各地区的中位房价。由于这两个数据集原本没有缺失值,将人为地移除一些值以创建带有人工缺失数据的新版本。然后,将比较在完整原始数据集上和经过不同技术插补后的人工缺失数据集上,随机森林回归器的性能。

创建缺失值

首先,下载这两个数据集。糖尿病数据集随scikit-learn一起提供,而加州住房数据集需要下载。为了加快计算速度,只使用前400个样本,但也可以选择使用整个数据集。

import numpy as np from sklearn.datasets import fetch_california_housing, load_diabetes rng = np.random.RandomState(42) X_diabetes, y_diabetes = load_diabetes(return_X_y=True) X_california, y_california = fetch_california_housing(return_X_y=True) X_california = X_california[:300] y_california = y_california[:300] X_diabetes = X_diabetes[:300] y_diabetes = y_diabetes[:300] def add_missing_values(X_full, y_full): n_samples, n_features = X_full.shape missing_rate = 0.75 n_missing_samples = int(n_samples * missing_rate) missing_samples = np.zeros(n_samples, dtype=bool) missing_samples[:n_missing_samples] = True rng.shuffle(missing_samples) missing_features = rng.randint(0, n_features, n_missing_samples) X_missing = X_full.copy() X_missing[missing_samples, missing_features] = np.nan y_missing = y_full.copy() return X_missing, y_missing X_miss_california, y_miss_california = add_missing_values(X_california, y_california) X_miss_diabetes, y_miss_diabetes = add_missing_values(X_diabetes, y_diabetes)

插补缺失数据并评分

接下来,将编写一个函数来评估不同插补数据的结果。将分别查看每个插补器:

rng = np.random.RandomState(0) from sklearn.ensemble import RandomForestRegressor from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer, KNNImputer, SimpleImputer from sklearn.model_selection import cross_val_score from sklearn.pipeline import make_pipeline N_SPLITS = 4 regressor = RandomForestRegressor(random_state=0) def get_scores_for_imputer(imputer, X_missing, y_missing): estimator = make_pipeline(imputer, regressor) impute_scores = cross_val_score(estimator, X_missing, y_missing, scoring="neg_mean_squared_error", cv=N_SPLITS) return impute_scores

评估原始数据的得分

首先,希望在原始数据上估计得分:

def get_full_score(X_full, y_full): full_scores = cross_val_score(regressor, X_full, y_full, scoring="neg_mean_squared_error", cv=N_SPLITS) return full_scores.mean(), full_scores.std()

用0替换缺失值

现在,将估计在用0替换缺失值后的数据上的得分:

def get_impute_zero_score(X_missing, y_missing): imputer = SimpleImputer(missing_values=np.nan, add_indicator=True, strategy="constant", fill_value=0) zero_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing) return zero_impute_scores.mean(), zero_impute_scores.std()

kNN插补缺失值

kNN插补器使用所需数量的最近邻的加权或不加权均值来插补缺失值。

def get_impute_knn_score(X_missing, y_missing): imputer = KNNImputer(missing_values=np.nan, add_indicator=True) knn_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing) return knn_impute_scores.mean(), knn_impute_scores.std()

用均值插补缺失值

另一种选择是使用均值插补器。

def get_impute_mean(X_missing, y_missing): imputer = SimpleImputer(missing_values=np.nan, strategy="mean", add_indicator=True) mean_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing) return mean_impute_scores.mean(), mean_impute_scores.std()

迭代插补缺失值

迭代插补器使用轮流线性回归,将具有缺失值的每个特征建模为其他特征的函数。

def get_impute_iterative(X_missing, y_missing): imputer = IterativeImputer(missing_values=np.nan, add_indicator=True, random_state=0, n_nearest_features=3, max_iter=10, sample_posterior=True) iterative_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing) return iterative_impute_scores.mean(), iterative_impute_scores.std()

结果可视化

最后,将可视化得分:

import matplotlib.pyplot as plt n_bars = len(mses_diabetes) xval = np.arange(n_bars) colors = ["r", "g", "b", "orange", "black"] plt.figure(figsize=(12, 6)) ax1 = plt.subplot(121) for j in xval: ax1.barh(j, mses_diabetes[j], xerr=stds_diabetes[j], color=colors[j], alpha=0.6, align="center") ax1.set_title("Imputation Techniques with Diabetes Data") ax1.set_xlim(left=np.min(mses_diabetes)*0.9, right=np.max(mses_diabetes)*1.1) ax1.set_yticks(xval) ax1.set_xlabel("MSE") ax1.invert_yaxis() ax1.set_yticklabels(x_labels) ax2 = plt.subplot(122) for j in xval: ax2.barh(j, mses_california[j], xerr=stds_california[j], color=colors[j], alpha=0.6, align="center") ax2.set_title("Imputation Techniques with California Data") ax2.set_yticks(xval) ax2.set_xlabel("MSE") ax2.invert_yaxis() ax2.set_yticklabels([""]*n_bars) plt.show()

也可以尝试不同的技术。例如,中位数是一个更稳健的估计器,适用于具有高幅度变量的数据,这些变量可能会主导结果(也就是所谓的“长尾”)。

沪ICP备2024098111号-1
上海秋旦网络科技中心:上海市奉贤区金大公路8218号1幢 联系电话:17898875485