机器学习与因果效应推断的局限性

机器学习模型在测量统计关联方面表现出色,但除非对数据做出强有力的假设,否则这些模型无法推断出因果效应。为了说明这一点,将模拟一个场景,尝试回答教育经济学中最重要的问题之一:获得大学学位对小时工资的因果效应是什么?尽管这个问题的答案对政策制定者至关重要,但遗漏变量偏差(OVB)阻碍了识别这种因果效应。

模拟小时工资数据集

数据生成过程在下面的代码中进行了概述。工作经验(以年为单位)和能力度量从正态分布中抽取;其中一个父母的小时工资从Beta分布中抽取。然后创建一个受能力和父母小时工资正向影响的大学学位指标。最后,将小时工资建模为所有先前变量的线性函数和一个随机组成部分。注意,所有变量对小时工资都有正面影响。

import numpy as np import pandas as pd n_samples = 10_000 rng = np.random.RandomState(32) experiences = rng.normal(20, 10, size=n_samples).astype(int) experiences[experiences < 0] = 0 abilities = rng.normal(0, 0.15, size=n_samples) parent_hourly_wages = 50 * rng.beta(2, 8, size=n_samples) parent_hourly_wages[parent_hourly_wages < 0] = 0 college_degrees = (9 * abilities + 0.02 * parent_hourly_wages + rng.randn(n_samples) > 0.7).astype(int) true_coef = pd.Series({ "college degree": 2.0, "ability": 5.0, "experience": 0.2, "parent hourly wage": 1.0, }) hourly_wages = (true_coef["experience"] * experiences + true_coef["parent hourly wage"] * parent_hourly_wages + true_coef["college degree"] * college_degrees + true_coef["ability"] * abilities + rng.normal(0, 1, size=n_samples)) hourly_wages[hourly_wages < 0] = 0

下面的图表显示了每个变量的分布和成对散点图。OVB故事的关键是能力和大学学位之间的正向关系。

import seaborn as sns df = pd.DataFrame({ "college degree": college_degrees, "ability": abilities, "hourly wage": hourly_wages, "experience": experiences, "parent hourly wage": parent_hourly_wages, }) grid = sns.pairplot(df, diag_kind="kde", corner=True)

在下一部分中,将训练预测模型,因此将目标列从特征中分离出来,并将数据分为训练集和测试集。

from sklearn.model_selection import train_test_split target_name = "hourly wage" X, y = df.drop(columns=target_name), df[target_name] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

首先,训练一个预测模型,一个线性回归模型。在这个实验中,假设所有用于真实生成模型的变量都是可用的。

from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score features_names = ["experience", "parent hourly wage", "college degree", "ability"] regressor_with_ability = LinearRegression() regressor_with_ability.fit(X_train[features_names], y_train) y_pred_with_ability = regressor_with_ability.predict(X_test[features_names]) R2_with_ability = r2_score(y_test, y_pred_with_ability) print(f"R2 score with ability: {R2_with_ability:.3f}")

这个模型预测小时工资做得很好,正如高R2分数所示。绘制模型系数,以显示准确地恢复了真实生成模型的值。

import matplotlib.pyplot as plt model_coef = pd.Series(regressor_with_ability.coef_, index=features_names) coef = pd.concat([ true_coef[features_names], model_coef], keys=["Coefficients of true generative model", "Model coefficients"], axis=1, ) ax = coef.plot.barh() ax.set_xlabel("Coefficient values") ax.set_title("Coefficients of the linear regression including the ability features") plt.tight_layout()

在实践中,智力能力是不被观察到的,或者只是从无意中也测量教育的代理中估计出来的(例如,通过智商测试)。但是,从线性模型中省略“能力”特征会通过正向OVB膨胀估计值。

features_names = ["experience", "parent hourly wage", "college degree"] regressor_without_ability = LinearRegression() regressor_without_ability.fit(X_train[features_names], y_train) y_pred_without_ability = regressor_without_ability.predict(X_test[features_names]) R2_without_ability = r2_score(y_test, y_pred_without_ability) print(f"R2 score without ability: {R2_without_ability:.3f}")

当省略能力特征时,模型的预测能力在R2分数方面是相似的。现在检查模型的系数是否与真实生成模型不同。

model_coef = pd.Series(regressor_without_ability.coef_, index=features_names) coef = pd.concat([ true_coef[features_names], model_coef], keys=["Coefficients of true generative model", "Model coefficients"], axis=1, ) ax = coef.plot.barh() ax.set_xlabel("Coefficient values") ax.set_title("Coefficients of the linear regression excluding the ability feature") plt.tight_layout() plt.show()

为了补偿省略的变量,模型通过膨胀大学学位特征的系数来补偿。因此,将这个系数值解释为真实生成模型的因果效应是不正确的。

沪ICP备2024098111号-1
上海秋旦网络科技中心:上海市奉贤区金大公路8218号1幢 联系电话:17898875485