教育经济学中的因果效应分析

在教育经济学领域，一个核心问题是大学学位对小时工资的影响。这个问题的答案对于政策制定者来说至关重要。然而，由于遗漏变量偏差（OVB），很难确定这种因果效应。为了说明这一点，将模拟一个情况，尝试回答这个问题。

数据生成过程

首先模拟了工作经验年数和能力指标，这些数据来自正态分布；父母之一的小时工资则来自贝塔分布。然后，创建了一个大学学位的指标，这个指标受到能力和父母小时工资的正向影响。最后，将小时工资建模为所有先前变量的线性函数和一个随机组成部分。注意，所有变量对小时工资都有正向影响。


import numpy as np
import pandas as pd

n_samples = 10_000
rng = np.random.RandomState(32)
experiences = rng.normal(20, 10, size=n_samples).astype(int)
experiences[experiences < 0] = 0
abilities = rng.normal(0, 0.15, size=n_samples)
parent_hourly_wages = 50 * rng.beta(2, 8, size=n_samples)
parent_hourly_wages[parent_hourly_wages < 0] = 0
college_degrees = (9 * abilities + 0.02 * parent_hourly_wages + rng.randn(n_samples) > 0.7).astype(int)
true_coef = pd.Series({
    "college degree": 2.0,
    "ability": 5.0,
    "experience": 0.2,
    "parent hourly wage": 1.0,
})
hourly_wages = (true_coef["experience"] * experiences +
                true_coef["parent hourly wage"] * parent_hourly_wages +
                true_coef["college degree"] * college_degrees +
                true_coef["ability"] * abilities +
                rng.normal(0, 1, size=n_samples))
hourly_wages[hourly_wages < 0] = 0

通过上述代码，生成了模拟数据。接下来，将展示每个变量的分布情况以及它们之间的散点图。在OVB故事中，能力和大学学位之间的正向关系是关键。

模拟数据的描述

下面的图表显示了每个变量的分布情况以及成对的散点图。在OVB故事中，能力和大学学位之间的正向关系是关键。


import seaborn as sns
df = pd.DataFrame({
    "college degree": college_degrees,
    "ability": abilities,
    "hourly wage": hourly_wages,
    "experience": experiences,
    "parent hourly wage": parent_hourly_wages,
})
grid = sns.pairplot(df, diag_kind="kde", corner=True)

在接下来的部分中，将训练预测模型，因此需要将目标列从特征中分离出来，并将数据分为训练集和测试集。


from sklearn.model_selection import train_test_split
target_name = "hourly wage"
X, y = df.drop(columns=target_name), df[target_name]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

完全观察变量的收入预测

首先，训练一个预测模型，即线性回归模型。在这个实验中，假设真实生成模型使用的所有变量都是可用的。


from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

features_names = ["experience", "parent hourly wage", "college degree", "ability"]
regressor_with_ability = LinearRegression()
regressor_with_ability.fit(X_train[features_names], y_train)
y_pred_with_ability = regressor_with_ability.predict(X_test[features_names])
R2_with_ability = r2_score(y_test, y_pred_with_ability)
print(f"R2 score with ability: {R2_with_ability:.3f}")

该模型预测小时工资的能力很高，如高R2分数所示。绘制模型系数图，以显示准确地恢复了真实生成模型的值。


import matplotlib.pyplot as plt
model_coef = pd.Series(regressor_with_ability.coef_, index=features_names)
coef = pd.concat([
    true_coef[features_names],
    model_coef],
    keys=[
    "Coefficients of true generative model",
    "Model coefficients"],
    axis=1,
)
ax = coef.plot.barh()
ax.set_xlabel("Coefficient values")
ax.set_title("Coefficients of the linear regression including the ability features")
plt.tight_layout()

部分观察变量的收入预测

在实践中，智力能力通常不被观察到，或者只能从无意中也测量教育的代理中估计（例如，通过智商测试）。但是，从线性模型中省略“能力”特征会通过正向OVB膨胀估计值。


features_names = ["experience", "parent hourly wage", "college degree"]
regressor_without_ability = LinearRegression()
regressor_without_ability.fit(X_train[features_names], y_train)
y_pred_without_ability = regressor_without_ability.predict(X_test[features_names])
R2_without_ability = r2_score(y_test, y_pred_without_ability)
print(f"R2 score without ability: {R2_without_ability:.3f}")

当省略能力特征时，模型的预测能力在R2分数方面是相似的。现在检查模型的系数是否与真实生成模型不同。


model_coef = pd.Series(regressor_without_ability.coef_, index=features_names)
coef = pd.concat([
    true_coef[features_names],
    model_coef],
    keys=[
    "Coefficients of true generative model",
    "Model coefficients"],
    axis=1,
)
ax = coef.plot.barh()
ax.set_xlabel("Coefficient values")
ax.set_title("Coefficients of the linear regression excluding the ability feature")
plt.tight_layout()
plt.show()

为了补偿省略的变量，模型膨胀了大学学位特征的系数。因此，将这个系数值解释为真实生成模型的因果效应是不正确的。

数据缺失值处理技术

本文探讨了使用不同的数据填充技术来处理缺失值，包括使用常数0填充、均值填充、k最近邻填充和迭代填充。

特征排列重要性与随机森林分类器

本文探讨了如何使用排列重要性来评估随机森林分类器中特征的重要性，并展示了如何处理多重共线性问题，以提高模型的准确性。

教育经济学中的因果效应分析

数据生成过程

模拟数据的描述

完全观察变量的收入预测

部分观察变量的收入预测

数据缺失值处理技术

特征排列重要性与随机森林分类器

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485

教育经济学中的因果效应分析

数据生成过程

模拟数据的描述

完全观察变量的收入预测

部分观察变量的收入预测

数据缺失值处理技术

特征排列重要性与随机森林分类器

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢 联系电话：17898875485

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485