线性预测模型综合教程

本文主要面向初学者,由于内容较多,将分为几篇文章进行介绍。文章将以实践代码为基础,读者在阅读完一篇文章后即可开始编码。让开始吧。

目录

  • 1) 线性预测模型
  • 2) 导入库和加载数据
  • 3) 熟悉数据
  • 4) 数据清洗和准备
  • 5) 探索性数据分析
  • 6) 相关性热图
  • 7) 独热编码和K折交叉验证
  • 8) 建模
  • 9) 下一部分的预期内容

线性预测模型

线性预测建模在数据预测、语音识别、低比特率编码、基于模型的频谱分析、插值、信号恢复等多个领域都有应用。这些线性算法起源于统计学,在统计文献中,这些模型被称为自回归(AR)过程。将在下一篇文章中探讨一些线性回归的类型,如Lasso、Ridge等。本文将涵盖有和没有正则化的线性回归,以及使用Regu、Pipeline、Cross Val Predict等的回归。

为了这一系列文章,从网站AutoScout24抓取了数据,并获得了德国二手车销售的数据集。可以在这里找到提取的数据集。本文的笔记本链接在文章末尾,请务必查看。

导入库和加载数据

首先,导入必要的库:

import os import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from pandas_profiling import ProfileReport import warnings warnings.filterwarnings("ignore")

目标是基于抓取的数据预测汽车的价格。

熟悉数据

df.shape > (46405, 9) df.describe() df['fuel'].unique() > array(['Manual', 'Automatic', nan, 'Semi-automatic'], dtype=object) df.info()

因此,数据集中有一些值。由于数量似乎非常低,可以希望省略它们。将在下一节中处理这个问题:

数据清洗和准备

df.isna().sum() 由于缺失值的数量不到总数据的1%,认为可以在没有问题的情况下丢弃这些条目。 df.dropna(inplace = True) # 删除重复行 df.drop_duplicates(keep = 'first', inplace = True) # 现在让看看数据框的形状 df.shape > (43947, 9) df.describe() 可以使用列Year来生成特定车辆的年龄,这可能对预测更有帮助。为此,使用DateTime模块。 from datetime import datetime df['age'] = datetime.now().year - df['year'] df.drop('year',axis = 1, inplace = True) df.head()

探索性数据分析

M = df.price.median()

print(M) > 10990.0 m = df.price.mean() print(m) > 16546.56379275036 below_M = df.query("price<10990") no_below_M = below_M.value_counts().sum() above_M = df.query("price > 10990.1") no_above_M = above_M.value_counts().sum() print(f'Median = {M}') print('Number of cars with values above the median') print(no_above_M) print('Number of cars with values below the median') print(no_below_M) print('--------------------------------------------')

sns.scatterplot(x=df['hp'], y=df['price']) # 将燃料从分类值更改为整数值

df['fuel'] = df['fuel'].replace('Diesel', 0) df['fuel'] = df['fuel'].replace('Gasoline', 1) df['fuel'] = df['fuel'].replace(['Electric/Gasoline', 'Electric/Diesel', 'Electric'], 2) df['fuel'] = df['fuel'].replace(['CNG', 'LPG', 'Others', '-/- (Fuel)', 'Ethanol', 'Hydrogen'], 3)

相关性热图

plt.figure(figsize=(14,7)) sns.heatmap(df.corr(),annot=True, cmap='coolwarm')

现在尝试找到流行的汽车,即那些更实惠、提供更好里程、更多人可以购买的汽车。

min_price, max_price = df.price.quantile([0.01, 0.99]) min_price, max_price > (3300.0, 83468.84000000004) pop_cars = df[(df.pricemin_price)] print('Total number of cars:') print(df.shape[0]) print('---------------------') print('Numbers of cars that are above $3.300,0 and below $99.999,0') print(pop_cars.shape[0])

独热编码和K折交叉验证

pop_cars = pop_cars.drop(columns=['make', 'model'], axis=1) pop_cars.head() pop_cars.dtypes.value_counts() > object 3 int64 3 float64 1 dtype: int64 mask = pop_cars.dtypes == np.object categorical = pop_cars.columns[mask] categorical > Index(['fuel', 'gear', 'offerType'], dtype='object') num_ohc_cols = (pop_cars[categorical].apply(lambda x: x.nunique()).sort_values(ascending=False)) num_ohc_cols > offerType 5 fuel 4 gear 3 dtype: int64 from sklearn.preprocessing import OneHotEncoder, LabelEncoder

data_ohc = pop_cars.copy()

ohc = OneHotEncoder() for col in num_ohc_cols.index: #这是一个稀疏数组 new_dat = ohc.fit_transform(data_ohc[[col]]) #从原始DF中删除原始列 data_ohc = data_ohc.drop(col, axis=1) #获取列的唯一名称 cats = ohc.categories_ #为每个OHE列创建一个列 new_cols = ['_'.join([col,cat]) for cat in cats[0]] #创建新的数据集 new_df = pd.DataFrame(new_dat.toarray(), columns=new_cols) #将新数据追加到df data_ohc=pd.concat([data_ohc, new_df], axis=1)

y_col = 'price'

feature_cols = [x for x in data_ohc.columns if x != y_col] X = data_ohc[feature_cols] y = data_ohc[y_col] from sklearn.model_selection import KFold kf = KFold(shuffle=True, random_state=72018, n_splits=3) kf.split(X)

这创建了一个元组,用于3种不同的情况(n_splits),即:train_index, test_index

for train_index, test_index in kf.split(X): print("Train index:", train_index[:10], len(train_index)) print("Test index:", test_index[:10], len(test_index)) print('')

到目前为止所做的一切都是为了准备数据进行建模。现在将实现各种类型的线性回归,并看看哪个模型具有最高的准确性。

from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_squared_error scores = [] lr = LinearRegression() for train_index, test_index in kf.split(X): X_train, X_test, y_train, y_test = (X.iloc[train_index, :], X.iloc[test_index, :], y[train_index], y[test_index]) lr.fit(X_train, y_train) y_pred = lr.predict(X_test) score = r2_score(y_test.values, y_pred) scores.append(score) scores > [0.8287930876292234, 0.8297633896297357, 0.8390539858927717]

2) 带有正则化的线性回归

from sklearn.preprocessing import StandardScaler scores = [] lr = LinearRegression() s = StandardScaler() for train_index, test_index in kf.split(X): X_train, X_test, y_train, y_test = (X.iloc[train_index, :], X.iloc[test_index, :], y[train_index], y[test_index]) X_train_s = s.fit_transform(X_train) lr.fit(X_train_s, y_train) X_test_s = s.transform(X_test) y_pred = lr.predict(X_test_s) score = r2_score(y_test.values, y_pred) scores.append(score) scores > [0.8287665996258867, 0.829763389629736, 0.8390557075678731]

3) 带有Regu、Pipeline和Cross Val Predict的线性回归

from sklearn.pipeline import Pipeline estimator = Pipeline([('scaler', s), ('linear_reg', lr)]) estimator.fit(X_train, y_train) estimator.predict(X_test) kf > KFold(n_splits=3, random_state=72018, shuffle=True) from sklearn.model_selection import cross_val_predict predictions = cross_val_predict(estimator, X, y, cv=kf, verbose=100) r2_score(y, predictions) > 0.8326247491666151

可以看到这几乎是一样的。线性回归在正则化下变化不大

4) 带有多项式正则化的线性回归

from sklearn.preprocessing import PolynomialFeatures from sklearn.model_selection import KFold powers = [2, 3, 4] lr1 = LinearRegression() scores = [] for power in powers: pf = PolynomialFeatures(power) estimator = Pipeline([('make_higher_degree', pf), ('linear_reg', lr1)]) predictions = cross_val_predict(estimator, X, y, cv=kf, verbose=100) score = r2_score(y, predictions) scores.append(score) list(zip(powers, scores)) > [(2, 0.8834636269383815), (3, 0.816677589576142), (4, -1026.6737588070525)]

可以看到,指数为2的多项式回归具有最高的88.3%的准确性。

沪ICP备2024098111号-1
上海秋旦网络科技中心:上海市奉贤区金大公路8218号1幢 联系电话:17898875485