线性预测模型综合教程

本文主要面向初学者，由于内容较多，将分为几篇文章进行介绍。文章将以实践代码为基础，读者在阅读完一篇文章后即可开始编码。让开始吧。

线性预测模型

线性预测建模在数据预测、语音识别、低比特率编码、基于模型的频谱分析、插值、信号恢复等多个领域都有应用。这些线性算法起源于统计学，在统计文献中，这些模型被称为自回归（AR）过程。将在下一篇文章中探讨一些线性回归的类型，如Lasso、Ridge等。本文将涵盖有和没有正则化的线性回归，以及使用Regu、Pipeline、Cross Val Predict等的回归。

为了这一系列文章，从网站AutoScout24抓取了数据，并获得了德国二手车销售的数据集。可以在这里找到提取的数据集。本文的笔记本链接在文章末尾，请务必查看。

导入库和加载数据

首先，导入必要的库：


import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
import warnings
warnings.filterwarnings("ignore")

目标是基于抓取的数据预测汽车的价格。

熟悉数据


df.shape
> (46405, 9)
df.describe()
df['fuel'].unique()
> array(['Manual', 'Automatic', nan, 'Semi-automatic'], dtype=object)
df.info()

因此，数据集中有一些值。由于数量似乎非常低，可以希望省略它们。将在下一节中处理这个问题：

数据清洗和准备


df.isna().sum()
由于缺失值的数量不到总数据的1%，认为可以在没有问题的情况下丢弃这些条目。
df.dropna(inplace = True)

# 删除重复行
df.drop_duplicates(keep = 'first', inplace = True)

# 现在让看看数据框的形状 df.shape
> (43947, 9)
df.describe()
可以使用列Year来生成特定车辆的年龄，这可能对预测更有帮助。为此，使用DateTime模块。
from datetime import datetime
df['age'] = datetime.now().year - df['year']    
df.drop('year',axis = 1, inplace = True)
df.head()

探索性数据分析

M = df.price.median()


print(M)
> 10990.0
m = df.price.mean()
print(m)
> 16546.56379275036
below_M = df.query("price<10990")
no_below_M = below_M.value_counts().sum()
above_M = df.query("price > 10990.1")
no_above_M = above_M.value_counts().sum()
print(f'Median = {M}')
print('Number of cars with values above the median')
print(no_above_M)
print('Number of cars with values below the median')
print(no_below_M)
print('--------------------------------------------')

sns.scatterplot(x=df['hp'], y=df['price']) # 将燃料从分类值更改为整数值


df['fuel'] = df['fuel'].replace('Diesel', 0)
df['fuel'] = df['fuel'].replace('Gasoline', 1)
df['fuel'] = df['fuel'].replace(['Electric/Gasoline', 'Electric/Diesel', 'Electric'],  2)
df['fuel'] = df['fuel'].replace(['CNG', 'LPG', 'Others', '-/- (Fuel)', 'Ethanol', 'Hydrogen'], 3)

独热编码和K折交叉验证


pop_cars = pop_cars.drop(columns=['make', 'model'], axis=1)
pop_cars.head()
pop_cars.dtypes.value_counts()
> object     3
  int64      3
  float64    1
dtype: int64
mask = pop_cars.dtypes == np.object
categorical = pop_cars.columns[mask]
categorical
> Index(['fuel', 'gear', 'offerType'], dtype='object')
num_ohc_cols = (pop_cars[categorical].apply(lambda x: x.nunique()).sort_values(ascending=False))
num_ohc_cols
> offerType    5
  fuel         4
  gear         3
  dtype: int64
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

data_ohc = pop_cars.copy()


ohc = OneHotEncoder()
for col in num_ohc_cols.index:
  #这是一个稀疏数组
  new_dat = ohc.fit_transform(data_ohc[[col]])
  #从原始DF中删除原始列
  data_ohc = data_ohc.drop(col, axis=1)
  #获取列的唯一名称
  cats = ohc.categories_
  #为每个OHE列创建一个列
  new_cols = ['_'.join([col,cat]) for cat in cats[0]]
  #创建新的数据集
  new_df = pd.DataFrame(new_dat.toarray(), columns=new_cols)
  #将新数据追加到df
  data_ohc=pd.concat([data_ohc, new_df], axis=1)

y_col = 'price'


feature_cols = [x for x in data_ohc.columns if x != y_col]
X = data_ohc[feature_cols]
y = data_ohc[y_col]
from sklearn.model_selection import KFold
kf = KFold(shuffle=True, random_state=72018, n_splits=3)
kf.split(X)

这创建了一个元组，用于3种不同的情况（n_splits），即：train_index, test_index


for train_index, test_index in kf.split(X):
  print("Train index:", train_index[:10], len(train_index))
  print("Test index:", test_index[:10], len(test_index))
  print('')

到目前为止所做的一切都是为了准备数据进行建模。现在将实现各种类型的线性回归，并看看哪个模型具有最高的准确性。


from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
scores = []
lr = LinearRegression()
for train_index, test_index in kf.split(X):
  X_train, X_test, y_train, y_test = (X.iloc[train_index, :], X.iloc[test_index, :], y[train_index], y[test_index])
  lr.fit(X_train, y_train)
  y_pred = lr.predict(X_test)
  score = r2_score(y_test.values, y_pred)
  scores.append(score)
scores
> [0.8287930876292234, 0.8297633896297357, 0.8390539858927717]

2) 带有正则化的线性回归


from sklearn.preprocessing import StandardScaler
scores = []
lr = LinearRegression()
s = StandardScaler()
for train_index, test_index in kf.split(X):
  X_train, X_test, y_train, y_test = (X.iloc[train_index, :], X.iloc[test_index, :], y[train_index], y[test_index])
  X_train_s = s.fit_transform(X_train)
  lr.fit(X_train_s, y_train)
  X_test_s = s.transform(X_test)
  y_pred = lr.predict(X_test_s)
  score = r2_score(y_test.values, y_pred)
  scores.append(score)
scores
> [0.8287665996258867, 0.829763389629736, 0.8390557075678731]

3) 带有Regu、Pipeline和Cross Val Predict的线性回归


from sklearn.pipeline import Pipeline
estimator = Pipeline([('scaler', s), ('linear_reg', lr)])
estimator.fit(X_train, y_train)
estimator.predict(X_test)
kf
> KFold(n_splits=3, random_state=72018, shuffle=True)
from sklearn.model_selection import cross_val_predict
predictions = cross_val_predict(estimator, X, y, cv=kf, verbose=100)
r2_score(y, predictions)
> 0.8326247491666151

可以看到这几乎是一样的。线性回归在正则化下变化不大

4) 带有多项式正则化的线性回归


from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import KFold
powers = [2, 3, 4]
lr1 = LinearRegression()
scores = []
for power in powers:
  pf = PolynomialFeatures(power)
  estimator = Pipeline([('make_higher_degree', pf), ('linear_reg', lr1)])
  predictions = cross_val_predict(estimator, X, y, cv=kf, verbose=100)
  score = r2_score(y, predictions)
  scores.append(score)
list(zip(powers, scores))
> [(2, 0.8834636269383815), (3, 0.816677589576142), (4, -1026.6737588070525)]

可以看到，指数为2的多项式回归具有最高的88.3%的准确性。

构建聊天机器人教程

本教程将指导您如何使用Python和深度学习技术构建一个聊天机器人。

神经网络基础与线性回归实现

本文介绍了神经网络的基本概念，与机器学习的区别，以及如何构建一个简单的神经网络来执行线性回归。

线性预测模型综合教程

目录