本文主要面向初学者,由于内容较多,将分为几篇文章进行介绍。文章将以实践代码为基础,读者在阅读完一篇文章后即可开始编码。让开始吧。
线性预测建模在数据预测、语音识别、低比特率编码、基于模型的频谱分析、插值、信号恢复等多个领域都有应用。这些线性算法起源于统计学,在统计文献中,这些模型被称为自回归(AR)过程。将在下一篇文章中探讨一些线性回归的类型,如Lasso、Ridge等。本文将涵盖有和没有正则化的线性回归,以及使用Regu、Pipeline、Cross Val Predict等的回归。
为了这一系列文章,从网站AutoScout24抓取了数据,并获得了德国二手车销售的数据集。可以在这里找到提取的数据集。本文的笔记本链接在文章末尾,请务必查看。
首先,导入必要的库:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
import warnings
warnings.filterwarnings("ignore")
目标是基于抓取的数据预测汽车的价格。
df.shape
> (46405, 9)
df.describe()
df['fuel'].unique()
> array(['Manual', 'Automatic', nan, 'Semi-automatic'], dtype=object)
df.info()
因此,数据集中有一些值。由于数量似乎非常低,可以希望省略它们。将在下一节中处理这个问题:
df.isna().sum()
由于缺失值的数量不到总数据的1%,认为可以在没有问题的情况下丢弃这些条目。
df.dropna(inplace = True)
# 删除重复行
df.drop_duplicates(keep = 'first', inplace = True)
# 现在让看看数据框的形状 df.shape
> (43947, 9)
df.describe()
可以使用列Year来生成特定车辆的年龄,这可能对预测更有帮助。为此,使用DateTime模块。
from datetime import datetime
df['age'] = datetime.now().year - df['year']
df.drop('year',axis = 1, inplace = True)
df.head()
M = df.price.median()
print(M)
> 10990.0
m = df.price.mean()
print(m)
> 16546.56379275036
below_M = df.query("price<10990")
no_below_M = below_M.value_counts().sum()
above_M = df.query("price > 10990.1")
no_above_M = above_M.value_counts().sum()
print(f'Median = {M}')
print('Number of cars with values above the median')
print(no_above_M)
print('Number of cars with values below the median')
print(no_below_M)
print('--------------------------------------------')
sns.scatterplot(x=df['hp'], y=df['price']) # 将燃料从分类值更改为整数值
df['fuel'] = df['fuel'].replace('Diesel', 0)
df['fuel'] = df['fuel'].replace('Gasoline', 1)
df['fuel'] = df['fuel'].replace(['Electric/Gasoline', 'Electric/Diesel', 'Electric'], 2)
df['fuel'] = df['fuel'].replace(['CNG', 'LPG', 'Others', '-/- (Fuel)', 'Ethanol', 'Hydrogen'], 3)
plt.figure(figsize=(14,7))
sns.heatmap(df.corr(),annot=True, cmap='coolwarm')
现在尝试找到流行的汽车,即那些更实惠、提供更好里程、更多人可以购买的汽车。
min_price, max_price = df.price.quantile([0.01, 0.99])
min_price, max_price
> (3300.0, 83468.84000000004)
pop_cars = df[(df.pricemin_price)]
print('Total number of cars:')
print(df.shape[0])
print('---------------------')
print('Numbers of cars that are above $3.300,0 and below $99.999,0')
print(pop_cars.shape[0])
pop_cars = pop_cars.drop(columns=['make', 'model'], axis=1)
pop_cars.head()
pop_cars.dtypes.value_counts()
> object 3
int64 3
float64 1
dtype: int64
mask = pop_cars.dtypes == np.object
categorical = pop_cars.columns[mask]
categorical
> Index(['fuel', 'gear', 'offerType'], dtype='object')
num_ohc_cols = (pop_cars[categorical].apply(lambda x: x.nunique()).sort_values(ascending=False))
num_ohc_cols
> offerType 5
fuel 4
gear 3
dtype: int64
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
data_ohc = pop_cars.copy()
ohc = OneHotEncoder()
for col in num_ohc_cols.index:
#这是一个稀疏数组
new_dat = ohc.fit_transform(data_ohc[[col]])
#从原始DF中删除原始列
data_ohc = data_ohc.drop(col, axis=1)
#获取列的唯一名称
cats = ohc.categories_
#为每个OHE列创建一个列
new_cols = ['_'.join([col,cat]) for cat in cats[0]]
#创建新的数据集
new_df = pd.DataFrame(new_dat.toarray(), columns=new_cols)
#将新数据追加到df
data_ohc=pd.concat([data_ohc, new_df], axis=1)
y_col = 'price'
feature_cols = [x for x in data_ohc.columns if x != y_col]
X = data_ohc[feature_cols]
y = data_ohc[y_col]
from sklearn.model_selection import KFold
kf = KFold(shuffle=True, random_state=72018, n_splits=3)
kf.split(X)
这创建了一个元组,用于3种不同的情况(n_splits),即:train_index, test_index
for train_index, test_index in kf.split(X):
print("Train index:", train_index[:10], len(train_index))
print("Test index:", test_index[:10], len(test_index))
print('')
到目前为止所做的一切都是为了准备数据进行建模。现在将实现各种类型的线性回归,并看看哪个模型具有最高的准确性。
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
scores = []
lr = LinearRegression()
for train_index, test_index in kf.split(X):
X_train, X_test, y_train, y_test = (X.iloc[train_index, :], X.iloc[test_index, :], y[train_index], y[test_index])
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
score = r2_score(y_test.values, y_pred)
scores.append(score)
scores
> [0.8287930876292234, 0.8297633896297357, 0.8390539858927717]
2) 带有正则化的线性回归
from sklearn.preprocessing import StandardScaler
scores = []
lr = LinearRegression()
s = StandardScaler()
for train_index, test_index in kf.split(X):
X_train, X_test, y_train, y_test = (X.iloc[train_index, :], X.iloc[test_index, :], y[train_index], y[test_index])
X_train_s = s.fit_transform(X_train)
lr.fit(X_train_s, y_train)
X_test_s = s.transform(X_test)
y_pred = lr.predict(X_test_s)
score = r2_score(y_test.values, y_pred)
scores.append(score)
scores
> [0.8287665996258867, 0.829763389629736, 0.8390557075678731]
3) 带有Regu、Pipeline和Cross Val Predict的线性回归
from sklearn.pipeline import Pipeline
estimator = Pipeline([('scaler', s), ('linear_reg', lr)])
estimator.fit(X_train, y_train)
estimator.predict(X_test)
kf
> KFold(n_splits=3, random_state=72018, shuffle=True)
from sklearn.model_selection import cross_val_predict
predictions = cross_val_predict(estimator, X, y, cv=kf, verbose=100)
r2_score(y, predictions)
> 0.8326247491666151
可以看到这几乎是一样的。线性回归在正则化下变化不大
4) 带有多项式正则化的线性回归
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import KFold
powers = [2, 3, 4]
lr1 = LinearRegression()
scores = []
for power in powers:
pf = PolynomialFeatures(power)
estimator = Pipeline([('make_higher_degree', pf), ('linear_reg', lr1)])
predictions = cross_val_predict(estimator, X, y, cv=kf, verbose=100)
score = r2_score(y, predictions)
scores.append(score)
list(zip(powers, scores))
> [(2, 0.8834636269383815), (3, 0.816677589576142), (4, -1026.6737588070525)]
可以看到,指数为2的多项式回归具有最高的88.3%的准确性。