在数据科学领域,逻辑回归是一种常用的分类算法,它能够预测二分类问题的结果。尽管使用像scikit-learn这样的库可以方便地实现逻辑回归,但从头开始编写算法能够帮助更深入地理解其背后的数学原理和工作机制。本文将介绍如何使用Python和Numpy库来实现逻辑回归,并评估其性能。
在开始之前,需要导入实现逻辑回归所需的Python库和数据集。本文将使用Numpy库来处理数组操作,并使用scikit-learn的make_classification函数生成一个具有四个特征的分类数据集。
import numpy as np
from numpy import log, dot, exp, shape
import matplotlib.pyplot as plt
import dataset
X, y = make_classification(n_features=4, n_classes=2)
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.1)
print(X_tr.shape, X_te.shape)
数据标准化是将数据缩放到均值为零、标准差为一的过程。这对于使用梯度下降优化的算法尤为重要,因为不同尺度的特征可能会影响模型的收敛。标准化可以通过以下数学公式实现:
def standardize(X_tr):
for i in range(shape(X_tr)[1]):
X_tr[:,i] = (X_tr[:,i] - np.mean(X_tr[:,i]))/np.std(X_tr[:,i])
在逻辑回归中,需要为特征向量和参数(权重)创建矩阵。特征向量矩阵的维度为mxn,其中m是观测值的数量,n是观测值的维度。参数矩阵的维度为nx1。此外,还需要在特征向量矩阵中添加一列偏置项,并在权重向量中添加相应的参数项。
def initialize(X):
weights = np.zeros((shape(X)[1]+1,1))
X = np.c_[np.ones((shape(X)[0],1)),X]
return weights, X
在逻辑回归中,Sigmoid函数用于将线性方程的结果限制在0和1之间,这对于二分类问题非常有用。Sigmoid函数的图像在y轴上的截距为0.5,通常使用这个点作为分类的阈值。
def sigmoid(z):
sig = 1/(1+exp(-z))
return sig
成本函数用于衡量模型预测值与实际值之间的偏差。对于逻辑回归,通常使用对数损失作为成本函数,因为它是凸函数,更适合梯度下降优化。
def cost(theta):
z = dot(X,theta)
cost0 = y.T.dot(log(sigmoid(z)))
cost1 = (1-y).T.dot(log(1-sigmoid(z)))
cost = -((cost1 + cost0))/len(y)
return cost
梯度下降是一种优化算法,用于找到最佳拟合参数。梯度是成本函数的一阶导数向量,表示函数最陡峭的上升方向。在梯度下降中,沿着梯度的反方向更新权重,直到收敛。
def fit(X, y, alpha=0.001, iter=100):
params, X = initialize(X)
cost_list = np.zeros(iter)
for i in range(iter):
params = params - alpha * dot(X.T, sigmoid(dot(X, params)) - np.reshape(y, (len(y), 1)))
cost_list[i] = cost(params)
weights = params
return cost_list
使用训练好的模型参数,可以对未见数据进行预测。
def predict(X):
z = dot(initialize(X)[1], weights)
lis = []
for i in sigmoid(z):
if i > 0.5:
lis.append(1)
else:
lis.append(0)
return lis
def f1_score(y, y_hat):
tp, tn, fp, fn = 0, 0, 0, 0
for i in range(len(y)):
if y[i] == 1 and y_hat[i] == 1:
tp += 1
elif y[i] == 1 and y_hat[i] == 0:
fn += 1
elif y[i] == 0 and y_hat[i] == 1:
fp += 1
elif y[i] == 0 and y_hat[i] == 0:
tn += 1
precision = tp/(tp+fp)
recall = tp/(tp+fn)
f1 = 2*precision*recall/(precision+recall)
return f1
import numpy as np
from numpy import log, dot, exp, shape
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
X, y = make_classification(n_features=4)
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.1)
def standardize(X_tr):
for i in range(shape(X_tr)[1]):
X_tr[:,i] = (X_tr[:,i] - np.mean(X_tr[:,i]))/np.std(X_tr[:,i])
def F1_score(y, y_hat):
tp, tn, fp, fn = 0, 0, 0, 0
for i in range(len(y)):
if y[i] == 1 and y_hat[i] == 1:
tp += 1
elif y[i] == 1 and y_hat[i] == 0:
fn += 1
elif y[i] == 0 and y_hat[i] == 1:
fp += 1
elif y[i] == 0 and y_hat[i] == 0:
tn += 1
precision = tp/(tp+fp)
recall = tp/(tp+fn)
f1 = 2*precision*recall/(precision+recall)
return f1
class LogisticRegression:
def sigmoid(self, z):
sig = 1/(1+exp(-z))
return sig
def initialize(self, X):
weights = np.zeros((shape(X)[1]+1,1))
X = np.c_[np.ones((shape(X)[0],1)),X]
return weights, X
def fit(self, X, y, alpha=0.001, iter=400):
weights, X = self.initialize(X)
def cost(theta):
z = dot(X, theta)
cost0 = y.T.dot(log(self.sigmoid(z)))
cost1 = (1-y).T.dot(log(1-self.sigmoid(z)))
cost = -((cost1 + cost0))/len(y)
return cost
cost_list = np.zeros(iter)
for i in range(iter):
weights = weights - alpha*dot(X.T, self.sigmoid(dot(X, weights))-np.reshape(y, (len(y), 1)))
cost_list[i] = cost(weights)
self.weights = weights
return cost_list
def predict(self, X):
z = dot(self.initialize(X)[1], self.weights)
lis = []
for i in self.sigmoid(z):
if i > 0.5:
lis.append(1)
else:
lis.append(0)
return lis
standardize(X_tr)
standardize(X_te)
obj1 = LogisticRegression()
model = obj1.fit(X_tr, y_tr)
y_pred = obj1.predict(X_te)
y_train = obj1.predict(X_tr)
f1_score_tr = F1_score(y_tr, y_train)
f1_score_te = F1_score(y_te, y_pred)
print(f1_score_tr)
print(f1_score_te)