逻辑回归算法实现与评估

数据科学领域,逻辑回归是一种常用的分类算法,它能够预测二分类问题的结果。尽管使用像scikit-learn这样的库可以方便地实现逻辑回归,但从头开始编写算法能够帮助更深入地理解其背后的数学原理和工作机制。本文将介绍如何使用Python和Numpy库来实现逻辑回归,并评估其性能。

在开始之前,需要导入实现逻辑回归所需的Python库和数据集。本文将使用Numpy库来处理数组操作,并使用scikit-learn的make_classification函数生成一个具有四个特征的分类数据集。

import numpy as np from numpy import log, dot, exp, shape import matplotlib.pyplot as plt import dataset X, y = make_classification(n_features=4, n_classes=2) from sklearn.model_selection import train_test_split X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.1) print(X_tr.shape, X_te.shape)

数据标准化是将数据缩放到均值为零、标准差为一的过程。这对于使用梯度下降优化的算法尤为重要,因为不同尺度的特征可能会影响模型的收敛。标准化可以通过以下数学公式实现:

def standardize(X_tr): for i in range(shape(X_tr)[1]): X_tr[:,i] = (X_tr[:,i] - np.mean(X_tr[:,i]))/np.std(X_tr[:,i])

逻辑回归中,需要为特征向量和参数(权重)创建矩阵。特征向量矩阵的维度为mxn,其中m是观测值的数量,n是观测值的维度。参数矩阵的维度为nx1。此外,还需要在特征向量矩阵中添加一列偏置项,并在权重向量中添加相应的参数项。

def initialize(X): weights = np.zeros((shape(X)[1]+1,1)) X = np.c_[np.ones((shape(X)[0],1)),X] return weights, X

在逻辑回归中,Sigmoid函数用于将线性方程的结果限制在0和1之间,这对于二分类问题非常有用。Sigmoid函数的图像在y轴上的截距为0.5,通常使用这个点作为分类的阈值。

def sigmoid(z): sig = 1/(1+exp(-z)) return sig

成本函数用于衡量模型预测值与实际值之间的偏差。对于逻辑回归,通常使用对数损失作为成本函数,因为它是凸函数,更适合梯度下降优化。

def cost(theta): z = dot(X,theta) cost0 = y.T.dot(log(sigmoid(z))) cost1 = (1-y).T.dot(log(1-sigmoid(z))) cost = -((cost1 + cost0))/len(y) return cost

梯度下降是一种优化算法,用于找到最佳拟合参数。梯度是成本函数的一阶导数向量,表示函数最陡峭的上升方向。在梯度下降中,沿着梯度的反方向更新权重,直到收敛。

def fit(X, y, alpha=0.001, iter=100): params, X = initialize(X) cost_list = np.zeros(iter) for i in range(iter): params = params - alpha * dot(X.T, sigmoid(dot(X, params)) - np.reshape(y, (len(y), 1))) cost_list[i] = cost(params) weights = params return cost_list

使用训练好的模型参数,可以对未见数据进行预测。

def predict(X): z = dot(initialize(X)[1], weights) lis = [] for i in sigmoid(z): if i > 0.5: lis.append(1) else: lis.append(0) return lis def f1_score(y, y_hat): tp, tn, fp, fn = 0, 0, 0, 0 for i in range(len(y)): if y[i] == 1 and y_hat[i] == 1: tp += 1 elif y[i] == 1 and y_hat[i] == 0: fn += 1 elif y[i] == 0 and y_hat[i] == 1: fp += 1 elif y[i] == 0 and y_hat[i] == 0: tn += 1 precision = tp/(tp+fp) recall = tp/(tp+fn) f1 = 2*precision*recall/(precision+recall) return f1 import numpy as np from numpy import log, dot, exp, shape import matplotlib.pyplot as plt from sklearn.datasets import make_classification X, y = make_classification(n_features=4) from sklearn.model_selection import train_test_split X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.1) def standardize(X_tr): for i in range(shape(X_tr)[1]): X_tr[:,i] = (X_tr[:,i] - np.mean(X_tr[:,i]))/np.std(X_tr[:,i]) def F1_score(y, y_hat): tp, tn, fp, fn = 0, 0, 0, 0 for i in range(len(y)): if y[i] == 1 and y_hat[i] == 1: tp += 1 elif y[i] == 1 and y_hat[i] == 0: fn += 1 elif y[i] == 0 and y_hat[i] == 1: fp += 1 elif y[i] == 0 and y_hat[i] == 0: tn += 1 precision = tp/(tp+fp) recall = tp/(tp+fn) f1 = 2*precision*recall/(precision+recall) return f1 class LogisticRegression: def sigmoid(self, z): sig = 1/(1+exp(-z)) return sig def initialize(self, X): weights = np.zeros((shape(X)[1]+1,1)) X = np.c_[np.ones((shape(X)[0],1)),X] return weights, X def fit(self, X, y, alpha=0.001, iter=400): weights, X = self.initialize(X) def cost(theta): z = dot(X, theta) cost0 = y.T.dot(log(self.sigmoid(z))) cost1 = (1-y).T.dot(log(1-self.sigmoid(z))) cost = -((cost1 + cost0))/len(y) return cost cost_list = np.zeros(iter) for i in range(iter): weights = weights - alpha*dot(X.T, self.sigmoid(dot(X, weights))-np.reshape(y, (len(y), 1))) cost_list[i] = cost(weights) self.weights = weights return cost_list def predict(self, X): z = dot(self.initialize(X)[1], self.weights) lis = [] for i in self.sigmoid(z): if i > 0.5: lis.append(1) else: lis.append(0) return lis standardize(X_tr) standardize(X_te) obj1 = LogisticRegression() model = obj1.fit(X_tr, y_tr) y_pred = obj1.predict(X_te) y_train = obj1.predict(X_tr) f1_score_tr = F1_score(y_tr, y_train) f1_score_te = F1_score(y_te, y_pred) print(f1_score_tr) print(f1_score_te)
沪ICP备2024098111号-1
上海秋旦网络科技中心:上海市奉贤区金大公路8218号1幢 联系电话:17898875485