逻辑回归算法实现与评估

在数据科学领域，逻辑回归是一种常用的分类算法，它能够预测二分类问题的结果。尽管使用像scikit-learn这样的库可以方便地实现逻辑回归，但从头开始编写算法能够帮助更深入地理解其背后的数学原理和工作机制。本文将介绍如何使用Python和Numpy库来实现逻辑回归，并评估其性能。

在开始之前，需要导入实现逻辑回归所需的Python库和数据集。本文将使用Numpy库来处理数组操作，并使用scikit-learn的make_classification函数生成一个具有四个特征的分类数据集。


                import numpy as np
                from numpy import log, dot, exp, shape
                import matplotlib.pyplot as plt
                import dataset
                X, y = make_classification(n_features=4, n_classes=2)
                from sklearn.model_selection import train_test_split
                X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.1)
                print(X_tr.shape, X_te.shape)

数据标准化是将数据缩放到均值为零、标准差为一的过程。这对于使用梯度下降优化的算法尤为重要，因为不同尺度的特征可能会影响模型的收敛。标准化可以通过以下数学公式实现：


                def standardize(X_tr):
                    for i in range(shape(X_tr)[1]):
                        X_tr[:,i] = (X_tr[:,i] - np.mean(X_tr[:,i]))/np.std(X_tr[:,i])

在逻辑回归中，需要为特征向量和参数（权重）创建矩阵。特征向量矩阵的维度为mxn，其中m是观测值的数量，n是观测值的维度。参数矩阵的维度为nx1。此外，还需要在特征向量矩阵中添加一列偏置项，并在权重向量中添加相应的参数项。


                def initialize(X):
                    weights = np.zeros((shape(X)[1]+1,1))
                    X = np.c_[np.ones((shape(X)[0],1)),X]
                    return weights, X

在逻辑回归中，Sigmoid函数用于将线性方程的结果限制在0和1之间，这对于二分类问题非常有用。Sigmoid函数的图像在y轴上的截距为0.5，通常使用这个点作为分类的阈值。


                def sigmoid(z):
                    sig = 1/(1+exp(-z))
                    return sig

成本函数用于衡量模型预测值与实际值之间的偏差。对于逻辑回归，通常使用对数损失作为成本函数，因为它是凸函数，更适合梯度下降优化。


                def cost(theta):
                    z = dot(X,theta)
                    cost0 = y.T.dot(log(sigmoid(z)))
                    cost1 = (1-y).T.dot(log(1-sigmoid(z)))
                    cost = -((cost1 + cost0))/len(y)
                    return cost

梯度下降是一种优化算法，用于找到最佳拟合参数。梯度是成本函数的一阶导数向量，表示函数最陡峭的上升方向。在梯度下降中，沿着梯度的反方向更新权重，直到收敛。


                def fit(X, y, alpha=0.001, iter=100):
                    params, X = initialize(X)
                    cost_list = np.zeros(iter)
                    for i in range(iter):
                        params = params - alpha * dot(X.T, sigmoid(dot(X, params)) - np.reshape(y, (len(y), 1)))
                        cost_list[i] = cost(params)
                    weights = params
                    return cost_list

使用训练好的模型参数，可以对未见数据进行预测。


                def predict(X):
                    z = dot(initialize(X)[1], weights)
                    lis = []
                    for i in sigmoid(z):
                        if i > 0.5:
                            lis.append(1)
                        else:
                            lis.append(0)
                    return lis


                def f1_score(y, y_hat):
                    tp, tn, fp, fn = 0, 0, 0, 0
                    for i in range(len(y)):
                        if y[i] == 1 and y_hat[i] == 1:
                            tp += 1
                        elif y[i] == 1 and y_hat[i] == 0:
                            fn += 1
                        elif y[i] == 0 and y_hat[i] == 1:
                            fp += 1
                        elif y[i] == 0 and y_hat[i] == 0:
                            tn += 1
                    precision = tp/(tp+fp)
                    recall = tp/(tp+fn)
                    f1 = 2*precision*recall/(precision+recall)
                    return f1


                import numpy as np
                from numpy import log, dot, exp, shape
                import matplotlib.pyplot as plt
                from sklearn.datasets import make_classification
                X, y = make_classification(n_features=4)
                from sklearn.model_selection import train_test_split
                X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.1)
                
                def standardize(X_tr):
                    for i in range(shape(X_tr)[1]):
                        X_tr[:,i] = (X_tr[:,i] - np.mean(X_tr[:,i]))/np.std(X_tr[:,i])
                
                def F1_score(y, y_hat):
                    tp, tn, fp, fn = 0, 0, 0, 0
                    for i in range(len(y)):
                        if y[i] == 1 and y_hat[i] == 1:
                            tp += 1
                        elif y[i] == 1 and y_hat[i] == 0:
                            fn += 1
                        elif y[i] == 0 and y_hat[i] == 1:
                            fp += 1
                        elif y[i] == 0 and y_hat[i] == 0:
                            tn += 1
                    precision = tp/(tp+fp)
                    recall = tp/(tp+fn)
                    f1 = 2*precision*recall/(precision+recall)
                    return f1
                
                class LogisticRegression:
                    def sigmoid(self, z):
                        sig = 1/(1+exp(-z))
                        return sig
                    def initialize(self, X):
                        weights = np.zeros((shape(X)[1]+1,1))
                        X = np.c_[np.ones((shape(X)[0],1)),X]
                        return weights, X
                    def fit(self, X, y, alpha=0.001, iter=400):
                        weights, X = self.initialize(X)
                        def cost(theta):
                            z = dot(X, theta)
                            cost0 = y.T.dot(log(self.sigmoid(z)))
                            cost1 = (1-y).T.dot(log(1-self.sigmoid(z)))
                            cost = -((cost1 + cost0))/len(y)
                            return cost
                        cost_list = np.zeros(iter)
                        for i in range(iter):
                            weights = weights - alpha*dot(X.T, self.sigmoid(dot(X, weights))-np.reshape(y, (len(y), 1)))
                            cost_list[i] = cost(weights)
                        self.weights = weights
                        return cost_list
                    def predict(self, X):
                        z = dot(self.initialize(X)[1], self.weights)
                        lis = []
                        for i in self.sigmoid(z):
                            if i > 0.5:
                                lis.append(1)
                            else:
                                lis.append(0)
                        return lis
                
                standardize(X_tr)
                standardize(X_te)
                obj1 = LogisticRegression()
                model = obj1.fit(X_tr, y_tr)
                y_pred = obj1.predict(X_te)
                y_train = obj1.predict(X_tr)
                f1_score_tr = F1_score(y_tr, y_train)
                f1_score_te = F1_score(y_te, y_pred)
                print(f1_score_tr)
                print(f1_score_te)

逻辑回归算法实现与评估

心脏病预测：机器学习在医疗领域的应用

数据治理的重要性与实践

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485

逻辑回归算法实现与评估

心脏病预测：机器学习在医疗领域的应用

数据治理的重要性与实践

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢 联系电话：17898875485

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485