首先,来理解机器学习代码。首先导入所需的库:
            import pandas as pd
            import matplotlib.pyplot as plt
            import numpy as np
            import seaborn as sns
            import pickle
        
接下来,导入数据,并将训练和测试数据合并,以便直接对整个数据进行预处理。复制输出列到另一个变量,然后从数据中删除该列。
            import pandas as pd
            train_data = pd.read_csv('train_.csv')
            test_data = pd.read_csv('test_.csv')
            data = train_data.append(test_data, ignore_index=True)
            y = data['is_promoted']
            data = data.drop(['is_promoted'],axis = 1)
            print(data)
        
现在对整个数据进行预处理。查看以下代码。
            dept_counts = data['department'].value_counts()
            region_count = data['region'].value_counts()
            region_data = data['region'].str.replace("[a-zA-Z_]","")
            data['region']= data['region'].str.replace("[a-zA-Z_]","")
            region = data['region'].astype(int)
            region = region.astype(int)
            data = pd.get_dummies(data, columns=['gender'])
            data = data.drop(['gender_f'],axis = 1)
            data = pd.get_dummies(data, columns=['education'])
            data = data.drop(['education_Below Secondary'],axis = 1)
            data = pd.get_dummies(data, columns=['recruitment_channel'])
            data = data.drop(['recruitment_channel_referred'],axis = 1)
            from sklearn.preprocessing import LabelBinarizer
            lb_style = LabelBinarizer()
            lb = lb_style.fit_transform(data["department"])
            data['previous_year_rating'] = data['previous_year_rating'].fillna(data['previous_year_rating'].median())
            data = data.drop(['department'],axis = 1)
            d1 =data.insert(1,'Region',region)
            data = data.drop(['region'],axis = 1)
            d = data
            count_ofall_nan = data.isna().sum()
            X= data.iloc[:,0:14].values
            X= np.hstack((X,lb))
            count_ = np.isnan(np.sum(lb))
            data = data.astype(np.int64)
        
如果对解决机器学习问题有一定的了解,会很容易理解预处理部分。所做的基本上是将分类变量转换为数值,并用中位数或均值填充NaN值。在这里,使用了中位数。Pandas有一个get_dummies函数,可以为完成编码部分。还有来自sklearn的labelbinarizer。在这里做了一些基本的预处理,需要仔细研究数据集,并可以使用更好的技术来提高准确性。
现在已经完成了预处理,让将数据集重新划分为训练和测试数据。同时,将输出列重新添加到训练变量中,因为将需要它来让模型学习。将测试数据保存为.csv文件。为此...
            #divide into train and test
            train = X[:length of train data,:]
            test = X[length of train data:,:]
            test.to_csv('test_preprocessed.csv')
        
接下来,使用不同的模型并将它们拟合到训练数据中。在这里,只是使用了3个模型,可以尝试不同的模型并调整它们,这将给带来最大的准确性。现在需要保存模型,因为将使用Django从网站预测输出。为了保存模型,使用pickle,然后使用dump函数保存模型。
            from sklearn.naive_bayes import GaussianNB
            nb = GaussianNB()
            nb.fit(X, y)
            pickle.dump(nb, open('gNB.sav','wb'))
            #random forest classifier
            from sklearn.ensemble import RandomForestClassifier
            random = RandomForestClassifier(n_estimators=100)
            random.fit(X,y)
            pickle.dump(random, open('random_forest.sav','wb'))
            from sklearn.naive_bayes import MultinomialNB
            classifier_multi = MultinomialNB()
            classifier_multi.fit(X, y)
            pickle.dump(classifier_multi, open('classifier_multi_NB.sav','wb'))
        
现在已经使用pickle保存了模型,让进入Django来从网站预测值。在前端,将有三个按钮在表单标签中,它们将与Django交互。表单动作指向链接‘download’,稍后会看到。以下是那部分代码。
            <form action="download" method="POST">
               {% csrf_token %}
               <input type="submit"
                   name="gNB"
                   value="Gaussian Naive Bayes" class="btn btn-success">
               <input type="submit"
                   name="multiNB"
                   value="Multinomial Naive Bayes" class="btn btn-success">
               <input type="submit"
                   name="rf"
                   value="Random Forest" class="btn btn-success">
            </form>
        
接下来,转到views.py文件,首先导入测试数据,以便可以使用它。
            test_data_preprocessed = pd.read_csv('test_preprocessed.csv')
            test_data_preprocessed = test_data_preprocessed.drop(['Unnamed: 0'],axis =1)
            test_data_preprocessed = test_data_preprocessed.iloc[:,:].values
        
在views.py文件中创建一个名为home的函数,以便可以看到3个按钮以及其他所有HTML内容的网站。
            def home(request):
            return render(request,"index.html")
        
在urls.py文件中添加以下代码。
            urlpatterns = [
            path('',views.home, name = 'home')
            ]
        
现在,来处理按钮的功能。再次在views.py文件中,将创建一个名为models的函数。在上面的HTML文件中,已经命名了按钮(粗体文本)。在这里,将使用这些名称来了解用户点击了哪个按钮,然后它将根据该模型预测值。查看以下代码。
            def models(request):
                
                if 'gNB' in request.POST:
                    gaussian = pickle.load(open('gNB.sav','rb'))
                    y_pred = gaussian.predict(test_data_preprocessed)
                    output = pd.DataFrame(y_pred)
                    output.to_csv('gaussianNB.csv')
                    
                    filename = 'gaussianNB.csv'
                    response = HttpResponse(open(filename, 'rb').read(),    content_type='text/csv')               
                    response['Content-Length'] = os.path.getsize(filename)
                    response['Content-Disposition'] = 'attachment; filename=%s' % 'gaussianNB.csv'
                    return response
                
                if 'multiNB' in request.POST:
                    multi = pickle.load(open('classifier_multi_NB.sav','rb'))
                    y_pred = multi.predict(test_data_preprocessed)
                    output = pd.DataFrame(y_pred)
                    output.to_csv('multi_NB.csv')
                    
                    filename = 'multi_NB.csv'
                    response = HttpResponse(open(filename, 'rb').read(), content_type='text/csv')               
                    response['Content-Length'] = os.path.getsize(filename)
                    response['Content-Disposition'] = 'attachment; filename=%s' % 'multi_NB.csv'
                    return response
                
                if 'rf' in request.POST:
                    rf = pickle.load(open('random_forest.sav','rb'))
                    y_pred = rf.predict(test_data_preprocessed)
                    output = pd.DataFrame(y_pred)
                    output.to_csv('rf.csv')
                    
                    filename = 'rf.csv'
                    response = HttpResponse(open(filename, 'rb').read(), content_type='text/csv')             
                    response['Content-Length'] = os.path.getsize(filename)
                    response['Content-Disposition'] = 'attachment; filename=%s' % 'rf.csv'
                    return response
        
如果语句将检查按钮名称,然后加载之前导入的测试数据。之后,使用predict函数来预测值。将其转换为dataframe,然后创建一个CSV文件。但主要任务是下载文件,因此,在Django中有一个HTTP响应,它将文件发送到浏览器,以便用户可以将其作为附件下载。这就是下载预测文件的方式。
            urlpatterns = [
                path('',views.home, name = 'home'),
                path('download',views.models)
            ]