首先,来理解机器学习代码。首先导入所需的库:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pickle
接下来,导入数据,并将训练和测试数据合并,以便直接对整个数据进行预处理。复制输出列到另一个变量,然后从数据中删除该列。
import pandas as pd
train_data = pd.read_csv('train_.csv')
test_data = pd.read_csv('test_.csv')
data = train_data.append(test_data, ignore_index=True)
y = data['is_promoted']
data = data.drop(['is_promoted'],axis = 1)
print(data)
现在对整个数据进行预处理。查看以下代码。
dept_counts = data['department'].value_counts()
region_count = data['region'].value_counts()
region_data = data['region'].str.replace("[a-zA-Z_]","")
data['region']= data['region'].str.replace("[a-zA-Z_]","")
region = data['region'].astype(int)
region = region.astype(int)
data = pd.get_dummies(data, columns=['gender'])
data = data.drop(['gender_f'],axis = 1)
data = pd.get_dummies(data, columns=['education'])
data = data.drop(['education_Below Secondary'],axis = 1)
data = pd.get_dummies(data, columns=['recruitment_channel'])
data = data.drop(['recruitment_channel_referred'],axis = 1)
from sklearn.preprocessing import LabelBinarizer
lb_style = LabelBinarizer()
lb = lb_style.fit_transform(data["department"])
data['previous_year_rating'] = data['previous_year_rating'].fillna(data['previous_year_rating'].median())
data = data.drop(['department'],axis = 1)
d1 =data.insert(1,'Region',region)
data = data.drop(['region'],axis = 1)
d = data
count_ofall_nan = data.isna().sum()
X= data.iloc[:,0:14].values
X= np.hstack((X,lb))
count_ = np.isnan(np.sum(lb))
data = data.astype(np.int64)
如果对解决机器学习问题有一定的了解,会很容易理解预处理部分。所做的基本上是将分类变量转换为数值,并用中位数或均值填充NaN值。在这里,使用了中位数。Pandas有一个get_dummies函数,可以为完成编码部分。还有来自sklearn的labelbinarizer。在这里做了一些基本的预处理,需要仔细研究数据集,并可以使用更好的技术来提高准确性。
现在已经完成了预处理,让将数据集重新划分为训练和测试数据。同时,将输出列重新添加到训练变量中,因为将需要它来让模型学习。将测试数据保存为.csv文件。为此...
#divide into train and test
train = X[:length of train data,:]
test = X[length of train data:,:]
test.to_csv('test_preprocessed.csv')
接下来,使用不同的模型并将它们拟合到训练数据中。在这里,只是使用了3个模型,可以尝试不同的模型并调整它们,这将给带来最大的准确性。现在需要保存模型,因为将使用Django从网站预测输出。为了保存模型,使用pickle,然后使用dump函数保存模型。
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X, y)
pickle.dump(nb, open('gNB.sav','wb'))
#random forest classifier
from sklearn.ensemble import RandomForestClassifier
random = RandomForestClassifier(n_estimators=100)
random.fit(X,y)
pickle.dump(random, open('random_forest.sav','wb'))
from sklearn.naive_bayes import MultinomialNB
classifier_multi = MultinomialNB()
classifier_multi.fit(X, y)
pickle.dump(classifier_multi, open('classifier_multi_NB.sav','wb'))
现在已经使用pickle保存了模型,让进入Django来从网站预测值。在前端,将有三个按钮在表单标签中,它们将与Django交互。表单动作指向链接‘download’,稍后会看到。以下是那部分代码。
<form action="download" method="POST">
{% csrf_token %}
<input type="submit"
name="gNB"
value="Gaussian Naive Bayes" class="btn btn-success">
<input type="submit"
name="multiNB"
value="Multinomial Naive Bayes" class="btn btn-success">
<input type="submit"
name="rf"
value="Random Forest" class="btn btn-success">
</form>
接下来,转到views.py文件,首先导入测试数据,以便可以使用它。
test_data_preprocessed = pd.read_csv('test_preprocessed.csv')
test_data_preprocessed = test_data_preprocessed.drop(['Unnamed: 0'],axis =1)
test_data_preprocessed = test_data_preprocessed.iloc[:,:].values
在views.py文件中创建一个名为home的函数,以便可以看到3个按钮以及其他所有HTML内容的网站。
def home(request):
return render(request,"index.html")
在urls.py文件中添加以下代码。
urlpatterns = [
path('',views.home, name = 'home')
]
现在,来处理按钮的功能。再次在views.py文件中,将创建一个名为models的函数。在上面的HTML文件中,已经命名了按钮(粗体文本)。在这里,将使用这些名称来了解用户点击了哪个按钮,然后它将根据该模型预测值。查看以下代码。
def models(request):
if 'gNB' in request.POST:
gaussian = pickle.load(open('gNB.sav','rb'))
y_pred = gaussian.predict(test_data_preprocessed)
output = pd.DataFrame(y_pred)
output.to_csv('gaussianNB.csv')
filename = 'gaussianNB.csv'
response = HttpResponse(open(filename, 'rb').read(), content_type='text/csv')
response['Content-Length'] = os.path.getsize(filename)
response['Content-Disposition'] = 'attachment; filename=%s' % 'gaussianNB.csv'
return response
if 'multiNB' in request.POST:
multi = pickle.load(open('classifier_multi_NB.sav','rb'))
y_pred = multi.predict(test_data_preprocessed)
output = pd.DataFrame(y_pred)
output.to_csv('multi_NB.csv')
filename = 'multi_NB.csv'
response = HttpResponse(open(filename, 'rb').read(), content_type='text/csv')
response['Content-Length'] = os.path.getsize(filename)
response['Content-Disposition'] = 'attachment; filename=%s' % 'multi_NB.csv'
return response
if 'rf' in request.POST:
rf = pickle.load(open('random_forest.sav','rb'))
y_pred = rf.predict(test_data_preprocessed)
output = pd.DataFrame(y_pred)
output.to_csv('rf.csv')
filename = 'rf.csv'
response = HttpResponse(open(filename, 'rb').read(), content_type='text/csv')
response['Content-Length'] = os.path.getsize(filename)
response['Content-Disposition'] = 'attachment; filename=%s' % 'rf.csv'
return response
如果语句将检查按钮名称,然后加载之前导入的测试数据。之后,使用predict函数来预测值。将其转换为dataframe,然后创建一个CSV文件。但主要任务是下载文件,因此,在Django中有一个HTTP响应,它将文件发送到浏览器,以便用户可以将其作为附件下载。这就是下载预测文件的方式。
urlpatterns = [
path('',views.home, name = 'home'),
path('download',views.models)
]