自然语言处理：电子邮件垃圾信息检测

欢迎来到博客！今天将探讨自然语言处理（NLP）的基础知识，并以电子邮件垃圾信息检测数据集为例，了解如何完成一个端到端的项目。无论是否了解NLP，本指南都应为提供即时参考。

自然语言处理（NLP）简介

自然语言处理是艺术和科学的结合，它帮助从文本中提取信息，并在计算和算法中使用这些信息。目标是创建一个自动化的垃圾邮件检测模型。

导入库和数据集

导入必要的库是任何项目的第一步。对于NLP项目，特别要记得安装NLTK包并从中导入一些有用的库。以下是一些示例：


            import nltk
            from nltk.corpus import stopwords
            nltk.download('stopwords')
            nltk.download('wordnet')
            nltk.download('punkt')
            nltk.download('averaged_perceptron_tagger')

接下来，将加载数据集，并将其转换为pandas DataFrame格式，然后检查前五行。


            df = pd.read_csv("messages.csv",encoding='latin-1')
            df.head()

数据预处理

在这一步中，探索数据并获取信息，例如数据的形状、准确性和初始模式。首先，检查给定数据集的信息，并提取有关数据集的信息。发现数据集包含2893条记录，特征消息和标签中没有NaN值。


            df.info()

接下来，检查数据集中的垃圾邮件和非垃圾邮件。


            print("Count of label:n",df['label'].value_counts())

在这里，1代表垃圾邮件，0代表非垃圾邮件。继续检查标签的比例或百分比，即垃圾邮件和非垃圾邮件的比率，并观察到17%的邮件是垃圾邮件，其余83%不是垃圾邮件。


            print("Not a Spam Email Ratio i.e. 0 label:",round(len(df[df['label']==0])/len(df['label']),2)*100,"%")
            print("Spam Email Ratio that is 1 label:",round(len(df[df['label']==1])/len(df['label']),2)*100,"%")

现在创建一个名为length的新特征，以检查每条消息的长度，并将消息中的每个字母转换为小写。


            df['length'] = df.message.str.len()
            df.head()

数据清洗

除了这些步骤，现在执行一些可以在每个NLP项目上执行的常见任务。需要使用正则表达式清洗数据，匹配电子邮件消息中的模式，并用更有组织的对应物替换它们。更干净的数据会导致更有效的模型和更高的准确性。预处理消息涉及以下步骤：


            1.) Replace email address
            2.) Replace URLs
            3.) Replace currency symbols
            4.) Replace 10 digits phone numbers (formats include parenthesis, spaces, no spaces, dashes)
            5.) Replace numeric characters
            6.) Removing punctuation
            7.) Replace whitespace between terms with a single space
            8.) Remove leading and trailing whitespace

现在是进行分词的时候了。分词是NLP中的关键步骤。不能在不先清洗文本的情况下直接跳到模型构建部分。它通过移除停用词来完成。


            9.) Removing Stop Words: There is a corpus of stop words, that are high-frequency words such as “the”, “to” and “also”, and that we sometimes want to liter out of a document before further processing. Stop-words usually have little lexical content, do not alter the general meaning of a sentence and their presence in a text fails to distinguish it from other texts.

移除停用词后，需要创建另一个名为clean_length的特征，以比较清洗后的消息长度与未处理的消息长度。


            import string
            import nltk
            from nltk.corpus import stopwords
            stop_words = set(stopwords.words('english') + ['u', 'ü', 'ur', '4', '2', 'im', 'dont', 'doin', 'ure'])
            df['message'] = df['message'].apply(lambda x: " ".join(term for term in x.split() if term not in stop_words))

现在检查一些关于长度的信息：


            #Total length removal
            print("Original Length:",df.length.sum())
            print("Cleaned Length:",df.clean_length.sum())
            print("Total Words Removed:",(df.length.sum()) - (df.clean_length.sum()))

观察到，几乎三分之一的不需要的数据在长度移除后被清理或处理。

探索性数据分析

1.) 数据可视化：它是信息和数据的图形表示。通过使用图表、图形和地图等视觉元素，数据可视化工具提供了一种易于理解和查看数据中趋势和模式的方式。


            a) Counting the number of labels(‘Spam and non-spam counts’):
            b) Message Distribution before cleaning
            c) Message Distribution after cleaning

让使用NLTK中的词云库来可视化垃圾邮件中的热门词汇。词云是一种数据可视化技术，用于表示文本数据，其中每个词的大小表示其频率或重要性。


            Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance.

准备好处理过的数据，可以进行建模了。

运行和评估选定的模型

1.) 使用Tf-idf向量化器将字母消息转换为数字形式。2.) 向量化消息后，将它们作为输入输入模型，并将特征标签作为输出。3.) 这里，使用两种算法Naïve Bayes和SVM来比较准确率。4.) 在Naïve Bayes中，准确率从83%提高到最高88%，即使在训练测试分割中最小化测试大小。5.) 但在SVM中，准确率上升到98%，比Naïve Bayes显示出更好的结果。SVM的参数：(C=1.0, kernel='linear', degree=3, gamma='auto')。6.) 在数据集上应用所有合适的算法后，发现SVM的性能优于Naïve Bayes，因此最适合目标。因此，所有进一步的处理都在SVM中完成。


            tf_vec = TfidfVectorizer()
            SVM = SVC(C=1.0, kernel='linear', degree=3 , gamma='auto')
            features = tf_vec.fit_transform(df['message'])
            X = features
            y = df['label']


            Checking Model prediction: To check model performance, we will now plot different performance metrics.
            a.) Plotted Confusion Matrix: Observation Confusion Matrix shows high accuracy in predicting true values: Out of total spam emails, very few are identified as non-spam emails rest all are correctly identified.
            We see from above, that the SVM model gives us the best accuracy when performs with all the metrics.
            b.) Checking Classification Report: From observation, we found high values (~0.99) for accuracy, precision, and recall for the model. This indicates that the model is a good fit for the prediction.

软件工程在数据科学项目中的应用

本文介绍了软件工程的基本概念，并探讨了如何将软件工程的方法应用于数据科学项目中，以提高项目效率和效果。

平均数的分类与应用

本文介绍了平均数的不同分类，包括数学平均、位置平均和商业平均，并探讨了如何选择合适的平均数进行数据分析。

自然语言处理：电子邮件垃圾信息检测

自然语言处理（NLP）简介

导入库和数据集

数据预处理

数据清洗

探索性数据分析

运行和评估选定的模型

软件工程在数据科学项目中的应用

平均数的分类与应用

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485

自然语言处理：电子邮件垃圾信息检测

自然语言处理（NLP）简介

导入库和数据集

数据预处理

数据清洗

探索性数据分析

运行和评估选定的模型

软件工程在数据科学项目中的应用

平均数的分类与应用

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢 联系电话：17898875485

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485