自然语言处理(NLP)是计算机科学和人工智能领域的一个重要分支,它涉及到使用计算机程序来处理和分析大量的自然语言数据。在进行文本分析或构建模型之前,对原始文本数据进行清洗是至关重要的一步。本文将介绍几种常用的文本清洗技术,包括去除多余空格、标点符号、大小写归一化、分词、去除停用词、词形还原和词干提取等。
在处理文本数据时,经常会遇到单词之间存在多余空格的情况。这些多余的空格不仅会影响文本的可读性,还会对后续的文本分析造成干扰。为了解决这个问题,可以使用正则表达式来去除这些多余的空格。
import regex as re
doc = "NLP is an interesting field. "
new_doc = re.sub("\s+"," ",doc)
print(new_doc)
文本中的标点符号对于文本分析来说通常是没有价值的信息。去除标点符号可以帮助更准确地区分不同的单词。可以使用正则表达式或者字符串库中的函数来去除标点符号。
text = "Hello! How are you!! I'm very excited that you're going for a trip to Europe!! Yayy!"
re.sub("[^-9A-Za-z ]", "" , text)
import string
text = "Hello! How are you!! I'm very excited that you're going for a trip to Europe!! Yayy!"
text_clean = "".join([i for i in text if i not in string.punctuation])
text_clean
在Python中,字符串是区分大小写的。为了消除这种差异,可以将所有字符转换为大写或小写。这可以通过使用字符串的.lower()
或.upper()
方法来实现。
import string
text = "Hello! How are you!! I'm very excited that you're going for a trip to Europe!! Yayy!"
text_clean = "".join([i.lower() for i in text if i not in string.punctuation])
text_clean
text = "Hello! How are you!! I'm very excited that you're going for a trip to Europe!! Yayy!"
nltk.tokenize.word_tokenize(text)
from nltk.tokenize import TweetTokenizer
tweet = TweetTokenizer()
tweet.tokenize(text)
import re
a = 'What are your views related to US elections @nitin'
re.split('s@', a)
stopwords = nltk.corpus.stopwords.words('english')
text = "Hello! How are you!! I'm very excited that you're going for a trip to Europe!! Yayy!"
text_new = "".join([i for i in text if i not in string.punctuation])
print(text_new)
words = nltk.tokenize.word_tokenize(text_new)
print(words)
words_new = [i for i in words if i not in stopwords]
print(words_new)
ps = nltk.PorterStemmer()
w = [ps.stem(word) for word in words_new]
print(w)
ss = nltk.SnowballStemmer(language = 'english')
w = [ss.stem(word) for word in words_new]
print(w)
wn = nltk.WordNetLemmatizer()
w = [wn.lemmatize(word) for word in words_new]
print(w)