正则表达式在自然语言处理中的应用

正则表达式是一种强大的文本处理工具，它可以帮助在处理大量文本数据时，快速准确地完成各种任务。在自然语言处理（NLP）领域，正则表达式的应用非常广泛，包括但不限于验证字符串结构、提取子字符串、搜索替换以及分割字符串等。本文将详细介绍正则表达式在NLP中的一些常见应用，并提供相应的Python代码示例。

正则表达式的常见用途

正则表达式可以用于多种文本处理任务，以下是一些常见的应用场景：

验证字符串是否符合特定的格式
从结构化的字符串中提取子字符串
在字符串中搜索、替换或重新排列部分内容
将字符串分割成多个标记（tokens）

这些任务在处理文本数据或解决NLP问题时非常常见。下面，将通过一些具体的代码示例，来展示正则表达式在NLP任务中的应用。

正则表达式函数介绍

在Python中，正则表达式相关的操作主要通过re模块实现。以下是一些常用的正则表达式函数：

re.findall - 搜索所有匹配给定模式的子串
re.sub - 替换匹配正则表达式的文本
re.match - 从字符串的开始位置匹配正则表达式模式
re.search - 在字符串中搜索正则表达式模式

在本文中，将重点使用re.findall函数来检测模式。

代码示例

以下是一些使用正则表达式在NLP任务中的代码示例：


        def find_url(string):
              text = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', string)
              return "".join(text)
        example = "I love spending time at https://www.leetcode.com/"
        find_url(example)

输出：'https://www.leetcode.com/'


        def findEmoji(text):
              emo_text = emoji.demojize(text)
              line = re.findall(r':(.*?):', emo_text)
              return line
        example = "I love ⚽ very much 😁"
        findEmoji(example)

输出：['soccer_ball', 'beaming_face_with_smiling_eyes']


        def findEmail(text):
              line = re.findall(r'[w.-]+@[w.-]+', str(text))
              return ",".join(line)
        example = "Gaurav's gmail is [email protected]"
        findEmail(example)

输出：'[email protected]'


        def findHash(text):
              line = re.findall(r'(?<=#)\w+', text)
              return " ".join(line)
        example = "#Sushant is trending now in the world"
        findHash(example)

输出：'Sushant'


        def findAt(text):
              line = re.findall(r'(?<=@)\w+', text)
              return " ".join(line)
        example = "@Ajit, please help me"
        findAt(example)

输出：'Ajit'


        def findNumber(text):
              line = re.findall(r'[0-9]+', text)
              return " ".join(line)
        example = "8853147 sq. km of area washed away in floods"
        findNumber(example)

输出：'8853147'


        def findPhoneNumber(text):
              line = re.findall(r"\d{10}", text)
              return "".join(line)
        findPhoneNumber("9990001796 is a phone number of PMO office")

输出：'9990001796'


        def findNonalp(text):
              line = re.findall("[^A-Za-z0-9 ]", text)
              return line
        example = "Twitter has lots of @ and # in posts.(2021 year is not good)"
        findNonalp(example)

输出：['@', '#', '.', '(', ')']


        def findYear(text):
              line = re.findall(r"\d{4}", text)
              return line
        example = "My DOB year is 1998."
        findYear(example)

输出：['1998']


        def find_punct(text):
              line = re.findall(r'[!"$%&'()*+,-./:;=#@?[\]^_`{|}~]*', text)
              string = "".join(line)
              return list(string)
        example = "Corona virus killed #24506 people. #Corona is un(tolerable)"
        print(find_punct(example))

输出：['#', '.', '#', '(', ')']


        def rep(text):
              grp = text.group(0)
              if len(grp) > 1:
                    return grp[0:1]
        def unique_char(rep, sentence):
              convert = re.sub(r'(\w)\1+', rep, sentence)
              return convert
        example = "heyyy this is a verrrry loong texttt"
        unique_char(rep, example)

输出：'hey this is a very long text'


        def num_great(text):
              line = re.findall(r'9[3-9][0-9]|[1-9]\d{3,}', text)
              return " ".join(line)
        example = "Height of this bridge is 935m. Width of this bridge is 30 metre. It used 9274kg of steel."
        num_great(example)

输出：'935 9274'


        def num_less(text):
              only_num = []
              for i in text.split():
                    line = re.findall(r'^(9[0-2][0-9]|[1-8][0-9]{2}|[1-9][0-9]|[0-9])$', i)
                    only_num.append(line)
              all_num = [",".join(x) for x in only_num if x != []]
              return " ".join(all_num)
        example = "There are some countries where less than 920 cases exist with 1100 observations"
        num_less(example)

输出：'920'


        def findDates(text):
              line = re.findall(r'(\d{1,2}/\d{1,2}/\d{4})', text)
              return line
        example = "Today's date is 06/21/2021 for format mm/dd/yyyy, not 31/09/2020"
        findDates(example)

输出：['06/21/2021']


        def onlyWords(text):
              line = re.findall(r'\b\w+\b', text)
              return " ".join(line)
        example = "Harish reduced his weight from 100 Kg to 75 kg."
        onlyWords(example)

输出：'Harish reduced his weight from Kg to kg.'


        def only_numbers(text):
              line = re.findall(r'\d+', text)
              return " ".join(line)
        example = "Harish reduced his weight from 100 Kg to 75 kg."
        only_numbers(example)

输出：'100 75'


        def pick_only_key_sentence(text, keyword):
              line = re.findall(r'([^.]*' + keyword + '[^.]*)', text)
              return line
        example = "People are fighting with covid these days. Economy has fallen down. How will we survive covid"
        pick_only_key_sentence(example, 'covid')

输出：['People are fighting with covid these days', 'How will we survive covid']


        def find_capital(text):
              line = re.findall(r'[A-Z]\w*', text)
              return line
        example = "Ajit Doval is the best National Security Advisor so far."
        find_capital(example)

输出：['Ajit', 'Doval', 'National', 'Security', 'Advisor']


        def remove_tag(string):
              text = re.sub('<.*?>', '', string)
              return text
        example = "Markdown sentences use  for breaks and  for italics"
        remove_tag(example)

输出：'Markdown sentences use for breaks and for italics'


        def ip_add(string):
              text = re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', string)
              return text
        example = "My public IP address is 165.19.120.1"
        ip_add(example)

输出：['165.19.120.1']


        def mac_add(string):
              text = re.findall(r'(?:[0-9a-fA-F]:?){12}', string)
              return text
        example = "MAC ADDRESSES of this TOSHIBA laptop is 00:21:27:b1:aa:xx."
        mac_add(example)

输出：['00:21:27:b1:aa:xx']


        def validPan(string):
              text = re.findall(r'^[A-Z]{5}\d{4}[A-Z]{1}$', string)
              if text != []:
                    print("{} is valid PAN number".format(string))
              else:
                    print("{} is not a valid PAN number".format(string))
        validPan("ABCED3193P")
        validPan("lEcGD012eg")


        def find_percent(string):
              text = re.findall(r'(100|\d{1,2})%', string)
              return text
        example = "COVID recovery rate is now 76%. But death rate is 4%"
        find_percent(example)


        def find_files(string):
              text = re.findall(r'([a-zA-Z0-9_]+)\.(jpg|png|gif|jpeg|pdf|ipynb|py)', string)
              all_files = []
              for i in range(len(text)):
                    all_files.append('.'.join(text[i]))
              return all_files
        example = "This image file name is cheatsheet.png. Titanic.py file is most common among beginners."
        find_files(example)

循环神经网络（RNN）详解

本文详细介绍了循环神经网络（RNN）的概念、原理、架构类型以及在不同领域的应用。

数据分布可视化：岭线图在Python中的应用

本文介绍了岭线图（Joy Plot）的概念、用途以及如何在Python中使用joypy库来创建和美化岭线图。

正则表达式在自然语言处理中的应用

正则表达式的常见用途

正则表达式函数介绍

代码示例

循环神经网络（RNN）详解

数据分布可视化：岭线图在Python中的应用

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485

正则表达式在自然语言处理中的应用

正则表达式的常见用途

正则表达式函数介绍

代码示例

循环神经网络（RNN）详解

数据分布可视化：岭线图在Python中的应用

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢 联系电话：17898875485

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485