搜索引擎索引技术解析

搜索引擎如何在短时间内扫描互联网并返回相关结果?答案是通过爬虫和索引技术。爬虫技术让自动化的机器人寻找新页面或更新的页面,并存储关键信息,如URL、标题、关键词等,以备后续使用。索引技术则分析从爬虫获取的数据,确定页面的内容,使用页面上的关键内容、图片和视频文件,并将这些信息索引存储,以便在搜索查询时返回。因此,当要求搜索引擎搜索时,它们并不是扫描整个互联网,而是仅扫描那些在第二步中索引的URL。

开发搜索引擎原型

本文将介绍如何开发一个类似于搜索引擎索引功能的小型原型。将使用关于#COVID的推文数据集,并尝试根据搜索词对其进行索引。这篇文章是数据科学博客马拉松的一部分。

import pandas as pd from rank_bm25 import *

BM25是一个简单的Python包,可以用来根据搜索查询对数据(在例子中是推文)进行索引。它基于TF/IDF概念,即词频(TF)和逆文档频率(IDF)。TF表示搜索词在推文中出现的次数,而IDF衡量搜索词的重要性。由于TF将所有词视为同等重要,不能仅使用词频来计算文本中一个词的权重。需要降低频繁出现的词的权重,同时提升罕见词的权重,以显示它们与推文的相关性。

由于本文不讨论Twitter API,将从基于Excel的Feed开始使用,并在这些关键步骤中清理文本数据,使搜索更加强大。

import pandas as pd from rank_bm25 import * import warnings warnings.filterwarnings('ignore') import nltk nltk.download('punkt') from nltk.tokenize import word_tokenize sentence = "Jack is a sharp minded fellow" words = word_tokenize(sentence) print(words)

分词是将句子分割成单词的过程,以便每个单词可以被独立考虑。

def spl_chars_removal(lst): lst1=list() for element in lst: str="" str = re.sub("[^a-zA-Z]"," ",element) lst1.append(str) return lst1

从推文中移除特殊字符。

from nltk.tokenize import word_tokenize from gensim.parsing.preprocessing import STOPWORDS #添加自定义停用词 all_stopwords_gensim = STOPWORDS.union(set(['disease'])) def stopwprds_removal_gensim_custom(lst): lst1=list() for str in lst: text_tokens = word_tokenize(str) tokens_without_sw = [word for word in text_tokens if not word in all_stopwords_gensim] str_t = “ “.join(tokens_without_sw) lst1.append(str_t) return lst1

停用词是推文中常用的词(如is、for、the等)。这些词不具有任何重要性,因为它们无助于区分两条推文。

文本归一化是将文本转换为规范(标准)形式的过程。例如,单词“gooood”和“gud”可以转换为“good”,其规范形式。另一个例子是将几乎相同的词如“stopwords”、“stop-words”和“stop words”映射为“stopwords”。

import nltk from nltk.stem import PorterStemmer ps = PorterStemmer() sentence = “Machine Learning is cool” for word in sentence.split(): print(ps.stem(word))

词干提取是将单词转换为其根形式的过程。它减少单词的屈折变化(例如troubled、troubles)到其根形式(例如trouble)。“根”在这种情况下可能不是一个真正的根词,而只是原始单词的规范形式。

这是运行查询的核心部分。根据单词“vaccine”搜索推文。用户也可以输入一个短语,因为在第二行以下对搜索词进行分词。

tokenized_corpus = [doc.split(" ") for doc in lst1] bm25 = BM25Okapi(tokenized_corpus) query = "vaccine" ## 输入搜索查询 tokenized_query = query.split(" ") # 检查每个推文与搜索词的关联 doc_scores = bm25.get_scores(tokenized_query) print(doc_scores) docs = bm25.get_top_n(tokenized_query, lst1, n=5) df_search = df[df['Text'].isin(docs)] df_search.head()
  1. 推文内容:@MikeCarlton01 Re #ABC funding, looked up Budget Papers. After massive prior cuts, it got extra $4.7M in funding (.00044% far less than inflation).#Morrison wastes $Ms on over-priced & ineffective services eg useless #Covid app.; delivery vaccine #agedcare; consultancies vaccine roll-out..
  2. 推文内容:@TonyHWindsor @barriecassidy @4corners @abc730 For its invaluable work, #ABC got extra $4.7M in funding (.00044% far less than inflation).While #Morrison Govt spends like drunken sailor on buying over-priced & ineffective services from mates (eg useless #Covid app.; delivery vaccine #agedcare; vaccine roll-out) #auspol
  3. 推文内容:It’s going to be a month after my #Covid recovery. Now I will go vaccine 😎😎😎😎
  4. 推文内容:RT @pradeepkishan : What a despicable politician is #ArvindKejariwal ! The minute oxygen hoarding came to light his propaganda shifted to vaccine shortage. He is more dangerous than #COVID itself! @BJP4India @TajinderBagga
  5. 推文内容:RT @AlexBerenson : TL: DR – In the @pfizer teen #Covid vaccine trial, 4 or 5 (the exact figure is hidden) of 1,100 kids who got the vaccine had serious side effects, compared to 1 who got placebo.@US_FDA did not disclose specifics, so we have no idea what they were or if they follow any pattern. https://t.co/n5igf2xXFN
沪ICP备2024098111号-1
上海秋旦网络科技中心:上海市奉贤区金大公路8218号1幢 联系电话:17898875485