搜索引擎索引技术解析

搜索引擎如何在短时间内扫描互联网并返回相关结果？答案是通过爬虫和索引技术。爬虫技术让自动化的机器人寻找新页面或更新的页面，并存储关键信息，如URL、标题、关键词等，以备后续使用。索引技术则分析从爬虫获取的数据，确定页面的内容，使用页面上的关键内容、图片和视频文件，并将这些信息索引存储，以便在搜索查询时返回。因此，当要求搜索引擎搜索时，它们并不是扫描整个互联网，而是仅扫描那些在第二步中索引的URL。

开发搜索引擎原型

本文将介绍如何开发一个类似于搜索引擎索引功能的小型原型。将使用关于#COVID的推文数据集，并尝试根据搜索词对其进行索引。这篇文章是数据科学博客马拉松的一部分。


import pandas as pd
from rank_bm25 import *

BM25是一个简单的Python包，可以用来根据搜索查询对数据（在例子中是推文）进行索引。它基于TF/IDF概念，即词频（TF）和逆文档频率（IDF）。TF表示搜索词在推文中出现的次数，而IDF衡量搜索词的重要性。由于TF将所有词视为同等重要，不能仅使用词频来计算文本中一个词的权重。需要降低频繁出现的词的权重，同时提升罕见词的权重，以显示它们与推文的相关性。

由于本文不讨论Twitter API，将从基于Excel的Feed开始使用，并在这些关键步骤中清理文本数据，使搜索更加强大。


import pandas as pd
from rank_bm25 import *
import warnings
warnings.filterwarnings('ignore')
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
sentence = "Jack is a sharp minded fellow"
words = word_tokenize(sentence)
print(words)

分词是将句子分割成单词的过程，以便每个单词可以被独立考虑。


def spl_chars_removal(lst):
    lst1=list()
    for element in lst:
        str=""
        str = re.sub("[^a-zA-Z]"," ",element)
        lst1.append(str)
    return lst1

从推文中移除特殊字符。


from nltk.tokenize import word_tokenize
from gensim.parsing.preprocessing import STOPWORDS
#添加自定义停用词
all_stopwords_gensim = STOPWORDS.union(set(['disease']))
def stopwprds_removal_gensim_custom(lst):
    lst1=list()
    for str in lst:
        text_tokens = word_tokenize(str)
        tokens_without_sw = [word for word in text_tokens if not word in all_stopwords_gensim]
        str_t = “ “.join(tokens_without_sw)
        lst1.append(str_t)
 
    return lst1

停用词是推文中常用的词（如is、for、the等）。这些词不具有任何重要性，因为它们无助于区分两条推文。

文本归一化是将文本转换为规范（标准）形式的过程。例如，单词“gooood”和“gud”可以转换为“good”，其规范形式。另一个例子是将几乎相同的词如“stopwords”、“stop-words”和“stop words”映射为“stopwords”。


import nltk
from nltk.stem import PorterStemmer
ps = PorterStemmer()
sentence = “Machine Learning is cool”
for word in sentence.split():
    print(ps.stem(word))

词干提取是将单词转换为其根形式的过程。它减少单词的屈折变化（例如troubled、troubles）到其根形式（例如trouble）。“根”在这种情况下可能不是一个真正的根词，而只是原始单词的规范形式。

这是运行查询的核心部分。根据单词“vaccine”搜索推文。用户也可以输入一个短语，因为在第二行以下对搜索词进行分词。


tokenized_corpus = [doc.split(" ") for doc in lst1]
bm25 = BM25Okapi(tokenized_corpus)
query = "vaccine" ## 输入搜索查询
tokenized_query = query.split(" ")
# 检查每个推文与搜索词的关联
doc_scores = bm25.get_scores(tokenized_query)
print(doc_scores)


docs = bm25.get_top_n(tokenized_query, lst1, n=5)
df_search = df[df['Text'].isin(docs)]
df_search.head()

推文内容：@MikeCarlton01 Re #ABC funding, looked up Budget Papers. After massive prior cuts, it got extra $4.7M in funding (.00044% far less than inflation).#Morrison wastes $Ms on over-priced & ineffective services eg useless #Covid app.; delivery vaccine #agedcare; consultancies vaccine roll-out..
推文内容：@TonyHWindsor @barriecassidy @4corners @abc730 For its invaluable work, #ABC got extra $4.7M in funding (.00044% far less than inflation).While #Morrison Govt spends like drunken sailor on buying over-priced & ineffective services from mates (eg useless #Covid app.; delivery vaccine #agedcare; vaccine roll-out) #auspol
推文内容：It’s going to be a month after my #Covid recovery. Now I will go vaccine 😎😎😎😎
推文内容：RT @pradeepkishan : What a despicable politician is #ArvindKejariwal ! The minute oxygen hoarding came to light his propaganda shifted to vaccine shortage. He is more dangerous than #COVID itself! @BJP4India @TajinderBagga
推文内容：RT @AlexBerenson : TL: DR – In the @pfizer teen #Covid vaccine trial, 4 or 5 (the exact figure is hidden) of 1,100 kids who got the vaccine had serious side effects, compared to 1 who got placebo.@US_FDA did not disclose specifics, so we have no idea what they were or if they follow any pattern. https://t.co/n5igf2xXFN

R语言Shiny数据应用开发

本文介绍了如何使用R语言的Shiny包构建数据应用，实现数据探索、模型构建和预测分析。

Python编译器概览

本文介绍了Python编译器的相关信息，包括PyCharm、Spyder、Visual Studio Code、PyDev、Jupyter Notebook和Sublime Text等，旨在为Python开发者和数据科学家提供有用的参考。

搜索引擎索引技术解析

开发搜索引擎原型

R语言Shiny数据应用开发

Python编译器概览

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485

搜索引擎索引技术解析

开发搜索引擎原型

R语言Shiny数据应用开发

Python编译器概览

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢 联系电话：17898875485

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485