基于词语相关度的搜索引擎排序算法

发布时间：2018-11-03 20:27

【摘要】：搜索引擎的主要任务是搜集网络信息,根据用户提供的检索词为用户返回与检索词相关的网页链接。随着Internet网络容积的扩大,信息量的增多,搜索引擎在网络上抓取足够的网页并不难,难在于如何将这些网页整理出来,选择合适的排序算法,将网页链接发送到用户界面。现在搜索引擎的排序算法主要是基于链接结构,如PageRank算法和HITS算法,并在此基础上结合其它算法形成改进后的排序模型,实践证明搜索效果很好。但是基于链接的排序算法有自身的不足,如对自然语言的分析力度不够,定程度上脱离了人对语言的理解。因此,本文提出了基于词语相关度的排序算法,一是在大量语料集的基础上,通过统计分析文档内词语的共现率、词间距以及词语在语料集内的信息增益,得出关键词在文档集内的相关词语,并计录它们的相关度大小；二是在检索界面获取用户输入的关键词后,将得到的相关词及相关度值按一定算法加权到PageRank算法中,影响网页的排序结果。由于本文没有实现完整的搜索引擎系统,所以本文通过现有搜索引擎Google来获文档,利用上述算法对文档重新排序,并与Google的排序结果对比。通过实验对比分析,本文提出的算法能够改善基于链接结构排序的问题,同时也存在着一些不足：一是语料集的主题单一,实验范围小；二是检索中算法的时间效率问题考虑不周。本文提出的算法还需要在更广的领域和更多的实验分析基础上进一步改进。
[Abstract]:The main task of the search engine is to collect the network information and return the web link related to the search word for the user according to the key word provided by the user. With the expansion of Internet network volume and the increase of information, it is not difficult for search engines to grab enough web pages on the network. The difficulty lies in how to sort out these pages, select the appropriate sorting algorithm, and send links to the user interface. Now search engine sorting algorithms are mainly based on link structure, such as PageRank algorithm and HITS algorithm, and combined with other algorithms to form an improved sorting model, practice shows that the search results are very good. But the link-based sorting algorithm has its own shortcomings, such as the analysis of natural language is not strong enough, fixed degree is divorced from the understanding of language. Therefore, this paper proposes a ranking algorithm based on word relevance. Firstly, based on a large number of corpus, the co-occurrence rate of words, word spacing and the information gain of words in the corpus are analyzed statistically. The relevant words and expressions in the document set are obtained, and their correlation degree is counted. Secondly, after the key words input by the user are obtained in the retrieval interface, the relevant words and correlation values are weighted into the PageRank algorithm according to a certain algorithm, which affects the sorting results of the web pages. Because there is no complete search engine system in this paper, we use the existing search engine Google to obtain documents, resort the documents by using the above algorithm, and compare the results with those of Google. Through the comparative analysis of experiments, the algorithm proposed in this paper can improve the problem of ranking based on link structure. At the same time, there are some shortcomings: first, the subject of corpus is single and the scope of experiment is small; Second, the time efficiency of the retrieval algorithm is not well considered. The algorithm proposed in this paper needs to be further improved on the basis of a wider range of fields and more experimental analysis.
【学位授予单位】：兰州大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.3

【参考文献】