基于词语相关度的搜索引擎排序算法
[Abstract]:The main task of the search engine is to collect the network information and return the web link related to the search word for the user according to the key word provided by the user. With the expansion of Internet network volume and the increase of information, it is not difficult for search engines to grab enough web pages on the network. The difficulty lies in how to sort out these pages, select the appropriate sorting algorithm, and send links to the user interface. Now search engine sorting algorithms are mainly based on link structure, such as PageRank algorithm and HITS algorithm, and combined with other algorithms to form an improved sorting model, practice shows that the search results are very good. But the link-based sorting algorithm has its own shortcomings, such as the analysis of natural language is not strong enough, fixed degree is divorced from the understanding of language. Therefore, this paper proposes a ranking algorithm based on word relevance. Firstly, based on a large number of corpus, the co-occurrence rate of words, word spacing and the information gain of words in the corpus are analyzed statistically. The relevant words and expressions in the document set are obtained, and their correlation degree is counted. Secondly, after the key words input by the user are obtained in the retrieval interface, the relevant words and correlation values are weighted into the PageRank algorithm according to a certain algorithm, which affects the sorting results of the web pages. Because there is no complete search engine system in this paper, we use the existing search engine Google to obtain documents, resort the documents by using the above algorithm, and compare the results with those of Google. Through the comparative analysis of experiments, the algorithm proposed in this paper can improve the problem of ranking based on link structure. At the same time, there are some shortcomings: first, the subject of corpus is single and the scope of experiment is small; Second, the time efficiency of the retrieval algorithm is not well considered. The algorithm proposed in this paper needs to be further improved on the basis of a wider range of fields and more experimental analysis.
【学位授予单位】:兰州大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP391.3
【参考文献】
相关期刊论文 前10条
1 许云,樊孝忠,张锋;基于知网的语义相关度计算[J];北京理工大学学报;2005年05期
2 李广原;属性论在文本相似度计算中的应用[J];广西师院学报(自然科学版);2000年03期
3 张岭,马范援;加速评估算法:一种提高Web结构挖掘质量的新方法[J];计算机研究与发展;2004年01期
4 谢桂芳;李仁发;;具有概念联想功能的语义关系库的自动构建[J];计算机工程与应用;2007年07期
5 鲁松,白硕;自然语言处理中词语上下文有效范围的定量描述[J];计算机学报;2001年07期
6 田萱;杜小勇;李海华;;信息检索中一种基于词语—主题词相关度的语言模型[J];中文信息学报;2007年06期
7 宋聚平,王永成,尹中航,滕伟;对网页PageRank算法的改进[J];上海交通大学学报;2003年03期
8 徐南轩;邹恒明;;一种反映词语相关度语义库的构建方法[J];上海交通大学学报;2008年07期
9 李星毅;曾路平;施化吉;;基于单词相似度的文本聚类[J];计算机工程与设计;2009年08期
10 郭鸿;周娅;;Web结构挖掘中HITS算法的改进[J];信息化纵横;2009年16期
相关硕士学位论文 前4条
1 肖江涛;基于本体的语义相关度算法研究[D];国防科学技术大学;2010年
2 戚华春;互联网络信息挖掘算法的研究[D];浙江工业大学;2005年
3 王广正;基于知网语义相关度计算的汉语自动分词方法的研究[D];云南师范大学;2006年
4 陈洁惠;搜索引擎排序算法的研究[D];河海大学;2007年
,本文编号:2308938
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2308938.html