基于链接相似度的网页排序算法研究

发布时间：2019-03-02 08:36

【摘要】： 本文主要讨论网页排序相关算法,重点讨论了链接分析技术。首先,介绍了网页排序的基本原理,对几种较为常用的网页排序技术进行了对比分析;着重剖析了两种典型的链接分析算法:PageRank和HITS,分析了它们各自的优劣。 PageRank算法主要缺陷是将PageRank值在所有的出链接上进行平均分配,没有很好地考虑语义信息,很容易受到无关链接的影响,产生主题漂移。本文设计了一个简单的计算模型改进PageRank算法,该计算模型在PageRank算法平均分配的基础之上,考虑了链接相似度信息,并利用朴素贝叶斯模型对链接相似度信息进行评估。由于考虑了出链接与目标网页相似度信息,使得那些没有价值的页面(广告页面)被分得较少的PageRank值,提升了真正有价值的页面所分得的PageRank值。最后,本文应用上述模型实现了一个模拟的搜索引擎。该模拟系统包含了搜索引擎的几乎全部功能,并在互联网真实环境下请一些用户进行实际测试,对上述算法进行验证。小范围用户测试结果表明:融入了链接相似度信息之后,提升了搜索结果的用户满意度。
[Abstract]:In this paper, we mainly discuss the related algorithms of web page sorting, and focus on the link analysis technology. First of all, this paper introduces the basic principle of web page sorting, compares and analyzes several common web page sorting techniques, and emphatically analyzes two typical link analysis algorithms: PageRank and HITS, which analyze their advantages and disadvantages respectively. The main drawback of PageRank algorithm is that the PageRank value is distributed evenly on all out links, and the semantic information is not considered very well, so it is easy to be influenced by irrelevant links, resulting in topic drift. In this paper, a simple computing model is designed to improve the PageRank algorithm. Based on the average allocation of the PageRank algorithm, the link similarity information is considered, and the naive Bayesian model is used to evaluate the link similarity information. Considering the similarity information between links and target pages, those pages (advertising pages) with no value are divided into fewer PageRank values, which improves the PageRank value of truly valuable pages. Finally, a simulated search engine is implemented by using the above-mentioned model. The simulation system contains almost all the functions of the search engine, and some users are asked to test the algorithm in the real environment of the Internet to verify the algorithm mentioned above. The results of a small-scale user test show that the user satisfaction of the search results can be improved by incorporating the link similarity information.
【学位授予单位】：南京理工大学
【学位级别】：硕士
【学位授予年份】：2008
【分类号】：TP391.3

【引证文献】