基于PageRank算法的Web数据挖掘的研究

发布时间：2018-06-05 17:53

本文选题：PageRank算法 + 网页相似度　；参考：《天津理工大学》2017年硕士论文

【摘要】：面对互联网中庞大的数据,怎样获取所需要的信息形成了研究所面对的一个难题。而Web数据挖掘这门学科的泛起为这个难题提出了解决方法。Web数据挖掘由Web内容挖掘、Web结构挖掘和Web使用挖掘构成。Web结构挖掘中主要有PageRank算法和HITS算法。由于Page Rank算法相比于HITS算法的应用更为广泛,同时它的效率也优于HITS算法。所以本文通过对Web结构挖掘中的PageRank算法的特征进行学习,提出了改进的方法,本文主要的贡献如下:(1)针对PageRank算法存在的平均分配PR值的问题。本文提出基于网页相似度的改良方法。将网页之间的指向关系作为一种链接向量,通过这种链接向量来表示某个网页。通过链接向量来表示网页之间的相似度。以当前网页和入链网页的相似度的大小来传递PR值,代替了原来PageRank算法的平均传递值的方法。对PageRank算法和改良的方法进行实验对比,改良后的算法在查准率上有所提高。(2)针对PageRank算法存在的主题漂移问题。本文提出基于主题相关性的改良方法。此改进方法的基本原理是:对一个关键字进行检索时,若检索系统可以在检索结果的排名中依据网页和客户要求的相关性的大小来排名,这么我们就认为这个检索系统的精确度是可以的。本文利用已经发展成熟的概率检索模型BM25F模型,利用此模型来获得网页与查询关键字之间的相关性。对PageRank算法,Top-Sensitive PageRank算法和改良后的算法进行实验对比,改良后的算法在网页质量的上有较大提升。(3)针对PageRank算法存在的偏重旧网页的问题。本文提出基于网页更新率的改良方法。传统的PageRank算法下只是考虑了网页之间的链接结构没有将时间因素作为一个评价标准,这样新的网页由于存在的时间短被其他网页所引用的概率就会大大降低,这对新网页是不利的。此改进方法是基于网页的变化是泊松过程,本文通过泊松分布的数据模型来计算网页的更新率。对PageRank算法和改良后的算法进行实验对比,改良后的算法对新网页的排名有所提升。
[Abstract]:In the face of the huge data in the Internet, how to obtain the needed information has become a difficult problem. For this problem, the Web data mining is composed of PageRank algorithm and HITS algorithm, which is composed of Web content mining, web structure mining and Web usage mining. Page Rank algorithm is more widely used than HITS algorithm, and its efficiency is better than that of HITS algorithm. Therefore, by learning the features of PageRank algorithm in Web structure mining, an improved method is proposed. The main contribution of this paper is as follows: 1) aiming at the problem of average allocation of PR value in PageRank algorithm. This paper proposes an improved method based on web similarity. The relationship between web pages is used as a link vector to represent a web page. The similarity between web pages is expressed by link vectors. The PR value is transferred by the similarity between the current web page and the linked web page, instead of the average transfer value of the original PageRank algorithm. By comparing the PageRank algorithm with the improved method, the improved algorithm can improve the precision. (2) aiming at the topic drift problem of the PageRank algorithm. This paper proposes an improved method based on thematic correlation. The basic principle of this improved method is that when a keyword is retrieved, if the retrieval system can rank the search results according to the size of the correlation between the web page and the customer's requirements, So we think the accuracy of the retrieval system is possible. This paper uses the developed probabilistic retrieval model, BM25F model, to obtain the correlation between web pages and query keywords. The PageRank algorithm Top-Sensitive PageRank algorithm is compared with the improved algorithm. The improved algorithm has a great improvement in the quality of the web page. This paper proposes an improved method based on the update rate of web pages. The traditional PageRank algorithm only considers the link structure between web pages and does not take the time factor as an evaluation criterion, so the probability of the new web page being quoted by other web pages will be greatly reduced because of the short time of existence. This is bad for the new web page. This improved method is based on the Poisson process of web page change. In this paper, the update rate of web page is calculated by Poisson distribution data model. Compared the PageRank algorithm with the improved algorithm, the improved algorithm improved the ranking of new web pages.
【学位授予单位】：天津理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.09;TP311.13

【相似文献】