基于链接相似性分析的WEB结构挖掘方法研究

发布时间：2018-08-24 08:41

【摘要】：WEB服务和应用近年来得到了飞速发展，其信息量呈几何级数增长，每天都有数以百万计的网页加入到WEB中。它已经成为了一个涉及教育、政府、电子商务、新闻、广告、消费信息、金融管理和许多其它信息服务的、巨大的、分布广泛、全球性的信息服务中心。WEB网页它们之间相互链接，盘根错节，组织成了一个类似于人类社会的网络，，结合链接相似性分析方法，将对WEB资源挖掘进行研究，帮助人们高效的获取所需信息，寻找所需领域的权威信息。本文针对WEB结构挖掘中的四个问题进行研究：WEB页面链接预测算法、垃圾页面（SPAM）识别算法、WEB结构挖掘算法以及WEB页面聚类算法。首先，提出了基于相似性的多路径游走链接预测算法。1）提出新的衰减因子，通过使用新的衰减因子定义出新的相似度公式；2）改进Rubin算法，与新的相似度公式相结合进行相似度计算，得出节点的相似度；3）对节点相似度排序，从而进行预测可能性判断，得出预测结果。4）最后通过实验对算法进行了验证。其次，提出了页面互相链接相似度的概念，然后给出了一个Spam页面链接结构的假设，并且提出了一种基于页面互链接相似度聚类的Spam页面识别算法，该算法考虑了网页之间会出现的彼此互相连接的关系，因此更加合理；并通过实验分析，验证了所提假设，并实验验证了算法的有效性。再次，针对PageRank算法其存在的“主题漂移”和偏重旧网页现象，提出了一种基于相似度和时间反馈因子的改进PageRank算法。第一步，利用向量空间模型VSM来计算链接文本和其指向网页之间的相似度；第二步，根据网页产生时间，设计一个时间反馈因子，削弱旧网页的网页等级值，提高新网页的网页等级值；第三步，将相似度值和时间反馈因子融入到PageRank算法计算网页等级值中，根据算法流程计算改进后网页的PageRank值。最后通过实验对算法的性能进行了分析。第四，研究国内外已有的基于局部信息的启发式聚类方法研究现状，然后进行总结分析；并详细研究基于局部信息的标签传播方法，分析该算法在迭代过程中，采用随机策略为某个节点选择所属的簇结构时所存在的问题；随后提出了一种针对随机策略选择簇结构问题的改进聚类方法——基于节点属性相似度的标签传播算法；最后，为了帮助高效的发现互联网的分组信息资源，通过实验对该算法的有效性和性能进行了验证，并将其在实际的网页聚类中进行了应用。本文最后得出结论，并对未来工作进行了展望。
[Abstract]:WEB services and applications have been rapidly developed in recent years, the amount of information is geometric growth, millions of pages are added to the WEB every day. It has become a huge, widely spread information service involving education, government, e-commerce, news, advertising, consumer information, financial management, and many other information services. The global information service center. Web pages are linked and intertwined among them. They are organized into a network similar to human society. Combined with the method of link similarity analysis, the WEB resource mining will be studied. Help people get the information they need and find the authority information in the field. In this paper, four problems in WEB structure mining are studied, such as: Web page link prediction algorithm, garbage page (SPAM) recognition algorithm, Web structure mining algorithm and WEB page clustering algorithm. Firstly, a similarity based multipath walking link prediction algorithm is proposed. (1) A new attenuation factor is proposed, and a new similarity formula is defined by using the new attenuation factor to improve the Rubin algorithm. Combining with the new similarity formula to calculate the similarity, the similarity degree of nodes is obtained. The similarity ranking of nodes is obtained, and the prediction possibility is judged. Finally, the algorithm is verified by experiments. Secondly, the concept of the similarity between pages is proposed, then a hypothesis of Spam page link structure is given, and a Spam page recognition algorithm based on the similarity clustering between pages is proposed. The algorithm takes into account the interconnectedness between web pages, so it is more reasonable, and through experimental analysis, the proposed hypothesis is verified, and the validity of the algorithm is verified by experiments. Thirdly, an improved PageRank algorithm based on similarity and time feedback factor is proposed to solve the problem of "topic drift" and emphasis on old web pages. In the first step, a vector space model (VSM) is used to calculate the similarity between the link text and its pointing to the web page, and the second step is to design a time feedback factor according to the generated time of the page, which weakens the page rank of the old web page. In the third step, the similarity value and time feedback factor are incorporated into the PageRank algorithm to calculate the web page rank value, and the improved PageRank value is calculated according to the algorithm flow. Finally, the performance of the algorithm is analyzed through experiments. Fourthly, the current situation of heuristic clustering methods based on local information is studied, and then summarized and analyzed, and the label propagation method based on local information is studied in detail to analyze the iterative process of the algorithm. The problems existing in the selection of cluster structure for a node by random strategy are discussed. Then, an improved clustering method for the cluster structure problem is proposed, which is based on the similarity of node attributes. Finally, a label propagation algorithm based on the similarity of node attributes is proposed. In order to help the efficient discovery of packet information resources in the Internet, the effectiveness and performance of the algorithm are verified by experiments, and the algorithm is applied in the actual web page clustering. Finally, the conclusion is drawn and the future work is prospected.
【学位授予单位】：哈尔滨工程大学
【学位级别】：博士
【学位授予年份】：2012
【分类号】：TP393.092;TP311.13

【参考文献】