当前位置:主页 > 文艺论文 > 广告艺术论文 >

基于链接相似性分析的WEB结构挖掘方法研究

发布时间:2018-08-24 08:41
【摘要】:WEB服务和应用近年来得到了飞速发展,其信息量呈几何级数增长,每天都有数以百万计的网页加入到WEB中。它已经成为了一个涉及教育、政府、电子商务、新闻、广告、消费信息、金融管理和许多其它信息服务的、巨大的、分布广泛、全球性的信息服务中心。WEB网页它们之间相互链接,盘根错节,组织成了一个类似于人类社会的网络,,结合链接相似性分析方法,将对WEB资源挖掘进行研究,帮助人们高效的获取所需信息,寻找所需领域的权威信息。 本文针对WEB结构挖掘中的四个问题进行研究:WEB页面链接预测算法、垃圾页面(SPAM)识别算法、WEB结构挖掘算法以及WEB页面聚类算法。 首先,提出了基于相似性的多路径游走链接预测算法。1)提出新的衰减因子,通过使用新的衰减因子定义出新的相似度公式;2)改进Rubin算法,与新的相似度公式相结合进行相似度计算,得出节点的相似度;3)对节点相似度排序,从而进行预测可能性判断,得出预测结果。4)最后通过实验对算法进行了验证。 其次,提出了页面互相链接相似度的概念,然后给出了一个Spam页面链接结构的假设,并且提出了一种基于页面互链接相似度聚类的Spam页面识别算法,该算法考虑了网页之间会出现的彼此互相连接的关系,因此更加合理;并通过实验分析,验证了所提假设,并实验验证了算法的有效性。 再次,针对PageRank算法其存在的“主题漂移”和偏重旧网页现象,提出了一种基于相似度和时间反馈因子的改进PageRank算法。第一步,利用向量空间模型VSM来计算链接文本和其指向网页之间的相似度;第二步,根据网页产生时间,设计一个时间反馈因子,削弱旧网页的网页等级值,提高新网页的网页等级值;第三步,将相似度值和时间反馈因子融入到PageRank算法计算网页等级值中,根据算法流程计算改进后网页的PageRank值。最后通过实验对算法的性能进行了分析。 第四,研究国内外已有的基于局部信息的启发式聚类方法研究现状,然后进行总结分析;并详细研究基于局部信息的标签传播方法,分析该算法在迭代过程中,采用随机策略为某个节点选择所属的簇结构时所存在的问题;随后提出了一种针对随机策略选择簇结构问题的改进聚类方法——基于节点属性相似度的标签传播算法;最后,为了帮助高效的发现互联网的分组信息资源,通过实验对该算法的有效性和性能进行了验证,并将其在实际的网页聚类中进行了应用。本文最后得出结论,并对未来工作进行了展望。
[Abstract]:WEB services and applications have been rapidly developed in recent years, the amount of information is geometric growth, millions of pages are added to the WEB every day. It has become a huge, widely spread information service involving education, government, e-commerce, news, advertising, consumer information, financial management, and many other information services. The global information service center. Web pages are linked and intertwined among them. They are organized into a network similar to human society. Combined with the method of link similarity analysis, the WEB resource mining will be studied. Help people get the information they need and find the authority information in the field. In this paper, four problems in WEB structure mining are studied, such as: Web page link prediction algorithm, garbage page (SPAM) recognition algorithm, Web structure mining algorithm and WEB page clustering algorithm. Firstly, a similarity based multipath walking link prediction algorithm is proposed. (1) A new attenuation factor is proposed, and a new similarity formula is defined by using the new attenuation factor to improve the Rubin algorithm. Combining with the new similarity formula to calculate the similarity, the similarity degree of nodes is obtained. The similarity ranking of nodes is obtained, and the prediction possibility is judged. Finally, the algorithm is verified by experiments. Secondly, the concept of the similarity between pages is proposed, then a hypothesis of Spam page link structure is given, and a Spam page recognition algorithm based on the similarity clustering between pages is proposed. The algorithm takes into account the interconnectedness between web pages, so it is more reasonable, and through experimental analysis, the proposed hypothesis is verified, and the validity of the algorithm is verified by experiments. Thirdly, an improved PageRank algorithm based on similarity and time feedback factor is proposed to solve the problem of "topic drift" and emphasis on old web pages. In the first step, a vector space model (VSM) is used to calculate the similarity between the link text and its pointing to the web page, and the second step is to design a time feedback factor according to the generated time of the page, which weakens the page rank of the old web page. In the third step, the similarity value and time feedback factor are incorporated into the PageRank algorithm to calculate the web page rank value, and the improved PageRank value is calculated according to the algorithm flow. Finally, the performance of the algorithm is analyzed through experiments. Fourthly, the current situation of heuristic clustering methods based on local information is studied, and then summarized and analyzed, and the label propagation method based on local information is studied in detail to analyze the iterative process of the algorithm. The problems existing in the selection of cluster structure for a node by random strategy are discussed. Then, an improved clustering method for the cluster structure problem is proposed, which is based on the similarity of node attributes. Finally, a label propagation algorithm based on the similarity of node attributes is proposed. In order to help the efficient discovery of packet information resources in the Internet, the effectiveness and performance of the algorithm are verified by experiments, and the algorithm is applied in the actual web page clustering. Finally, the conclusion is drawn and the future work is prospected.
【学位授予单位】:哈尔滨工程大学
【学位级别】:博士
【学位授予年份】:2012
【分类号】:TP393.092;TP311.13

【参考文献】

相关期刊论文 前7条

1 李晓佳;张鹏;狄增如;樊瑛;;复杂网络中的社团结构[J];复杂系统与复杂性科学;2008年03期

2 东昱晓;柯庆;吴斌;;基于节点相似性的链接预测[J];计算机科学;2011年07期

3 沈华伟;程学旗;陈海强;刘悦;;基于信息瓶颈的社区发现[J];计算机学报;2008年04期

4 魏小娟;李翠平;陈红;;Co-Training——内容和链接的Web Spam检测方法[J];计算机科学与探索;2010年10期

5 余慧佳;刘奕群;张敏;马少平;茹立云;;基于目的分析的作弊页面分类[J];中文信息学报;2009年02期

6 杨博;刘大有;金弟;马海宾;;复杂网络聚类方法[J];软件学报;2009年01期

7 杨宁;唐常杰;王悦;陈瑜;郑皎凌;;基于谱聚类的多数据流演化事件挖掘[J];软件学报;2010年10期



本文编号:2200241

资料下载
论文发表

本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/2200241.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户f6a68***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com