动态自适应的资源采集系统的设计与实现
					发布时间:2018-08-24 15:24
				
				
				
				
				
					 【摘要】:当今,互联网提供了越来越多有价值的信息,人们习惯通过搜索引擎来获取信息。中国的网页总数在2012年比2011年增长了近41%,这对搜索引擎的网络资源采集提出了更高的要求。互联网的网页数量很庞大,尤其是动态网页的数量增长迅速。在资源采集的过程中,难免会碰到各种异常情况,如服务器响应缓慢,重复网页、无效网页链接过多,网页资源之间的链接关系难以发现等问题。本文重点研究这类问题的解决办法。 本文主要研究目标是设计并实现一个资源采集系统,不仅能够动态调整和自动适应广域网中的各种异常情况,而且能基于已有采集信息发现网页之间的链接关系,预测出更多相似网页。本文中,系统将采集过程中的实时统计信息,作为实时过滤链接的依据,旨在过滤重复率高、访问无效、访问超时的网页链接,以提高系统的采集效率。与一般的采集系统相比,本系统可以较好地适应了不稳定的网络状况和较好地处理大量垃圾链接的问题。本文针对难以发现网页链接的问题,提出了链接分析预测的方法,采用了在分析链接统计信息的基础上进行预测的方式,取得了发现大量相似网页、扩大采集覆盖范围的效果,,弥补了抽取链接的常规方法的不足。 本文采用分布式架构设计来实现资源采集系统,除了划分并实现了网页下载、网页解析、URL消重、URL调度等基本模块以外,还加入实时过滤模块和URL预测模块,以及统计信息、URL聚类、分类等辅助模块,使得系统具备动态自适应特性。 测试表明,本文提出的方法能够识别各种异常采集状况的发生并自适应地进行调整,提高了系统的健壮性,保证了采集过程的稳定。针对难以发现的网页链接,系统能够进行有效预测,除了常规抽取链接以外,本文提供了发现网页链接的另一个有效途径。
[Abstract]:Nowadays, the Internet provides more and more valuable information. The total number of web pages in China increased by nearly 41% in 2012 compared with 2011, which puts forward higher requirements for the collection of web resources by search engines. The number of web pages on the Internet is huge, especially the number of dynamic pages. In the process of resource acquisition, it is inevitable to encounter various abnormal situations, such as slow response of server, repeated pages, too many invalid web page links, and the link relationship between web resources is difficult to find, and so on. This paper focuses on the solution of this kind of problem. The main research goal of this paper is to design and implement a resource acquisition system, which can not only dynamically adjust and automatically adapt to all kinds of anomalies in WAN, but also discover the link relationship between web pages based on the information collected. Predict more similar pages. In this paper, the system takes real-time statistical information in the process of collection as the basis for real-time filtering links, aiming at filtering web links with high repetition rate, invalid access and time-out access, so as to improve the efficiency of the system. Compared with the general collection system, the system can adapt to the unstable network conditions and deal with the problem of a large number of spam links. In this paper, the method of link analysis and prediction is put forward, which is based on the analysis of the statistical information of the link, and the method of finding a large number of similar pages and extending the coverage of the collection is obtained. It makes up for the deficiency of the conventional method of extracting links. In this paper, the distributed architecture is used to realize the resource acquisition system. Besides the basic modules of web page download, web page analysis and URL reshuffle scheduling, real-time filtering module and URL prediction module are also added. As well as the statistical information URL clustering, classification and other auxiliary modules, make the system has dynamic adaptive characteristics. The test results show that the method proposed in this paper can recognize the occurrence of various abnormal sampling conditions and adaptively adjust, improve the robustness of the system and ensure the stability of the acquisition process. The system can make effective prediction for the hard to find web links. In addition to the conventional extraction of links, this paper provides another effective way to find web links.
【学位授予单位】:华南理工大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP393.092;TP391.3
 
本文编号:2201235
					
			
				
						
						
					
					
				
				[Abstract]:Nowadays, the Internet provides more and more valuable information. The total number of web pages in China increased by nearly 41% in 2012 compared with 2011, which puts forward higher requirements for the collection of web resources by search engines. The number of web pages on the Internet is huge, especially the number of dynamic pages. In the process of resource acquisition, it is inevitable to encounter various abnormal situations, such as slow response of server, repeated pages, too many invalid web page links, and the link relationship between web resources is difficult to find, and so on. This paper focuses on the solution of this kind of problem. The main research goal of this paper is to design and implement a resource acquisition system, which can not only dynamically adjust and automatically adapt to all kinds of anomalies in WAN, but also discover the link relationship between web pages based on the information collected. Predict more similar pages. In this paper, the system takes real-time statistical information in the process of collection as the basis for real-time filtering links, aiming at filtering web links with high repetition rate, invalid access and time-out access, so as to improve the efficiency of the system. Compared with the general collection system, the system can adapt to the unstable network conditions and deal with the problem of a large number of spam links. In this paper, the method of link analysis and prediction is put forward, which is based on the analysis of the statistical information of the link, and the method of finding a large number of similar pages and extending the coverage of the collection is obtained. It makes up for the deficiency of the conventional method of extracting links. In this paper, the distributed architecture is used to realize the resource acquisition system. Besides the basic modules of web page download, web page analysis and URL reshuffle scheduling, real-time filtering module and URL prediction module are also added. As well as the statistical information URL clustering, classification and other auxiliary modules, make the system has dynamic adaptive characteristics. The test results show that the method proposed in this paper can recognize the occurrence of various abnormal sampling conditions and adaptively adjust, improve the robustness of the system and ensure the stability of the acquisition process. The system can make effective prediction for the hard to find web links. In addition to the conventional extraction of links, this paper provides another effective way to find web links.
【学位授予单位】:华南理工大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP393.092;TP391.3
【参考文献】
相关期刊论文 前10条
1 雷鸣,王建勇,赵江华,单松巍,陈葆珏;第三代搜索引擎与天网二期[J];北京大学学报(自然科学版);2001年05期
2 陈鹏;吕卫锋;;一种基于有效修剪的最大频繁项集挖掘算法[J];北京航空航天大学学报;2006年02期
3 王新;;搜索方法中的剪枝优化[J];电脑知识与技术(学术交流);2007年11期
4 李振星,徐泽平,唐卫清,唐荣锡;基于兴趣模型的WEB信息预测采集过滤方法[J];计算机工程与应用;2003年05期
5 周德懋;李舟军;;高性能网络爬虫:研究综述[J];计算机科学;2009年08期
6 杨文峰,李星;网络搜索引擎的用户查询分析[J];计算机工程;2001年06期
7 汪涛,樊孝忠;链接分析对主题爬虫的改进[J];计算机应用;2004年S2期
8 董守斌;;木棉:企业级校园网搜索引擎[J];中国教育网络;2007年06期
9 马志新,陈晓云,王雪,李龙杰;最大频繁项集挖掘中搜索空间的剪枝策略[J];清华大学学报(自然科学版);2005年S1期
10 周开波;孟艾立;王小雨;谷金雷;鲁旭;;影响互联网网速的因素[J];现代电信科技;2012年09期
本文编号:2201235
本文链接:https://www.wllwen.com/kejilunwen/sousuoyinqinglunwen/2201235.html

