支持AJAX的定址网络爬虫系统的研究与实现

发布时间：2018-01-27 16:32

本文关键词： AJAX JaVaScript 网络爬虫数据采集定址　出处：《北京邮电大学》2013年硕士论文　论文类型：学位论文

【摘要】：在Web2.0的概念出现后,一种被称为RIA的具有高度互动性和丰富用户体验的网络应用程序出现了,如博客、微博等,而AJAX技术由于符合Web2.0时代的需求,被越来越多的运用在Web开发中。AJAX技术采用客户端JavaScript动态修改DOM结构,实现了网页的无缝化重构,提高了网页的互动性、速度以及可用性。但与此同时,它改变了传统的Web应用模型,打破了传统爬虫依赖于分析页面中超链接的爬行模式,使传统爬虫不能采集AJAX网页中的动态内容,这意味着大量有意义的数据无法通过搜索引擎检索。为了解决AJAX网站的动态数据采集问题,本文设计并实现了种支持AJAX的定址网络爬虫系统。首先,通过对传统网络爬虫的研究,分析得出了AJAX爬虫的技术难点,并从一个实际的AJAX网站出发,阐述了传统爬虫在爬行使用AJAX技术实现的网站时存在的关键问题以及研究应用场景；其次,介绍了研究的相关概念和问题模型,并设计了系统运行流程与系统架构；最后,通过对AJAX爬虫中关键问题的分析与设计,实现了一种支持AJAX的定址网络爬虫系统。系统将传统网络爬虫工作过程中的URL提取和下载网页两个功能分离开,使之成为两个独立的功能模块。通过URL抽取模块实现网站URL抽取,形成URL资源库。采用Webkit渲染引擎实现的浏览器来加载HTML网页并解析JavaScript代码,并结合脚本生成器生成的JavaScript翻页脚本,实现了从页面DOM表示中识别用于页面导航的页面元素,自动触发页面元素上的事件,生成并提取分页内容。爬虫系统只采集URL资源库中链接地址导向的网页信息,也就是说爬虫的爬行范围完全由URL资源库限定,是受控的,即为“定址”的网络爬虫。此外,利用三类(共六个)真实网站,对系统的查全率、准确度及性能进行了测试。实验结果表明,本系统的查全率达到了100%；在不翻页采集的情况下,平均抓取速率达到52.03kb/s,系统展现出很好的效能。研究表明,本系统能够准确抓取AJAX网站的动态内容,并对相似网页结构的目标网页进行分页数据采集,系统具有较高的灵活性与适用性,可用于建设垂直搜索,以及开源情报采集等。
[Abstract]:After the concept of Web2.0 emerged, a highly interactive and user experience rich web application called RIA emerged, such as blog, Weibo and so on. However, AJAX technology is more and more used in Web development because it meets the needs of Web2.0 era. Ajax technology adopts client JavaScript to dynamically modify DOM structure. It realizes the seamless reconfiguration of web pages, improves the interaction, speed and usability of web pages, but at the same time, it changes the traditional Web application model. It breaks the traditional crawler's crawling mode which relies on the hyperlink in the analysis page, and makes the traditional crawler unable to collect the dynamic content in the AJAX page, which means that a lot of meaningful data can not be retrieved through the search engine. In order to solve the problem of dynamic data acquisition of AJAX website, this paper designs and implements a kind of addressable web crawler system supporting AJAX. Firstly, through the research of traditional web crawler. This paper analyzes the technical difficulties of AJAX crawler, and starts from a practical AJAX website. This paper expounds the key problems and application scenarios of the traditional crawlers when they use AJAX technology to realize the web sites. Secondly, the related concepts and problem models of the research are introduced, and the system running flow and system architecture are designed. Finally, through the analysis and design of the key problems in AJAX crawler, an addressable web crawler system supporting AJAX is implemented. The system separates the two functions of URL extraction and web page download from the traditional web crawler working process, making it two independent function modules. The URL extraction module is implemented through the URL extraction module. Form the URL repository. Use the Webkit rendering engine to implement the browser to load the HTML pages and parse the JavaScript code. Combined with the JavaScript page turning script generated by the script generator, the page elements used for page navigation are identified from the page DOM representation, and the events on the page elements are automatically triggered. The crawler system only collects the link address oriented web page information in the URL repository, that is to say, the crawler's crawling range is completely limited by the URL resource base and is controlled. That is, the "address" of the network crawler. In addition, the recall, accuracy and performance of the system are tested by using three kinds of (six) real websites. The experimental results show that the recall rate of the system has reached 100%. The average capture rate is 52.03 kb / s without page turning, and the system shows good performance. The research shows that the system can capture the dynamic content of AJAX website accurately and collect the paging data of the target pages with similar web page structure. The system has high flexibility and applicability. Can be used to build vertical search, as well as open source intelligence collection and so on.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3

【参考文献】