支持AJAX的分布式爬虫系统的研究与实现

发布时间：2018-05-07 16:27

本文选题：分布式爬虫 + AJAX　；参考：《华中科技大学》2013年硕士论文

【摘要】：现代社会互联网技术日新月异，互联网产品也如雨后春笋一般层出不穷，AJAX技术越来越受到开发者的青睐。此技术对传统的网络爬虫却是不友好的，使用传统的网页抓取方式获得内容是不完整的，因此，研究支持AJAX的网络爬虫系统，具有重要的现实意义。本文首先调研了异步加载网页获取方式的国内外研究现状，阐述了其收录困难的原因，分析了当前抓取方案的优势和不足，提出了调用浏览器接口请求并获取网页的方案。其次，为了提高网页抓取效率，协调好AJAX爬虫和静态网页爬虫的资源调配，本文提出了一个网页属性分类器的解决方案，此方案能通过网页处理模块的正文抽取结果反馈并修正分类结果，根据分类结果对不同的网页采取不同的抓取方法。最后，为了维护分布式系统的健康运行，系统设计了心跳信息监测模块，，此模块将收集分布式系统的心跳信息并统计分析系统健康度。本文所研究和实现的支持AJAX的分布式爬虫系统，能够收录异步加载的网页和普通的静态页面，并能够实现抓取任务高效的分配，为异步加载网页的抓取提供了新思路。系统测试结果表明预期功能得以实现，并达到了较好的性能指标。
[Abstract]:With the rapid development of Internet technology in modern society, Internet products are springing up one after another. Ajax technology is becoming more and more popular with developers. This technique is not friendly to the traditional web crawlers, and it is not complete to use the traditional web crawling method to obtain the content. Therefore, it is of great practical significance to study the web crawler system supporting AJAX. This paper first investigates the current research situation of asynchronous loading web page acquisition at home and abroad, expounds the reasons for its difficulty, analyzes the advantages and disadvantages of the current grab scheme, and puts forward a scheme of calling browser interface request and obtaining web page. Secondly, in order to improve the efficiency of web crawling and coordinate the resource allocation of AJAX crawler and static web crawler, this paper proposes a solution of web property classifier. This scheme can extract the result feedback from the text of the web page processing module and correct the classification result. According to the classification result, different grab methods can be adopted for different web pages. Finally, in order to maintain the healthy operation of the distributed system, a heartbeat monitoring module is designed, which will collect the heartbeat information of the distributed system and analyze the health degree of the system. The distributed crawler system supported by AJAX, which is researched and implemented in this paper, can collect the pages loaded asynchronously and static pages, and can realize the efficient assignment of crawling tasks, which provides a new way of thinking for the crawling of pages loaded asynchronously. The system test results show that the expected function can be achieved and achieve better performance.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP393.092;TP391.1

【参考文献】