基于Hadoop分布式环境下垂直爬虫的研究与实现

发布时间：2018-06-10 12:32

本文选题：Hadoop + ChainMapper/ChainReducer　；参考：《北京邮电大学》2017年硕士论文

【摘要】：随着人们对信息搜索个性化服务的需求日益增长,垂直爬虫技术克服了通用爬虫全网爬取的缺点,主要访问用户指定的站点和页面,提高了信息获取的效率和准确率。然而随着网络上数据爆炸式的增长,传统的垂直领域单机爬虫对于海量数据爬取效率已远不能满足需求,且海量存储也是一个很大的挑战。同时,动态网页技术已经广泛应用,给爬虫爬取页面带来了很大的困难。针对爬虫领域面临的这两个问题,本文提出了一个分布式垂直爬虫框架,并对基于状态转换图的动态网页处理算法进行了改进,最后实现了爬取手机App信息的分布式垂直爬虫系统。分布式垂直爬虫框架基于MapReduce的ChainMapper/ChainReducer来设计爬虫各模块,引入Redis内存数据库来对URL等进行管理存储,采用分布式数据库HBase来存储网页提取的特征内容信息。动态网页处理算法针对页面主体模块进行页面相似性判定和有选择性地触发有效元素对原来算法进行了改进,并运用Selenium WebDriver来驱动无界面浏览器Phamtomjs对网页元素上绑定的事件进行触发,下载动态网页的内容。最后基于ChainMR Crawler框架和动态网页处理算法实现了爬取手机App信息的分布式垂直爬虫系统。实验结果表明分布式垂直爬虫框架ChainMR Crawler比Nutch爬取效率高6%,说明ChainMR Crawler有较好的性能。改进后的动态网页处理算法有效减少了无效元素上事件的触发,提高了页面相关性,验证了改进算法的高效性。手机App信息爬虫系统实现了预期功能,具有比较高的爬取效率和扩展性,有较好的实用性。
[Abstract]:With the increasing demand for personalized information search services, vertical crawler technology overcomes the shortcomings of universal crawler crawling, mainly visits user-specified sites and pages, and improves the efficiency and accuracy of information acquisition. However, with the explosive growth of data on the network, the traditional vertical single-machine crawler is far from meeting the demand for mass data crawling efficiency, and mass storage is also a great challenge. At the same time, dynamic web technology has been widely used, which brings great difficulties to crawlers. Aiming at these two problems, this paper proposes a distributed vertical crawler framework, and improves the dynamic web page processing algorithm based on state transition graph. Finally, a distributed vertical crawler system for crawling mobile phone App information is implemented. The distributed vertical crawler framework designs crawler modules based on ChainMapper / ChainReducer of MapReduce, introduces Redis memory database to manage and store URLs, and uses distributed database HBase to store feature content information extracted from web pages. The dynamic web page processing algorithm is used to judge the page similarity of the main module of the page and selectively trigger the effective elements to improve the original algorithm. Selenium WebDriver is used to drive Phamtomjs, a non-interface browser, to trigger events bound on web page elements and download the contents of dynamic web pages. Finally, a distributed vertical crawler system based on ChainMR Crawler framework and dynamic web page processing algorithm is implemented for crawling mobile phone App information. The experimental results show that the efficiency of ChainMR Crawler is 6% higher than that of Nutch, which shows that ChainMR Crawler has better performance. The improved dynamic web page processing algorithm effectively reduces the trigger of events on invalid elements, improves the page correlation, and verifies the efficiency of the improved algorithm. The mobile phone App information crawler system realizes the expected function, has higher crawling efficiency and expansibility, and has good practicability.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.092;TP311.56

【相似文献】