基于Storm云平台的分布式网络爬虫技术研究与实现

发布时间：2018-01-06 21:06

本文关键词：基于Storm云平台的分布式网络爬虫技术研究与实现　出处：《电子科技大学》2015年硕士论文　论文类型：学位论文

【摘要】：随着互联网的高速发展,许多新型的商业模式,例如O2O等,被运用到互联网上,这导致越来越多的站点在互联网上创建,因此互联网上所包含的信息资源也就越来越多。在这浩瀚的互联网大海中,人们想快速的找到自己想要的信息,搜索引擎的搜索技术就显得愈发重要。而网络爬虫是搜索引擎中很重要的组成部分,这也就对网络爬虫提出了新的挑战。传统的单机网络爬虫已经不能满足日益高速增长的数据的抓取需求,这导致分布式网络爬虫技术的出现。分布式网络爬虫利用多台机器,有效的分工合作,提高了网络爬虫的速度,从而从整体上提升了网络爬虫的性能。本文设计并实现了一个基于Storm的、可扩展的分布式网络爬虫系统,结合当下流行的新浪微博平台,将网络爬虫的数据源放在新浪微博上。具体来说,本文完成了以下的工作内容:1、对本文中的分布式网络爬虫的需求进行了分析,包括对系统要实现的目标、系统的可行性、功能需求和性能需求这四个模块。其中功能需求分析中确定将本系统分成六大模块,包括模拟登录模块、URL队列库模块、URL链接优化模块、网页下载模块、网页解析模块和网页存储模块,并对每个模块的需求进行了详细的阐述。2、针对新浪微博,对本系统的网络爬虫进行了一个详细设计,包括数据库的设计和系统架构的设计。重点介绍了系统的整个架构设计,分别对六个模块的设计进行了详细的说明。3、针对本文实现的分布式网络爬虫系统进行了一个测试,从系统的功能和性能两个方面对其进行了测试,并对测试的结果进行分析。4、对本文的进行了一个总结,分析了本文存在的问题和不足,并提出了今后继续对本文的研究方向。
[Abstract]:With the rapid development of the Internet, many new business models, such as O2O, have been applied to the Internet, which has led to more and more sites being created on the Internet. Therefore, the Internet contains more and more information resources. In this vast sea of Internet, people want to quickly find the information they want. Search engine search technology is becoming more and more important, and web crawler is a very important part of search engine. This poses a new challenge to web crawlers. Traditional single-machine web crawlers can no longer meet the growing demand for data capture. This leads to the emergence of distributed network crawler technology. Distributed network crawler using multiple machines, effective division of work and cooperation, improve the speed of network crawler. This paper designs and implements an extensible distributed web crawler system based on Storm, combined with the current popular Sina Weibo platform. Put the data source of web crawler on Sina Weibo. Specifically, this paper completes the following work: 1, analyzes the requirements of distributed web crawler in this paper, including the goal of the system to be realized. The feasibility of the system, functional requirements and performance requirements of these four modules. In the analysis of functional requirements, it is determined that the system is divided into six modules, including the simulated login module and URL queue library module. URL link optimization module, page download module, web page analysis module and page storage module, and the requirements of each module are elaborated in detail, aiming at Sina Weibo. The network crawler of this system is designed in detail, including the design of database and the design of system architecture. The design of the six modules is described in detail. 3. The distributed web crawler system implemented in this paper is tested, and the function and performance of the system are tested. The results of the test. 4, a summary of this paper, analysis of the problems and shortcomings of this paper, and put forward the future research direction of this paper.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP391.3

【参考文献】