基于Docker集群的分布式爬虫研究与设计

发布时间：2018-06-18 10:21

本文选题：Docker + 分布式爬虫　；参考：《浙江理工大学》2017年硕士论文

【摘要】：自从政府提出实施国家大数据战略以来,互联网大数据成为重要的战略资源的地位越来越明显。而开采互联网大数据的有效工具网络爬虫也显得更加重要,但传统的爬虫均建立在VM集群之上,存在着宿主机资源利用不充分且爬虫系统难以扩展等问题。随着新兴虚拟化技术Docker的发展,为解决原有运行在VM环境上的网络爬虫存在的问题提供了契机。基于Docker集群分布式爬虫主要从分布式爬虫技术和Docker集群技术两个方面进行研究。目前开源的爬虫框架对分布式的支持程度不同,例如Scrapy爬虫框架不支持分布式,并且现有框架比较适合运行在VM集群环境之上,存在着VM集群带来的系统资源利用不充分的缺点。Docker集群是一种全新的虚拟化集群技术,比VM集群更加合理高效的利用宿主机的各种资源。通过研究开源网络爬虫架构,本文设计并实现完全支持分布式的网络爬虫系统,并使之运行在Docker集群之上。本文还进一步改进爬虫的URL去重算法,采用具有更好去重效果的K分型Bloom filter算法,并使其满足分布式情况下的应用需求。本文的主要工作有以下几个方面:(1)深入研究网络爬虫的工作原理,掌握其整体架构的设计模式。详细研究Docker集群的编排管理工具,掌握其工作原理以及管理和调度机制。研究内容去重算法,并应用于分布式爬虫系统。(2)通过研究开源的网络爬虫框架,理解其不支持分布式的原因,设计并实现出适合Docker集群的分布式爬虫系统模块。并将系统模块有效的组合起来,形成完整高效的分布式爬虫系统。采用Docker集群编排管理工具Kubernetes来对分布式爬虫系统的各个功能模块进行部署和管理,使之成功运行在Docker集群之上。(3)将实现的分布式爬虫分别搭建在VM集群和Docker集群之上进行不同层次的实验对比,来证明分布式爬虫系统运行在Docker集群之上有更好的抓取效率,更加充分的利用宿主机资源,并且容易实现系统水平扩展。(4)理解经典的Bloom filter算法的原理,并对其误差概率进行研究。通过改进K分型Bloom filter算法使其满足分布式情况下的应用需求,并进一步提高去重效果,降低误差概率。最后通过实验证明改进后的K分型Bloom filter有更好的去重效果。
[Abstract]:Since the government put forward the national big data strategy, the status of Internet big data as an important strategic resource has become more and more obvious. However, the traditional crawlers are based on VM clusters, and there are some problems such as insufficient utilization of host resources and difficulty in extending crawler systems. With the development of new virtualization technology Docker provides an opportunity to solve the problems of web crawlers running in VM environment. Distributed crawler based on Docker cluster is mainly studied from two aspects: distributed crawler technology and Docker cluster technology. The current open source crawler framework has different degrees of support for distribution, for example, Scrapy crawler framework does not support distributed, and the existing framework is more suitable for running on VM cluster environment. Docker cluster is a new virtualization cluster technology, which is more reasonable and efficient than VM cluster to utilize all kinds of resources of host. Through the research of open source web crawler architecture, this paper designs and implements a distributed web crawler system and makes it run on Docker cluster. This paper also further improves the crawler's URL removal algorithm, adopts K-typed Bloom filter algorithm with better removal effect, and makes it meet the requirements of distributed applications. The main work of this paper is as follows: 1) deeply studying the working principle of web crawler and mastering the design pattern of its whole architecture. This paper studies the orchestration management tool of Docker cluster in detail, and grasps its working principle and management and scheduling mechanism. By studying the open source web crawler framework and understanding the reason why it does not support distributed, the distributed crawler system module suitable for Docker cluster is designed and implemented. And the system modules are effectively combined to form a complete and efficient distributed crawler system. Kubernetes, a Docker cluster orchestration management tool, is used to deploy and manage the functional modules of distributed crawler systems. Make it run on Docker cluster. 3) build distributed crawler on VM cluster and Docker cluster for different levels of experiments, to prove that distributed crawler system running on Docker cluster has better crawling efficiency. It is easier to realize the horizontal expansion of the system by making full use of the host resource and to understand the principle of the classical Bloom filter algorithm, and to study its error probability. The K-typing Bloom filter algorithm is improved to meet the requirements of distributed applications, and further improve the removal effect and reduce the error probability. Finally, the improved K-typing Bloom filter has been proved to be more effective.
【学位授予单位】：浙江理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【参考文献】