分布式网络爬虫的研究与实现

发布时间：2018-06-15 04:46

本文选题：搜索引擎 + 分布式　；参考：《东南大学》2017年硕士论文

【摘要】：随着互联网技术的飞速发展,人们在工作和生活中对互联网信息的需求也越来越多,搜索引擎技术的重要性越加明显。互联网信息在很多方面都有非常广泛的应用,搜索引擎技术已经深入人心,融入人们的生活,对人们的生活影响越来越大,而网络爬虫是搜索引擎中非常重要的一个部分。目前,基于单机的网络爬虫抓取能力已经不能满足当前互联网的需求,这样就促使了基于分布式网络爬虫技术的出现。构建分布式系统,多台机器有效的合作分工,提高了超大数据量的计算速度,提高了网络爬虫的抓取性能。运用分布式存储,对整个系统数据的存储的性能也能大大提高。本文详细介绍了分布式网络爬虫,设计并实现了基于Hadoop平台的分布式网络爬虫,以解决单机网络爬虫的速度慢、效率低下等问题,本文的主要研究工作如下:(1)介绍了搜索引擎技术,分布式网络爬虫的工作原理和关键技术,分布式网络爬虫整体系统的架构设计,分析了关键组成模块的具体实现流程和实现原理,各模块的MapReduce的实现方式。(2)针对网页抓取模块的已有算法影响抓取内容和抓取速度的问题,提出了 URL权重算法的优化方法,在抓取过后,对URL的过滤和去重也是极其重要的环节,对URL去重策略也进行了优化,解决了网络爬虫抓取方面速度慢、抓取内容冗余的问题,大大提高了网络爬虫抓取速度和准确度。(3)搭建分布式系统的测试环境,从功能性测试、性能测试、可扩展性测试三个方面设计了测试方案,并对URL权重算法和URL去重策略优化前后进行了对比测试。总之,本文的意义在于设计实现了分布式网络爬虫系统,在一定程度上解决了单机爬虫效率低、可扩展性差的弊端,提高了网络爬虫采集信息、网页抓取数据的速度和质量。
[Abstract]:With the rapid development of Internet technology, more and more people need Internet information in their work and life, and the importance of search engine technology becomes more and more obvious. Internet information has been widely used in many aspects, search engine technology has been deeply rooted in the people, into people's lives, more and more impact on people's lives, and the web crawler is a very important part of the search engine. At present, the ability of crawler crawling based on single machine can not meet the current demand of the Internet, which promotes the emergence of distributed crawler technology. In order to construct a distributed system, many machines can work together effectively, which can improve the speed of computing large amount of data and improve the crawler's capture performance. With distributed storage, the performance of data storage in the whole system can be greatly improved. This paper introduces the distributed network crawler in detail, and designs and implements the distributed network crawler based on Hadoop platform to solve the problems of slow speed and low efficiency of single machine network crawler. The main research work of this paper is as follows: 1) introduce the technology of search engine, the working principle and key technology of distributed web crawler, the architecture design of the whole system of distributed web crawler, This paper analyzes the concrete realization flow and realization principle of the key component module, and the realization mode of MapReduce of each module. Aiming at the problem that the existing algorithms of the web crawling module affect the grab content and speed, the optimization method of URL weight algorithm is put forward. After crawling, filtering and removing URL is also very important. The strategy of URL removal is optimized to solve the problem of slow speed and redundant content of crawler. It greatly improves the speed and accuracy of crawler to build the testing environment of distributed system. The test scheme is designed from three aspects: functional test, performance test and extensibility test. And the URL weight algorithm and URL removal strategy before and after optimization were compared and tested. In a word, the significance of this paper lies in the design and implementation of distributed web crawler system, which to some extent solves the disadvantages of low efficiency and poor expansibility of single crawler, and improves the speed and quality of web crawler information collection and web page data capture.
【学位授予单位】：东南大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【参考文献】