基于Scrapy框架的网络爬虫实现与数据抓取分析

发布时间：2018-03-14 19:42

本文选题：爬虫　切入点：Scrapy　出处：《吉林大学》2017年硕士论文　论文类型：学位论文

【摘要】：随着信息时代的发展和编程技术的普及,搜索引擎成为了人们日常生活中的必须品。搜索引擎大多使用爬虫技术作为核心模块,通过关键词返回用户查询的结果。但是网络信息呈现爆炸式的增长,使得信息的查找和定位也变得困难。针对上述问题,本文以Python和Scrapy环境为基础,以“新浪微博”为爬取对象,在学习并分析当前爬虫技术的原理、核心模块以及运行流程的基础上,探索性地实现一个基于Scrapy框架的网络爬虫,完成数据抓取等目标。首先,本文简明给出了爬虫技术的原理和发展现状,介绍爬虫工程中一些关键技术,并着重介绍了在本研究中有深刻影响的Cookie和Robot协议。其次,通过使用基于Python语言开发的Scrapy开源爬虫框架来进行爬虫开发,指出了Mongo DB为代表的No Sql数据库在元数据存储中的巨大作用。详细介绍了Scrapy开发爬虫的流程和实现细节。再次,讨论了对于爬虫设计领域的关键问题,本文实现的自定义爬虫的解决方法。采用了更换Cookie和user-agent欺骗来突破站点限制。而URL去重和多线程并发的问题,则采用并分析Scrapy自带的解决方案。最后对爬虫进行测试并展示成果,思考存在的问题和改进的可能。
[Abstract]:With the development of information age and the popularization of programming technology, search engine has become a necessity in people's daily life. Most search engines use crawler technology as the core module. The result of user query is returned by key words. However, the explosive growth of network information makes it difficult to find and locate information. In view of the above problems, this paper is based on Python and Scrapy environment. Taking "Sina Weibo" as the object of crawling, on the basis of studying and analyzing the principle, core module and running flow of current crawler technology, this paper explores the realization of a web crawler based on Scrapy framework and accomplishes the goal of data capture. This paper briefly introduces the principle and development of crawler technology, introduces some key technologies in reptile engineering, and emphatically introduces the Cookie and Robot protocols which have profound influence in this research. By using Scrapy open source crawler framework based on Python language to develop crawlers, this paper points out the great role of No Sql database represented by Mongo DB in metadata storage, and introduces in detail the process and implementation details of Scrapy crawler development. This paper discusses the key problems in the domain of crawler design, and the solution of custom crawler in this paper. The replacement of Cookie and user-agent spoofing is used to break through the limit of site, while the problem of URL deduplication and multithreading concurrency is discussed. Finally, we test the reptiles and show the results, think about the existing problems and the possibility of improvement.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【相似文献】