主题网络爬虫的并行化研究与设计

发布时间：2018-05-18 22:07

本文选题：并行化 + 爬虫　；参考：《西南石油大学》2017年硕士论文

【摘要】：随着移动互联网的普及,数据产生的速度不断加快,数据量不断增长。搜索引擎提供的查询结果数量虽能够满足普通用户的需求,但不足以支持科研人员在主题领域的数据分析。本文以如何获取主题信息作为研究问题,根据实际需要,研究使用主题网络爬虫从互联网中高效地采集相关数据。文中采用集群并行化处理的思想以及改进的网页相似度判定算法采集网页并判定网页信息主题相关性,从而获取信息。研究工作分为三部分:爬虫工作原理及相关知识、爬虫并行化改进和数据采集过程中文本主题相关性的判断。首先,爬虫是搜索引擎的重要组成部分,以搜索引擎和Web遵循的HTTP协议为起点,进而研究了爬虫的采集流程。其次,在普通爬虫流程的基础上,基于常用搜索策略提出了多策略融合的搜索算法,改进了原有搜索效率低下的问题,达到效率成倍提升的效果。接着,互联网的数据规模促使爬虫采用并行化方式提高效率,根据爬虫各部分的需求以及数据的特点采用了合适的并行框架:包括存放URL多队列的RabbitMQ、URL去重的内存级数据库Redis、处理网页数据的并行计算框架Storm和分布式数据库MongoDB。最后,提出以标题为中心的精简内容子树构建网页主要内容,并对其应用向量空间模型和语义结合的判别算法对网页进行主题识别,提高了网页主题相关的识别率。通过对系统架构以及各模块的设计与实现,并以“大数据”为主题对系统进行测试,结果表明系统能够识别与“大数据”相关的网页,准确率最高达到82%,且经过并行化的改进,系统效率和稳定性有所提升,解决了中小型爬虫自主采集相关主题网页的问题,获取到的数据对后续的分析也有着积极作用。
[Abstract]:With the popularity of mobile Internet, the speed of data generation is accelerating and the amount of data is increasing. Although the number of query results provided by search engines can meet the needs of ordinary users, it is not enough to support the data analysis of scientific researchers in the subject area. In this paper, how to obtain topic information as a research problem, according to the actual needs, the use of topic crawlers from the Internet to efficiently collect relevant data. In this paper, the idea of cluster parallelization and the improved similarity determination algorithm are used to collect web pages and determine the relevance of web pages' information, so as to obtain the information. The research work is divided into three parts: crawler working principle and related knowledge, reptile parallelization improvement and the judgment of relevance of Chinese text in data acquisition process. Firstly, the crawler is an important part of the search engine. Based on the HTTP protocol followed by the search engine and Web, the crawler collection process is studied. Secondly, on the basis of common crawler flow, a multi-strategy fusion search algorithm is proposed based on common search strategies, which improves the original problem of low search efficiency and achieves the effect of multiplying the efficiency. Then, the size of the data on the Internet encourages crawlers to use parallelism to improve their efficiency. According to the requirements of each part of the crawler and the characteristics of the data, this paper adopts a suitable parallel framework, which includes the memory level database Redisis which stores the URL multi-queue RabbitMQ URL, the parallel computing framework for processing web page data, Storm and the distributed database, MongoDB. Finally, the main content of the web page is constructed by a reduced content subtree with the title as the center, and the recognition rate of the web page is improved by using the vector space model and the semantic discriminant algorithm. Through the design and implementation of the system architecture and each module, and taking "big data" as the theme to test the system, the result shows that the system can identify the web pages related to "big data", and the accuracy rate is up to 822, and it is improved by parallelization. The efficiency and stability of the system are improved, which solves the problem of the small and medium-sized reptiles collecting related web pages independently, and the obtained data also play a positive role in the subsequent analysis.
【学位授予单位】：西南石油大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.3

【参考文献】