异构学术资源分布式爬取系统的设计与实现

发布时间：2019-04-02 20:22

【摘要】：随着学术信息的快速膨胀和互联网技术的快速发展,近年来,网络中的学术资源呈现出规模大、增长速度快、来源和组织结构不统一的特征,给学术资源的获取带来了困难。同时,本项目组一直针对互联网中的学术资源进行信息挖掘工作,通过挖掘学术信息,进行学术建模和学术推荐,这对海量、实时有效的学术资源数据的获取提出了更高的要求。因此,从不同的学术资源搜索网站快速高效地爬取学术资源,抽取有用的学术资源信息,建立统一的学术资源数据库,显得尤为重要。本论文的主要工作包括了解网络爬虫相关技术、分布式计算的工作原理、网页解析的方法及海量数据存储技术等。在此基础上,基于分布式爬取框架Nutch,本文设计并实现了一个异构学术资源分布式爬取系统,包括设计和实现异构学术资源网页的解析和存储,给出基于Nutch的分布式爬取系统的整体结构、物理框架和存储结构,以及对Nutch的扩展方法和方案,然后基于系统的设计进行详细的编码实现和系统测试。本文设计和实现的异构学术资源分布式爬取系统目前已经在实验室环境得到部署应用。本文基于Nutch和Hadoop设计和实现的异构学术资源分布式爬取系统,解决了单机爬取速度缓慢、扩展性差的问题,提高了学术资源信息采集的速度,扩大了采集规模,为学术资源的挖掘和研究提供了学术数据。
[Abstract]:With the rapid expansion of academic information and the rapid development of Internet technology, in recent years, the academic resources in the network have the characteristics of large scale, rapid growth rate and inconsistent source and organizational structure, which has brought difficulties to the acquisition of academic resources. At the same time, the project team has been working on information mining for academic resources on the Internet, through mining academic information, academic modeling and academic recommendations, which are massive, The acquisition of real-time and effective academic resource data puts forward higher requirements. Therefore, it is very important to crawl academic resources quickly and efficiently from different academic resource search websites, extract useful information of academic resources, and establish a unified academic resource database. The main work of this paper is to understand the related technology of web crawler, the working principle of distributed computing, the method of web page parsing and the technology of mass data storage, etc. On this basis, based on the distributed crawling framework Nutch, this paper designs and implements a heterogeneous academic resources distributed crawling system, including the design and implementation of heterogeneous academic resources web page parsing and storage. This paper presents the whole structure, physical framework and storage structure of the distributed crawling system based on Nutch, as well as the method and scheme of extending Nutch, and then carries out detailed coding implementation and system testing based on the design of the system. The distributed crawling system of heterogeneous academic resources designed and implemented in this paper has been deployed in laboratory environment. Based on the design and implementation of heterogeneous academic resources distributed crawling system based on Nutch and Hadoop, this paper solves the problems of slow crawling speed and poor expansibility of single machine crawling, improves the speed of collecting academic resources information and expands the collection scale. It provides academic data for the mining and research of academic resources.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.52

【参考文献】