基于集群计算的网络信息采集系统的设计与实现

发布时间：2018-06-09 12:42

本文选题：网络信息采集 + 双语网络信息更新　；参考：《哈尔滨工业大学》2012年硕士论文

【摘要】：随着Web信息技术的不断发展，网络信息采集技术也日趋完善，作为许多Web信息服务的基础及重要组成部分，它被广泛的应用于搜索引擎、机器翻译等自然语言处理的各个方面。面对互联网上各种信息资源，，有针对性的网络信息采集系统不断推陈出新，为获取网络信息提供极大的便利，同时，海量增长的网络信息也给信息的获取带来了新的挑战。对于统计机器翻译、机器辅助翻译以及翻译知识获取等研究来说，网络信息采集的任务是从海量的Web网页中发现大规模、含有多语言平行网页文本的网站中搜集平行网页文本，建设大规模多语言平行语料库，这也正是本文的研究目标。本文深入研究了一个针对大规模数据处理的分布式计算集群框架—Hadoop，并在此基础上设计并实现了一个可配置、可扩展的面向Web的分布式网络信息采集系统，此外，本文还设计并实现了一个增量式的网络信息更新采集系统，用来对双语平行网页进行增量式更新采集。本文首先介绍了网络信息采集系统的研究背景、当期的发展现状，并调研了当前非常热门的分布式计算集群框架—Hadoop，深入的理解其子系统Hadoop分布式文件系统(HDFS)及其重要的并行计算模型MapReduce的设计原理、体系结构等，分析了网络信息采集中URLs去重、任务调度、网页更新的识别等的关键技术，在此基础上设计并实现了面向Web的分布式网络信息采集系统和面向双语网站的增量式更新采集系统。最后通过对实验结果的分析，验证了本文提出的面向Web的分布式网络信息采集系统的高可配置、稳定、高可扩展等的特性，能够完成采集大规模、多语言网页的任务，对于面向双语网站的增量式更新采集系统，能够高效的完成对双语网站的增量式更新采集网页的任务，最终实现了课题的研究目标。
[Abstract]:With the continuous development of Web information technology, network information collection technology is becoming more and more perfect. As the foundation and important component of many Web information services, it is widely used in various aspects of natural language processing, such as search engine, machine translation and so on. In the face of all kinds of information resources on the Internet, the targeted network information collection system is constantly emerging, which provides great convenience for obtaining network information, at the same time, For the research of statistical machine translation, machine assisted translation and translation knowledge acquisition, the task of network information collection is to find a large scale from a large number of Web pages. Web sites containing multilingual parallel page text collect parallel page text and build a large scale multilingual parallel corpus. This is exactly the research goal of this paper. A distributed computing cluster framework named Hadoop for large-scale data processing is studied in this paper, and a configurable computing cluster framework is designed and implemented on this basis. An extensible Web-oriented distributed network information acquisition system is designed and implemented in this paper. This paper first introduces the research background of the network information collection system, the current development of the current situation, It also investigates the popular distributed computing cluster framework -Hadoop. and deeply understands the design principle and architecture of Hadoop distributed file system (HDFSs) and its important parallel computing model MapReduce. This paper analyzes the key technologies of URLs removal, task scheduling and web page updating in network information collection. On this basis, a Web-oriented distributed network information acquisition system and an incremental update collection system for bilingual websites are designed and implemented. Finally, the experimental results are analyzed. It is verified that the Web-oriented distributed network information acquisition system is highly configurable, stable and scalable, and can accomplish the task of collecting large-scale and multi-language web pages. For the incremental update acquisition system for bilingual websites, the task of incremental updating and collecting web pages of bilingual websites can be accomplished efficiently, and the research goal of the subject is finally realized.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP274.2;TP393.092

【参考文献】