基于Hadoop的Web页面正文抽取技术的研究

发布时间：2018-07-17 07:51

【摘要】：随着互联网技术的快速发展和网络用户不断增多,网页信息量呈井喷式增长。Web信息抽取现已经成为当前的研究热点之一。当前Web信息是网络用户获取信息的重要来源,由于Web信息的动态变化性,在数量巨大的网络信息库中用户往往无法快速的捕捉网页中的正文信息。如何从巨大的互联网资源库中快速并且准确的对页面中的噪音进行过滤,抽取出网页中对用户有用的信息是当前抽取领域的难题。本文提出的基于Hadoop的Web页面正文抽取方法正是解决上述问题的方法之一。论文研究如何在面对海量规模数据的Web页面的情况下,确保Web页面正文抽取的高效性和准确性。研究内容主要包含两部分:在第一部分中,本文分析现有的基于视觉信息的分块方法,并对原算法的分隔迭代过程进行改进,生成语义较为完整的网页信息块且形成网页视觉块树。在第二部分中,本文充分利用网页块的样式、内容、词频等特征并进行分析,根据重要度进行正文网页块识别。在综合本文研究内容的基础上,分析典型的系统结构特点,设计实现基于Hadoop的Web页面正文抽取系统。对系统进行数据源的测试,实验结果表明本文提出的信息抽取算法有较好地准确率以及较高的性能。该系统良好的解决海量网页的抽取问题。本文提出的基于Hadoop的抽取方法为海量数据模型提供了新的解决思路,分布式计算模型能够较好的解决性能问题。
[Abstract]:With the rapid development of Internet technology and the increasing of network users, Web information extraction has become one of the research hotspots. At present, Web information is an important source for web users to obtain information. Because of the dynamic variation of Web information, users often can not capture the text information in web pages quickly in a large number of network information databases. How to filter the noise quickly and accurately from the huge Internet resource bank and extract the useful information from the web page is a difficult problem in the field of extraction. The method of Web page text extraction based on Hadoop proposed in this paper is one of the methods to solve the above problems. This paper studies how to ensure the efficiency and accuracy of Web page text extraction in the face of massive data. In the first part, this paper analyzes the existing block methods based on visual information, and improves the separated iterative process of the original algorithm. A web page information block with complete semantics is generated and a web page visual block tree is formed. In the second part, we make full use of the style, content, word frequency and other features of the web page block, and analyze it, and identify the text page block according to the importance degree. On the basis of synthesizing the contents of this paper, this paper analyzes the characteristics of typical system structure, and designs and implements a Web page text extraction system based on Hadoop. The experimental results show that the proposed information extraction algorithm has good accuracy and high performance. The system can solve the problem of massive web page extraction. The proposed extraction method based on Hadoop provides a new solution for the massive data model, and the distributed computing model can solve the performance problem better.
【学位授予单位】：南京邮电大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1;TP393.09

【参考文献】