基于语义DOM的WEB信息抽取

发布时间：2018-11-18 14:30

【摘要】：在Internet飞速发展的今天,Web已经成为全球最大的、分布式的、共享的信息资源。面对最大的信息资源,如何从中获取有用的信息已经成为目前亟待解决的问题,因此搜索引擎技术得以蓬勃发展,由于Web页面结构复杂性、异构性、动态性、开放性等特点使得当前搜索引擎的检索性能不尽人意。为了提高检索性能,在搜索引擎技术中引进数据挖掘技术,对Web页面进行结构化处理,而Web页面结构化处理技术中的重要研究问题就是Web页面信息抽取。本文针对Web页面数据复杂性、异构性等特点,建立了一种基于语义DOM的WEB信息自动抽取技术,该技术中,我们分别对模板规则提取、基于DOM树的内容信息抽取和基于语义DOM的内容信息抽取技术作了深入的研究。首先,本文介绍了页面信息抽取技术的发展历史、国内外的研究状况,并对列举出典型的web信息抽取技术进行了综合比较,指出其优缺点。最后详细介绍了语义化标签、DOM模型、XHTML的理论和编程实践技术。本文研究的信息抽取技术基于DOM(文档结构模型)和标签语义化,其中DOM是W3C的一个标准,它以树数据结构来描述网页文档,并且提供标准的接口方法对页面节点进行操作。而标签语义化也是W3C所倡导的一种使用标签的标准,它使得HTML页面的数据能够让更多的软件识别和解析。其实现方式通过使用标签来说明包含数据的意义。接下来,本文详细阐述了基于语义DOM(文档结构模型)信息抽取的体系结构、设计方法和处理流程。首先讨论了HTML的标准化方法,基于DOM分析器将HTML或者XHTML文本转换为DOM树的技术解决方案,然后通过模板检测来提高提取效率,最后进一步根据语义化标签、文本加权的方式对DOM树进行剪枝、去噪,从而可以在纯净的DOM树中抽取有用的信息格式化展示给用户。
[Abstract]:With the rapid development of Internet, Web has become the largest, distributed and shared information resource in the world. In the face of the largest information resources, how to obtain useful information from it has become an urgent problem, so search engine technology can flourish, because of the complexity of the structure of Web pages, heterogeneity, dynamic, The characteristics of openness make the retrieval performance of current search engine unsatisfactory. In order to improve the retrieval performance, data mining technology is introduced into search engine technology to process Web pages structurally. The important research problem in Web page structured processing technology is Web page information extraction. In view of the complexity and heterogeneity of Web page data, a WEB information extraction technology based on semantic DOM is proposed in this paper. In this technology, we extract template rules respectively. The technology of content information extraction based on DOM tree and content information extraction based on semantic DOM has been deeply studied. First of all, this paper introduces the history of page information extraction technology, the research situation at home and abroad, and enumerates the typical web information extraction technology for a comprehensive comparison, pointing out its advantages and disadvantages. Finally, the semantic label, DOM model, XHTML theory and programming technology are introduced in detail. The information extraction technology studied in this paper is based on DOM (document structure Model) and label semantics. DOM is a W3C standard. It describes web documents by tree data structure and provides standard interface methods to operate page nodes. Label semantics is also a standard advocated by W3C, which enables more software to identify and parse the data of HTML pages. It is implemented by using tags to illustrate the meaning of containing data. Then, the architecture, design method and processing flow of information extraction based on semantic DOM (document structure Model) are described in detail. This paper first discusses the standardized method of HTML, the technical solution of converting HTML or XHTML text into DOM tree based on DOM analyzer, then improves the extraction efficiency by template detection, finally, according to the semantic label, Text weighted pruning and denoising of the DOM tree can extract useful information from the pure DOM tree and display it to the user.
【学位授予单位】：广西师范大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP391.1

【引证文献】