精确Web信息抽取关键技术与系统研究
发布时间:2018-02-24 15:01
本文关键词: 精确Web信息抽取 浏览导航 数据集成 数据记录 数据项 出处:《南京大学》2017年博士论文 论文类型:学位论文
【摘要】:随着互联网技术的发展,Web成为全球企业与机构进行信息发布与应用部署的主要平台。大量Web网站和Web应用的出现使得Web上的数据量急剧增长。Web上的海量数据蕴含很多有价值的信息。为了获得并分析利用这些有价值的信息,通常首先需要从Web上获取精确有用的结构化数据,然后对这些结构化数据执行深度分析处理。然而,Web系统的广泛分布性和自治性、Web数据的异构性和非结构化特性、以及Web数据的展现结构与目标数据结构的不一致性,使得从Web中有效地获取精确有用的结构化数据成为一个较大的技术难题。Web信息抽取正是为解决这一问题而产生的研究领域。Web信息抽取研究如何从展现结构的Web页面抽取出用户感兴趣的数据,并将其转换成结构化数据。一个完整的Web信息抽取过程可以被分为三个阶段:网页浏览导航、网页数据抽取、以及网页数据集成。然而,现有大部分研究工作主要关注网页数据抽取,忽略了网页浏览导航与网页数据集成,导致缺少完整的Web信息抽取处理能力和过程。与此同时,大多数现有工作过于强调理论意义上的全自动化分析抽取处理。相应的方法主要有两种:自动网页数据抽取方法;开放式异构网页数据抽取方法。前者不考虑用户需求,会抽取出很多用户不感兴趣的冗余数据;这导致分析应用需要对数据进行转换、清洗、过滤等二次处理。后者不使用任何特定于网页的抽取规则模板,试图从描述相同实体的异构网页抽取出用户感兴趣的数据;这导致后者的数据抽取精确度通常较低。针对现有工作的上述不足,本文力图综合自动化方法以及精确Web信息抽取的实际应用需求。面向完整Web信息抽取过程,本文研究精确Web信息抽取基本模型、语言、以及关键技术方法,并给出相应的原型系统的设计与实现。具体而言,本文主要研究工作和创新点如下:(1)三阶段一体化精确Web信息抽取基本模型研究首先,研究并提出完整的三阶段一体化精确Web信息抽取模型。然后,分别针对三个阶段研究并提出网页浏览导航模型、网页数据抽取模型、以及网页数据集成模型。网页浏览导航模型通过构建交互和浏览导航动作模型、网页浏览导航路径模型、以及网页链接关系模型,以分别描述用户交互动作、网页浏览导航过程、以及网页链接关系。网页数据抽取模型通过构建网页数据抽取基本模型、网页数据记录模型、以及数据记录和数据项抽取规则模型,以分别描述网页数据抽取过程、网页数据记录结构形式、以及数据记录和数据项抽取规则框架。网页数据集成模型描述了将源网页数据转换成目标结构数据的基本过程。(2)三阶段一体化精确Web信息抽取规则体系与语言研究基于三阶段一体化精确Web信息抽取基本模型,研究并设计一种三阶段一体化的精确Web信息抽取规则体系与语言。与精确Web信息抽取过程的三阶段相对应,该规则体系与语言包含三个部分:网页浏览导航规则语言、网页数据抽取规则语言、以及网页数据集成规则语言。与现有的Web信息抽取规则语言相比,该语言的主要优点包括:1)网页浏览导航规则语言可以定义各种复杂网页浏览导航过程的网页浏览导航规则;2)网页数据抽取规则语言可以定义各种复杂结构数据记录抽取规则;3)网页数据集成规则语言可以方便灵活地定义网页数据集成规则。(3)自动网页数据抽取研究现有自动网页数据抽取方法主要适用于抽取简单结构数据记录(连续-定长-线性数据记录),而难以有效抽取复杂结构数据记录(非连续、变长、或嵌套数据记录)。针对这一不足,研究并提出两种自动网页数据抽取方法:基于内聚度和DAG(有向无环图)的自动网页数据抽取方法,以及基于确定性有穷自动机的自动网页数据抽取方法。前者适用于抽取连续-定长(变长)-线性数据记录,而后者可以抽取各种简单或复杂结构数据记录。(4)精确Web信息抽取规则生成研究为了便于用户高效生成鲁棒的精确Web信息抽取规则,研究并提出一种基于用户交互、自动网页结构分析和监督式规则学习的精确Web信息抽取规则生成方法。在网页浏览导航规则生成上,将通过自动录制用户交互和浏览导航动作来生成相应规则。在网页数据抽取规则生成上,对于包含规整数据记录的页面,将采用上述自动网页数据抽取方法分析网页结构,继而基于监督式规则学习来自动生成相应规则;对于包含非规整数据记录的网页,将基于用户交互和监督式规则学习来生成相应规则。在网页数据集成规则生成上,将采用简单的脚本语言编码方式来生成相应规则。(5)精确Web信息抽取原型系统的设计与实现为了验证所提出的模型、规则语言和关键技术方法的有效性,本文设计并实现一个精确Web信息抽取原型系统。实验结果表明,本文所研究提出的精确Web信息抽取模型与关键技术方法是有效的,比现有的技术方法取得更好的抽取精确性、并具有更强的处理能力。
[Abstract]:With the development of Internet technology, Web has become a global enterprise and organization information publication platform and application deployment. The emergence of a large number of Web sites and Web applications on the Web data of the sharp increase in the amount of valuable information contains large amounts of data on the.Web. In order to obtain and analyze the use of these valuable information, usually the first to obtain accurate useful structured data from the Web, then the implementation of structured data depth analysis. However, widely distributed and autonomous Web systems, heterogeneous and unstructured characteristics of Web data, and Web data show the structure and target data structure is not consistent, made from Web effectively. To obtain accurate structured data useful to become a technical problem of.Web information extraction is generated in order to solve this problem the research field of.Web information extraction research How to show the structure of a Web page to extract data of interest to users, and convert it into structured data. A complete Web information extraction process can be divided into three stages: Web navigation, web data extraction, data integration and web page. However, most of the existing research work mainly focus on Web data extraction ignore the web browsing, navigation and web data integration, resulting in the lack of a complete Web information extraction processing capability and process. At the same time, most of the existing work too much emphasis on the theoretical significance on automatic extraction analysis. There are two main types of corresponding methods: automatic web data extraction method; open heterogeneous Web data extraction method. The former is not consider the needs of users, many users will extract the redundant data not interested; this leads to the analysis of application of the need for data conversion, cleaning, filtering and other two time The latter. Do not use any specific web page extraction rule template, trying to extract the user interest from heterogeneous web pages to describe the same entity data; this leads to data extraction accuracy of the latter is usually low. According to the shortages of the existing work, this paper tries to comprehensive automatic method and actual application needs accurate Web information extraction. For a complete Web information extraction process, this paper studies the accurate Web information extraction model, language, and the key technology, design and implementation of the prototype system and the corresponding. Specifically, the main research work and innovation are as follows: (1) study on the basic model of three stage integrated precise information extraction of Web firstly, and put forward the research the three stage of the integration of accurate Web information extraction model complete. Then, according to the three phase of the study and put forward the "navigation model, web page data extraction Model and web data integration model. By constructing interactive web browsing navigation model and navigation model, web browsing and web navigation path model, link model, to describe user interaction, web browsing and navigation, Web links between web data extraction model. By constructing the basic model of Web data extraction, web data recording model and data recording and data extraction rule model to describe Web data extraction process, web data structure, and data recording and data extraction rules. Web data integration framework model describes the source web page data into the basic process of target structure data. (2) the three stage integrated precision Web the rules of information extraction system and language research based on the basic model of three stage integrated accurate Web information extraction, research and design A three stage of the integration of accurate Web information extraction rule system and language. In the three stage and accurate Web information extraction process corresponding to the rules and the language consists of three parts: Web navigation rule language, web data extraction rule language, and web data integration rule language. Compared with the existing Web information extraction rules language, the main advantages include: 1) language web navigation rule language can define various complex web browsing navigation web browsing navigation rules; 2) web data extraction rule language can define various complex data structure records rule extraction; 3) web data integration rule language can easily define web data integration the rules. (3) automatic web data extraction of existing automatic web data extraction method is applied to extract simple structure (- continuous data recording Long - linear data record), and difficult to extract the complex structure of data record (non continuous, variable length, or nested data record). For the lack of research and puts forward two kinds of automatic web data extraction method: cohesion and based on DAG (directed acyclic graph) automatic web data extraction method, and based on the the deterministic finite automaton of the automatic web data extraction method. The former is suitable for continuous extraction of fixed length (variable length) - linear data record, and the latter can extract a variety of simple or complex data structure records. (4) research on the generation of accurate Web information extraction rules in order to facilitate users to efficiently generate robust rules of accurate Web information extraction research. We present a method based on user interaction, method of generating accurate Web information extraction rules of web page structure analysis and supervised learning rule. In the web browser navigation rules, through the automatic recording by User interaction and navigation action to generate the corresponding rules. In the web data extraction rules generation, including structured data records for the analysis of web pages, using the structure of automatic web data extraction method, and then supervised learning rules to automatically generate the corresponding rules based on non structured data record; for a web page, the user interaction and supervised learning rule is generated based on the corresponding rules. In the web data integration rule generation, will use a simple scripting language encoding to generate the corresponding rules. (5) and implementation in order to validate the proposed model design of accurate Web information extraction prototype system, effective rule language and the key technique of the method, this paper design and implement a prototype system of Web information extraction precision. The experimental results show that the proposed accurate Web information extraction model and key techniques in this paper It is effective and is more accurate than the existing technical methods, and has a stronger ability to deal with it.
【学位授予单位】:南京大学
【学位级别】:博士
【学位授予年份】:2017
【分类号】:TP391.1;TP393.09
,
本文编号:1530670
本文链接:https://www.wllwen.com/shoufeilunwen/xxkjbs/1530670.html