复杂结构精确Web信息抽取规则语言与关键技术研究

发布时间：2018-03-20 12:01

本文选题：精确Web信息抽取　切入点：深度网页　出处：《南京大学》2014年硕士论文　论文类型：学位论文

【摘要】：互联网时代Web已经成为各类海量数据和信息的主要载体,成为人们获取大量有用信息的主要数据源。当前,电子商务领域的蓬勃发展,垂直搜索、社交网络的舆情和情感分析等诸多应用,都依赖于Web信息抽取技术来获得大规模的网页数据,因此Web信息抽取技术的研究具有重要的研究意义和商业应用价值。Web信息抽取技术的一个重要研究问题是,研究如何提供一种有效的Web信息抽取规则以方便快速地表示各种复杂结构网页数据记录的抽取逻辑,从而避免硬编码程序编写方式来完成数据抽取。现有的Web信息抽取技术的研究已经取得了一定的成就,然而Web页面技术的发展给Web信息抽取技术领域不断带来新的研究课题。现有Web信息抽取技术与抽取规则研究方面还存在以下主要缺点：1)抽取规则模型和体系设计方面,缺少对完整的抽取过程和模型的深入研究,难以完成深度网页的浏览导航、数据抽取和集成的全过程处理：2)缺少对复杂结构数据记录模型的研究,降低了Web网页数据抽取技术的适用范围；3)抽取规则语言方面,目前主流的抽取规则语言缺乏足够的表达能力来满足复杂结构深度Web页面的数据抽取需求；4)针对动态数据页面模板更新带来的规则包装器失效问题,尽管也有关于规则检测和维护的相关研究,但是缺乏从规则体系层面上对规则检测、维护、更新的表达能力；5)数据抽取特征方面,目前研究利用的网页DOM树的结构特征和视觉特征,虽然可以处理大多数常规的数据抽取应用问题,然而对于上述两种特征无法涵盖和处理的复杂结构网页,在抽取规则的定义和设计层面上缺少足够的特征来提高表达和处理能力；6)缺少对规则语言执行效率的分析和改进,未能从大规模应用场景出发设计和改进现有的规则执行过程,提高数据抽取的效率。在总结现有Web信息抽取规则研究工作的基础上,针对已有研究,本文主要进行了五个方面的研究工作：1)研究设计了Web信息抽取全过程模型,可刻画完整Web信息抽取过程中的浏览导航逻辑、数据抽取逻辑和数据集成逻辑,为设计兼具浏览导航和数据集成的综合处理能力的抽取规则语言提供指导；2)抽取规则体系和模型研究：为了能够更清晰地描述Web信息抽取处理过程,提高Web信息抽取技术处理的能力,本文研究了Web信息抽取过程中涉及到的各类模型,包括复杂结构数据记录模型、基于DOM树结构的自上而下的结构化数据抽取过程模型、页面规则模型、以及包含规则生成、规则检测、规则维护和更新的抽取规则包装器生命周期模型；3)基于对Web信息抽取基本模型的深入研究,本文研究并提出了层次化的Web信息抽取规则综合体系和语言,对每个Web网页建立“数据区-数据记录-数据项”的层次化映射关系,在每个层次上综合利用DOM节点和页面元素的结构、视觉和语义特征,通过抽取谓词的组合来提供对各粒度数据元素的定位、重组、抽取、细粒度过滤、抽取异常检测、维护等各种功能规则,提供强有力的数据抽取逻辑语言表达能力；4)根据多功能化综合规则模型和体系,在规则语言中设置检测规则和维护功能规则,检测页面模板是否发生变化,对已失效的数据抽取规则进行局部修复；5)在抽取规则语言表达能力方面,补充完善了基于语义的数据抽取规则,将语义元素融入到现有的数据抽取规则体系,解决了结构特征和视觉特征难以完成的数据抽取处理问题。在以上关键技术研究基础上,本文研究实现了抽取规则执行引擎,并设计实现了一个完成的Web信息抽取原型系统。基于对商业网站的抽取实验结果表明,本文所实现的抽取技术和抽取规则语言具有较强的表达和处理能力。
[Abstract]:The age of the Internet Web has become the main carrier of all kinds of data and information, the main data source for people to acquire useful information. At present, the vigorous development of the field of electronic commerce, vertical search, social networking applications of public opinion and sentiment analysis, are dependent on the Web information extraction technology to obtain a large-scale web data. An important research problem so the research of Web information extraction technology has the research significance and commercial value of.Web information extraction technology is an important research, how to provide an effective Web information extraction rules to facilitate rapid said web data extraction logic of various complex structure records, so as to avoid hard encoding program to complete data extraction. Research on Web information extraction technology of the existing has made some achievements, but the development of Web technology to Web information extraction Technology continues to bring a new research topic. The existing Web information extraction technology and extraction rules studies have the following disadvantages: 1) the main extraction rule model and system design, the lack of in-depth study on the extraction process and the complete model, difficult to complete navigation through the deep web, data extraction and integration of the whole process: 2) the lack of complex structured data record model, reduce the scope of the Web web data extraction technology; 3) extraction rule language, the current mainstream extraction rules language lacks the ability to express enough to meet the needs of complex structure depth data extraction Web page; 4) for dynamic data page template update brings rules the wrapper of failure, although there are relevant researches on the detection and maintenance rules, but the lack of rules from system level to regular testing, maintenance, The ability to express updates; 5) data extraction characteristic, the research of "DOM tree structure and visual features, although can handle the data extraction using most conventional, but for these two features can not cover and deal with complex structure web pages, lack of features to improve the expression and processing ability in the definition of the extraction rules and design level; 6) the lack of analysis and improvement of efficiency in the implementation of the rule language, not from large-scale application of the scene design and the existing rules to improve the implementation process, improve the efficiency of data extraction. Based on the existing Web information extraction rules on the research work, based on the existing research, this paper mainly the research work in five aspects: 1) the research and design of the Web information extraction model can describe the whole process, complete Web information extraction in the process of browsing navigation logic, the number of According to the selected logic and data integration logic, provides guidance for the design of both the comprehensive ability of navigation and browsing data integration rules language; 2) rule extraction system and model research: in order to more clearly describe the Web information extraction process, improve the ability of Web information extraction technology, this paper studied various models involved the Web information extraction process, including recording model of complex data structures, data extraction process model of DOM tree structure based on top-down rule, page model, and contains the detection rules generation, rules, rules of maintenance and updating of the extraction rules wrapper lifecycle model; 3) research on Web information extraction based on basic model in this paper, and put forward the hierarchical Web system and comprehensive information extraction rules for each of the Web language, "established" data - data record The hierarchical mapping between the data items, "- recorded at every level of comprehensive utilization of DOM node and page elements, visual and semantic features to provide location, the size of the data elements of the extraction through the combination of predicate reorganization, fine-grained extraction, filtration, extraction of anomaly detection, maintenance and other functions to provide rules. Data extraction logic language strong expression ability; 4) according to the multi function integrated rule model and system, set the detection rules and maintenance function rule in the rule language, test page template is changed, the local repair of data extraction rules has expired; 5) expression ability in the extraction rule language, complement data extraction rules based on semantic and semantic elements into the existing system of data extraction rules, to solve the structural features and visual features of data extraction and processing difficult to complete In the above problem. Based on the research on the key technology, this paper realizes the engine execution of the extraction rules, and the design and implementation of a complete Web information extraction prototype system. The experimental results on the extraction of commercial websites that based on the realization of the extraction technology and extraction rule language with strong expression and processing ability.

【学位授予单位】：南京大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP391.1;TP393.092

【相似文献】