Web实体活动与实体关系抽取研究

发布时间：2018-08-28 10:53

【摘要】：随着互联网技术的迅速发展,Web已经成为一个巨大的数据源,拥有海量数据。如何高效、全面、准确的集成Web上有价值的信息,为市场情报分析、搜索引擎、智能问答等系统提供数据支持,丰富市场情报分析和智能问答等系统的知识库,帮助完善分析推理的结果,使搜索引擎返回更加精准的检索数据,成为数据集成、信息检索、自然语言理解等领域研究的热点和难点。要集成Web数据,首要问题是如何将Web上的无结构和半结构化数据通过信息抽取技术转变为计算机可读的结构化数据。 Web数据具有大规模、异构性、自治性、分布式等特点,现有的信息抽取技术无法同时满足高效、全面和准确的数据集成需求。一方面,在面对大规模、分布式的Web数据时,现有的信息抽取技术旨在高效的抽取Web上的命名实体、实体关系和实体属性(数据对象),但是抽取方法受抽取对象领域的限制,抽取结果较为简单,信息内容不够丰富：另一方面,面对异构性、自治性强的无结构化Web数据,现有的信息抽取技术旨在抽取结果的准确性,抽取效率不能满足大规模信息抽取的需要。本文致力于研究Web信息抽取技术,目标在于在保障抽取结果准确率的前提下,面向大规模、异构性的Web数据,充分挖掘Web上的有价值信息,丰富信息抽取的内容。Web上存在大量描述实体行为活动的数据,现有的信息抽取技术未能详细刻画和抽取实体活动这一类特殊信息；面对大规模Web数据,现有的关系抽取技术主要以二元关系为抽取对象,并未考虑二元关系的时效性,从而导致关系实例的可用性较差。本文针对现有Web信息抽取技术未能充分利用Web上有价值的数据,抽取结果内容不够丰富,可用性差的问题展开研究,主要工作和贡献概括如下 1.提出一种基于SVM和扩展条件随机场的Web实体活动抽取方法,能够面向多领域,准确的从Web数据源抽取实体活动这一未被利用的数据类型。 Web实体活动是指存在于Web上描述实体行为活动的数据,传统信息抽取技术较少单独考虑这一特殊的数据类型。本文首先对Web实体活动进行了详细刻画,基于格语法提出了实体活动的形式化定义,并提出一种基于SVM和扩展条件随机场的Web实体活动抽取方法,能够从Web上准确的抽取实体的活动信息。首先,为了避免人工标注训练数据的繁重工作,提出一种基于启发式规则的训练数据生成算法,将语义角色标注的训练数据集转化为适合Web实体活动抽取的训练数据集,分别训练支持向量机分类器和扩展条件随机场。在抽取过程中,通过分类器获得包含实体活动的有效语句,然后利用扩展条件随机场对传统条件随机场中不能够利用的标签频率特征和关系特征建模,标注自然语句中的待抽取信息,提高标注的准确率。通过多领域的实验证明,该抽取方法能够较好的适用于Web实体活动抽取。 2.提出了一种自举式Web实体关系时效信息抽取方法,有效解决了传统关系抽取中时间维度缺失的问题,丰富抽取内容,增强抽取结果的可用性。传统关系抽取主要以二元关系抽取为研究对象,但是现有抽取技术都是在假定关系实例时间无关性的基础上进行的,导致了抽取结果的时间维度缺失、可以性差。针对以上问题,本文提出了一种自举式的Web实体关系实效信息抽取方法,该方法能够抽取给定关系类型下所有关系实例以及关系实例对应的时效信息。方法中,首先对待抽取的3元关系：二元关系中的2个实体以及关系的时效信息,进行重新建模,通过将实体关系视作一个事实维度形成新的二元关系,最后利用经典的自举式二元关系抽取方法进行关系实例和时效信息的抽取。相比传统的自举式抽取过程,本文引入马尔科大逻辑网,用于弱化规则和模板的硬性约束,提高抽取的召回率；通过引入L1范数模型选择高质量模板,帮助提高抽取结果的准确率；关系的抽取对象为Web上的自然语句,方法中引入语义解析,充分利用自然语句中的依赖特征。实验证明,该方法能够在多领域高效准确的抽取给定关系类型下的关系实例以及实例的对应时效信息,最后,通过实验证明,在自举式抽取过程中引入MLN、L1范数模型进行模板选择以及语义解析对抽取结果的提高都有显著帮助。
[Abstract]:With the rapid development of Internet technology, the Web has become a huge data source with massive data. How to efficiently, comprehensively and accurately integrate valuable information on the Web, provide data support for market intelligence analysis, search engine, intelligent question answering systems, enrich the knowledge base of market intelligence analysis and intelligent question answering systems, help Perfecting the results of analysis and reasoning makes the search engine return more accurate retrieval data, which becomes a hot and difficult point in data integration, information retrieval, natural language understanding and other fields. Structural data.
Web data has the characteristics of large-scale, heterogeneous, autonomous, distributed, and so on. The existing information extraction technology can not meet the needs of efficient, comprehensive and accurate data integration at the same time. Attributes (data objects), but the extraction method is limited by the extraction object domain, the extraction results are relatively simple, the information content is not rich enough: on the other hand, in the face of heterogeneous, autonomous unstructured Web data, the existing information extraction technology aims to extract the accuracy of the results, extraction efficiency can not meet the needs of large-scale information extraction. Yes.
This paper is devoted to the study of Web information extraction technology. The goal is to face large-scale, heterogeneous Web data, fully mine valuable information on the Web and enrich the content of information extraction. There are a lot of data describing entity behavior activities on the Web, and the existing information extraction technology can not be described in detail. In the face of large-scale Web data, the existing relational extraction technology mainly takes binary relation as the extraction object, and does not consider the timeliness of binary relation, which leads to the poor availability of relational instances.
In this paper, the existing Web information extraction technology can not make full use of the valuable data on the Web, extraction results are not rich enough content, poor usability of the problem to start research, the main work and contributions are summarized as follows
1. A Web entity activity extraction method based on SVM and extended conditional random field is proposed, which can extract entity activity from Web data source accurately and multi-domain.
Web entity activity refers to the data that exists on the Web to describe entity activity. Traditional information extraction technology seldom considers this special data type alone. Firstly, Web entity activity is described in detail, formal definition of entity activity is proposed based on lattice grammar, and a W-based SVM and extended conditional random field is proposed. EB entity activity extraction method can accurately extract entity activity information from the Web. Firstly, to avoid the heavy work of manual labeling training data, a training data generation algorithm based on heuristic rules is proposed, which transforms the training data set of semantic role labeling into the training data set suitable for Web entity activity extraction. Support Vector Machine (SVM) classifier and Extended Conditional Random Field (ESRF) are trained. In the extraction process, valid statements containing entity activities are obtained by classifier, and then label frequency features and relational features which can not be used in traditional conditional random fields are modeled by ESRF to annotate the information to be extracted from natural sentences and improve annotation. Experiments in many fields show that the proposed method is suitable for Web entity activity extraction.
2. A bootstrap Web entity relation timeliness information extraction method is proposed, which effectively solves the problem of missing time dimension in traditional relation extraction, enriches extraction content and enhances the availability of extraction results.
Traditional relational extraction mainly focuses on binary relational extraction, but the existing extraction techniques are based on the assumption that relational instances are time-independent, which leads to the lack of time dimension and poor feasibility of extraction results. This method can extract the time-effect information of all relational instances and relational instances under a given type of relation. Firstly, the time-effect information of the three-element relation: two entities in the binary relation and the relation is re-modeled, and the entity relation is regarded as a fact dimension to form a new binary relation. The classical bootstrap binary relation extraction method is used to extract relation instances and time information. Compared with the traditional bootstrap extraction process, this paper introduces the Markov Large Logic Network (MLN) to weaken the hard constraints of rules and templates and improve the recall rate of extraction. The experimental results show that this method can extract the corresponding time-effect information of relation instances and instances under given relation types efficiently and accurately in many fields. Finally, the experiment proves that the method is self-contained. The introduction of MLN, L1 norm model for template selection and semantic parsing in the process of enumeration extraction can significantly improve the extraction results.
【学位授予单位】：山东大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP311.13

【参考文献】