中文开放式多元实体关系抽取

发布时间:2017-12-31 20:23

  本文关键词:中文开放式多元实体关系抽取 出处:《太原理工大学》2017年硕士论文 论文类型:学位论文


  更多相关文章: 开放式信息抽取 实体关系抽取 机器学习 逻辑回归分类器 支持向量机


【摘要】:信息抽取是指从文本中抽取指定类型的实体词、关系词、时间、地点、事件等多层次的语义信息,并将这些信息转化成结构化格式进行输出。随着网络信息的指数型增长,加之在今天人工智能的快速发展,信息抽取逐渐成了热门研究领域。而实体关系抽取是信息抽取的一个重要环节,同时也是一个重要任务,实体关系抽取的主要内容是抽取文本中的实体关系类型和实体关系值。实体关系抽取对于知识图谱构建和领域本体、问答系统、文本相似度计算以及语义理解和文本摘要提取等更深层次的自然语言处理问题都具有重要的理论和实践意义。实体关系抽取的研究包括传统式实体关系抽取和开放式实体关系抽取。其中,传统实体关系抽取主要面向限定领域文本、限定类别实体和关系的抽取,需要针对某一限定领域建立语言模型进行抽取。然而随着互联网信息的指数型增长和互联网信息所具有的跨领域特性,使得传统式实体关系抽取无法满足网络文本抽取的需求。从而,开放式信息抽取成为了信息抽取的一个重要研究领域,它的主要任务是从大规模异构、跨领域文本中抽取实体、关系、事件等多层次语义信息,并且以结构化格式输出,使得可以跨领域地、大规模地对网络文本进行处理。针对英文文本的开放式实体关系抽取主要分为两个阶段:先对实体词进行抽取的阶段和先对关系词进行抽取的阶段。在针对中文文本实体关系抽取方面的研究主要集中在二元关系抽取以及使用浅层语义特征进行抽取的方法。因此本文提出了基于依存关系分析的针对中文文本的开放式实体关系抽取方法,该方法可以用于抽取多元关系,并且加入了深层语义特征使得抽取的准确性得到了提供。本文在上述方法的基础上设计并实现了抽取系统。本文提出了面对大规模、异构中文网络文本的基于依存关系的开放式信息抽取方法,首先对网络文本进行预处理,包括网页正文文本抽取、中文分词、中文词性标注和依存关系分析,然后使用启发式规则进行基本名词短语识别并通过基于词间依存关系的启发式规则获取候选实体关系多元组,接着通过经过训练的机器学习分类器对候选实体关系多元组进行过滤得到最终的实体关系多元组,最后将过滤得到的实体关系组进行标准化过程后保存在数据库中。抽取出的大规模的实体关系组也可以用于其他的自然语言处理方面的任务。本文使用语言技术平台云(Language Technology Platform-Cloud,LTP-Cloud)进行文本预处理,定义了一系列基本名词短语的词性组合规则和一系列基于依存关系的抽取实体关系多元组的规则。在过滤阶段,以词个数、词性、词间距离等方面为特征训练得到机器学习分类器,对候选关系组进行一个正确与否的判断与过滤。在对测试语料抽取实验中,得到81.25%的准确性。最后,使用了本文提出的抽取方法搭建了中文开放式多元实体关系抽取系统,并抽取出了大量的实体关系组。
[Abstract]:Information extraction refers to the extraction from the specified text types of solid words, words, time, place, events and other multi-level semantic information, and these information into a structured format output. With the exponential growth of network information, coupled with the rapid development of today, artificial intelligence, information extraction has become a hot research the field and entity relation extraction is an important part of information extraction, and also an important task, the main content of entity relation extraction is selected in the text type and entity relationship entity relationship value. Entity relation extraction for knowledge mapping and domain ontology, question answering system, has important theoretical and practical significance of Natural Language Processing the deeper problem of text similarity computing and semantic comprehension and text summarization extraction. Research of entity relation extraction including traditional entity relation extraction Take and open entity relation extraction. Among them, the traditional entity relation extraction for domain specific text, limited categories of entity and relation extraction, need for a restricted domain language model based extraction. However, cross domain characteristics with the exponential growth of Internet information and Internet information. It makes the traditional entity relationship extraction can not meet the demand. So the network text extraction, open information extraction has become an important research field of information extraction, it is the main task of the large-scale heterogeneous, entity extraction, cross domain text between the events of multi-level semantic information, and output in a structured format, enables cross domain, for on a large scale. The network text open to English text entity relation extraction is mainly divided into two stages: the first stage extraction on the real words And the first to extract Related words in text. Chinese entity relation extraction research mainly concentrated in the two yuan relation extraction method and using the shallow semantic features extraction. This paper proposes an open entity relation extraction method for Chinese text dependency relation based on the analysis, this method can be used to extract multiple the relationship between, and joined the deep semantic feature makes the accuracy of the extraction is offered. This paper designs and implements the extraction system on the basis of the above methods is put forward in this paper. In the face of massive, heterogeneous network Chinese text open information extraction method based on the dependency relation, the network text pretreatment, including Web Text extraction Chinese, word segmentation, POS tagging and dependency relation analysis Chinese, then use heuristic rules for base noun phrase identification and The heuristic rules based on the dependency relation between words acquisition candidate entity relation between multiple groups, followed by trained machine learning classifier to filter candidate entity between multiple groups to obtain the final entity relation between multiple groups, the group entity relationship by filtering in the standardization process after stored in the database. A large group of entity relationship the extract can also be used for Natural Language Processing other tasks. In this paper, the use of language technology platform (Language Technology Platform-Cloud, LTP-Cloud cloud) for text preprocessing, defines a series of basic noun phrase combination rule based on part of speech and a series of multiple entity relation extraction group rule dependency relation. In the filtering stage, in a word the number of POS, distance etc. between words by machine learning classifier for feature training, a group of candidate relations In the test corpus extraction experiment, we get 81.25% accuracy. Finally, we use the extraction method proposed in this paper to build an open multi entity relationship extraction system in China, and extract a large number of entity relationship groups.

【学位授予单位】:太原理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1

【参考文献】

相关期刊论文 前4条

1 秦兵;刘安安;刘挺;;无指导的中文开放式实体关系抽取[J];计算机研究与发展;2015年05期

2 赵军;刘康;周光有;蔡黎;;开放式文本信息抽取[J];中文信息学报;2011年06期

3 奉国和;郑伟;;国内中文自动分词技术研究综述[J];图书情报工作;2011年02期

4 周宏宇;张政;;中文分词技术综述[J];安阳师范学院学报;2010年02期

相关博士学位论文 前1条

1 张奇;信息抽取中实体关系识别研究[D];中国科学技术大学;2010年



本文编号:1361332

资料下载
论文发表

本文链接:https://www.wllwen.com/shoufeilunwen/xixikjs/1361332.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户e5793***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com