基于深度卷积神经网络的实体关系抽取

发布时间：2018-04-04 05:04

本文选题：关系抽取　切入点：深度卷积神经网络　出处：《太原理工大学》2017年硕士论文

【摘要】：实体关系抽取一直以来就是自然语言处理领域研究的一个热点问题。能够准确的识别出两个实体之间的语义关系在信息抽取任务中是至关重要的,同时对于知识库的创建以及信息检索等领域都具有重要的意义。随着深度学习在图像和视觉等领域的迅猛发展,近年来深度学习也被引入到自然语言处理领域,成为了研究的热点。由于传统的实体关系抽取方法在模型学习之前都需要人工手动的选取一些离散的特征,特征选取的好坏直接关系到最终的抽取结果。我们无法预知什么样的特征最有效,而且特征的数量也不是越多越好,多数是依赖专家经验来判断特征的有效性。同时特征的选择过程大多依赖于现有的自然语言处理(NLP)工具,费时费力,且易造成错误传播。与传统的方法相比,基于深度学习机制的关系抽取算法可以自动的从原始的语料中学习到特征,不仅减少了对于NLP工具的依赖,而且充分利用了文本的结构信息。同时,前人的研究成果证明了深度学习模型中的卷积神经网络(Convolutional Neural Network,CNN)以其独特的网络结构可以对特征进行更好的学习。基于此,本文采用深度卷积神经网络完成实体关系抽取任务。首先,提出基于句子的衡量词重要性的TP-ISP(term proportion-inverse sentence proportion)算法,通过该算法得到每个类别中各个词的tpisp值,根据该值的大小结合排序算法得到关于每个词重要性的排序结果;然后选取排名靠前的词作为表征该类别的关键词特征,同原始句子的词向量特征和词位置特征一同作为网络的初始输入,减少了现有的使用深度学习的方法中仅仅依赖单一词向量学习特征的不足。通过加入该类别关键词特征,增加了类别间的区分度,同时也弥补了网络自动学习特征的不足;最后在网络训练阶段,本文采用分段最大池化策略,即选取每一段中得分值最高的特征,将这些特征组合起来作为最终分类器的输入特征。这一策略一定程度上减少了传统的最大池化策略对于信息的丢失问题。此外,由于中文语料匮乏等原因在此方面研究较少,因此本文以COAE(Chinese Opinion Analysis Evaluation)2016评测任务中的数据集为对象,将该模型结合中文语料的特殊性解决基于中文的实体关系抽取问题。同时使用word2vec工具中的Skip-gram模型和中文维基数据,训练获得了中文词向量表,优于单独使用word2vec随机初始化生成的词向量表。实验证明,本文的模型在英文和中文语料中都使得实体关系抽取结果得到很大的提升。
[Abstract]:Entity relation extraction has always been a hot topic in the field of natural language processing.It is very important to identify the semantic relationship between two entities accurately in the task of information extraction. It is also important for the creation of knowledge base and information retrieval and so on.With the rapid development of depth learning in the field of image and vision, deep learning has been introduced into the field of natural language processing in recent years.Because traditional entity relation extraction methods need to manually select some discrete features manually before model learning, the quality of feature selection is directly related to the final extraction results.We can not predict which features are the most effective, and the number of features is not as much as possible. Most of them depend on expert experience to judge the validity of features.At the same time, the process of feature selection mostly depends on the existing natural language processing tools, which is time-consuming and easy to cause error propagation.Compared with the traditional methods, the relationship extraction algorithm based on the deep learning mechanism can automatically learn features from the original corpus, which not only reduces the dependence on NLP tools, but also makes full use of the structural information of the text.At the same time, the previous research results prove that the convolutional Neural network CNNs in the deep learning model can better learn the features with their unique network structure.Based on this, this paper uses deep convolution neural network to complete the entity relation extraction task.First of all, the TP-ISP(term proportion-inverse sentence proportion algorithm based on sentence is proposed, through which the tpisp value of each word in each category is obtained, and the sorting result about the importance of each word is obtained according to the size of the value combined with the sorting algorithm.Then the top word is selected as the key word feature to represent the category, and the word vector feature and word position feature of the original sentence are used as the initial input of the network.It reduces the deficiency of the existing methods of using depth learning which only rely on single word vector learning features.By adding the keyword feature of the category, the classification degree among the categories is increased, and the deficiency of the automatic learning feature of the network is also made up. Finally, in the training stage of the network, this paper adopts the strategy of segment maximization pool.In other words, the features with the highest score in each segment are selected and combined as the input features of the final classifier.To some extent, this strategy reduces the problem of information loss caused by the traditional maximization strategy.In addition, due to the lack of Chinese corpus, this paper takes the data set in the COAE(Chinese Opinion Analysis Evaluation)2016 evaluation task as an object to solve the problem of entity relation extraction based on Chinese language combined with the particularity of Chinese corpus.At the same time, the Chinese word orientation scale is obtained by using the Skip-gram model and Chinese wiki data in word2vec tool, which is better than the word orientation scale which is generated by using word2vec random initialization alone.Experimental results show that the proposed model can greatly improve the result of entity relation extraction in both English and Chinese corpus.
【学位授予单位】：太原理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1;TP183

【参考文献】