面向中文新闻文本的实体关系抽取研究
发布时间:2018-07-31 17:15
【摘要】:随着互联网技术的快速发展,互联网上的文本信息呈现飞速增长。如何从海量文本中快速准确地抽取人们需要的知识正在成为研究的热点。其中,从文本中自动抽取实体关系的研究显得尤为重要。目前,实体关系抽取研究主要集中在英文文本语料,同时主要使用传统机器学习算法。此外,目前研究较少考虑到大量无关系样本存在对关系分类的影响。为此,本文的工作集中在面向中文新闻文本、主要基于深度学习方法的实体关系分类。为减少无关系样本的影响,本文将实体关系抽取任务划分有无实体关系判别和实体关系分类两个子任务,分别展开研究。在有无实体关系判别子任务中,本文设计实现了一种结合词袋模型和逻辑回归算法的判别方法。针对这种方法存在的特征空间维度较大、算法运行时间过长的问题,进一步设计实现了一种基于卷积神经网络模型的判别方法。通过应用在搜狐新闻数据预训练得到的词向量,结合对ACE2005中文文本实体关系抽取数据集分词后得到的词语进行向量映射作为卷积神经网络输入,应用于有无实体关系判别。在ACE2005中文文本实体关系抽取数据集上的实验结果显示该方法获得更好的判别性能,F值达到了81.78%。在实体关系分类子任务中,本文提出了一种基于Bi-directional Long-Short Term Memory(BLSTM)模型结合特征融合的实体关系分类方法。首先对语料预训练得到词向量,并提取实体类型、实体长度、实体相对位置等实体相关特征。通过对语料库文本中实体类型及其上下文与关系类型的联系进行分析,构建自定义的规则库。最终,融合词向量、实体相关特征和自定义规则库作为BLSTM模型的输入,构建分类器。在ACE2005数据集上实验显示该方法的关系分类F值达到了91.74%,显示了本文工作对中文新闻文本实体关系分类的有效性。
[Abstract]:With the rapid development of Internet technology, text information on the Internet is growing rapidly. How to extract the knowledge that people need quickly and accurately from the massive text is becoming a hot topic. Among them, the research of extracting entity relation automatically from text is particularly important. At present, the research of entity relation extraction mainly focuses on the English text corpus, and mainly uses the traditional machine learning algorithm. In addition, few studies have taken into account the influence of a large number of unrelated samples on relational classification. Therefore, the work of this paper is focused on the Chinese news text, mainly based on the in-depth learning method of entity relationship classification. In order to reduce the influence of independent samples, this paper divides the entity relation extraction task into two sub-tasks: entity relation discrimination and entity relation classification. In this paper, we design and implement a judgment method which combines the word bag model with the logical regression algorithm. In order to solve the problem that the feature space dimension is large and the algorithm running time is too long, a discriminant method based on convolution neural network model is designed and implemented. By applying the word vector pre-trained in the Sohu news data and combining the vector mapping of the words extracted from the entity relation of ACE2005 Chinese text data set as the convolutional neural network input, it is applied to judge whether the entity relation exists or not. The experimental results on the data set of ACE2005 Chinese text entity relation extraction show that the proposed method achieves better discriminant performance and F value reaches 81.78. In the subtask of entity relationship classification, this paper proposes a method of entity relationship classification based on Bi-directional Long-Short Term Memory (BLSTM) model and feature fusion. First, the word vector is obtained by pre-training the corpus, and the entity correlation features such as entity type, entity length and entity relative position are extracted. Based on the analysis of the relation between the entity type and the relation between the context and the relational type in the corpus text, a custom rule library is constructed. Finally, the classifier is constructed by combining word vector, entity correlation feature and custom rule base as input of BLSTM model. Experiments on the ACE2005 dataset show that the F value of this method is 91.74, which shows the effectiveness of this work for the classification of entity relations in Chinese news texts.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
本文编号:2156286
[Abstract]:With the rapid development of Internet technology, text information on the Internet is growing rapidly. How to extract the knowledge that people need quickly and accurately from the massive text is becoming a hot topic. Among them, the research of extracting entity relation automatically from text is particularly important. At present, the research of entity relation extraction mainly focuses on the English text corpus, and mainly uses the traditional machine learning algorithm. In addition, few studies have taken into account the influence of a large number of unrelated samples on relational classification. Therefore, the work of this paper is focused on the Chinese news text, mainly based on the in-depth learning method of entity relationship classification. In order to reduce the influence of independent samples, this paper divides the entity relation extraction task into two sub-tasks: entity relation discrimination and entity relation classification. In this paper, we design and implement a judgment method which combines the word bag model with the logical regression algorithm. In order to solve the problem that the feature space dimension is large and the algorithm running time is too long, a discriminant method based on convolution neural network model is designed and implemented. By applying the word vector pre-trained in the Sohu news data and combining the vector mapping of the words extracted from the entity relation of ACE2005 Chinese text data set as the convolutional neural network input, it is applied to judge whether the entity relation exists or not. The experimental results on the data set of ACE2005 Chinese text entity relation extraction show that the proposed method achieves better discriminant performance and F value reaches 81.78. In the subtask of entity relationship classification, this paper proposes a method of entity relationship classification based on Bi-directional Long-Short Term Memory (BLSTM) model and feature fusion. First, the word vector is obtained by pre-training the corpus, and the entity correlation features such as entity type, entity length and entity relative position are extracted. Based on the analysis of the relation between the entity type and the relation between the context and the relational type in the corpus text, a custom rule library is constructed. Finally, the classifier is constructed by combining word vector, entity correlation feature and custom rule base as input of BLSTM model. Experiments on the ACE2005 dataset show that the F value of this method is 91.74, which shows the effectiveness of this work for the classification of entity relations in Chinese news texts.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1
【参考文献】
相关期刊论文 前5条
1 刘丹丹;彭成;钱龙华;周国栋;;《同义词词林》在中文实体关系抽取中的作用[J];中文信息学报;2014年02期
2 陈宇;郑德权;赵铁军;;基于Deep Belief Nets的中文名实体关系抽取[J];软件学报;2012年10期
3 董静;孙乐;冯元勇;黄瑞红;;中文实体关系抽取中的特征选择研究[J];中文信息学报;2007年04期
4 车万翔,刘挺,李生;实体关系自动抽取[J];中文信息学报;2005年02期
5 姜吉发,王树西;一种自举的二元关系和二元关系模式获取方法[J];中文信息学报;2005年02期
相关硕士学位论文 前1条
1 王莉峰;领域自适应的中文实体关系抽取研究[D];哈尔滨工业大学;2011年
,本文编号:2156286
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2156286.html