基于新闻数据的中文人物社会关系抽取研究

发布时间：2018-12-18 21:25

【摘要】：随着互联网规模的不断扩大,其中蕴含的信息和数据也在持续增长。信息抽取技术的目标是从互联网中的海量无结构化数据中挖掘出结构化的数据。实体关系抽取是信息抽取的子任务,已经成为数据挖掘与信息检索领域的一个研究热点。人物关系抽取属于实体关系抽取的一个方面,人物关系三元组数据被用于构建人物关系网络和问答系统,具有较高的应用价值。但是,目前关系抽取研究主要集中在英文语料的处理上,基于中文数据的关系抽取研究进展比较缓慢且研究难度较大。基于机器学习的关系抽取方法因其在关系抽取结果上较好的表现,已经成为目前的研究热点。按照训练数据获取方式的不同,本文对基于半监督学习,远监督学习和无监督学习的三种方法进行研究,主要贡献如下：1.有监督学习的关系抽取方法对人工标注的训练数据的依赖性较强,且人工标注的成本过高。为了在少量标注数据的条件下也能获取较高的关系抽取性能,本文对半监督学习的关系抽取方法进行研究。使用基于标签传播的半监督学习算法能提升少量标注数据下的关系抽取效果,但是随机选择训练样本会影响关系抽取性能。为了提升标签传播算法的关系抽取效果,本文将标签传播算法与主动学习方法相结合用于人物关系抽取。这个方法主动选取对于关系分类的帮助最大的样本进行标注,可以减少无效标注样本数量,在相同标注数据量的条件下提升系统的性能。2.在目前的关系抽取研究中,远监督方法通常用于自动构建训练数据,但是远监督学习的基本假设存在不准确的问题,从而在训练数据中会引入噪声数据。本文针对该问题提出了基于打分函数过滤训练数据中噪声的方法,能减少基于远监督学习获取的训练数据中的噪声数据。另外,针对目前关系抽取系统的准确率不够理想的问题,本文应用词向量技术从单句文本中提取基于词向量的若干特征加入常用的关系抽取特征系统中,用于提升人物关系抽取系统的表现。3.以上方法都需要预先定义关系类型后进行关系抽取获得相应的关系实例。这些方法会限制了关系抽取模型可以获得的关系种类,无法得到新的关系类型的关系三元组数据。因此本文提出了一种不需要训练数据以及预先定义的关系类型的基于无监督学习的关系抽取方法。该方法首先从新闻标题数据获得关联度较高的人物对用于关系抽取研究；然后,抓取关联人物对所在新闻数据进行预处理后,利用TF-IDF得到人物对共现句子中的关键词；其次,基于词语共现信息得到词语之间的关联,进而建立关键词关联网络：最后,利用对关联网络进行图聚类分析以获得人物关系。
[Abstract]:With the continuous expansion of the scale of the Internet, the information and data contained therein are also growing. The goal of information extraction technology is to mine the structured data from the massive unstructured data in the Internet. Entity relation extraction is a sub-task of information extraction, which has become a research hotspot in the field of data mining and information retrieval. Personal-relationship extraction belongs to an aspect of entity relation extraction. The triple data of personal-relationship is used to construct personal-relationship network and question-and-answer system, which has high application value. However, at present, the research on relation extraction is mainly focused on the processing of English corpus, and the research on relation extraction based on Chinese data is slow and difficult. The relationship extraction method based on machine learning has become a hot research topic because of its good performance in relation extraction results. According to the different training data acquisition methods, this paper studies three methods based on semi-supervised learning, far supervised learning and unsupervised learning. The main contributions are as follows: 1. The supervised learning relational extraction method is highly dependent on the training data of manual annotation, and the cost of manual annotation is too high. In order to obtain high performance of relation extraction under the condition of small amount of labeled data, this paper studies the relationship extraction method of semi-supervised learning. Using semi-supervised learning algorithm based on label propagation can improve the effect of relational extraction under a small amount of labeled data, but random selection of training samples will affect the performance of relational extraction. In order to improve the relationship extraction effect of label propagation algorithm, this paper combines tag propagation algorithm with active learning method to extract human relationship. This method takes the initiative to select the most helpful samples for relational classification, which can reduce the number of invalid labeled samples, and improve the performance of the system under the condition of the same amount of tagged data. 2. In the present research of relation extraction, remote supervision is usually used to construct training data automatically, but the basic hypothesis of remote supervised learning is inaccurate, so noise data will be introduced into the training data. In this paper, a method of filtering noise in training data based on scoring function is proposed, which can reduce the noise data obtained from training data based on remote supervised learning. In addition, aiming at the problem that the accuracy of the current relational extraction system is not ideal, this paper applies word vector technology to extract some features based on word vector from the single sentence text and adds some features based on word vector to the commonly used relational extraction feature system. Used to enhance the performance of the personal-relationship extraction system. 3. All of the above methods need to predefine the relationship type and then extract the relation to obtain the corresponding relational instance. These methods limit the types of relationships that can be obtained by the relational extraction model, and can not obtain the relational triples of the new relational types. Therefore, this paper proposes an unsupervised learning based relational extraction method which does not require training data and predefined relationship types. In this method, first of all, people pairs with high correlation degree are obtained from the news title data for relation extraction, and then the key words in the co-occurrence sentences are obtained by TF-IDF after the related characters are preprocessed to their news data. Secondly, based on the co-occurrence information, the association between words is obtained, and then the keyword association network is established. Finally, the relationship between people is obtained by using the graph clustering analysis of the association network.
【学位授予单位】：中国科学技术大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.1

【参考文献】