上下文感知的实体链接技术研究

发布时间：2018-02-09 20:05

本文关键词： 实体链接文档向量信息抽取分布式表达　出处：《浙江大学》2017年硕士论文　论文类型：学位论文

【摘要】：将互联网上海量的非结构化文本数据转为应用所需的结构化知识,使其可被计算机高效地利用是改进搜索系统,实现智能问答与机器阅读的基础。在这个过程中,实体链接技术扮演了一个关键的角色,其主要目标是消除由于别名、指代、一词多意等语言学现象引发的歧义,建立文本中出现的专有名词(实体名)与知识库中其所指代的实体之间的对应关系。如何从对应同一实体名的多个候选实体中找出最佳候选是实体链接的研究重点。本文对此展开了深入的研究,提出了一种基于融合实体信息的文档向量的实体链接方法:首先,本文提出了一种新颖的文档分布式向量表达学习模型,在传统文档向量表达学习过程中融合上下文实体、实体共现性等对于实体链接至关重要的额外信息,使得新的文档向量表达在实体链接中具有更好鉴别力;针对上述模型难以直接训练的问题,研究了一种通过随机采样训练样本,并结合Hierarchical Softmax或Negative Sampling进行训练的方法,不但使得信息融合成为可能,更提高了训练速度;随后,基于该模型学习得到的文档向量特征,本研究构建了候选实体与当前输入文档的语义匹配程度模型;最后,结合求出的语义匹配程度与候选实体本身的属性协同地探索实体链接最佳候选,形成了完整的实体链接系统。基于分布式向量表达的实体链接系统克服了传统方法需手工构造特征的缺点,自动地利用了同一文档中提及的不同实体之间通常存在一定关联的假设,在链接时联合感知了上下文中的普通词信息和提及实体信息。相比近年来提出的基于深度神经网络的方法,具有无需大量实体链接标注数据、模型训练时间短等突出优势。在实体链接研究常用的TAC KBP实体链接数据集上的一系列实验结果表明,本研究设计的实体链接系统性能优越,其准确率可高于现有最新实体链接方法2个百分点以上。基于本研究的实体链接系统,在2016年NIST(美国国家标准技术研究所)组织的国际知识库构建大赛(TAC KBP)英文EDL(实体发现与链接)任务的全部8项指标中,取得了6项第一,2项第二,综合性能排名第一的好成绩。同时参与任务的国内外高校与研究机构还包括CMU、IIBM、科大讯飞等共13个参赛队伍。本研究所述的实体链接系统,不但得到了国际竞赛的认可,还直接应用于中国工程院牵头的中国工程科技知识中心建设项目等多个国家级项目中,为实现其中的自动知识库构建与数据结构化发挥了重要的作用。
[Abstract]:It is the foundation of improving search system, realizing intelligent question and answer and machine reading to convert the unstructured text data of Shanghai quantity of Internet into the structured knowledge needed for application, so that it can be used efficiently by computer. Physical link technology plays a key role, the main goal of which is to eliminate ambiguity caused by linguistic phenomena such as aliases, references, multiple meanings, etc. Establish the correspondence between the proper nouns (entity names) appearing in the text and the entities they refer to in the knowledge base. How to find out the best candidate from multiple candidate entities corresponding to the same entity name is the study of entity link. This paper has carried out a deep research on this, This paper proposes an entity linking method based on document vector fusion of entity information. Firstly, a novel document distributed vector representation learning model is proposed, in which context entities are fused in the traditional document vector representation learning process. The additional information, such as entity co-occurrence, which is crucial to entity link, makes the new document vector expression more discriminant in entity link, aiming at the problem that the above model is difficult to be trained directly. This paper studies a training method based on random sampling and combining with Hierarchical Softmax or Negative Sampling, which not only makes information fusion possible, but also improves the training speed. In this study, the semantic matching degree model between candidate entity and current input document is constructed. Finally, the best candidate for entity link is explored in combination with the semantic matching degree and the attributes of the candidate entity. The entity link system based on distributed vector expression overcomes the disadvantage of traditional method which needs to construct features manually. The assumption that the different entities mentioned in the same document are usually related is automatically exploited, The common word information and the reference entity information in the context are jointly perceived when linking. Compared with the method based on depth neural network proposed in recent years, there is no need for a large number of entity link tagging data. The model training time is short and so on. A series of experimental results on the TAC KBP entity link data set, which are commonly used in entity link research, show that the physical link system designed in this paper is superior in performance. The accuracy of the method can be more than 2 percentage points higher than that of the latest entity link method. In 2016, NIST (National Institute of National Standards and Technology) organized the International knowledge Base Building Competition (TAC KBP) in English EDL (entity discovery and link) task of all eight indicators, obtained 6 first and second, At the same time, the participating universities and research institutions at home and abroad also included 13 teams, such as CMU I IBM, iFLYTEK, etc. The physical link system described in this study has not only been recognized by international competitions. It is also directly applied to many state-level projects such as the construction project of China Engineering Science and Technology knowledge Center led by the Chinese Academy of Engineering which plays an important role in the realization of automatic knowledge base construction and data structure.
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【相似文献】