基于嵌入模型的知识图谱补全

发布时间：2018-03-22 18:01

本文选题：知识图谱　切入点：嵌入模型　出处：《中山大学》2017年博士论文　论文类型：学位论文

【摘要】：知识图谱是三元组的集合,其中三元组的形式是(主语,谓词,宾语),主语和宾语是实体,谓词是关系。每个三元组(例如(奥巴马,出生地,檀香山))表示一个事实。当被应用于问答系统中时,只有当一个知识图谱覆盖了问答所对应的事实,它才能够提供所需要的答案。尽管已经有多个大规模、开放领域的知识图谱问世,它们距离完备仍然有很远的距离,例如Freebase中有30%的人物实体缺少记录他们父母亲信息的三元组。知识图谱补全就是向一个已有的知识图谱中增加新的三元组,且加入的三元组必须是客观事实。主要有两个渠道的信息可以用于补全知识图谱:1.从一个知识图谱已有的三元组来推理新的三元组。2.从文本中抽取新的实体和新的三元组。为了利用第一个渠道的信息,近年涌现了大量知识图谱嵌入方面的工作,它们为每个实体学习一个稠密的向量表示,同时基于实体的向量表示计算每个三元组的可信度。这些嵌入模型能被用于推理信息抽取模型从文本中抽取得到的三元组。由于上述两个渠道是互补的,所以合并嵌入模型与信息抽取模型能够表现出较之单一模型更好的性能。我们将现有知识图谱嵌入模型存在的弱点以及将其与信息抽取模型合并所存在的挑战总结如下:1.业界领先的知识图谱嵌入模型—TransE不能妥善地处理具有自反或者一对多/多对一/多对多性质的关系。2.在训练一个知识图谱嵌入模型时,现有的负采样算法有可能产生假阴性样本。3.对于从文本中抽取的三元组,其主语和宾语是词。如果这个三元组的主语或宾语无法链接至所考虑的知识图谱中的某个实体,现有的嵌入模型因为缺少实体的向量表示进行计算,无法对其进行推理。在本文中,我们提出一系列技术去解决上述问题。本文的主要贡献包括:1.我们表明了上述首个问题源自于Trans E将每种关系建模成对于实体向量的平移操作。于是,我们提出一个新的知识图谱嵌入模型Trans H。该模型通过在进行平移操作之前首先将实体向量投影至为每种关系定义的超平面,解决了Trans E存在的上述弱点。同时,Trans H避免了增加过多模型复杂度。2.我们提出了一个数据驱动的、每种关系独有的分布,用于采样负例来训练知识图谱嵌入模型。该分布能减少抽样到假阴性样本的机会。同时,该分布的参数可以由每种关系的基本统计量确定。3.我们首先表明,在词嵌入模型—Word2Vec中,词之间的隐式关系可以被解释成对于词向量的平移操作,类似于Trans E对于知识图谱中关系的建模。基于此,我们提出了一个联合嵌入模型,去为每个实体和每个词都学习一个稠密的向量表示。我们的联合嵌入模型能够为同时涉及词和实体的三元组计算可信度。据我们所知,我们的联合嵌入模型是能够处理此类三元组的首个方法。4.我们提出三个分别基于实体链接,实体名称,实体描述的对齐模型。用于训练这些模型的监督信息都易于获取且是规模大的。经验性评估显示,这些模型能有效将词被嵌入的向量空间与实体被嵌入的向量空间所对齐。我们做了大量的实验去比较提出的模型与基准方法。实验结果表明,我们的方法在性能上优于业界领先的方法,而且更为细致的实验结果分析肯定了我们提出模型的动机。
[Abstract]:Knowledge map is a collection of three tuples, which is in the form of three tuple (subject, predicate, object, subject and object) is the entity relationship. Each predicate is three tuples (e.g. (Obama, born in Honolulu, said a fact)). When applied to QA system, only when a the knowledge map covering answers corresponding to the fact that it can provide the required answer. Although there have been a number of large-scale, the advent of knowledge map in open field, they still have a long distance from complete, such as Freebase in figure 30% the lack of solid three tuple record their parents information. Knowledge map is complete to the knowledge map of an existing three increase in new tuples, and adding the three tuple must be objective facts. There are two main channels of information can be used to complement the knowledge map: 1. from the three tuple a knowledge map to an existing The new inference three tuple.2. extraction from text in the new entity and three new tuples. In order to use the first channel information in recent years, the emergence of a large number of knowledge embedded in the work, they are learning for each entity a dense vector, and the vector based on the calculation of each entity credibility. These three tuple the model can be used to embed three tuple reasoning model for information extraction from text extraction. Due to the above two channels are complementary, so with embedded model and information extraction model can show the performance of the single model is better. We will present the knowledge map embedded model and its weakness and the existing information summary the challenges are as follows: extraction model with TransE not knowledge map embedding model - 1. industry-leading deal with reflexive or one to many / many to one / more of Many properties in relation to.2. in training a knowledge map embedding model, the existing negative sampling algorithm may produce false negative samples.3. for three tuple extraction from the text, the subject and the object is the word. If an entity knowledge maps of the three tuple cannot be linked to the subject or object in the considered the existing models of embedded, because lack of vector representation of the entity is calculated to reasoning on it. In this paper, we propose a series of techniques to solve the above problems. The main contributions of this thesis include: 1. we show that the first question comes from Trans E of each entity relationship modeling for vector translation the operation. Then, we propose a new model of Trans H. embedded knowledge map by the model before the first translation operation entity is the vector projection in every kind of relationship defined by hyperplanes, solved The weakness of Trans E. At the same time, Trans H avoided the excessive increase in the complexity of the model.2., we propose a data-driven, unique distribution of every kind of relationship, for example sampling negative training knowledge map embedding model. The distribution can reduce the sampling to false negative samples. At the same time parameters of the machine, the distribution can be determined by the basic statistics of every kind of relationship.3. we first show that, in the word embedding model - Word2Vec, implicit relations between words can be interpreted as the word vector translation operation, similar to the Trans E for modeling the relationship between knowledge map. Based on this, we propose a joint embedding model for each entity, to learn every word and a dense vector. We can model the joint embedding into three tuple words and also relates to the credibility of computational entity. As far as we know, our combined embedded mode .4. is the first method can deal with such three tuples we propose three respectively based on the physical link, entity name, entity description alignment model. To monitor the information in the training of these models are easy to obtain and large scale. Empirical evaluation shows that the model can effectively be word vector space and entity embedding by embedding the vector space alignment. We have done a lot of experiments to compare the proposed model with the reference method. Experimental results show that our method outperforms the method of leading industry, and a more detailed analysis of the experimental results that our proposed model of motivation.

【学位授予单位】：中山大学
【学位级别】：博士
【学位授予年份】：2017
【分类号】：TP391.1

【相似文献】