用于自然语言分布式表达的联合学习方法研究

发布时间：2018-04-12 16:30

本文选题：自然语言处理 + 神经网络　；参考：《中国科学技术大学》2016年博士论文

【摘要】：自然语言分布式表达(Distributed Representations of Natural Language)技术是指利用深度神经网络算法训练得到自然语言对象(单词、短语、句子、段落和文档等)的向量表达的技术,这种向量也被称为自然语言嵌入向量(Natural Language Embedding Vector)。一般来讲,分布式表达向量是从大规模无监督的语料中学习得来的低维、稠密实数值向量,因为其承载了该自然语言对象的语义信息,所以可以作为自然语言的一种有效的表达,应用于各项自然语言处理的任务中,并取得了非常优异的实际表现。在本论文中,与以往完全从原始文本语料学习(Learning From Scratch)得到自然语言分布式表达的方法不同,我们试图融入更多的信息,达到联合训练自然语言分布式表达向量的目的。这些信息有可能是外源信息(例如字典信息与知识图谱信息),也有可能是原始语料信息的其他抽象、或者高层次表达(例如单词的多义性信息与主题信息)。这种联合训练的方法一方面可以利用更多的信息提升原始分布式表达向量的质量,另外一方面可以利用自然语言的分布式表达更好地帮助相应的任务(例如主题建模),从而达到更佳的实际表现。具体来说,1)我们通过单词多义性信息与单词分布式表达联合训练的方法来克服传统单词分布式表达以单词作为基本语义嵌入单元的限制,所提出的算法可以精确表达多义单词的不同语义,取得了良好的实际效果,同时我们在本文中介绍了该算法的大规模并行实现：2)我们通过知识图谱表示与单词分布式表达联合训练的方法来克服原始文本驱动的单词嵌入向量无法表示复杂知识关系的限制；3)基于这两种联合训练的方式我们提出了一种利用单词分布式表达来完成自动智商测试的方法,在标准词汇智商测试任务上取得了比该测试的人类参与者的表现更高的准确率；4)更进一步,我们提出了一种基于递归神经网络(Recurrent Neural Network)的句子分布式表达模型和主题模型的联合训练方法,利用该方法训练得到的主题模型可以建模单词序列性信息,与忽略该信息的传统主题模型相比在定量任务和定性任务上都有更好的表现。
[Abstract]:Distributed Representations of Natural language (NLP) is a technique that uses the depth neural network algorithm to train the vector representation of natural language objects (words, phrases, sentences, paragraphs, documents, etc.).This kind of vector is also called Natural Language Embedding vector.In general, distributed representation vectors are low-dimensional, dense real-value vectors that are learned from large-scale unsupervised corpus because they carry the semantic information of the natural language object.Therefore, it can be used as an effective expression of natural language, and it can be applied to various tasks of natural language processing, and it has achieved excellent practical performance.In this thesis, we try to integrate more information into the distributed expression of natural language, and achieve the purpose of training distributed expression vector of natural language.These information may be exogenous information (such as dictionary information and knowledge map information), other abstractions of original corpus information, or high-level expressions (such as polysemous information and subject information of words).On the one hand, this joint training method can use more information to improve the quality of the original distributed expression vector.On the other hand, the distributed representation of natural language can be used to better help the corresponding tasks (such as topic modeling), so as to achieve better practical performance.Specifically, we can overcome the limitation of traditional word distributed expression by using word polysemous information and word distributed expression as the basic semantic embedding unit.The proposed algorithm can accurately express the different semantics of polysemous words, and achieves good practical results.At the same time, we introduce the large-scale parallel implementation of the algorithm: 2) We use the method of knowledge map representation and word distributed expression training to overcome the complexity of original text-driven word embedding vector.Based on these two methods of joint training, we propose a method of using word distributed expression to complete the automatic IQ test.In the standard vocabulary IQ test task, we achieved a higher accuracy rate than the human participants in the test.In this paper, we propose a joint training method of sentence distributed expression model and topic model based on recursive neural network (Recurrent Neural Network), which can be used to model the sequential information of words.Compared with the traditional thematic model which ignores this information, it has better performance in quantitative and qualitative tasks.
【学位授予单位】：中国科学技术大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP391.1

【相似文献】