基于深度神经网络的文本表示及其应用

发布时间：2018-05-27 07:07

本文选题：深度学习 + 语言表示　；参考：《哈尔滨工业大学》2016年博士论文

【摘要】：近年来,深度神经网络在诸如图像分类、语音识别等任务上被深入探索并取得了突出的效果,表现出了优异的表示学习能力。文本表示一直是自然语言处理领域的核心问题,传统的文本表示的维数灾难、数据稀疏等问题,已经成为大量自然语言处理任务性能提高的瓶颈。近年来,通过深度神经网络对文本学习表示逐渐成为一个新的研究热点。然而,由于人类语言的灵活多变以及语义信息的复杂抽象,深度神经网络模型在文本表示学习上的应用更为困难。本文旨在研究深度神经网络对不同粒度的文本学习表示,并将其应用于相关任务上。首先,对词向量的学习进行了研究。提出了一种基于动名分离的词向量学习模型。该模型将词性引入到词向量的学习过程,同时保持了词序信息。受人类大脑的动名分离结构的启发,在学习词向量的过程中,该模型根据词性标注工具得到的词性,动态的选择模型顶层的网络参数,从而实现模型的动名分离。与相关向量学习方法进行实验对比,结果显示该模型能够以相对较低的时间复杂度,学习得到高质量的词向量;通过其得到的常见词的相似词更为合理;在命名实体识别和组块分析任务上的性能,显著地优于其它对比的词向量。其次,对语句的表示学习进行了研究。提出了基于深度卷积神经网络的语句表示模型。该模型不依赖句法分析树,通过多层交叠的卷积和最大池化操作对语句进行建模。语句匹配对自然语言处理领域的大量任务非常重要。一个好的匹配模型,不仅需要对语句的内部结构进行合理建模,还需要捕捉到语句间不同层次的匹配模式。基于此,本文提出了两种基于深度卷积神经网络的语句匹配架构。架构一,首先通过两个卷积神经网络分别对两个语句进行表示,然后通过多层感知机进行匹配。架构二,则是对两个语句的匹配直接建模,然后通过多层感知机对匹配表示进行打分。两种匹配架构都无需任何先验知识,因此可被广泛应用于不同性质、不同语言的匹配任务上。在三种不同语言、不同性质的语句级匹配任务上的实验结果表明,本文提出的架构一和架构二远高于其他对比模型。相比架构一,架构二更能够有效地捕捉到两个语句间多层次的匹配模式,架构二在三种任务上取得了优异的性能。第三,对统计机器翻译中短语对的选择进行了研究。提出了上下文依赖的卷积神经网络短语匹配模型。该模型对目标短语对进行选择,不仅考虑到了源端短语与目标端短语的语义相似度,同时利用了源端短语的句子上下文信息。为了有效的对模型进行训练,提出使用上下文依赖的双语词向量初始化模型,同时设计了一种“课程式”的学习算法对模型进行从易到难、循序渐进的训练。实验表明,将该模型对双语短语的匹配打分融入到一个较强的统计机器翻译系统中,可以显著提高翻译性能,BLEU值提高了1.0%。自动生成进行了研究。构建了一个较高质量的大规模中文短文本摘要数据集,该数据集包括240多万的摘要,同时构造了一个高质量的测试集。提出使用基于循环神经网络的编码-解码架构从大规模数据集中自动学习生成摘要,构建了两个基于循环神经网络的摘要生成模型。模型一通过使用循环神经网络对原文进行建模,并将其最后一个状态作为原文段落的表示,利用另一个循环神经网络从该表示中解码生成摘要。模型二在模型一的基础上,通过动态的从编码阶段的循环神经网络的所有状态中综合得到上下文表示,然后将当前的上下文表示传递给解码循环神经网络生成摘要。两种模型都是产生式模型,无需任何人工特征。实验表明,两种模型能够对原文进行较为合理的表示,生成具有较高信息量的摘要文本。特别地,模型二生成的摘要文本质量显著优于模型一。综上所述,本文以深度神经网络为手段,以文本表示为研究对象,对自然语言中不同粒度的文本即词、句、段的表示学习及其应用进行了深入研究。本文将所提出的方法应用到了序列标注、语句匹配、机器翻译以及自动文摘生成问题上,并取得了良好的效果。
[Abstract]:In recent years, deep neural networks have been deeply explored and achieved outstanding results in such tasks as image classification and speech recognition, showing excellent learning ability. Text representation has always been the core problem in the field of Natural Language Processing. The problems of traditional text representation of dimension disaster and data sparsity have become a large number of problems. However, in recent years, the expression of text learning by deep neural networks has become a new research hotspot in recent years. However, because of the flexibility of human language and the complex abstraction of semantic information, the application of deep neural network model in text representation learning is more difficult. Deep neural network is used to express text learning with different granularity and applies it to related tasks. First, the learning of word vectors is studied. A word vector learning model based on moving name separation is proposed. The model introduces word character to the learning process of word vector and maintains the word order information. It is subject to the human brain. In the course of learning the structure, in the process of learning the word vector, the model can dynamically select the network parameters of the top layer of the model according to the word character obtained by the part of speech tagging tool, so as to realize the separation of the dynamic name of the model. The results show that the model can be higher in learning with relatively low time complexity. The word vector of the quality is more reasonable by the similar words of the common words it gets; the performance on the named entity recognition and the block analysis task is significantly better than the other words vector. Secondly, the expression learning of the sentence is studied. The statement representation model based on the deep convolution neural network is proposed. The model does not depend on the syntax. It is very important for a large number of tasks in the field of Natural Language Processing. A good matching model requires not only the rational modeling of the internal structure of the statements, but also the matching patterns of different levels between sentences. Based on this, this paper Two kinds of Sentence Matching architecture based on deep convolution neural network are proposed. First, two convolution neural networks are used to represent two statements and then match through multi layer perceptron. Architecture two is the direct modeling of the matching of the two statements, and then the matching representation is scored by a multi-layer perceptron. Two No prior knowledge is required for the species matching architecture, so it can be widely used in the matching tasks of different properties and different languages. Experimental results on three different language, different sentence level matching tasks show that the proposed architecture and architecture two are far higher than their contrast model. Compared with architecture one, architecture two is more capable. The multi level matching pattern between two statements is captured effectively. Architecture two achieves excellent performance on the three tasks. Third, the selection of phrase pairs in statistical Machine Translation is studied. A context dependent convolution neural network phrase matching model is proposed. The model selects the target phrase pair, not only takes into account the source of the target phrase. The semantic similarity between the end phrase and the target phrase is used and the context information of the source phrase is used. In order to train the model effectively, a context dependent bilingual word vector is used to initialize the model. At the same time, a "Curriculum" learning algorithm is designed to train the model from easy to difficult and gradual. The experiment shows that the matching score of the bilingual phrase is integrated into a strong statistical Machine Translation system, which can significantly improve the translation performance. The BLEU value improves the automatic generation of 1.0%.. A high quality large scale Chinese short text summary data set, which includes about 2400000 summaries, is constructed. A high quality test set is constructed. A recurrent neural network based encoding and decoding architecture is used to generate abstracts from a large dataset, and two recurrent neural network based Abstract generation models are built. The model one models the original text by using recurrent neural network and makes the last state of the model. For the representation of the text paragraph, another recurrent neural network is used to decode the abstract from the representation. Model two is based on the model one, and is synthesized by a dynamic context representation from all the states of the recurrent neural network at the coding stage, and then passes the current upper and lower expressions to the decoded recurrent neural network to generate the pluck. All two models are production models without any artificial features. Experiments show that the two models can represent the original text more reasonably and generate abstract text with higher information. In particular, the quality of the abstract text generated by model two is significantly better than that of the model one. In this paper, we have studied the expression learning and its application of different granularity of text, words, sentences and segments in natural language. This paper applies the proposed method to sequence tagging, statement matching, Machine Translation and automatic abstract generation, and has achieved good results.
【学位授予单位】：哈尔滨工业大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP391.1;TP183

【参考文献】