基于卷积神经网络的文本表示建模方法研究

发布时间：2018-04-01 07:16

本文选题：文本表示　切入点：卷积神经网络　出处：《华中师范大学》2017年硕士论文

【摘要】：在机器学习中,数据表示是决定后续任务性能的关键所在。文本是数据的一大类,文本表示是许多自然语言处理任务的基础工作,建立文本表示模型的目的是分析和表示文本的语义信息,得以在文本分类、机器翻译、自动问答等自然语言处理任务上取得更好的效果。在传统的文本表示方法中,例如词袋子模型,具有数据稀疏和容易产生维度灾难等问题,模型的泛化能力较差。近年来,随着机器学习的发展,各种神经网络建立的文本表示模型开始出现。基于神经网络的文本表示模型是将多种层次结构的文本,通过神经网络的学习映射得到低维连续的向量,所有向量都在同一个低维向量空间中,提高了模型的表示能力。同时卷积神经网络在各种神经网络中具有较好的特征选取能力。然而,现有神经网络文本表示模型存在着一些问题。首先,对于不同文本中相同的词,在神经网络中使用相同的单一的向量,在特征提取时势必对一词多义、同形异义等情况不能够做出较好的区分,得到的特征不能较好地完成分类任务。然后,对于通常的神经网络模型文本表示模型,无法有效地捕获不同文本单元和可变长度的组合序列的语义和结构信息,对文档级别的文本进行处理时模型性能会大大的降低。根据上述存在的问题,本文分别从句子和文档两个层次,对比了多种神经网络文本表示方法,根据现有文本表示方法的不足,提出了改进后的表示模型。以下是本文所做出的主要工作:第一,提出了基于主题词向量的卷积神经网络句子文本表示模型。在该模型中,针对神经网络输入层的词向量矩阵,利用了相同的词在不同的文本中的语义信息应该有所差异的特性,为句子文本中的每个词语分配所在文本对应的主题信息,得到了每个词的主题词向量。同时为了不把不相关的主题信息引入到神经网络中,在中间层加入了主题转移矩阵过滤无用的主题信息,主题转移矩阵是根据词与主题的相似度和概率分布计算得到。通过主题转移矩阵将主题词向量融入神经网络模型中,使模型可以消除词在不同文本中的歧义。实验证明了得到的文本表示在句子级别的情感分类任务上有更好的表现。第二,提出基于长距离关联的卷积神经网络文档文本表示模型。针对了通常的神经网络模型文本表示无法捕捉文档文本中长距离的语义关系的问题,将整个文档文本的词所对应的主题词向量序列通过长短时记忆网络LSTM层进行处理,得到了包含长距离的语义关系和结构信息的隐藏状态序列,最后通过卷积神经网络提取文本特征,得到文本表示。其中根据是否考虑文档中句子间的语义交互分别给出了文档语义记忆文本表示模型、句子-文档语义记忆文本表示模型两种模型。实验证明了给出的文本表示在文档级别的情感分类任务上有更好的表现。
[Abstract]:In machine learning, data representation is the key to decide the subsequent task performance. The text is a large class of data, text representation is the basis work of many Natural Language Processing tasks, a text representation model is designed to analyze and express the semantic information of text to Machine Translation in text classification, better automatic quiz Natural Language Processing the task. In the traditional text representation methods, such as word bag model, with data sparsity and prone to the curse of dimensionality, the generalization ability of the model is poor. In recent years, with the development of machine learning, text representation model based on neural network is beginning to emerge. Text based on neural network model is the text a variety of hierarchical structure, low dimensional continuous vector obtained by learning neural network, all vectors are in the same low dimensional vector space In improving the representation ability of this model. At the same time convolution neural network has good characteristics of the neural network is selected. However, the existing neural network text representation model has some problems. First of all, for the same text in different words, use the same single vector in the neural network, feature extraction in this will on polysemy, homonymy and so to not be able to distinguish characteristics, can not get better complete the classification task. Then, the neural network model usually text representation model, semantic and structural information can not effectively capture different combinations of sequences of variable length text unit and the text of the document level when processing model performance will be greatly reduced. According to the above problems, this paper from the two levels of sentences and documents, this paper compares a variety of neural network representation According to the law, lack of representation of the existing text, put forward the improved model. The following is the main work: first, this paper makes the sentence representation model of convolutional neural network based on the theme of the text word vector is proposed. In this model, the word vector matrix of neural network input layer, using the characteristics of semantic information the same words in different texts should be different, the text topic information corresponding to each word distribution sentence in the text, the theme word vector of each word. At the same time in order not to subject irrelevant information is introduced to the neural network in the middle layer, joined the theme theme information transfer matrix filter the useless, the theme of the transfer matrix is based on words and topic similarity and probability distribution are calculated. The theme into the word vector neural network model through the theme of transfer matrix, the mold Can eliminate the ambiguity in different texts. Experiments show that the resulting text said to have better performance in sentence level sentiment classification task. Second, the document representation model. Convolutional neural network is proposed based on the text of long distance relationship for the neural network model of text representation usually cannot capture the semantic relations of long distance the text of the document, the vector sequence corresponding to the word document text word memory network through the length of the LSTM layer, the hidden state sequence contains semantic relations and structure of long distance, and finally through the convolution neural network feature extraction of text, text according to whether to consider the document said. In the sentence semantic interaction are given document semantic memory, text representation model, sentence document semantic memory text representation model two models. The verifying shows that the given text represents a better performance on the document level emotional classification task.

【学位授予单位】：华中师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1;TP18

【参考文献】