基于神经网络的语言模型的改进研究

发布时间：2019-04-08 09:43

【摘要】：在自然语言处理的研究中,词和句子是主要的研究单位。词一般是文本处理领域最小的有意义的单位,比如搜索引擎一般会把搜索的query切分成词再进行查找。句子是比词更高一级的文本单位,如果我们不限定句子的长度,那么句子也可以是段落或者篇章。由于词和句子是文本处理的主要单位,词和句子的表示的研究就显得尤为重要。词的表示学习方法可以分为两种类型的方法。一种是基于神经网络的模型训练得出的词向量,另一种主要是LSA,LDA这些类似矩阵分解的方法。句子的表示主要有基于TF-IDF的向量空间模型,主题模型可以学出句子在主题中的分布作为句子的表示,基于神经网络的语言模型可以无监督的学习出句子的表示。本文的主要工作包括以下几个方面。第一,在词向量模型方面,首先提出了基于逆词频霍夫曼编码的层次化softmax方法。神经网络语言模型通常采用基于霍夫曼编码的层次化softmax和negative sampling进行模型的加速。本文认为word2vec中高频词编码短而低频词编码长的算法不合理,因此提出了基于逆词频的霍夫曼编码。其次,本文研究了基于位置的权重向量和权重因子的问题,本文采用基于位置的权重向量以及基于位置的权重因子方法改进了word2vec词向量模型。最后,本文提出将背景词向量(context representation)与目标词向量(target representation)共享的词向量模型。背景词向量和目标词向量通常对应不同的向量,但本文实验发现共享词向量会得到更好的结果。第二,在段落向量方面本文,本文提出了 D-CBOW模型来学习段落向量和词向量,与Quoc的模型采用拼接或者平均的方法不同,D-CBOW模型采用段落权重向量和位置权重向量来融合词向量与段落向量。第三,采用上述算法,本文设计实现了段落的情感倾向判断。本文进行了多组实验对比,发现采用基于位置的权重向量和逆词频编码之后,在IMDB电影评论的情感倾向的判断的任务上效果好于Quoc的方法。同时本文还对sigmoid,tanh,relu三种激活函数进行比较,发现在情感倾向判断的任务中使用relu作为激活函数的效果较好。
[Abstract]:In the study of natural language processing, words and sentences are the main research units. Words are generally the smallest meaningful units in the field of text processing. For example, search engines usually divide the search query into words and then look them up. A sentence is a unit of text higher than a word. If we do not limit the length of the sentence, the sentence can also be a paragraph or a text. Because words and sentences are the main units of text processing, it is very important to study the representation of words and sentences. The expression learning method of words can be divided into two types. One is the word vector based on the neural network model training, the other is the similar matrix decomposition method such as LSA,LDA. The main expression of sentences is vector space model based on TF-IDF. The topic model can learn the distribution of sentences in the topic as the expression of sentences, and the language model based on neural networks can learn the representation of sentences unsupervised. The main work of this paper includes the following aspects. Firstly, in the aspect of word vector model, a hierarchical softmax method based on inverse word-frequency Huffman coding is proposed. Neural network language model is usually accelerated by hierarchical softmax and negative sampling based on Huffman coding. This paper considers that the algorithm of high frequency word coding is short and low frequency word coding length is not reasonable in word2vec. Therefore, Huffman coding based on inverse word frequency is proposed. Secondly, this paper studies the problem of position-based weight vector and weight factor. In this paper, we use the position-based weight vector and the position-based weight factor method to improve the word2vec word vector model. Finally, this paper proposes a word vector model which shares the background word vector (context representation) with the target word vector (target representation). The background word vector and the target word vector usually correspond to different vectors, but this paper finds that the shared word vector can get better results. Second, in the aspect of paragraph vector, this paper proposes a D-CBOW model to learn paragraph vector and word vector, which is different from Quoc's model by splicing or averaging. D-CBOW model uses paragraph weight vector and position weight vector to fuse word vector and paragraph vector. Thirdly, using the algorithm mentioned above, this paper designs and realizes the judgment of the emotional tendency of the paragraph. In this paper, a number of experiments are carried out and it is found that using position-based weight vector and inverse word-frequency coding, the result is better than that of Quoc in the task of judging the emotional tendency of IMDB movie reviews. At the same time, the paper also compares the three activation functions of sigmoid,tanh,relu, and finds that relu is a better activation function in the task of judging emotional tendency.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP391.1;TP183

【相似文献】