基于Word Embedding的短文本特征扩展方法研究

发布时间：2018-05-20 13:11

本文选题：Word + Embedding　；参考：《吉林大学》2017年硕士论文

【摘要】：随着网络的发展和移动设备的普及,人与人之间交流变的更加及时、方便。短信、QQ、微博等社交媒体已成为我们生活中不可或缺的一部分,信息的形式也变得更加短小和自由。网络中的短文本的数量快速增长,给传统的基于长文本的自动信息处理和文本挖掘技术带来了新的挑战。如何解决短文本自身特征稀疏、特征覆盖率低等问题,已经成为很多学者研究的重点。其中,最直接有效的方法是扩展短文本的特征。深度学习的不断发展,使其在各个领域中得到了广泛的应用,结合深度学习的自然语言处理技术也成为研究的一种必然趋势,其中Word Embedding就是这个过程中的一个重要的成果。Word Embedding是词的一种向量表示方法,不同于传统相互独立的词表示,它将词按照语义间的关联强度分布在相对低维度的向量空间中,同时编码了语言中显性和隐性的规则。这也使词向量不再是单纯用来识别单词的符号,同时也蕴含着很多语义信息。本文将Word Embedding作为短文本的特征扩展的依据,提出来一种新的文本特征扩展方法。该方法丰富了短文本的语义信息,同时扩大了文本特征覆盖率,具体研究内容如下:1.基于大规模的语料库训练Word Embedding。Word Embedding的训练模型为神经网络结构的语言模型,本文根据Word Embedding的发展过程和不同需求介绍四种常见的模型:神经网络语言模型、循环神经网络模型、CBOW和Skip_gram。同时结合其他学者对模型的研究和本文的任务需求,选择了Skip_gram模型作为Word Embedding训练模型。同时选择内容丰富、数据量较大的WIKI百科英文数据库作为模型的训练数据,得到了200多万个单词对应的Word Embedding表示。2.依据Word Embedding间的特性,使用向量间计算完成短文本范围内的简单推理。部分Word Embedding编码的语言规则,可以使用Word Embedding间的减法和加法运算来表示,本文将这一特性用在短文本对应的有序的词序列上,获得与短文本语义相关的向量表达。运算得到的推理向量与Word Embedding属于同一个向量空间。3.使用Word Embedding聚类表示扩展特征空间。不同于传统的小粒度的语义表示单位(词、短语、概念等),本文基于Word Embedding空间分布特点,通过聚类得到基于语义相近度自动划分的“语义单元”,以“语义单元”作为扩展特征的特征项,且相同维度的向量表达(包括短文本对应的Word Embedding向量和之前介绍的Word Embedding的推理向量)都可以映射到扩展特征空间上。最后,本文使用基于Word Embedding的短文本特征扩展方法进行了短文本分类和短文本聚类实验。在谷歌搜索片段和China Daily新闻摘要两种数据集上,分类精度相较于基于LDA的方法分别提高了3.7%、1.0%,聚类F值相较于传统聚类方法分别提高30.64%、17.54%。实验结果表明,本文方法可以更好地表达短文本的信息,改善了短文本特征稀疏、特征覆盖率低等问题。
[Abstract]:With the development of network and the popularization of mobile devices, communication between people becomes more timely and convenient. Social media such as SMS, QQ, Weibo have become an integral part of our lives, and the form of information has become shorter and freer. The rapid growth of the number of short text in the network brings new challenges to the traditional automatic information processing and text mining technology based on long text. How to solve the problems of sparse features and low coverage of short text has become the focus of many scholars. The most direct and effective method is to extend the feature of short text. With the continuous development of deep learning, it has been widely used in various fields. The natural language processing technology combined with deep learning has become an inevitable trend of research. Word Embedding is an important achievement in this process. Word Embedding is a vector representation method of words, which is different from the traditional independent word representation. It distributes the words in a vector space of relatively low dimension according to the correlation strength between semantics. It encodes both explicit and implicit rules in language. This makes the word vector not only used to identify words, but also contains a lot of semantic information. In this paper, Word Embedding is taken as the basis of feature extension of short text, and a new method of text feature extension is proposed. This method not only enriches the semantic information of short text, but also expands the coverage of text features. The specific research contents are as follows: 1. The training model of Word Embedding.Word Embedding based on large-scale corpus training is a language model of neural network structure. According to the development process and different needs of Word Embedding, this paper introduces four common models: neural network language model. The circulatory neural network model is CBOW and SkipSP-gram. At the same time, according to other scholars' research on the model and the task requirements of this paper, Skip_gram model is chosen as the Word Embedding training model. At the same time, the English database of WIKI encyclopedia, which is rich in content and large amount of data, is chosen as the training data of the model, and the corresponding Word Embedding representation of more than 2 million words is obtained. 2. 2. According to the characteristics of Word Embedding, the simple reasoning within the scope of text is accomplished by vector computation. Some of the language rules encoded by Word Embedding can be expressed by subtraction and addition between Word Embedding. This feature is applied to the ordered word sequences corresponding to short texts to obtain vector expressions related to the semantics of short texts. The inference vector obtained by operation belongs to the same vector space as Word Embedding. The extended feature space is represented by Word Embedding clustering. Different from the traditional small-grained semantic representation units (words, phrases, concepts, etc.), this paper, based on the spatial distribution of Word Embedding, obtains the "semantic units" which are automatically partitioned based on the degree of semantic similarity by clustering. The "semantic unit" is used as the feature item of extended feature, and the vector representation of the same dimension (including the Word Embedding vector corresponding to the short text and the inference vector of Word Embedding introduced earlier) can be mapped to the extended feature space. Finally, short text classification and short text clustering experiments based on Word Embedding are carried out. On Google search segment and China Daily news summary, the classification accuracy is 3.740% higher than that based on LDA, and the clustering F value is 30.64% 17.54% higher than that of traditional clustering method. The experimental results show that this method can better express the information of short text and improve the problems of sparse feature and low coverage of short text.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【相似文献】