基于word2vec词向量的文本分类研究

发布时间：2018-06-22 08:05

本文选题：Word2vec模型 + 文本表示　；参考：《西南大学》2017年硕士论文

【摘要】：自动文本分类技术在文本挖掘、自然语言处理以及机器学习等领域具有重要地位,它为信息检索与文本管理提供了很多便利。近年来随着互联网技术的高速发展,文本数据每天都在迅速膨胀,比如用户所发的微博动态信息、各大新闻门户网站的新闻内容、用户来往的电子邮件信息以及论坛、博客的帖子等。自动文本分类恰好是处理和组织这些文本数据的有效工具,已经在许多方面得到了应用,如微博情感分类、垃圾邮件过滤以及新闻内容自动分发等。未来互联网上的文本数据还会不断增加,自动文本分类技术将在这些领域发挥越来越重要的作用。自动文本分类包括若干技术,比如文本预处理、文本表示、特征选择、特征抽取以及分类算法的选择等,其中文本表示与分类算法的研究是这些技术中的关键,它们将直接影响到自动文本分类的结果。目前大多数学者对文本分类技术的研究也主要侧重于文本的特征选择及抽取、文本表示以及分类算法的优化方面。在众多的文本表示模型中,基于词频-逆文本频率(TF-IDF)加权的向量空间模型(VSM)是一种主流的文本表示模型(简称VSM_TFIDF模型),它在学术界与工业界都有不错的表现,但该模型并不能很好的表示文本的语义信息,它无法将文本中特征词的上下文语义与句法信息考虑到模型之中。其次,常用的文本距离度量方式,比如欧氏距离、余弦距离等无法很好的衡量这类文本表示模型所表示的文本之间的相似度。针对以上问题,本文借助于Word2vec词向量将语义信息引入文本表示模型或文本距离度量方式之中,从而提升文本分类的效果。文中深入研究了Word2vec词向量的生成机制,包括它的两种训练模型(CBOW模型和Skip-gram模型),以及两套提升词向量训练效率的优化方案(Hierarchical Softmax和Negative Sampling)。在此基础上,本文将Word2vec词向量引入到对文本表示模型以及文本距离度量方式的研究之中,主要的工作包括如下2个方面:(1)提出了一种基于Word2vec词向量与VSM_TFIDF模型的多粒度多模型组合的文本表示模型——CombineTextVector。由于Word2vec词向量可以很好的表示特征词的语义信息,文中考虑将它与VSM_TFIDF模型结合起来,优势互补,提升文本表示的效果。文中首先将文本的类别信息嵌入TF-IDF加权公式,以提升加权因子的类别区分能力(我们将其命名为wTFIDF加权公式),然后与Word2vec词向量结合,构建了一种多粒度的文本表示模型Word2vec_wTFIDF,最后再将该模型与传统的VSM_TFIDF模型结合,构建CombineTextVector文本表示模型。为了验证新模型的性能,本文在复旦中文文本分类语料库上设计实验,并与多种主流的文本表示模型进行对比。实验结果证明,新模型均能取得较高的分类F1值。(2)提出了一种基于Word2vec词向量与EMD距离,并针对主题模型进行距离度量的方式——TopMD距离度量。文中首先分析了传统VSM_TFIDF模型和主题模型中常用的文本距离度量方式,针对文本间语义相似度无法很好度量的问题,将EMD度量方式与Word2vec词向量结合,提出了一种针对主题模型的TopMD距离度量方式。与常用度量方式相比,它能将更细粒度的特征词之间的相似度考虑到TopMD距离之中。为了验证所提方法的有效性,本文分别在中文和英文的语料库上进行实验,并与多种距离度量方式进行对比。实验结果证明,相对于传统方式,该方法可以提高主题模型的文本相似度度量效果。
[Abstract]:Automatic text classification technology has an important position in the fields of text mining, Natural Language Processing and machine learning. It provides a lot of convenience for information retrieval and text management. In recent years, with the rapid development of Internet technology, the text data is expanding rapidly every day, like the micro-blog dynamic information sent by the users, the big news gates. Automatic text classification is an effective tool for processing and organizing these text data, which has been applied in many aspects, such as micro-blog emotion classification, spam filtering and automatic distribution of news content. Text data will continue to increase, and automatic text classification technology will play a more and more important role in these fields. Automatic text classification includes several technologies, such as text preprocessing, text representation, feature selection, feature extraction and selection of classification algorithms, and the study of the Chinese representation and classification algorithm is the key to these techniques. Key, they will directly affect the result of automatic text classification. At present, most scholars mainly focus on the selection and extraction of text features, text representation and the optimization of classification algorithms. In a large number of text representation models, the vector space model based on word frequency inverse text frequency (TF-IDF) weighting is used in a large number of text representation models. (VSM) is a mainstream text representation model (VSM_TFIDF model). It has a good performance in both academia and industry, but the model can not express the semantic information of the text well. It can not consider the context semantics and syntactic information of the feature words in the model. Secondly, the common text distance measurement method is used. For example, the Euclidean distance and cosine distance can not be used to measure the similarity between the text expressed by the text representation model. In this paper, the semantic information is introduced into the text representation model or text distance measure with the help of the Word2vec word vector, thus the effect of the text classification is raised. In this paper, the Word is deeply studied. The generation mechanism of 2vec word vector, including its two training models (CBOW model and Skip-gram model), and two sets of optimization schemes (Hierarchical Softmax and Negative Sampling) for lifting word vector training efficiency (Hierarchical Softmax and Negative Sampling). On this basis, this paper introduces the vector of Word2vec word to the study of text representation model and text distance measurement. The main work includes the following 2 aspects: (1) a text representation model of multi granularity and multi model combination based on the Word2vec word vector and the VSM_TFIDF model is proposed. Because the word vector of the Word2vec word can express the semantic information of the feature words well, it is considered to combine it with the VSM_TFIDF model in this paper. In this paper, we first embed the text category information into the TF-IDF weighted formula to improve the classification ability of the weighted factor (we named it as the wTFIDF weighted formula), and then combined with the Word2vec word vector, a multi granularity text representation model, Word2vec_wTFIDF, was constructed, and then the model was then applied to the model. In conjunction with the traditional VSM_TFIDF model, the CombineTextVector text representation model is constructed. In order to verify the performance of the new model, this paper designs experiments on Fudan Chinese text classification corpus and compares it with a variety of mainstream text representation models. The experimental results show that the new model can achieve higher classification F1 values. (2) a kind of basis is proposed. The distance between the Word2vec word vector and the EMD, and the way of distance measurement for the theme model, the distance measurement of the TopMD. First, the text distance measurement in the traditional VSM_TFIDF model and the theme model is analyzed. In view of the problem that the semantic similarity between the text can not be well measured, the EMD measure and the Word2vec word vector are connected. In addition, a TopMD distance measurement for the theme model is proposed. Compared with the common measure, it can take the similarity between the more finer feature words into the TopMD distance. In order to verify the validity of the proposed method, this paper carries out experiments on the Chinese and English Corpus respectively, and goes in with a variety of distance measures. The experimental results show that compared with the traditional way, this method can improve the text similarity measurement effect of the topic model.
【学位授予单位】：西南大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.1

【参考文献】