词向量与LDA相融合的短文本分类方法

发布时间：2018-08-02 13:10

【摘要】：【目的】针对短文本主题聚焦性差以及严重的特征稀疏问题,设计一种基于词向量与LDA主题模型相融合的短文本分类方法。【方法】从"词"粒度及"文本"粒度层面同时对短文本进行精细语义建模,首先基于Word2Vec训练词向量并通过相加平均法合成"词"粒度层面的短文本向量,基于吉布斯采样法训练LDA主题模型并根据主题概率最大原则对短文本进行特征扩展,然后基于词向量相似度计算扩展特征权重得到"文本"粒度层面的短文本向量,最后通过向量拼接构建词向量与LDA相融合的短文本表示模型,在此基础上通过最近邻分类算法完成短文本分类。【结果】相比传统的基于向量空间模型、基于词向量、基于LDA主题模型这三种基于单一模型的分类方法,词向量与LDA相融合的分类方法准确率、召回率、F_1值均有提升,分别至少提升3.7%,4.1%和3.9%。【局限】仅应用于最近邻分类器,尚未推广应用到朴素贝叶斯和支持向量机等多种不同的分类器。【结论】基于词向量与LDA相融合的短文本表示模型进行分类,能有效克服短文本的主题聚焦性差及特征稀疏性问题,提高短文本分类性能。
[Abstract]:[objective] to solve the problem of poor focus and serious characteristic sparsity in the short essay. This paper designs a short text classification method based on the combination of word vector and LDA subject model. [methods] Fine semantic modeling of short text is carried out at the level of "word" granularity and "text" granularity at the same time. Firstly, based on the Word2Vec training word vector and the additive averaging method, we synthesize the short text vector of word granularity level, train the LDA topic model based on Gibbs sampling method, and extend the feature of the short text according to the principle of maximum subject probability. Then, based on the word vector similarity, the extended feature weights are calculated to get the text vector at the granularity level of "text". Finally, a short text representation model combining word vector and LDA is constructed by vector splicing. On this basis, the nearest neighbor classification algorithm is used to complete the short text classification. [results] compared with the traditional vector space model, word vector and LDA topic model, these three classification methods are based on a single model. The accuracy rate of word vector and LDA fusion method was improved, and the recall rate and FK-1 value were increased by at least 3.741% and 3.9% respectively. [limitation] was only applied to nearest neighbor classifier. It has not been extended to many different classifiers, such as naive Bayes and support vector machines. [conclusion] based on the combination of word vector and LDA, the text representation model is used to classify. It can effectively overcome the problem of short text focus and feature sparsity, and improve the performance of short text classification.
【作者单位】：中国人民解放军电子工程学院;
【基金】：国家自然科学基金项目“动态数据挖掘的构造性机器学习方法研究”(项目编号:61273302)的研究成果之一
【分类号】：TP391.1

【相似文献】