当前位置:主页 > 科技论文 > 软件论文 >

基于LDA的短文本分类中特征扩展方法的研究

发布时间:2018-04-29 00:11

  本文选题:主题模型 + 特征扩展 ; 参考:《中国地质大学(北京)》2017年硕士论文


【摘要】:随着信息时代的到来,人们花在网上的时间越来越多,一些内容分发平台、社交网站等近几年迅速发展起来。网络舆情的分析,网络新闻的整理等都需要按照一定的要求进行分类,这就涉及到文本分类,特别是短文本分类的研究。用于短文本的分类不能照搬长文分类,一种思路是先对短文本进行分类关键词的扩展,然后利用分类器分类。根据这个思路,本文提出了一种利用LDA的主题词和特征分类权重相结合的特征扩展的方法。本文深入研究了传统长文本常用的表示模型:向量空间模型,认为向量空间模型适合表示关键词信息比较多的长文本,而对于关键词比较少的短文本,会出现特征向量空间稀疏性过高的问题,从而向量空间模型不能直接用来表示短文本。根据国内外的研究现状,本文研究了LDA模型的理论基础,利用LDA模型得到语料库的主题-单词分布,用LDA模型计算测试样本的所属主题,分析测试文本与所属主题下的主题词之间的相关性。由此认为直接利用LDA模型的主题词对短文本进行主题扩展时存在不足。根据LDA模型的特点,针对直接利用LDA的主题词进行特征扩展的不足,本文提出能体现特征词在不同类别之间的分类信息差异的特征分类权重,特征分类权重考虑了特征词在类间的分布信息、类内的离散度以及特征词在类内的不完全分类情况。因此引入了利用LDA的主题词进行特征扩展时的候选词自选机制。为验证本文方法的有效性,本文采用ICTCLAS(中科院分词工具)和LIBSVM搭建分类平台,将本文提出的特征扩展方法与传统的基于LDA特征扩展的短文本分类方法进行对比。实验证明,利用本文方法对短文本进行特征扩展后,分类的性能得到了一定程度的提升。
[Abstract]:With the advent of the information age, people spend more and more time on the Internet, some content distribution platforms, social networking sites and other rapid development in recent years. The analysis of network public opinion and the arrangement of network news need to be classified according to certain requirements, which involves text classification, especially the study of text classification. Long text classification can not be used in short text classification. One way of thinking is to extend the short text text first and then use classifier to classify the short text. According to this idea, this paper proposes a method of feature expansion which combines the theme words of LDA and the weight of feature classification. This paper deeply studies the traditional representation model of long text: vector space model. It is considered that vector space model is suitable for long text with more keyword information, but for short text with fewer keywords. The problem of high sparsity of eigenvector space will occur, so the vector space model can not be directly used to express short text. According to the current research situation at home and abroad, this paper studies the theoretical basis of LDA model, uses LDA model to get the corpus topic-word distribution, uses LDA model to calculate the subject of test sample. Analyze the correlation between the test text and the subject word under the subject. It is concluded that there are some shortcomings in the theme extension of the short text by using the theme words of the LDA model directly. According to the characteristics of LDA model, aiming at the deficiency of extending feature directly by using LDA's theme words, this paper puts forward the weight of feature classification which can reflect the difference of classification information between different categories of feature words. The weight of feature classification takes into account the distribution of feature words between classes, the degree of dispersion within classes and the incomplete classification of feature words within classes. Therefore, this paper introduces the candidate word selection mechanism when using LDA theme words for feature extension. In order to verify the effectiveness of this method, this paper uses ICTCLASA (Chinese Academy of Sciences word Segmentation tool) and LIBSVM to build a classification platform, and compares the proposed feature extension method with the traditional short text classification method based on LDA feature extension. The experimental results show that the classification performance is improved to a certain extent by extending the feature of the short text by using the method in this paper.
【学位授予单位】:中国地质大学(北京)
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1

【相似文献】

中国期刊全文数据库 前10条

1 李政泽;韩毅;周斌;贾焰;;微博用户分类的特征词权重优化及推荐策略[J];信息网络安全;2012年08期

2 翟东海;杜佳;崔静静;聂洪玉;;基于双粒度模型的中文情感特征词提取研究[J];重庆邮电大学学报(自然科学版);2014年03期

3 李德容;干静;张s,

本文编号:1817499


资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1817499.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户2d95d***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com