基于主题模型的微博话题挖掘
发布时间:2018-08-23 10:47
【摘要】:随着微博用户的不断增长,国外的Twitter和国内的新浪微博已经成为媒体和个人发布信息的重要平台。对于微博这种特殊的文本,通常小于140字,包含了丰富的社会化信息,且微博内容不仅包含话题文本,也包含其他无话题表征能力的冗余文本,传统的文本挖掘算法并不能很好的做微博话题的提取。本文结合中文词性标注和LDA(Latent Dirichlet Allocation)主题模型两种方法用于微博话题提取,使用增量聚类方法确定微博话题个数和微博聚类,运用中文词性标注可以很好的过滤掉微博文本中无话题表征能力的文本词语,运用LDA主题模型可以将文本信息表示在一个低维的主题空间之中,从语义上更好的挖掘微博话题。使用增量聚类方法可以有效的发现微博话题个数,无需预先在聚类方法中指定话题个数。实验表明相较于传统的文本分析分析方法,中文词性标注,LDA主题模型和增量聚类三者相结合能够提高话题发现的准确率 本文主要完成了以下几项工作: (1)分析基于传统文本模型进行话题提取的方法,实验结果指出基于传统文本模型优势和不足。提出基于LDA主题模型进行微博话题检测和提取的方法。 (2)基于LDA主题模型进行微博话题检测的过程中,发现文本预处理对于微博话题提取来说,至关重要。大量的微博中包含大量与话题无关的成分,干扰微博话题提取。提出在基于LDA主题模型进行微博话取与中文词性标注进行结合,可以有效的提高话题提取的精度和准确性。并且进行实验证实中文词性标注有助于提高话题提取的准确性。 (3)分析传统话题提取中使用的聚类方法需要指定特定的话题个数的不足,从而使用增量聚类的方法single-pass这一聚类方法进行话题聚类,并且在single-pass算法的基础上提出批处理的思想对single-pass算法进行改进。并且通过实验对比,指出改进后的single-pass聚类算法能够有效发现话题的数目。
[Abstract]:With the continuous growth of Weibo users, Twitter abroad and Sina Weibo at home have become an important platform for media and individuals to publish information. For special text such as Weibo, which is usually less than 140 words, it contains a wealth of social information, and the content of Weibo contains not only topic text, but also other redundant text without topic representation. The traditional text mining algorithm can not do Weibo topic extraction very well. This paper combines Chinese part of speech tagging and LDA (Latent Dirichlet Allocation) topic model for Weibo topic extraction, and uses incremental clustering method to determine the number of Weibo topics and Weibo clustering. Using Chinese part of speech tagging can filter out the text words with no topic representation in Weibo text, and use LDA topic model to express the text information in a low-dimensional topic space, so as to excavate Weibo topic better semantically. Using incremental clustering method can find the number of Weibo topics effectively, without specifying the number of topics in the clustering method in advance. The experiment shows that compared with the traditional text analysis method, The combination of LDA topic model and incremental clustering in Chinese part-of-speech tagging can improve the accuracy of topic discovery. This paper mainly completes the following work: (1) Analysis based on traditional text model The method of topic extraction, The experimental results point out the advantages and disadvantages of the traditional text model. This paper proposes a method of Weibo topic detection and extraction based on LDA topic model. (2) in the process of Weibo topic detection based on LDA topic model, it is very important to find out that text preprocessing is very important for Weibo topic extraction. A large number of Weibo contains a large number of topic independent components, interfering with Weibo topic extraction. It is proposed that the combination of Weibo speech extraction and Chinese part of speech tagging based on LDA topic model can effectively improve the accuracy and accuracy of topic extraction. It is proved by experiments that Chinese part-of-speech tagging can improve the accuracy of topic extraction. (3) it is necessary to specify the number of specific topics in traditional clustering methods. The incremental clustering method, single-pass, is used to cluster the topic, and based on the single-pass algorithm, the idea of batch processing is proposed to improve the single-pass algorithm. Through experimental comparison, it is pointed out that the improved single-pass clustering algorithm can effectively find the number of topics.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:TP393.092;TP391.1
本文编号:2198849
[Abstract]:With the continuous growth of Weibo users, Twitter abroad and Sina Weibo at home have become an important platform for media and individuals to publish information. For special text such as Weibo, which is usually less than 140 words, it contains a wealth of social information, and the content of Weibo contains not only topic text, but also other redundant text without topic representation. The traditional text mining algorithm can not do Weibo topic extraction very well. This paper combines Chinese part of speech tagging and LDA (Latent Dirichlet Allocation) topic model for Weibo topic extraction, and uses incremental clustering method to determine the number of Weibo topics and Weibo clustering. Using Chinese part of speech tagging can filter out the text words with no topic representation in Weibo text, and use LDA topic model to express the text information in a low-dimensional topic space, so as to excavate Weibo topic better semantically. Using incremental clustering method can find the number of Weibo topics effectively, without specifying the number of topics in the clustering method in advance. The experiment shows that compared with the traditional text analysis method, The combination of LDA topic model and incremental clustering in Chinese part-of-speech tagging can improve the accuracy of topic discovery. This paper mainly completes the following work: (1) Analysis based on traditional text model The method of topic extraction, The experimental results point out the advantages and disadvantages of the traditional text model. This paper proposes a method of Weibo topic detection and extraction based on LDA topic model. (2) in the process of Weibo topic detection based on LDA topic model, it is very important to find out that text preprocessing is very important for Weibo topic extraction. A large number of Weibo contains a large number of topic independent components, interfering with Weibo topic extraction. It is proposed that the combination of Weibo speech extraction and Chinese part of speech tagging based on LDA topic model can effectively improve the accuracy and accuracy of topic extraction. It is proved by experiments that Chinese part-of-speech tagging can improve the accuracy of topic extraction. (3) it is necessary to specify the number of specific topics in traditional clustering methods. The incremental clustering method, single-pass, is used to cluster the topic, and based on the single-pass algorithm, the idea of batch processing is proposed to improve the single-pass algorithm. Through experimental comparison, it is pointed out that the improved single-pass clustering algorithm can effectively find the number of topics.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:TP393.092;TP391.1
【参考文献】
相关期刊论文 前3条
1 张晨逸;孙建伶;丁轶群;;基于MB-LDA模型的微博主题挖掘[J];计算机研究与发展;2011年10期
2 郑斐然;苗夺谦;张志飞;高灿;;一种中文微博新闻话题检测的方法[J];计算机科学;2012年01期
3 彭泽映;俞晓明;许洪波;刘春阳;;大规模短文本的不完全聚类[J];中文信息学报;2011年01期
,本文编号:2198849
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2198849.html