基于时间序列的微博热点话题识别与追踪
本文选题:微博 切入点:时间 出处:《西安科技大学》2017年硕士论文 论文类型:学位论文
【摘要】:目前微博已经成为信息分享和传播的重要平台,其产生的网络舆情对现实社会造成影响。为了维护正常的社会秩序,对微博热点话题的识别与追踪就显得尤为重要。首先研究了已有的微博热点话题识别与追踪方法,包括基于空间向量模型(Vector Space Model,VSM)的方法和基于隐含狄利克雷分布(Latent Dirichlet Allocation,LDA)模型的方法;总结并分析了以K-means算法为代表的话题识别方法和以决策树算法为代表的话题追踪方法,发现基于VSM模型的传统微博热点话题识别与追踪方法计算过程复杂,识别与追踪到的结果不够细致、精准。对LDA模型进行了重点研究,分析了在LDA模型中加入时间概念进行微博热点话题识别与追踪的三种方法:先离散时间方法、后离散时间方法以及引入连续时间的方法。分析发现后离散时间与引入时间的方法只能对微博热点话题的强度进行追踪,不能对话题内容的变化进行追踪,而先离散时间的LDA模型能够对微博热点话题强度和内容同时进行追踪。先离散时间的LDA模型在进行微博热点话题识别与追踪时需要进行话题关联度计算。分析发现经典的KL距离(Kullback Leibler Divergence,KL)话题关联度计算方法及其改进算法均有缺陷,如KL算法没有考虑微博热点话题特征词的相似度以及微博热点话题内容随时间的变化的特点。针对此类缺陷,提出基于热点话题特征词相似度和特征词共现情况的微博热点话题关联度计算方法:Jaccard-词共现(Jaccard-Word co-occurrence,JW)算法,通过话题所包含特征词的相似性衡量两个热点话题内容相同的概率,同时通过特征词的共现率衡量话题内容相关联的概率。在两个数据集上进行试验以验证JW算法的有效性,证明JW话题关联度算法的召回率、准确率以及F1值均高于经典KL算法、JSD-Cosine算法和词共现算法。追踪到的微博热点话题与现实事件在时间序列上的强度和内容的变化过程相吻合,证明识别与追踪到的结果符合现实事件的发展过程,说明JW算法可行有效。
[Abstract]:At present, micro-blog has become an important platform for information sharing and dissemination of network public opinion, the impact on the social reality. In order to maintain the normal social order, recognition and tracking of micro-blog hot topic is particularly important. Firstly, micro-blog hot topic detection and tracking methods have been developed, including the vector space model based on Vector (Space Model, VSM) based on implicit method and de Lickley distribution (Latent Dirichlet Allocation, LDA) model method; summarize and analyze the topic identification method represented by K-means algorithm and decision tree algorithm to the topic as the representative of the tracing method, found that the traditional micro-blog hot topic detection and tracking method based on VSM model calculation process is complex, to identify and track the results not detailed, accurate. The LDA model focuses on the analysis of the concept of time to join in the LDA model. Three methods for the micro-blog hot topic identification and tracking: the first time after the discrete method, discrete time method and method of introducing continuous time. Only with the introduction of discrete time analysis method of time after the discovery of strength on micro-blog hot topic tracking, not to track changes in the topic, and then the discrete time LDA the micro-blog model can be a hot topic of strength and content at the same time tracking. The first discrete time LDA model requires topic correlation calculation in micro-blog hot topic detection and tracking. Analysis shows that the classical KL distance (Kullback Leibler Divergence, KL) topic correlation calculation method and its improved algorithm has defects, such as the KL algorithm does not consider micro-blog hot topic features of word similarity and micro-blog hot topic content changes with time characteristics. For this defect, based on the hot words The calculation method of micro-blog hot topic correlation characteristics of word similarity and word co-occurrence features of word co-occurrence: Jaccard- (Jaccard-Word co-occurrence JW) algorithm, which contains the similarity measure of feature words two hot topics of the same content probability by topic, the features of word co-occurrence measure related topic probability. Experiments on two data sets to verify the effectiveness of the JW algorithm, prove that the recall rate of JW topic correlation algorithm, accuracy and F1 value were higher than that of the classical KL algorithm, JSD-Cosine algorithm and co-occurrence algorithm. The change process of micro-blog to the hot topic tracking and reality event intensity and content in time series the match, the development process of proof of identification and tracking results to realistic events, so JW algorithm is feasible and effective.
【学位授予单位】:西安科技大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP393.092;TP391.1
【参考文献】
相关期刊论文 前10条
1 李卫疆;王真真;余正涛;;基于BTM和K-means的微博话题检测[J];计算机科学;2017年02期
2 刘红兵;李文坤;张仰森;;基于LDA模型和多层聚类的微博话题检测[J];计算机技术与发展;2016年06期
3 马国栋;李慧;;基于改进K-means算法的BBS热点话题发现[J];首都师范大学学报(自然科学版);2015年04期
4 陈卓群;;基于共词网络的社交媒体话题演化分析[J];情报科学;2015年01期
5 谈成访;汪材印;张亚康;;基于LDA模型的中文微博热点话题发现[J];宿州学院学报;2014年04期
6 胡吉明;陈果;;基于动态LDA主题模型的内容主题挖掘与演化[J];图书情报工作;2014年02期
7 林萍;黄卫东;;基于LDA模型的网络舆情事件话题演化分析[J];情报杂志;2013年12期
8 杨长春;周猛;叶施仁;徐小松;;基于改进CURE算法的微博热点话题发现[J];计算机仿真;2013年11期
9 赵爱华;刘培玉;郑燕;;基于LDA的新闻话题子话题划分方法[J];小型微型计算机系统;2013年04期
10 熊志斌;王冬;尹成国;;舆情监测技术及应用综述[J];软件;2012年12期
相关硕士学位论文 前3条
1 刘盼盼;中文新闻主题事件的关联性识别研究[D];昆明理工大学;2016年
2 袁胜文;基于LDA的中文科技文献话题演化研究[D];河南工业大学;2015年
3 郭炜星;数据挖掘分类算法研究[D];浙江大学;2008年
,本文编号:1606801
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/1606801.html