微博话题检测与跟踪方法研究
发布时间:2018-11-20 12:22
【摘要】:微博作为目前最流行的社交应用之一,成为人们信息获取和传播的主要途径。微博数据实际上是一个高速、海量和动态的信息流,更能表达出每时每刻的社会话题及其变化过程,从中进行话题检测及跟踪对舆论监督、民意调查有重大意思。在此背景下,本文提出了一个时效性高、能够处理大规模数据流的聚类算法,并将其用到微博话题检测与跟踪上去,取得了较好的效果。提出了一种基于近邻传播的大规模数据流聚类处理方法(Affinity Propagation in Massive Data Stream,APMStream),主要包括初始聚类、在线聚类、聚类调整和聚类维护四个部分。从分布式迭代和动态调整阻尼系数两个方面改进近邻传播(Affinity Propagation,AP)算法,使其适用于大规模数据的初始聚类。在线聚类能够实时在线处理每个元组,根据与已有聚类的距离将元组归并到聚类中或者创建一个新的聚类。聚类调整首先重新选取聚类中心,然后运用加权的AP算法对新的聚类中心进行聚类。聚类维护通过删除长时间没有更新的聚类和重要程度低的元组,维持系统的负载在合理的范围内。将APMStream方法用到话题检测与跟踪上去,主要包括微博重要程度的度量和微博之间距离的计算,其中微博重要程度是通过基于微博之间的关系计算得到的,作为AP算法的优先权参数,决定微博成为聚类中心的概率大小;微博之间的距离是通过基于公共词块方法计算得到的,用于构造AP算法的相似度矩阵。APMStream方法被设计成为分布式流处理框架Apache Storm的一个拓扑,数据的处理分布在这个拓扑的各个节点上。经过实验验证,APMStream可以快速处理大规模微博数据流,检测微博话题,并且反映微博话题随时间的演化过程。
[Abstract]:Weibo, as one of the most popular social applications, has become the main way for people to obtain and disseminate information. Weibo data is in fact a high-speed, massive and dynamic information flow, which can express the social topic and its changing process at every moment, from which to conduct topic detection and track the supervision of public opinion, public opinion survey has great significance. Under this background, this paper proposes a clustering algorithm which can deal with large scale data streams, and applies it to Weibo topic detection and tracking, and achieves good results. A large scale data stream clustering method based on nearest neighbor propagation (Affinity Propagation in Massive Data Stream,APMStream) is proposed, which includes four parts: initial clustering, online clustering, clustering adjustment and clustering maintenance. The nearest neighbor propagation (Affinity Propagation,AP) algorithm is improved from two aspects of distributed iteration and dynamically adjusting damping coefficient to make it suitable for the initial clustering of large-scale data. Online clustering can process each tuple in real time and merge the tuple into the cluster or create a new cluster according to the distance from the existing clustering. Firstly, the clustering center is re-selected, and then the new clustering center is clustered by using the weighted AP algorithm. Cluster maintenance maintains the system load within a reasonable range by deleting clusters that have not been updated for a long time and tuples of low importance. The APMStream method is applied to topic detection and tracking, mainly including the measurement of Weibo's importance and the calculation of the distance between Weibo. As the priority parameter of AP algorithm, the probability of Weibo becoming the center of clustering is determined. The distance between Weibo is calculated based on the common lexical block method, which is used to construct the similarity matrix of the AP algorithm. The APMStream method is designed as a topology of the distributed flow processing framework (Apache Storm). Data processing is distributed across the nodes of the topology. The experimental results show that APMStream can deal with the large-scale Weibo data flow quickly, detect the topic of Weibo, and reflect the evolution of the topic with time.
【学位授予单位】:华中科技大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.1
本文编号:2344900
[Abstract]:Weibo, as one of the most popular social applications, has become the main way for people to obtain and disseminate information. Weibo data is in fact a high-speed, massive and dynamic information flow, which can express the social topic and its changing process at every moment, from which to conduct topic detection and track the supervision of public opinion, public opinion survey has great significance. Under this background, this paper proposes a clustering algorithm which can deal with large scale data streams, and applies it to Weibo topic detection and tracking, and achieves good results. A large scale data stream clustering method based on nearest neighbor propagation (Affinity Propagation in Massive Data Stream,APMStream) is proposed, which includes four parts: initial clustering, online clustering, clustering adjustment and clustering maintenance. The nearest neighbor propagation (Affinity Propagation,AP) algorithm is improved from two aspects of distributed iteration and dynamically adjusting damping coefficient to make it suitable for the initial clustering of large-scale data. Online clustering can process each tuple in real time and merge the tuple into the cluster or create a new cluster according to the distance from the existing clustering. Firstly, the clustering center is re-selected, and then the new clustering center is clustered by using the weighted AP algorithm. Cluster maintenance maintains the system load within a reasonable range by deleting clusters that have not been updated for a long time and tuples of low importance. The APMStream method is applied to topic detection and tracking, mainly including the measurement of Weibo's importance and the calculation of the distance between Weibo. As the priority parameter of AP algorithm, the probability of Weibo becoming the center of clustering is determined. The distance between Weibo is calculated based on the common lexical block method, which is used to construct the similarity matrix of the AP algorithm. The APMStream method is designed as a topology of the distributed flow processing framework (Apache Storm). Data processing is distributed across the nodes of the topology. The experimental results show that APMStream can deal with the large-scale Weibo data flow quickly, detect the topic of Weibo, and reflect the evolution of the topic with time.
【学位授予单位】:华中科技大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.1
【参考文献】
相关期刊论文 前10条
1 孙莉;张振;李继云;王梅;;基于微博文本和元数据的话题检测[J];计算机应用与软件;2016年03期
2 曹文琴;黄玉军;涂国平;;微博话题传播的时间网络影响力模型研究[J];图书情报工作;2016年01期
3 黄贤英;陈红阳;刘英涛;;短文本相似度研究及其在微博话题检测中的应用[J];计算机工程与设计;2015年11期
4 刘季;陈秀宏;杭文龙;;面向大规模数据的快速多代表点仿射传播算法[J];计算机科学与探索;2016年02期
5 陈羽中;方明月;郭文忠;;面向微博热点话题发现的多标签传播聚类方法研究[J];模式识别与人工智能;2015年01期
6 庆艳华;左小德;;考虑服务惩罚的配送中心选址的双层规划模型[J];华南理工大学学报(社会科学版);2014年03期
7 张建朋;陈福才;李邵梅;刘力雄;;基于仿射传播的进化数据流在线聚类算法[J];模式识别与人工智能;2014年05期
8 王金明;王远方;;基于Twitter Storm平台并行挖掘最稠密子图[J];计算机科学;2014年01期
9 王勇;肖诗斌;郭嵡秀;吕学强;;中文微博突发事件检测研究[J];现代图书情报技术;2013年02期
10 童薇;陈威;孟小峰;;EDM:高效的微博事件检测算法[J];计算机科学与探索;2012年12期
相关硕士学位论文 前1条
1 黄军;社交网络热点话题公众情感极性实时计算研究[D];杭州电子科技大学;2015年
,本文编号:2344900
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2344900.html