微博热点话题检测与跟踪技术研究
发布时间:2018-10-23 20:31
【摘要】:话题检测与跟踪是指从海量数据中发现被最多讨论的话题并在后续信息中跟进话题的发展变化状态,为人们解决愈发严重的信息爆炸问题。话题检测与跟踪可以节省用户时间,跟进事件发展动态;为舆情监控提供数据支持,有重要的实际价值和安全意义。随着越来越多的用户使用微博进行信息发布和话题讨论,热点话题展示也逐渐变成微博平台的一个重要功能。由于微博的即时性很强,突发新闻在微博上的传播速度很快,而且对于影响力较大的新闻事件,参与报道、转发、评论的用户数量也很大,往往能够先于传统新闻媒体做出反应。因此,针对微博的特点,本文通过过滤无效微博,设计并实现了一种针对微博的热点话题跟踪及检测方法,主要工作如下:1)分析了微博特性,过滤了无效微博。微博用户人群复杂,涵盖范围广,差别大,内容驳杂。通过分析微博用户特征,包括用户粉丝数与用户每日发布微博数,过滤广告用户与僵尸用户;通过分析微博内容,过滤商家推广活动,与用户分享内容,用户参与的活动等大量对话题无贡献的微博;通过分析分词后的微博数据,过滤包含词数过多和过少的微博,去除无意义的过短文本,和重复过多的过长文本,有效过滤无效微博,降低计算复杂度。2)设计并实现了基于时间特性的微博热点话题检测算法。将微博按时间递增顺序处理,通过改进Single-Pass聚类算法,包括相似度计算方法的改进,结合用户影响力的话题向量更新方法的改进,进行初步话题检测;利用FP-Growth频繁项集发现算法,挖掘频繁特征词集,修正SP算法的错误;利用改进的K-MEDOIDS算法对频繁特征词集进行聚类,抽取最终话题,提高了计算效率与话题检测的准确率。3)设计并实现了基于时间特性的多查询向量自适应话题跟踪算法。基于微博数量在时间维度上的分布特征,将微博按时段分组,并按时间递增顺序处理;将每个时段的话题与已存在所有话题组的所有话题进行相似度计算对比,根据阈值选择将其归入已存在话题组或创建新的话题组,自适应更改加入话题组的话题向量。有效的跟踪话题发展状态,提高了准确率,减少了话题漂移。
[Abstract]:Topic detection and tracking is to find the most discussed topic from the massive data and follow up the development and change of the topic in the follow-up information to solve the increasingly serious problem of information explosion for people. Topic detection and tracking can save user time, follow up the development of events, and provide data support for public opinion monitoring, which has important practical value and security significance. As more and more users use Weibo to publish information and discuss topics, hot topic display has gradually become an important function of Weibo platform. Because Weibo's immediacy is very strong, breaking news spreads very quickly on Weibo, and the number of users who participate in reporting, forwarding, and commenting on news events with great influence is also very large. It is often possible to react before the traditional news media. Therefore, according to the characteristics of Weibo, this paper designs and implements a method of tracking and detecting hot topics for Weibo by filtering invalid Weibo. The main work is as follows: 1) analyzing the characteristics of Weibo, filtering the invalid Weibo. Weibo user crowd is complex, covers a wide range, the difference is big, the content is complicated. By analyzing Weibo's user characteristics, including the number of users' fans and the number of users issuing Weibo daily, filtering advertising users and zombie users, analyzing the content of Weibo, filtering merchants' promotional activities, and sharing content with users, Weibo, who has no contribution to the topic, participated in a large number of activities such as user participation. By analyzing the Weibo data after the participle, he filtered too many words and too few words to remove meaningless and too short text, and repeated too many long texts. Effectively filter invalid Weibo, reduce the computational complexity. 2) designed and implemented the algorithm based on the time characteristics of Weibo hot topic detection. Weibo is processed in the order of increasing time, by improving the Single-Pass clustering algorithm, including the improvement of similarity calculation method, combining with the improvement of the topic vector updating method of user's influence, the preliminary topic detection is carried out, and the FP-Growth frequent itemset discovery algorithm is used. Mining frequent feature word sets, correcting errors of SP algorithm, clustering frequent feature words set with improved K-MEDOIDS algorithm, extracting final topic, The computational efficiency and the accuracy of topic detection are improved. 3) A multi-query vector adaptive topic tracking algorithm based on time characteristic is designed and implemented. On the basis of the distribution of Weibo's quantity in time dimension, Weibo is grouped according to the period of time and processed in the order of increasing time, and the similarity calculation between the topics of each time period and all the topics that already exist in all the topic groups is compared. According to the threshold selection, the topic vector is changed adaptively to the existing topic group or to create a new topic group. Tracking the status of topic development effectively improves the accuracy and reduces the topic drift.
【学位授予单位】:东南大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.1;TP393.092
本文编号:2290384
[Abstract]:Topic detection and tracking is to find the most discussed topic from the massive data and follow up the development and change of the topic in the follow-up information to solve the increasingly serious problem of information explosion for people. Topic detection and tracking can save user time, follow up the development of events, and provide data support for public opinion monitoring, which has important practical value and security significance. As more and more users use Weibo to publish information and discuss topics, hot topic display has gradually become an important function of Weibo platform. Because Weibo's immediacy is very strong, breaking news spreads very quickly on Weibo, and the number of users who participate in reporting, forwarding, and commenting on news events with great influence is also very large. It is often possible to react before the traditional news media. Therefore, according to the characteristics of Weibo, this paper designs and implements a method of tracking and detecting hot topics for Weibo by filtering invalid Weibo. The main work is as follows: 1) analyzing the characteristics of Weibo, filtering the invalid Weibo. Weibo user crowd is complex, covers a wide range, the difference is big, the content is complicated. By analyzing Weibo's user characteristics, including the number of users' fans and the number of users issuing Weibo daily, filtering advertising users and zombie users, analyzing the content of Weibo, filtering merchants' promotional activities, and sharing content with users, Weibo, who has no contribution to the topic, participated in a large number of activities such as user participation. By analyzing the Weibo data after the participle, he filtered too many words and too few words to remove meaningless and too short text, and repeated too many long texts. Effectively filter invalid Weibo, reduce the computational complexity. 2) designed and implemented the algorithm based on the time characteristics of Weibo hot topic detection. Weibo is processed in the order of increasing time, by improving the Single-Pass clustering algorithm, including the improvement of similarity calculation method, combining with the improvement of the topic vector updating method of user's influence, the preliminary topic detection is carried out, and the FP-Growth frequent itemset discovery algorithm is used. Mining frequent feature word sets, correcting errors of SP algorithm, clustering frequent feature words set with improved K-MEDOIDS algorithm, extracting final topic, The computational efficiency and the accuracy of topic detection are improved. 3) A multi-query vector adaptive topic tracking algorithm based on time characteristic is designed and implemented. On the basis of the distribution of Weibo's quantity in time dimension, Weibo is grouped according to the period of time and processed in the order of increasing time, and the similarity calculation between the topics of each time period and all the topics that already exist in all the topic groups is compared. According to the threshold selection, the topic vector is changed adaptively to the existing topic group or to create a new topic group. Tracking the status of topic development effectively improves the accuracy and reduces the topic drift.
【学位授予单位】:东南大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.1;TP393.092
【参考文献】
相关期刊论文 前5条
1 周刚;邹鸿程;熊小兵;黄永忠;;MB-SinglePass:基于组合相似度的微博话题检测[J];计算机科学;2012年10期
2 廉捷;周欣;曹伟;刘云;;新浪微博数据挖掘方案[J];清华大学学报(自然科学版);2011年10期
3 张辉;周敬民;王亮;赵莉萍;;基于三维文档向量的自适应话题追踪器模型[J];中文信息学报;2010年05期
4 洪宇;张宇;刘挺;李生;;话题检测与跟踪的评测及研究综述[J];中文信息学报;2007年06期
5 王会珍;朱靖波;季铎;叶娜;张斌;;基于反馈学习自适应的中文话题追踪[J];中文信息学报;2006年03期
,本文编号:2290384
本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/2290384.html