中文微博热点话题检测技术研究
发布时间:2018-05-23 10:30
本文选题:中文微博 + 话题检测 ; 参考:《重庆理工大学》2014年硕士论文
【摘要】:随着移动互联技术的迅猛发展,微博这一新兴的社交网络平台快速兴起,成为广大用户的一种新的交流方式。用户以微博为载体,可以方便快捷地发表观点,进行信息交流、互动、资源共享。微博平台特有的时效性和随意性,使得微博信息能够快速传播及扩散,在现实社会中产生强大的影响力。在微博文本中,隐含着大量时政要点、突发事件等有价值的信息。对微博文本进行热点话题的提取、检索,可以帮助用户迅速了解社会中的实时热点信息,对网络舆情监控和信息即时搜索,具有重要的现实意义。但是,微博文本具有大数据的特征,难以由人工进行识别和过滤。由此,通过寻求相关信息过滤手段,研究微博文本热点话题的自动检测技术,成为信息检索领域研究的热点。 论文首先介绍了话题检测的背景,研究现状及相关技术;接着分析了中文微博的信息特点及传播特点;针对热点话题检测的信息过滤问题,提出了一种用户角色定位方法。该方法通过用户个人粉丝数和关注数指标来计算用户关注度;通过微博转发数和评论数来计算微博影响力;再通过用户关注度和微博影响力来综合评估用户影响力。通过对用户角色的定位,实现了热点话题检测前的信息粗度过滤。然后,采用基于改进的Single-Pass增量式聚类算法,对微博信息进行初步话题检测;最后结合微博转发数、评论数等话题热度的影响因素,进行微博话题热度的评估和排序,从而找到一定时间段内的热点话题。论文对中文微博话题检测中的文本预处理、文本特征选取方法等进行了优化,采用结合语义相似度的TF-IDF函数计算特征权重。 基于上述方法,论文以新浪微博语料为载体展开了相关实验,并以TDT会议评测规范中的召回率、漏检率、错检率和误测开销值作为评价指标,对实验结果进行了分析和比较。实验表明,论文提出的用户角色定位方法可以有效地实现微博用户类别的划分,,为热点话题检测的信息过滤提供了基础;运用基于用户关注度和微博影响力的评估方法,论文对热点话题提取的的漏检率和误检率指标分别降低到了20.38%和1.98%,取得了优于传统话题检测的效率和精准率,证明了论文所提方法的有效性。
[Abstract]:With the rapid development of mobile interconnection technology, Weibo, a new social network platform, has become a new communication mode for users. With Weibo as the carrier, users can express their views conveniently and quickly, exchange information, interact and share resources. Because of the timeliness and arbitrariness of Weibo platform, Weibo information can be spread and diffused quickly, and has a strong influence in the real society. In the text of Weibo, there are a lot of valuable information, such as the key points of current politics and unexpected events. Extracting and retrieving Weibo text from hot topics can help users quickly understand the real-time hot information in the society, and it is of great practical significance to monitor the network public opinion and search the information in real time. However, Weibo text has the characteristics of big data, so it is difficult to be recognized and filtered manually. Therefore, the research on automatic detection of hot topics in Weibo texts has become a hot topic in the field of information retrieval by searching for relevant information filtering methods. Firstly, this paper introduces the background of topic detection, research status and related technologies; then analyzes the information characteristics and propagation characteristics of Chinese Weibo; aiming at the problem of information filtering of hot topic detection, a user role location method is proposed. The method calculates the user's attention by the index of the number of users' individual followers and the number of users' attention; calculates the influence of Weibo by the number of Weibo retweets and comments; and evaluates the influence of users by the degree of user's attention and the influence of Weibo. The information coarseness filtering before hot topic detection is realized by locating the user role. Then, based on the improved Single-Pass incremental clustering algorithm, the preliminary topic detection of Weibo information is carried out. Finally, combining with the factors of Weibo forwarding number, comment number and so on, the evaluation and ranking of Weibo topic heat are carried out. In order to find a certain period of time hot topics. In this paper, the text preprocessing and text feature selection methods in Chinese Weibo topic detection are optimized, and the feature weights are calculated by TF-IDF function combined with semantic similarity. Based on the above methods, this paper takes Sina Weibo corpus as the carrier to carry out relevant experiments, and analyzes and compares the experimental results with the recall rate, missed detection rate, false check rate and false test cost value of the TDT conference evaluation specification. The experiments show that the user role location method proposed in this paper can effectively divide the user categories of Weibo and provide the basis for information filtering of hot topic detection, and use the evaluation method based on user concern and Weibo influence. The missing rate and false detection rate of hot topic extraction are reduced to 20.38% and 1.98% respectively. The efficiency and accuracy of the proposed method are better than that of traditional topic detection, which proves the effectiveness of the proposed method.
【学位授予单位】:重庆理工大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP393.092
【参考文献】
相关期刊论文 前9条
1 万小军,杨建武;在线新闻主题检测系统的设计与应用[J];华南理工大学学报(自然科学版);2004年S1期
2 王冠男;;微博客的信息流动机制与传播形态[J];机电产品开发与创新;2010年06期
3 贾自艳 ,何清 ,张海俊 ,李嘉佑 ,史忠植;一种基于动态进化模型的事件探测和追踪算法[J];计算机研究与发展;2004年07期
4 李保利,俞士汶;话题识别与跟踪研究[J];计算机工程与应用;2003年17期
5 闵可锐;赵迎宾;刘昕;赵泽宇;闫华;;互联网话题识别与跟踪系统设计及实现[J];计算机工程;2008年19期
6 骆卫华;于满泉;许洪波;王斌;程学旗;;基于多策略优化的分治多层聚类算法的话题发现研究[J];中文信息学报;2006年01期
7 洪宇;张宇;刘挺;李生;;话题检测与跟踪的评测及研究综述[J];中文信息学报;2007年06期
8 杨武;李阳;卢玲;;基于用户角色定位的微博热点话题检测方法[J];计算机应用;2013年11期
9 王伟;许鑫;;基于聚类的网络舆情热点发现及分析[J];现代图书情报技术;2009年03期
本文编号:1924367
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/1924367.html