个人微博中公共事件检测算法的研究

发布时间：2018-06-15 22:57

本文选题：微博 + 主题词　；参考：《内蒙古科技大学》2014年硕士论文

【摘要】：伴随着计算机应用技术的迅猛发展，互联网传媒也相应地的兴起并快速地影响着人们的日常生活，与此同时成为了电视、报纸和广播等多种传统媒体之后的又一个新闻载体。由于信息能够在互联网空间内实现快速传播，其信息本身也呈现出了多元化、公开化和实时化特征，因此互联网充当了社会实时热点事件传播平台的重要角色。以新浪微博为典型代表，是国内近年来新兴起且发展迅速的网络媒体。用户可以通过WEB网页、移动客户端等多种途径随时随地进行状态更新和信息分享。新浪是目前国内流行度最广、用户规模最大的微博网站，根据2013年7月最新的数据统计显示，新浪微博注册用户已达到3.3亿，形成了微博庞大的数据量。由于微博数据具有不规则性、海量性和实时性等特点。所以如何从大量的、不规则的个人微博数据中精确地提取出用户在某段时间内所关注公共事件，是当前个人微博信息检测技术首要解决的问题。将个人微博数据作为实验测试样本，主要的研究工作是如何根据个人微博信息检测出某用户在某段时间内关注了哪些公共事件。经过反复地实验证明，将传统的事件提取算法应用于个人微博事件处理结果并不理想。所以在一系列算法尝试和多次实验的基础上，综合考虑了个人微博的非主流文本特征，以短文本数据挖掘为研究背景，以提取主题词为课题重点，展开了从文本获取、预处理、相似性度量，特征值计算、以及最后的公共模板的正向匹配和反向匹配等一系列研究。课题已经形成了一个合理的、完整的个人微博公共事件检测的操作流程，，概括起来主要分为文本预处理、主题词识别和公共模板匹配三个模块。具体说预处理主要是清除文本的噪音干扰，使得文本的表示方式更加规范化；主题词主要是基于耦合、时序和流行三个相似度的计算以及应用提出的TF-DF函数二者相结合的方法进行提取，这样不仅考虑了实验的数据特征，同时也提高了主题词提取的准确率；公共模版匹配通过主题词与新浪风云榜的模板事件依次进行正向匹配和反向匹配两个步骤，得到最终的公共事件检测结果。
[Abstract]:With the rapid development of computer application technology, Internet media is also rising and rapidly affecting people's daily life. At the same time, it has become another news carrier after many traditional media such as TV, newspaper and radio. Because the information can spread rapidly in the Internet space, its information itself presents the characteristics of diversification, openness and real-time, so the Internet plays an important role in the communication platform of social real-time hot events. Sina Weibo as a typical representative, is a new and rapid development of domestic network media in recent years. Users can use Web pages, mobile clients and other ways to update their status and share information anytime and anywhere. Sina is the most popular and largest Weibo site in China. According to the latest statistics in July, 2013, Sina Weibo registered 330 million users, forming a huge amount of Weibo data. Because of the irregularity, magnanimity and real-time of Weibo data, etc. Therefore, how to accurately extract the public events that users pay attention to in a certain period of time from a large number of irregular personal Weibo data is the first problem to be solved by the current personal Weibo information detection technology. Taking the personal Weibo data as the experimental test sample, the main research work is how to detect which public events a user pays attention to in a certain period of time according to the personal Weibo information. After repeated experiments, it is proved that the application of the traditional event extraction algorithm to the personal Weibo event processing is not satisfactory. Therefore, on the basis of a series of algorithm attempts and many experiments, this paper synthetically considers the non-mainstream text features of individual Weibo, takes the short text mining as the research background, and focuses on extracting the theme words. Similarity measurement, eigenvalue calculation, and the final common template forward matching and reverse matching are studied. The subject has formed a reasonable and complete operation flow of personal Weibo common event detection, which can be divided into three modules: text preprocessing, subject word recognition and common template matching. Specifically, preprocessing is mainly to clear the noise interference of the text, which makes the presentation of the text more standardized; the theme words are mainly based on coupling. The computation of three similarity degrees of time sequence and popularity and the method of combining TF-DF function proposed to extract them not only consider the experimental data features, but also improve the accuracy of the subject word extraction. Public template matching through the theme words and Sina Fengyun list of template events in turn to carry out two steps of forward matching and reverse matching to obtain the final public event detection results.
【学位授予单位】：内蒙古科技大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092

【参考文献】