当前位置:主页 > 管理论文 > 移动网络论文 >

微博热点话题发现研究与实现

发布时间:2018-02-23 01:09

  本文关键词: 微博 热点话题发现 微博API Single-Pass算法 LDA模型 出处:《郑州大学》2014年硕士论文 论文类型:学位论文


【摘要】:随着互联网的快速发展以及移动互联网的全面普及,网民们相互沟通了解的方式越来越多样化。微博作为一个新兴的平台,以其独特的灵活性和便捷性,更加受到网民的青睐。微博给人们生活带来极大便利的同时,也产生了一些副作用,例如一些人使用微博蓄意传播假消息,给社会安定造成不良的影响。如果能够及早发现这些话题,就能及时采取相应的措施。对用户来说,用户只能看到自己主页上的微博消息,,无法了解到整个微博网络中大多数用户都在讨论或者关注哪些事件。因此,及时发现微博热点话题是非常有意义的。 本文定义了话题的热度,从定量的角度来表达热点话题,对于某个话题来说,包含的微博发布时间越晚,评论数和转发数越多,该话题的热度越高,越有可能是热点话题。国内外大量学者都在热点话题发现上做了许多研究,总结出来大致有聚类算法、LDA模型、情感模型三种方法,或者是在此基础上进行改进。本文在研究微博热点话题发现的过程中,首先需要解决微博语料的问题,传统的网络爬虫无法适用于微博信息抓取,而且微博API也只能抓取本人微博主页上的微博信息,无法获取大量的微博信息,所以本文根据微博用户之间相互关注的关系获取大量用户信息,然后抓取这些用户最新发表的微博信息。接下来需要对微博进行预处理,包括过滤垃圾微博信息、分词、去除停用词、无用信息过滤、特征词提取、特征权重计算,为每一条微博文本生成特征向量。最后针对微博不断增加的特点,选择适合的Single-Pass增量聚类算法,得到多个簇,每个簇代表一个话题,每一个话题下包含许多条微博。为了从话题中选择出热点话题,文中定义了话题的热度,发布时间越晚、评论数和转发数越多的话题,热度越高,成为热点话题的可能性越大。 从大量学者的研究中发现,LDA主题模型也能够用来发现话题,但是它需要多次迭代,处理大量数据时运行时间比较长。不过LDA主题模型在主题表达方面比较有优势,所以本文将Single-Pass算法与LDA模型结合起来,先利用Single-Pass聚类算法对微博文本聚类,然后利用LDA算法处理每一个簇,最后得到微博热点话题,这样比单独使用Single-Pass能生成更加准确的话题,比单独使用LDA模型处理速度更快。
[Abstract]:With the rapid development of the Internet and the overall popularity of mobile Internet, Internet users to communicate with each other more and more diverse ways of micro-blog. As a new platform, with its unique flexibility and convenience, more users of all ages. Micro-blog has brought great convenience to people's life at the same time, also have some side effects, such as some people use micro-blog deliberately spread false news, causing adverse effects to social stability. If we can find these topics, we may be able to take corresponding measures. For users, users can only see from micro-blog news has on the home page, you can not understand the majority of users throughout the micro-blog network in the discussion or attention. So what events micro-blog, found that the hot topic is very meaningful in a timely manner.
This paper defines the topic of heat, to express the topic from the quantitative point of view for a topic, including the micro-blog released the late time, the number of comments and forwarding number, the topic of heat is high, the more likely it is a hot topic. Many scholars at home and abroad are found on the hot topic there are many studies, summed up the clustering algorithm, LDA model, emotion model three methods, or improve on this basis. This paper found in hot topic on micro-blog, micro-blog first need to solve the problem of corpora, traditional web crawlers cannot apply to micro-blog information capture, API and micro-blog can grab me the micro-blog home page on micro-blog information, unable to get a lot of micro-blog information, so according to the relationship between the attention of micro-blog users get a lot of user information, and then grab the users the latest. Table next to micro-blog. Micro-blog information pretreatment, including micro-blog word, information filtering spam, remove stop words, useless information filtering, feature extraction, feature weight calculation, for each micro-blog text feature vectors. Finally, according to the characteristics of micro-blog increased, Single-Pass incremental clustering algorithm for the get a plurality of clusters, each cluster represents a topic, each topic contains a lot of micro-blog. In order to select a topic from the topic, this paper defines the topic of heat release, the late time, the number of comments and forwarding topic number, the higher the heat, become the hot topic of the possibility of more.
The study found that a large number of scholars in the LDA topic model also can be used to find the topic, but it needs many iterations, the processing of large amounts of data to run a long time. But LDA topic model in theme expression of comparative advantage, so this paper introduces Single-Pass algorithm and LDA model combined by using the Single-Pass clustering algorithm on micro-blog text clustering, then we use the LDA algorithm to handle each cluster, and finally get the micro-blog hot topic, so than using Single-Pass alone can generate more accurate than a single topic, using the LDA model processing speed is faster.

【学位授予单位】:郑州大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP393.092

【参考文献】

相关期刊论文 前10条

1 龙树全;赵正文;唐华;;中文分词算法概述[J];电脑知识与技术;2009年10期

2 赵前东;叶猛;;微博热点话题检测系统的设计与实现[J];电视技术;2013年03期

3 谷文成;柴宝仁;韩俊松;;基于支持向量机的垃圾信息过滤方法[J];北京理工大学学报;2013年10期

4 孙国菊,张杰;中文文本分类的特征选取评价[J];哈尔滨理工大学学报;2005年01期

5 刘丽珍,宋瀚涛;文本分类中的特征选取[J];计算机工程;2004年04期

6 冯进;丁博;史殿习;张瞩熹;许凯;;XML解析技术研究[J];计算机工程与科学;2009年02期

7 王小伟;王黎明;;基于动态人工免疫的邮件分类算法研究[J];计算机应用;2006年10期

8 杨亮;林原;林鸿飞;;基于情感分布的微博热点事件发现[J];中文信息学报;2012年01期

9 庞景安;;Web信息采集技术研究与发展[J];情报科学;2009年12期

10 莫建文;郑阳;首照宇;张顺岚;;改进的基于词典的中文分词方法[J];计算机工程与设计;2013年05期



本文编号:1525763

资料下载
论文发表

本文链接:https://www.wllwen.com/guanlilunwen/ydhl/1525763.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户a60a9***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com