微博热点话题发现研究与实现

发布时间：2018-02-23 01:09

本文关键词： 微博热点话题发现微博API Single-Pass算法 LDA模型　出处：《郑州大学》2014年硕士论文　论文类型：学位论文

【摘要】：随着互联网的快速发展以及移动互联网的全面普及，网民们相互沟通了解的方式越来越多样化。微博作为一个新兴的平台，以其独特的灵活性和便捷性，更加受到网民的青睐。微博给人们生活带来极大便利的同时，也产生了一些副作用，例如一些人使用微博蓄意传播假消息，给社会安定造成不良的影响。如果能够及早发现这些话题，就能及时采取相应的措施。对用户来说，用户只能看到自己主页上的微博消息，，无法了解到整个微博网络中大多数用户都在讨论或者关注哪些事件。因此，及时发现微博热点话题是非常有意义的。本文定义了话题的热度，从定量的角度来表达热点话题，对于某个话题来说，包含的微博发布时间越晚，评论数和转发数越多，该话题的热度越高，越有可能是热点话题。国内外大量学者都在热点话题发现上做了许多研究，总结出来大致有聚类算法、LDA模型、情感模型三种方法，或者是在此基础上进行改进。本文在研究微博热点话题发现的过程中，首先需要解决微博语料的问题，传统的网络爬虫无法适用于微博信息抓取，而且微博API也只能抓取本人微博主页上的微博信息，无法获取大量的微博信息，所以本文根据微博用户之间相互关注的关系获取大量用户信息，然后抓取这些用户最新发表的微博信息。接下来需要对微博进行预处理，包括过滤垃圾微博信息、分词、去除停用词、无用信息过滤、特征词提取、特征权重计算，为每一条微博文本生成特征向量。最后针对微博不断增加的特点，选择适合的Single-Pass增量聚类算法，得到多个簇，每个簇代表一个话题，每一个话题下包含许多条微博。为了从话题中选择出热点话题，文中定义了话题的热度，发布时间越晚、评论数和转发数越多的话题，热度越高，成为热点话题的可能性越大。从大量学者的研究中发现，LDA主题模型也能够用来发现话题，但是它需要多次迭代，处理大量数据时运行时间比较长。不过LDA主题模型在主题表达方面比较有优势，所以本文将Single-Pass算法与LDA模型结合起来，先利用Single-Pass聚类算法对微博文本聚类，然后利用LDA算法处理每一个簇，最后得到微博热点话题，这样比单独使用Single-Pass能生成更加准确的话题，比单独使用LDA模型处理速度更快。
[Abstract]:With the rapid development of the Internet and the overall popularity of mobile Internet, Internet users to communicate with each other more and more diverse ways of micro-blog. As a new platform, with its unique flexibility and convenience, more users of all ages. Micro-blog has brought great convenience to people's life at the same time, also have some side effects, such as some people use micro-blog deliberately spread false news, causing adverse effects to social stability. If we can find these topics, we may be able to take corresponding measures. For users, users can only see from micro-blog news has on the home page, you can not understand the majority of users throughout the micro-blog network in the discussion or attention. So what events micro-blog, found that the hot topic is very meaningful in a timely manner.
This paper defines the topic of heat, to express the topic from the quantitative point of view for a topic, including the micro-blog released the late time, the number of comments and forwarding number, the topic of heat is high, the more likely it is a hot topic. Many scholars at home and abroad are found on the hot topic there are many studies, summed up the clustering algorithm, LDA model, emotion model three methods, or improve on this basis. This paper found in hot topic on micro-blog, micro-blog first need to solve the problem of corpora, traditional web crawlers cannot apply to micro-blog information capture, API and micro-blog can grab me the micro-blog home page on micro-blog information, unable to get a lot of micro-blog information, so according to the relationship between the attention of micro-blog users get a lot of user information, and then grab the users the latest. Table next to micro-blog. Micro-blog information pretreatment, including micro-blog word, information filtering spam, remove stop words, useless information filtering, feature extraction, feature weight calculation, for each micro-blog text feature vectors. Finally, according to the characteristics of micro-blog increased, Single-Pass incremental clustering algorithm for the get a plurality of clusters, each cluster represents a topic, each topic contains a lot of micro-blog. In order to select a topic from the topic, this paper defines the topic of heat release, the late time, the number of comments and forwarding topic number, the higher the heat, become the hot topic of the possibility of more.
The study found that a large number of scholars in the LDA topic model also can be used to find the topic, but it needs many iterations, the processing of large amounts of data to run a long time. But LDA topic model in theme expression of comparative advantage, so this paper introduces Single-Pass algorithm and LDA model combined by using the Single-Pass clustering algorithm on micro-blog text clustering, then we use the LDA algorithm to handle each cluster, and finally get the micro-blog hot topic, so than using Single-Pass alone can generate more accurate than a single topic, using the LDA model processing speed is faster.

【学位授予单位】：郑州大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092

【参考文献】