基于分布式MBUT-LDA的微博用户主题挖掘
发布时间:2018-05-20 14:07
本文选题:微博 + 用户主题 ; 参考:《重庆大学》2014年硕士论文
【摘要】:微博作为当下最主流的社交网络平台之一,已经成为用户发布和获取实时信息的重要手段。微博主题建模能够从海量信息中挖掘用户感兴趣的话题和其他用户。但是由于微博具有消息文本短、信息更新快、以及数据量巨大等特点,传统的主题建模方法并不能有效挖掘出用户真正感兴趣的信息。 本文在研究已有的主题建模方法的基础上,提出一种基于微博用户和时间维度的建模方法MBUT-LDA。其中MB代表微博(MicroBlog)、U代表用户(User)、T代表时间(Time)。该方法具有以下特点: ⑴本文在分析研究已有主题模型的基础上,并且充分利用微博消息的主题在时间上有明显的集中性特点,将用户的微博信息按照时间进行划分。此方法解决了微博文本信息短引起的信息量不完整问题,并且充分利用了微博消息的主题有明显的时间集中性特点,提高了微博用户主题的准确度。 ⑵在分析微博用户和好友关系的提出上,提出“关注度”的概念;并结合TF-IDF算法,提出新的权重计算公式ATF-IDF,用以衡量微博词汇预测主题的能力大小。 ⑶现今微博用户量剧增,并且微博平台允许微博用户通过各种移动客户端发布即时信息,导致微博信息文档规模庞大,单一节点在分析微博海量信息时容易遇到性能瓶颈问题。本文利用分布式和虚拟化技术的优势,将提出的新的主题建模方法部署到分布式计算平台Hadoop上,,实现了一个基于分布式框架Hadoop的MBUT-LDA微博用户主题挖掘方法。 本文利用提出的分布式MBUT-LDA主题建模方法,通过大量微博消息训练微博主题模型,并在训练好的主题的基础上,挖掘微博用户的感兴趣的主题。实验证明,经过ATF-IDF优化的MBUT-LDA的推广度和主题的准确度要高于MBUT-LDA和U-LDA(基于微博用户的主题建模)。通过对不同用户数量和不同节点数量的分布式MBUT-LDA实验结果分析发现,随着节点增加,能够有效的减少处理数据的时间,并且能够有效的处理庞大的数据。
[Abstract]:As one of the most popular social network platforms, Weibo has become an important means for users to publish and obtain real-time information. Weibo topic modeling can mine topics of interest to users and other users from mass information. However, because Weibo has the characteristics of short message text, fast updating of information and huge amount of data, the traditional method of topic modeling can not effectively mine the information that users are really interested in. In this paper, based on the research of existing thematic modeling methods, a modeling method MBUT-LDA based on Weibo user and time dimension is proposed. MB stands for Weibo MicroBlogn U for user and time for time. The method has the following characteristics: 1. On the basis of analyzing and studying the existing topic models, this paper makes full use of the obvious centrality of the topic of Weibo message in time, and divides the user's Weibo information according to time. This method solves the problem of incomplete information caused by short text information of Weibo, and makes full use of the obvious time centrality of the topic of Weibo message, and improves the accuracy of Weibo user topic. 2 on the analysis of Weibo user and friend relationship, the concept of "concern" is put forward, and a new weight calculation formula ATF-IDF is put forward based on TF-IDF algorithm, which can be used to measure the ability of Weibo vocabulary to predict topic. At present, the number of Weibo users increases dramatically, and the Weibo platform allows Weibo users to publish instant information through various mobile clients, which leads to the large scale of Weibo information documents, and the single node is prone to meet the performance bottleneck problem when analyzing the huge amount of Weibo information. Based on the advantages of distributed and virtualization technology, this paper deploys the new topic modeling method to the distributed computing platform Hadoop, and implements a MBUT-LDA Weibo user topic mining method based on distributed framework Hadoop. In this paper, we use the distributed MBUT-LDA topic modeling method to train the Weibo topic model through a large number of Weibo messages, and mine the topics of interest to Weibo users on the basis of the well trained topics. Experimental results show that the generalization degree and accuracy of MBUT-LDA optimized by ATF-IDF are higher than those of MBUT-LDA and U-LDA (topic modeling based on Weibo users). By analyzing the results of distributed MBUT-LDA experiments with different number of users and different nodes, it is found that with the increase of nodes, the processing time of data can be reduced effectively, and the large amount of data can be processed effectively.
【学位授予单位】:重庆大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP393.092;TP391.1
【参考文献】
相关期刊论文 前4条
1 张晨逸;孙建伶;丁轶群;;基于MB-LDA模型的微博主题挖掘[J];计算机研究与发展;2011年10期
2 汪中;刘贵全;陈恩红;;一种优化初始中心点的K-means算法[J];模式识别与人工智能;2009年02期
3 张志飞;苗夺谦;高灿;;基于LDA主题模型的短文本分类方法[J];计算机应用;2013年06期
4 罗军舟;金嘉晖;宋爱波;东方;;云计算:体系架构与关键技术[J];通信学报;2011年07期
本文编号:1914921
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/1914921.html