基于改进H-K聚类算法的热点话题发现方法
发布时间:2019-04-04 17:24
【摘要】:随着社会网络的快速发展,微博客已经成为人们日常生活中沟通交流、信息传播的平台之一。在很短的时间内,微博平台能产生海量的、信息分散的数据集,微博客用户很难从这些海量文本信息中分辨出热点话题,所以如何快速、准确的从海量的微博客文本数据集中挖掘出热点话题成为当前研究的热点。由于传统的话题发现方法通常是基于特征词匹配的,没有考虑微博客文本潜在的语义,导致话题发现的质量不高。针对微博客的特点,本文从语义的角度对微博客热话题发现技术进行了深入的研究,提出一种基于改进H-K聚类算法的话题发现方法。本文首先针对微博客数据集文本的时间刻度特性和话题的持续性的特点,对热点话题发现方法中用到的H-K聚类算法进行了改进。针对海量的微博客数据集,在微博客话题发现方法中,结合Hadoop中的MapReduce编程思想,将该算法进行了并行化实现,以提高聚类的处理效率。其次,本文从语义的层次对微博客文本进行分析,通过引入LDA主题模型将非结构化的微博客文本转化为文本-主题分布和主题-文本特征词分布,以降低微博客文本的维度,并从语义的角度对微博客进行建模,以提高微博客文本相似度计算的准确度。同时在微博客文本建模阶段,结合MapReduce并行编程思想将LDA主题模型并行化,以提高微博客数据集的处理能力。实验表明,改进的H-K聚类算法的聚类效果明显得到提高,时间效率也得到提升,而且能更好的应用到微博客文本的聚类中,解决了传统聚类算法效率不高的问题;引入云计算平台后,提高了对海量微博客文本数据集的处理能力;本文提出的热点话题发现方法能根据微博客文本特征词潜在的语义快速、准确的从微博客数据集中发现热点话题。
[Abstract]:With the rapid development of social network, Weibo visitors have become one of the platforms for communication and information dissemination in people's daily life. In a very short period of time, the Weibo platform can produce massive, scattered data sets. It is very difficult for Weibo guest users to distinguish hot topics from these massive text messages, so how to quickly, Accurately mining hot topics from massive data sets of Weibo guest texts has become a hot topic in current research. Because the traditional topic discovery method is usually based on feature word matching and does not consider the potential semantics of Weibo text, the quality of topic discovery is not high. In view of the characteristics of Weibo guest, this paper makes a deep research on Weibo hot topic discovery technology from the semantic point of view, and proposes a topic discovery method based on the improved Hxk clustering algorithm. In this paper, based on the characteristics of time scale and topic persistence of Weibo guest data sets, we improve the clustering algorithm used in the hot topic discovery method. In order to improve the processing efficiency of clustering, the algorithm is implemented in parallel with the idea of MapReduce programming in Hadoop, aiming at the massive Weibo guest data set, and in the method of Weibo guest topic discovery, the algorithm is implemented in parallel. Secondly, this paper analyzes the Weibo guest text from the semantic level, and transforms the unstructured Weibo guest text into text-theme distribution and theme-text feature word distribution by introducing the Weibo theme model, in order to reduce the dimension of Weibo guest text. From the semantic point of view, the Weibo guest is modeled to improve the accuracy of the similarity calculation of Weibo guest text. At the same time, in the modeling phase of Weibo text, the LDA topic model is parallelized with the idea of LDA parallel programming, so as to improve the processing ability of Weibo guest data set. The experiment results show that the improved clustering algorithm can improve the efficiency of clustering and improve the efficiency of time. Moreover, it can be applied to clustering of Weibo guest text better, which solves the problem of low efficiency of traditional clustering algorithm. With the introduction of cloud computing platform, the processing ability of massive Weibo guest text data set is improved. The hot topic discovery method proposed in this paper can find hot topics from the Weibo guest data set accurately and quickly according to the latent semantics of the feature words in Weibo guest text.
【学位授予单位】:哈尔滨工程大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP391.1;TP393.092
本文编号:2454016
[Abstract]:With the rapid development of social network, Weibo visitors have become one of the platforms for communication and information dissemination in people's daily life. In a very short period of time, the Weibo platform can produce massive, scattered data sets. It is very difficult for Weibo guest users to distinguish hot topics from these massive text messages, so how to quickly, Accurately mining hot topics from massive data sets of Weibo guest texts has become a hot topic in current research. Because the traditional topic discovery method is usually based on feature word matching and does not consider the potential semantics of Weibo text, the quality of topic discovery is not high. In view of the characteristics of Weibo guest, this paper makes a deep research on Weibo hot topic discovery technology from the semantic point of view, and proposes a topic discovery method based on the improved Hxk clustering algorithm. In this paper, based on the characteristics of time scale and topic persistence of Weibo guest data sets, we improve the clustering algorithm used in the hot topic discovery method. In order to improve the processing efficiency of clustering, the algorithm is implemented in parallel with the idea of MapReduce programming in Hadoop, aiming at the massive Weibo guest data set, and in the method of Weibo guest topic discovery, the algorithm is implemented in parallel. Secondly, this paper analyzes the Weibo guest text from the semantic level, and transforms the unstructured Weibo guest text into text-theme distribution and theme-text feature word distribution by introducing the Weibo theme model, in order to reduce the dimension of Weibo guest text. From the semantic point of view, the Weibo guest is modeled to improve the accuracy of the similarity calculation of Weibo guest text. At the same time, in the modeling phase of Weibo text, the LDA topic model is parallelized with the idea of LDA parallel programming, so as to improve the processing ability of Weibo guest data set. The experiment results show that the improved clustering algorithm can improve the efficiency of clustering and improve the efficiency of time. Moreover, it can be applied to clustering of Weibo guest text better, which solves the problem of low efficiency of traditional clustering algorithm. With the introduction of cloud computing platform, the processing ability of massive Weibo guest text data set is improved. The hot topic discovery method proposed in this paper can find hot topics from the Weibo guest data set accurately and quickly according to the latent semantics of the feature words in Weibo guest text.
【学位授予单位】:哈尔滨工程大学
【学位级别】:硕士
【学位授予年份】:2014
【分类号】:TP391.1;TP393.092
【参考文献】
相关期刊论文 前10条
1 范宇;符红光;文奕;;基于LDA模型的专利信息聚类技术[J];计算机应用;2013年S1期
2 廖彬;于炯;张陶;杨兴耀;;基于分布式文件系统HDFS的节能算法[J];计算机学报;2013年05期
3 薛素芝;鲁燃;任圆圆;;基于速度增长的微博热点话题发现[J];计算机应用研究;2013年09期
4 李玉林;董晶;;基于Hadoop的MapReduce模型的研究与改进[J];计算机工程与设计;2012年08期
5 郑斐然;苗夺谦;张志飞;高灿;;一种中文微博新闻话题检测的方法[J];计算机科学;2012年01期
6 杨亮;林原;林鸿飞;;基于情感分布的微博热点事件发现[J];中文信息学报;2012年01期
7 程苗;陈华平;;基于Hadoop的Web日志挖掘[J];计算机工程;2011年11期
8 赵应秋;罗军;张君艳;;基于知网的词语语义相关度计算[J];信息技术;2010年03期
9 鲁明羽;姚晓娜;魏善岭;;基于模糊聚类的网络论坛热点话题挖掘[J];大连海事大学学报;2008年04期
10 肖波;徐前方;蔺志青;郭军;李春光;;可信关联规则及其基于极大团的挖掘算法[J];软件学报;2008年10期
相关硕士学位论文 前1条
1 张珏;网络舆情预测模型与平台的研究[D];北京交通大学;2009年
,本文编号:2454016
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2454016.html