微博热点发现技术的研究与实现

发布时间：2018-07-20 15:37

【摘要】：随着WEB2.0和社交网站蓬勃发展，互联网进入了一个完全崭新的“自媒体”时代。以新浪微博、Twitter等为代表的微博网站成为了人们关注的焦点，但随之而来的巨大的信息量也给人们带来了困扰，如何从海量的微博信息流中获得最新的热门话题，便成人们一种迫切的需求。通过分析微博信息特点，并结合国内外话题跟踪检测的方法，首先重点改进了单遍聚类算法，该算法通过计算微博信息流的质心，过滤掉大量离质心距离过远的微博，有效降低了计算的复杂度，解决了对大数据量的样本集进行聚类时所出现的计算量过高，，无法进行实时运算的问题，同时改善了单遍聚类算法的准确率对于样本输入的顺序依赖过高的缺点；其次，对朴素贝叶斯信息分类技术进行了改进，提出了一种在微博文本短小、特征少的情况下提高分类准确率的方法；最后，在文本特征提取中，采用搜索引擎技术来对文本特征项提取过程中的互信息进行计算，解决了大规模短文本难以计算互信息的问题。通过搭建微博热点发现平台，并在长期的使用中表明，该微博热点发现技术取得了良好的效果，该算法比传统的算法更适用于微博的平台，具有速度快、精确度高、可进行大数据量实时计算的优点，有较高的理论意义和实用价值。
[Abstract]:With Web 2.0 and social networking sites booming, the Internet has entered a completely new era of self-media. The Weibo websites, such as Sina Weibo Twitter and so on, have become the focus of attention, but the huge amount of information that follows has also brought people trouble, how to get the latest hot topic from the massive Weibo information flow, It becomes an urgent need for people. By analyzing the characteristics of Weibo information and combining the methods of topic tracking and detection at home and abroad, the single-pass clustering algorithm is improved. By calculating the centroid of Weibo information flow, the algorithm filters out a large number of Weibo which are far away from the centroid. The complexity of computation is reduced effectively, and the problem that the amount of computation is too high for the large data set to be clustered is solved, which can not be used in real time operation. At the same time, it improves the accuracy of single-pass clustering algorithm, which depends too much on the order of sample input. Secondly, the naive Bayesian information classification technology is improved, and a short text in Weibo is proposed. Finally, in the text feature extraction, search engine technology is used to calculate the mutual information in the text feature extraction process. The problem that mutual information is difficult to calculate in large-scale short text is solved. Through the construction of Weibo hot spot discovery platform, and in the long-term application, it shows that the Weibo hot spot discovery technology has achieved good results, this algorithm is more suitable for the platform of Weibo than the traditional algorithm, and has fast speed and high accuracy. The advantages of real-time calculation of large amount of data have high theoretical significance and practical value.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP393.092

【参考文献】