当前位置:主页 > 科技论文 > 软件论文 >

基于优化TF-IDF与词共现的微博热点话题发现研究

发布时间:2019-01-25 21:17
【摘要】:微博热点话题发现是指从大量微博中挖掘出话题,并根据话题热度评估方法选出热点话题。它可以帮助人们从海量的信息中,便捷地选出用户感兴趣或者需要的信息,并对政府舆情指导、信息安全、金融判断等领域也有很大价值。本文对微博热点话题发现的现状进行分析和总结,发现目前存在文本分词错误率较高、主题词提取准确性不高以及选择的话题热度评估方式不同的问题。针对这些问题,本文重点研究了以下三个方面:第一,对中文分词和新词发现技术进行深入探讨,发现目前的分词工具分词后出会现很多单字碎片,尤其是将新词分词后,导致与原意非常不同。本文为了解决分词错误率较高的问题,提出了基于规则和N-Gram模型发现新词。首先考虑词语结构制定规则构建碎片库,然后利用Bi-Gram和Tri-Gram模式提取碎片库中的候选字串,选取在两个模式下概率都较大的候选字串做为新词,最后有机结合系统分词和新词。实验结果表明,这种算法有效的防止了因新词造成的微博文本分词效果差的影响。第二,针对主题词提取准确性不高的问题,本文结合TF-IDF算法和词共现模型的优点,提出了基于优化的TF-IDF和词共现模型提取主题词算法。在TF-IDF算法的研究中,发现传统算法没有体现词语的位置信息,本文为了有效反应词语的重要程度,把词语是属于微博正文、标题和评论的位置信息加入数据集中,并给予不同权重,以此优化TF-IDF算法。在此基础上,利用词共现模型考虑词语的上下文语义和语境的联系,进行主题词提取。通过实验验证,此算法可降低主题词提取的偏差,使结果更为精准。第三,通过对微博结构和话题传播规律的研究,本文选择参与用户特征和主题词特征作为热点话题的影响因素,并利用它们设计话题的热度值计算公式,计算每个话题的热度值,最后根据热度值的阈值选出微博热点话题。实验结果发现,该算法得到的微博热点话题和实际情况较符合。
[Abstract]:Weibo's hot topic discovery refers to excavating the topic from a large number of Weibo and selecting the hot topic according to the method of topic heat evaluation. It can help people to choose the information users are interested in or need conveniently from the mass of information, and also has great value in the fields of government public opinion guidance, information security, financial judgment and so on. This paper analyzes and summarizes the current situation of Weibo's hot topic discovery, and finds that there are some problems such as high error rate of text segmentation, low accuracy of subject word extraction and different ways of evaluating the heat of selected topic. In view of these problems, this paper focuses on the following three aspects: first, the Chinese word segmentation and new word discovery technology are discussed in depth, and it is found that a lot of word fragments will appear after word segmentation with the present word segmentation tool, especially after the new word segmentation. The result is very different from the original intention. In order to solve the problem of high error rate of word segmentation, this paper proposes a new word discovery method based on rule and N-Gram model. Firstly, the rules of word structure are considered to construct the fragment library, then the candidate strings are extracted by using Bi-Gram and Tri-Gram patterns, and the candidate strings with high probability in both modes are selected as new words. Finally, organic combination of systematic participle and new words. The experimental results show that this algorithm can effectively prevent the bad effect of Weibo text segmentation caused by new words. Secondly, aiming at the problem that the accuracy of the subject word extraction is not high, this paper proposes an algorithm based on the optimized TF-IDF and word co-occurrence model to extract the theme words, which combines the advantages of TF-IDF algorithm and word co-occurrence model. In the study of TF-IDF algorithm, it is found that the traditional algorithm does not reflect the location information of words. In order to effectively reflect the importance of words, this paper adds the location information which belongs to Weibo text, title and comment to the data set. And give different weights to optimize the TF-IDF algorithm. On this basis, we use the co-occurrence model to consider the contextual semantic and contextual relationship of words, and extract the theme words. The experimental results show that the algorithm can reduce the deviation of subject word extraction and make the result more accurate. Thirdly, through the study of Weibo structure and topic communication law, this paper chooses the user characteristics and the subject word features as the influencing factors of hot topics, and uses them to design the calorific calculation formula of topics. Calculate the calorific value of each topic, finally select Weibo hot topic according to the threshold of calorific value. The experimental results show that the hot topic of Weibo obtained by this algorithm is in good agreement with the actual situation.
【学位授予单位】:南昌大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP391.1

【参考文献】

相关期刊论文 前10条

1 李晓瑜;俞丽颖;雷航;唐雪飞;;一种K-means改进算法的并行化实现与应用[J];电子科技大学学报;2017年01期

2 饶浩;林育曼;陈海媚;;基于粒子群算法的微博热点话题发现分析[J];情报科学;2016年12期

3 夭荣朋;许国艳;宋健;;基于改进互信息和邻接熵的微博新词发现方法[J];计算机应用;2016年10期

4 马慧芳;吉余岗;李晓红;周汝南;;基于离散粒子群优化的微博热点话题发现算法[J];计算机工程;2016年03期

5 叶成绪;杨萍;刘少鹏;;基于主题词的微博热点话题发现[J];计算机应用与软件;2016年02期

6 李元菊;;数据不平衡分类研究综述[J];现代计算机(专业版);2016年04期

7 刘少鹏;印鉴;欧阳佳;黄云;杨晓颖;;基于MB-HDP模型的微博主题挖掘[J];计算机学报;2015年07期

8 陈羽中;方明月;郭文忠;;面向微博热点话题发现的多标签传播聚类方法研究[J];模式识别与人工智能;2015年01期

9 李勇;安新颖;赵迎光;;基于动态时间窗口的突发监测研究[J];医学信息学杂志;2014年06期

10 孙永利;李东;张s,

本文编号:2415197


资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2415197.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户616a2***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com