基于内容的新浪微博舆情预测研究
发布时间:2018-08-18 09:33
【摘要】:随着互联网的飞速发展,网络成为了人们获取信息和发表意见的重要载体。新浪微博以其短小精悍、表达方式简单等特征,吸引了大量的用户。当今的新浪微博月活两亿以上,日活达到千万数量,微博用户每时每刻在平台上进行大量的博文输出,用户转评赞活跃。微博在给信息传播和热点讨论带来便利的同时也给虚假信息的滋生创造了条件,负面、虚假信息的传播不仅会扰乱和谐的网络环境也会给社会带来负面的影响。然而微博平台数据庞大,如果仅依靠人为的操作和管理不仅获取的信息量有限而且会消耗大量的人力物力。舆情监控系统既可以实现及时地发现热点事件,又可将整个监控过程平台化、自动化,实现了高效地运作。本文使用文本挖掘的相关技术,实现了对海量博文的分类和聚类。在文本向量化阶段使用分布式卡方特征提取法降维,tfidf值计算权重。采用支持向量机的分类方法和kmeans的聚类方法。在文本分类和聚类的基础上形成事件。通过博文总量的转发、评论和点赞数计算事件热度。最终形成热点事件的监控数据。并可实现历史事件的数据分析与展示。本文在之前舆情研究的基础上,实现了基于内容的舆情监控系统,并在事件聚类之前进行了类别的划分,使得监控的事件覆盖度更广,内容更加丰富。
[Abstract]:With the rapid development of the Internet, the Internet has become an important carrier for people to obtain information and express their opinions. Sina Weibo to its short, simple expression and other characteristics, attracted a large number of users. Nowadays, Sina Weibo has more than 200 million active users every month, and millions of active users every day. Weibo users carry out a large number of blog posts on the platform every moment of the day. Weibo not only brings convenience to information dissemination and hot discussion, but also creates conditions for the breeding of false information. The spread of false information not only disturbs the harmonious network environment, but also brings negative influence to the society. However, the data of Weibo platform is huge, if it only depends on artificial operation and management, not only the amount of information obtained is limited, but also a lot of manpower and material resources will be consumed. The monitoring system of public opinion can not only discover hot events in time, but also make the whole monitoring process platform and automate, and realize efficient operation. In this paper, the text mining technology is used to realize the classification and clustering of massive blog articles. In the phase of text vectorization, the distributed chi-square feature extraction method is used to reduce the dimension and tfidf value to calculate the weight. Support vector machine classification method and kmeans clustering method are adopted. Events are formed on the basis of text classification and clustering. The heat of events is calculated by forwarding, commenting, and counting the total amount of blog posts. Finally, the monitoring data of hot spot events are formed. And can realize the historical event data analysis and display. Based on the previous research of public opinion, this paper implements a content-based monitoring system for public opinion, and classifies categories before event clustering, which makes the coverage of monitoring events wider and the content more abundant.
【学位授予单位】:首都经济贸易大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:G206;C912.63
[Abstract]:With the rapid development of the Internet, the Internet has become an important carrier for people to obtain information and express their opinions. Sina Weibo to its short, simple expression and other characteristics, attracted a large number of users. Nowadays, Sina Weibo has more than 200 million active users every month, and millions of active users every day. Weibo users carry out a large number of blog posts on the platform every moment of the day. Weibo not only brings convenience to information dissemination and hot discussion, but also creates conditions for the breeding of false information. The spread of false information not only disturbs the harmonious network environment, but also brings negative influence to the society. However, the data of Weibo platform is huge, if it only depends on artificial operation and management, not only the amount of information obtained is limited, but also a lot of manpower and material resources will be consumed. The monitoring system of public opinion can not only discover hot events in time, but also make the whole monitoring process platform and automate, and realize efficient operation. In this paper, the text mining technology is used to realize the classification and clustering of massive blog articles. In the phase of text vectorization, the distributed chi-square feature extraction method is used to reduce the dimension and tfidf value to calculate the weight. Support vector machine classification method and kmeans clustering method are adopted. Events are formed on the basis of text classification and clustering. The heat of events is calculated by forwarding, commenting, and counting the total amount of blog posts. Finally, the monitoring data of hot spot events are formed. And can realize the historical event data analysis and display. Based on the previous research of public opinion, this paper implements a content-based monitoring system for public opinion, and classifies categories before event clustering, which makes the coverage of monitoring events wider and the content more abundant.
【学位授予单位】:首都经济贸易大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:G206;C912.63
【参考文献】
相关期刊论文 前10条
1 杨爱东;刘东苏;;基于Hadoop的微博舆情监控系统模型研究[J];现代图书情报技术;2016年05期
2 余秀才;;微博舆情研究中的大数据风险与挑战[J];华中科技大学学报(社会科学版);2015年05期
3 兰月新;董希琳;苏国强;瞿志凯;;大数据背景下微博舆情信息交互模型研究[J];现代图书情报技术;2015年05期
4 李天龙;李明德;张宏邦;;微博舆情生成机制研究[J];情报杂志;2014年09期
5 唐晓波;童海燕;严承希;;基于话题情感强度的微博舆情分析[J];图书馆学研究;2014年17期
6 张s,
本文编号:2189067
本文链接:https://www.wllwen.com/shekelunwen/shgj/2189067.html
教材专著