基于话题模型的教育领域微博账号萃取
发布时间:2018-06-22 05:13
本文选题:信息冗余 + 对象筛选 ; 参考:《华中师范大学》2017年硕士论文
【摘要】:互联网的高速发展和广泛普及正深刻影响着社会的发展和信息的传播,越来越多的人习惯通过微博、论坛、社区等网络载体传播见闻、事件和政策等各种信息。教育领域也正在新的时代下飞快地更新迭代,信息平台的发展为我们提供了获取教育信息的一大捷径。而在信息充足丰富的同时,信息冗余的问题也随之而来。因此,在快节奏的生活中,我们希望尽可能快速而全面地捕捉到教育领域中的前沿内容。本文的研究对象主要是在微博平台上发布了与教育相关内容的博主账号,希望寻找一种途径能帮助我们在众多可供选择的对象集中筛选出一个小的博主集合,通过关注小集合中这些大V人群的微博信息,提炼出有关教育的、最新且覆盖面较广的信息动态。针对这个问题,我们首先对已有的研究和方法进行了分析,然后聚焦在比较有效的主题模型上。考虑到教育领域和微博文本的特点,我们确定出初步圈定对象的标准,找到合适的样本;继而获取他们的文本数据,并采用中科院的分词工具进行了数据转换和预处理,编写好词库和对应编号,使之形成形如“博主序号-词语编号-词频”的三元组,使数据能直接应用到模型中分析。在分析和解决问题中,我们针对数据做了三层递进的实验。首先抽取小样本分别进行AT模型和人工多重审阅的分析方式,观察能最大程度呈现出相近结果的筛选方式,将它确定为本文的筛选机制——首先采用AT模型对主题进行划分,其次根据呈现的关键词对主题进行归纳并根据他们的比率排序,优先关注在同一顺位上出现次数最多的博主。最后分别针对采集到的两个规模下的样本按照制定以上方式进行筛选,找到在限定主题中的最优关注用户集合。本文的研究为类似问题的处理提供了范例。也还存在一些可以更进一步挖掘的地方,如可以考虑备选博主群的实时更新,以满足话题变迁的可能性,或者博主间关联度分析和博主背景等。这需要持续开展研究,本研究结果可以为后续研究打下基础。
[Abstract]:The rapid development and widespread popularity of the Internet is deeply affecting the development of society and the dissemination of information. More and more people are accustomed to spread information through Weibo, forums, communities and other network carriers, such as information, events and policies. The field of education is also rapidly updating and iterating in the new era. The development of information platform provides us with a great shortcut to get education information. At the same time, the problem of information redundancy also follows. Therefore, in a fast-paced life, we want to capture the frontier of education as quickly and comprehensively as possible. The research object of this paper is to publish a blog account of education-related content on the Weibo platform, hoping to find a way to help us filter out a small set of bloggers in a large number of optional object sets. By focusing on the Weibo information of these large V groups in a small set, we can extract the latest and more extensive information trends about education. To solve this problem, we first analyze the existing research and methods, and then focus on the more effective topic model. In view of the characteristics of the educational field and Weibo texts, we have determined the criteria for preliminary delineation of objects, found suitable samples, then obtained their text data, and used the segmentation tools of the Chinese Academy of Sciences for data conversion and preprocessing. The thesaurus and corresponding numbering are compiled to form triples in the form of "blog ordinal-word number-word frequency", so that the data can be directly applied to the analysis of the model. In analyzing and solving the problem, we have done a three-layer progressive experiment on the data. Firstly, we select small samples to analyze AT model and artificial multiple reexamination, observe the screening method that can show similar results to the maximum extent, and determine it as the screening mechanism of this paper. Firstly, we use AT model to divide the topic. Secondly, the topics are summed up according to the keywords presented and sorted according to their ratio, giving priority to the bloggers who appear most frequently in the same sequence. Finally, according to the two scale samples collected, we select the optimal user set in the limited topic according to the formulation of the above method. The research in this paper provides an example for the treatment of similar problems. There are also some places that can be further explored, such as the consideration of real-time updating of alternative blogger groups to meet the possibility of topic change, or the analysis of the correlation degree between bloggers and the background of bloggers, and so on. The results of this study can lay a foundation for further research.
【学位授予单位】:华中师范大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP393.092
【参考文献】
相关期刊论文 前10条
1 王志国;;网络舆情监控过程中微博文本分类处理的实现方法[J];图书情报导刊;2016年12期
2 王永贵;张丰田;刘雨诗;肖成龙;;微博中结合转发特性的用户兴趣话题挖掘方法[J];计算机应用研究;2017年07期
3 张丁文;;农村淘宝如何进行商品筛选[J];同行;2016年14期
4 刘志远;;人工智能的发展方向和趋势[J];福建理论学习;2016年06期
5 刘路星;郑蓉蓉;蔡雪玲;;基于TAM模型的慕课教学推广对策研究[J];安徽工业大学学报(社会科学版);2016年03期
6 裴超;肖诗斌;江敏;;基于改进的LDA主题模型的微博用户聚类研究[J];情报理论与实践;2016年03期
7 仲兆满;胡云;李存华;刘宗田;;微博中特定用户的相似用户发现方法[J];计算机学报;2016年04期
8 荀峰;;最短路径问题[J];中学数学教学参考;2015年Z2期
9 李凤岭;朱保平;;基于LDA模型的微博话题发现技术研究[J];计算机应用与软件;2014年10期
10 米文丽;孙曰昕;;利用概率主题模型的微博热点话题发现方法[J];计算机系统应用;2014年08期
相关硕士学位论文 前2条
1 曾珂;基于数据挖掘的微博用户兴趣群体发现与分类[D];华中师范大学;2014年
2 郑希文;互联网话题演变与传播分析技术研究[D];哈尔滨工程大学;2009年
,本文编号:2051730
本文链接:https://www.wllwen.com/guanlilunwen/ydhl/2051730.html