基于半监督学习的微博谣言检测研究

发布时间：2019-04-29 06:16

【摘要】：微博作为高科技信息化时代产物,在快速发展的同时,随之迅速蔓延的谣言信息也成为日益突出的问题。谣言的自动检测研究作为社交网络谣言研究、监控、应对和治理的前提,正逐渐受到关注,关于微博谣言识别的研究工作越来越多。国内外学者对社交网络和微博尤其是Twitter可信度作了大量的研究,主流研究实现的主要思路是从用户特征、文本内容特征、传播特征等方面抽取信息特征,建立分类器来实现谣言检测。然而采用传统机器学习算法并不能有效解决微博谣言检测中存在的数据标注代价高昂和数据类别不平衡导致检测准确率低等问题。本文以新浪微博为背景,以微博谣言为研究对象,在前人将检测任务作为分类问题求解的框架下,重点关注于解决传统监督学习算法数据标注代价高昂的问题,将半监督学习算法引入微博谣言检测中。同时,针对微博中谣言数量远少于非谣言、准确识别谣言比识别非谣言价值更高的事实,将微博谣言检测定义为一个不平衡数据的二分类问题。综合上述因素,提出一种针对不平衡数据集的半监督学习算法,用于谣言检测的分类任务中。本文的工作主要体现在如下两个方面。首先,围绕不平衡数据集分类,提出一种基于Co-Forest算法针对不平衡数据集的改进方法——ImCo-Forest算法(semi-supervised learning algorithm from imbalanced data based on Co-Forest),利用SMOTE算法和分层抽样平衡数据分布,并通过引入代价敏感的加权投票法来提高对未标记样本预测的正确率。为验证算法的有效性,在10组UCI测试数据上进行了实验比较。其次,在研究不平衡数据集分类问题的基础上,将不平衡数据集分类的机器学习方法引入微博谣言检测领域,并给出一个微博谣言检测的流程图。文章最后,通过2组微博谣言的实证实验证明了所提方法的有效性和优越性。通过在新浪微博平台上抽取的数据进行实验,表明论文提出的方法能有效解决微博谣言检测中存在的数据标注代价高昂和数据类别不平衡导致检测准确率低等问题,适用于海量微博数据的分析和谣言检测。
[Abstract]:As a product of the high-tech information age, Weibo is developing rapidly, and the rumor information has become an increasingly prominent problem along with the rapid spread of rumor information. As the premise of social network rumor research, monitoring, response and governance, the research on automatic detection of rumors is getting more and more attention. The research on Weibo rumor recognition is more and more. Scholars at home and abroad have done a lot of research on social networks and Weibo, especially on the credibility of Twitter. The main idea of mainstream research is to extract information features from the aspects of user characteristics, text content features, communication features, and so on. A classifier is established to detect rumors. However, the traditional machine learning algorithm can not effectively solve the problems such as high cost of data tagging and imbalance of data categories in Weibo rumor detection, which lead to low detection accuracy. Taking Sina Weibo as the background and Weibo rumor as the research object, this paper focuses on solving the expensive problem of traditional supervised learning algorithm data tagging, under the framework of the forefathers taking the detection task as the classification problem solving, and focusing on solving the problem of high cost of traditional supervised learning algorithm data tagging. Semi-supervised learning algorithm is introduced into Weibo rumor detection. At the same time, in view of the fact that the number of rumors in Weibo is far less than that of non-rumors, accurate identification of rumors is more valuable than recognition of non-rumors, and Weibo rumor detection is defined as a binary classification problem of unbalanced data. Based on the above factors, a semi-supervised learning algorithm for unbalanced data sets is proposed, which can be used in the classification of rumor detection. The work of this paper is mainly reflected in the following two aspects. Firstly, based on the classification of unbalanced datasets, an improved Co-Forest algorithm-ImCo-Forest algorithm (semi-supervised learning algorithm from imbalanced data based on Co-Forest) is proposed for unbalanced datasets. The SMOTE algorithm and stratified sampling are used to balance the data distribution, and the cost-sensitive weighted voting method is introduced to improve the accuracy of unlabeled samples prediction. In order to verify the effectiveness of the algorithm, 10 groups of UCI test data were compared by experiments. Secondly, on the basis of studying the problem of unbalanced dataset classification, the machine learning method of unbalanced dataset classification is introduced into the field of Weibo rumor detection, and a flowchart of Weibo rumor detection is given. At the end of the paper, the validity and superiority of the proposed method are proved by two groups of Weibo rumors empirical experiments. The experimental results on Sina Weibo show that the method proposed in this paper can effectively solve the problems of high cost of data tagging and low detection accuracy caused by unbalanced data categories in the detection of Weibo rumors, and the results show that the proposed method can effectively solve the problems of high cost of data tagging and imbalance of data categories. It is suitable for mass Weibo data analysis and rumor detection.
【学位授予单位】：山东大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP393.092

【参考文献】