基于AdaBoost-LC的微博垃圾评论识别研究

发布时间：2018-06-28 04:39

本文选题：微博 + 垃圾评论识别　；参考：《重庆大学》2014年硕士论文

【摘要】：随着Web2.0和互联网的飞速发展，社交网络呈现爆发式增长。微博作为社交网络的一大重要代表逐渐深入人心，成为网民上网的主要活动之一。正是由于微博具有便捷性、高速度、广泛性、效率高、背对脸等特点，吸引了垃圾制造者们的注意。垃圾制造者出于各种目的，在微博上发表了大量的各种垃圾评论，这些垃圾评论的泛滥既影响网民之间的交流，甚至使得网民上当受骗，又阻碍了面向评论的数据挖掘工作，因此垃圾评论的识别与过滤具有重要意义。本文面向微博领域进行识别垃圾评论的研究，主要的研究工作及成果如下： ①针对微博评论短小，分词后容易出现特征稀疏的问题，提出把微博评论表示成特征值向量，由9个特征值组成，从多个不同的角度来描述评论的内容，在此基础上提出一种基于AdaBoost-LC的微博垃圾评论识别方法，该方法以线性分类器中最简单的单阈值二值分类器作为基分类器，然后使用集成学习算法——AdaBoost算法来提升基分类器的分类精度。 ②针对AdaBoost-LC算法存在的不足之处，“困难”样本权重急剧扩张引起的退化现象，以及在垃圾评论识别场景下，正常评论被错误识别的代价更加高昂的问题，提出一种改进的AdaBoost-Ex算法来识别垃圾评论。 ③针对垃圾评论出现新特征，或者分类器随时间流逝分类性能下降需要重新学习的问题，本文设计了算法的模块化增量学习模型，该模型在保留原本学习到的规则的基础上，只需要学习新样本的规则，学习到的子分类器以线性加权的方式融合到增量学习系统中，使得算法具有渐进式的学习能力，增强了算法的实用性。最后，，在实际的热门新浪微博的评论数据集上分别对本文提出的方法进行了实验，证明本文所提方法对微博垃圾评论具有良好的识别效果。
[Abstract]:With the rapid development of Web 2.0 and the Internet, social networks have exploded. As an important representative of social network, Weibo has become one of the main activities of Internet users. Because of its convenience, speed, universality, efficiency and back-to-face, Weibo attracts the attention of garbage makers. For various purposes, garbage makers have published a large number of spam comments on Weibo. The flood of these comments not only affects the communication among netizens, but also makes them cheated, and hinders the work of data mining for comments. Therefore, the identification and filtering of garbage comments is of great significance. The main research work and results of this paper are as follows: (1) in view of the short comment of Weibo, it is easy to have sparse features after word segmentation. In this paper, Weibo comments are represented as eigenvalue vectors, which are composed of nine eigenvalues. The content of comments is described from many different angles. On the basis of this, a Weibo garbage comment recognition method based on Ada Boost-LC is proposed. In this method, the simplest single threshold binary classifier in linear classifier is used as the base classifier, and then an integrated learning algorithm, AdaBoost algorithm, is used to improve the classification accuracy of the base classifier. 2 the shortcomings of AdaBost-LC algorithm are pointed out. The degradation caused by the sharp expansion of the "difficult" sample weight, and the more expensive problem of the normal comment being misidentified in the garbage comment recognition scene, An improved AdaBoost-Ex algorithm is proposed to identify spam comments. The modular incremental learning model of the algorithm is designed in this paper. The model only needs to learn the rules of the new samples on the basis of retaining the original learning rules, and the sub-classifiers that have been learned are integrated into the incremental learning system in a linearly weighted manner. It makes the algorithm have progressive learning ability and enhances the practicability of the algorithm. Finally, the methods proposed in this paper are tested on the popular Sina Weibo comment data set, which proves that the method proposed in this paper has a good recognition effect on Weibo spam reviews.
【学位授予单位】：重庆大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.092;TP391.1

【参考文献】