微博恶意用户识别方法的研究

发布时间：2018-04-05 18:34

本文选题：微博　切入点：恶意用户　出处：《北京交通大学》2017年硕士论文

【摘要】：随着互联网的飞速发展,以Twitter、Facebook为代表的社交网络也得到了迅猛发展,社交网络逐渐成为现代人生活中不可或缺的一部分。在国内,最具代表性的社交网络是微博,它扮演的角色早已超越单纯的社交,已然成为一个信息的集中扩散中心。同时,微博被恶意用户所利用。这些用户以庞大的数量传播着虚假信息、恶意信息,影响人们对事件的看法。因此,对反恶意用户的研究具有重要的现实意义,其中恶意用户识别技术就是一个重要的研究热点。本论文以新浪微博用户为对象,重点研究微博网络中恶意用户识别的问题。论文的研究工作得到了国家自然科学基金项目(No.61271308、61172072、61401015)与北京市教育委员会研究生学科建设项目的支持论文的主要工作包括:论文从恶意用户特征入手,依据微博的功能特性以及用户的使用习惯,分析并发现了对于微博中的"收藏"功能,恶意用户与正常用户的使用习惯有着较大的差别。因此,本文将"收藏数量"及"收藏速度"加入到特征列表,验证其对于恶意用户识别效果的贡献度。论文使用Weka Java API对Weka中的算法进行调用及参数调优,针对用户信息缺失的情况,分别对比了朴素贝叶斯算法、C4.5决策树、随机森林三种算法在处理缺失数据前后的分类效果。分析对比得出的结论是:在数据存在缺失的情况下,C4.5决策树与随机森林算法都有较好的鲁棒性,尤其是随机森林算法效果更佳。论文还对实际的使用情况进行了模拟实现,研究了在需要处理较大规模的数据时如何提高恶意用户识别算法的效率。通过部署Hadoop分布式架构,分别对比了不同节点数对不同大小数据集的处理时间,及恶意用户的识别效果。论文从用户特征的角度分析恶意用户与正常用户的差异,并根据这些特征选取合适的分类算法对恶意用户进行识别,识别准确率接近90%。
[Abstract]:With the rapid development of the Internet, social networks, such as Twitter and Facebook, have also developed rapidly, and social networks have gradually become an integral part of modern life.In China, Weibo is the most representative social network.At the same time, Weibo was used by malicious users.These users spread false information and malicious information in a large number to influence people's views on events.Therefore, the research on anti-malicious users has important practical significance, among which malicious user identification technology is an important research hotspot.This paper focuses on the problem of malicious user identification in Weibo network.The research work of the thesis has been supported by the National Natural Science Foundation Project No. 61271308FU 61172072Pu 61401015) and the main work of this thesis is as follows: the thesis starts with the characteristics of malicious users.According to Weibo's functional characteristics and user's usage habits, the author analyzes and finds out that there are great differences between malicious users and normal users' usage habits for the "collection" function in Weibo.Therefore, this paper adds "collection quantity" and "collection speed" to the feature list to verify its contribution to malicious user identification.In this paper, Weka Java API is used to call and tune the parameters of the algorithm in Weka. Aiming at the lack of user information, the classification effects of the naive Bayesian algorithm C4.5 decision tree and the random forest algorithm before and after processing the missing data are compared respectively.The conclusion of analysis and comparison is that C4.5 decision tree and stochastic forest algorithm have better robustness, especially the effect of stochastic forest algorithm is better.The paper also simulates the actual usage and studies how to improve the efficiency of malicious user identification algorithm when dealing with large scale data.By deploying Hadoop distributed architecture, the processing time of different node points to different size data sets and the effect of malicious user identification are compared.This paper analyzes the differences between malicious users and normal users from the point of view of user characteristics, and selects appropriate classification algorithms according to these features to identify malicious users, and the recognition accuracy is close to 90%.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP18;TP393.092

【相似文献】