利用图计算检测大规模社交网络虚假账户

发布时间：2018-03-05 16:24

本文选题：在线社交网络　切入点：网络安全　出处：《吉林大学》2017年硕士论文　论文类型：学位论文

【摘要】：近年来,随着在线社交网络规模的不断发展,攻击者们为了牟取利益通过虚假账户进行恶意攻击行为,包括传播垃圾信息,窃取用户隐私等,极大的威胁了社交网络用户的安全,阻碍了社交网络的发展。因此,社交网络虚假账户的检测成为了网络安全及信息安全的关键研究性问题之一。目前,关于虚假账户检测的方法有很多,包括基于用户行为特征,基于用户信息内容以及基于社交网络图结构等三大类检测方法。其中,基于图结构的检测方法有着检测效率高,攻击者不能轻易模拟去躲避检测,特征抓取简单等优点。但是随着社交网络规模的不断扩大,现有的复杂的检测算法很难扩展并应用到实际的大规模的社交网络检测中。并且,传统的大规模处理框架,如Map Reduce,很难去处理非结构化的图结构数据,因此,一些工作,开始使用Pregel/Giraph等,一类以点为中心的图计算系统,来实现检测大规模社交网络。但是目前的图计算系统的设计都是过度通用化的,即它们都是被设计出来处理通用图的,而不是针对具有特征的社交网络图,因此现有的系统在处理社交网络图时性能不佳。本文通过对社交网络图数据的研究,提出了一种针对社交网络图特征优化的以点为中心的单机图计算系统框架,并且分析现有的检测算法,在我们的系统实现了2种不同类别的基于图结构的虚假账户检测算法。具体工作如下:1.在系统层面,我们对目前的社交网络图进行分析,考虑到目前社交网络以及单机服务器的配置,参考利用核外计算技术来提高计算扩展性的单机图计算系统来设计检测系统框架。2.在数据层面,考虑到社交网络图的幂率分布,将图按照顶点度数分成2个不相交集合,称为重集合和轻集合,对轻重集合使用不同的存储格式,执行模式,选择性调度策略以及缓存策略来优化图系统。3.在算法层面,分析了目前现有的基于图结构的虚假账户检测算法,将其中涉及的图算法归为两类,一类是基于随机游走的幂迭代算法,如Sybil Rank;一类是基于社区发现的遍历型算法,如COLOR/COLOR+,本文提出d Sybil Rank和d COLOR算法,即将两类算法转化为以点为中心的并行迭代式图算法,来提高原有检测算法的效率。4.最后,我们在我们系统上测试了以上两类社交网络检测算法的性能。实验结果显示,我们的系统展现了很好的性能,比如,我们在单机服务器上处理5千万顶点的网络需要459秒,相比于Sybil Rank处理1.6亿顶点数据需要11台m1.large集群同时处理33小时表现的很优秀。同时,我们也将我们的系统与现有的图计算系统相比较,其性能在社交网络图上提高到1.14~5.91倍。
[Abstract]:In recent years, with the continuous development of the scale of online social network, the attackers, in order to gain profits, carry out malicious attacks through false accounts, including spreading spam information, stealing user privacy, etc. It greatly threatens the security of social network users and hinders the development of social network. Therefore, the detection of false accounts of social network has become one of the key research problems of network security and information security. There are many methods to detect false accounts, including three kinds of detection methods, which are based on user behavior characteristics, user information content and social network graph structure, among which, the detection method based on graph structure has high detection efficiency. Attackers can not easily simulate to avoid detection, feature capture simple and other advantages. But with the continuous expansion of the scale of social networks, the existing complex detection algorithms are difficult to extend and apply to the actual large-scale social network detection. Traditional large-scale processing framework, such as Map reduction, is very difficult to deal with unstructured graph structure data. Therefore, some work has begun to use Pregel/Giraph, a kind of point-centered graph computing system. To detect large-scale social networks. But the current graphic computing systems are designed to be overly generic, that is, they are designed to deal with universal graphs, not for characteristic social network diagrams. Therefore, the existing system has poor performance in dealing with the social network graph. Through the research of the social network graph data, this paper proposes a point-centered single-machine graph computing system framework for the social network graph feature optimization. And analyzing the existing detection algorithms, we implemented two different kinds of false account detection algorithms based on graph structure in our system. The specific work is as follows: 1. At the system level, we analyze the current social network graph. Considering the configuration of social network and single-machine server, the frame of the detection system is designed by using the technology of out-of-core computing to improve the expansibility of computing. 2. At the data level, considering the distribution of power rate of social network graph, The graph is divided into two disjoint sets according to the vertex degrees, which are called heavy set and light set. Different storage format, execution mode, selective scheduling strategy and cache policy are used to optimize graph system. In this paper, the existing algorithms of false account detection based on graph structure are analyzed. The graph algorithms are classified into two categories: one is power iteration algorithm based on random walk, such as Sybil rank, the other is traversal algorithm based on community discovery, such as COLOR/COLOR. In this paper, d Sybil Rank and d COLOR algorithms are proposed, which transform the two algorithms into point-centered parallel iterative graph algorithms to improve the efficiency of the original detection algorithms. Finally, We tested the performance of these two kinds of social network detection algorithms on our system. Experimental results show that our system shows good performance. For example, we need 459 seconds to handle the 50 million vertex network on a single server. Compared to Sybil Rank, it takes 11 m1.Large clusters to process 160 million vertex data and 33 hours of processing at the same time. At the same time, we have improved our system to 1.14591 times in social network compared with the existing graph computing system.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.08

【相似文献】