图挖掘算法的研究与应用

发布时间：2018-04-14 00:16

本文选题：图挖掘 + 极大团　；参考：《北京邮电大学》2016年硕士论文

【摘要】：近年来,随着社交网络的兴起,图挖掘越来越受到研究人员的重视,有关图挖掘研究的论文在SigKDD、ICDM、SiamDM等会议中有逐年递增的趋势。极大团挖掘是图挖掘中的一个基本问题,有着丰富的研究意义与极高的实践价值。如进行社交网络分析、利用邮件网络推断社会等级、研究行为与认知网络的结构、对金融网络进行统计分析、对动态网络进行聚类、侦测各种恐怖活动和紧急事件等。本论文以极大团算法的研究为切入点,对这一基本的图挖掘问题进行了理论上的研究并给出了其实践中的应用。论文的研究主要分为三大部分:1.研究并实现了 Base BK算法、Improved BK系列算法和Kose系列算法,对部分算法进行对比实验,分析算法的适用场景与时空优劣。首先分析Base BK算法,这是最早的一个使用递归方式挖掘极大团的经典算法。进而研究Improved BK系列算法,这些算法致力于解决Base BK算法递归空间树规模过于庞大的问题,提出了各种启发式选择标志节点的方法对Base BK算法的递归空间树进行剪枝。其中著名的是最小计数器版本的Improved BK算法和随机CANDIDATESNOT版本的Improved BK算法,论文都有理论分析与实现。也研究了 Kose系列算法,Kose算法使用迭代方式挖掘图中的极大团,虽然避免了生成不必要的子团,却导致内存消耗急剧增大。由此衍生了 Kose RAM算法和Kose Disk算法,它们分别将中间数据存入内存和硬盘,所以耗费的内存不同。本论文研究并实现了 Kose RAM算法。最后取七个经典的极大团挖掘算法,在单进程模式下实现,在相同的实验环境下做了 1000多个实验,根据实验结果反馈分析算法的原理并给出了不同环境下如何选择极大团挖掘算法的建议。2.研究FAMCELM算法,指出使其运行失效的场景,进而提出受限内存下的解决方案,并以理论和实验证明算法的可行性。Fast Algorithms for Maximal Clique Enumeration with Limited Memory 论文提出了在大规模图中挖掘极大团一个分布式的算法。这一算法有着三大优势:适用于内存有限的场景、减少了挖掘极大团的机器的CPU消耗、可以在并行环境下实现。研究中发现,面对图中存在度极大的节点的情形,论文中的SeqMCE算法会失效。分析了产生这个问题的原因,给出了解决这一问题的一个方案及严格的理论证明。在对SeqMCE做了其他的优化后提出了 Improved SeqMCE算法,这一算法解决了遇到度极大的节点的问题,代价则是重复挖掘了部分全局极大团。最后在内存受限的场景下运行Improved SeqMCE算法,从而证明其在实践中的可行性。3.依托于Improved BK算法和Improved SeqMCE算法,为解决统计段时间内不同的发送垃圾短信的手机号码总数的问题,提出了基于先验概率的Hash算法——GHash算法。作为算法的预处理部分,首先需挖掘出等长字符串(场景下则是手机号码)的统计规律。论文将字符串的各个位置抽象成一个图的节点,创造性地将挖掘统计规律的问题转化成寻找极长链的问题,进而转化成在其传递闭包中挖掘极大团的问题。在使用Improved BK算法挖掘出图中的极大团后,再将节点映射回原字符串的相应位置,从而得到原字符串的统计规律。为解决极大团挖掘算法为NP问题,算法耗时过久的问题,又提出了通过计算信息熵从而得到统计规律的一个子集的替代方案,这在手机号码的场景下运行良好。在一个有着3000多万手机号码的数据集上的实验表明,GHash算法所占用的内存远小于使用Bitmap算法所占用的内存;速度上则远快于使用AVL树、红黑树和Trie树完成检索的场景;相比较布隆过滤器,GHash算法是1000%准确的;相比较传统的部分Hash算法,如djb2、fnv-1、sdbm等,GHash算法无论是运行时间还是冲突数,都有明显的优势。
[Abstract]:In recent years, with the rise of social networks, graph mining researchers pay more attention to the related research, graph mining papers in SigKDD, ICDM, SiamDM etc. is increasing meeting. The maximum clique mining is a basic problem in graph mining, has abundant research significance and high practical value. Such as social network analysis, using e-mail network to infer the social hierarchy, structure of behavioral and cognitive networks, statistical analysis of the financial network, clustering of dynamic network, detection of various terrorist activities and emergencies. This thesis studies the maximum clique algorithm as the starting point, mining of the basic map the problem was presented on the theory research and its application in practice. The thesis is divided into three parts: 1. the research and implementation of Base BK algorithm, Improved algorithm and BK series Kose series of algorithm. Some algorithms are compared, the suitable scene analysis algorithm and spatial analysis of Base BK algorithm performance. First, this is a classic algorithm using the recursive method the earliest mining maximum. Then study the Improved series of BK algorithm, the algorithm to solve Base BK algorithm recursive tree space scale is too large, the the various heuristic node selection marker method of recursive space tree of Base BK algorithm for pruning. The famous Improved BK algorithm Improved BK algorithm and random version of the CANDIDATESNOT version of the minimum counter, the theory and implementation of the theoretical analysis. Also study the series of Kose algorithm, Kose algorithm using iterative method, mining maximal clique graph although, to avoid generating unnecessary sub group, but lead to memory consumption increases rapidly. The Kose derived from the RAM algorithm and Kose Disk algorithm, respectively. The intermediate data stored in memory and hard disk, so the cost of memory. The research and implementation of Kose RAM algorithm. The maximum clique finally take seven classical mining algorithm, implemented in a single process mode, more than 1000 experiments were done in the same experimental environment, according to the proposed.2. FAMCELM algorithm research results feedback analysis of the principle of the algorithm is given under different circumstances, how to choose the maximum clique algorithm, the operation failure of the scene, and then propose solutions under the limited memory, and by theoretical analysis and experimental results show the feasibility of the algorithm.Fast Algorithms for Maximal Clique Enumeration with Limited Memory is proposed in a large graph mining maximal clique a distributed algorithm. This algorithm has three major advantages: suitable for limited memory of the scene, reduce the mining maximal clique CPU machine consumption, can In the parallel environment. The study found that in the face of great presence of node degree in the case of SeqMCE algorithm in this paper will failure. Analysis of the causes of this problem, gives a solution to this problem and the strict theoretical proof. In the other optimization of SeqMCE is proposed Improved SeqMCE algorithm, this algorithm solves the problems encountered node degree greatly, it is repeated mining of partial global maximal clique. Finally, run the Improved SeqMCE algorithm in memory constrained scenarios, thus proving the feasibility in the practice of.3. based on Improved BK algorithm and Improved SeqMCE algorithm, for the total number of mobile phone number to send spam messages of different time. To solve the problem, Hash algorithm is proposed based on prior probability -- GHash algorithm. As a preprocessing algorithm, first of all need to dig out so long String (scenario is the mobile phone number) of the statistical law. This paper will abstract each position string nodes of a graph, creatively mining statistical regularity problem into finding the very long chain problem, and then transformed into mining maximal clique in the transitive closure of the problem. In the use of Improved BK algorithm the maximal cliques in graphs, then the corresponding node mapped back to the original string, so as to obtain the statistical regularity of the original string. In order to solve the maximum clique algorithm for NP problem, the algorithm is time-consuming too long, and put forward the alternative one by calculating the information entropy to obtain statistical rule sets, this running well in the mobile phone number of scenes. Show that in a about 30000000 mobile phone number of the data set on the experiment, the GHash algorithm is far less than the memory occupied by using the Bitmap algorithm the memory speed; Is far faster than the AVL tree, tree and Trie tree search scene; comparison of Bloom filter, GHash algorithm is 1000% accurate; part compared with the traditional Hash algorithm, such as djb2, fnv-1, sdbm, GHash algorithm both running time or the number of conflicts, have obvious advantages.

【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【参考文献】