Spark环境下基于频繁边的大规模单图采样算法

发布时间：2019-03-01 17:12

【摘要】：随着社交网络的流行,对其进行频繁子图挖掘的需求越来越强烈.大数据时代的到来,社交网络规模不断扩大,频繁子图挖掘工作变得愈发困难.在实际应用中,往往并不需要精确地挖掘出频繁子图,采样的方法在保证一定准确率的前提下能够显著提高频繁子图挖掘的效率.现有采样算法大多是根据节点的度进行采样,不适用于频繁子图挖掘.提出了一种基于频繁边的采样算法DIMSARI(distributed Monte Carlo sampling algorithm based on random jump and graph induction),在蒙特卡罗算法的基础上增加了根据频繁边进行随机跳的操作,并对其结果进行了图感应操作,进一步增加了算法的准确性,并在理论上证明了该方法的无偏性.实验结果显示:使用DIMSARI算法采样后进行频繁子图挖掘,准确性比现有其他的采样算法有较大的提高,在不同的采样率下采样后的子图的节点度都保持更小的归一化均方偏差.
[Abstract]:With the popularity of social networks, the demand for frequent subgraph mining is becoming more and more intense. With the arrival of big data era, the scale of social network continues to expand, and it becomes more and more difficult to mine the frequent sub-graph. In practical applications, it is often not necessary to mine frequent subgraphs accurately. The sampling method can significantly improve the efficiency of frequent subgraphs mining on the premise of ensuring a certain accuracy. Most of the existing sampling algorithms are based on the degree of nodes and are not suitable for frequent subgraph mining. In this paper, a sampling algorithm based on frequent edges (DIMSARI (distributed Monte Carlo sampling algorithm based on random jump and graph induction),) is proposed. Based on Monte Carlo algorithm, the random hop operation based on frequent edges is added, and the graph induction operation is carried out on the results. The accuracy of the algorithm is further improved, and the unbiased property of the method is proved theoretically. The experimental results show that the accuracy of frequent sub-graph mining using DIMSARI algorithm is much higher than that of other sampling algorithms, and the node degree of sub-graph sampled at different sampling rates keeps a smaller normalized mean square deviation.
【作者单位】：宁波大学信息科学与工程学院;
【基金】：国家自然科学基金项目(61572266,61472194) 浙江省自然科学基金项目(Y16F020003) 宁波市自然科学基金项目(2017A610114)~~
【分类号】：TP301.6

【相似文献】