基于有限内存的大图聚集算法研究

发布时间：2018-07-13 13:19

【摘要】：图聚集技术将原图中的顶点和边集合抽象到一个更高的层次,从而获得一个可以代表原图的简洁超图。相对于原图,它能有效的节省存储空间、解决社交网络中的隐私保护以及实现模糊查询等问题。而现有的图聚集算法是假设整个图结构数据能够一次性全部加载到的内存的,而对于无法一次性完整加载到内存的大规模图数据是无法处理的。随着真实世界的实体扩张和关系的复杂,图数据急剧增长,这样的规模巨大其结构复杂的大图越来越多,为图挖掘工作带来了巨大的挑战。同样,传统的图聚集算法也不适用有限内存下大图聚集。目前比较方便的解决方式是利用分布式图计算系统,例如如基于Spark框架的大规模的图计算引擎GraphX,基于BSP框架(Bulk Synchronous Parallel),即整体同步并行模型的Pregel图计算系统,都是专门用来处理和优化各类复杂的图计算任务,但其在成本花费的控制、安全性、算法调试和管理方面都是比较难。针对这种问题,学术界提出基于单机上的图计算平台,例如GraphChi、TurboGraph等图计算平台。相对于分布式图计算系统,基于单机的图计算平台不但在图计算效率上达到合理期望,其适用性更强。本文首先分析了图聚集和图聚类的区别和联系,帮助研究者深入了解图聚集的概念。针对目前图聚集算法迭代次数过多问题,提出新的图聚集算法,通过实验证明,该算法能有效时间内处理基于内存中的大图。并在单机图计算系统GraphChi上实现该算法,实现基于有限内存的大图聚集。
[Abstract]:The graph aggregation technique abstracts the set of vertices and edges in the original graph to a higher level, thus obtaining a simple hypergraph that can represent the original graph. Compared with the original image, it can save storage space effectively, solve privacy protection and fuzzy query in social network. The existing graph aggregation algorithm assumes that the whole graph structure data can be loaded into memory at one time, but the large scale graph data that can not be loaded into memory at one time can not be processed. With the expansion of entities and the complexity of relationships in the real world, the graph data increases rapidly, and more large graphs with such large scale and complex structure bring great challenges to graph mining work. Similarly, the traditional graph aggregation algorithm is not suitable for large graph aggregation in limited memory. At present, a more convenient solution is to use distributed graph computing systems, such as GraphX, a large-scale graph computing engine based on Spark framework, and Pregel graph computing system based on bulk synchronous parallel model. All of them are specially used to deal with and optimize all kinds of complex graph computing tasks, but it is difficult to control cost, security, algorithm debugging and management. In order to solve this problem, a graph computing platform based on a single computer is proposed, such as graph computing platform such as Graph Chii Turbo Graph. Compared with the distributed graph computing system, the graph computing platform based on single machine not only achieves reasonable expectation in graph computing efficiency, but also has stronger applicability. In this paper, the difference and relation between graph aggregation and graph clustering are analyzed, which helps researchers to understand the concept of graph aggregation. A new graph aggregation algorithm is proposed to solve the problem of too many iterations of graph aggregation algorithm. Experiments show that the algorithm can deal with large graph based on memory effectively in time. The algorithm is implemented on the single computer graph computing system GraphChi, and the large graph aggregation based on limited memory is realized.
【学位授予单位】：昆明理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP301.6

【参考文献】