GPU加速MapReduce集群的设计与实现

发布时间：2018-09-04 15:22

【摘要】：更快速的处理海量数据,是数据中心计算领域永恒的追求。随着数据量的爆炸式的增长,以及应用领域对于数据处理时效性的要求越来越高,数据处理的压力越来越大。人们不得不着手对现有的大规模数据处理的软硬件架构进行改进。MapReduce作为一种分布式并行计算模型,在企业大数据计算领域得到了广泛的应用。近年来,研究人员着手从各种角度挖掘MapReduce模型的性能潜力,其中,硬件加速的MapReduce是一种新颖的思路。在本文中,我们将介绍一种基于图形处理器(GPU)加速的MapReduce实现平台。GPU是一种高度并行的众核(many-core)处理器,它可以同时发射上千线程,显著提高计算速度。目前在高性能计算等领域,以GPU为代表的异构协处理器已经得到了广泛认可。以此为基础,我们尝试将GPU的强大计算能力与MapReduce模型在数据密集型应用方面的优势相结合,以实现一种基于GPU加速的高性能MapReduce集群。以此为中心,本文所属的课题展开了相应的研究,具体的工作和成果如下： 1.设计并实现了一种基于GPU加速的MapReduce实现框架一—GAMR集群系统； 2.提出了一种基于GPU的并行排序算法,并应用于GAMR集群系统中,从而将作业运行阶段的排序速度提高了3到8倍； 3.详细分析了MapReduce作业的数据流,得到了一种形式化的MapReduce性能量化模型,从而使MapReduce作业的性能评估可以通过公式计算得出； 4.提出了一种基于共轭梯度优化算法的自动化MapReduce集群性能优化方法,减少了集群运维人员的工作量；我们工作的核心思想是,将MapReduce模型的并行性从节点间粗粒度的多机(Multi-computer)并行,进一步延伸到节点内细粒度的众核(Many-core)并行,通过异构协处理器来提高MapReudce运行环境的性能。实验测试表明,与其他MapReduce实现环境相比,运行在GAMR集群上的MapReduce作业获得了5倍左右的加速。
[Abstract]:Faster processing of massive data is the eternal pursuit in the field of data center computing. With the explosive growth of data volume and the increasing demand for the timeliness of data processing in the application field, the pressure of data processing is increasing. People have to improve the existing large-scale data processing software and hardware architecture. MapReduce as a distributed parallel computing model has been widely used in the field of enterprise big data computing. In recent years, researchers have begun to tap the performance potential of MapReduce models from various angles. Among them, hardware-accelerated MapReduce is a novel approach. In this paper, we will introduce a MapReduce implementation platform based on (GPU) acceleration. GPU is a highly parallel multikernel (many-core) processor, which can transmit thousands of threads at the same time, and significantly improve the computing speed. At present, heterogeneous coprocessors, represented by GPU, have been widely accepted in the field of high performance computing. On this basis, we try to combine the powerful computing power of GPU with the advantages of MapReduce model in data-intensive applications to achieve a high-performance MapReduce cluster based on GPU acceleration. Taking this as the center, the subject of this paper has carried out the corresponding research, the concrete work and the achievement are as follows: 1. Design and implementation of a MapReduce implementation framework based on GPU acceleration-GAMR cluster system; 2. A parallel sorting algorithm based on GPU is proposed and applied to the GAMR cluster system, which improves the sorting speed of the job running phase by 3 to 8 times. The data flow of MapReduce jobs is analyzed in detail and a formal MapReduce performance quantization model is obtained so that the performance evaluation of MapReduce jobs can be calculated by formula. 4. This paper presents an automatic MapReduce cluster performance optimization method based on conjugate gradient optimization algorithm, which reduces the workload of cluster operators. The parallelism of MapReduce model is extended from coarse-grained multi-machine (Multi-computer) parallel to fine-grained multi-kernel (Many-core) parallelism in nodes. The performance of MapReudce running environment is improved by heterogeneous coprocessor. The experimental results show that compared with other MapReduce implementation environments, the MapReduce jobs running on the GAMR cluster can be accelerated by about 5 times.
【学位授予单位】：云南大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP338.6

【引证文献】