GPU集群的并行编程通信接口研究

发布时间：2018-01-19 15:42

本文关键词： 图形处理器集群并行编程集群通信全局数组　出处：《华中科技大学》2012年硕士论文　论文类型：学位论文

【摘要】：图形处理器GPU善于处理大规模密集型数据和并行数据，通用并行架构CUDA让GPU在通用计算领域越来越普及。由于GPU集群的高性价比，高性能计算领域中GPU集群的使用越来越普遍，但GPU集群并行编程并没有一个标准的通信模型，绝大多数集群应用采取CUDA+MPI的方法实现，而CUDA和MPI编程都非常困难，需要程序员了解GPU硬件架构和MPI消息传递机制，，显式控制内存与显存、节点与节点间的数据传输。因此，对编程人员来说，GPU集群并行编程仍是一个复杂的问题。 GPU集群通信接口CUDAGA结合分布式内存上的共享内存编程模型GA与通用并行架构CUDA的特征，采用共享显存方式，通过全局共享地址空间实现节点间GPU-to-GPU的数据通信，并通过内部透明的CPU端临时全局数组和GPU端全局数组来维护数据一致性，保证通信数据的正确性。同时，该接口解决了多进程多GPU环境下GPU设备的初始化问题，并提供GPU集群信息查询接口及图形化监控界面两种方式，帮助用户及时了解设备使用情况。此外，CUDAGA从数据传输和计算内核两方面对GA库中的数组运算进行优化，加速后的函数库可供用户直接使用。CUDAGA为用户提供了一个简单方便的GPU集群并行编程通信接口，在保证通信性能的同时简化编程难度，提高程序员编写GPU集群应用程序的效率。选取并行矩阵乘Cannon算法和Jacobi迭代算法在GPU集群上的代码实现和运行为例，对GPU集群通信接口CUDAGA进行测试。从编程复杂度与通信性能两方面的测试结果可以看出，对于以数组为基本数据结构、节点间通信量大且涉及大量数据访问操作的应用，用CUDAGA编写的代码的运行性能要优于用CUDA+MPI实现的版本，而且代码长度缩短一半以上，提高了程序编写的效率。
[Abstract]:Graphics processor GPU is good at dealing with large scale intensive data and parallel data. CUDA makes GPU become more and more popular in the field of general computing because of the high cost performance of GPU cluster. The use of GPU cluster is becoming more and more common in the field of high performance computing, but there is no standard communication model for GPU cluster parallel programming. Most cluster applications adopt the method of CUDA MPI. CUDA and MPI programming are very difficult, require programmers to understand the GPU hardware architecture and MPI messaging mechanism, explicit control of memory and memory, node to node data transmission. Parallel programming in GPU clusters is still a complex problem for programmers. GPU trunked communication interface (CUDAGA) combines the characteristics of shared memory programming model (GA) on distributed memory with that of CUDA, which is a general parallel architecture, and adopts shared video memory. The GPU-to-GPU data communication between nodes is realized through the global shared address space, and the data consistency is maintained through the internal transparent temporary global array on the CPU side and the global array on the GPU side. At the same time, the interface solves the initialization problem of GPU device in multi-process and multi-#en0# environment, and provides two ways of GPU cluster information query interface and graphical monitoring interface. In addition, CUDAGA optimizes the array operation in GA library from the aspects of data transmission and computing kernel. The accelerated function library can be used directly by the user. CUDAGA provides a simple and convenient communication interface for GPU cluster parallel programming, which simplifies the programming difficulty while ensuring the communication performance. Improve the efficiency of programmers writing GPU cluster applications. The parallel matrix multiplication Cannon algorithm and the Jacobi iterative algorithm are selected as an example to implement and run the code on the GPU cluster. The GPU trunked communication interface CUDAGA is tested. From the test results of programming complexity and communication performance, we can see that array is the basic data structure. The code written in CUDAGA has better performance than the version implemented in CUDA MPI, and the length of code is shortened by more than half because of the large amount of communication between nodes and the application of a large number of data access operations. The efficiency of programming is improved.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP338.6

【参考文献】