当前位置:主页 > 科技论文 > 计算机论文 >

GPU集群的并行编程通信接口研究

发布时间:2018-01-19 15:42

  本文关键词: 图形处理器集群 并行编程 集群通信 全局数组 出处:《华中科技大学》2012年硕士论文 论文类型:学位论文


【摘要】:图形处理器GPU善于处理大规模密集型数据和并行数据,通用并行架构CUDA让GPU在通用计算领域越来越普及。由于GPU集群的高性价比,高性能计算领域中GPU集群的使用越来越普遍,但GPU集群并行编程并没有一个标准的通信模型,绝大多数集群应用采取CUDA+MPI的方法实现,而CUDA和MPI编程都非常困难,需要程序员了解GPU硬件架构和MPI消息传递机制,,显式控制内存与显存、节点与节点间的数据传输。因此,对编程人员来说,GPU集群并行编程仍是一个复杂的问题。 GPU集群通信接口CUDAGA结合分布式内存上的共享内存编程模型GA与通用并行架构CUDA的特征,采用共享显存方式,通过全局共享地址空间实现节点间GPU-to-GPU的数据通信,并通过内部透明的CPU端临时全局数组和GPU端全局数组来维护数据一致性,保证通信数据的正确性。同时,该接口解决了多进程多GPU环境下GPU设备的初始化问题,并提供GPU集群信息查询接口及图形化监控界面两种方式,帮助用户及时了解设备使用情况。此外,CUDAGA从数据传输和计算内核两方面对GA库中的数组运算进行优化,加速后的函数库可供用户直接使用。CUDAGA为用户提供了一个简单方便的GPU集群并行编程通信接口,在保证通信性能的同时简化编程难度,提高程序员编写GPU集群应用程序的效率。 选取并行矩阵乘Cannon算法和Jacobi迭代算法在GPU集群上的代码实现和运行为例,对GPU集群通信接口CUDAGA进行测试。从编程复杂度与通信性能两方面的测试结果可以看出,对于以数组为基本数据结构、节点间通信量大且涉及大量数据访问操作的应用,用CUDAGA编写的代码的运行性能要优于用CUDA+MPI实现的版本,而且代码长度缩短一半以上,提高了程序编写的效率。
[Abstract]:Graphics processor GPU is good at dealing with large scale intensive data and parallel data. CUDA makes GPU become more and more popular in the field of general computing because of the high cost performance of GPU cluster. The use of GPU cluster is becoming more and more common in the field of high performance computing, but there is no standard communication model for GPU cluster parallel programming. Most cluster applications adopt the method of CUDA MPI. CUDA and MPI programming are very difficult, require programmers to understand the GPU hardware architecture and MPI messaging mechanism, explicit control of memory and memory, node to node data transmission. Parallel programming in GPU clusters is still a complex problem for programmers. GPU trunked communication interface (CUDAGA) combines the characteristics of shared memory programming model (GA) on distributed memory with that of CUDA, which is a general parallel architecture, and adopts shared video memory. The GPU-to-GPU data communication between nodes is realized through the global shared address space, and the data consistency is maintained through the internal transparent temporary global array on the CPU side and the global array on the GPU side. At the same time, the interface solves the initialization problem of GPU device in multi-process and multi-#en0# environment, and provides two ways of GPU cluster information query interface and graphical monitoring interface. In addition, CUDAGA optimizes the array operation in GA library from the aspects of data transmission and computing kernel. The accelerated function library can be used directly by the user. CUDAGA provides a simple and convenient communication interface for GPU cluster parallel programming, which simplifies the programming difficulty while ensuring the communication performance. Improve the efficiency of programmers writing GPU cluster applications. The parallel matrix multiplication Cannon algorithm and the Jacobi iterative algorithm are selected as an example to implement and run the code on the GPU cluster. The GPU trunked communication interface CUDAGA is tested. From the test results of programming complexity and communication performance, we can see that array is the basic data structure. The code written in CUDAGA has better performance than the version implemented in CUDA MPI, and the length of code is shortened by more than half because of the large amount of communication between nodes and the application of a large number of data access operations. The efficiency of programming is improved.
【学位授予单位】:华中科技大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP338.6

【参考文献】

相关期刊论文 前4条

1 陈华平 ;黄刘生 ;安虹 ;陈国良;;并行分布计算中的任务调度及其分类[J];计算机科学;2001年01期

2 程豪;张云泉;张先轶;李玉成;;CPU-GPU并行矩阵乘法的实现与性能分析[J];计算机工程;2010年13期

3 吴恩华,柳有权;基于图形处理器(GPU)的通用计算[J];计算机辅助设计与图形学学报;2004年05期

4 冯高锋;;GPU-CPU集群上的动态规划算法[J];计算机应用;2007年S2期

相关硕士学位论文 前1条

1 马庆怀;基于CPU与GPU混合架构集群的性能测试与优化[D];中国地质大学(北京);2011年



本文编号:1444838

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/1444838.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户de770***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com