面向GPDSP科学计算的高性能DMA传输方式的设计与实现

发布时间：2018-03-11 08:17

本文选题：科学计算　切入点：GPDSP　出处：《国防科学技术大学》2015年硕士论文　论文类型：学位论文

【摘要】：高性能计算是当今计算科学研究所面临的重大课题,而高性能计算主要涉及到算法研究和处理器设计。GPDSP处理器是我校自主正向设计的一款高性能多核微处理器,兼有通用处理器和数字信号处理器的优势。通过对HPL高性能评测标准的分析,发现影响HPL执行效率的主要因素是矩阵更新操作,而矩阵更新操作是通过调用矩阵乘加运算(GEMM)实现的。GEMM的实现有许多种方式,大量的研究表明基于GEPP-GEBP思想的实现方案是GPDSP处理器中执行效率最高的。本文结合GPDSP处理器的结构特征和对GEMM实现方案的分析,设计了DMA的特殊传输方式。DMA的特殊传输方式包括DMA矩阵转置传输、DMA分段传输和DMA核间同步传输,以及在DMA分段传输基础上设计的DMA阻塞分段传输。DMA矩阵转置传输是指把源存储空间的二维数据块搬移到目的存储空间,并且在搬移过程中完成矩阵的转置。DMA矩阵转置传输的应用可以极大地提高矩阵乘的运算效率。通过模拟验证工具的测试,本文设计的DMA矩阵转置传输的传输效率是传统矩阵转置传输的1.56倍以上。GEMM实现的大致思想是把核外的数据分成多个小块,发送到多个DSP内核存储中进行运算,然后再把所有的运算结果搬移到核外存储进行同步。因此,本文设计了DMA分段传输、DMA核间同步传输以及DMA阻塞分段传输。DMA分段传输可以快速地把核外存储中的数据搬移到多个核的核内存储,而DMA核间同步传输则可以实现把多个核的核内存储中的数据快速地搬移到核外,另外,DMA阻塞分段传输可以有效隐藏数据搬移的时间。根据Cadence公司的NC-VERILOG模拟验证工具的测试,DMA分段传输的传输速度是传统传输方式传输速度的1.24倍以上。而DMA阻塞分段传输则可以使GEMM核心运算的时间至少减少3000拍。DMA核间同步传输的平均传输速度是传统传输方式传输速度的2.56倍。经过充分的验证和实验测试,本文设计的DMA特殊传输方式满足算法要求,可以有效提升HPL高性能评测标准的执行效率。
[Abstract]:High performance computing is an important subject in the research of computational science nowadays. High performance computing mainly involves algorithm research and processor design. GPDSP processor is a high performance multi-core microprocessor designed independently and forward by our university. By analyzing the high performance evaluation standard of HPL, it is found that the main factor that affects the efficiency of HPL execution is matrix update operation. The matrix update operation is implemented by calling matrix multiplication and addition operations. There are many ways to implement the. GEMM. A large number of studies show that the implementation scheme based on GEPP-GEBP is the most efficient among GPDSP processors. This paper combines the structural characteristics of GPDSP processor and the analysis of GEMM implementation scheme. The special transmission mode of DMA. DMA includes DMA matrix transpose transmission and DMA core synchronous transmission. And the transpose transmission of DMA block segmented transmission. DMA matrix based on DMA segmented transmission means that the 2D data block of source storage space is moved to the destination storage space. And in the process of moving the matrix transpose. DMA matrix transpose transmission can greatly improve the efficiency of matrix multiplication. The transmission efficiency of DMA matrix transpose transmission designed in this paper is more than 1.56 times that of traditional matrix transpose transmission. The general idea of .GEMM realization is to divide the data out of the core into several small blocks and send them to multiple DSP kernel storage for operation. And then move all the results to out-of-core storage for synchronization. In this paper, we design DMA segmented transmission and DMA block segmented transmission. DMA segmented transmission can move the data from out-of-core storage to the core storage of multiple cores quickly. The synchronous transmission between DMA cores can quickly move the data from the core storage of multiple cores to the outside of the core. In addition, DMA-blocking segmented transmission can effectively hide the time of data transfer. According to the test of Cadence's NC-VERILOG simulation verification tool, the transmission speed of DMA segmented transmission is 1.24 times faster than that of traditional transmission mode, while DMA blocking is more than 1.24 times the speed of traditional transmission mode. The segmented transmission can reduce the operation time of the core of GEMM at least 3 000 beats. The average transmission speed of synchronous transmission between cores is 2.56 times that of the traditional transmission mode. The special transmission mode of DMA designed in this paper can meet the requirements of the algorithm and can effectively improve the efficiency of HPL high performance evaluation standard.
【学位授予单位】：国防科学技术大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP332

【参考文献】