面向多核DSP的高性能并行BLAS3的设计与实现
发布时间:2018-01-22 01:17
本文关键词: 多核处理器 并行 线性代数库 矩阵乘法 分块算法 出处:《国防科学技术大学》2013年硕士论文 论文类型:学位论文
【摘要】:BLAS库在高性能计算领域中一直扮演着非常重要的角色,其体现的效率是高性能计算的主要评测标准。研究基于多核DSP的并行BLAS库,对多核DSP在高性能计算领域的评测及应用,对开发多核DSP的并行计算性能,都有着十分重要的现实意义。本文深入研究了BLAS3中的各个例程库,设计和实现了基于C6678的单核GEMM、SYMM、SYRK、SYR2K及TRMM;基于C6678的多核通信与同步机制,设计并实现了并行的GEMM、SYMM、SYRK、SYR2K及TRMM。主要的研究工作包括以下几个方面:1、设计和实现了基于C6678单核的GEMM。针对体系结构的多级存储特点,对GEMM的核心循环在Cache级进行了访存比的性能比较和分析,结合C6678的硬件资源和体系结构进行了访存优化,对存储空间进行了合理的划分,设计和实现了高性能的GEMM,经过测试,性能达8.49 GFLOPS。2、设计和实现了基于C6678单核的BLAS3。详细分析和研究了SYMM、SYRK、SYR2K及TRMM四个例程的运算特点;对SYMM中对称矩阵的数据访问进行了优化设计;对SYRK的BP kernel更新对称矩阵进行了优化设计;对SYR2K的计算方式进行了转换使其可以直接调用SYRK的接口例程;对TRMM中三角矩阵的访问进行了分析,根据对角线的数据特点对BP kernel进行了优化设计;结合C6678的硬件机制分别将SYMM、SYRK、SYR2K及TRMM四个例程高效地映射至C6678的单核结构中,性能分别为8.241、8.102、8.008、8.203 GFLOPS。3、设计和实现了基于C6678的多核并行BLAS3。深入剖析了各个例程的算法规则,采用分块的方式对数据进行并行分解,使块与块之间的计算相互独立,并优化了多核间的负载均衡,结合C6678的多核通信及同步机制将并行的分块算法高效地映射至多个核中,经过性能测试,GEMM、SYMM、SYRK、SYR2K和TRMM等BLAS3例程的八核并行加速比分别为6.21、5.22、4.49、4.49和4.55。
[Abstract]:BLAS library has been playing a very important role in the field of high performance computing. Its efficiency is the main evaluation standard of high performance computing. The parallel BLAS library based on multi-core DSP is studied. The evaluation and application of multi-core DSP in the field of high performance computing is of great practical significance to the development of parallel computing performance of multi-core DSP. In this paper, every routine library in BLAS3 is deeply studied. The design and implementation of SYRK2K and TRMMMK based on C6678 are presented. Based on the multi-core communication and synchronization mechanism of C6678, a parallel SYMM-SYRK is designed and implemented. The main research work of SYR2K and TRMM. includes the following several aspects: 1. The design and implementation of Gem based on C6678 single core. This paper compares and analyzes the performance of the core cycle of GEMM at the Cache level, and combines the hardware resources and architecture of C6678 to optimize the memory access. The storage space is divided reasonably, and a high-performance GEMMM is designed and implemented. After testing, the performance reaches 8.49 GFLOPS.2. The BLAS3based on C6678 single core is designed and implemented. The operation characteristics of SYMMP SYRK SYR2K and TRMM are analyzed and studied in detail. The data access of symmetric matrix in SYMM is optimized. The optimized design of BP kernel renewal symmetric matrix of SYRK is presented. The calculation method of SYR2K is transformed so that it can directly call the interface routine of SYRK. This paper analyzes the access of triangular matrix in TRMM, and optimizes the design of BP kernel according to the characteristics of diagonal data. Combined with the hardware mechanism of C6678, the four routines of SYMMMM-SYRKT SYR2K and TRMM are mapped to the single core structure of C6678 efficiently, and the performance is 8.241 respectively. 8.102 / 8.008 / 8.203 GFLOPS.3. the multi-core parallel BLAS3 based on C6678 is designed and implemented. The algorithm rules of each routine are deeply analyzed. The data is decomposed in parallel by block, which makes the computation between blocks independent, and optimizes the load balance between multi-cores. Combined with the multi-core communication and synchronization mechanism of C6678, the parallel block algorithm is mapped to multiple cores efficiently, and the performance test is carried out. The parallel speedup ratios of BLAS3 routines such as SYR2K and TRMM are 6.21, 5.22, 4.49 and 4.55, respectively.
【学位授予单位】:国防科学技术大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP338.6
,
本文编号:1453163
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/1453163.html