面向ARMv8 64位多核处理器的QGEMM设计与实现

发布时间：2018-04-21 04:06

本文选题：ARMv + 位多核处理器　；参考：《计算机学报》2017年09期

【摘要】：该文在ARMv8 64位多核处理器上基于OpenBLAS首次设计、实现并优化了四精度矩阵乘法(Quadruple precision General Matrix-Matrix Multiplication,QGEMM).由于浮点计算中不可避免地引入舍入误差,双精度矩阵乘法(DGEMM)在某些情况下不能给出令人满意的数值结果,因此需要高精度或多精度算法来实现更精确的计算.Double-double算术是一种较为有效和广泛使用的手段.文中采用double-double数据格式构建结构体存储四精度浮点数据;基于OpenBLAS中的稠密矩阵计算的分块算法,增加四精度数据格式的相关的头文件和源文件,并用汇编代码撰写文中所提出的QGEMM的核心内核;利用无误差变换技术,调整并优化内核中的算法流程,避免规格化操作步骤造成的数据强制依赖关系;通过分析算法的数据依赖关系,设计寄存器的分配和轮转策略,优化指令调度顺序,开发指令级并行性,提高QGEMM的实际性能.根据具体算法使用混合乘加指令(FMA)的程度不同,文中采用了算法理论峰值性能这一概念,其有别于机器理论峰值的概念,能更好地评估文中所提出的QGEMM的实际效率.数值实验表明:文中通过汇编代码实现并优化的QGEMM性能最高达到19.7Gflops,效率为在ARMv864位多核处理器平台上QGEMM算法理论峰值性能的82.1%,在满足数值结果精度要求的同时,其计算速度约是由C语言撰写的未优化的QGEMM和MBLAS中QGEMM的5.8倍,是编译器GCC实现的long double数据格式的QGEMM的24倍.同时数值实验还显示文中提出的QGEMM针对不同规模的矩阵具有较好的线程可扩展性.
[Abstract]:In this paper, the first design based on OpenBLAS based on ARMv8 64-bit multi-core processor is presented. The four-precision matrix multiplication is realized and optimized. The Quadruple precision General Matrix-Matrix replication is optimized. Due to the inevitable introduction of rounding error in floating-point calculation, the double-precision matrix multiplication DGEMMM can not give satisfactory numerical results in some cases. Therefore, it is an effective and widely used method to calculate double-double arithmetic with high precision or multi-precision algorithm. In this paper, double-double data format is used to construct the structure to store four-precision floating-point data, and based on the block algorithm of dense matrix calculation in OpenBLAS, the related header files and source files of four-precision data format are added. The kernel of QGEMM is written with assembly code, and the algorithm flow in kernel is adjusted and optimized by using error-free transformation technology to avoid the data mandatory dependency caused by standardized operation steps. By analyzing the data dependence of the algorithm, designing register allocation and rotation strategy, optimizing instruction scheduling order, developing instruction level parallelism, improving the actual performance of QGEMM. According to the degree of using mixed multiplicative plus instruction (FMA), the concept of peak performance of algorithm theory is adopted in this paper, which is different from the concept of peak value of machine theory, which can better evaluate the actual efficiency of QGEMM proposed in this paper. Numerical experiments show that the QGEMM performance achieved and optimized by assembly code is up to 19.7Gflops, and the efficiency is 82.1% of the peak performance of the QGEMM algorithm on the ARMv864 multi-core processor platform. At the same time, the accuracy of the numerical results is satisfied. The speed of calculation is about 5.8 times that of QGEMM in unoptimized QGEMM and MBLAS written by C language, and 24 times that of QGEMM in long double data format implemented by compiler GCC. Numerical experiments also show that the proposed QGEMM has better thread scalability for different size matrices.
【作者单位】：国防科学技术大学计算机学院;湖南大学信息科学与工程学院;国防科学技术大学并行与分布处理重点实验室;
【基金】：国家“八六三”高技术研究发展计划项目基金(2012AA01A301) 国家自然(61402495,61303189,61602166,61170049,61402496)资助~~
【分类号】：TP332

【相似文献】