面向向量处理器的QR分解算法设计与实现

发布时间：2018-05-06 09:08

本文选题：QR分解 + 向量化　；参考：《国防科学技术大学》2015年硕士论文

【摘要】：QR分解算法作为数字信号处理的主要工具,在高性能计算领域中扮演着重要的角色,是衡量处理器性能的重要指标。QR分解在解决最小二乘问题时非常有效,研究QR分解算法对发挥多核向量处理器的并行处理性能具有重要意义。针对Matrix的向量体系结构的特点,研究高效的QR分解向量化设计与实现方法具有重要的理论意义和应用价值。本文深入分析了QR分解的三种算法的向量化方法,对Matrix的向量体系结构中融合指令的优化,成功设计并实现了Givens旋转,Gram-schmidt正交化,Householder变换三种算法大规模数据单核汇编程序。本文主要研究工作包括以下几个方面:(1)设计和实现了基于Matrix单核的Givens旋转算法程序。利用标向量共享寄存器从而减少了DDR到SRAM的数据传输;设计了软件流水实现方法并采用手工汇编对程序进行优化;详细分析了其数据排布要求,对数据初始存储进行偏移从而有效减少了AM_Bsy;设计双缓冲DMA数据搬移策略,将数据传输时间和数据计算时间重跌,从而提升程序性能。试验结果表明:相比基于TI公司的TMS320C6713平台经优化的C语言,对于不同规模双精度Givens的平均加速比为74.33。对于2048规模的矩阵计算性能达到74.77%。(2)设计和实现了基于Matrix单核的Gram-schmidt正交化算法程序。通过对传统Gram-schmidt正交化方法进行改进,使得其更加适合Matrix向量处理器的结构特点。设计了软件流水实现方法并采用手工汇编对程序进行优化,详细分析了其数据排布要求和确定了最小迭代间隔。设计双缓冲DMA数据搬移策略,将数据传输时间和数据计算时间重跌,使算法计算效率提高。试验结果表明:相比基于TI公司的TMS320C6713平台经优化的C语言,对于不同规模双精度Gram-schmidt正交化的平均加速比为83.26。对于2048规模的矩阵计算性能达到46.31%。(3)设计和实现了基于Matrix单核的Householder变换算法程序。详细分析了大规模数据Householder变换基本原理和算法流程,通过对两种矩阵乘法的分析,选择了更适合Matrix向量处理器的方法;实现了Householder矩阵求值方法的向量化;优化设计了基于DMA双缓冲搬移计算的单核Householder变换程序;设计双缓冲DMA数据搬移策略,将数据传输时间和数据计算时间重跌。实验结果表明:相比基于TI公司的TMS320C6713平台经优化的C语言,对于不同规模双精度Householder变换的平均加速比为95.76。对于1920规模的矩阵计算性能达到83.64%
[Abstract]:As the main tool of digital signal processing, QR decomposition algorithm plays an important role in the field of high performance computing. QR decomposition is an important index to measure processor performance. QR decomposition is very effective in solving the least square problem. It is very important to study QR decomposition algorithm to give full play to the parallel processing performance of multi-core vector processor. According to the characteristics of vector architecture of Matrix, it is of great theoretical significance and practical value to study the efficient design and implementation of QR decomposition vectorization. This paper analyzes the vectorization methods of three QR decomposition algorithms, optimizes the fusion instructions in the vector architecture of Matrix, and successfully designs and implements the large scale data single core assembler program of the three algorithms of Givens rotation Gram-Schmidt orthogonalization and Householder transformation. The main research work of this paper includes the following aspects: design and implement the Givens rotation algorithm program based on Matrix single core. The data transmission from DDR to SRAM is reduced by using the scalar vector shared register. The software pipelining implementation method is designed and the program is optimized by manual assembly. The data initial storage is offset to reduce the Ampis Bsys effectively, and the double buffer DMA data transfer strategy is designed to reduce the data transmission time and data computation time again, thus improving the performance of the program. The experimental results show that compared with the optimized C language for the TMS320C6713 platform based on TI, the average speedup ratio for Givens with different scales is 74.33. For the matrix of 2048 scale, the algorithm program of Gram-schmidt orthogonalization based on Matrix single core is designed and implemented. The traditional Gram-schmidt orthogonalization method is improved to make it more suitable for the Matrix vector processor architecture. The software pipelining implementation method is designed and the program is optimized by manual assembly. The data scheduling requirements and the minimum iteration interval are analyzed in detail. A double buffer DMA data transfer strategy is designed to reduce the data transmission time and the data computing time so as to improve the computational efficiency of the algorithm. The experimental results show that compared with the optimized C language of the TMS320C6713 platform based on TI, the average speedup ratio for the Gram-schmidt orthogonalization with different scales is 83.26. For the matrix of 2048 scale, the algorithm program of Householder transform based on Matrix single core is designed and implemented. The basic principle and algorithm flow of large-scale data Householder transform are analyzed in detail. Through the analysis of two kinds of matrix multiplication, the method that is more suitable for Matrix vector processor is selected, and the vectorization of Householder matrix evaluation method is realized. The single core Householder transform program based on DMA double buffer shift calculation is designed, and the double buffer DMA data transfer strategy is designed to reduce the data transmission time and data calculation time again. The experimental results show that compared with the optimized C language based on TI's TMS320C6713 platform, the average speedup of Householder transform with different scales is 95.76. Performance of 83.64% for 1920 matrix
【学位授予单位】：国防科学技术大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP332

【参考文献】