面向稀疏矩阵运算的异构并行算法研究

发布时间：2018-08-12 18:52

【摘要】：异构高性能体系结构已经成为高性能计算领域越来越流行的一种架构,极大的推动着当今社会大规模科学与工程领域计算的进步。稀疏矩阵运算是大规模科学与工程领域计算的非常重要的一种操作,但是由于稀疏矩阵运算是一种典型的不规则运算,严重限制着异构高性能体系结构计算性能的发挥。这主要是由于,一方面,稀疏矩阵运算的程序行为是由矩阵的稀疏结构确定的,不同应用所对应的稀疏矩阵的稀疏结构不同,难以发掘其中的局部性,另一方面,由于稀疏矩阵的压缩存储导致大量的间址访存和随机访存,从而系统的访存带宽成为稀疏矩阵运算的性能瓶颈。为了应对这些问题和挑战,本文以稀疏矩阵运算为背景,研究如何充分利用异构体系结构中的计算资源和访存带宽提高稀疏矩阵运算的性能,主要工作和创新点如下:(1)提出了面向异构体系结构的稀疏矩阵向量乘算法的高效并行优化算法本文提出了一种基于负载的多种并行模式混合的稀疏矩阵向量乘算法。针对SpMV操作中不规则访存多和负载不均衡等问题,本文基于计算负载对稀疏矩阵进行分块,减少由非零元个数变化所引入的存储开销;根据每行的计算负载选择不同的线程映射方式,实现线程间负载均衡;设计一种基于warp的负载分配策略,减少线程间同步开销。实验结果表明,本文提出稀疏矩阵向量乘算法能够有效提高算法的性能。(2)提出了面向异构体系结构的稀疏矩阵转置向量乘的高效并行优化算法本文提出了无矩阵转置的稀疏矩阵转置向量乘算法。针对稀疏矩阵转置向量乘算法的转置或原子开销大等问题,本文采用行压缩格式减少预处理开销,采用外积的计算方式避免稀疏矩阵转置的操作;采用多路并行归并排序消除原子操作。实验结果表明,本文所提出的稀疏矩阵转置向量乘算法能够获得有效的性能提升。(3)提出了面向异构体系结构的稀疏矩阵稀疏矩阵乘的高效并行优化算法本文提出了一种异构协同的稀疏矩阵稀疏矩阵乘算法。采用上限分配和逐次增加的混合空间分配策略,提高存储空间资源利用率;通过对中间结果进行多路并行归并排序和并行前向扫描操作,高效地实现中间结果的累加操作;根据结果矩阵每列非零元素个数进行任务划分,实现线程间的负载均衡;根据CPU-GPU的计算能力进行任务划分,实现异构协同并行,提高计算资源的利用率。实验结果表明,本文所提出的稀疏矩阵稀疏矩阵乘算法具有较好的性能提升。(4)提出了面向异构体系结构平台的稀疏三角方程组求解的高效并行优化算法本文提出了一种基于层次化并行的稀疏三角方程组并行求解算法。针对不同层次并行度不同的特征,设计了基于群并行模式和流水线并行模式两种方式的混合并行模式。利用群并行模式处理并行度较高的任务层,利用流水线并行模式处理并行度较低的任务层。在此基础上,基于任务层的计算任务实现计算过程中的线程间负载均衡策略。最后设计了一种CPU-GPU异构协同的稀疏三角方程组求解的并行算法,实验结果表明,本文提出的异构并行算法能够有效的挖掘CPU和GPU的计算性能,提高异构体系结构中的计算效率。
[Abstract]:Heterogeneous high-performance architectures have become more and more popular in the field of high-performance computing, which greatly promote the progress of large-scale computing in science and engineering. Sparse matrix computing is a very important operation in large-scale computing in science and engineering, but sparse matrix computing is a classic one. This is mainly because, on the one hand, the program behavior of sparse matrix operation is determined by the sparse structure of the matrix, and the sparse structure of the sparse matrix corresponding to different applications is different, so it is difficult to discover the locality of the sparse matrix, on the other hand, because of the sparse structure. Compressed storage of matrices leads to a large number of inter-address memory access and random memory access, so the system memory access bandwidth becomes the performance bottleneck of sparse matrix operation. In order to deal with these problems and challenges, this paper studies how to make full use of computing resources and memory access bandwidth in heterogeneous architectures to improve the performance of sparse matrix operation. The main work and innovations are as follows: (1) An efficient parallel optimization algorithm for sparse matrix vector multiplication algorithm based on heterogeneous architecture is proposed. A load-based sparse matrix vector multiplication algorithm with multiple parallel modes is proposed. The sparse matrix is partitioned based on the computing load to reduce the storage overhead caused by the change of the number of non-zero elements; different thread mapping methods are selected according to the computing load of each row to achieve load balance between threads; a load allocation strategy based on warp is designed to reduce the synchronization overhead between threads. Matrix vector multiplication algorithm can effectively improve the performance of the algorithm. (2) An efficient parallel optimization algorithm for sparse matrix transpose vector multiplication is proposed for heterogeneous architecture. A sparse matrix transpose vector multiplication algorithm without matrix transposition is proposed. The row compression scheme is used to reduce the preprocessing overhead, and the outer product calculation method is used to avoid the operation of sparse matrix transposition. The multi-channel parallel merge sort is used to eliminate the atom operation. The experimental results show that the proposed sparse matrix transposition vector multiplication algorithm can achieve effective performance improvement. (3) A heterogeneous architecture oriented sparse architecture is proposed. An efficient parallel optimization algorithm for sparse matrix multiplication is proposed in this paper.A heterogeneous cooperative sparse matrix multiplication algorithm is proposed.The upper bound allocation and the successive increase of the mixed space allocation strategy are used to improve the utilization of the storage space resources.The intermediate results are sorted by multi-parallel mergers and parallel forward scanning operations. The results show that the sparse matrix multiplication proposed in this paper can improve the utilization ratio of computing resources. (4) An efficient parallel algorithm for solving sparse trigonometric equations based on heterogeneous architecture platform is proposed. A parallel algorithm for solving sparse trigonometric equations based on hierarchical parallelism is proposed. A hybrid parallel mode of waterline parallel mode is proposed. The task layer with higher parallelism is processed by group parallel mode and the task layer with lower parallelism is processed by pipeline parallel mode. The parallel algorithm for solving sparse trigonometric equations is constructed. The experimental results show that the heterogeneous parallel algorithm proposed in this paper can effectively mine the computational performance of CPU and GPU, and improve the computational efficiency in heterogeneous architecture.
【学位授予单位】：国防科学技术大学
【学位级别】：博士
【学位授予年份】：2015
【分类号】：O241.6

【相似文献】