面向异构体系结构的稀疏矩阵算法研究

发布时间：2018-09-10 20:16

【摘要】：面向异构体系结构的稀疏矩阵算法已经成为高性能计算领域的关键问题之一。稀疏矩阵算法是自然科学和社会科学中许多领域进行数值模拟计算时的关键技术和性能瓶颈,为了提高稀疏矩阵算法的计算性能,需要提高相应算法在特定平台上的计算效率。然而,一方面,稀疏矩阵算法的计算与访存行为与矩阵的稀疏结构相关,是典型的不规则类算法,很难发掘时间与空间的局部性;另一方面,随着包含算法加速器的异构并行体系结构成为高性能计算机体系结构的主流,并行程序执行效率的影响因素日益复杂多样,传统的面向具体平台的程序并行与优化方法已经无法满足高效率并行程序的开发需求。为了应对这些问题和挑战,本文对面向异构体系结构的稀疏矩阵算法进行了深入的研究,主要工作与创新点如下:(1)提出了面向CPU-GPU异构平台的搜索方向优化的宽度优先搜索算法。本文实现了基于GPU的自底向上的宽度优先搜索算法,提高了GPU线程访存的连续性并降低了线程间负载的不均衡,并进一步实现了CPU-GPU协同的自底向上的宽度优先搜索算法,充分利用了CPU-GPU异构并行计算平台上所有的计算资源,并有效提高了宽度优先搜索算法对大规模节点前沿的处理速度。在此基础上,设计了面向CPU-GPU异构体系结构的优化搜索方向的宽度优先搜索算法,通过结合基于多核CPU的自顶向下的串行宽度优先搜索算法、自顶向下的并行宽度优先搜索算法以及CPU-GPU协同的自底向上的宽度优先搜索算法,提高了宽度优先搜索算法对不同规模节点前沿的适应性。(2)提出了面向CPU-GPU异构平台的稀疏矩阵向量乘算法。本文提出了基于索引信息压缩的稀疏矩阵分块存储数据结构,通过合并位置相近的同行非零元,减少了矩阵非零元素对应的索引信息的数据量,从而一方面提高了稀疏矩阵访存的规则性和局部性,另一方面控制了填零元所引入的额外的计算和访存开销,并通过采用二级混合存储数据结构提高了对于矩阵不同稀疏结构的适应性。在此基础上,实现了基于计算访存量的负载均衡算法,分别设计了针对多核CPU和GPU的优化的Sp MV算法,有效的提高了稀疏矩阵向量乘算法的计算效率,并进一步实现了CPU-GPU协同的稀疏矩阵向量乘算法,充分发掘了异构体系结构计算平台的计算能力。(3)提出了面向异构平台的基于超节点数据结构的稀疏矩阵分解算法。本文以稀疏矩阵Cholesky分解算法为研究对象,在数据组织方面,改进了稀疏矩阵超节点数据结构,通过超节点合并和分块控制计算粒度;在计算调度方面,将分解过程映射为一系列的数据块任务,并设计了相应的任务生成与调度算法,在满足数据依赖性的前提下提高任务的并行性,从而显著提高了稀疏矩阵Cholesky分解算法在GPU上的实现效率。通过将控制任务和计算任务分别映射到CPU和GPU上,有效提高了CPU-GPU异构平台上稀疏矩阵分解算法的计算性能。(4)提出了面向CPU-GPU异构平台的稀疏三角方程组求解算法。本文提出了面向稀疏结构的分块处理策略,根据稀疏三角矩阵非零元密度的不同将计算过程分解,并设计了基于分块数据结构的稀疏三角方程组求解算法。在此基础上,设计了面向负载均衡的线程映射算法,并针对GPU设计了基于线程warp的计算组织策略,降低了GPU线程的控制分支所造成的性能损失,并进一步实现了CPU-GPU协同的稀疏三角方程组求解并行算法,有效提高了CPU-GPU异构平台上稀疏三角方程组求解算法的计算效率。
[Abstract]:Sparse matrix algorithm for heterogeneous architecture has become one of the key problems in high performance computing. Sparse matrix algorithm is the key technology and performance bottleneck in many fields of natural science and social sciences. In order to improve the performance of sparse matrix algorithm, it is necessary to improve the corresponding algorithm in particular. However, on the one hand, the computation and memory access behavior of the sparse matrix algorithm is related to the sparse structure of the matrix, and it is a typical irregular algorithm. It is difficult to discover the locality of time and space; on the other hand, with the heterogeneous parallel architecture including the algorithm accelerator, it becomes the main architecture of the high performance computer architecture. In order to meet these problems and challenges, the sparse matrix algorithm for heterogeneous architecture is studied in this paper. The main work is as follows:1. The main innovations are as follows: (1) A breadth-first search algorithm for CPU-GPU heterogeneous platform search direction optimization is proposed. This paper implements a GPU-based bottom-up breadth-first search algorithm, which improves the continuity of GPU thread access and reduces the load imbalance between threads, and further realizes the bottom-up CPU-GPU cooperation. Width-first search algorithm makes full use of all computing resources on CPU-GPU heterogeneous parallel computing platform, and effectively improves the processing speed of width-first search algorithm for large-scale node frontier. On this basis, an optimized search direction search algorithm for CPU-GPU heterogeneous architecture is designed. The top-down serial width-first search algorithm based on multi-core CPU, the top-down parallel width-first search algorithm and the CPU-GPU cooperative bottom-up width-first search algorithm improve the adaptability of the width-first search algorithm to different scale node frontiers. (2) The sparse matrix orientation for CPU-GPU heterogeneous platform is proposed. In this paper, a sparse matrix block storage data structure based on index information compression is proposed. By merging peer-to-peer non-zero elements with similar positions, the amount of data of index information corresponding to non-zero elements of the matrix is reduced. Thus, the regularity and locality of sparse matrix access are improved on the one hand, and the zero-filling elements are controlled on the other hand. Additional computation and memory access overhead are added, and the adaptability to different sparse structures of the matrix is improved by using a two-level hybrid storage data structure. On this basis, load balancing algorithm based on computational memory access is implemented, and optimized Sp MV algorithm for multi-core CPU and GPU is designed, which effectively improves the sparse matrix vector. The computational efficiency of the multiplication algorithm is improved, and the CPU-GPU cooperative sparse matrix vector multiplication algorithm is further realized, which fully exploits the computing capability of heterogeneous architecture computing platform. (3) A sparse matrix decomposition algorithm based on super-node data structure for heterogeneous platform is proposed. In the aspect of data organization, the sparse matrix super-node data structure is improved, and the computing granularity is controlled by super-node merging and partitioning; in the aspect of computing scheduling, the decomposition process is mapped into a series of data block tasks, and the corresponding task generation and scheduling algorithm is designed to improve the parallelism of tasks under the premise of satisfying the data dependence. By mapping control tasks and computing tasks to CPU and GPU respectively, the performance of the sparse matrix decomposition algorithm on CPU-GPU heterogeneous platform is effectively improved. (4) A sparse triangular equations solving algorithm for CPU-GPU heterogeneous platform is proposed. A block processing strategy for sparse structure is proposed, and the calculation process is decomposed according to the non-zero element density of sparse triangular matrix, and a sparse triangular equations solving algorithm based on block data structure is designed. On this basis, a load balancing oriented thread mapping algorithm is designed, and a thread warp-based meter is designed for GPU. Computational organization strategy reduces the performance loss caused by the control branches of GPU threads, and further realizes the CPU-GPU cooperative parallel algorithm for sparse trigonometric equations, which effectively improves the computational efficiency of sparse trigonometric equations solving algorithm on CPU-GPU heterogeneous platform.
【学位授予单位】：国防科学技术大学
【学位级别】：博士
【学位授予年份】：2013
【分类号】：TP301.6;TP332

【相似文献】