面向千万亿次CPU-GPU异构系统的编程模型与性能优化关键技术研究

发布时间：2019-05-19 14:27

【摘要】：科学计算永无止境的计算需求驱动着高性能计算机系统进入了千万亿次时代,面向千万亿次系统的各种关键技术将是未来构建百万万亿次系统的基石。受到CMOS工艺特征尺寸、功耗和散热等技术的限制,完全依靠CPU提供计算能力的同构计算机系统在到达千万亿次系统规模后很难再进行扩充。而使用GPU作为加速器的异构系统在性能功耗比方面比同构系统更有优势,也是构建百万万亿次系统最有前景的技术路线之一。2010年11月国防科大计算机学院为天津超算中心构建的天河-1A使用了NVIDIA的Fermi GPU,以2.566 PFLOPS的可持续运算速度排名世界第一。这种CPU-GPU异构系统提供了强大的计算能力,但用户编程和性能优化都与传统的同构计算机不同,成为发挥整个系统性能的关键。针对目前大规模异构系统上应用程序编程难、优化难的问题,本文以千万亿次CPU-GPU异构系统为平台,研究了异构系统的编程模型以及优化方法。本文的主要创新点包括:1.首次在千万亿次CPU-GPU异构计算机系统上引入了MPI/Open MP/Streaming混合编程模型,并扩展至全系统规模。针对混合编程模型中软件任务到硬资源映射的问题,提出了以结点为中心的任务映射、以CPU为中心的任务映射和以GPU为中心的任务映射。并针对大规模并行系统结点内编程模型总结出7项需求:简单易用性、性能可扩展性、存储可扩展性、模型层次性、调度灵活性、模型异构性、定位准确性,用于评估目前的编程模型。另外,提出了基于共享内存的多进程共享使用GPU的方法,并给出了高效编2.基于测量的自适应任务划分技术。我们将所有的任务放在一个任务队列中,循环地从任务队列中获取任务,每次取出的任务根据当前的“任务划分比率”划分成CPU执行和加速器执行两部分,初始的“任务划分比率”由CPU和加速器的理论计算峰值得到。划分完毕之后在异构平台上执行,并在执行完毕进行实际性能测量,将统计得到的性能结果和本次划分的任务负载相结合,更新“任务划分比率”,作为下次任务划分的依据。由于每次任务划分并执行完毕后,任务划分比率都被自适应地调整,使得主机和加速器之间的任务分配获得了很好的负载平衡效果,大大提升了异构系统的计算效率。3.基于有限状态自动机的嵌套双缓冲软件流水技术。GPU程序的执行分为数据输入、GPU计算、数据输出三个部分。我们分析了异构系统上软件流水的执行模型和代价模型,并设计了嵌套双缓冲软件流水机制。在实现过程中,我们使用了基于有限状态自动机的方法,用单个CPU线程控制了多任务的输入、执行和输出,并将三者有序的重叠执行。实验表明,这种方法极大缓解了主机和加速器间带宽不足的问题,能有效解决原有GPU库性能波动的问题。针对BLAS3中DGEMM不同问题规模的测试,平均性能提升达到7.61%。4.在千万亿次CPU-GPU异构系统上设计并实现了高效的LINPACK程序(Hybrid-LINPACK)。首先设计并实现了能够同时使用CPU和GPU计算能力的异构BLAS库,然后基于异构BLAS库,使用了MPI/Open MP/Streaming混合编程模型,结合同构系统上的高性能LINPACK实现(HPL 2.0),实现并优化了Hybrid-LINPACK。优化方法主要涉及CPU与GPU的任务划分、CPU与GPU的通信优化、SWAP算法并行化优化、结点间数据传输优化、以及HPL传统的优化方法和参数调优等。Hybrid-LINPACK充分发挥了硬件和体系结构设计提供的强大计算和通信能力,在天河-1单个计算单元上比AMD发布的LINPACK实现取得了3.3倍的加速比,获得70.1%的计算效率。最终全系统LINPACK测试在天河-1和天河-1A上分别取得了0.563PFLOPS和2.566 PFLOPS的实测性能。使得天河-1在2009年11月排名TOP500第五,天河-1A在2010年11月排名第一,都创下我国超级计算机TOP500排名历史上的最好成绩。
[Abstract]:The scientific calculation of the ever-ending computing demand drives a high-performance computer system into the billions of times, and the key technologies for the millions of systems will be the cornerstone of the future of a million-billion-dollar system. Due to the limitations of the CMOS process feature size, power consumption and heat dissipation, the isomorphic computer system, which is fully dependent on the CPU to provide the computing power, is difficult to expand after reaching the system size of millions of times. The use of the GPU as an accelerator is one of the most promising technology routes in performance-power-ratio, and is one of the most promising technical routes to build a million-trillion-dollar system. The Tianhe-1A, built by the University of Great Computer in Tianjin in November 2010, uses the NVIDIA's Fermi GPU, The world's first is ranked at 2.566 PFLOPS. The CPU-GPU heterogeneous system provides powerful computing power, but the programming and performance optimization of the user are different from the traditional homogeneous computer, and become the key to the performance of the whole system. In order to solve the problem that the application program is difficult and difficult to be optimized on the large-scale heterogeneous system, the programming model and the optimization method of the heterogeneous system are studied in this paper. The main innovation points of this paper include:1. The MPI/ Open MP/ Streaming mixed programming model was introduced for the first time on a million-million CPU-GPU heterogeneous computer system and extended to the system-wide scale. In order to solve the problem of software task-to-hard resource mapping in hybrid programming model, a node-centric task map, a CPU-centric task map and a GPU-centric task map are proposed. And the seven requirements are summarized for a large-scale parallel system node internal programming model: the simple usability, the performance expandability, the storage expandability, the model hierarchy, the scheduling flexibility, the model heterogeneity and the positioning accuracy, and is used for evaluating the current programming model. In addition, the method of multi-process sharing using the GPU based on shared memory is put forward, and the high-efficiency part 2 is given. Self-adaptive task partitioning technology based on measurement. We put all the tasks in a task queue, and the task is cyclically taken from the task queue. Each time the task is taken out is divided into two parts of the CPU execution and the accelerator according to the current "task division ratio". The initial "task division ratio" is obtained by the theoretical calculation peak of the CPU and the accelerator. After the division is finished, executing on a heterogeneous platform, performing actual performance measurement after the division is completed, combining the obtained performance results and the divided task load, and updating the "task division ratio" as the basis of the next task division. The task partition ratio is adaptively adjusted after each task is divided and executed, so that the task allocation between the host and the accelerator is well balanced, and the computing efficiency of the heterogeneous system is greatly improved. The invention relates to a nested double-buffering software running water technology based on a finite state automaton. The execution of the GPU program is divided into three parts: data input, GPU calculation, and data output. We analyzed the execution model and cost model of the software running water on the heterogeneous system, and designed the nested dual-buffer software running-water mechanism. In the course of implementation, we use a finite state automaton to control the input, execution, and output of multitask with a single CPU thread, and perform the orderly overlapping of the three. The experiment shows that this method greatly reduces the problem of insufficient bandwidth between the host and the accelerator, and can effectively solve the problem of the performance fluctuation of the original GPU library. The average performance of DGEMM in BLS3 was 7.61%. A high-efficiency LINPACK (Hybrid-LINPACK) program is designed and implemented on a 10 million CPU-GPU heterogeneous system. First of all, a heterogeneous BLAS library capable of simultaneously using CPU and GPU computing power is designed and implemented, and then the MPI/ Open MP/ Streaming mixed programming model is used based on the heterogeneous BLAS library, and the hybrid-LINPACK is realized and optimized in combination with the high-performance LINPACK implementation (HPL 2.0) on the homogeneous system. The optimization method is mainly concerned with the task division of the CPU and the GPU, the communication optimization of the CPU and the GPU, the parallel optimization of the SWAP algorithm, the optimization of data transmission among the nodes, and the optimization method and the parameter adjustment of the HPL tradition. Hybrid-LINPACK gives full play to the powerful computing and communication capabilities provided by the hardware and architecture design, and a 3.3-fold acceleration ratio is achieved on the Tianhe-1 single computing unit than the LINPACK issued by AMD, yielding 70.1% of the computational efficiency. The final system LINPACK test has obtained the measured performance of 0.563 PFLOPS and 2.566 PFLOPS on the Tianhe-1 and Tianhe-1A, respectively. The Tianhe-1 ranked the fifth of the TOP500 in November 2009, and the Tianhe-1A ranked the first in November 2010, all the best in the history of the TOP 500 of China's supercomputer.
【学位授予单位】：国防科学技术大学
【学位级别】：博士
【学位授予年份】：2013
【分类号】：TP338

【相似文献】