基于NR-MPI的并行程序容错设计技术研究

发布时间：2018-05-28 18:02

本文选题：高性能计算 + MPI并行程序　；参考：《国防科学技术大学》2012年硕士论文

【摘要】：随着高性能计算技术的飞速发展，高性能计算机（HPC）的系统规模急剧增大，系统的平均故障间隔时间（MTBF）随之降低，远低于HPC上大型科学计算程序的运行时间，严重影响了系统的可用性。容错技术是提高HPC系统可用性的重要技术手段。然而，目前常用的容错方法：系统级检查点，通常会带来巨大容错开销，已不能满足HPC应用的需求。应用级检查点技术虽然可以较好的控制容错开销，但是它仍然需要重新加载出错的程序，这在大规模系统中可能会引入很大的开销。MPI是HPC领域应用最广泛的并行编程方式，而NR-MPI是一种新型、高性能的容错MPI，因此，基于NR-MPI的并行程序容错设计技术研究具有十分重要的意义。由于MPI并行程序的复杂性与多样性，很难找到一种通用且高效的容错技术。本文面向应用广泛的循环迭代并行程序，对数据冗余和结点冗余这两种容错技术进行了深入的研究，主要工作如下：首先，为评价容错技术的优劣，定义了三个评价容错技术的指标：容错空间开销、容错时间开销、失效恢复时间，并为估计容错技术是否适用于某个HPC系统上的某个应用，，定义了容错时间因子，这些工作为基于NR-MPI的并行程序容错设计提供了理论支撑。其次，提出了基于数据冗余的容错并行算法框架：Data Redundancy based FaultTolerant Framework（简称DRFTF），并对其中的关键问题：数据备份策略、全局一致性、备份周期和关键变量进行了重点分析。DRFTF是建立在程序原算法的基础上的，对原算法不需要太大改动即可实现容错，而且对于关键变量比例较小的算法，可以保获得较小的容错开销。第三，对测试程序NPB和Sweep3D的算法进行了分析，使用DRFTF实现了NPB和Sweep3D的容错版本，并对容错程序进行了实验和性能分析。实验结果验证了DRFTF的容错能力和较低的容错开销。第四，针对可以在每步循环维持校验和关系的算法，提出了基于结点冗余的容错并行算法框架：Node Redundancy based Fault Tolerant Framework（简称NRFTF）。NRFTF采用结点冗余容错技术，建立了程序数据的校验和，并将其保存在冗余结点，校验和数据由冗余进程进行更新，不暂停原算法的执行，因此可以获得很小的容错开销。最后，分析了并行高斯消元算法，使用NRFTF设计了容错的并行高斯消元算法，并以TOP500超级计算机排行的测试程序HPL为例，实现了容错的HPL程序，对容错程序进行了实验和性能分析。实验结果验证了NRFTF的容错能力和很低的容错开销。
[Abstract]:With the rapid development of high performance computing technology, the scale of high performance computer (HPC) system increases rapidly, and the average fault interval time (MTBF) of the system decreases, which is far less than the running time of large scientific computing program on HPC. The availability of the system is seriously affected. Fault-tolerant technology is an important technique to improve the availability of HPC system. However, the commonly used fault-tolerant methods, system-level checkpoints, usually bring huge fault-tolerant overhead, and can no longer meet the requirements of HPC applications. Although the application-level checkpoint technology can control the fault-tolerant overhead well, it still needs to reload the error-prone program, which may introduce a large amount of overhead in large-scale systems. MPI is the most widely used parallel programming method in the field of HPC. NR-MPI is a new type of fault-tolerant MPI with high performance. Therefore, it is of great significance to study the fault-tolerant design technology of parallel programs based on NR-MPI. Due to the complexity and diversity of MPI parallel programs, it is difficult to find a universal and efficient fault-tolerant technology. In this paper, two kinds of fault-tolerant techniques, data redundancy and node redundancy, are deeply studied for circular iterative parallel programs. The main work is as follows: Firstly, in order to evaluate the merits and demerits of the fault-tolerant technology, three indexes are defined to evaluate the fault-tolerant technique: fault-tolerant space overhead, fault-tolerant time overhead, failure recovery time, and to estimate whether the fault-tolerant technique is suitable for an application in a HPC system. The fault-tolerant time factor is defined, which provides a theoretical support for the fault-tolerant design of parallel programs based on NR-MPI. Secondly, a parallel fault-tolerant algorithm based on data redundancy is proposed, which is called: DRFTF Redundancy based FaultTolerant Framework(, and the key problems are: data backup strategy, global consistency, and so on. The backup period and key variables are analyzed emphatically. DRFTF is based on the original algorithm of the program. It can be fault-tolerant without too much change to the original algorithm, and for the algorithm with small proportion of key variables, It can guarantee less fault tolerance overhead. Thirdly, the algorithms of NPB and Sweep3D are analyzed, the fault-tolerant versions of NPB and Sweep3D are implemented with DRFTF, and the experiment and performance analysis of the fault-tolerant program are carried out. The experimental results show that the DRFTF is fault-tolerant and has a low fault-tolerant overhead. Fourthly, aiming at the algorithm which can maintain the checksum relation in every step, a parallel fault-tolerant algorithm framework named: node Redundancy based Fault Tolerant Framework( based on node redundancy is proposed, which adopts node redundancy fault-tolerant technology and establishes the checksum of program data. The checksum data is updated by the redundant process, and the execution of the original algorithm is not suspended, so the fault tolerant cost can be very small. Finally, the parallel Gao Si elimination algorithm is analyzed, and the fault-tolerant parallel Gao Si elimination algorithm is designed by using NRFTF. Taking HPL, a test program ranked by TOP500 supercomputer, as an example, the fault-tolerant HPL program is implemented. The experiment and performance analysis of fault-tolerant program are carried out. The experimental results show that NRFTF is fault-tolerant and has very low fault-tolerant overhead.
【学位授予单位】：国防科学技术大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP302.8

【参考文献】