硬件故障在程序中的传播行为分析及容错技术研究

发布时间：2018-05-14 06:07

本文选题：硬件故障 + 故障传播　；参考：《国防科学技术大学》2012年博士论文

【摘要】：器件工艺技术的发展、系统规模的扩大以及异构系统的兴起在不断提高高性能计算机性能的同时，也带来了越发严重的可靠性问题。可靠性问题已经成为制约高性能计算发展的重要因素之一。虽然通过提高器件的可靠性或者使用冗余的部件可以在一定程度上提高高性能计算机的可靠性，但是，这种基于硬件的容错方法容错代价较大。面向硬件故障的软件容错方法可以在不对硬件做任何修改的前提下，通过修改程序实现对硬件故障的容错。硬件故障及其所导致的错误随程序的执行而传播，对硬件故障在程序中的传播行为进行分析有助于更好地通过软件方法容忍硬件故障。因此，本文的研究分为基础篇和应用篇：基础篇对硬件故障在程序中传播行为展开分析；应用篇利用基础篇的分析结果设计相应的容错优化方法。在基础篇，本文着重选取了三类具有代表性的程序：串行程序、同构并行程序，以及异构并行程序，分别作为研究对象，对其上硬件故障的传播行为进行研究，主要工作和创新点体现在： 1.建立了硬件故障在串行程序中的传播模型（第二章）串行程序是一种最基本的程序类型，针对串行程序中故障传播行为的分析是对故障在程序中传播行为展开研究的基础。本文对硬件故障在程序中传播所产生的错误进行了分类，得到原生错误、数据流生错误和控制流生错误三类错误。使用前向数据流分析方法，在程序详细控制流图的基础上给出了串行程序中数据流生错误和控制流生错误的错误传播方程和相关求解算法，建立了硬件故障在串行程序中的传播模型。研究者可以基于该模型，在给定原生错误的情况下，计算得到串行程序中各个程序点的错误信息。 2.以MPI程序为例建立了硬件故障在同构并行程序中的传播模型（第三章）MPI程序是并行与分布式计算领域的事实标准，是一种具有代表性的同构并行程序。本文根据MPI程序的特点，对MPI程序中的数据流生错误进一步细分，得到了进程内错误和进程间错误。以变量整体和变量副本分别为错误载体，重点分析了MPI程序中进程间错误传播的行为，得到了计算MPI程序中数据流生错误的错误传播方程和相关求解算法，建立了硬件故障在MPI程序中的传播模型。基于该模型，在给定原生错误的情况下，，以变量整体或变量副本作为错误载体，研究者可以使用相应方程和算法，计算得到MPI程序中各个程序点的错误信息。 3.以GPGPU程序为例建立了硬件故障在异构并行程序中的传播模型（第四章） CPU-GPU异构系统已经在高性能计算领域得到广泛使用，GPGPU程序已经成为一种具有代表性的异构并行程序。本文根据GPGPU程序的特点，对硬件故障引起的错误进行了分析，将错误进一步划分为CPU错误和GPU错误。针对GPGPU程序中语句可能异步执行的特点，分析了GPGPU程序中给定程序点错误的不确定性，设计了兼容这种不确定错误的保守计算方程和求解算法。提出了利用GPU执行错误分析Kernel，对静态分析Kernel内错误传播行为进行加速的方法，建立了硬件故障在GPGPU程序中的传播模型。研究者可以基于该模型，在给定原生错误的情况下，利用GPU部件加速计算得到GPGPU程序中各个程序点的错误信息。在应用篇，本文基于基础篇中关于硬件故障在程序中传播行为分析的结果，分别针对MPI程序和GPGPU程序设计并实现了相应的容错优化方法，主要工作和创新点体现在： 1.提出了面向MPI程序的弱阻塞协同式应用级检查点方法——WBC-ALC（第五章）本文分析了MPI程序中实现应用级检查点的难点，并针对这些难点提出了一种面向MPI程序的弱阻塞协同式应用级检查点方法——WBC-ALC。具体介绍了WBC-ALC的基本思想和协同机制，设计了用于实现WBC-ALC的编程方法和容错框架，并给出了基于这种编程方法和容错框架的实现方法。实验结果表明，程序员可以较容易的使用WBC-ALC方法对MPI程序进行容错，且WBC-ALC可以有效降低检查点的容错代价。 2.提出了面向GPGPU程序的懒惰错误检测方法——LazyFT（第六章）本文分析了GPU计算部件瞬时故障所产生的错误在CPU-GPU异构平台上的传播规律，基于该传播规律提出了Lazy的错误检测方法。并基于该检测方法设计了一种面向CPU-GPU异构系统的容错方法LazyFT，给出了LazyFT的容错框架。建立了容错GPGPU程序的执行时间模型，并基于该时间模型给出了科学计算程序中两类典型程序段在使用LazyFT容错方法时的最优容错粒度选择方法。通过实验验证了LazyFT容错方法的有效性，与现有的Eager容错方法相比，使用LazyFT对GPGPU程序进行容错处理，在有无故障发生的情况下，都可以显著降低容错开销。 3.提出了面向GPGPU程序的部分复算方法——PartialRC（第七章）本文分析了GPU硬件发生瞬时故障后，GPGPU程序真正需要复算的计算情况，首次提出了面向GPGPU程序的部分复算思想，以及基于部分复算的故障恢复方法PartialRC。设计了基于该故障恢复方法对GPGPU程序进行容错处理的编程模型以及容错框架。给出了该容错框架中各个关键技术的基本原理、实现技术以及优化方法。实验结果表明，与现有基于完全复算的故障恢复相比，PartialRC可以有效降低GPGPU程序在发生GPU硬件瞬时故障之后的故障恢复代价。
[Abstract]:The development of the device technology , the expansion of the system scale and the rise of the heterogeneous system , while continuously improving the performance of the high - performance computer , also bring serious reliability problems . The reliability problem has become one of the important factors that restrict the development of high - performance computing .

The failure of hardware and its error are propagated with the execution of the program . The analysis of the propagation behavior of the hardware fault in the program helps to tolerate the hardware fault better through the software method . Therefore , the research of this paper is divided into the basic and application part : the basic text analyzes the propagation behavior of the hardware fault in the program ;
The corresponding fault - tolerant optimization method is designed by using the analysis results of the basic text .

On the base of this paper , three types of representative programs are selected : serial program , homogenous parallel program , and heterogeneous parallel program , which are used as the research object to study the propagation behavior of hardware failure . The main work and innovation points are :

1 . The propagation model of the hardware fault in the serial program is established ( chapter 2 ) . The serial program is one of the most basic program types . The fault propagation behavior in the serial program is classified according to the fault propagation behavior in the serial program . The propagation model of the hardware fault in the serial program is obtained . Based on the model , the error information of each program point in the serial program can be calculated under the condition of a given native error .

2 . Based on MPI program , the propagation model of the hardware fault in the parallel program is established ( chapter 3 ) . The MPI program is the de facto standard in parallel and distributed computing .

3 . The propagation model of hardware fault in heterogeneous parallel program is established with GPGPU program as an example ( chapter 4 ) .

The CPU - GPU heterogeneous system has been widely used in the field of high performance computing , and the GPGPU program has become a representative heterogeneous parallel program . According to the characteristics of the GPGPU program , the error caused by hardware failure is analyzed , and the error is further divided into CPU error and GPU error .

In this paper , based on the results of the analysis of the propagation behavior of the hardware faults in the program , the paper designs the MPI program and the GPGPU respectively , and realizes the corresponding fault tolerance optimization method . The main work and innovation points are as follows :

1 . A weak blocking cooperative application level checkpointing method _ WBC - ALC ( chapter V ) for MPI program is put forward . A weak blocking cooperative application level checkpointing method _ WBC - ALC for MPI program is presented in this paper . The basic idea and cooperative mechanism of WBC - ALC are introduced . The program method and fault - tolerant framework are designed for realizing WBC - ALC . The experimental results show that the programmer can use WBC - ALC method to fault tolerance for MPI program , and WBC - ALC can effectively reduce the fault - tolerant cost of the checkpoint .

2 . In this paper , LazyFT ( chapter 6 ) is introduced to detect the error generated by the instantaneous fault of GPU computing components . Lazy ' s error detection method is proposed based on the propagation rule . The fault - tolerant framework of Lazyft is designed based on this method .

3 . A partial re - calculation method _ PartialRC ( chapter 7 ) for GPGPU program is proposed in this paper . After the transient fault of GPU hardware is analyzed , a part of the complex calculation thought for GPGPU program and the fault - tolerant framework for GPGPU program are put forward . The basic principle , realization technology and optimization method of GPGPU program are presented . The experimental results show that the PartialRC can effectively reduce the recovery cost of GPGPU program after the transient fault of GPU hardware .

【学位授予单位】：国防科学技术大学
【学位级别】：博士
【学位授予年份】：2012
【分类号】：TP302.8

【共引文献】