面向瞬时故障的可配置容错技术研究
本文选题:瞬时故障 + 程序分析 ; 参考:《国防科学技术大学》2013年博士论文
【摘要】:随着处理器设计朝更小的晶体管特征尺寸、更低的工作电压和更高的频率发展,瞬时故障引发的可靠性问题已经引起整个计算市场的关注。由于不同领域的用户对系统可靠性、成本、性能、功耗等指标的要求不同,如何面向不同用户的不同需求提供可靠性和代价满足约束的可靠性解决方案,成为处理器设计者必须面对的挑战。为了应对这种挑战,本文重点研究了可配置、低代价的容错保护技术。此外,为了分析瞬时故障的影响和容错技术的可靠性,本文也研究了基于故障注入的可靠性分析技术。具体来说,本文工作可以分为以下四个方面:1.处理器运算单元中的故障可能导致程序运行出现数据流错误或控制流错误。其中,数据流错误检测通常基于冗余计算的方法进行,如何降低冗余计算的开销(性能、硬件开销等)是困扰容错研究至今的难点问题。为了解决这一问题,本文结合软、硬件容错技术的优势,提出了一种可配置的数据流检测技术Epipe。Epipe首先通过改造现有的超标量流水线处理器,提供了一个能够对指令进行选择性冗余保护的硬件平台。由于超标量处理器中有丰富的计算资源,Epipe平台只需要很少的硬件开销。为了减少冗余保护产生的性能开销,Epipe还基于程序分析方法评估每个指令的重要性,即指令发生故障后导致程序输出错误结果的概率。程序运行时,Epipe根据用户的性能和可靠性要求选择保护最重要的一部分指令。Epipe的创新点在于,Epipe只冗余保护发生故障后导致程序输出错误结果的指令,对于导致系统异常或超时的故障则直接利用系统中的异常检测机制加以处理,而剩余的不会影响程序执行的故障(即被屏蔽的故障)则不需要任何处理。这种分类处理故障的方法有效地减少了需要冗余保护的指令,再结合时空开销较低的硬件指令保护技术,使得Epipe技术可以更低的开销保护程序数据流。2.实现控制流检测的一种有效技术是软件实现的标签分析方法。已有的标签分析技术除了存在时空开销过大和可靠性不足的问题外,还缺乏可配置性,无法满足不同用户的不同需求。此外,软件检测技术引入的冗余代码自身也有可能发生错误,现有的控制流检测技术在容错机制的自我保护方面缺乏研究。为了克服上述不足,本文提出了一种可配置的控制流检测算法CFCES。CFCES通过为每个程序块设计特殊格式的标签并在其中插装额外的控制流检测指令,以较少的开销有效地克服了已有算法的检测盲点。而且,CFCES在设计检测机制时引入了一种被称为“对等性”的不变量,通过对这种不变量进行检测,CFCES能够以极低的代价实现检错机制的自容错保护。此外,CFCES还通过分析函数的重要性和调节程序块的大小提供了可配置的优化方法,可以满足用户不同的时空开销和可靠性约束。CFCES优化方法的特点在于其可以提高CFCES的容错效率,且可以用于优化其它基于标签分析的控制流检测算法。3.瞬时故障不仅可能发生在处理器运算单元,也有可能出现在处理器存储单元中。被广泛用于保护片外存储的ECC技术并不适合用来保护片上存储结构,原因是这些存储结构本身已经占用了大部分芯片面积,并且访问频繁,采用ECC保护会带来大量的面积、性能和功耗开销。鉴于现有的容错研究中十分缺乏针对片上存储结构的合理保护方案,本文针对一种特殊的片上存储结构SPM提出了低代价的保护技术PPS。尽管用ECC对SPM进行完全保护的开销很高,但是对部分SPM存储进行ECC保护并进行合理分配仍是非常有价值的。PPS技术首先设计了基于部分ECC保护SPM的存储体系结构(被保护的比例可以根据不同应用的可靠性、性能等需求决定),然后对程序中的待分配变量进行脆弱性分析,并将SPM空间划分为“寄存器”,最后采取基于优先级的图着色方法将较为脆弱的变量优先分配到ECC保护的“寄存器”中。基于上述方法,PPS能够以较低的开销获得较高的存储可靠性。4.故障注入是一种有效且广为应用的可靠性分析方法。故障注入技术面临的困难是如何平衡故障模拟速度与精度的关系。由于已有的故障注入技术还不能有效地解决上述问题,本文提出了一种新的故障注入框架Smart Injector。Smart Injector首先基于程序分析从故障空间中删除等价类故障和结果确定型故障。等价类故障是指发生在相似的数据流或控制流上下文环境中的故障。这类故障往往会导致系统产生相同的反应,因此只需要将它们划为等价类并从中选取代表进行模拟注入即可,等价类中其它故障则可以从故障空间中删除。结果确定型故障是指那些通过程序分析就可以确定系统反应的故障。Smart Injector还首次开发了一种故障结果预测技术,通过预测故障产生的结果和判定结果的位置,可以在程序运行结束前提前判断故障注入的结果,从而减少单次模拟的时间开销。结合提出的故障删除技术和故障结果预测技术,Smart Injector以少量的精度损失极大地减少了故障注入的时间开销。
[Abstract]:With the development of smaller transistor feature size, lower working voltage and higher frequency, the reliability problem caused by instantaneous fault has attracted the attention of the whole computing market. In order to cope with this challenge, this paper focuses on configurable and low cost fault tolerance protection technology. In addition, this paper also studies the effect of transient fault and the reliability of fault tolerance technology. The reliability analysis technique of barrier injection, specifically, this work can be divided into four aspects: the fault in the 1. processor unit may lead to a data flow error or a control flow error in the program running. In order to solve this problem, in order to solve this problem, this paper proposes a configurable data flow detection technology, Epipe.Epipe, which is based on the advantages of soft and hardware fault-tolerant technology. First, a superscalar pipelining processor is reformed to provide a selective redundancy protection for the instruction. Hardware platform. Because of the rich computing resources in the superscalar processor, the Epipe platform needs only a few hardware overhead. In order to reduce the performance overhead of redundant protection, Epipe also evaluates the importance of each instruction based on the program analysis method, that is, the probability of the program output error results after the failure of the instruction. The program runs, The innovation of Epipe to select the most important part of the instruction.Epipe according to the user's performance and reliability is that the Epipe is only redundant to protect the instructions that cause the error results of the program after the failure, and to deal with the abnormal or timeout faults directly using the exception detection mechanism in the system, while the rest is not. The fault (the shielded fault) that affects the execution of the program does not require any processing. This method of classifying the fault effectively reduces the instructions requiring redundant protection, and then combines the hardware instruction protection technology with low time and space overhead, so that the Epipe technology can lower the open pin protection program data stream.2. to implement the control flow detection. An effective technique is a label analysis method implemented by software. The existing label analysis technology, in addition to the problem of too much time and space overhead and lack of reliability, is still lack of configurability and can not meet the different needs of different users. In addition, the redundancy code introduced by software detection technology itself may also have errors and existing control. Flow detection technology lacks research on self protection in fault tolerance. In order to overcome these shortcomings, a configurable control flow detection algorithm, CFCES.CFCES, is proposed in this paper by designing a special format label for each block and inserting additional control flow detection instructions in it, effectively overcoming the existing calculation with less overhead. The blind spot of the method is detected. Furthermore, CFCES introduces an invariants called "equivalence" in the design of the detection mechanism. By detecting the invariants, the CFCES can realize the fault tolerance protection of the error detection mechanism at a very low cost. In addition, CFCES provides a fit for the analysis of the importance of the function and the size of the adjustment program block. The optimization method, which can satisfy the user's different time and space overhead and the reliability constraint.CFCES optimization method, can improve the CFCES fault tolerance efficiency, and can be used to optimize the other control flow detection algorithms based on the label analysis,.3. instantaneous fault may not only occur in the processor unit, but also may appear in the process of processing. The ECC technology, which is widely used to protect external storage, is not suitable for protecting the storage structure on the chip. The reason is that these storage structures themselves have occupied most of the chip area, and the access is frequent. The use of ECC protection will bring a lot of area, performance and power consumption. In view of the lack of fault tolerance research, it is very short. For a reasonable protection scheme for the storage structure on the chip, this paper presents a low cost protection technology for a special on chip storage structure SPM, PPS., although the overhead of full protection with ECC for SPM is very high, but the ECC protection and rational allocation of partial SPM storage is still a very valuable.PPS Technology first designed the base The storage architecture of the partial ECC protects the SPM (the protected proportion can be determined according to the reliability of different applications, performance and other requirements). Then, the vulnerability analysis of the undistributed variables in the program is analyzed, and the SPM space is divided into "registers". Finally, the more vulnerable variables are prioritization based on the graph coloring method based on the priority level. Based on the "register" of ECC protection. Based on the above method, PPS can obtain high storage reliability with lower overhead and.4. fault injection is an effective and widely used reliability analysis method. The difficulty of fault injection technology is how to balance the relationship between the speed and precision of the fault simulation. The above problem can not be solved effectively. In this paper, a new fault injection framework, Smart Injector.Smart Injector, is proposed to delete equivalent type fault and result deterministic fault in the fault space first. The equivalent fault is a fault in similar data flow or control flow context. The obstacles often cause the same reaction to the system, so they only need to be classified as equivalent classes and selected from the representative to simulate injection, and other faults in the equivalent class can be deleted from the fault space. The result determined type fault is the.Smart Injector, which can determine the system reaction through the program analysis. A fault result prediction technique is developed. By predicting the result of the fault and the position of the decision result, the result of the fault injection can be judged in advance before the end of the program running, thus reducing the time cost of the single simulation. In combination with the proposed fault deletion technology and the fault result prediction technique, the Smart Injector is with a small amount of precision. The loss greatly reduces the time cost of fault injection.
【学位授予单位】:国防科学技术大学
【学位级别】:博士
【学位授予年份】:2013
【分类号】:TP332
【相似文献】
相关期刊论文 前10条
1 孙欣;检测与屏蔽煤矿操作所用计算机中的瞬时故障[J];煤矿现代化;1995年02期
2 李建立;谭庆平;徐建军;;一种辐射环境下瞬时故障的软件检测方法[J];计算机工程与科学;2010年03期
3 马满福;姚军;张强;贾永新;;多交叉通道模型中瞬时故障的后向恢复[J];计算机应用;2014年09期
4 邓焕明;黄双;周纯杰;;工业以太网通信中瞬时故障处理[J];计算机工程与设计;2012年09期
5 马杰;黄雄峰;帅金荣;周纯杰;;工业人机界面瞬时故障检测及恢复方法[J];可编程控制器与工厂自动化;2012年01期
6 左泽华;黄雄峰;秦元庆;周纯杰;;无线隧道施工监控系统瞬时故障恢复控制[J];计算机应用;2012年05期
7 解鹏,崔刚,,王申科,吴智博,杨孝宗;TMR计算机系统瞬时故障的纠错技术[J];航空计算技术;1996年02期
8 解鹏,崔刚,吴智博,杨孝宗,杨鹏;基于表决的TMR机瞬时故障纠错技术的研究[J];电脑学习;1996年05期
9 江建慧;梁剑华;靳昂;胡瑾;;Linux上软件实现的瞬时故障注入方案及实现[J];同济大学学报(自然科学版);2006年06期
10 朱丹丹;刘久富;陈柯;梁娟娟;;一种面向瞬时故障的容错技术的形式化方法[J];电子设计工程;2013年05期
相关重要报纸文章 前1条
1 ;减少特殊天气下配网瞬时故障[N];中国电力报;2013年
相关博士学位论文 前1条
1 李建立;面向瞬时故障的可配置容错技术研究[D];国防科学技术大学;2013年
相关硕士学位论文 前2条
1 王烨;施工隧道无线监控系统瞬时故障分析及控制技术研究[D];华中科技大学;2011年
2 廖政;星载摆臂控制系统瞬时故障软件容错技术研究[D];华中科技大学;2011年
本文编号:1906611
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/1906611.html