基于依赖跟踪和消息计数的回卷恢复容错技术研究

发布时间：2019-04-09 10:51

【摘要】：目前大量的科学研究和工程技术应用都在分布式计算系统上进行，但伴随着系统规模的扩大，系统节点数量的增加，系统运行时发生故障的概率也随之增大。如果想要使系统在出现故障或异常之后，仍能够保证结果的正确性或满足应用的需求，那么系统必须所具有容错的能力。回卷恢复容错技术基于时间冗余进行容错，无须结点冗余，是实现高性能分布式计算可靠性的主流技术。但是回卷恢复技术在保障系统可靠性的同时会带来大量的额外开销，开销问题很大程度上限制了它的应用与发展。因此研究降低回卷恢复协议开销、提高系统执行效率的方法有着重要的意义。本文的主要研究内容包括如下两个方面：第一，针对传统消息日志协议中由于同步约束所导致的消息日志开销大的问题，，提出一种基于依赖跟踪的轻量级消息日志协议。该协议利用运行时的消息传递特性，采用信息附带策略解除消息日志中的同步约束。该协议中消息数据本身保存在发送方，未施加任何约束条件，消息的提交信息随消息传递保存在依赖关系扩展中的依赖方，这种保存方式也未引入任何约束。消息的提交信息通过保存方跟踪，尽力避免了不必要的传递，减少了消息的附带信息量，具有轻量级的特性。通过实验表明，该协议相比Egida协议，其消息日志开销和检查点开销均降低了10%左右。第二，针对已有的协同检查点协议通常存在阻塞或协同开销较大的问题，提出了一种基于消息计数的非阻塞式协同检查点协议。该协议将进程的运行时状态分为三种，利用分布式并行程序运行时检查点设置概率远高于故障发生概率这一特征，采用信息附带策略和非阻塞执行机制，将检查点设置过程中的部分协同开销转移到故障后的回卷恢复阶段，同时通过标识检查点间隔内进程的通信情况，来避免进程设置不必要的检查点，以此降低检查点设置过程中的整体开销。实验结果表明，该协议相比两段式检查点协议，其协同检查点开销降低了20%至40%；相比分布式快照协议，其协同检查点开销降低了20%左右。
[Abstract]:At present, a large number of scientific research and engineering applications are carried out in distributed computing systems. However, with the expansion of system scale and the increase of the number of nodes, the probability of system failure is also increased. If the system is to be able to guarantee the correctness of the results or meet the requirements of the application after the fault or exception occurs, the system must have the fault-tolerant ability. Roll-back recovery fault-tolerant technology, which is based on time redundancy and does not require node redundancy, is the mainstream technology to achieve high-performance distributed computing reliability. However, roll-back recovery technology can not only guarantee the reliability of the system but also bring a lot of additional overhead, which limits its application and development to a great extent. Therefore, it is of great significance to study the methods to reduce the overhead of rollback recovery protocol and improve the efficiency of system execution. The main contents of this paper include the following two aspects: firstly, a lightweight message log protocol based on dependency tracing is proposed to solve the problem of large message log overhead caused by synchronization constraints in traditional message logging protocols. This protocol takes advantage of the message-passing characteristic of runtime and uses the information-attached policy to remove the synchronization constraint in message log. In this protocol, the message data is stored in the sender without any constraints, and the message submission information is stored in the dependent party with the message transmission in the dependency extension, and no constraints are introduced in this way. The message submission information is tracked by the depositor, which avoids unnecessary transmission, reduces the incidental information of the message, and has the characteristics of lightweight. The experimental results show that the message log overhead and checkpoint overhead of the proposed protocol are reduced by about 10% compared with the Egida protocol. Secondly, a non-blocking cooperative checkpoint protocol based on message counting is proposed to solve the problem that the existing cooperative checkpoint protocols usually have blocking or high cooperative overhead. The protocol divides the run-time state of the process into three types. Using the characteristics of distributed parallel program runtime checkpoint setting probability far higher than the probability of failure occurrence, this protocol adopts information collateral policy and non-blocking execution mechanism. "transfers part of the collaboration overhead during checkpoint setup to the post-failure rollback recovery phase, while avoiding unnecessary checkpoints by identifying the traffic of processes within the checkpoint interval." This reduces the overall overhead during checkpoint setup. The experimental results show that compared with the two-segment checkpoint protocol, the cooperative checkpoint overhead of the proposed protocol reduces by 20% to 40%, and that of the distributed snapshot protocol by about 20%.
【学位授予单位】：湖南大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP302.7

【参考文献】