基于消息数目检验和消息重排序理论的检查点算法的研究

发布时间：2018-11-22 14:14

【摘要】：随着大型分布式系统的不断发展,人们越来越关注系统的可靠性。例如中国研制的天河一号系统、航空火车等分布式控制交通系统以及基于MPI的FT-MPI系统等。分布式系统不仅关系到经济社会各方面的发展,而且与我们每个人息息相关。分布式系统容错性的质量保证的特性决定了其应用的广泛性以及重要性。分布式系统的容错性可以理解为容忍错误,消除错误影响。分布式系统容错主要分为前向容错和后向容错。考虑到存储量以及恢复过程,与前向容错技术相比,后向容错技术在实际应用中更为广泛。本课题来自于“基于后向恢复的异构分布式系统容错技术的研究与实现”的山东省自然科学基金项目。后向容错技术分为两种：基于检查点的容错算法与基于消息日志的容错协议。如何保存分布式系统的系统状态以及当系统失效时如何使进程恢复到全局一致状态是后向容错技术中的两个主要问题。现存文献中存在很多判定分布式全局状态一致的方法,但存在不同程度的缺陷。本文主要创新点及贡献为： (1)提出消息数目检验模型。通过研究进程间消息接收事件数目与消息发送事件数目的关系,本文提出消息数目检验模型。在此模型中,若一个全局状态中不含孤儿消息,则判定此全局状态是一致的。 (2)基于消息数目检验模型,提出一种新的求解包含给定检查集的最大最小全局一致检查点算法。此算法首先利用消息数目检验方法判定给定的检查点集中是否存在孤儿消息。如果存在孤儿消息,则分布式系统中不存在包含给定检查点集的最大最小全局一致检查点,减少搜索时间开销。否则,通过全局搜索算法查找包含给定检查点集的最大最小全局一致检查点。 (3)提出了消息重排序理论。首先此理论描述了消息发送事件与接收事件间的总是在先发生关系,并利用进程改进的逻辑时钟标记事件间的总是在先发生关系。其次此理论引入了等价消息接收序列的概念。在消息恢复过程中不存在和进程失效前执行结果完全一致的等价的消息接收序列。最后此理论解决了在乐观消息日志恢复协议中,进程的接收消息次序由于故障丢失的问题。 (4)在消息数目检验模型和消息重排序理论的基础上,提出一种新的消息日志协议。此消息日志协议的主要创新点是：①此协议表明了在乐观消息日志协议中,系统故障恢复时未做日志的消息不可能以系统失效前的接收顺序准确重现。②使用基于接收的日志协议,消息做日志事件与消息发送事件可在消息发送方异步进行。
[Abstract]:With the development of large distributed system, people pay more and more attention to the reliability of the system. Such as Tianhe 1 system developed in China, distributed traffic control system such as air train and FT-MPI system based on MPI. Distributed system is not only related to economic and social development, but also closely related to each of us. The quality assurance of fault tolerance in distributed systems determines the universality and importance of its applications. The fault tolerance of distributed systems can be understood as tolerating errors and eliminating error effects. Fault tolerance in distributed systems is mainly divided into forward fault tolerance and backward fault tolerance. Considering the storage capacity and recovery process, the backward fault-tolerant technology is more widely used in practice than the forward fault-tolerant technology. This topic comes from the project of Shandong Natural Science Foundation, "Research and implementation of fault-tolerant technology for heterogeneous distributed systems based on backward recovery". Backward fault-tolerant techniques can be divided into two types: fault-tolerant algorithm based on checkpoint and fault-tolerant protocol based on message log. How to preserve the system state of distributed systems and how to restore the process to a globally consistent state when the system fails are two main problems in the backward fault-tolerant technology. There are many methods to determine the consistency of distributed global state in the existing literature, but there are some defects in different degrees. The main innovations and contributions of this paper are as follows: (1) A message number test model is proposed. By studying the relationship between the number of message receiving events and the number of message sending events, this paper proposes a message number verification model. In this model, if there is no orphan message in a global state, it is determined that the global state is consistent. (2) based on the message number test model, a new algorithm for solving the maximum and minimum global uniform checkpointing with a given check set is proposed. The algorithm first uses the number of messages test method to determine whether an orphan message exists in a given checkpoint set. If orphan messages exist in distributed systems there is no maximum and minimum globally uniform checkpoint containing a given set of checkpoints which reduces the search time overhead. Otherwise, a global search algorithm is used to find the maximum and minimum globally uniform checkpoints containing a given set of checkpoints. (3) the theory of message reordering is proposed. Firstly, this theory describes the relationship between the sending and receiving events, and uses the improved logical clock to mark the relationship between the events. Secondly, this theory introduces the concept of equivalent message receiving sequence. In the process of message recovery, there is no equivalent message receiving sequence that is consistent with the result of execution before the process fails. Finally, this theory solves the problem that the order of received messages in the optimistic message log recovery protocol is lost due to failure. (4) based on the message number checking model and message reordering theory, a new message log protocol is proposed. The main innovations of this message logging protocol are: 1 this protocol indicates that in the optimistic message logging protocol, Messages that are not logged during system fault recovery can not be accurately reproduced in the order of receiving before system failure. 2 using the received log protocol, message log events and message sending events can be performed asynchronously at the sender of the message.
【学位授予单位】：山东大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP302.8

【参考文献】