容错检查点算法研究和软件设计

发布时间：2018-02-26 04:23

本文关键词： 容错不可靠非FIFO信道一致性全局检查点 Windows检查点　出处：《山东大学》2012年硕士论文　论文类型：学位论文

【摘要】：近年来,越来越多的分布式系统被各行各业使用,如军事、航空、金融系统等行业。随着为分布式系统设计的分布式软件的复杂度的增加,分布式系统中节点数量的增多,导致分布式系统有越来越高的概率发生故障,从而造成系统可靠性越来越差。若是在使用过程中出现故障,并且没有相应的保护措施,这些故障有可能会造成生命、财产的重大损失。因此研究容错检查点技术就有十分重要的现实意义。本课题是基于山东省自然科学基金项目“基于后向恢复的异构分布式系统容错技术的研究与实现”提出的。在本文中首先叙述了现如今检查点技术的研究意义及发展现状,介绍了分布式系统的基本故障模型以及基本容错构件。提出了一个基于不可靠的非FIFO通信信道的检查点算法,在不可靠的非FIF0的通信信道中,系统会发生报文丢失、重复接收报文和报文乱序。进程可能由于报文丢失会导致一些报文不被计算,可能由于重复接收报文导致一些消息被多次计算,也可能由于消息乱序导致一些报文不能按照其发送顺序进行计算,以上提到的问题会导致系统产生不正确的计算结果,从而无法使各进程设置一致性的检查点。我们的算法通过给每个报文分配一个序列号来解决上面提到的问题。在检查点设置过程中,一致性检查点通过发送消息序号与接收消息序号来决定。通过检测发送消息序号和接收消息序号来标识丢失消息、重复接收的报文和乱序报文。我们要重发丢失的消息,保存乱序消息和丢弃重复接收的报文来解决以上的问题。我们的算法能够使系统设置一致性的全局检查点。本文还叙述了Windows进程检查点的设置和恢复,分为用户地址空间和内核对象的保存和恢复,使用Visual Studio2005环境模拟了进程的检查点设置和恢复。
[Abstract]:In recent years, more and more distributed systems have been used in various industries, such as military, aviation, financial systems, etc. With the increasing complexity of distributed software designed for distributed systems, the number of nodes in distributed systems has increased. Causes a higher and higher probability of failure in a distributed system, resulting in a worsening of system reliability. If failure occurs during use and without appropriate protection measures, these failures may lead to life. Therefore, it is very important to study fault-tolerant checkpoint technology. This paper is based on Shandong Natural Science Foundation project "Research and implementation of fault tolerance technology for heterogeneous distributed systems based on backward recovery". In this paper, the significance and development of checkpoint technology are first described. This paper introduces the basic fault model and fault-tolerant components of distributed system, and proposes a checkpoint algorithm based on unreliable non-#en0# communication channel. In the unreliable non-#en1# communication channel, the system will lose the message. The process may cause some messages not to be calculated because of the loss of the message, or some messages may be calculated several times because of the repeated receipt of the message. It is also possible that some messages cannot be calculated in the order in which they are sent because of the disorder of messages. The problems mentioned above may lead to incorrect calculation results in the system. Our algorithm solves the problem mentioned above by assigning a sequence number to each message. The consistency checkpoint is determined by sending the sequence number of the message and the serial number of the received message. The missing message is identified by detecting the sequence number of the sent message and the sequence number of the received message, and the received message and scrambled message are repeated. We want to retransmit the lost message, The algorithm can make the system set up consistent global checkpoint. This paper also describes the setup and recovery of Windows process checkpoint. It is divided into user address space and kernel object save and restore, and use Visual Studio2005 environment to simulate the process checkpoint setting and recovery.
【学位授予单位】：山东大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP302.8

【参考文献】