基于VxWorks的检查点容错技术研究

发布时间：2018-11-27 11:00

【摘要】：检查点技术作为一种普遍的容错技术，在分布式/集群系统中有着广泛应用。在基于消息传递的系统中，藉由将进程的运行状态定期的记录到可靠存储设备中（检查点文件），这样进程在失效时就可以通过存储的检查点文件进行迅速恢复，避免了进程对前期工作的重复执行，减少计算损失。协同检查点技术作为检查点技术中的一种，通过对进程的检查点设置过程进行协调来保持检查点集合的全局一致状态。通常我们使用容错开销来对一项容错技术进行评价。以故障点为边界，将容错开销分为无故障开销和故障恢复开销。协同检查点技术凭借其全局一致状态在故障恢复开销上有着较好的性能，但检查点设置过程中进程间的协调控制消息增加了系统的无故障开销。在检查点的设置上本文提出了一种具有O(n)复杂度的非阻塞协同检查点算法，，通过全局共享的消息通道来避免进程的消息接收导致的系统状态不一致，并且与传统阻塞协同检查点算法中的双阶段阻塞协议不一样的是，本文中采用的是单阶段非阻塞方式，借助于全局的共享内存，将协调消息的复杂度由传统的O(n2)减少到了O(n)，从而减少了系统的无故障开销。另外通过非阻塞的方式使得任务在做检查点设置的过程中不需要阻塞任务的运行以及消息发送，任务完成检查点设置后即可处理后续收到的消息而不用等待，如此提高了系统的处理速度以及实时性能。为了满足算法的非阻塞性，进程独立的进行检查点文件存储，检查点设置过程中故障的发生会使得进程的检查点文件不一致，为此本文采用了双检查点文件来避免这种不一致的发生。文中的非阻塞方式极大提高了进程的自主性，不过也使系统的检查点状态由强一致性全局状态变为了全局一致状态，因为此时的检查点状态中可能包含有中途消息，因此在检查点设置中还需要结合消息日志技术，以此保证系统状态的可恢复性。由协同检查点算法可知，消息日志只需存储检查点设置触发点之后的消息，避免了垃圾回收的需要。本文的检查点容错方案基于VxWorks嵌入式实时系统，该系统有着良好的可靠性以及实时性。结合该系统本文在容错方案中对文件存储、消息传输上做了改进。借助磁带式存储方案提高了文件的存储效率，减少了对存储空间的占用；而借助内存管理，减少数据在消息队列中的拷贝量，并提高了数据的传输效率。最后通过本文通过三个简单的试验，验证了检查点容错方案的可行性。
[Abstract]:As a universal fault-tolerant technology, checkpoint technology is widely used in distributed / cluster systems. In a message-passing based system, the process can be restored quickly by storing the running status of the process in a reliable storage device (checkpoint file) on a regular basis through the stored checkpoint file. Avoids the process to the previous work duplicate execution, reduces the computation loss. As one of the checkpoint techniques, the cooperative checkpoint technology maintains the global consistent state of the checkpoint set by coordinating the process of checkpoint setting. Generally, we evaluate a fault-tolerant technique using fault-tolerant overhead. The fault-tolerant overhead is divided into fault-free overhead and fault recovery cost with fault point as the boundary. Cooperative checkpoint technology has good performance in fault recovery overhead by virtue of its globally consistent state, but the coordinated control messages between processes in the process of checkpoint setting increase the fault-free overhead of the system. In this paper, a non-blocking cooperative checkpoint algorithm with O (n) complexity is proposed to avoid the system state inconsistency caused by the message reception of the process through a globally shared message channel. And different from the two-stage blocking protocol in the traditional blocking cooperative checkpoint algorithm, the single-stage non-blocking method is used in this paper, with the help of global shared memory. The complexity of coordinating messages is reduced from the traditional O (N2) to O (n), thus reducing the system's failure free overhead. In addition, the task does not need to block the running of the task and the message sending in the process of setting up the checkpoint by non-blocking method. After the task completes the checkpoint setting, it can process the messages received after the completion of the checkpoint without having to wait. In this way, the processing speed and real-time performance of the system are improved. In order to satisfy the non-obstructive algorithm, the process stores the checkpoint file independently, and the fault in the process of checkpoint setting will make the process's checkpoint file inconsistent. In this paper, double checkpoint files are used to avoid this inconsistency. The nonblocking mode in this paper greatly improves the autonomy of the process, but also changes the checkpoint state of the system from a strongly consistent global state to a globally consistent state, because the checkpoint state may contain a halfway message. Therefore, it is necessary to combine message logging technology in checkpoint setting to ensure the recovery of system state. From the cooperative checkpoint algorithm, the message log only needs to store the message after the checkpoint set up the trigger point, which avoids the need of garbage collection. The fault tolerance scheme of checkpoint in this paper is based on VxWorks embedded real-time system, which has good reliability and real-time. Combined with this system, this paper improves the file storage and message transmission in the fault-tolerant scheme. With the aid of the tape storage scheme, the storage efficiency of files is improved and the storage space is reduced, while the copy amount of data in message queue is reduced and the data transmission efficiency is improved with the aid of memory management. Finally, through three simple experiments in this paper, the feasibility of the checkpoint fault tolerance scheme is verified.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP302.8

【参考文献】