分布式系统中回卷恢复技术研究

发布时间：2018-02-27 22:11

本文关键词： 分布式系统回卷恢复检查点 XMPP协议原型系统　出处：《重庆大学》2012年博士论文　论文类型：学位论文

【摘要】：分布式系统具有用户投资风险小、结构可扩展性好、用户可继承原有的软硬件资源、构造简单等特点，其应用领域越来越广泛。包括大规模科学计算系统、天气预报系统、分时电话系统、飞机订票系统、银行系统、股票系统、购物系统等。随着系统规模的不断扩大，其在计算过程中发生故障的几率也在指数增长，系统一旦失效，可能带来灾难性的后果，因此迫切需要为分布式计算系统提供容错机制。检查点与回卷恢复(Checkpoint and Rollback-Recovery)技术是一类重要的软件容错技术，具有实现和使用简单，对资源要求低等特点，适合在分布式计算环境中应用。分布式计算环境中，不确定的通信带宽、存储空间限制、节点的动态性、频繁的断开连接等特点决定了为单机系统开发的回卷恢复技术不能直接地应用到分布式计算系统中。在保证系统一致性的前提下，减少检查点和消息日志的存储开销、减少回卷恢复机制引入的通信开销、提高节点的自治性(autonomy)、减少由于进程间依赖关系造成的节点间藕合、实现回卷恢复机制对节点的透明，是分布式环境下回卷恢复技术研究的核心问题。本文围绕以上内容展开研究，主要创新点如下。（1）提出了一种分布式环境下非阻塞协调检查点及回卷恢复算法。在分布式计算环境的实际应用中，节点的自治性很强，希望的容错机制是一种透明的服务。提出的检查点算法基于发送进程来确保不会产生孤儿消息，不需要接收进程的任何信息，算法每次获得的检查点均是全局一致检查点，直接获得永久检查点，跳过临时检查点阶段，加快了检查点的形成时间，一个进程是否获得检查点与其他进程无关，算法是否获得检查点只与发送标志有关，确保了算法的高并行性。某节点失效后，，只需要通过进程广播一条同步消息，其他进程收到同步消息后，根据算法做独立处理，不需要其他进程的额外消息，从而实现了节点间透明、并行地执行回卷恢复算法。通过算法性能分析和仿真实验，验证了算法无故障运行和回卷恢复阶段的低开销性。（2）提出了一种基于动态分组的两级检查点及回卷恢复容错算法。就包含多个结点的应用而言，结点间交换信息的频率是不一样的，甚至相差很大，因此需要一种机制来适应分布式系统中进程动态协作的特点。提出的算法根据结点间通信的频率、通信时延、通信带宽及分组中结点数等指标来实现动态分组，实现分组的高内聚低耦合。组内通信时延小、结点数不多，适合协调检查点算法，因此在组级采用协调检查点算法。组间通常是由高时延、低带宽的网络相互连接，并且组间的通信频率较低，提出的系统级检查点算法充分考虑了这些特点，每个分组是否获得检查点，与其他分组无关，各个分组可以独立地，以并行方式获得系统级检查点；通过发送分组来确保分组间不会产生孤儿消息，每次获得的系统级检查点均是全局一致检查点，避免了多米诺效应的发生。算法一方面动态适应了应用自身的要求，提高了资源的整体效能，另一方面通过发送分组来确保分组间不会产生孤儿消息，实现了由传统的两阶段提交算法到单阶段算法的转变。实验结果表明，算法执行时间较低，相对于传统的两阶段提交算法，时间复杂度由通常的O(n2)降低到O(n)。（3）基于XMPP协议构建一个通用的消息传递机制。已有检查点及回卷恢复算法，都是自定义算法，消息传递方式各不相同，没有通用性可言。我们根据分布式系统的特点及检查点算法传递的消息特点，构建一个通用的消息传递机制，该机制基于XMPP协议，实现了消息的跨平台、准实时传输。对XMPP协议中XML标签进行扩展，实现了多种检查点消息传输格式的统一，提高了程序的重用性。（4）原型系统的设计与实现。在理论研究基础上，进行系统原型设计及实现，验证理论的可实现性，是从理论研究到实际工程应用过程中非常重要的工作。结合前面的理论研究成果，研究了原型系统的系统构建、客户端软件需求分析、客户端软件总体框架、功能模块及处理流程，并编程实现一个原型系统，证明了理论成果的可实现性。
[Abstract]:Distributed system has the user investment risk is small, the structure with good scalability, users can inherit the existing software and hardware resources, the advantages of simple structure, more and more widely used. Including large-scale scientific computing system, weather forecast system, time-sharing telephone system, aircraft booking system, bank system, stock system, shopping system etc. with the continuous expansion of the system size, the probability of failure in the process of calculation is in exponential growth, once the system failure, can be disastrous, so there is an urgent need to provide fault-tolerant mechanism for distributed computing systems. Checkpointing and rollback recovery (Checkpoint and Rollback-Recovery) technology is a kind of important software fault tolerant technique. Has the advantages of simple implementation and use of resources, low requirements, suitable for application in a distributed computing environment.
In the distributed computing environment, communication bandwidth and uncertainty, storage space constraints, dynamic nodes, frequent disconnection and so decided to develop stand-alone system rollback recovery technology can not be directly applied to the distributed computing system. Under the premise of ensuring the consistency of the system, reduce the storage overhead of checkpointing and message logging the communication overhead rollback recoverymechanism, improve the autonomy of the node (autonomy), reduce the node process ofdependency relationship between the coupling and implementation of rollback recovery mechanism for nodes transparent, is the core issue of research on recovery technology in distributed environment. This paper focuses on the rollback above, the main innovation the following.
(1) proposed a distributed environment of non blocking coordinated checkpointing and rollback recovery algorithm. Practical application in a distributed computing environment, autonomous nodes are very strong, we hope the fault-tolerant mechanism is a kind of transparent service. The proposed checkpointing algorithm based on the transmission process to ensure that does not produce any orphan message information do not need the process of receiving, the algorithm for each checkpoint are consistent global checkpoint, direct access to permanent checkpoints, skip the temporary check point, speeding up the formation time of the checkpoint, a process whether the check point has nothing to do with the other processes, whether to obtain the check point algorithm and sending only sign, ensure the highly parallel algorithm. After a node failure, only through the process of broadcasting a synchronous message, other process synchronization messages are received, according to the algorithm of independent processing, no other process amount Foreign news, in order to achieve the node transparent, the concurrent execution of rollback recovery algorithm. The algorithm performance analysis and simulation results verify the algorithm, trouble free operation and low cost of the rollback recovery stage.
(2) proposed a two level checkpoint and rollback recovery fault-tolerant dynamic grouping algorithm based on application includes a number of nodes for the exchange of information between nodes of the frequency is not the same, even a big difference, so we need a mechanism to adapt to the characteristics of the process of dynamic collaboration in distributed systems. The proposed according to the algorithm of communication between nodes frequency, communication delay, nodes communication bandwidth and packet index to realize dynamic grouping, high cohesion and low coupling to realize packet. Group communication delay, the nodes are not many suitable coordinated checkpointing algorithm, so the group level by the coordinated checkpointing algorithm among groups is usually. By Gao Shiyan, low bandwidth network connected to each other, and the communication frequency between groups is low, the system level checkpointing algorithm considers these features, each packet is check points, not with other components, each Group can independently, in parallel for system level checkpoint; to ensure that groups will not produce orphan message by sending packet, system level checkpoint each obtained are consistent global checkpoint, avoid the occurrence of Domino effect. On the one hand to dynamically adapt to the application requirements of their own, to improve the overall efficiency of resources on the other hand, by sending a packet to packet have orphan message is realized by the two stage, the traditional algorithm to change the single stage of the algorithm presented. The experimental results show that the algorithm execution time is low, relative to the conventional two phase commit algorithm, the time complexity is O (N2) is reduced to O (n).
(3) XMPP protocol constructs a universal message transfer mechanism. Based on existing checkpointing and rollback recovery algorithms are custom algorithms, message transfer in different ways, there is no universal definition. We according to the transfer characteristics and system checkpoint algorithm news features, constructs a universal message transfer mechanism this mechanism, based on the XMPP protocol, to achieve a cross platform message, quasi real time transmission. The XML tag in XMPP protocol is extended, the realization of the unity of various checkpoint message transmission format, improves the reusability of the program.
(4) the design and implementation of prototype system. On the basis of theoretical research, and realize the design of the prototype system, can realize the verification of the theory, from theoretical research to practical engineering application process is very important work. Combined with the previous theoretical research results, studies the construction of prototype system, analysis of client software demand, the overall framework of client software, function module and process, a prototype system is implemented, it can be proved that the implementation of the theoretical results.

【学位授予单位】：重庆大学
【学位级别】：博士
【学位授予年份】：2012
【分类号】：TP338.8

【引证文献】