一种基于消息重排序和消息数目检验消息日志恢复方法

发布时间：2018-03-20 11:23

本文选题：分布式系统　切入点：容错　出处：《山东大学》2013年硕士论文　论文类型：学位论文

【摘要】：随着计算机技术的高速发展,分布式系统得以广泛应用,然而由于分布式系统本身还不够稳定可靠,再加上环境、人员等外在因素的影响,分布式系统具有很高的故障率,而且一旦出现故障就势必影响到人们的正常生产生活。因此如何保证分布式系统某一节点出现故障时不至于影响到整个系统的正常运行,确保有效数据依然完整,并且能够迅速判明哪里出现了问题,是什么因素导致了问题的发生,怎样才能使问题得到尽快的解决,从而使整个系统能够高效的正常运转引发了人们深深的思考。要实现这一目标,就必然要进行技术改良,而容错技术能很好的应对上述问题,因此在现实生活中我们研究容错技术就显得十分重要。山东省曾专门立项研究如何实现基于后向恢复的分布式系统的容错技术,并拿出自然科学基金作为研究经费。本文深入探讨了他们的观点,介绍了当前研究检查点技术的重要意义及该领域的发展现状,对分布式系统容错机制研究中热门的方向进行了的阐述,分析了分布式系统中可能出现的故障,对基本容错构件进行了详细说明,并提出了系统全局一致状态,探讨通过技术手段降低检查点设置和回卷回复时进程的阻塞,使检查点设置时能更为高效,从而使消息的数量维持在合理的范围内,对容错回卷回复技术及检查点算法进行了深入的研究,描述了三种检查点协议以及三种报文日志协议。由于报文丢失、乱序报文、中途报文以及重复报文等问题等存在,系统不能设置一致性的检查点,因此系统常常得不到正确的计算结果,对用户的使用造成很大的困扰。同样在乐观消息日志中消息保存至日志文件和进程通信存在着异步性,当进程发生故障时,系统中接受的消息次序会发生逻辑混乱,从而导致了系统提供给用户的消息有可能缺失、未加处理或者处理错误。即便没有故障,系统的运行性能也有待改善。针对这一问题本文提出了一种基于消息重排序和消息数目检验消息日志恢复方法。
[Abstract]:With the rapid development of computer technology, distributed system has been widely used. However, due to the lack of stability and reliability of the distributed system itself and the influence of external factors such as environment and personnel, the distributed system has a high failure rate. And once a failure occurs, it will inevitably affect people's normal production and life. Therefore, how to ensure that the failure of one node of a distributed system does not affect the normal operation of the whole system and ensure that the effective data is still complete, And can quickly determine where there is a problem, what causes the problem, and how to solve the problem as soon as possible. So that the whole system can run efficiently and normally, people think deeply. To achieve this goal, it is necessary to carry out technical improvement, and fault-tolerant technology can deal with the above problems very well. Therefore, it is very important for us to study fault-tolerant technology in real life. Shandong Province has studied how to implement fault-tolerant technology of distributed systems based on backward recovery, and put forward natural science fund as research funds. This paper introduces the significance of the current research on checkpoint technology and the development status of this field, expounds the popular direction in the research of fault tolerance mechanism of distributed system, and analyzes the possible faults in the distributed system. The basic fault-tolerant components are explained in detail, and the global consistent state of the system is put forward. It is discussed how to reduce the blocking of the process when the checkpoint is set up and the rollback recovery by technical means, so that the checkpoint setting can be more efficient. In order to keep the number of messages within a reasonable range, the fault-tolerant rollback recovery technology and checkpoint algorithm are deeply studied, and three checkpoint protocols and three message log protocols are described. There are problems such as midway message and repeated message, so the system can't set up the consistent checkpoint, so the system often can't get the correct calculation result. It is also asynchronous in the optimistic message log to save the message to the log file and the communication between the process. When the process fails, the order of the messages accepted in the system will be confused. This may result in messages that the system provides to the user may be missing, unprocessed, or error-handling, even if there is no failure, This paper presents a message log recovery method based on message reordering and message number verification.
【学位授予单位】：山东大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP338.8;TP302.8

【参考文献】