面向云平台的协同卷回恢复关键技术研究

发布时间：2018-06-10 00:04

本文选题：容错 + 云计算　；参考：《哈尔滨工业大学》2014年硕士论文

【摘要】：云计算在继承传统技术的基础上加入新的思路，，通过使用群集的商业计算机来处理大量的数据，正成为一种流行的计算模式。但云计算系统的容错能力也日益成为瓶颈，亟需提高系统的容错能力。课题涉及的卷回恢复技术并非新的技术，包括协同检查点和消息日志，均已得到较为广泛的应用。但这些容错技术面对云计算仍显不足，大多只针对云平台的虚拟机实例提供容错能力。因此，对云计算的卷回恢复容错技术进行研究，以提供云计算环境下系统全局容错能力。本文实现的云平台协同卷回恢复系统周期性地设置半协同检查点，通过对各虚拟机进行协同同步避免孤儿消息，并利用消息驱赶协议消除中途消息，完成全局一致地检查点设置。云平台虚拟机发生错误后，快速地检测到错误，执行云平台卷回恢复。一般情况下，云平台分配给不同用户的虚拟机实例间是相互独立的，出错后回卷恢复所有虚拟机实例可能导致大量无谓的计算损失。为了减少参与卷回的虚拟机数量，本文提出基于日志的协同检查点算法，当某虚拟机发生错误只恢复与其存在依赖关系的虚拟机。区别于传统容错技术，本文实现容错平台对具体应用和操作系统透明，除云平台管理服务器端控制模块外所有功能模块均在虚拟机特权域中实现，无需修改应用软件和操作系统。在研究比较各类云平台的基础上，选择开源软件CloudStack和XenServer搭建小型IaaS云平台，对设计和开发的协同卷回恢复系统进行测试。测试结果表明，相关协同卷回恢复算法在为云平台提供容错能力的同时，半协同检查点降低用户等待时间，而基于日志的协同卷回恢复算法则减少了参与回卷的虚拟机数量。
[Abstract]:Cloud computing is adding new ideas on the basis of inheriting traditional technology. It is becoming a popular computing model by using a cluster of commercial computers to deal with a large number of data. But the fault tolerance ability of the cloud computing system is also becoming a bottleneck, and it is urgent to improve the fault tolerance of the system.
The technology of rollback recovery is not a new technology, including cooperative checkpoints and message logs, which have been widely used. However, these fault-tolerant technologies are still inadequate in the face of cloud computing. Most of them only provide fault tolerance for the virtual machine instances of the cloud platform. The overall fault tolerance of the system in the cloud computing environment.
In this paper, a semi cooperative checkpoint is set periodically for the cloud platform cooperative rollback recovery system. By synergetic synchronization of each virtual machine, the orphan message is avoided, and the message drive protocol is used to eliminate the halfway message and complete the global consistent checkpoint setting. After the cloud platform virtual machine has made a mistake, it detects the error quickly and executes the cloud flat. In general, the cloud platform is independent of the virtual machine instances allocated to different users, and the recovery of all virtual machine instances after the error can lead to a large number of meaningless computing losses. In order to reduce the number of virtual machines involved in the rollback, this paper proposes a cooperative checkpoint algorithm based on the daily chronicles, when a virtual machine occurs. The error only restored to the virtual machine which depended on its existence. Unlike the traditional fault-tolerant technology, the fault tolerant platform is transparent to the specific application and operating system. All functional modules except the cloud platform management server end control module are implemented in the virtual machine privileges domain without the need to repair the application software and operating system.
On the basis of comparing various cloud platforms, we choose open source software CloudStack and XenServer to build a small IaaS cloud platform to test the collaborative rollback recovery system designed and developed. The results show that the related cooperative rollback recovery algorithm provides fault tolerance for the cloud platform, while the semi cooperative checkpoint reduces the waiting time for users. In addition, log based collaborative rollback recovery algorithm reduces the number of virtual machines involved in the rollback.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.09

【参考文献】