MapReduce故障恢复机制设计与实现

发布时间：2018-05-30 10:14

本文选题：云计算 + MapReduce　；参考：《华中科技大学》2012年硕士论文

【摘要】：随着大规模数据运算的不断发展，运算集群的规模越来越大，对系统可靠性的要求也越来越高。然而，对于如此大规模的集群，不可避免的存在着各种各样的故障发生。在MapReduce作业的运算过程中，集群上任务故障和节点故障更是十分普遍。然而，MapReduce现有的故障处理方式存在着一些缺陷。因此，对MapReduce计算模型故障恢复机制进行研究与设计具有很大的意义。本文阐述了云计算的概念、特点以及发展现状，并简单介绍了Hadoop集群的特点，在此基础上，说明了对大规模集群故障恢复机制进行研究的意义以及国内外的研究现状。然后，本文对MapReduce计算模型进行了简单的介绍，阐述了MapReduce计算模型的基本思想、工作原理和任务调度流程。在此基础上，介绍了MapReduce计算模型主要故障类型，并针对各种故障类型深入分析了其故障处理方式。接着，在现有的MapReduce计算模型基础上，增加了节点的自动重启功能模块，使得各节点在故障后可以迅速重启；并进一步对任务故障后的恢复机制进行了设计与实现，使得运行失败的任务在重新调度后不必从头开始执行，而是可以在故障前的进度基础上继续执行。通过相关的优化，使得集群在运算中出现故障后能够更快的实现故障恢复。最后，，本文对优化后系统进行了功能和性能的测试与评估。结果表明，优化后系统的故障恢复机制在功能上达到了预期的目的，性能上优于原先的MapReduce计算模型。
[Abstract]:With the continuous development of large-scale data operation, the scale of computing cluster is becoming larger and larger, and the requirement of system reliability is becoming higher and higher. However, for such a large cluster, there are inevitably a variety of failures. In the process of MapReduce operation, task failure and node fault are very common in cluster. However, there are some defects in the existing fault handling methods of MapReduce. Therefore, it is of great significance to study and design the fault recovery mechanism of MapReduce computing model. This paper describes the concept, characteristics and development status of cloud computing, and briefly introduces the characteristics of Hadoop cluster, and on this basis, explains the significance of research on large-scale cluster fault recovery mechanism and the current research situation at home and abroad. Then, this paper briefly introduces the MapReduce computing model, and expounds the basic idea, working principle and task scheduling flow of the MapReduce computing model. On the basis of this, the main fault types of MapReduce calculation model are introduced, and its fault handling methods are analyzed in depth according to various fault types. Then, on the basis of the existing MapReduce computing model, the automatic restart function module of the node is added, so that each node can be restarted quickly after the failure, and the recovery mechanism after the failure of the task is further designed and implemented. So that the failed task after rescheduling does not have to be executed from scratch, but can continue on the basis of the progress before the failure. Through the correlation optimization, the cluster can realize the fault recovery more quickly after the failure occurs in the operation. Finally, the function and performance of the optimized system are tested and evaluated. The results show that the fault recovery mechanism of the optimized system achieves the expected function and the performance is better than the original MapReduce calculation model.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP306

【参考文献】