基于中间结果检查点的MapReduce容错方法研究与实现

发布时间：2018-04-28 11:48

本文选题：检查点容错 + 中间结果　；参考：《内蒙古大学》2017年硕士论文

【摘要】：随着互联网的高速发展,网络所产生的数据量开始呈爆发性的增长,传统的存储和计算模式已经不能满足应用中的存储和计算需求,云计算凭借其优秀的分布式处理技术成为目前最流行的数据处理技术。其中,MapReduce作为一种高效的并行计算框架,越来越多的应用在大数据处理领域。目前MapReduce模型有两种常见的故障类型:任务故障和节点故障。对于任务故障,MapReduce采用"再执行"的处理方式,即当任务执行失败以后,会被重新分配执行,任务的每次重新执行不仅浪费了大量的计算资源,也延长了任务平均完成时间,降低了计算效率。节点故障一般分为Master节点故障和Worker节点故障,对于Master节点故障,MapReduce常采用双工的容错方法。对于Worker节点故障,由于Map任务产生的中间结果存储在Worker节点上,故障会导致中间结果的丢失,已经完成的任务需要被重新分配执行。针对这种故障类型,MapReduce计算模型目前还没有合适、高效的容错方法。本文针对当前MapReduce计算模型中容错机制不足所导致的容错效率低、计算资源浪费等问题,通过检查点容错技术,对任务执行状态和中间结果以检查点文件的方式进行保存,保证中间结果不丢失,当故障发生以后根据检查点文件进行作业恢复时,提高作业的恢复执行效率。本文主要完成以下三方面的工作。(1)分析Hadoop源码中MapReduce容错机制的不足:通过分析Hadoop源码,研究MapReduce执行过程中如何处理任务故障和节点故障及其弊端,为改进MapReduce目前的容错方式提供分析基础。(2)设计与实现检查点容错机制:针对MapReduce计算过程中常见的任务故障和节点故障,本文设计和实现了检查点容错机制,将任务的执行状态和中间结果的元数据信息以检查点文件的形式进行保存,当任务被重新分配执行时利用相应的检查点文件进行任务的快速恢复执行。其中,针对任务故障设计实现了本地检查点容错机制,针对节点故障设计实现了远程和查询元数据检查点容错机制。(3)检查点容错机制的测试运行:在设计和实现了检查点容错机制后,通过搭建Hadoop集群,编写应用程序,对应用程序进行故障注入,验证当故障发生时检查点容错机制能否提供有效的容错功能,并通过实验测试检查点容错机制的容错效率。
[Abstract]:With the rapid development of the Internet, the amount of data generated by the network began to increase explosively. The traditional storage and computing mode can no longer meet the storage and computing needs in applications. Cloud computing has become the most popular data processing technology with its excellent distributed processing technology. As an efficient parallel computing framework, MapReduce is more and more used in big data processing field. At present, MapReduce model has two common fault types: task fault and node fault. In the case of task failure, MapReduce uses a "re-execution" approach, that is, when the task fails, it is reassigned to execute. Each reexecution of the task not only wastes a lot of computing resources, but also prolongs the average task completion time. The calculation efficiency is reduced. Node faults are generally divided into Master node faults and Worker node failures. For Master node faults MapReduce often adopts duplex fault-tolerant method. For Worker node failure, because the intermediate results generated by the Map task are stored on the Worker node, the failure will lead to the loss of the intermediate results, so the completed tasks need to be reassigned and executed. At present, there is no suitable and efficient fault-tolerant method for this kind of fault type. Aiming at the problems of low fault-tolerant efficiency and waste of computing resources caused by the deficiency of fault-tolerant mechanism in the current MapReduce computing model, this paper uses checkpoint fault-tolerant technology to save the task execution state and intermediate results in the way of checkpoint files. Ensure that the intermediate result is not lost and improve the efficiency of job recovery when the fault occurs and the job is restored according to the checkpoint file. This paper mainly completes the following three aspects of work. 1) analyzing the shortcomings of MapReduce fault-tolerant mechanism in Hadoop source code: by analyzing the Hadoop source code, how to deal with the task fault, node fault and its malpractice in the MapReduce execution process is studied. In order to improve the current fault-tolerant mode of MapReduce, we design and implement the fault-tolerant mechanism of checkpoint. Aiming at the common task faults and node faults in the process of MapReduce calculation, this paper designs and implements the fault-tolerant mechanism of checkpoint. The metadata information of the execution state and intermediate result of the task is saved in the form of checkpoint file. When the task is reallocated and executed, the corresponding checkpoint file is used to quickly resume the execution of the task. Among them, the fault tolerant mechanism of local checkpoint is designed and implemented for the task fault. The fault-tolerant mechanism of remote and query metadata checkpointing is designed and implemented in this paper. After designing and implementing the fault-tolerant mechanism of checkpoint, the application program is written by setting up a Hadoop cluster. Fault injection is carried out to verify whether the fault tolerance mechanism can provide effective fault tolerance when the fault occurs, and the fault tolerance efficiency of the checkpoint fault-tolerant mechanism is tested by experiments.
【学位授予单位】：内蒙古大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP302.8;TP311.13

【参考文献】