云环境下MapReduce容错技术的研究

发布时间：2018-08-25 10:29

【摘要】：云计算(Cloud Computing)已经成为今天计算机行业中最重要的技术之一。随着云技术的迅速发展,数据的形式从传统的结构化数据(structured data)逐步地向半结构化数据(semi-structured data)和非结构化数据(unstructureddata)转变,同时数据的规模也有了海量式地膨胀。传统的数据库技术已经无法应对海量规模的数据,因此,如何来处理这些大数据(Big Data)就成了一个亟待解决的问题。于是,Google在2004年提出了它们的解决方案——MapReduce的技术,来应对云时代的大数据带来的挑战。简单地说,MapReduce是一个针对海量数据进行批量并行化处理的编程模型。它不仅能够解决处理海量数据的性能问题,同时也简化了程序员开发分布式并行程序的方式。更重要的是,MapReduce很好地解决了可扩展性(Scalability)和可靠性(Reliability)等问题,这也是与传统数据库相比MapReduce最大的优势。围绕着MapReduce这个新兴的编程框架,国内外展开了各种各样的研究,其中关于MapReduce的容错能力一直是研究的热点之一。国内外针对容错能力的研究方案主要可以归纳为以下两种方法:再执行和备份。这些方案旨在发现失效后进行相应的恢复操作,但是如果不能及时地感知到失效的情况,以上方案就不能充分发挥作用了。因此,本文将从一个新的角度出发来研究MapReduce的容错能力,即如何能够更快更准确地感知到MapReduce中的失效节点。针对这个问题,本文尝试提出了两种思路:自适应的超期时间和基于信誉的探测模型。自适应的超期时间旨在改变MapReduce集群中严格并且固定的超期时间。为了做到这一点,首先会对每个作业的执行时间进行估算,然后让超期时间自适应于估算得到的执行时间。在运行时,如果JobTracker超过了自适应的超期时间内没有收到来自节点的心跳信息时,那么那个节点就会被认为发生了失效。而基于信誉的探测模型则会给每个节点赋予一个信誉值,利用reduce任务远程获取map数据失败的动作,实时地对节点的信誉进行评估。如果节点的信誉值因为过多的失败动作而衰减到预设的下限值时,那个节点就被认为发生了失效。大量实验数据表明,本文提出的两种方案要明显优于原始的Hadoop集群。当集群中有节点失效之后,相比原始的方案,本文中的方案可以将发现失效的时间大幅度地缩减。另外,在两个方案的对比实验中可以看出,自适应的超期时间将更倾向于短作业的执行,而基于信誉的探测模型更适合大作业的执行。使用这两种方案,可以更好地配合已有的容错技术,使得Hadoop集群拥有一个更好的容错能力——不仅能够快速地定位失效,并且也能够快速地从失效中恢复回来。本文的主要贡献是提出了自适应的超期时间和基于信誉的探测模型两种机制,同时扩宽了Hadoop容错的研究思路。
[Abstract]:Cloud computing (Cloud Computing) has become one of the most important technologies in the computer industry today. With the rapid development of cloud technology, the form of data has gradually changed from traditional structured data (structured data) to semi-structured data (semi-structured data) and unstructured data (unstructureddata). At the same time, the scale of data has expanded in a large scale. Traditional database technology has been unable to cope with massive data, so how to deal with these big data (Big Data) has become a problem to be solved. So in 2004 Google put forward its solution, MapReduce, to meet the challenges posed by big data in the cloud age. Simply put, MapReduce is a programming model for batch parallelization of mass data. It not only solves the performance problem of processing massive data, but also simplifies the way for programmers to develop distributed parallel programs. More importantly, MapReduce solves the problems of extensibility (Scalability) and reliability (Reliability), which is the biggest advantage of MapReduce compared with traditional database. A variety of researches have been carried out around MapReduce as a new programming framework, among which the fault-tolerant ability of MapReduce has been one of the hotspots. The domestic and foreign research programs for fault tolerance can be summed up into the following two methods: reexecution and backup. The purpose of these schemes is to carry out the corresponding recovery operations after the failure is discovered, but if the failure situation is not perceived in time, the above schemes will not be able to play a full role. Therefore, this paper will study the fault-tolerant ability of MapReduce from a new point of view, that is, how to perceive the failure nodes in MapReduce more quickly and accurately. In order to solve this problem, this paper tries to put forward two kinds of ideas: adaptive overdue time and credit-based detection model. Adaptive overruns are designed to change the rigid and fixed outages in MapReduce clusters. In order to do this, the execution time of each job is estimated first, and then the overdue time is adaptive to the estimated execution time. At run time, if the JobTracker does not receive heartbeat information from a node within an adaptive timeframe, that node is considered invalid. The credit-based detection model assigns a credit value to each node and makes use of the reduce task to remotely obtain the action of map data failure and evaluate the reputation of the node in real time. The node is considered to be invalid if the creditworthiness value of the node attenuates to the preset lower limit due to too many failed actions. A large number of experimental data show that the two schemes proposed in this paper are obviously superior to the original Hadoop cluster. When there are node failures in the cluster, compared with the original scheme, the time of finding the failure can be greatly reduced by the scheme in this paper. In addition, it can be seen from the comparative experiments of the two schemes that the adaptive extended time will be more inclined to the execution of short jobs, while the credit-based detection model is more suitable for the execution of large jobs. By using these two schemes, the existing fault-tolerant techniques can be better coordinated, and the Hadoop cluster has a better fault-tolerant capability not only to locate failures quickly, but also to recover quickly from failures. The main contribution of this paper is to propose two kinds of mechanisms: adaptive delay time and credit-based detection model, and at the same time broaden the research ideas of Hadoop fault tolerance.
【学位授予单位】：上海交通大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP302.8

【相似文献】