当前位置:主页 > 科技论文 > 计算机论文 >

云环境下MapReduce容错技术的研究

发布时间:2018-08-25 10:29
【摘要】:云计算(Cloud Computing)已经成为今天计算机行业中最重要的技术之一。随着云技术的迅速发展,数据的形式从传统的结构化数据(structured data)逐步地向半结构化数据(semi-structured data)和非结构化数据(unstructureddata)转变,同时数据的规模也有了海量式地膨胀。传统的数据库技术已经无法应对海量规模的数据,因此,如何来处理这些大数据(Big Data)就成了一个亟待解决的问题。于是,Google在2004年提出了它们的解决方案——MapReduce的技术,来应对云时代的大数据带来的挑战。 简单地说,MapReduce是一个针对海量数据进行批量并行化处理的编程模型。它不仅能够解决处理海量数据的性能问题,同时也简化了程序员开发分布式并行程序的方式。更重要的是,MapReduce很好地解决了可扩展性(Scalability)和可靠性(Reliability)等问题,这也是与传统数据库相比MapReduce最大的优势。围绕着MapReduce这个新兴的编程框架,国内外展开了各种各样的研究,其中关于MapReduce的容错能力一直是研究的热点之一。国内外针对容错能力的研究方案主要可以归纳为以下两种方法:再执行和备份。这些方案旨在发现失效后进行相应的恢复操作,但是如果不能及时地感知到失效的情况,以上方案就不能充分发挥作用了。因此,本文将从一个新的角度出发来研究MapReduce的容错能力,即如何能够更快更准确地感知到MapReduce中的失效节点。 针对这个问题,本文尝试提出了两种思路:自适应的超期时间和基于信誉的探测模型。自适应的超期时间旨在改变MapReduce集群中严格并且固定的超期时间。为了做到这一点,首先会对每个作业的执行时间进行估算,然后让超期时间自适应于估算得到的执行时间。在运行时,如果JobTracker超过了自适应的超期时间内没有收到来自节点的心跳信息时,那么那个节点就会被认为发生了失效。而基于信誉的探测模型则会给每个节点赋予一个信誉值,利用reduce任务远程获取map数据失败的动作,实时地对节点的信誉进行评估。如果节点的信誉值因为过多的失败动作而衰减到预设的下限值时,那个节点就被认为发生了失效。 大量实验数据表明,本文提出的两种方案要明显优于原始的Hadoop集群。当集群中有节点失效之后,相比原始的方案,本文中的方案可以将发现失效的时间大幅度地缩减。另外,在两个方案的对比实验中可以看出,自适应的超期时间将更倾向于短作业的执行,而基于信誉的探测模型更适合大作业的执行。使用这两种方案,可以更好地配合已有的容错技术,使得Hadoop集群拥有一个更好的容错能力——不仅能够快速地定位失效,并且也能够快速地从失效中恢复回来。本文的主要贡献是提出了自适应的超期时间和基于信誉的探测模型两种机制,同时扩宽了Hadoop容错的研究思路。
[Abstract]:Cloud computing (Cloud Computing) has become one of the most important technologies in the computer industry today. With the rapid development of cloud technology, the form of data has gradually changed from traditional structured data (structured data) to semi-structured data (semi-structured data) and unstructured data (unstructureddata). At the same time, the scale of data has expanded in a large scale. Traditional database technology has been unable to cope with massive data, so how to deal with these big data (Big Data) has become a problem to be solved. So in 2004 Google put forward its solution, MapReduce, to meet the challenges posed by big data in the cloud age. Simply put, MapReduce is a programming model for batch parallelization of mass data. It not only solves the performance problem of processing massive data, but also simplifies the way for programmers to develop distributed parallel programs. More importantly, MapReduce solves the problems of extensibility (Scalability) and reliability (Reliability), which is the biggest advantage of MapReduce compared with traditional database. A variety of researches have been carried out around MapReduce as a new programming framework, among which the fault-tolerant ability of MapReduce has been one of the hotspots. The domestic and foreign research programs for fault tolerance can be summed up into the following two methods: reexecution and backup. The purpose of these schemes is to carry out the corresponding recovery operations after the failure is discovered, but if the failure situation is not perceived in time, the above schemes will not be able to play a full role. Therefore, this paper will study the fault-tolerant ability of MapReduce from a new point of view, that is, how to perceive the failure nodes in MapReduce more quickly and accurately. In order to solve this problem, this paper tries to put forward two kinds of ideas: adaptive overdue time and credit-based detection model. Adaptive overruns are designed to change the rigid and fixed outages in MapReduce clusters. In order to do this, the execution time of each job is estimated first, and then the overdue time is adaptive to the estimated execution time. At run time, if the JobTracker does not receive heartbeat information from a node within an adaptive timeframe, that node is considered invalid. The credit-based detection model assigns a credit value to each node and makes use of the reduce task to remotely obtain the action of map data failure and evaluate the reputation of the node in real time. The node is considered to be invalid if the creditworthiness value of the node attenuates to the preset lower limit due to too many failed actions. A large number of experimental data show that the two schemes proposed in this paper are obviously superior to the original Hadoop cluster. When there are node failures in the cluster, compared with the original scheme, the time of finding the failure can be greatly reduced by the scheme in this paper. In addition, it can be seen from the comparative experiments of the two schemes that the adaptive extended time will be more inclined to the execution of short jobs, while the credit-based detection model is more suitable for the execution of large jobs. By using these two schemes, the existing fault-tolerant techniques can be better coordinated, and the Hadoop cluster has a better fault-tolerant capability not only to locate failures quickly, but also to recover quickly from failures. The main contribution of this paper is to propose two kinds of mechanisms: adaptive delay time and credit-based detection model, and at the same time broaden the research ideas of Hadoop fault tolerance.
【学位授予单位】:上海交通大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP302.8

【相似文献】

相关期刊论文 前10条

1 王义强;袁修华;马明阳;胡艳娟;;基于神经网络的数控插补容错技术[J];农业机械学报;2011年07期

2 欧阳城添;王曦;郑剑;;自适应一致表决算法[J];计算机科学;2011年07期

3 柳燕煌;黄立勤;;云计算环境的并行支持向量机[J];南阳理工学院学报;2011年02期

4 郑启龙;汪睿;王向前;;HPMR内存管理模块优化设计[J];计算机系统应用;2011年08期

5 宁新建;;航空火控计算机的容错技术分析[J];计算机与网络;2010年17期

6 李虎;邹鹏;贾焰;周斌;;一种基于MapReduce的分布式文本数据过滤模型研究[J];信息网络安全;2011年09期

7 李远方;邓世昆;闻玉彪;韩月阳;;Hadoop-MapReduce下的PageRank矩阵分块算法[J];计算机技术与发展;2011年08期

8 李s,

本文编号:2202608


资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/2202608.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户fb4bb***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com