基于MapReduce的迭代型分布式数据处理研究
发布时间:2018-05-04 18:47
本文选题:MapReduce + 分布式 ; 参考:《山东大学》2013年硕士论文
【摘要】:信息时代即数据的时代,随着数据规模的急剧增加,数据处理在诸多领域已远远超出了个人电脑的能力,越来越呈现出海量和并行的特点。而传统的并行编程技术如MPI、网格计算等存在开发复杂,扩展性不好等问题,无法满足日益增长的大规模数据处理的要求,迫切需要一种新的更加优秀的大规模数据处理编程模型。面对挑战,MapReduce应运而生。 MapReduce是由Google首先提出的一种用于大规模数据集并行运算的分布式编程框架,具有编程简单,容错性好,易于扩展等特点,极大地简化了集群上的海量数据并行处理实现。自其诞生的那一刻起,MapReduce就受到了高度关注,吸引了大量的相关研究,并在越来越多的实际场景中得到了广泛应用。 然而,现有的传统MapReduce实现诸如Hadoop和Sphere,不能有效的支持迭代型数据处理,而迭代计算在现实中是一类非常重要的应用。在科学计算、数据挖掘、信息检索、机器学习等领域,很多算法都是运用多次迭代实现的。这使得如何提高MapReduce的迭代型数据处理效能成为当前一项十分紧迫的研究课题,具有重要的实用价值。针对这个问题,本文进行了深入分析和研究,并在Hadoop的基础上进行扩展和修改,提出了一种改进的MapReduce框架,myHadoop。 myHadoop通过改进编程模型和任务调度程序,采用新的任务并行策略,增加循环控制模块以及数据缓存模块,不仅扩展了MapReduce对迭代程序的编程支持,还大大改善了其执行效率。本文首先分析了MapReduce对迭代型程序的处理方法和存在问题,然后详细描述了myHadoop的设计和实现,最后选取几个典型应用进行了实验,将myHadoop与Hadoop的迭代型分布式数据处理效率进行分析对比,并讨论了myHadoop在应用中Map任务分割个数的设置以及非迭代型数据处理的问题。
[Abstract]:The information age is the era of data. With the rapid increase of data scale, data processing has been far beyond the ability of personal computers in many fields, more and more showing the characteristics of mass and parallelism. However, the traditional parallel programming techniques such as MPI, grid computing and so on have the problems of complex development and poor expansibility, which can not meet the requirements of increasing large-scale data processing. There is an urgent need for a new and better large-scale data processing programming model. Facing the challenge, MapReduce came into being. MapReduce is a distributed programming framework which is first put forward by Google for parallel operation of large data sets. It has the advantages of simple programming, good fault tolerance and easy extension. It greatly simplifies the implementation of parallel processing of massive data on clusters. Since its birth, MapReduce has attracted great attention, attracted a large number of related research, and has been widely used in more and more practical scenes. However, existing traditional MapReduce implementations such as Hadoop and Hadoop can not effectively support iterative data processing, and iterative computing is a very important application in reality. In the fields of scientific computing, data mining, information retrieval and machine learning, many algorithms are implemented with multiple iterations. This makes how to improve the efficiency of iterative data processing of MapReduce becomes a very urgent research topic and has important practical value. In order to solve this problem, this paper analyzes and researches in depth, extends and modifies on the basis of Hadoop, and proposes an improved MapReduce framework named myHadoop. By improving the programming model and task scheduler, adopting a new task parallel strategy, adding cyclic control module and data cache module, myHadoop not only extends the programming support of MapReduce to iterative program, but also greatly improves its execution efficiency. This paper first analyzes the method and existing problems of MapReduce to iterative program, then describes the design and implementation of myHadoop in detail. Finally, several typical applications are selected for experiment. This paper analyzes and compares the efficiency of iterative distributed data processing between myHadoop and Hadoop, and discusses the setting of the number of Map tasks in the application of myHadoop and the problem of non-iterative data processing.
【学位授予单位】:山东大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP338.8
【参考文献】
相关期刊论文 前6条
1 王鹏;孟丹;詹剑锋;涂碧波;;数据密集型计算编程模型研究进展[J];计算机研究与发展;2010年11期
2 李丽英;唐卓;李仁发;;基于LATE的Hadoop数据局部性改进调度算法[J];计算机科学;2011年11期
3 宫学庆;金澈清;王晓玲;张蓉;周傲英;;数据密集型科学与工程:需求和挑战[J];计算机学报;2012年08期
4 曹军;Google的PageRank技术剖析[J];情报杂志;2002年10期
5 张正玺,焦占亚,焦沛;关系代数中用基本运算表示除法运算[J];陕西师范大学学报(自然科学版);2003年S1期
6 李远方;邓世昆;闻玉彪;韩月阳;;Hadoop-MapReduce下的PageRank矩阵分块算法[J];计算机技术与发展;2011年08期
相关硕士学位论文 前5条
1 王凯;MapReduce集群多用户作业调度方法的研究与实现[D];国防科学技术大学;2010年
2 县小平;搜索引擎PageRank算法研究[D];西北大学;2010年
3 张钊宁;数据密集型计算中任务调度模型的研究[D];国防科学技术大学;2009年
4 张密密;MapReduce模型在Hadoop实现中的性能分析及改进优化[D];电子科技大学;2010年
5 陈广钊;基于MapReduce的海量图像检索技术研究[D];西安电子科技大学;2012年
,本文编号:1844205
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/1844205.html