基于Hadoop的作业调度方案研究
发布时间:2018-04-03 13:13
本文选题:集群 切入点:作业调度 出处:《东北大学》2013年硕士论文
【摘要】:近年来,随着信息技术的进一步发展,企业数字化进程的不断加深,企业需要处理的数据也出现了爆发式的增长。为了提高企业的流程效率、盈利能力和产能,出现了一些列以云计算为代表的新技术。Hadoop是一个开源并行分布式计算平台,属于云计算中的PaaS服务层。Hadoop中的作业调度是指将系统中空闲的资源按一定调度策略分配给作业,调度策略的好坏关系到Hadoop集群的资源利用率、作业花费时间和集群的性能。本文分析了Hadoop中的MapReduce和HDFS架构,对Hadoop的调度过程以及如何编写调度器进行了介绍。目前Hadoop平台主要使用四种调度器,一是默认的FIFO调度器,二是Fair调度器,三是Capacity调度器,四是推测式任务调度器,本文介绍了这几种调度器的算法思想,在实验的基础上比较了四种调度器的性能,并分析了这些调度器的不足。在此基础上,本文给出一个作业调度方案,方案包括一个调度器和一个集群负载均衡算法,详细介绍了算法的核心思想,算法伪代码实现和方案所用的类图。在实验章节,通过使用java程序进行仿真实验,测试调度器所用的参数,得到了性能较优的参数组合。通过搭建Hadoop集群测试负载均衡算法的性能,然后在集群上部署完整的作业调度方案,分别在同构环境和异构环境中测试了调度方案的性能,将该方案和Hadoop原有调度器进行了对比,实验结果表明该调度方案在异构环境下,在作业的总运行时间、平均周转时间这两项指标上比原有调度器有更好的性能。
[Abstract]:In recent years, with the further development of information technology and the deepening of enterprise digitization process, the data that enterprises need to deal with also appear explosive growth.In order to improve enterprise process efficiency, profitability and capacity, some new technologies, such as cloud computing, are emerging. Hadoop is an open source parallel distributed computing platform.Job scheduling in PaaS service layer. Hadoop, which belongs to cloud computing, refers to the allocation of idle resources to jobs according to certain scheduling policies. The quality of scheduling policies is related to the resource utilization of Hadoop clusters, the time spent by jobs and the performance of clusters.This paper analyzes the MapReduce and HDFS architecture in Hadoop, introduces the scheduling process of Hadoop and how to write the scheduler.At present, four kinds of schedulers are mainly used in Hadoop platform, one is default FIFO scheduler, the other is Fair scheduler, three is Capacity scheduler, and the other is conjectural task scheduler.On the basis of experiments, the performance of four schedulers is compared, and the shortcomings of these schedulers are analyzed.On this basis, this paper presents a job scheduling scheme, which includes a scheduler and a cluster load balancing algorithm. The core idea of the algorithm, the implementation of the pseudo code and the class diagram used in the scheme are introduced in detail.In the chapter of experiment, the parameters of the scheduler are tested by using java program, and the parameter combination with better performance is obtained.By setting up a Hadoop cluster to test the performance of the load balancing algorithm, and then deploying a complete job scheduling scheme on the cluster, the performance of the scheduling scheme is tested in the isomorphic environment and the heterogeneous environment, respectively.Compared with the original Hadoop scheduler, the experimental results show that the scheme has better performance than the original scheduler in terms of the total running time and the average turnover time of the job in the heterogeneous environment.
【学位授予单位】:东北大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP338.6
,
本文编号:1705342
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/1705342.html