基于能量感知的Hadoop平台调度器研究

发布时间：2019-02-13 03:21

【摘要】：现如今各行各业的数据每天都在快速增加，学术界和企业发现这些数据中隐藏着巨大的价值。在这种需求下各种数据分析框架和平台发展起来，其中Hadoop是目前最流行的开源平台，该平台实现了Google提出的MAPREDUCE计算模型和GFS存储模型。近年来不断积累的温室气体正在改变全球气候，数据中心的建设也应该把低碳减排放在重要的位置；同时企业在数据中心电能方面的投入也越来越多。目前Hadoop集群中的主机数目正在不断的增加，数据中心能耗控制问题也越来越突出。因此从Hadoop平台方面研究如何减少Hadoop集群的能量消耗对于环境保护和减少企业成本具有重要的意义。结合Hadoop平台的工作原理以及MapReduce计算框架运行时环境的架构，本文确定了从资源与任务调度的角度在Hadoop平台中构建一套能量消耗控制的体系结构。使用先进先出算法的单队列调度器（FIFO Scheduler）和基于计算能力算法的调度器（Capacity Scheduler）是平台自带的两种常用调度器，通过对它们的测试和分析，总结出了这两种调度器对于构建Hadoop平台能量控制框架的缺陷和不足。基于原有调度器的不足本文设计并实现了基于能量感知的Hadoop平台调度器，该调度器中构建了一套能量控制的框架，并设计了两层调度策略来进行作业到资源的节能调度。本文设计的基于能量感知的Hadoop平台调度器具有以下两个特点：1）调度器可以调节和平衡Hadoop集群作业运行过程中的Qos和总能耗；2）调度器本身具有高效的调度策略。调度器的整体框架是基于多队列设计的，设计了两层调度策略来完成作业的任务到计算资源之间的动态节能匹配，两层调度策略具有高效性，并且时间复杂度是线性的；多队列中作业的分配使用了类似一致性hash的方法，，保证了作业到队列的高效动态分配以及系统的高并发性。本文最后使用XCP（xen cloud platform）云平台构建了具有32台虚拟机的Hadoop集群环境。并在该集群环境中将本文设计的节能调度器与Hadoop平台自带的先进先出调度器和计算能力调度器进行了对比实验，实验对比的目标是在不同的作业输入情况下Hadoop集群使用不同的调度器时，作业运行总能量消耗和时间消耗两方面的性能；另一方面是对比本文设计的节能调度器自身在控制作业运行能耗和时间消耗的能力。实验结果表明本文设计的节能调度器具有较好的能量控制能力，同时不增加集群作业运行的时间消耗；本文设计的节能调度器在作业运行时间和能耗两方面也具有较好的调节能力。
[Abstract]:Today, data from various industries are increasing rapidly every day, and academics and businesses find great value hidden in them. Under this requirement, a variety of data analysis frameworks and platforms have been developed, among which Hadoop is the most popular open source platform. The platform implements the MAPREDUCE computing model and GFS storage model proposed by Google. The accumulation of greenhouse gases in recent years is changing the global climate, the construction of data centers should also put low carbon emissions reduction in the important position, and enterprises in the data center electricity investment is also increasing. At present, the number of hosts in Hadoop cluster is increasing, and the problem of data center energy consumption control is becoming more and more prominent. Therefore, it is of great significance to study how to reduce the energy consumption of Hadoop cluster from the aspect of Hadoop platform for environmental protection and enterprise cost reduction. Combined with the working principle of Hadoop platform and the framework of runtime environment of MapReduce computing framework, this paper establishes a set of energy consumption control architecture in Hadoop platform from the point of view of resource and task scheduling. Single queue scheduler (FIFO Scheduler) using first-in first-out algorithm and (Capacity Scheduler) scheduler based on computing power algorithm are two kinds of common schedulers that come with the platform. The defects and shortcomings of these two schedulers for building the energy control framework of Hadoop platform are summarized. Based on the shortcomings of the original scheduler, this paper designs and implements an energy-aware Hadoop platform scheduler. In this scheduler, a set of energy control framework is constructed, and a two-layer scheduling strategy is designed to carry out the energy saving scheduling from the job to the resource. The energy aware Hadoop platform scheduler designed in this paper has the following two characteristics: 1) the scheduler can adjust and balance the Qos and total energy consumption in the running of Hadoop cluster jobs; 2) the scheduler itself has an efficient scheduling strategy. The overall framework of the scheduler is based on the multi-queue design. A two-layer scheduling strategy is designed to complete the task of the job to the dynamic energy saving matching between the computing resources. The two-layer scheduling strategy is efficient and the time complexity is linear. The method of similar consistency hash is used in the assignment of jobs in multiple queues, which ensures the efficient dynamic assignment of jobs to queues and the high concurrency of the system. In the end, the Hadoop cluster environment with 32 virtual machines is constructed by using XCP (xen cloud platform) cloud platform. In this cluster environment, the energy saving scheduler designed in this paper is compared with the first-in-first-out scheduler and the computing power scheduler that comes with Hadoop platform. The objective of the experiment is to compare the performance of the total energy consumption and time consumption of the job when the Hadoop cluster uses different schedulers under different job input conditions. On the other hand, it compares the energy consumption and time consumption of the energy saving scheduler designed in this paper. The experimental results show that the energy-saving scheduling device designed in this paper has better energy control ability and does not increase the time consumption of cluster operation. The energy-saving scheduler designed in this paper also has better regulating ability in terms of job running time and energy consumption.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.09

【参考文献】