基于改进模拟退火算法的Hadoop云平台下新型调度器的研究和开发

发布时间：2018-07-27 11:59

【摘要】：当下,随着“云计算(Cloud Computing)"平台的火热发展,越来越多的高校、研究所、IT公司以及互联网企业开始深入研究并开展云平台的项目,以求能更好地面对“大数据(Big Data)"时代的来临。而在这其中,Apache Hadoop作为完全开源的云平台,受到了大多数企业,工程师以及专家学者的青睐,纷纷参与到了Hadoop云计算平台的研究和开发中。而随着“云计算”的火热发展,“云服务”供应商正在面临着越来越巨大,越来越复杂的数据处理。各种PB级别的结构化和非结构化数据让现有的Hadoop平台处理起来非常地吃力。此时,原生Hadoop在某些特殊作业的背景下已经难以有效地应对用户所提交的各种复杂任务了。本文正是针对目前MapReduce框架下Hadoop现有调度器在处理大内存需求作业时出现的任务等待时间过长,作业完成时间过高等问题,研究了不同调度器的调度策略,提出并设计了基于模拟退火算法的队列级别调度策略。通过采用队列资源利用率作为退火概率,将作业期望完成时间、资源量限制等作为设计参数,利用模拟退火算法的高效率、低初始条件约束等特点,优化计算能力调度器的调度效果。本文所做工作如下：首先,针对目前的Hadoop平台,分析,研究了Hadoop的设计理念,运行机制,掌握了MapReduce的处理框架,并对Hadoop现有调度器进行了深入的学习,包括Hadoop默认的FIFO先进先出调度器,Hadoop中自带的公平调度器,计算能力调度器,以及在MapReduce事项列表中正式提出且已设计出但尚未在Hadoop2.0之前的版本中正式使用的资源感知调度器和自适应调度器。针对以上五种调度器,探讨了它们的设计理念,并对它们的调度机理进行了研究和分析,指出了目前各种调度器中所存在的不同问题。然后,根据之前的工作中所总结的在现有各种调度器中所存在的普遍问题,本文提出并设计了一种新型的调度器,能有效地解决之前调度器中所存在的对大内存需求作业调度吃紧的问题。设计思路采用改进型的模拟退火算法,首先对传统的模拟退火算法进行了分析,之后对如何在调度器中应用给出了改进方法,根据Hadoop平台下的调度器原理进行了基于模拟退火算法新型调度策略的设计并依据该策略开发了新型的Hadoop调度器。最后,本文对新型调度器进行了实际情况测试,包括Hadoop中实现调度器的自由切换,针对不同类型作业的调度情况测试,在同一种作业下与计算能力调度器的调度对比测试等等。经过实验验证,本文所设计的新型调度器对大内存需求作业进行调度时能有效地降低任务等待情况的发生,实现了更低的作业完成时间以及更好的资源利用率。基本实现了hadoop调度器所需要的功能,同时也能满足特殊情况下作业的合理调度。
[Abstract]:Nowadays, with the development of cloud computing (Cloud Computing) platform, more and more universities, research institutes, IT companies and Internet enterprises begin to research and develop cloud platform projects in order to better face the "big data (Big Data)" era. As a completely open source cloud platform, Apache Hadoop has been favored by most enterprises, engineers and experts, and has participated in the research and development of Hadoop cloud computing platform. With the development of cloud computing, cloud service providers are facing more and more huge and complex data processing. Various PB-level structured and unstructured data make the existing Hadoop platform very difficult to handle. At this point, native Hadoop in the context of some special jobs has been difficult to effectively deal with the user submitted a variety of complex tasks. In this paper, the scheduling strategies of different schedulers are studied in order to solve the problems such as too long waiting time and too high job completion time when the current Hadoop scheduler processes jobs with large memory requirements under the current MapReduce framework. A queue level scheduling strategy based on simulated annealing algorithm is proposed and designed. By using queue resource utilization as annealing probability, the expected completion time and resource limit are taken as design parameters, and the high efficiency and low initial constraints of simulated annealing algorithm are used. Optimize the scheduling effect of the computing power scheduler. The work of this paper is as follows: firstly, according to the current Hadoop platform, the design concept and running mechanism of Hadoop are studied, the processing framework of MapReduce is mastered, and the existing Hadoop scheduler is deeply studied. Including Hadoop default FIFO first-in first-out scheduler Hadoop comes with a fair scheduler, computing power scheduler, And the resource aware scheduler and adaptive scheduler which are formally put forward in the list of MapReduce items and which have been designed but have not been formally used in the previous version of Hadoop2.0. In view of the above five kinds of schedulers, this paper discusses their design ideas, studies and analyzes their scheduling mechanism, and points out the different problems existing in the various schedulers at present. Then, according to the common problems existing in all kinds of schedulers summarized in previous work, this paper proposes and designs a new kind of scheduler. It can effectively solve the problem of tight job scheduling for large memory requirements in the previous scheduler. The improved simulated annealing algorithm is adopted in the design. Firstly, the traditional simulated annealing algorithm is analyzed, and then the improved method is given for its application in the scheduler. According to the principle of Hadoop scheduler, a new scheduling strategy based on simulated annealing algorithm is designed and a new Hadoop scheduler is developed. Finally, this paper tests the actual situation of the new scheduler, including the implementation of free switching of scheduler in Hadoop, the scheduling test for different types of jobs, the scheduling comparison test between the scheduler and the computing power scheduler under the same kind of job, and so on. Experimental results show that the new scheduler designed in this paper can effectively reduce the occurrence of task waiting and achieve lower job completion time and better resource utilization when scheduling jobs with large memory requirements. The functions of hadoop scheduler are basically realized, and the reasonable scheduling of jobs under special circumstances is also satisfied.
【学位授予单位】：太原理工大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.05

【参考文献】