基于SLA的MapReduce调度机制研究

发布时间：2019-03-09 14:24

【摘要】：MapReduce作为一种有效的数据分析和处理解决方案已被广泛应用于大规模数据处理领域。随着MapReduce应用的扩大,越来越多的服务提供商对外提供MapReduce商业服务。服务提供商通过运行MapReduce作业实现一系列业务逻辑,最终将数据分析和处理结果返回给用户。为保证双方权益,用户与服务提供商之间签订服务水平协议(SLA),服务提供商必须遵循SLA,满足作业响应时间等性能需求,否则可能收到违约处罚。因此,如何有效的进行作业及任务调度以满足用户的SLA已成为服务提供商关注的问题。 SLA的差异性和集群的共享性为解决这一问题带来了诸多挑战。1)用户需求的不同,造成了作业类型的多样化,集群中可能同时运行着即席查询作业,处理生产型的大作业、机器学习型作业等,即使处理同一数据集,也可能出现短交互式作业和长批量作业混杂的复杂场景,相应地,用户对SLA中作业响应时间也就有着迥然相异的要求。2)服务提供商为节约构建独立集群和跨集群数据复制带来的网络和存储成本,使得MapReduce集群在多用户群组间共享,但同时也造成作业性能容易受到其他并发作业的影响,给满足用户的SLA增添了挑战。现有的MapReduce调度机制重点关注集群资源在用户间的公平共享,或者通过基于优先级的策略进行资源分配和调度。但是这些调度机制缺乏对用户SLA的感知,作业优先级难以体现用户SLA具体的差异,粒度过大,无法建立优先级和用户SLA间准确的映射关系。同时,还缺乏对集群运行状态和作业执行状态动态变化的感知,从而无法准确而有效的满足用户的SLA。针对上述问题和挑战,本文从作业性能模型构建、作业级调度和任务级调度优化等几个方面着手,提出了基于SLA的MapReduce调度机制。本文的主要工作和成果包括： 1.提出基于SLA的MapReduce调度架构,引入可插拔的调度支持节点,从作业级和任务级两个层次对用户的SLA提供灵活支持,并给出了该架构下动态自适应的作业性能模型,该模型基于历史记录、集群和作业运行状态,准确地预测和判断是否可能出现SLA作业响应时间上限违例的情况。 2.针对用户SLA的差异性,结合作业性能模型,提出基于SLA的两阶段作业调度机制,该机制预测满足用户SLA所需的最小资源量以及作业预期边际收益,据此实现集群资源划分,进行作业调度以最大限度地满足用户的SLA,避免集群闲置资源的盲目分配,并提高服务提供商可能获得的全局收益。 3.在作业级调度策略的基础上,提出感知数据分布的任务分配优化机制,尽可能减少组成作业的若干任务执行过程中的数据移动代价,从而通过架构反馈回路,提高执行效率,缩短作业响应时间,优化SLA满足率。该机制以感知数据分布为核心思想,根据map任务和reduce任务输入数据分布的不同特点,分别以任务的本地调度权重和数据传输代价为依据,基于贪婪思想实现有效的任务分配。 4.从作业性能模型准确度,作业级调度策略对用户SLA满足的有效性和任务级分配优化对任务执行效率提升程度几个方面进行实验评估,验证了本文工作的可行性和有效性。
[Abstract]:MapReduce, as an effective data analysis and processing solution, has been widely used in the field of large-scale data processing. With the expansion of MapReduce application, more and more service providers offer MapReduce business services to the outside. The service provider implements a series of business logic by running the MapReduce job, and finally returns the data analysis and processing results to the user. In order to ensure the rights and interests of both parties, the service provider must follow the SLA to meet the performance requirements such as the operation response time and other performance requirements, otherwise the default penalty may be received. Therefore, how to effectively carry out the operation and task scheduling to meet the user's SLA has become a concern of the service provider. The difference of the SLA and the sharing of the cluster have brought many challenges to the solution of this problem.1) The difference of the user's needs, resulting in the diversification of the job type, can run the Ad Hoc Query Job at the same time in the cluster, and handle the large-scale operation of the production type and the learning-type operation of the machine and the like, even if the same data set is processed, a complex scene with a short interactive operation and a long batch job mixing may occur, and accordingly, in addition, that us has a very different requirement for the time of the job response in the SLA.) the service provider saves the network and storage costs associated with the construction of the independent cluster and the cross-cluster data replication, so that the MapReduce cluster co-operates among the multi-user groups But at the same time, the operation performance is easily influenced by other concurrent operations, and the SLA of the user is added. The existing MapReduce scheduling mechanism focuses on the fair sharing of cluster resources among users, or the allocation of resources through priority-based policies and the task priority is difficult to reflect the specific difference of the user SLA, the granularity is too large, the priority can not be established, and the accurate mapping between the user SLA can not be established in addition, the invention also lacks the perception of the dynamic change of the running state of the cluster and the execution state of the operation, so that the user can not be satisfied accurately and effectively In view of the above problems and challenges, this paper starts from the aspects of job performance model construction, job-level scheduling and task-level scheduling optimization, and puts forward the SLA-based MapReduce e-scheduling mechanism. The main work of this paper And the dynamic self-adaptation under the framework is given. a job performance model that accurately predicts and determines whether an SLA job response time may occur based on a history, a cluster, and a job run state 2. According to the difference of SLA, a two-stage job scheduling mechanism based on SLA is proposed, which is used to predict the minimum amount of resources required to meet the SLA and the expected marginal revenue of the operation. The method realizes the cluster resource division, performs job scheduling to meet the SLA of the user to the maximum extent, avoids the blind distribution of the cluster idle resources, and improves the service provider 3. Based on the job-level scheduling strategy, a task allocation optimization mechanism for sensing data distribution is proposed to minimize the cost of data movement in the execution of several tasks that make up the job, so as to improve the performance through the architecture feedback loop Efficiency, shortened job response time, the SLA satisfaction rate is optimized. The mechanism uses the perceived data distribution as the core idea, and according to the different characteristics of the data distribution of the map task and the reduce task, based on the local scheduling weight of the task and the data transmission cost, the mechanism is based on the greed, The effective task assignment is realized by the idea.4. From the accuracy of the job performance model, the job-level scheduling strategy is used to evaluate the effectiveness and the task-level allocation optimization of the user's SLA, and to verify the efficiency of the task execution efficiency.
【学位授予单位】：山东大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.09

【参考文献】