面向性能调优的MapReduce集群模拟器的研究与设计

发布时间：2018-05-24 04:43

本文选题：云计算 + MapReduce　；参考：《杭州电子科技大学》2013年硕士论文

【摘要】：当前各种互联网应用都面临着海量数据的存储和处理问题，飞速增长数据对数据处理系统的可扩展性提出了巨大的挑战。以MapReduce为典型的云技术的兴起，为海量数据的处理提供了一套可行的解决方案。作为MapReduce框架的开源实现，Hadoop也越来越受到各企业的青睐，一方面它提供了HDFS，为海量数据的存储提供可靠、高可扩展的存储平台，另一方面，它实现了MapReduce框架，简化了并行应用程序的设计难度，为大规模并行数据处理提供了简单易用的编程框架。然而，随着Hadoop集群规模的不断扩大，许多基于Hadoop平台的benchmark的测试不能反映生产集群的真实负载特征。同时搭建一个同等规模的测试集群，需要一笔昂贵的开销。同时，作为Hadoop平台性能调优的一个重要方面，调度器性能一直都是人们重点关注的问题。而且随着集群用户和作业的不断增加，用户对作业的响应性能也有不同的要求，共享集群中的作业调度问题日渐突出。许多现存的调度器，如公平调度器，计算能力调度器，HOD等在面对这些问题特别是面对作业类型多样化问题时，都显得有些无能为力。本文在分析Hadoop平台原理和技术的基础上进行以下两个方面的研究工作：（1）提出一种负载生成方法，通过分析真实负载中的作业类型，以及还原真实负载的作业提交模型来模拟集群中的真实负载。同时本文设计了一个MapReduce模拟器，能使用少量节点模拟出大规模集群，并对作业的运行过程进行了精确模拟，，从而提供了一个完整的Hadoop集群性能测试平台，帮助解决大规模集群的测试问题。经过实验验证，负载生成方法可以精确生成反映真实负载的模拟负载，模拟器可以通过少量节点模拟出大规模集群，并提供较为精确的作业运行模拟。（2）针对作业多样化问题提出了基于静态优先级的抢占调度算法（SPPSA，Static Priority based Preemptive Scheduling Algorithm），该调度算法将调度问题分解为作业池调度，作业优先级调度，任务调度等三个问题，从而提供了作业池级别的公平性和资源控制、作业响应性保证，以及数据本地性保证等功能，经过实验验证，SPPSA可以解决大规模共享集群下用户对作业的不同响应性要求，同时抢占所带来的影响也在可接受范围之内。
[Abstract]:At present, all kinds of Internet applications are facing the problem of storage and processing of massive data. The rapid growth of data poses great challenges to the scalability of data processing systems. The rise of MapReduce as a typical cloud technology provides a feasible solution for the processing of massive data. As an open source implementation of the MapReduce framework, Had OOP is also becoming more and more popular in various enterprises. On the one hand, it provides HDFS to provide reliable and scalable storage platform for mass data storage. On the other hand, it implements the MapReduce framework, simplifies the difficulty of designing parallel applications and provides a simple and easy programming framework for large-scale parallel data processing.
However, with the expansion of the Hadoop cluster scale, many benchmark based testing based on Hadoop platform can not reflect the real load characteristics of the production cluster. At the same time, it takes an expensive cost to build an equal scale test cluster. At the same time, as an important aspect of performance tuning of the Hadoop platform, the performance of the scheduler has been all the time. With the increasing number of users and jobs in the cluster, the response performance of the user to the job is also different. The problem of job scheduling in the shared cluster is becoming more and more prominent. Many existing schedulers, such as the fair scheduler, the computing power scheduler, HOD and so on, are facing these problems, especially the job types. There are some ineffective ways to solve the problem of diversification. Based on the analysis of the principles and techniques of Hadoop platform, the following two aspects are studied:
(1) a load generation method is proposed to simulate the real load in the cluster by analyzing the job types in the real load and the job submission model of the real load. At the same time, this paper designs a MapReduce simulator, which can simulate the large-scale cluster with a small number of nodes, and simulate the operation process accurately. Thus, a complete Hadoop cluster performance testing platform is provided to help solve the test problem of large-scale cluster. It is verified by experiments that the load generation method can accurately generate the simulated load reflecting the real load. The simulator can simulate a large number of clusters through a small number of nodes and provide more accurate operation simulation.
(2) a preemptive scheduling algorithm based on static priority (SPPSA, Static Priority based Preemptive Scheduling Algorithm) is proposed for the problem of job diversification. The scheduling algorithm decomposes the scheduling problem into three problems, such as job pool scheduling, job priority scheduling, task scheduling and so on, thus providing the fairness and resources of the job pool level. Control, job responsiveness guarantee, and data locality assurance functions. Through experimental verification, SPPSA can solve the different responsiveness requirements of users to jobs in large-scale shared clusters, and the impact of preemption is also within acceptable range.
【学位授予单位】：杭州电子科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP338

【参考文献】