面向海量数据的MapReduce本地优先作业调度策略研究与实现

发布时间：2018-01-13 12:29

本文关键词：面向海量数据的MapReduce本地优先作业调度策略研究与实现　出处：《国防科学技术大学》2012年硕士论文　论文类型：学位论文

【摘要】：近几十年来，信息网络的技术和规模都不断发展，海量数据应用不断增加，由单个企业搭建的普通计算集群已难以解决不断增长的海量数据给有效管理和高效计算带来的挑战，因此工业界提出将计算推至云端的思想，即云计算当前，云计算的概念已经被企业和科研机构所广泛接受，，并且在可靠性可用性等方面取得了很多成果在这些成果中，MapReduce是海量数据分布式计算中具有重要意义的解决方案之一，它的核心功能已在Hadoop分布式计算系统中得到实现Hadoop的开源特性，使得其成为研究MapReduce分布式计算的重要基础平台本文的工作即基于此平台 MapReduce分布式计算模型中的作业调度问题对系统的性能可靠性等方面具有重要的影响本文针对多作业情况下现有的作业调度算法的数据本地性差的问题，提出了一种基于本地优先的作业调度算法该方法通过新的思路解决数据本地性和系统负载均衡性相冲突的问题，在保证数据本地性的同时，通过作业级别的调度优化系统的负载均衡性能，降低了计算过程中的IO开销，从而增加系统的吞吐率和减少单个作业的执行时间本文在以HDFS为分布式存储系统的MapReduce编程模型中设计实现了基于本地优先的作业调度算法，并且在仿真环境中进行了实验验证实验结果显示，在完全实现数据本地性的机制下，系统的吞吐率得到有效提升的同时，单个作业的平均执行时间也大大减少
[Abstract]:In recent decades, information network technology and scale development, the increasing use of massive data, common computing cluster from single enterprise has been difficult to solve massive data growing brings to the effective management and efficient computing challenges, so the industry will push to put forward the idea of cloud computing, cloud computing is the current. The concept of cloud computing has been widely accepted by enterprises and research institutions, and made a lot of achievements in reliability, usability etc.
In these results, MapReduce is one of the solutions is of great significance for massive data in distributed computing, its core function has been calculated in Hadoop distributed implementation of Hadoop open source system, making it become the important work platform in MapReduce distributed computing is based on this platform
This paper has the important effect of MapReduce distributed computing scheduling problem in the model performance of the reliability of the system and other aspects of the existing scheduling algorithms work in case of data locality difference problem, put forward a new idea by scheduling algorithm based on local priority based on the solution of data locality and load system the balance of conflict problems, while ensuring the data locality, through load balancing performance scheduling optimization system operation level, reduces the calculation of the IO overhead, thereby increasing system throughput and reduce the execution time of a single job
The design and implementation of scheduling algorithm based on local priority based on the HDFS MapReduce programming model for distributed storage system, and verified the experimental results shown in the simulation environment, in the full realization mechanism of data locality, and effectively improve the system throughput, the average execution time of single job is greatly reduced

【学位授予单位】：国防科学技术大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP333;TP311.13

【参考文献】