MapReduce模型的数据分配策略研究

发布时间：2018-05-22 13:28

本文选题：云计算 + Hadoop　；参考：《华中科技大学》2013年硕士论文

【摘要】：自2007年云计算诞生至今，它已经逐渐成为国内外IT界热门的概念，得到了广泛的关注。在当今互联网高速发达的环境中，面对数据量的急剧增长，如何快速有效的对海量数据进行存储和计算成为亟待解决的问题，这也是云计算诞生的原动力。但是对于云计算而言，它本身只是一种思维方式，虽然有硬件设施提供必要的环境，但是能够支撑云计算思想的编程模型更加重要。由Google提出的MapReduce并行编程模型，为云计算海量数据的处理提供了软件支持。 Hadoop以一种可靠、高效、可伸缩的方式工作，在短短几年里成为了主流的开源云计算平台，，但是Hadoop仍然是一个比较年轻的平台，在很多地方有不够完善之处，对其进行改进是十分必要的。通过对Hadoop平台下的MapReduce并行编程模型进行深入研究，主要针对MapReduce并行编程模型在Map端输出的中间数据分布不均衡现象提出解决方案，该方案的设计思路是用两个阶段MapReduce作业对上述问题进行处理，第一个MapReduce阶段用于对源数据集进行并行抽样，根据抽样的结果估计数据信息，提出一种称为LAB的分配策略，该分配策略对中间数据进行均衡分配；第二MapReduce阶段按照上述数据分配策略执行MapReduce作业。通过实验表明，该方案减少了作业运行时间，Reduce端输入数据达到负载均衡，从而证明改进方案的可行性和其优势所在。该方案能够充分利用计算资源，避免资源的浪费，提高了程序运行效率。
[Abstract]:Since the birth of cloud computing in 2007, it has gradually become a hot concept in IT field at home and abroad. With the rapid development of the Internet, how to store and compute the massive data quickly and effectively becomes an urgent problem in the face of the rapid growth of data, which is also the driving force of cloud computing. But for cloud computing, it is only a way of thinking. Although there are hardware facilities to provide the necessary environment, the programming model that can support cloud computing is more important. The MapReduce parallel programming model proposed by Google provides software support for cloud computing massive data processing. Hadoop, which works in a reliable, efficient and scalable way, has become the mainstream open source cloud computing platform in just a few years, but Hadoop is still a relatively young platform that is imperfect in many places. It is necessary to improve it. Through the in-depth study of MapReduce parallel programming model based on Hadoop platform, a solution is proposed to solve the problem of uneven distribution of intermediate data output from MapReduce parallel programming model in Map terminal. The design idea of the scheme is to deal with the above problems with two stage MapReduce jobs. The first stage of MapReduce is used to sample the source data set in parallel. According to the result of sampling, the data information is estimated, and an allocation strategy called LAB is proposed. The allocation strategy distributes the intermediate data evenly, and the second MapReduce stage executes the MapReduce job according to the above data allocation strategy. The experimental results show that this scheme can reduce the operation time and reduce the input data to achieve load balance, which proves the feasibility of the improved scheme and its advantages. The program can make full use of computing resources, avoid the waste of resources, and improve the efficiency of program operation.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333;TP311.1

【参考文献】