MapReduce中基于抽样技术的倾斜问题研究

发布时间：2018-10-08 11:41

【摘要】：随着互联网的快速发展,信息正在呈爆炸式增长,每天都会产生海量的数据,存储和分析海量数据是目前的一个巨大挑战。近年来,云计算这一新计算模型自从诞生以来就备受关注,各大IT巨头们纷纷将云计算作为首要发展战略,提出了自己的云计算平台和云计算服务,并且已经有了显著的成果。 MapReduce作为一种大规模数据的并行处理模型在云计算环境下受到广泛的应用,它以其简单易用,高可扩展性和容错性等特点被应用于很多领域。然而,它也存在问题,它不能有效地处理倾斜的数据。当MapReduce处理的数据分布不均匀时,会造成有些任务比其他任务运行较慢的情况,而整个作业的执行时间是由最慢的那个任务决定的,因此增加了整个作业的完成时间,使系统性能下降。本文对MapReduce中的倾斜问题进行了研究,提出了一种处理方法。本文的出发点是考虑当倾斜的数据存在时,如何高效地将MapReduce中Map阶段产生的中间结果划分给Reduce,使所有Reduce能够达到负载平衡。主要工作为：(1)统计输入文件中所有key的频次分布,由于统计所有数据的开销较大,所以本文采用抽样技术,估算keys的出现次数。将统计key频次分布这一操作用一个单独MapReduce作业来完成。并且,文中给出抽样的理论分析,证明抽取出的样本能够代替源输入文件进行key的频次估计。(2)根据统计出来的所有key的频次分布结果,提出两种划分方法：Cluster组合和Cluster分割,前者在数据倾斜度不大的时候较有效,后者在数据倾斜度较大的时候较有效。(3)实验证明使用抽样技术处理小部分数据能够较快地估计出key的频次分布,两种划分方法可以获得较快的执行时间,使Reduce得到很好的负载平衡。
[Abstract]:With the rapid development of the Internet, the information is increasing explosively. Every day, huge amounts of data are produced. It is a great challenge to store and analyze the massive data. In recent years, cloud computing, a new computing model, has attracted much attention since its birth. The major IT giants have put forward their own cloud computing platform and cloud computing services as the primary development strategy, and have made remarkable achievements. As a parallel processing model of large-scale data, MapReduce is widely used in cloud computing environment. It is widely used in many fields because of its simplicity, high scalability and fault tolerance. However, it also has problems, it can not effectively handle skewed data. When the distribution of data processed by MapReduce is uneven, some tasks run slower than others, and the execution time of the entire job is determined by the slowest task, thus increasing the completion time of the entire job. Make the system performance degrade. In this paper, the tilting problem in MapReduce is studied, and a method is proposed to deal with it. The starting point of this paper is to consider how to efficiently divide the intermediate results from the Map phase in MapReduce to Reduce, so that all Reduce can achieve load balance when skewed data exist. The main work is as follows: (1) the frequency distribution of all key in the statistical input file. Because of the high cost of statistics all data, this paper uses sampling technique to estimate the frequency of keys. The operation of statistical key frequency distribution is done with a single MapReduce job. Moreover, the theoretical analysis of sampling is given, and it is proved that the extracted sample can estimate the frequency of key instead of the source input file. (2) according to the frequency distribution results of all key, two partition methods: cluster combination and Cluster partition are proposed. The former is more effective when the data inclination is small, and the latter is more effective when the data inclination is high. (3) the experiment proves that the frequency distribution of key can be estimated quickly by using sampling technique to process a small part of data. The two partitioning methods can obtain faster execution time and make Reduce load balance very good.
【学位授予单位】：大连海事大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP338.6

【参考文献】