MapReduce并行计算应用案例及其执行框架性能优化研究

发布时间：2018-05-19 02:40

本文选题：并行计算 + 大数据处理　；参考：《南京大学》2012年硕士论文

【摘要】：当前,商业领域、科学领域以及社会生活中所产生的数据都在以惊人的速度增长。以关系型数据库为代表的传统数据存储、处理技术和工具,已无法存储、管理和处理如此大规模急速增长的数据。大数据包含了更多的有用信息,也带来了更多的挑战。大数据处理技术已成为当前的研究热点。在此背景下,通过并行计算技术解决大数据处理问题已成为学术界和工业界的普遍共识。然而并行计算技术与应用问题紧密相关,且应用问题本身具有不同的复杂性和多样性,这使得大数据的处理具有很大的技术挑战,需要寻找和研究有效的大数据处理并行计算模型和系统。由Google公司所发表的MapReduce并行计算技术,因其高可扩展性和高易用性而成为目前最成功的大数据处理技术,得到广泛应用。Hadoop作为当前主流的开源MapReduce框架实现,已成为大数据处理应用事实上的工业标准。但是,现有的MapReduce执行框架的实现主要面向大规模数据批处理作业,而目前各行业出现了越来越多的对作业响应性能有较高要求的在线数据处理或查询应用,现有的MapReduce并行计算框架在处理这类应用时,其响应性能存在明显的不足。为了解该问题,本文从MapReduce上层应用到底层框架逐步深入,以MapReduce并行计算应用案例的研究工作为基础,研究并实现了对现有MapReduce执行框架的性能优化。本文的研究工作主要分为以下两部分： (1) MapReduce并行计算应用案例研究,以生物信息学中的著名序列比对工具BLAST为研究案例,对BLAST算法并行化所涉及到的数据划分和计算划分的难点加以分析,提出并实现了基于MapReduce的两种并行化方案,通过多组实验测试对两种方案作了评估和比较。通过对应用案例的研究,我们也观察到MapReduce模型和执行框架在作业调度和程序执行性能方面所存在的一些问题,进而过渡到本文的第二部分研究工作。 (2) MapReduce执行框架的性能优化,通过对MapReduce作业执行的内部处理过程和时间开销的详细分析,以及对MapReduce执行框架的结构、作业(job)和任务(task)的状态转换、作业和任务调度流程的细致研究分析,本文提出并实现了对MapReduce执行框架的两点优化：一是通过将job setup和job cleanup的工作从TaskTracker转移到JobTracker,减少了作业运行的环境准备和清理的时间开销；二是将任务分配从“拉(pull)"模式改为“推(push)"模式,并把任务状态变化消息从现有的较为耗时的周期性心跳机制中分离出来,采用即时传递机制,从而提高了任务调度效率和计算资源的利用率。最后,本文使用第一部分工作中的应用案例对我们优化后的MapReduce执行框架做实验测试,实验结果表明了我们提出的性能优化方法的有效性,实际的性能提升效果较为显著。
[Abstract]:The data generated in business, science and social life are growing at an alarming rate. Traditional data storage, processing techniques and tools, represented by relational databases, can no longer store, manage and process such large scale and rapidly growing data. Big data contains more useful information and brings more challenges. Big data processing technology has become the focus of current research. In this context, it has become a common understanding of academia and industry to solve big data processing problems by parallel computing technology. However, the parallel computing technology is closely related to the application problem, and the application problem itself has different complexity and diversity, which makes the processing of big data have great technical challenge. It is necessary to find and study effective parallel computing models and systems for big data processing. MapReduce parallel computing technology, published by Google Company, has become the most successful big data processing technology due to its high scalability and ease of use. It has been widely used as the mainstream open source MapReduce framework implementation. It has become a de facto industrial standard for big data processing applications. However, the implementation of the existing MapReduce execution framework is mainly oriented to large-scale data batch processing, and there are more and more online data processing or query applications with high performance of job response. The response performance of the existing MapReduce parallel computing framework is obviously inadequate when dealing with this kind of applications. In order to understand this problem, this paper goes deep from the upper layer of MapReduce to the bottom frame. Based on the research work of MapReduce parallel computing application case, the performance optimization of the existing MapReduce execution framework is studied and realized. The research work of this paper is divided into the following two parts: 1) the application case study of MapReduce parallel computing, taking BLAST, a famous tool of sequence alignment in bioinformatics, as a case study, analyzes the difficulties of data partition and computational partitioning involved in parallelization of BLAST algorithm. Two parallel schemes based on MapReduce are proposed and implemented. Through the study of application cases, we also observe some problems in job scheduling and program execution performance of MapReduce model and execution framework, and then transition to the second part of this paper. 2) the performance optimization of MapReduce execution framework, through the detailed analysis of the internal processing process and time cost of MapReduce job execution, as well as the structure of MapReduce execution framework, job) and task state transformation. After detailed research and analysis of job and task scheduling process, this paper proposes and implements two optimizations of MapReduce execution framework: first, by transferring the work of job setup and job cleanup from TaskTracker to Job Tracker, the time cost of preparing and cleaning the environment of job running is reduced; The second is to change the task allocation from "pullout" mode to "push-push" mode, and separate the message of task state change from the existing time-consuming periodic heartbeat mechanism, and adopt the instant delivery mechanism. Therefore, the efficiency of task scheduling and the utilization of computing resources are improved. Finally, we use the application cases in the first part of the work to test our optimized MapReduce execution framework. The experimental results show that the proposed performance optimization method is effective, and the actual performance improvement effect is more significant.
【学位授予单位】：南京大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP338.6

【相似文献】