基于内存的MapReduce系统效率优化机制研究

发布时间：2018-05-30 07:50

本文选题：MapReduce + 内存计算　；参考：《华中科技大学》2016年硕士论文

【摘要】：大数据时代下数据的处理与分析已成为一个十分重要的环节。为了满足数据处理高时效的需求,基于内存计算的大数据处理系统成为了新的研究热点。现有高性能计算集群由于内存配置相对CPU配置明显不足,当运行在上面的MapReduce系统用来处理数据密集性应用,容易导致不必要的数据溢出到磁盘的I/O操作,内存效率急需优化。当处理大规模的数据集时,分区数量过多,基于哈希的Shuffle机制会导致过多的文件操作和内存的不合理使用。但当分区块过大,任务消耗的内存量变大,容易导致CPU与内存出现协调不一致的性能瓶颈问题。同时每个工作节点处理的中间数据量分配不合理,容易导致负载不均衡,影响系统性能。适用于大数据处理的内存效率优化系统针对MapReduce系统在高性能计算集群中出现的问题,结合内存计算的特性,提出并实现了内存资源高效使用的优化方案,用于构建快速、高效的大数据处理平台。首先,优化系统设计了一种对象复用的Shuffle机制,通过复用文件写句柄及其附属对象有效解决了分区数量过多时内存申请速度过快的问题,确保内存的平稳使用;其次,优化系统建立了一种基于反馈-采样-决策的任务分发机制,有效协调了分区块过大时CPU与内存的使用关系,极大地减少了中间数据溢出到磁盘的I/O开销;最后,优化系统实现了一种内嵌负载均衡器的任务调度机制,确保每个工作节点处理的中间数据量几乎一致,并且最大化地减少网络传输数据量。优化系统提出的内存效率优化方案集成在Spark系统上,实现了对用户的透明,可以完全兼容已有的Spark应用程序。通过典型案例测试,实验结果表明,改进后的Spark系统相比原生系统,在处理大规模数据集时,内存使用效率得到提高,磁盘I/O大量减少,在总的执行时间上有着1.25倍到3.18倍的性能提升。
[Abstract]:Data processing and analysis in big data era has become a very important link. In order to meet the demand of high aging data processing, big data processing system based on memory computing has become a new research hotspot. Because the memory configuration of the existing high performance computing cluster is obviously insufficient compared with the CPU configuration, when the MapReduce system running on it is used to deal with data-intensive applications, it is easy to cause unnecessary data overflow to disk I / O operation, and the memory efficiency needs to be optimized urgently. When dealing with large-scale data sets, there are too many partitions, and the hash based Shuffle mechanism will lead to excessive file manipulation and improper use of memory. However, when the sub-block is too large, the amount of memory consumed by the task becomes larger, which easily leads to the performance bottleneck problem of inconsistent coordination between CPU and memory. At the same time, the allocation of the middle data is unreasonable, which easily leads to the imbalance of the load and affects the performance of the system. The memory efficiency optimization system suitable for big data processing, aiming at the problems of MapReduce system in high performance computing cluster, combining the characteristics of memory computing, proposes and realizes the optimization scheme of efficient use of memory resources, which is used to build rapidly. Efficient big data processing platform. Firstly, an Shuffle mechanism of object reuse is designed for optimizing the system. By reusing the file write handle and its subordinate objects, the problem of excessive request speed of memory when the number of partitions is excessive is effectively solved, and the smooth use of memory is ensured. The optimized system establishes a task distribution mechanism based on feedback, sampling and decision, which effectively coordinates the relationship between CPU and memory when the sub-block is too large, and greatly reduces the I / O overhead of the intermediate data overflow to disk. The optimization system implements a kind of task scheduling mechanism with embedded load balancer, which ensures that the intermediate data amount is almost the same per working node, and maximizes the amount of network transmission data. The memory efficiency optimization scheme proposed by the optimization system is integrated on the Spark system, which is transparent to the users and compatible with the existing Spark applications. The experimental results show that compared with the native system, the improved Spark system can improve the memory efficiency and reduce the I / O of the disk. Performance increases of 1.25 to 3.18 times in total execution time.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【参考文献】