面向Hadoop的应用特性分析及系统性能优化

发布时间：2018-09-16 21:48

【摘要】：Hadoop是目前使用最为广泛的大数据处理系统。尽管Hadoop为大规模分布式数据处理提供了高效的解决方案,但是Hadoop系统仍然面临着一系列的挑战:1)Hadoop对外提供的抽象编程接口隐藏了底层具体的实现细节,难以对应用程序进行性能分析;2)Hadoop系统配置参数对系统性能有重要的影响,但默认配置模式不能保证所有应用程序获得最佳的性能,需要有针对性地进行配置参数调优;3)数据的频繁移动严重制约大数据系统的性能,需要寻求新的解决方案以降低数据移动对大数据系统性能造成的不利影响。本文主要针对Hadoop系统中应用程序的性能特性分析和性能优化方案加以研究。首先,本文基于二进制字节码动态追踪技术设计并实现了一个轻量级、非侵入式的分布式Hadoop应用性能分析框架,能够动态获取应用程序的运行时状态并进行性能分析,帮助用户了解应用程序在Hadoop系统中运行时的性能特性,进而为应用程序的优化指明方向。其次,本文提出了一种针对动态资源分配场景的Hadoop应用程序性能模型,并以该性能模型为基础使用遗传算法对全局的高维配置参数空间进行搜索,从而解决Hadoop系统配置参数的调优问题。本文提出的Hadoop应用程序性能模型的预测错误率低于6%;相比于默认配置,使用本文方案优化后平均可以获得9.52倍的性能提升,最高可获得18.76倍的性能提升。最后,本文针对Hadoop系统中MapReduce应用的数据并行处理特性提出了一种近数据处理系统,提供了完整的软硬件接口、动态任务迁移机制和运行时环境,并实现了一个轻量级的MapReduce框架,支持将Map任务和Reduce任务迁移至近数据处理单元中完成。相比于不采用近数据处理的基准系统,本文提出的近数据处理系统获得了4.83倍性能提升,系统功耗可以降低26%;相比于采用近数据处理但不支持数据并行处理的SMC系统,本文提出的近数据处理系统功耗增加了37%,但获得了2.32倍的性能提升。
[Abstract]:Hadoop is the most widely used big data processing system. Although Hadoop provides an efficient solution for large-scale distributed data processing, Hadoop systems still face a series of challenges: 1) the abstract programming interface provided by Hadoop hides the underlying implementation details. Hadoop system configuration parameters have a significant impact on system performance, but default configuration mode does not guarantee optimal performance for all applications. In order to reduce the adverse effect of data mobility on the performance of big data system, the frequent movement of configuration parameters is needed to restrict the performance of big data system seriously, and a new solution is needed to reduce the adverse effect caused by data mobility on the performance of big data system. In this paper, the performance characteristic analysis and performance optimization scheme of application program in Hadoop system are studied. Firstly, this paper designs and implements a lightweight, non-intrusive distributed Hadoop application performance analysis framework based on binary bytecode dynamic tracing technology, which can dynamically obtain the runtime state of the application and analyze its performance. To help users understand the performance characteristics of applications running in Hadoop systems, and then point out the direction of application optimization. Secondly, this paper proposes a Hadoop application performance model for dynamic resource allocation scenarios. Based on the performance model, genetic algorithm is used to search the global high-dimensional configuration parameter space. In order to solve the Hadoop system configuration parameters optimization problem. The prediction error rate of the Hadoop application performance model proposed in this paper is less than 6. Compared with the default configuration, the optimized scheme can achieve an average performance improvement of 9.52 times and a maximum performance improvement of 18.76 times. Finally, this paper presents a near data processing system based on the data parallel processing characteristics of MapReduce application in Hadoop system, which provides complete hardware and software interface, dynamic task migration mechanism and runtime environment. A lightweight MapReduce framework is implemented to support the migration of Map and Reduce tasks to near data processing units. Compared with the reference system without near data processing, the proposed near data processing system has achieved a 4.83 times performance improvement, and the power consumption of the system can be reduced by 26. Compared with the SMC system which uses near data processing but does not support data parallel processing, the proposed near data processing system can improve the performance of the system by 4.83 times and reduce the power consumption of the system by 26%. The power consumption of the proposed near data processing system is increased by 37 times, but the performance is improved by 2.32 times.
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】