HADOOP作业启动性能优化实践

发布时间：2018-04-26 10:14

本文选题：Hadoop + Split　；参考：《北京交通大学》2012年硕士论文

【摘要】：本论文阐述了本人在百度公司分布式计算小组工作过程中做过的一个优化HADOOP作业提交时间的项目。这个项目的重点在于优化在作业提交时split过程占用的时间和消耗的内存,这是作业提交过程中最耗时的一步,也是作业提交前的所有准备工作中最重要的一步,因为它直接关系到输入数据如何分片,即最终决定了这个作业具有的map任务数量,以及每一个map任务处理多少数据量,每个map任务优先给哪一个节点的TaskTracker来处理。在百度公司以前的HADOOP版本,以及目前社区的HADOOP版本中,一直以来都没有对split这个过程进行过的大的修改或者优化,随着百度公司的HADOOP集群规模的扩大,大作业数量的增加,单个作业需要输入的数据量越来越大,文件数量也越来也多,从而导致在提交作业之前,对这些输入数据进行split过程暴露出了内存占用大,耗时长的问题,这两个问题已经严重影响到百度HADOOP集群对于大作业的处理效率,并引起了使用百度HADOOP集群的百度数据挖掘,日志分析等部门用户的不满,因此,为了提高集群的处理效率,改善用户体验,必须要对split这一过程进行优化。本人独立完成对split过程的优化工作可以分为四个部分,分别是获取blockLocations优化,ls过程输入路径正则表达式中间匹配到文件的优化,getSplits占用内存过高优化和将getSplit过程移植到TaskTracker优化。这四部分优化分别加速了获取blockLocation信息的速度,对于遍历路径操作在中间层匹配到文件这种情况进行了加速优化,对split内存优化使得split整个过程中占用内存大幅下降,并且可以使内存占用不依赖作业的输入数据量,而是依赖于用户指定的参数。将整个split过程从客户端移植到TaskTracker上可以释放客户端的压力,并且利用同集群间网络传输的优势来进一步节省split过程的耗时。经过本人对split的优化,这一项目已经成功上线了百度公司HADOOP集群,并且达到了非常理想的效果。大作业的提交时间从小时级缩短到了分钟级,平均split过程速度提升了30-60倍,且整个split过程内存可以稳定控制在200mb左右,相比之前随着作业输入数据量而不断膨胀的内存使用量甚至可以达到3G以上,内存的节省是巨大的。最终这个项目赢得了部门同事和用户方的好评。
[Abstract]:This paper describes a project that I have done in the distributed computing group of Baidu Company to optimize the submission time of HADOOP jobs. This project focuses on optimizing the amount of time and memory consumed by the split process when the job is submitted, which is the most time-consuming step in the job submission process and the most important step in all preparations before the job is submitted. Because it is directly related to how the input data is partitioned, that is to say, it ultimately determines the number of map tasks that the job has, the amount of data handled by each map task, and the TaskTracker of which node is given priority for each map task. In the previous HADOOP version of Baidu, and in the current HADOOP version of the community, there has been no major modification or optimization of the split process. With the expansion of the scale of Baidu's HADOOP cluster, the number of large operations has increased. The amount of data needed to be input by a single job is increasing, and the number of files is also increasing. Therefore, before submitting a job, the split process for these input data exposes the problems of large memory consumption and long time consuming. These two problems have seriously affected the processing efficiency of Baidu HADOOP cluster for large jobs, and caused dissatisfaction of Baidu data mining, log analysis and other departments using Baidu HADOOP cluster. Therefore, in order to improve the processing efficiency of the cluster, To improve the user experience, the split process must be optimized. I can divide the optimization work of split process into four parts, namely, get the optimization of blockLocations optimization process input path regular expression matching to the file, and optimize the getSplit procedure to TaskTracker optimization by taking up too much memory. These four parts of optimization accelerate the speed of obtaining blockLocation information respectively. The traversal path operation matches to the file in the middle layer, and the split memory optimization greatly reduces the memory occupied in the whole process of split. Moreover, the memory can be used not by the input data of the job, but by the parameters specified by the user. Transplanting the whole split process from the client to the TaskTracker can release the pressure of the client and further save the time consuming of the split process by taking advantage of the network transmission with the cluster. After my split optimization, this project has been successfully launched Baidu HADOOP cluster, and achieved a very satisfactory effect. The submission time of the large job is shortened from the hour level to the minute level, the average speed of the split process is increased 30-60 times, and the memory of the whole split process can be steadily controlled in the 200mb. Compared with the previous expansion of memory usage over 3G with job input data, memory savings are significant. In the end, the project won high praise from departmental colleagues and users.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP338.8

【相似文献】