Hadoop平台下海量日志数据处理模型的研究及改进

发布时间：2018-05-18 18:15

本文选题：Hadoop + 分层作业调度　；参考：《浙江理工大学》2013年硕士论文

【摘要】：随着计算机技术以及互联网高速地运用到人类社会生产生活的各个方面，数据量呈现出爆发性的增长。为满足海量数据应用的处理要求，基于大规模计算机集群的并行计算成为了主要途径，而MapReduce就是一个最初由谷歌设计用来在大型集群上执行并行计算的框架。它能够减少开发人员在进行并发编程时的复杂性，使得开发人员在不了解分布式底层细节的情况下开发分布式程序。 Hadoop是一个实现MapReduce的开放源代码的集群平台。目前，Hadoop在很多互联网公司里都已经得到了应用，可以说是应用最为广泛的开源云计算软件平台。但是，Hadoop还是一个发展时间较短的平台，在许多地方还需要提高和改进。本文主要研究工作和贡献如下： 1)本文对Hadoop平台的架构及其核心技术进行了深入的研究；阐述了Hadoop平台下现有的调度算法FIFO、计算能力调度算法以及公平调度算法的设计思路、实现过程以及算法优缺点。针对FIFO调度策略单一、容易造成大作业长时间等待、集群CPU利用率低的问题，，提出了基于红黑树的分层调度算法（HSBRB），并将其引入Hadoop平台。 2) HSBRB调度算法引入了红黑树作为存储作业信息的数据结构。红黑树是一种效率非常高的不完全平衡二叉树，随着结点个数的增加,红黑树会获得高速的数据插入、删除速度,从而提高整个集群的CPU利用率。同时，HSBRB调度算法采用了层次调度模型来调度作业。当多用户共享集群平台时，每个用户对应一个池，每个池里存放多个作业，从而解决了FIFO只针对单用户提交作业的不足导致的集群资源利用率低的问题。 3)海量日志数据的处理。本文的海量日志数据均来自于NBER的专利数据集。为获得不同引用频率的专利数目，搭建了一个小型的Hadoop集群平台，并在该平台上开发分布式并行程序，结果保存到指定的目录文件中。 4)为验证HSBRB算法的性能，本文设计了两个不同的实验场景对Hadoop现有的调度算法FIFO、Fair Scheduler以及本课题的HSBRB算法进行了实验对比。实验结果验证了HSBRB算法的合理性以及有效性，而且相对于现有的调度算法，HSBRB算法能够更好地减少作业运行时间、提高CPU的利用率，是一种较为理想的任务调度算法。最后我们对论文工作进行了总结，并讨论了对进一步工作的展望。
[Abstract]:With the rapid application of computer technology and Internet to all aspects of human society, the amount of data is increasing explosively. In order to meet the requirements of mass data applications, parallel computing based on large scale computer clusters has become the main approach, and MapReduce is a framework originally designed by Google to perform parallel computing on large clusters. It can reduce the complexity of concurrent programming and enable developers to develop distributed programs without understanding the underlying details of distributed programming. Hadoop is a cluster platform that implements MapReduce's open source code. At present Hadoop has been used in many Internet companies, it can be said to be the most widely used open source cloud computing software platform. But Hadoop is also a relatively short development time platform, in many places still need to be improved and improved. The main research work and contributions of this paper are as follows: 1) in this paper, the architecture and core technology of Hadoop platform are deeply studied, and the design ideas, implementation process, advantages and disadvantages of the existing scheduling algorithms, such as FIFO, computing power scheduling algorithm and fair scheduling algorithm under Hadoop platform are described. Aiming at the problem of single scheduling strategy of FIFO, which is easy to cause long time waiting of large jobs and low utilization of cluster CPU, a hierarchical scheduling algorithm based on red-black tree is proposed and introduced into Hadoop platform. 2) HSBRB scheduling algorithm introduces red-black tree as the data structure to store job information. The red-black tree is a highly efficient binary tree with incomplete balance. With the increase of the number of nodes, the red-black tree will obtain high-speed data insertion, delete speed, and thus improve the CPU utilization of the whole cluster. At the same time, HSBRB scheduling algorithm adopts hierarchical scheduling model to schedule jobs. When multi-users share a cluster platform, each user has a pool, each pool holds more than one job, thus solving the problem of low utilization of cluster resources caused by the shortage of FIFO only for single user to submit jobs. 3) processing of massive log data. The massive log data in this paper come from the patent data set of NBER. In order to obtain the number of patents with different reference frequencies, a small Hadoop cluster platform is built and distributed parallel programs are developed on the platform. The results are saved to a specified directory file. 4) in order to verify the performance of HSBRB algorithm, two different experimental scenarios are designed to compare the existing Hadoop scheduling algorithm, FIFO Fair Scheduler, and the HSBRB algorithm in this paper. The experimental results verify the rationality and validity of the HSBRB algorithm, and it is a more ideal task scheduling algorithm than the existing scheduling algorithm, which can reduce the running time of jobs and improve the utilization of CPU. Finally, we summarize the work of the paper and discuss the prospects for further work.
【学位授予单位】：浙江理工大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP301.6;TP338.6

【参考文献】