云计算中分布式JobTracker节点模型的建立与优化

发布时间：2019-03-15 17:18

【摘要】：云计算是随着大规模计算机、个人计算机、互联网的发展而产生的第四次IT产业革命,谷歌首先定义并发展了云计算。而云计算的开源模型Hadoop是一种基于Java的通过运行可分布式的密集型应用来分析和处理大数据的开源分布式计算平台,其中的单点问题造成了Hadoop的性能瓶颈。针对存储模型架构HDFS中的namenode节点的单节点优化,Hadoop2.0提出了多节点高可用性方案,但是针对JobTracker节点的单节点优化并没有给出相应的解决办法。本文期望通过建立分布式JobTracker节点模型来改善传统的计算模型架构中的单JobTracker节点故障,从而能够自动避免单JobTracker节点故障导致的作业运行失败。本文的主要工作内容及贡献如下:在充分分析了前人对单JobTracker节点模型的改进和对调度算法与负载均衡算法的调优。首先通过对最短路径算法Dijkstra算法、网页权值判断算法PageRank算法和网页去重算法Bloom Fliter算法的研究,建立了分布式JobTracker节点模型,并通过Dijkstra算法对分布式JobTracker节点模型中多对多节点间的通信方式进行了优化,以期望多节点模型下的多个JobTracker节点和任务节点间能够均衡的进行通信;其次基于PageRank算法对作业的调度方式进行了优化;最后进一步通过Counting Bloom Filter算法改进各个节点上任务的复本数,从而对分布式Job Tracker模型中节点的负载进行了优化。本文在分析完分布式JobTracker节点模型的通信方式及其相关的调度优化后,搭建了小型Hadoop实验集群对结果进行了实验验证。由实验结果可以看出,单JobTracker节点模型与分布式JobTracker节点模型相比,在集群发生宕机时,分布式JobTrackder节点模型具有更高的可靠性,基于Dijkstra算法的通信方式能够更快速的选出JobTracker节点;对于改进的作业调度算法,在提交的作业具有依赖关系时,基于PageRank的改进算法能够更进一步的提高作业的整体处理时间;对于改进的负载均衡算法,从副本的存储负载角度对集群的负载进行了优化,从而提高了重复数据副本存储空间利用率。实验最后对集群的综合性能进行了对比,由实验结果可以看出,分布式JobTracker节点模型下的优化由于主要是针对特定作业的优化与改进,处理作业的综合性能并不如原有集群高,但是当集群发生JobTracker节点宕机时,提高了集群的安全可靠性,并针对特殊场景的作业处理具有很好的意义。
[Abstract]:Cloud Computing is the fourth IT industrial revolution with the development of large-scale computers, personal computers and the Internet. Google first defined and developed cloud computing. Hadoop, the open source model of cloud computing, is an open source distributed computing platform based on Java, which runs distributed and intensive applications. The single-point problem causes the bottleneck of Hadoop performance. For the single-node optimization of namenode nodes in storage model architecture (HDFS), Hadoop2.0 proposed a multi-node high-availability scheme, but there is no corresponding solution for single-node optimization of JobTracker nodes. In this paper, a distributed JobTracker node model is expected to improve the single JobTracker node failure in the traditional computing model architecture, so that the job failure caused by the single JobTracker node failure can be avoided automatically. The main contents and contributions of this paper are as follows: in this paper, the improvement of single JobTracker node model and the optimization of scheduling algorithm and load balancing algorithm are fully analyzed. Firstly, the distributed JobTracker node model is established by studying the shortest path algorithm (Dijkstra), the web weight judgment algorithm (PageRank) and the web page de-duplication algorithm (Bloom Fliter). The communication mode between many-to-many nodes in distributed JobTracker node model is optimized by Dijkstra algorithm, so that the communication between multiple JobTracker nodes and task nodes in multi-node model can be balanced. Secondly, based on the PageRank algorithm, the scheduling mode of the job is optimized. Finally, the Counting Bloom Filter algorithm is used to improve the number of tasks on each node to optimize the load of the nodes in the distributed Job Tracker model. After analyzing the communication mode of the distributed JobTracker node model and the related scheduling optimization, a small Hadoop experimental cluster is built to verify the results. It can be seen from the experimental results that the single JobTracker node model is more reliable than the distributed JobTracker node model when the cluster goes down, and the communication mode based on Dijkstra algorithm can select JobTracker nodes more quickly. For the improved job scheduling algorithm, when the submitted job is dependent, the improved algorithm based on PageRank can further improve the overall processing time of the job. For the improved load balancing algorithm, the load of the cluster is optimized from the point of view of the storage load of the replica, thus improving the utilization of the storage space of the duplicate data copy. At the end of the experiment, the comprehensive performance of the cluster is compared. It can be seen from the experimental results that the optimization under the distributed JobTracker node model is not as high as the original cluster due to the optimization and improvement of the specific jobs, and the overall performance of the processing jobs is not as high as that of the original cluster. However, when the JobTracker node goes down in the cluster, it improves the security and reliability of the cluster, and the job processing for the special scenario is of great significance.
【学位授予单位】：河北工程大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP393.09

【参考文献】