基于Hadoop的MapReduce性能优化研究
本文选题:MapReduce + 负载均衡 ; 参考:《南京邮电大学》2017年硕士论文
【摘要】:随着互联网技术的不断发展,网络和企业生产中需要处理的数据越来越多,云计算成为大数据处理的流行计算模式。Hadoop作为云计算的开源系统平台,很快成为大数据处理的主流技术。随着Hadoop集群的广泛应用,其性能问题也成为人们关注的重点。其中负载均衡对集群性能有重要的影响,也是本文研究的重点。本文通过对MapReduce运行过程中存在的负载均衡问题进行研究和分析,达到集群性能优化的目的。针对异构环境下,节点计算能力各不相同,在MapReduce的任务调度过程中容易出现节点任务负载不均,导致个别节点执行时间过长,进而影响整个作业的响应时间问题,本文提出了一种基于负载均衡的任务调度算法。该算法通过分析任务执行特点以及异构集群中节点性能,得到了一个任务调度负载均衡度量值,该度量值为节点的任务分配提供了依据,使得每个节点在任务调度中得到与其性能相匹配的计算负载,并在任务执行过程中通过建立节点通信模型实现负载的动态调节,从而保证了任务调度中的负载均衡。对于MapReduce执行过程中采用默认Hash分区机制导致在处理密集型数据时,节点接收到的数据负载倾斜问题,本文提出了分区代价模型,该模型对分区的负载均衡问题进行代价评估,并在此模型基础上提出了新的细粒度分区算法,该算法通过增加分区个数,减少分区中的倾斜数据,并通过分区代价模型保证节点接收到的数据量的相对均衡。最后,通过搭建实验环境,并设计相应的实验方案,验证了本文提出的任务调度算法和细粒度分区算法对集群负载均衡的优化。
[Abstract]:With the continuous development of Internet technology, more and more data need to be processed in network and enterprise production. Cloud computing has become the popular computing mode of big data processing. Hadoop is the open source system platform of cloud computing. Soon became the mainstream of big data processing technology. With the wide application of Hadoop cluster, its performance has become the focus of attention. Load balancing has an important impact on cluster performance and is also the focus of this paper. In this paper, the problem of load balancing in the running process of MapReduce is studied and analyzed to optimize the performance of cluster. In the heterogeneous environment, the computing power of the nodes is different. In the task scheduling process of MapReduce, the workload of the nodes is uneven, which leads to the excessive execution time of individual nodes, and then affects the response time of the whole job. In this paper, a task scheduling algorithm based on load balancing is proposed. By analyzing the characteristics of task execution and the performance of nodes in heterogeneous clusters, the algorithm obtains a task scheduling load balancing measure, which provides a basis for the task allocation of nodes. Each node gets a computational load matching its performance in task scheduling and dynamically adjusts the load by establishing a node communication model in the process of task execution so as to ensure the load balance in task scheduling. As the default Hash partitioning mechanism used in the execution of MapReduce results in the skew of data received by nodes when processing intensive data, this paper proposes a partition cost model, which evaluates the cost of load balancing in partitions. Based on this model, a new fine-grained partitioning algorithm is proposed. By increasing the number of partitions, the skew data in the partition is reduced, and the relative equilibrium of the data received by the nodes is ensured by the partition cost model. Finally, the task scheduling algorithm and fine-grained partitioning algorithm are proposed to optimize the load balance of the cluster by setting up the experimental environment and designing the corresponding experimental scheme.
【学位授予单位】:南京邮电大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13
【参考文献】
相关期刊论文 前10条
1 张松;杜庆伟;孙静;孙振;;Hadoop异构集群中数据负载均衡的研究[J];计算机应用与软件;2016年05期
2 宋杰;王智;李甜甜;于戈;;一种优化MapReduce系统能耗的数据布局算法[J];软件学报;2015年08期
3 李航晨;秦小麟;沈尧;;基于压力反馈的MapReduce负载均衡策略[J];计算机科学;2015年04期
4 黄伟建;周鸣爱;;MapReduce高可用性的研究与优化[J];计算机工程与设计;2014年11期
5 宋杰;刘雪冰;朱志良;李甜甜;赵大哲;于戈;;一种能效优化的MapReduce资源比模型[J];计算机学报;2015年01期
6 郑晓薇;项明;张大为;刘青昆;;基于节点能力的Hadoop集群任务自适应调度方法[J];计算机研究与发展;2014年03期
7 韩蕾;孙徐湛;吴志川;陈立军;;MapReduce上基于抽样的数据划分最优化研究[J];计算机研究与发展;2013年S2期
8 董新华;李瑞轩;周湾湾;王聪;薛正元;廖东杰;;Hadoop系统性能优化与功能增强综述[J];计算机研究与发展;2013年S2期
9 谢然;;Hadoop 从小象到巨人的崛起[J];互联网周刊;2013年20期
10 周家帅;王琦;高军;;一种基于动态划分的MapReduce负载均衡方法[J];计算机研究与发展;2013年S1期
相关博士学位论文 前1条
1 顾涛;集群MapReduce环境中任务和作业调度若干关键问题的研究[D];南开大学;2014年
相关硕士学位论文 前2条
1 熊晟;Hadoop集群性能优化研究[D];杭州电子科技大学;2015年
2 耿玉娇;MapReduce中基于抽样技术的倾斜问题研究[D];大连海事大学;2013年
,本文编号:1920475
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1920475.html