基于MapReduce的多元连接优化方法
发布时间:2018-07-21 17:27
【摘要】:多元连接是数据分析最常用的操作之一,MapReduce是广泛用于大规模数据分析处理的编程模型,它给多元连接优化带来新的挑战:传统的优化方法不能简单地适用到MapReduce中;MapReduce连接执行算法尚存优化空间.针对前者,考虑到I/O代价是连接运算的主要代价,首先以降低I/O代价为目标提出一种启发式算法确定多元连接执行顺序,并在此基础上进一步优化,最后针对MapReduce设计一种并行执行策略提高多元连接的整体性能.针对后者,考虑到负载均衡能够有效减少MapReduce的"木桶效应",通过任务公平分配算法提高连接内部的并行度,并在此基础上给出Reduce任务个数的确定方法.最后,通过实验验证本文提出的执行计划确定方法以及负载均衡算法的优化效果.该研究对大数据环境下MapReduce多元连接的应用具有指导意义,可以优化如OLAP分析中的星型连接、社交网络中社团发现的链式连接等应用的性能.
[Abstract]:Multivariate connection is one of the most commonly used operations in data analysis. MapReduce is a programming model widely used in large-scale data analysis and processing. It brings a new challenge to the multivariate connection optimization: the traditional optimization method can not be applied to MapReduce simply. There is still optimization space in the MapReduce connection execution algorithm. For the former, considering that I / O cost is the main cost of join operation, a heuristic algorithm is proposed to determine the order of multiple join execution with the aim of reducing I / O cost. Finally, a parallel execution strategy is designed for MapReduce to improve the overall performance of multiple connections. In view of the latter, considering that load balancing can effectively reduce the "bucket effect" of MapReduce, the parallel degree within the join is improved by using the task fair assignment algorithm, and the method of determining the number of reduce tasks is given. Finally, the proposed execution plan determination method and the optimization effect of load balancing algorithm are verified by experiments. This study is of guiding significance for the application of MapReduce multivariate connections in big data environment, and can optimize the performance of applications such as star connections in OLAP analysis and chain connections found in social networks.
【作者单位】: 东北大学计算机科学与工程学院;东北大学软件学院;
【基金】:国家自然科学基金重大项目(61433008);国家自然科学基金青年基金项目(61202088) 国家博士后科学基金面上项目(2013M540232) 中央高校基本科研业务费专项基金项目(N120817001) 教育部高等学校博士学科点博导基金项目(20120042110028)~~
【分类号】:TP311.13
本文编号:2136254
[Abstract]:Multivariate connection is one of the most commonly used operations in data analysis. MapReduce is a programming model widely used in large-scale data analysis and processing. It brings a new challenge to the multivariate connection optimization: the traditional optimization method can not be applied to MapReduce simply. There is still optimization space in the MapReduce connection execution algorithm. For the former, considering that I / O cost is the main cost of join operation, a heuristic algorithm is proposed to determine the order of multiple join execution with the aim of reducing I / O cost. Finally, a parallel execution strategy is designed for MapReduce to improve the overall performance of multiple connections. In view of the latter, considering that load balancing can effectively reduce the "bucket effect" of MapReduce, the parallel degree within the join is improved by using the task fair assignment algorithm, and the method of determining the number of reduce tasks is given. Finally, the proposed execution plan determination method and the optimization effect of load balancing algorithm are verified by experiments. This study is of guiding significance for the application of MapReduce multivariate connections in big data environment, and can optimize the performance of applications such as star connections in OLAP analysis and chain connections found in social networks.
【作者单位】: 东北大学计算机科学与工程学院;东北大学软件学院;
【基金】:国家自然科学基金重大项目(61433008);国家自然科学基金青年基金项目(61202088) 国家博士后科学基金面上项目(2013M540232) 中央高校基本科研业务费专项基金项目(N120817001) 教育部高等学校博士学科点博导基金项目(20120042110028)~~
【分类号】:TP311.13
【相似文献】
相关期刊论文 前2条
1 张海;马建红;;基于HDFS的小文件存储与读取优化策略[J];计算机系统应用;2014年05期
2 ;[J];;年期
相关硕士学位论文 前1条
1 毛仁伟;大型多人在线游戏中负载均衡及相关技术的研究[D];首都师范大学;2014年
,本文编号:2136254
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2136254.html