MapReduce下区间连接方法研究
发布时间:2018-05-08 20:16
本文选题:区间连接 + 集合分类 ; 参考:《华中科技大学》2016年硕士论文
【摘要】:随着网络技术的飞速发展,全球数据倍增,为大数据的分析和处理带来了困难。Map Reduce作为新兴的数据密集型计算编程模型,在大数据分析与处理方面发挥了重要的作用。而区间连接是属性取值在一个范围内的连接运算,是大数据分析和处理的重要运算,如何利用Map Reduce编程平台提升区间连接的效率具有重要的意义。在Allen提出的区间元组概念、区间元组关系的基础上,设计了一种基于集合分类实现二路区间和多路区间的连接算法。首先将参与运算的区间元组根据区间范围均匀划分成若干个分区,根据元组与分区是否有交集,将元组映射到相应的分区集合,对每个元组在分区中的位置进行分类,定义了四种类型的集合分类,并分析了每个分区中四种类型集合分类占分区数据总量的比例。其次用Map Reduce分布式编程框架编程实现二路区间和多路区间连接算法。通过四种集合分类构建的键值对可以过滤掉不需要参与连接的元组,减少Map端数据传输量和Reduce端数据计算量,提升区间连接的效率。最后,根据各个集合分类占各个分区数据总量的比例,分别制定二路区间和多路区间的负载均衡策略,重新组合各个分区之间的集合分类生成新的键值对,均衡各个Reduce节点收到的数据,以进一步提高区间连接作业的完成效率。在搭建的分布式Hadoop平台下分别对二路区间连接和多路区间连接方法进行了有效性的验证。实验结果表明,基于集合分类的区间连接方法能适用于多种情况,相比已有二路区间连接和多路区间连接方法具有一定的优势,并且制定的负载均衡策略能进一步提升效率。
[Abstract]:With the rapid development of network technology, the global data is multiplying, which brings difficulties to the analysis and processing of large data..Map Reduce is a new data intensive programming model, which plays an important role in the analysis and processing of large data. And the important operation of processing, how to use Map Reduce programming platform to improve the efficiency of the interval connection is of great significance. Based on the concept of interval tuples and interval tuples proposed by Allen, a connection algorithm based on set classification is designed to realize the connection between the two path interval and the multipath interval. First, the interval tuples involved in the operation are based on the algorithm. The interval range is divided into several partitions. According to whether the tuple and the partition have intersection, the tuples are mapped to the corresponding partition sets, the positions of each tuple in the partition are classified, four types of set classification are defined, and the proportion of the four types of set classification in each partition is analyzed. Secondly, Ma is used. P Reduce distributed programming framework programming two road interval and multipath interval connection algorithm. Through four sets of set of key values, we can filter the tuples that do not need to join, reduce the amount of data transmission in the Map end and the amount of data in the Reduce end, and improve the efficiency of the interval connection. Finally, according to each set classification, each partition occupies each partition. In the proportion of total data, the load balancing strategy of two roads and multiple intervals is formulated respectively, and the set classification between each partition is recombined to generate a new key value pair, and the data received by each Reduce node is balanced to further improve the completion efficiency of the interval connection operation. In the distributed Hadoop platform, the two road intervals are respectively set up. The effectiveness of connection and multiple interval connection method is verified. The experimental results show that the interval connection method based on the set classification can be applied to a variety of situations. Compared with the existing two way interval connection and multipath interval connection method, the proposed load balancing strategy can further improve the efficiency.
【学位授予单位】:华中科技大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP311.13
【参考文献】
相关期刊论文 前4条
1 张延松;;数据库与MapReduce融合的大数据管理技术探索[J];科研信息化技术与应用;2013年01期
2 孟小峰;慈祥;;大数据管理:概念、技术与挑战[J];计算机研究与发展;2013年01期
3 覃雄派;王会举;杜小勇;王珊;;大数据分析——RDBMS与MapReduce的竞争与共生[J];软件学报;2012年01期
4 姜素芳;陈天滋;;空间连接优化方法的研究[J];计算机工程;2007年02期
相关博士学位论文 前1条
1 黄继先;基于R-树的空间数据库查询技术研究[D];中南大学;2005年
相关硕士学位论文 前2条
1 孙惠;基于Hadoop框架的大数据集连接优化算法[D];南京邮电大学;2013年
2 李俊洁;空间数据库中空间连接和查询优化研究[D];哈尔滨理工大学;2008年
,本文编号:1862908
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1862908.html