面向固态硬盘的Spark数据持久化方法设计
发布时间:2018-04-14 00:20
本文选题:大数据 + 混合存储 ; 参考:《计算机研究与发展》2017年06期
【摘要】:基于固态硬盘(solid-state drive,SSD)和硬盘(hard disk drive,HDD)混合存储的数据中心已经成为大数据计算领域的高性能载体,数据中心负载应该可将不同特性的数据按需持久化到SSD或HDD,以提升系统整体性能.Spark是目前产业界广泛使用的高效大数据计算框架,尤其适用于多次迭代计算的应用领域,其原因在于Spark可以将中间数据持久化在内存或硬盘中,且持久化数据到硬盘打破了内存容量不足对数据集规模的限制.然而,当前的Spark实现并未专门提供显式的面向SSD的持久化接口,尽管可根据配置信息将数据按比例分布到不同的存储介质中,但是用户无法根据数据特征按需指定RDD的持久化存储介质,针对性和灵活性不足.这不仅成为进一步提升Spark性能的瓶颈,而且严重影响了混合存储系统性能的发挥.有鉴于此,首次提出面向SSD的数据持久化策略.探索了Spark数据持久化原理,基于混合存储系统优化了Spark的持久化架构,最终通过提供特定的持久化API实现用户可显式、灵活指定RDD的持久化介质.基于SparkBench的实验结果表明,经本方案优化后的Spark与原生版本相比,其性能平均提升14.02%.
[Abstract]:The data center, which is based on solid state disk (SD) and hard disk (HDD), has become a high performance carrier in big data's computing field.The data center load should be able to persist data with different characteristics to SSD or HDD on demand to improve the overall performance of the system. Spark is a highly efficient big data computing framework widely used in industry, especially in the field of multiple iterations.The reason is that Spark can persist intermediate data in memory or hard disk, and persistent data to hard disk breaks the limit of data set size due to insufficient memory capacity.However, the current Spark implementation does not specifically provide an explicit persistence interface for SSD, although data can be distributed proportionally to different storage media based on configuration information.However, the user can not specify the persistent storage medium of RDD according to the data characteristics, so it is not specific and flexible.This not only becomes the bottleneck of further improving Spark performance, but also seriously affects the performance of hybrid storage system.In view of this, a data persistence strategy for SSD is proposed for the first time.This paper explores the principle of Spark data persistence, and optimizes the persistence architecture of Spark based on hybrid storage system. Finally, the user can explicitly specify the persistence medium of RDD by providing specific persistent API.The experimental results based on SparkBench show that the performance of the optimized Spark is 14.02% higher than that of the native version.
【作者单位】: 深圳大学计算机与软件学院;广东工业大学计算机学院;计算机体系结构国家重点实验室(中国科学院计算技术研究所);国家计算机网络应急技术处理协调中心;中国工程院战略咨询中心;
【基金】:国家“八六三”高技术研究发展计划基金项目(2015AA015305) 广东省自然科学基金项目(2014A030313553) 广东省省部产学研项目(2013B090500055) 深圳市基础研究学科布局项目(JCYJ20150529164656096)~~
【分类号】:TP311.13;TP333
,
本文编号:1746881
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/1746881.html