基于任务结构优化的Spark缓存策略研究
发布时间:2018-08-30 16:58
【摘要】:大数据计算框架Spark运用内存空间极大提升了任务的执行效率,但由于内存空间的局限性,Spark任务常常因为内存瓶颈导致执行效率低下,甚至任务失败,这与框架本身的缺陷和RDD(Resilient Distributed Datasets)的缓存策略密切相关。Spark自诞生至今,一直采用LRU(Least Recently Used)作为缓存替换算法,但由于Spark的缓存调度器无法准确预测整个任务数据的使用情况,导致部分情况下LRU算法效果欠佳。为了减小任务执行时间,提升内存利用率,通过解析Spark的任务结构,对其进行一定的优化,并获取整个任务过程中数据和内存的使用情况,通过分析结果优化现有的缓存策略,这是本文研究的重点。本文首先对Spark现有的缓存机制进行分析,比较不同缓存方式对于任务性能的影响,通过实际例子证明现有的缓存策略还有较大的优化空间。接着提出了任务结构分析和任务结构优化的方法,对于任务结构分析,通过动态分析的方法提取出Spark任务的关键信息,根据RDD之间的依赖关系解析出整个任务的依赖关系图,同时解析出任务运行过程中数据和内存的使用情况;对于任务结构优化,在获取了Spark的任务信息后,通过调整Stage的位置使得任务计算过程中同一RDD的使用更加集中,减少了内存替换的概率,提高了整个任务的执行效率。在分析和优化任务结构的基础上,提出了RDD权重的概念,综合多种影响RDD使用情况的因素,包括使用次数、大小、跨度、分区与核数比例、计算代价等,建立了合理的RDD权重模型。基于RDD权重模型,本文提出了一种新的缓存替换策略,RWR(RDD Weight Replace)缓存替换策略,确保内存替换过程中相对更有价值的数据能够缓存至内存中,用于提高缓存命中率和内存利用率,减少因为内存瓶颈造成的计算错误,在一定程度上提高了Spark框架的容错性能。最后通过对比实验,结合多种负载用例,通过运行单个任务、调整集群配置、混合多种任务等方式,对默认未修改的Spark和优化后的Spark进行实验对比,实验结果表明,本文提出的任务结构优化策略和缓存替换策略能够有效提高任务执行效率。
[Abstract]:Big data's computational framework, Spark, greatly improves the efficiency of task execution by using memory space. However, due to the limitation of memory space, Spark tasks are often inefficient or even fail due to memory bottlenecks. This is closely related to the defects of the framework itself and the cache policy of RDD (Resilient Distributed Datasets). Since Spark was born, LRU (Least Recently Used) has been used as the cache replacement algorithm. However, the cache scheduler of Spark can not accurately predict the use of the whole task data. In some cases, the effect of LRU algorithm is not good. In order to reduce task execution time and improve memory utilization, the task structure of Spark is optimized by analyzing the task structure, and the data and memory usage during the whole task are obtained, and the existing cache strategy is optimized by analyzing the results. This is the focus of this paper. This paper first analyzes the existing caching mechanism of Spark and compares the effects of different caching methods on task performance. It is proved that the existing cache policy still has a large space for optimization through practical examples. Then, a method of task structure analysis and task structure optimization is proposed. For task structure analysis, the key information of Spark task is extracted by dynamic analysis, and the dependency graph of the whole task is analyzed according to the dependency relationship between RDD. At the same time, the usage of data and memory in the process of task operation is analyzed. For task structure optimization, after obtaining the task information of Spark, the use of the same RDD in the process of task calculation is more centralized by adjusting the position of Stage. It reduces the probability of memory replacement and improves the efficiency of the whole task. Based on the analysis and optimization of the task structure, the concept of RDD weight is put forward, which synthesizes many factors affecting the use of RDD, including the number of times of use, size, span, partition to kernel ratio, calculation cost, etc. A reasonable RDD weight model is established. Based on the RDD weight model, this paper proposes a new cache replacement strategy named RWR (RDD Weight Replace) cache replacement strategy, which ensures that the more valuable data can be cached into memory, which can be used to improve cache hit rate and memory utilization. The error caused by memory bottleneck is reduced, and the fault-tolerant performance of Spark framework is improved to some extent. Finally, through the contrast experiment, combined with various load use cases, by running a single task, adjusting the cluster configuration, mixing a variety of tasks, the default unmodified Spark is compared with the optimized Spark. The experimental results show that, The task structure optimization strategy and cache replacement strategy proposed in this paper can effectively improve the efficiency of task execution.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP333
[Abstract]:Big data's computational framework, Spark, greatly improves the efficiency of task execution by using memory space. However, due to the limitation of memory space, Spark tasks are often inefficient or even fail due to memory bottlenecks. This is closely related to the defects of the framework itself and the cache policy of RDD (Resilient Distributed Datasets). Since Spark was born, LRU (Least Recently Used) has been used as the cache replacement algorithm. However, the cache scheduler of Spark can not accurately predict the use of the whole task data. In some cases, the effect of LRU algorithm is not good. In order to reduce task execution time and improve memory utilization, the task structure of Spark is optimized by analyzing the task structure, and the data and memory usage during the whole task are obtained, and the existing cache strategy is optimized by analyzing the results. This is the focus of this paper. This paper first analyzes the existing caching mechanism of Spark and compares the effects of different caching methods on task performance. It is proved that the existing cache policy still has a large space for optimization through practical examples. Then, a method of task structure analysis and task structure optimization is proposed. For task structure analysis, the key information of Spark task is extracted by dynamic analysis, and the dependency graph of the whole task is analyzed according to the dependency relationship between RDD. At the same time, the usage of data and memory in the process of task operation is analyzed. For task structure optimization, after obtaining the task information of Spark, the use of the same RDD in the process of task calculation is more centralized by adjusting the position of Stage. It reduces the probability of memory replacement and improves the efficiency of the whole task. Based on the analysis and optimization of the task structure, the concept of RDD weight is put forward, which synthesizes many factors affecting the use of RDD, including the number of times of use, size, span, partition to kernel ratio, calculation cost, etc. A reasonable RDD weight model is established. Based on the RDD weight model, this paper proposes a new cache replacement strategy named RWR (RDD Weight Replace) cache replacement strategy, which ensures that the more valuable data can be cached into memory, which can be used to improve cache hit rate and memory utilization. The error caused by memory bottleneck is reduced, and the fault-tolerant performance of Spark framework is improved to some extent. Finally, through the contrast experiment, combined with various load use cases, by running a single task, adjusting the cluster configuration, mixing a variety of tasks, the default unmodified Spark is compared with the optimized Spark. The experimental results show that, The task structure optimization strategy and cache replacement strategy proposed in this paper can effectively improve the efficiency of task execution.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP333
【参考文献】
相关期刊论文 前5条
1 杨志伟;郑p,
本文编号:2213697
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/2213697.html