集群计算引擎Spark中的内存优化研究与实现

发布时间：2018-12-16 15:46

【摘要】：在迭代之间使用内存做数据传输的并行计算框架是当前的一个研究热点。与传统的基于硬盘和网络的计算方式相比,使用内存可以减少数据传输的时间。对于数据密集类型的任务，可以将运行时间提升十几倍。在新一代框架快速发展的同时，如何充分利用相对仍然紧缺的内存资源，保证任务的运行效率，成为一个亟待解决的问题。本文基于集群计算引擎Spark，研究了并行计算集群对于内存的使用行为。通过对内存行为进行建模与分析,对内存的使用进行了决策自动化以及替换策略优化。提高了任务在资源有限情况下的运行效率，以及在不同集群环境下任务效率的稳定性。本文的贡献主要有：通过对代码的语义进行分析，实现了内存策略的自动化。即调度器可以自动识别出价值的数据集（RDD）放入缓存，，避免缓存存污染的同时，也减轻了程序员的编程负担。在对代码语义分析，获得任务详细信息的基础上，对内存使用的替换策略进行了优化。主要包括RDD大小和权重的计算，操作顺序的优化重排，使用寄存器分配模型加权重信息形成新的替换算法，代替原有的LRU算法以及多级缓存模型的智能化。最后对内存在异构集群群上的行为也进行了初步的分析。最后通过不同的实验，验证了优化后的方案可以提高任务对不同集群环境的适应性，并且在在内存资源相对有限的情况下使任务运行效率更高，使系统的实用性整体增强，对于其他并行系统中的内存使用也有实际的参考价值。
[Abstract]:A parallel computing framework using memory for data transfer between iterations is a hot topic. Compared with the traditional hard disk and network based computing, the use of memory can reduce the time of data transmission. For data-intensive types of tasks, you can increase the running time more than ten times. With the rapid development of the new generation framework, how to make full use of the relatively scarce memory resources and ensure the operational efficiency of the task has become a problem to be solved urgently. This paper studies the memory usage behavior of parallel computing clusters based on cluster computing engine Spark,. Through modeling and analysis of memory behavior, the decision automation and substitution strategy optimization of memory usage are carried out. The efficiency of task is improved under the condition of limited resources and the stability of task efficiency in different cluster environment. The main contributions of this paper are as follows: by analyzing the semantics of the code, the memory strategy is automated. That is, the scheduler can automatically recognize the value of the data set (RDD) into the cache, to avoid cache pollution, but also reduce the programmer's programming burden. On the basis of code semantic analysis and task details, the memory replacement strategy is optimized. It mainly includes the calculation of RDD size and weight, the optimal rearrangement of operation sequence, the use of register allocation model and weight information to form a new replacement algorithm, which replaces the original LRU algorithm and the intelligence of multi-level buffer model. Finally, the behavior of heterogeneous cluster is also analyzed. Finally, through different experiments, it is proved that the optimized scheme can improve the adaptability of the task to different cluster environments, and make the task run more efficiently under the condition of relatively limited memory resources, so that the practicability of the system is enhanced as a whole. It also has practical reference value for memory usage in other parallel systems.
【学位授予单位】：清华大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333.1

【共引文献】