基于补偿函数的Spark容错机制优化

发布时间：2018-10-14 18:38

【摘要】：大数据时代,随着数据量的增加和数据价值的发掘,分布式大数据计算系统已被企业和机构广泛的应用与研究。伴随分布式系统节点不断增多,故障率也随之提升,容错成为了分布式大数据计算系统研究的一项不可忽视的关键技术。在大数据应用领域中,特别是数据挖掘和机器学习,迭代计算成为了其算法的一个主要特性,其通过反复迭代的过程,达到求解最优解的目的。Spark作为新兴的通用大数据处理框架,立足于内存计算,在迭代计算中具有优异的性能,迅速成为了最为流行的分布式大数据计算平台。然而Spark主要采用Lineage机制实现数据的容错,Lineage记录一个数据集如何从其它数据集演变过来,当某块分区数据丢失时,Spark通过记录的Lineage信息回溯丢失数据的依赖关系,重新计算丢失数据,在迭代计算等长任务场景中,存在重计算恢复时间过长的问题。本文分析了迭代计算过程及其收敛性,得出迭代计算具有从不同的状态收敛的稳定性,提出一种基于补偿函数的乐观容错机制实现对数据的容错,并使用此机制对Spark的容错机制进行优化。不同于传统使用重计算恢复数据的容错方式,此机制在故障发生导致数据丢失时,通过定义的补偿函数快速生成补偿值代替丢失的数据,而不是重计算生成原始数据,并保证整体数据集的一致性,使算法能够继续执行,通过后续迭代过程校正数据,并收敛到正确结果。在无故障时,此机制采用乐观的容错方式,不添加任何容错措施,不会造成额外开销。实验结果表明基于补偿函数的乐观容错机制能够有效保障迭代数据的可靠性,并且性能优于现有的容错机制。
[Abstract]:In big data's time, with the increase of data volume and the discovery of data value, distributed big data computing system has been widely used and studied by enterprises and institutions. With the increasing number of nodes in distributed systems, the failure rate also increases. Fault tolerance has become a key technology in the research of distributed big data computing system. In the field of big data application, especially in data mining and machine learning, iterative computing has become one of the main characteristics of its algorithm. As a new general big data processing framework, Spark, which is based on memory computing, has excellent performance in iterative computing and has become the most popular platform for distributed big data computing. However, Spark mainly uses Lineage mechanism to implement data fault-tolerance. Lineage records how a dataset evolves from other data sets. When a block of data is lost, Spark can trace back the dependence of lost data through recorded Lineage information. Recalculating the lost data, there is a problem that the recalculation recovery time is too long in the iterative computation of equal length task scenario. In this paper, the process of iterative computation and its convergence are analyzed. It is concluded that iterative computation has the stability of convergence from different states. An optimistic fault-tolerant mechanism based on compensation function is proposed to realize the fault-tolerant of data. This mechanism is used to optimize the fault-tolerant mechanism of Spark. Different from the traditional fault-tolerant method of recalculating recovery data, when the fault occurs and results in data loss, the compensation value is generated by the defined compensation function to replace the lost data, instead of the original data generated by recalculation. The consistency of the whole data set is ensured so that the algorithm can continue to execute and correct the data through the subsequent iteration process and converge to the correct result. When there is no fault, the mechanism adopts optimistic fault-tolerant method and does not add any fault-tolerant measures. The experimental results show that the optimistic fault-tolerant mechanism based on compensation function can effectively guarantee the reliability of iterative data, and its performance is better than the existing fault-tolerant mechanism.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】