基于Spark的分布式ETL研究与应用

发布时间：2018-05-02 21:12

本文选题：大数据 + 分布式ETL　；参考：《东华大学》2017年硕士论文

【摘要】：大数据时代,越来越多的数据需要被人们处理和使用。对于企业来说,数据已经成为企业的生存基础,能否利用好自己的数据对企业的未来发展至关重要。数据仓库技术为企业分析海量数据提供了一种有效方案,而在数据仓库的构建过程中,ETL往往是整个过程中最为耗时和复杂的阶段。处理数据量的日益增长,对ETL技术提出了更高的性能要求,也带来了更大的挑战。为了应对海量数据的ETL处理需求,基于分布式并行技术进行ETL很有必要。当前基于MapReduce范型实现的分布式ETL方案能够实现海量数据的高效处理,但是由于Map Reduce编程模型的限制,即只有Map/Reduce两种处理方式,以及多步的处理过程中存在的高I/O开销,使其在ETL的转换过程中存在一些性能问题,处理效率和处理速度方面还有许多优化空间。针对大数据的“海量”特征,以及基于Map Reduce范型实现的分布式ETL方案的局限性,本文结合数据仓库理论知识和分布式处理技术,基于Spark对分布式并行ETL技术进行了研究,提出了一种分布式ETL的设计方案,重点研究了数据转换过程中转换处理的并行实现,根据不同的转换处理类型给出了适用的解决方法。针对前期非聚集操作,如基本的数据清洗,数据格式标准化操作,提出了基于分区的并行管道处理算法,以分区为单位进行数据处理,从而提高数据转换的效率;对于聚集操作,如事实表的数值数据的聚合操作,采用了分区预聚合方法,以减少数据传输频率。实验结果表明,提出的方法能够明显加速大数据量的转换处理,进而提高分布式ETL的性能和处理效率。之后本文对基于Spark的数据处理流程进行了性能优化研究。详细分析了Spark在处理中的常见数据倾斜问题,根据不同场景下的数据倾斜情况,分别给出了对应的并行调优策略。相关实验表明了调优策略的有效性。最后,基于一个实际的决策支持系统开发,阐述了基于Spark的分布式ETL的设计与应用情况,包括与传统ETL开发方案的比较分析,分析结果表明了本文提出的基于Spark的分布式ETL方案的有效性和高可扩展性。
[Abstract]:Big data era, more and more data need to be processed and used by people. For enterprises, data has become the survival basis of enterprises, whether to make good use of their own data is very important for the future development of enterprises. Data warehouse technology provides an effective solution for enterprise to analyze massive data, and ETL is often the most time-consuming and complex stage in the process of building data warehouse. With the increasing amount of data processing, higher performance requirements and greater challenges for ETL technology have been put forward. In order to deal with the ETL processing requirement of massive data, it is necessary to implement ETL based on distributed parallel technology. The current distributed ETL scheme based on MapReduce norm can efficiently process massive data. However, due to the limitation of Map Reduce programming model, there are only two kinds of Map/Reduce processing methods, and the high I / O overhead in the process of multi-step processing. There are some performance problems in the conversion process of ETL, and there is much room for optimization in processing efficiency and processing speed. In view of big data's "magnanimity" characteristic and the limitation of distributed ETL scheme based on Map Reduce norm, this paper studies distributed parallel ETL technology based on Spark, combined with data warehouse theory knowledge and distributed processing technology. In this paper, a design scheme of distributed ETL is presented. The parallel implementation of conversion processing in the process of data conversion is studied, and the suitable solutions are given according to different types of conversion processing. A parallel pipeline processing algorithm based on partitioning is proposed to deal with non-aggregate operations, such as basic data cleaning and data format standardization, in order to improve the efficiency of data conversion. For aggregation operations, such as the aggregation of numerical data in fact tables, a partitioned preaggregation method is used to reduce the frequency of data transmission. The experimental results show that the proposed method can accelerate the conversion of large amount of data and improve the performance and processing efficiency of distributed ETL. Then, the performance optimization of data processing flow based on Spark is studied in this paper. The common data skew problem in the processing of Spark is analyzed in detail. According to the data skew in different scenarios, the corresponding parallel tuning strategies are given. Experiments show the effectiveness of the tuning strategy. Finally, based on the development of a practical decision support system, the design and application of distributed ETL based on Spark are described, including the comparison and analysis with the traditional ETL development scheme. The results show that the proposed distributed ETL scheme based on Spark is effective and scalable.
【学位授予单位】：东华大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【引证文献】