面向工业大数据的分布式ETL系统的设计与实现

发布时间：2018-08-26 10:59

【摘要】：自从进入工业4.0时代以来,由于互联网和计算机技术的高速发展,在与工业系统深度融合过程中引发的生产力、生产关系、生产技术、商业模式以及创新模式等方面的深度变革,使整个工业系统迈向全面智能化的革命性转变。工业大数据分析是未来工业在全球市场中发挥竞争优势的关键领域。随着物联网和信息物理系统时代的来临,更多数据可以被收集和分析,并用于做出更明智的决策。在整个工业大数据分析的过程中,历史数据如何从各个数据源汇聚到分析系统中、实时数据如何从各个传感器加载到分析系统中成为整个数据分析的基础。这就要用到数据处理工具ETL(Extract-Transform-Load,抽取、转换、加载)。传统的ETL多是在单机系统下并行运行,其处理速度和处理量远远不能满足工业数据分析的要求。而商业ETL性能好,但是价格昂贵,而且对硬件系统的要求太高,无法做到普及。针对以上情况,本文针对工业数据处理设计并实现了一种价格低廉、性能高的分布式ETL系统。本文分布式ETL系统的设计主要分三个模块展开:数据抽取模块、数据转换模块以及数据加载模块。数据抽取阶段主要设计了基于分表触发器的变更数据捕获方案、基于数据校验的差异数据同步方案和基于Redis的Pub/Sub通信模式的实时数据抽取方案。数据转换阶段主要根据数据对处理速度和处理量的要求分别设计了批处理层和加速层,批处理层主要处理对实时性要求不高的历史数据,基于Hadoop的MapReduce实现;加速层主要处理的实时数据,基于Spark Streaming流处理方式实现。数据加载阶段主要由Sqoop来处理结构化数据的加载、由HDFS客户端来处理非结构化数据的加载。最后本文对设计的分布式ETL系统分别进行了功能测试和性能测试。试验结果表明,本文设计的ETL系统在处理工业大数据的问题上具有较好的性能,这对工业数据的信息化改造具有较强的实际意义。
[Abstract]:Because of the rapid development of the Internet and computer technology, the productivity, relations of production, and production technology caused by the deep integration with the industrial system have been increased since the beginning of the 4.0 era of industry. The deep transformation of business model and innovation mode makes the whole industrial system move toward the revolutionary transformation of full intelligence. Industry big data analysis is the future industry in the global market play a key area of competitive advantage. With the advent of the Internet of things and the age of information physics systems, more data can be collected, analyzed, and used to make more informed decisions. In the whole process of big data's analysis, how the historical data converge from the various data sources to the analysis system, and how the real-time data is loaded into the analysis system from each sensor becomes the basis of the whole data analysis. This will use the data processing tool ETL (Extract-Transform-Load, extraction, transformation, loading). The traditional ETL is mostly run in parallel in a single computer system, and its processing speed and processing capacity are far from meeting the requirements of industrial data analysis. The commercial ETL performance is good, but the price is expensive, and the request to the hardware system is too high, cannot achieve the popularization. In view of the above situation, this paper designs and implements a low price and high performance distributed ETL system for industrial data processing. The design of distributed ETL system is divided into three modules: data extraction module, data conversion module and data loading module. In the stage of data extraction, we mainly design change data capture scheme based on table trigger, differential data synchronization scheme based on data verification and real-time data extraction scheme based on Pub/Sub communication mode based on Redis. In the data conversion stage, the batch layer and the acceleration layer are designed according to the requirements of the data processing speed and the processing capacity, respectively. The batch layer mainly processes the historical data with low real-time requirements, and the MapReduce based on Hadoop is implemented. The real-time data processing in acceleration layer is based on Spark Streaming stream processing. In the data loading stage, the loading of structured data is mainly handled by Sqoop, and the loading of unstructured data is handled by HDFS client. Finally, the function and performance of the distributed ETL system are tested. The experimental results show that the ETL system designed in this paper has better performance in dealing with the problem of industrial big data, which has a strong practical significance for the information transformation of industrial data.
【学位授予单位】：中国科学院大学(中国科学院沈阳计算技术研究所)
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】