面向铁路运维的大数据流式处理技术的研究与应用

发布时间：2018-06-07 16:47

本文选题：大数据 + 铁路运维　；参考：《北京交通大学》2017年硕士论文

【摘要】：目前,我们正在处于大数据时代,在铁路运输行业也是这样。我国目前已经在高铁行业进入了世界领先的水平,掌握了许多高速列车的核心技术。在铁路的运维过程中,也已借助先进的传感器技术、数据采集设备和计算机存储设备积累了海量的运维数据。如何对这些海量的数据做分析处理,对铁路的维修维护工作具有重大的意义。面对目前铁路运维数据所呈现的大容量、多样化和积累迅速等特点,传统的数据处理方式已经较难有效处理,其弊端主要体现在耗时长,难以满足运维过程中的实时性需求。因此本论文提出了基于流式处理技术的方案,并应用于铁路运维的数据处理中,解决了目前铁路运维过程中处理大量快速增长的数据时耗时较长的问题。本论文调研了当前铁路运维数据的特征,并对流式处理技术和传统处理技术的差异进行比较,提出基于流式框架的数据处理方案。在此基础上,本文实现了基于Spark Streaming框架的铁路通信光纤监测日志文件的流式处理系统,深入研究了 concurrentJobs,batchDuration等参数对处理性能的影响,并对系统进行了优化。本论文主要进行了以下几个方面的工作:(1)在分析了流式处理计算框架的核心技术的基础上,根据当前铁路运维过程中的数据特征和处理需求,提出了基于流式框架的解决方案。目前,铁路行业的流式数据增长迅速,然而铁路运维过程中仍采用传统的数据处理技术开展应用分析,数据处理的时效性不强。对此本文提出了基于流式处理技术的方案,解决了传统处理技术在应对大量快速增长的数据时处理时间较长的问题。实验表明流处理方式与传统的处理方式相比在时效性上有很大的提升。(2)设计并实现了基于Spark Streaming的光纤监测日志数据处理系统。首先搭建了分布式流处理实验环境。然后利用流式处理框架对日志文件进行基于内存的分布式处理,提取日志文件中的关键字段并保存在数据仓库中。最后利用交互式查询工具对提取出的数据进行业务分析。(3)在(2)工作的基础上,对基于Spark Streaming的流式处理系统进行了优化,提升了系统的性能。具体地,首先在架构上整合了分布式消息队列Kafka,实现了数据读入过程的并行化;接着针对Spark Streaming的concurrentJobs,batchDuration等参数进行了优化,提升了日志数据的处理效率。本文对所提出的流式处理方案进行实验验证,实验数据采用生产环境中所积累的光纤监测日志数据,分别设计不同实验并与传统的数据处理方式进行了对比。实验结果表明,本文所提出的方案能够更快速地完成日志文件的处理,并且分布式的系统架构具有很好的扩展性,系统性能随着节点数量的增加会有进一步的提升。本论文所实现的流式处理系统满足了运维中的时效性需求,能够快速地处理运维过程中积累的数据,提高了铁路运维中数据处理的效率。
[Abstract]:At present, we are in the big data era, in the railway transport industry is the same. At present, China has entered the world leading level in high-speed rail industry, and has mastered the core technology of many high-speed trains. In the process of railway operation and maintenance, the advanced sensor technology, data acquisition equipment and computer storage equipment have accumulated a large amount of operational and maintenance data. How to analyze and deal with these massive data is of great significance to railway maintenance and maintenance. In the face of the characteristics of large capacity, diversification and rapid accumulation of railway operation and maintenance data, the traditional data processing method has been difficult to deal with effectively, and its disadvantages are mainly reflected in the time consuming, which is difficult to meet the real-time requirements in the process of operation and maintenance. Therefore, this paper puts forward a scheme based on flow processing technology, and applies it to the data processing of railway operation and maintenance, which solves the problem that it takes a long time to deal with a large number of rapidly increasing data in the process of railway operation and maintenance. In this paper, the characteristics of current railway operation and maintenance data are investigated, and the differences between flow processing technology and traditional processing technology are compared, and a data processing scheme based on flow framework is proposed. On this basis, this paper implements the flow processing system of railway communication optical fiber monitoring log file based on Spark Streaming framework. The effect of parameters such as concurrent obsbatch duration on processing performance is deeply studied, and the system is optimized. On the basis of analyzing the core technology of the flow processing computing framework and according to the data characteristics and processing requirements in the current railway operation and maintenance process, this paper puts forward a solution based on the flow frame. At present, the flow data of railway industry is growing rapidly, however, the traditional data processing technology is still used in the railway operation and maintenance process to carry out application analysis, and the timeliness of data processing is not strong. In this paper, a scheme based on flow processing technology is proposed, which solves the problem of long processing time of traditional processing technology in dealing with a large number of rapidly increasing data. The experimental results show that the stream processing method is much more time-efficient than the traditional one.) the design and implementation of the optical fiber monitoring log data processing system based on Spark Streaming is carried out. First, a distributed flow processing experimental environment is built. Then, the memory based distributed processing of log files is carried out by using streaming processing framework, and the key fields in log files are extracted and stored in the data warehouse. Finally, an interactive query tool is used to analyze the service of extracted data. On the basis of 2), the flow processing system based on Spark Streaming is optimized and the performance of the system is improved. Specifically, the distributed message queue Kafka is integrated in the architecture, which realizes the parallelization of the data read-in process, and then optimizes the parameters such as concurrent JobsbatchDuration of Spark Streaming to improve the efficiency of log data processing. In this paper, the proposed flow processing scheme is verified by experiments. The experimental data is based on the optical fiber monitoring log data accumulated in the production environment. Different experiments are designed and compared with the traditional data processing methods. The experimental results show that the proposed scheme can process log files more quickly, and the distributed system architecture has a good scalability, and the system performance will be further improved with the increase of the number of nodes. The flow processing system realized in this paper can meet the requirement of timeliness in operation and maintenance, and can process the data accumulated in the process of operation and maintenance quickly, and improve the efficiency of data processing in railway operation and maintenance.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】