基于大数据的动车组故障关联关系规则挖掘算法研究与实现

发布时间：2018-05-22 15:07

本文选题：关联规则 + 大数据　；参考：《北京交通大学》2017年硕士论文

【摘要】：动车组作为完成铁路高速运输生产任务最重要的移动设备,是高新技术的集成体。与传统机车车辆相比动车组在车辆结构上有很大的不同,而且其运行速度是传统机车车辆所不可及的。在其运营过程中,故障管理和检修是高速铁路系统综合保障工程中的重要组成部分,是确保实现动车组安全运行,高效率使用的必要保障。在检修过程中,修程修制又起着指导性、关键性的作用,而且合理完善的修程修制是保证高速动车组快速、安全、舒适、高效运行的基本前提。然而,对安全问题的重视,无疑会造成动车组复杂的维修流程,这对于提升效率自然会是一个极大的影响。要提高动车组的维修效率,一方面是深入对动车组构造的理论研究;另一方面,在过去积累的大量动车组数据中包含着尚未发掘的有价值的信息。而随着大数据相关技术的成熟,这些数据的价值也日益凸显。为了使这些数据得到很好的利用,要从海量的故障数据中获取其中隐含的故障关联信息,以达到较早发现故障的目的。维修的策略主要有3种:周期修,状态修和事后修。其中周期修是目前最为主要的一种方式,将维修等级分成五级,列车服役一定的时间或里程后就会进行相应的维修,更换一些对应的部件。此方法中,维修周期是根据专家经验确定的,为了保证安全所以有一定的余地。这样虽然保证了安全,但是会陷入到过度修的情况中,即列车上某部件健康情况良好却依然被更换,导致运维成本提高。事后修则是另一种极端,即当部件完全失效时再进行更换,这显然是不可取的方案。故而就提出了折中的状态修方案,根据部件当前的工作状态,判断其损坏程度,在其将要损坏时进行更换,从而既保证了运输安全,又降低成本的目的。目前在我国的铁路事业中,大数据分析技术已经运用到了一些领域中:基于Hadoop平台设计并实现了一种分析和处理动车组振动数据的方案,用于消除高铁振动数据中的线性漂移,发现数据中的异常点,通过数据分布情况判断列车部件故障的类型。基于Hadoop平台,通过分析历史车流数据来高效准确的推算车流;提出了一种构建动车组数据仓库的思路。其中也包括动车组故障数据的相关部分,可以说大数据分析对于庞大的铁路系统来说是未来的发展方向,并且也已经在动车组的运营管理的某些领域中得到了应用。随着动车组维修领域的需求日益增长,动车组故障检修方面也必将需要大数据分析技术的支持。大数据数据挖掘过程一般由数据清洗、数据集成、数据转换、数据挖掘、模式评估和知识表示这几个阶段组成。在具体挖掘过程中,需要这几个阶段的反复执行。数据挖掘主要分为关联模式挖掘,聚类模式挖掘,决策树模式挖掘等;而本文的主要工作:关联规则挖掘,主要分为挖掘频繁模式和根据频繁模式生成关联规则两步。其中关联规则的生成较为简单,所以影响关联规则算法效率的主要步骤是频繁模式的挖掘,也是区分诸多算法效率的核心问题。因此在频繁模式挖掘方面取得的任何进展都将对关联规则以至于其他的数据挖掘任务的效率产生重要影响。综上所述,本文通过在分布式计算平台上实现关联关系规则算法,用于分析动车组故障数据。填补我国目前动车组运维方面的不足。最早的关联规则算法可以追溯到1993年,名叫AIS算法。但由于该算法效率过低,在由Agrwal等人的改进后提出了 Apriori算法,特点是使用了逐层搜索的迭代思路来找出事务数据库中的频繁项集,相较于AIS其效率大大的提高。作为一种经典算法,后来的许多算法比如AprioriHybrid等算法皆是依据它改进而来的。Apriori算法主要通过两个频繁项集的重要特性,使得整个算法的效率提升:如项目集R是频繁项集,则其子集也是频繁项集;如R不是频繁项集,则其超集都是非频繁项集。通过这两个性质,可以有效的减少频繁项集的产生。Apriori算法使用的是一种迭代方法,叫做逐层搜索,其中k项集用于探索(k+1)项集。首先,扫描数据库,累积每个单独项的计数,并记录每个满足最小支持度的项,即找出频繁1项集的集合,记为L1。然后根据这个找出L2,即频繁2项集的集合。以此类推,只到不能再找到频繁k项集。一次数据库的完整扫描只能完成一次找出Lk的操作。除了在故障诊断方面Apriori算法能发挥巨大的作用之外,该算法在商业,价格分析等领域中都得到了广泛的应用。该算法具有直观,简便易于实现等特点,同样也有候选项集多,数据库扫描次数多等方面的不足。可以说是优点与缺点同样明显。本文根据算法的缺点进行了改进,考虑从蚁群优化和布隆过滤器两种思路对算法的性能做出优化,主要是在产生关联关系的中间过程中消除一些冗余,使得算法能更加快速的执行。并对比算法之间的性能,选取性能更优的算法用于进一步工作;另一方面,为了更好的分析数据,就要使用大数据工具,才能高效,合理的进行计算。本文对于大数据平台Hadoop进行深入研究,包括分布式文件系统(Hadoop Distributed File System)以及 Spark 框架。HDFS作为主流的分布式存储系统,主要有以下优点:①扩容能力:能更可靠的存储和处理PB级的数据;②成本低:可以通过普通机器组成的服务群来分发以及处理数据,这些服务器总计可达数千个节点。③高效率:通过分发数据和备份数据,Hadoop可以在数据所在的节点上并行的处理他们。④高容错性:在面对数据可能损害或出错时,不是采用使用更好的机器以防止出错这种策略,而是提供了一种机制,使得普通机器节点上的数据损坏出错后也能很好的处理。可以说,HDFS是面向一种数据高出错率的一种解决方案。这种容错性高的特点可以保证数据安全可靠更可以使其可以部署在一般的普通商业机器上。Spark是一个基于内存计算的开源的集群计算系统,目的是让数据分析更加快速。Spark非常小巧玲珑,由加州伯克利大学AMP实验室的Matei为主的小团队所开发。Spark是一种与Hadoop相似的开源集群计算环境,但是两者之间还存在一些不同之处,这些有用的不同之处使Spark在某些工作负载方面表现得更加优越,换句话说,Spark启用了内存分布数据集,除了能够提供交互式查询外,它还可以优化迭代工作负载。Spark是在Scala语言中实现的,它将Scala用作其应用程序框架。与Hadoop不同,Spark和Scala能够紧密集成,其中的Scala可以像操作本地集合对象一样轻松地操作分布式数据集。尽管创建Spark是为了支持分布式数据集上的迭代作业,但是实际上它是对Hadoop的补充,可以在Hadoop文件系统中并行运行。最后,以关联规则算法和大数据平台为基础,将前期理论知识和动车组故障数据相结合,确定故障关联规则的挖掘方案。最终达到高速准确的挖掘动车组故障关联规则的目的,为管理部门制定更加完善,合理的动车组维修流程提供优化建议。随着动车组的大规模应用,维修管理规程得到了补充,修订和完善。使得检修计划和作业流程得到调整优化,但由于尚在起步阶段,检修计划会随着铁路建设,部件寿命等变动而调整。所以,很多方面我国仍处于研究阶段。我国大数据分析主要面对的问题是投入产出比不高,消耗的资源较高但是没有产生应有的效应。但从长远来看,随着相关行业的规范化和各行业原始数据的积累,大数据分析的前景必定广阔。本论文"基于大数据的动车组故障关联关系规则挖掘算法研究与实现"是基于动车组运维数据来实现动车组故障知识的获取,优化等工作。本研究实现了从海量动车组故障数据中利用改进的Apriori算法挖掘出故障的频繁项集和关联规则,并根据算法的不足进行改进;以及将改进后算法移植到Spark下更快速的完成上述工作。
[Abstract]:As the most important mobile equipment to complete the high speed transportation production task of railway, the EMU is the integration of high and new technology. Compared with the traditional locomotive, the EMU has a great difference in the vehicle structure, and its running speed is not available by the traditional locomotive. In its operation course, the fault management and maintenance are the high-speed railway system. The important part of the comprehensive guarantee project is the necessary guarantee to ensure the safe operation and efficient use of the EMU. During the maintenance process, the repair system plays a guiding and key role, and a reasonable and perfect repair system is the basic premise to ensure the rapid, safe, comfortable and efficient operation of the high speed EMU. The attention of the whole problem will undoubtedly cause the complex maintenance process of the EMU, which will naturally have a great influence on the efficiency of the lifting. To improve the maintenance efficiency of the EMU, it is the theoretical study of the EMU structure on the one hand; on the other hand, the large number of EMU data that has accumulated in the past contains the value that has not been excavated. With the maturity of large data related technologies, the value of these data is becoming more and more prominent. In order to make good use of these data, it is necessary to obtain the hidden fault association information from the massive failure data to achieve the purpose of early detection. There are 3 main maintenance strategies: periodic repair, state repair and post repair. The middle period repair is the most important way at present. The maintenance grade is divided into five levels. The train will be repaired after a certain time or mileage, and the corresponding parts will be replaced. In this method, the maintenance cycle is determined according to the experience of the expert, in order to ensure the safety and safety, this ensures the safety, But in the case of excessive repair, that is, a part of the train is in good health and is still being replaced, resulting in an increase in the cost of operation and maintenance. The latter is another extreme, that is, the replacement of the component when the component is completely invalid. This is obviously an undesirable scheme. Therefore, a compromise state repair scheme is proposed, based on the current work of the component. State, to judge the extent of its damage and replace it when it will be damaged, which not only ensures the safety of transportation, but also reduces the cost. At present, the large data analysis technology has been used in some fields in our country's railway industry. Based on the Hadoop platform, a scheme for analyzing and dealing with the vibration data of the EMU is designed and implemented. In order to eliminate the linear drift in the high speed rail vibration data, the anomaly points in the data are found and the type of the train component fault is judged by the data distribution. Based on the Hadoop platform, the data of the historical traffic flow is used to calculate the traffic flow efficiently and accurately. A train of thought for the construction of the EMU data warehouse is proposed. According to the related parts, it can be said that large data analysis is the future development direction for the large railway system, and has been applied in some areas of the operation management of the EMU. With the increasing demand of the EMU maintenance field, the fault maintenance of EMU will also need the support of large data analysis technology. The process of data mining in large data is usually composed of data cleaning, data integration, data conversion, data mining, pattern evaluation and knowledge representation. In the concrete mining process, the repeated execution of these stages is needed. Data mining is mainly divided into association pattern mining, clustering pattern mining, decision tree pattern mining, and so on. The main work: mining association rules is divided into two steps: mining frequent patterns and generating association rules according to frequent patterns. The generation of association rules is relatively simple, so the main steps that affect the efficiency of association rules algorithm are the mining of frequent patterns, and also the core problem to distinguish the efficiency of many algorithms. Any progress made will have an important impact on the efficiency of association rules and other data mining tasks. To sum up, this paper implements the algorithm of association rules on the distributed computing platform to analyze the malfunction data of the EMU. The algorithm can be traced back to 1993, called AIS algorithm. But because of the low efficiency of the algorithm, the Apriori algorithm is proposed after the improvement of Agrwal et al. The characteristic is to use the iterative idea of layer by layer search to find frequent itemsets in the transaction database, which is greatly improved compared to the efficiency of AIS. As a classic algorithm, many later calculations are made. The algorithm, such as AprioriHybrid, is based on its improved.Apriori algorithm, mainly through the important properties of two frequent itemsets, to improve the efficiency of the whole algorithm: if the item set R is a frequent itemset, then its subset is also a frequent itemset; for example, R is not a frequent itemset, and its superset is infrequent itemsets. Through these two properties, To effectively reduce frequent itemsets generation.Apriori algorithm is an iterative method called an iterative method called layer by layer, where k sets are used to explore (k+1) sets. First, the database is scanned, the count of each individual item is accumulated, and each item that satisfies the minimum support is recorded, that is, to find a set of frequent 1 sets, recorded as L1. and then based on this search. L2, that is, the set of frequent 2 sets. By analogy, only the frequent K itemsets can not be found. A complete scan of the database can only be completed to find the operation of Lk once. Besides the great role of the Apriori algorithm in fault diagnosis, the algorithm has been widely used in the fields of business, price analysis and so on. It has the characteristics of intuitionistic, simple and easy to implement. There are also many candidate items and many shortcomings of database scanning. It can be said that the advantages and disadvantages are equally obvious. In this paper, the shortcomings of the algorithm are improved and the performance of the algorithm is optimized from two ideas of ant colony optimization and blon filter. In the middle process of the connection, some redundancy can be eliminated so that the algorithm can be executed more quickly. And the performance of the algorithm is compared with the algorithm of better performance. On the other hand, in order to better analyze the data, it is necessary to use large data tools to achieve high efficiency and reasonable calculation. In this paper, the large data platform Hadoo P's in-depth study, including the Hadoop Distributed File System and the Spark framework.HDFS as the mainstream distributed storage system, has the following advantages: (1) capacity expansion: more reliable storage and processing of PB level data; and low cost: can be distributed and processed by a service group composed of ordinary machines. Data, the total number of these servers can reach thousands of nodes. 3. High efficiency: by distributing data and backing up the data, Hadoop can handle them parallel to the nodes of the data. 4. High fault tolerance: instead of using a better machine to prevent the error in the face of data damage or error, it provides a machine. As a result, HDFS is a solution to a high error rate of data. The high fault tolerance can ensure that the data is safe and reliable and can be deployed on the ordinary common business machine and.Spark is a memory based calculation. The open source cluster computing system is designed to make data analysis more rapid and.Spark very small. The.Spark is an open source cluster computing environment similar to Hadoop, developed by a small team based on Matei of the AMP laboratory in Berkeley University of California. But there are some differences between the two, and these useful differences make Spark In some of the workload performance, in other words, Spark enabled the memory distribution dataset, in addition to providing interactive queries, it also optimizes the iterative workload.Spark to be implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, Spark and Scala can be tightly integrated, Scala in a distributed data set can be manipulated as easily as the local collection object. Although the creation of Spark is to support iterative jobs on a distributed data set, it is actually a supplement to the Hadoop and can run in parallel in the Hadoop file system. Finally, the former is based on the custom rule algorithm and the large data platform. With the combination of the theoretical knowledge and the malfunction data of the EMU, the mining scheme of the fault association rules is determined. Finally, the purpose of high speed and accurate mining of the mus fault association rules is achieved, and the optimization proposal for the management department to make a more perfect and reasonable EMU maintenance process is provided. With the large-scale application of the EMU, the maintenance management rules are obtained. It has been supplemented, revised and perfected. The maintenance plan and operation process have been adjusted and optimized. But because of the initial stage, the maintenance plan will be adjusted with the railway construction and the changes in the component life. So, in many aspects, our country is still in the stage of research. But in the long run, with the standardization of the related industries and the accumulation of the original data in various industries, the prospect of the large data analysis must be broad. This paper "research and implementation of the algorithm for mining fault association rules based on large data" is based on EMU Operation and maintenance data to realize the movement. In this study, we use improved Apriori algorithm to excavate frequent item sets and association rules from the malfunction data of mass EMU, and improve the algorithm according to the shortcomings of the algorithm. And the improved algorithm is transplanted to Spark to complete the work more quickly.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：U269;TP311.13

【相似文献】