集群环境下的关联规则挖掘及应用

发布时间：2018-03-24 15:10

本文选题：大数据　切入点：智能制造　出处：《太原科技大学》2017年博士论文

【摘要】：大数据催生了各行各业的迅猛发展,各领域呈现出了新产品、技术、服务和发展业态。大数据的战略意义不在于拥有庞大的数据资源,而在于提高对数据的"加工能力",通过"加工"实现数据的"增值"。数据挖掘是实现大数据知识发现的有效手段和途径,利用数据挖掘技术能够深层次地了解大数据背后的价值。关联规则作为数据挖掘领域中的一个主要研究内容,可以在不知道或无法确定数据的关联函数或模型时,有效发现大量数据项集之间有趣的关联信息。现有的关联规则挖掘算法因其时空复杂性和I/O代价高,难以适应大数据分析处理任务。本文充分利用MapReduce集群系统的强大数据处理能力,研究了面向大数据的关联规则挖掘方法和性能优化技术,并将其应用于冷轧辊加工质量分析。主要研究成果如下:(1)提出了两种Hadoop集群环境下的频繁项集并行挖掘FiDoop和FiDoop-HD算法。FiDoop算法充分利用了 MapReduce编程模型强大的计算能力,并实现了频繁模式树的压缩存储,避免了条件模式基的递归建立,有效提高了并行挖掘效率;FiDoop的扩展算法FiDoop-HD通过降低项目集的分解代价,从而能够有效地适应于高维数据集。在Hadoop集群平台上,实验验证了该并行算法的可行性和有效性。(2)针对包含FiDoop在内的频繁模式并行挖掘任务存在的数据非本地性问题,提出一种面向频繁项集并行挖掘的数据划分策略FiDoop-DP。该策略利用Voronoi图和LSH技术,尽量将相关性高的事物尽量划分在同一个数据分区,有效地降低了网络传输和计算代价,提高了海量数据的分析效率。在Hadoop集群平台上,实验验证了该数据划分策略的有效性。(3)提出了一种基于Spark内存计算的并行频繁项集挖掘算法。该算法充分利用了 Spark集群的内存计算优势和对迭代式数据处理的支持,并利用新定义的节点计算量预估模型,解决了其在计算过程中出现的负载不均衡问题。在Spark集群平台上,实验验证了该算法的有效性。(4)设计与实现了集群环境下的冷轧辊加工质量分析原型系统。以某钢铁企业的冷轧辊产品生产为背景,利用上述频繁项集挖掘算法和数据划分策略,开发了冷轧辊质量分析原型系统,并对其冷轧辊生产数据预处理、软件体系结构及各模块功能给出了详细分析。运行结果表明该原型系统可以有效发现冷轧辊加工过程中的关键工序及工序间的相关性,从而为企业开展产品质量控制提供了一种新的技术和解决思路。
[Abstract]:Big data has given birth to the rapid development of various industries. New products, technologies, services and development patterns have emerged in various fields. The strategic significance of big data is not to have huge data resources. Data mining is an effective way to realize big data's knowledge discovery, which is to improve the "processing ability" of the data and to realize the "value added" of the data through the "processing". Using data mining technology can deeply understand the value behind big data. As one of the main research contents in the field of data mining, association rules can be used when the association function or model of data is not known or can not be determined. Effective discovery of interesting association information between a large number of data itemsets. Existing association rules mining algorithms are highly costly due to their space-time complexity and I / O costs. It is difficult to adapt to big data's task of analysis and processing. This paper makes full use of the powerful data processing ability of MapReduce cluster system, and studies the association rule mining method and performance optimization technology for big data. The main research results are as follows: 1) in this paper, we propose two kinds of algorithms for parallel mining of frequent itemsets in Hadoop cluster environment, I. e., FiDoop and FiDoop-HD algorithms. FiDoop algorithm makes full use of the powerful computing power of MapReduce programming model. The compression storage of frequent pattern tree is realized, and the recursive establishment of conditional schema base is avoided, and the efficiency of parallel mining is improved effectively. The extended algorithm FiDoop-HD can reduce the decomposition cost of itemsets. On the Hadoop cluster platform, the feasibility and effectiveness of the parallel algorithm are verified. (2) aiming at the data non-local problem of frequent pattern parallel mining tasks including FiDoop, the experiment proves that the parallel algorithm can be applied to high dimensional data sets effectively. A data partition strategy, FiDoop-DPfor parallel mining of frequent itemsets, is proposed in this paper. By using Voronoi diagram and LSH technology, the objects with high correlation can be divided into the same data partition as far as possible, which can effectively reduce the cost of network transmission and computation. Improve the efficiency of mass data analysis. On the Hadoop cluster platform, Experiments verify the validity of the data partitioning strategy. (3) A parallel frequent itemset mining algorithm based on Spark memory computing is proposed, which takes full advantage of the memory computing advantage of Spark cluster and supports iterative data processing. The load imbalance problem in the computing process is solved by using the newly defined node computational load estimation model. On the Spark cluster platform, the problem of load imbalance is solved. The validity of this algorithm is verified by experiments. The prototype system of cold roll machining quality analysis in cluster environment is designed and implemented. Based on the production of cold roll products in a steel enterprise, the mining algorithm of frequent itemsets and the data partition strategy are used. The prototype system of cold roll quality analysis is developed, and the production data of cold roll is preprocessed. The software architecture and the function of each module are analyzed in detail. The running results show that the prototype system can effectively find out the key processes in the cold roll machining process and the correlation between the processes. Thus, it provides a new technology and solution for enterprises to carry out product quality control.
【学位授予单位】：太原科技大学
【学位级别】：博士
【学位授予年份】：2017
【分类号】：TP311.13

【相似文献】