基于Hadoop平台的并行化分布式关联规则挖掘算法研究

发布时间：2018-02-04 21:11

本文关键词： 关联规则挖掘算法数据挖掘的并行化 Apriori算法 Hadoop　出处：《吉林大学》2017年硕士论文　论文类型：学位论文

【摘要】：随着近些年科学技术的飞速发展,人们日常生活中通过计算机、手机等终端平台进行的一系列行为都会产生大量的数据,而产生数据、获取数据的方式也在与日俱增。在当今这个数据时代的大背景下,各种数据都以急速的势态不断增长,能够达到日产数据量几百TB乃至PB级别的大型网络企业屡见不鲜。如何从如此庞大的数据库中快速、高效、准确地获取信息,是现今计算机科学研究的热点之一。并行化分布式挖掘算法是针对可能存在的跨地域的海量数据进行分析的一种重要手段,具有非常重要的研究意义和实用价值。关联规则挖掘算法是经典的数据挖掘算法之一,具有很强的学习价值和参考价值。传统的关联规则挖掘算法会将候选集一一缓存输出,在并行化的前提下还要进行网络交换。但是在大数据量的背景下,生成的候选项目集会出现暴增的情况,容易对机器的内存造成负担,影响算法的效率。针对算法原有的缺陷,本文提出一种优化算法Y-IDA算法,直接在内存中将合并计数的过程完成,替代传统的将候选集逐一输出的方法来优化算法,同时修改Hadoop接口,改变Map Reduce的读入模式,利用生成的首个频繁项集对数据库进行清洗,降低了内存消耗和CPU占用时间,提高了算法的执行效率。本文主要工作包括:1)实现基本算法串行Apriori,为后续并行化打下基础;2)针对并行化的Apriori算法提出了优化算法Y-IDA,该算法在内存中将合并计数的的过程完成,替代传统的将候选集逐一输出的方法,同时改变Map Reduce传统的读入模式,减少执行过程中的通讯量,并且在生成候选1项集后对数据进行清洗,去除无效数据;3)在Hadoop平台上实现关联规则算法的并行化,在现有的实验条件下提出实验方案,验证了Y-IDA算法的结果与经典算法相同,分别在时间效率、内存消耗、磁盘读写、CPU占用等方面进行详细比对。结合本文工作,通过Hadoop完全分布式平台,采用数据挖掘离散测试数据进行实现,可以得到的结果是:改进后的算法可以缩短执行时间,在内存消耗、CPU占用、磁盘I/O读写方面都有较好的表现,得到改进的算法具有可行性和普遍意义的结论。
[Abstract]:With the rapid development of science and technology in recent years, people's daily life through the computer, mobile phone and other terminal platform to carry out a series of behaviors will produce a lot of data, and produce data. The way to get data is also increasing. In the background of this data age, all kinds of data are growing rapidly. It is common for large network enterprises to reach the daily output of several hundred terabytes or even PB. How to obtain information quickly, efficiently and accurately from such a huge database. Parallel distributed mining algorithm is an important method to analyze the large amount of data that may exist across different regions. Association rules mining algorithm is one of the classical data mining algorithms. The traditional association rule mining algorithm will cache the candidate set one by one and exchange the candidate set in parallel. But in the context of large amount of data. Because of the explosion of candidate project assembly, it is easy to burden the memory of the machine and affect the efficiency of the algorithm. In view of the original defects of the algorithm, this paper proposes an optimization algorithm Y-IDA algorithm. The process of merging and counting is completed directly in memory, instead of the traditional method of outputting candidate sets one by one to optimize the algorithm. At the same time, the Hadoop interface is modified to change the readin mode of Map Reduce. The first frequent itemset is used to clean the database, which reduces memory consumption and CPU time. The main work of this paper includes: 1) realizing the basic algorithm serially Apriori. which lays the foundation for the subsequent parallelization; 2) for the parallel Apriori algorithm, an optimization algorithm Y-IDA is proposed, which completes the process of merging count in memory, replacing the traditional method of outputting candidate sets one by one. At the same time, the traditional read-in mode of Map Reduce is changed to reduce the communication in the execution process, and the data is cleaned after the candidate set is generated to remove the invalid data. 3) the parallelization of association rule algorithm is realized on Hadoop platform, and the experimental scheme is proposed under the existing experimental conditions. The result of Y-IDA algorithm is the same as that of classical algorithm, and the time efficiency of Y-IDA algorithm is respectively in time efficiency. Memory consumption, disk read and write CPU usage and other aspects are compared in detail. Combined with the work of this paper, through the Hadoop completely distributed platform, data mining discrete test data are implemented. The results are as follows: the improved algorithm can shorten the execution time and has good performance in memory consumption CPU consumption disk I / O reading and writing. The conclusion that the improved algorithm is feasible and universal is obtained.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】