当前位置:主页 > 科技论文 > 软件论文 >

基于Hadoop平台并行关联规则挖掘算法研究

发布时间:2018-07-04 23:55

  本文选题:大数据 + 关联规则 ; 参考:《西安科技大学》2017年硕士论文


【摘要】:数据规模的爆炸性增长给传统计算机技术和串行算法带来挑战,同时也带来了新的发展机遇。“大数据”顺应而生。大数据使串行化关联规则算法需要重写,串行算法的并行化迫在眉睫,并行计算和大数据平台的应用是好的解决方案。关联规则用于发现信息与信息之间存在的关系,是重要的数据挖掘任务。关联规则传统算法Apriori算法和FP-Growth算法处理大数据时,单机处理发生内存溢出情况。使用Hadoop进行关联规则研究,降低编程难度,数据分片,因此Hadoop上关联规则并行算法研究是一个重要课题。针对此问题,本文进行了如下研究:(l)研究了 H-Apriori(Apriori algorithm based on Hadoop)算法并改进其算法。大数据环境下,Apriori串行算法难以处理海量数据,H-Apriori算法的中间过程产生大量值为1的键/值对,并且读取全部的事务,以致产生了大量的候选项并消耗了运算时间。本文采用删除非频繁项达到减少冗余数据的目的。重构数据库,优化读取事务步骤,提出了基于Hadoop的改进算法。有效约简了事务数据库,使用哈希树计数减少计数时间,提高了算法效率。(2)提出了一种基于Hadoop平台的负载均衡数据分割FP-Growth的改进算法。大数据环境下,FP-Growth串行算法难以处理海量数据,PFP(ParallelFP-Growth)难以处理一定量的数据。改进算法使用负载量估计、改进的均衡化分组方法进行均衡化分组,克服了 PFP数据量增大不能处理、负载不均衡的缺点。改进算法可以有效平衡集群各节点的负载,缩短整个集群的算法运行时间。搭建大数据Hadoop平台框架后,进行了对比实验。通过权威数据验证算法实效性。实验表明,改进算法能够更好的适应大数据,并且效率较高。
[Abstract]:The explosive growth of data scale brings challenges to traditional computer technology and serial algorithms, but also brings new opportunities for development. "big data" comes with adaptation. The serialized association rule algorithm needs to be rewritten by big data, and the parallelization of serial algorithm is imminent. Parallel computing and big data platform are good solutions. Association rules are used to discover the relationship between information and information, which is an important task of data mining. When Apriori algorithm and FP-Growth algorithm deal with big data, memory overflow occurs on single machine. Using Hadoop to study association rules reduces the difficulty of programming and divides data into pieces. Therefore the research on parallel algorithms of association rules on Hadoop is an important subject. In order to solve this problem, this paper researches as follows: (l) studies H-Apriori (Apriori algorithm based on Hadoop algorithm and improves its algorithm. In big data environment, it is difficult to deal with massive data in the middle process of H-Apriori algorithm, which produces a large number of key / value pairs with a value of 1, and reads all transactions, resulting in a large number of candidate items and consuming operation time. In this paper, we reduce redundant data by deleting infrequent items. The improved algorithm based on Hadoop is proposed to reconstruct the database and optimize the step of reading transaction. The transaction database is reduced effectively and the counting time is reduced by using hash tree. (2) an improved FP-Growth algorithm for load balancing data segmentation based on Hadoop platform is proposed. FP-Growth serial algorithm is difficult to deal with large amount of data in big data (parallel FP-Growth). The improved algorithm uses the load estimation and the improved equalization grouping method to equalize the packet, which overcomes the disadvantage that the PFP data can not be processed and the load is unbalanced. The improved algorithm can effectively balance the load of each node in the cluster and shorten the running time of the whole cluster. After the big data Hadoop platform framework is built, a comparative experiment is carried out. The validity of the algorithm is verified by authoritative data. Experiments show that the improved algorithm can adapt to big data better and more efficiently.
【学位授予单位】:西安科技大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13

【参考文献】

相关期刊论文 前10条

1 邹裕;肖倩;吴树荣;;基于增强关联规则挖掘的大型网站推荐系统[J];计算机与现代化;2016年10期

2 陈明洁;;分布式频繁项集挖掘算法[J];计算机应用与软件;2015年10期

3 晁永生;孙文磊;;基于粗糙集的焊接类型关联规则提取[J];计算机工程与应用;2015年15期

4 吕婉琪;钟诚;唐印浒;陈志朕;;Hadoop分布式架构下大数据集的并行挖掘[J];计算机技术与发展;2014年01期

5 章志刚;吉根林;;一种基于FP-Growth的频繁项目集并行挖掘算法[J];计算机工程与应用;2014年02期

6 刘维晓;陈俊丽;屈世富;万旺根;;一种改进的Apriori算法[J];计算机工程与应用;2011年11期

7 王锋;李勇华;毋国庆;;基于矩阵的改进的Apriori算法[J];计算机工程与设计;2009年10期

8 谈恒贵;王文杰;李克双;;频繁项集挖掘算法综述[J];计算机仿真;2005年11期

9 陈付幸,王润生;基于预检验的快速随机抽样一致性算法[J];软件学报;2005年08期

10 迟利华,刘杰,胡庆丰;数值并行计算可扩展性评价与测试[J];计算机研究与发展;2005年06期

相关硕士学位论文 前3条

1 车斌;基于Hadoop海量数据处理关键技术研究[D];电子科技大学;2013年

2 魏峰;基于聚类的关联规则挖掘算法研究[D];浙江工业大学;2012年

3 谢朋峻;基于MapReduce的频繁项集挖掘算法的并行化研究[D];南京大学;2012年



本文编号:2098020

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2098020.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户37c2d***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com