基于云平台的关联规则算法优化及应用研究

发布时间：2018-04-24 10:02

本文选题：云计算 + 数据挖掘　；参考：《河南工业大学》2017年硕士论文

【摘要】：随着互联网的快速发展,网络已深入到生活的方方面面。互联网丰富、方便了大众的生活,甚至一定程度上改变了人们的工作方式。随着互联网技术的广泛应用,后台产生的数据信息规模呈现海量化。如何在大数据中挖掘出价值信息得到了各行业的关注。从大规模噪杂的的数据集合中挖掘出事物之间的关联规则是数据挖掘技术中一个较为广泛的应用。但是传统的单机数据挖掘无法实现对海量数据的全面分析,云计算的出现为数据挖掘行业提出了新思路。Apache基金会研发的Hadoop云平台降低了云计算开发的技术门槛。将云平台的并行计算技术与改进后的关联规则算法相结合,能够更好地实现对海量数据的挖掘操作,得出蕴含在数据集中的信息规律,从而为商业应用提供出更好地决策。本文以传统的Apriori算法为研究的理论基础,通过分析算法的执行流程找出可优化的关键点,对算法进行了相应的改进,将改进后的Apriori算法与Hadoop平台相结合,算法部署在云平台上用以实现算法的并行化,以此来达到对海量数据的处理。文中对当前云计算以及数据挖掘技术的研究现状和发展做了详细论述,在Hadoop技术中着重介绍了HDFS和MapReduce两个核心技术。第三章对传统的Apriori关联算法做了分析,并以实例的形式论述算法执行存在的缺陷,同时介绍了已存在的算法优化的方法,列出了性能上的对比。文章第四、第五章是是所研究的核心内容,其主要内容是:第四章针对传统的Apriori算法提出了改进,降低算法执行的时间复杂度,提高了算法的执行效率;然后引入了兴趣度阈值的概念对算法挖掘产生的规则做进一步的筛选,提高强关联规则的有效性、可用性,并以折线图的方式将实验分析所得出的结果呈现出来,对比得出结论。第五章着重介绍了搭建Hadoop平台的流程及常规配置,阐述了算法并行化的思想,介绍了零售行业对云计算关联分析技术的需求,将优化的Apriori算法部署在Hadoop平台上与普通的串行算法的执行效率做对比,以实验结果分析论述算法并行化的可行性及优势。
[Abstract]:With the rapid development of the Internet, the network has penetrated into all aspects of life. The Internet is rich, convenient for people's life, and even changes the way people work to a certain extent. With the wide application of Internet technology, the scale of data information produced in the background presents sea quantification. How to dig out value information in big data has been concerned by various industries. Mining association rules between objects from large scale noisy data sets is a more extensive application in data mining technology. However, traditional single-machine data mining can not achieve a comprehensive analysis of massive data, cloud computing for the data mining industry put forward a new idea. Apache Foundation research and development of Hadoop cloud platform to reduce the technical threshold of cloud computing development. By combining the parallel computing technology of cloud platform with the improved association rules algorithm, the mining operation of massive data can be realized better, and the information law contained in the data set can be obtained, thus providing better decision for commercial applications. Based on the traditional Apriori algorithm, this paper finds out the key points that can be optimized by analyzing the execution flow of the algorithm, and improves the algorithm accordingly. The improved Apriori algorithm is combined with the Hadoop platform. The algorithm is deployed on the cloud platform to realize the parallelization of the algorithm so as to process the massive data. In this paper, the current research status and development of cloud computing and data mining technology are discussed in detail, and two core technologies, HDFS and MapReduce, are emphatically introduced in Hadoop technology. In the third chapter, the traditional Apriori association algorithm is analyzed, and the shortcomings of the algorithm execution are discussed in the form of an example. At the same time, the existing algorithm optimization methods are introduced, and the performance comparison is given. The fourth chapter and the fifth chapter are the core contents of the research. The main contents are as follows: in the fourth chapter, the traditional Apriori algorithm is improved, the time complexity of the algorithm is reduced, and the efficiency of the algorithm is improved. Then the concept of interest threshold is introduced to further filter the rules generated by algorithm mining, to improve the effectiveness and availability of strong association rules, and the results of experimental analysis are presented by the way of broken line graph. Draw a conclusion by contrast. The fifth chapter mainly introduces the flow and general configuration of Hadoop platform, expounds the idea of algorithm parallelization, and introduces the demand of cloud computing association analysis technology in retail industry. The optimized Apriori algorithm is deployed on the Hadoop platform and compared with the execution efficiency of the ordinary serial algorithm. The feasibility and advantages of parallelization of the algorithm are discussed with the experimental results.
【学位授予单位】：河南工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13;TP393.09

【参考文献】