基于并行的Apriori数据挖掘算法的研究

发布时间：2018-08-21 08:26

【摘要】：数据之所以存在价值,是因为通过分析数据,发现其背后的规律,可以很好地指导我们未来的生产和工作。随着互联网以及信息技术的长足进步,各行各业在发展过程中都积累了大量数据。国内最先运用大数据的是几家大型互联网公司。他们有数以亿计的客户,这些客户在网络中的行为会产生大量数据。这些公司可以通过分析客户的消费习惯或阅读习惯,有选择地向客户推送产品和信息。大数据的应用在传统行业中也非常有价值。比如电力公司通过数据分析可以预测线路负载,然后更加精确优化电能地储备和调配。传统制造业根据使用数据反馈制定下一代产品的研发方案。综上所述,利用数据分析来指导未来的工作已经成为发展的趋势。所以有效利用数据,挖掘出数据背后的规律就变得尤为重要。数据挖掘技术在这种背景下应运而生。数据挖掘主要分为六大类。分别是关联算法、分类算法、回归算法、聚类算法、预测算法和诊断算法,本文主要介绍关联算法。关联规则挖掘的经典算法之一就是Apriori算法。该算法能够准确挖掘出数据中相互关联的项。比较典型的问题是超市中货物摆放问题,商家会将顾客喜欢一起购买的商品摆放在一起。最初的算法设计对数据规模考虑的不是很充分,在处理超大数据集时可能效率会比较低。所以本文的思路是对Apriori算法进行一定程度地优化,并且通过Map Reduce将算法移植到hadoop平台上。那么传统的Apriori算法就变成分布式算法。可以把任务以及数据分布到集群中,提高挖掘效率。Hadoop平台是一种云计算平台。其优势在于可以利用大量廉价的,非高可靠的硬件来存储和处理数据。并且可以非常便利的利用其编程模型将一些串行的算法改成并发执行的。本文将详细介绍hadoop和关联算法的背景知识,还会讨论将apriori算法通过mapreduce编程框架实现并在hadoop平台上部署运行的可行性。论证这种做法对效率提升的效果。希望对以后的研究人员在算法移植云平台有一定的参考。
[Abstract]:The reason why the data exist is that by analyzing the data and finding the law behind it, we can guide our future production and work well. With the rapid progress of the Internet and information technology, a lot of data have been accumulated in the development process of various industries. The first use of big data in China is a few large Internet companies. They have hundreds of millions of customers whose behavior in the network generates a lot of data. These companies can selectively push products and information to customers by analyzing their consumer or reading habits. The application of big data is also very valuable in traditional industries. Power companies, for example, can predict line loads through data analysis, and then optimize the storage and allocation of electricity more accurately. The traditional manufacturing industry formulates the next generation product research and development plan according to the data feedback. To sum up, the use of data analysis to guide future work has become a trend of development. Therefore, the effective use of data, mining the rules behind the data becomes particularly important. Data mining technology emerges as the times require under this background. Data mining is divided into six categories. It is an association algorithm, a classification algorithm, a regression algorithm, a clustering algorithm, a prediction algorithm and a diagnosis algorithm. One of the classical algorithms for mining association rules is the Apriori algorithm. The algorithm can accurately mine the interrelated items in the data. A typical problem is the placement of goods in supermarkets, where merchants place goods that customers like to buy together. The original algorithm design is not enough to consider the size of the data, and may be less efficient when dealing with large data sets. Therefore, the idea of this paper is to optimize the Apriori algorithm to a certain extent, and transplant the algorithm to the hadoop platform through Map Reduce. Then the traditional Apriori algorithm becomes the distributed algorithm. The task and data can be distributed into the cluster. The Hadoop platform is a cloud computing platform. Its advantage is that it can use a lot of cheap, unreliable hardware to store and process data. And it is very convenient to use its programming model to change some serial algorithms into concurrent execution. This paper introduces the background of hadoop and association algorithm in detail, and discusses the feasibility of implementing apriori algorithm through mapreduce programming framework and deploying it on hadoop platform. Demonstrate the effect of this practice on efficiency improvement. Hope that the future of the researchers in the algorithm migration cloud platform has a certain reference.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】