基于MapReduce的分布式聚类算法的研究

发布时间：2018-07-25 18:55

【摘要】：聚类分析是数据挖掘中最基础的数据分析技术之一,被广泛应用于经济学、社会科学以及计算机科学等领域。然而,随着互联网技术的快速发展,各种网络应用产生的数据急剧增加,给传统的聚类分析方法带来了巨大的技术挑战。如何快速有效地从海量数据中获取到有价值的信息,成为诸多行业急需解决的问题。云计算技术的日趋成熟使得快速有效的处理海量数据成为可能。Hadoop是一种开源的分布式云计算平台,其核心设计是分布式文件系统(HDFS)和MapReduce,其中HDFS提供海量数据的存储,MapReduce编程模型用于对数据进行并行化处理。相对于传统的并行编程模型,该编程模型对底层的数据分割、任务调度、并行处理等细节进行封装,用户可在不明白分布式底层细节的情况下开发分布式应用程序,极大地方便了对并行化程序的设计。K-means算法作为聚类分析中的经典算法被应用于多个行业领域,但随着数据规模的增大,该算法的迭代次数会明显增加,影响算法的执行效率。为使其能够较好的应用于大规模数据的聚类分析中,本文首先根据MapReduce的编程原理实现该算法在Hadoop平台上的并行化,然后针对K-means算法随机选取簇中心的盲目性及聚类结果易陷入局部最优的问题进行相应的改进。论文的主要工作如下:(1)在分析了传统K-means算法基础上,借鉴最大最小距离的思想,提出基于最大最小距离的K-means并行化算法。根据最大最小距离的思想选取簇中心并将其作为K-means算法的初始中心点,避免随机选取中心点容易出现的初始中心点过于邻近的情况,从而提高聚类结果的质量。为提高其效率,设计并实现了该算法的并行化。(2)对一趟聚类算法的原理及其优缺点进行分析,并结合传统K-means算法的特性,提出OPKMEANS并行化算法。该算法利用一趟聚类算法简单高效的特性,先将数据集进行快速的“粗”聚类,然后把得到的中心点作为K-means算法的初始中心点,避免K-means算法随机选取中心点的盲目性,减少K-means算法的迭代次数,以降低并行化过程的数据传输开销,从而提高算法的执行效率。(3)为了验证改进算法的有效性,本文在研究Hadoop原理的基础上,在虚拟机上搭建了Hadoop分布式计算平台,并进行多组实验,从聚类质量、加速比及可扩展性方面验证上述算法的优越性。
[Abstract]:Clustering analysis is one of the most basic data analysis techniques in data mining, which is widely used in economics, social science and computer science. However, with the rapid development of Internet technology, the data generated by various network applications increase rapidly, which brings great technical challenges to the traditional clustering analysis methods. How to obtain valuable information from massive data quickly and effectively has become an urgent problem in many industries. With the maturity of cloud computing technology, it is possible to deal with massive data quickly and effectively. Hadoop is an open source distributed cloud computing platform. Its core design is distributed file system (HDFS) and MapReduce. in which HDFS provides a programming model for storing large amounts of data for parallelization of data. Compared with the traditional parallel programming model, this programming model encapsulates the details of data segmentation, task scheduling, parallel processing, etc. Users can develop distributed applications without understanding the distributed low-level details. It greatly facilitates the design of parallelization program. K-means algorithm is applied to many industries as a classical algorithm in clustering analysis. However, with the increase of data scale, the number of iterations of the algorithm will increase obviously, which will affect the efficiency of the algorithm. In order to apply it to the clustering analysis of large-scale data, this paper firstly realizes the parallelization of the algorithm on the Hadoop platform according to the programming principle of MapReduce. Then the blindness of random selection of cluster centers in K-means algorithm and the problem that clustering results are prone to fall into local optimum are improved accordingly. The main work of this paper is as follows: (1) based on the analysis of traditional K-means algorithm and the idea of maximum and minimum distance, a K-means parallelization algorithm based on maximum and minimum distance is proposed. The cluster center is selected according to the idea of maximum and minimum distance and used as the initial center of K-means algorithm to avoid the situation that the initial center is too close to the random selection of the center point, so as to improve the quality of the clustering results. In order to improve its efficiency, the parallelization of the algorithm is designed and implemented. (2) the principle, advantages and disadvantages of the one-trip clustering algorithm are analyzed, and combining with the characteristics of the traditional K-means algorithm, the OPKMEANS parallelization algorithm is proposed. Based on the simple and efficient feature of one-trip clustering algorithm, the algorithm firstly clusters the data set quickly "coarse", and then takes the obtained center as the initial center of the K-means algorithm to avoid the blindness of the random selection of the center points in the K-means algorithm. To reduce the number of iterations of the K-means algorithm to reduce the data transmission overhead of the parallelization process and improve the efficiency of the algorithm. (3) in order to verify the effectiveness of the improved algorithm, this paper studies the principle of Hadoop. The Hadoop distributed computing platform is built on the virtual machine, and many experiments are carried out to verify the superiority of the above algorithm in terms of clustering quality, speedup and extensibility.
【学位授予单位】：江西理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】