基于Mahout的并行化k-means聚类算法优化研究

发布时间：2018-05-13 06:56

本文选题：聚类分析 + k-means算法　；参考：《华中科技大学》2016年硕士论文

【摘要】：聚类分析是从大量数据中获取有效信息的重要手段,用于聚类分析的算法称为聚类算法。k-means聚类算法具备简单、快速、有效等诸多优点,是使用范围最广泛的经典聚类算法之一。如今,快速发展的互联网产业导致了数据量的剧增,传统k-means算法已无法满足海量数据的聚类需求,因此,k-means算法MapReduce并行化以及针对并行化k-means算法的优化研究尤为重要。本文将探讨并行化k-means算法的实现方式,并以此为基础采用适合海量数据处理需求的算法优化策略,目标是降低算法时间和空间复杂度的同时获取更优质的聚类结果。本文从k-means算法优化与并行化的研究现状出发,分析了目前k-means算法优化方法主要是针对串行k-means,与此同时k-means并行化研究主要围绕算法设计展开,由此可知现阶段国内外对并行化k-means算法的优化研究尚属薄弱环节,因此本文确立了采用时间复杂度较低的算法对并行化k-means进行优化的研究思路。作为铺垫,本文介绍了分布式开源框架Hadoop、编程模式MapReduce以及提供协同过滤、聚类、分类等大规模机器学习算法分布式实现的算法库Mahout;然后着重研究了k-means算法原理、算法缺陷以及它在Mahout中的并行化实现方式;最后,采用针对并行化k-means算法优化方法,即利用时间复杂度极低的“粗聚类”算法Canopy对并行化k-means进行优化。在算法性能测试阶段,本文利用Mahout算法库提供的算法驱动等接口将Canopy优化前后的k-means算法予以实现,并将优化前后的算法应用在Hadoop分布式测试平台上,采用控制变量法调整参数,将算法应用在呈高斯分布的数据集上进行聚类性能测试。分析实验数据可知,优化算法的聚类性能明显更优——在保证算法效率的前提下,以更少的迭代次数收敛于更准确的质心,并且在算法稳定性方面也有显著的提升。总体来看,基于Canopy的k-means算法优化效果明显。
[Abstract]:Clustering analysis is an important means to obtain effective information from a large number of data. The clustering algorithm called .k-means clustering algorithm has many advantages, such as simple, fast, effective and so on. It is one of the most widely used classical clustering algorithms. Today, the rapid development of the Internet industry has led to a sharp increase in the amount of data, the traditional k-means algorithm can no longer meet the needs of massive data clustering, so MapReduce parallelization of k-means algorithm and optimization of parallel k-means algorithm is particularly important. In this paper, we will discuss the implementation of parallel k-means algorithm, and based on this, we will adopt an algorithm optimization strategy suitable for mass data processing requirements. The goal is to reduce the time and space complexity of the algorithm and obtain better clustering results at the same time. Based on the research status of optimization and parallelization of k-means algorithm, this paper analyzes that the optimization method of k-means algorithm is mainly aimed at serial k-means, while the research of k-means parallelization mainly focuses on the design of algorithm. It can be seen that the research on parallelization k-means optimization is still weak at present, so this paper establishes the research idea of using the algorithm with low time complexity to optimize parallelized k-means. As a paver, this paper introduces the distributed open source framework Hadoop, the programming pattern MapReduce and the algorithm library Mahoutwhich provides distributed implementation of large-scale machine learning algorithms, such as collaborative filtering, clustering and classification, and then focuses on the principle of k-means algorithm. Finally, the optimization method for parallelized k-means algorithm is adopted, that is, the "coarse clustering" algorithm Canopy, which has a very low time complexity, is used to optimize the parallelized k-means. In the performance testing phase of the algorithm, the k-means algorithm before and after Canopy optimization is realized by using the interface provided by Mahout algorithm library, and the algorithm before and after optimization is applied to the Hadoop distributed test platform, and the control variable method is used to adjust the parameters. The algorithm is applied to the data set with Gao Si distribution to test the clustering performance. By analyzing the experimental data, we can see that the clustering performance of the optimization algorithm is obviously better-converging to the more accurate centroid with less iteration times, while ensuring the efficiency of the algorithm, and the stability of the algorithm is also improved significantly. In general, the optimization effect of k-means algorithm based on Canopy is obvious.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【参考文献】