基于Hadoop云计算平台的K-Means聚类算法研究

发布时间：2018-03-11 21:47

本文选题：Hadoop　切入点：云计算　出处：《哈尔滨理工大学》2017年硕士论文　论文类型：学位论文

【摘要】：聚类分析作为数据挖掘技术最热门的研究方向之一,一直倍受广大研究学者与开发人员的青睐。聚类可以将用户输入的原始数据对象分成数个类簇,算法的目标就是相同类簇内的数据对象间相似程度较高,不同类簇内的数据对象间相似程度较低。随着移动互联网、物联网以及人工智能的发展,Web端产生的信息量越来越庞大,如何高效稳定地对超大规模数据进行聚类分析成为了一个全新的研究课题。Hadoop分布式云计算平台的兴起,使利用多个计算节点进行并行计算去解决传统串行算法的性能问题成为可能。本文深入研究Hadoop分布式云计算平台和聚类算法等相关技术。设计并实现一套基于Hadoop平台的聚类分析系统。系统共分为三层架构,分别为底层驱动层、中间逻辑层以及对外服务层。文中详细阐述系统的设计思想及具体实现过程,旨在将聚类分析具体操作在内部进行高度封装,并对外暴露简单操作接口,使具体算法实现对用户透明,稳定高效地执行聚类分析。通过深入分析K-Means算法中存在的问题,设计基于Hadoop分布式平台的改进方案。使用本文实现的聚类分析系统配置实验环境,分别从并行随机采样、样本距离计算并行化以及数据对象聚类过程并行化三个方向优化算法执行过程,同时对改进的K-Means并行算法流程进行了详细描述。最后分别从收敛速度、正确率、初始化采样速率和集群环境下加速比四个方向对改进的K-Means并行算法进行实验测试。实验结果表明本文设计的基于Hadoop分布式云计算平台的聚类分析系统能够提供高效、稳定、可配置的聚类分析服务。改进的K-Means并行聚类算法能够快速处理大规模的聚类分析计算。
[Abstract]:As one of the hottest research directions of data mining technology, clustering analysis has always been favored by many researchers and developers. Clustering can divide the original data objects input by users into several clusters. The target of the algorithm is that the degree of similarity among data objects in the same cluster is higher, and the similarity degree among data objects in different clusters is lower. With the development of mobile Internet, the Internet of things and the development of artificial intelligence, the amount of information generated by the Web end becomes more and more large. How to cluster large scale data efficiently and stably has become a new research topic, Hadoop distributed cloud computing platform. It is possible to solve the performance problems of traditional serial algorithms by parallel computing with multiple computing nodes. In this paper, we deeply study the Hadoop distributed cloud computing platform and clustering algorithms, and design and implement a set of Hadoop based on cloud computing platform and clustering technology. The system is divided into three layers. In this paper, the design idea and implementation process of the system are described in detail. The purpose of this paper is to encapsulate the concrete operation of cluster analysis in the inside, and to expose the simple operation interface. By analyzing the problems existing in K-Means algorithm, an improved scheme based on Hadoop distributed platform is designed, and the experimental environment is configured with the cluster analysis system realized in this paper. The parallel random sampling, the parallelization of sample distance computation and the parallelization of data object clustering are respectively used to optimize the execution process of the algorithm. At the same time, the improved K-Means parallel algorithm flow is described in detail. Finally, the convergence rate of the improved K-Means parallel algorithm is discussed. The experimental results of the improved K-Means parallel algorithm show that the cluster analysis system based on Hadoop distributed cloud computing platform can provide high efficiency. Stable and configurable clustering services. The improved K-Means parallel clustering algorithm can deal with large scale cluster analysis and computation quickly.
【学位授予单位】：哈尔滨理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【相似文献】