当前位置:主页 > 科技论文 > 软件论文 >

基于MapReduce的分布式聚类算法的研究

发布时间:2018-07-25 18:55
【摘要】:聚类分析是数据挖掘中最基础的数据分析技术之一,被广泛应用于经济学、社会科学以及计算机科学等领域。然而,随着互联网技术的快速发展,各种网络应用产生的数据急剧增加,给传统的聚类分析方法带来了巨大的技术挑战。如何快速有效地从海量数据中获取到有价值的信息,成为诸多行业急需解决的问题。云计算技术的日趋成熟使得快速有效的处理海量数据成为可能。Hadoop是一种开源的分布式云计算平台,其核心设计是分布式文件系统(HDFS)和MapReduce,其中HDFS提供海量数据的存储,MapReduce编程模型用于对数据进行并行化处理。相对于传统的并行编程模型,该编程模型对底层的数据分割、任务调度、并行处理等细节进行封装,用户可在不明白分布式底层细节的情况下开发分布式应用程序,极大地方便了对并行化程序的设计。K-means算法作为聚类分析中的经典算法被应用于多个行业领域,但随着数据规模的增大,该算法的迭代次数会明显增加,影响算法的执行效率。为使其能够较好的应用于大规模数据的聚类分析中,本文首先根据MapReduce的编程原理实现该算法在Hadoop平台上的并行化,然后针对K-means算法随机选取簇中心的盲目性及聚类结果易陷入局部最优的问题进行相应的改进。论文的主要工作如下:(1)在分析了传统K-means算法基础上,借鉴最大最小距离的思想,提出基于最大最小距离的K-means并行化算法。根据最大最小距离的思想选取簇中心并将其作为K-means算法的初始中心点,避免随机选取中心点容易出现的初始中心点过于邻近的情况,从而提高聚类结果的质量。为提高其效率,设计并实现了该算法的并行化。(2)对一趟聚类算法的原理及其优缺点进行分析,并结合传统K-means算法的特性,提出OPKMEANS并行化算法。该算法利用一趟聚类算法简单高效的特性,先将数据集进行快速的“粗”聚类,然后把得到的中心点作为K-means算法的初始中心点,避免K-means算法随机选取中心点的盲目性,减少K-means算法的迭代次数,以降低并行化过程的数据传输开销,从而提高算法的执行效率。(3)为了验证改进算法的有效性,本文在研究Hadoop原理的基础上,在虚拟机上搭建了Hadoop分布式计算平台,并进行多组实验,从聚类质量、加速比及可扩展性方面验证上述算法的优越性。
[Abstract]:Clustering analysis is one of the most basic data analysis techniques in data mining, which is widely used in economics, social science and computer science. However, with the rapid development of Internet technology, the data generated by various network applications increase rapidly, which brings great technical challenges to the traditional clustering analysis methods. How to obtain valuable information from massive data quickly and effectively has become an urgent problem in many industries. With the maturity of cloud computing technology, it is possible to deal with massive data quickly and effectively. Hadoop is an open source distributed cloud computing platform. Its core design is distributed file system (HDFS) and MapReduce. in which HDFS provides a programming model for storing large amounts of data for parallelization of data. Compared with the traditional parallel programming model, this programming model encapsulates the details of data segmentation, task scheduling, parallel processing, etc. Users can develop distributed applications without understanding the distributed low-level details. It greatly facilitates the design of parallelization program. K-means algorithm is applied to many industries as a classical algorithm in clustering analysis. However, with the increase of data scale, the number of iterations of the algorithm will increase obviously, which will affect the efficiency of the algorithm. In order to apply it to the clustering analysis of large-scale data, this paper firstly realizes the parallelization of the algorithm on the Hadoop platform according to the programming principle of MapReduce. Then the blindness of random selection of cluster centers in K-means algorithm and the problem that clustering results are prone to fall into local optimum are improved accordingly. The main work of this paper is as follows: (1) based on the analysis of traditional K-means algorithm and the idea of maximum and minimum distance, a K-means parallelization algorithm based on maximum and minimum distance is proposed. The cluster center is selected according to the idea of maximum and minimum distance and used as the initial center of K-means algorithm to avoid the situation that the initial center is too close to the random selection of the center point, so as to improve the quality of the clustering results. In order to improve its efficiency, the parallelization of the algorithm is designed and implemented. (2) the principle, advantages and disadvantages of the one-trip clustering algorithm are analyzed, and combining with the characteristics of the traditional K-means algorithm, the OPKMEANS parallelization algorithm is proposed. Based on the simple and efficient feature of one-trip clustering algorithm, the algorithm firstly clusters the data set quickly "coarse", and then takes the obtained center as the initial center of the K-means algorithm to avoid the blindness of the random selection of the center points in the K-means algorithm. To reduce the number of iterations of the K-means algorithm to reduce the data transmission overhead of the parallelization process and improve the efficiency of the algorithm. (3) in order to verify the effectiveness of the improved algorithm, this paper studies the principle of Hadoop. The Hadoop distributed computing platform is built on the virtual machine, and many experiments are carried out to verify the superiority of the above algorithm in terms of clustering quality, speedup and extensibility.
【学位授予单位】:江西理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13

【参考文献】

相关期刊论文 前10条

1 牛怡晗;海沫;;Hadoop平台下Mahout聚类算法的比较研究[J];计算机科学;2015年S1期

2 成卫青;卢艳红;;一种基于最大最小距离和SSE的自适应聚类算法[J];南京邮电大学学报(自然科学版);2015年02期

3 王蕾;崔慧敏;陈莉;冯晓兵;;任务并行编程模型研究与进展[J];软件学报;2013年01期

4 李霞;蒋盛益;张倩生;朱靖;;适用于大规模文本处理的动态密度聚类算法[J];北京大学学报(自然科学版);2013年01期

5 蒋盛益;苗邦;余雯;;基于一趟聚类的不平衡数据下抽样算法[J];小型微型计算机系统;2012年02期

6 熊忠阳;陈若田;张玉芳;;一种有效的K-means聚类中心初始化方法[J];计算机应用研究;2011年11期

7 赵卫中;马慧芳;傅燕翔;史忠植;;基于云计算平台Hadoop的并行k-means聚类算法设计研究[J];计算机科学;2011年10期

8 江小平;李成华;向文;张新访;颜海涛;;k-means聚类算法的MapReduce并行化实现[J];华中科技大学学报(自然科学版);2011年S1期

9 蒋盛益;庞观松;张黎莎;;Chameleon算法的改进[J];小型微型计算机系统;2010年08期

10 杨燕;靳蕃;KAMEL Mohamed;;聚类有效性评价综述[J];计算机应用研究;2008年06期

相关博士学位论文 前1条

1 许玉杰;云计算环境下海量数据的并行聚类算法研究[D];大连海事大学;2014年

相关硕士学位论文 前10条

1 侯s,

本文编号:2144761


资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2144761.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户24060***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com