基于MapReduce的分布式聚类算法的研究
[Abstract]:Clustering analysis is one of the most basic data analysis techniques in data mining, which is widely used in economics, social science and computer science. However, with the rapid development of Internet technology, the data generated by various network applications increase rapidly, which brings great technical challenges to the traditional clustering analysis methods. How to obtain valuable information from massive data quickly and effectively has become an urgent problem in many industries. With the maturity of cloud computing technology, it is possible to deal with massive data quickly and effectively. Hadoop is an open source distributed cloud computing platform. Its core design is distributed file system (HDFS) and MapReduce. in which HDFS provides a programming model for storing large amounts of data for parallelization of data. Compared with the traditional parallel programming model, this programming model encapsulates the details of data segmentation, task scheduling, parallel processing, etc. Users can develop distributed applications without understanding the distributed low-level details. It greatly facilitates the design of parallelization program. K-means algorithm is applied to many industries as a classical algorithm in clustering analysis. However, with the increase of data scale, the number of iterations of the algorithm will increase obviously, which will affect the efficiency of the algorithm. In order to apply it to the clustering analysis of large-scale data, this paper firstly realizes the parallelization of the algorithm on the Hadoop platform according to the programming principle of MapReduce. Then the blindness of random selection of cluster centers in K-means algorithm and the problem that clustering results are prone to fall into local optimum are improved accordingly. The main work of this paper is as follows: (1) based on the analysis of traditional K-means algorithm and the idea of maximum and minimum distance, a K-means parallelization algorithm based on maximum and minimum distance is proposed. The cluster center is selected according to the idea of maximum and minimum distance and used as the initial center of K-means algorithm to avoid the situation that the initial center is too close to the random selection of the center point, so as to improve the quality of the clustering results. In order to improve its efficiency, the parallelization of the algorithm is designed and implemented. (2) the principle, advantages and disadvantages of the one-trip clustering algorithm are analyzed, and combining with the characteristics of the traditional K-means algorithm, the OPKMEANS parallelization algorithm is proposed. Based on the simple and efficient feature of one-trip clustering algorithm, the algorithm firstly clusters the data set quickly "coarse", and then takes the obtained center as the initial center of the K-means algorithm to avoid the blindness of the random selection of the center points in the K-means algorithm. To reduce the number of iterations of the K-means algorithm to reduce the data transmission overhead of the parallelization process and improve the efficiency of the algorithm. (3) in order to verify the effectiveness of the improved algorithm, this paper studies the principle of Hadoop. The Hadoop distributed computing platform is built on the virtual machine, and many experiments are carried out to verify the superiority of the above algorithm in terms of clustering quality, speedup and extensibility.
【学位授予单位】:江西理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13
【参考文献】
相关期刊论文 前10条
1 牛怡晗;海沫;;Hadoop平台下Mahout聚类算法的比较研究[J];计算机科学;2015年S1期
2 成卫青;卢艳红;;一种基于最大最小距离和SSE的自适应聚类算法[J];南京邮电大学学报(自然科学版);2015年02期
3 王蕾;崔慧敏;陈莉;冯晓兵;;任务并行编程模型研究与进展[J];软件学报;2013年01期
4 李霞;蒋盛益;张倩生;朱靖;;适用于大规模文本处理的动态密度聚类算法[J];北京大学学报(自然科学版);2013年01期
5 蒋盛益;苗邦;余雯;;基于一趟聚类的不平衡数据下抽样算法[J];小型微型计算机系统;2012年02期
6 熊忠阳;陈若田;张玉芳;;一种有效的K-means聚类中心初始化方法[J];计算机应用研究;2011年11期
7 赵卫中;马慧芳;傅燕翔;史忠植;;基于云计算平台Hadoop的并行k-means聚类算法设计研究[J];计算机科学;2011年10期
8 江小平;李成华;向文;张新访;颜海涛;;k-means聚类算法的MapReduce并行化实现[J];华中科技大学学报(自然科学版);2011年S1期
9 蒋盛益;庞观松;张黎莎;;Chameleon算法的改进[J];小型微型计算机系统;2010年08期
10 杨燕;靳蕃;KAMEL Mohamed;;聚类有效性评价综述[J];计算机应用研究;2008年06期
相关博士学位论文 前1条
1 许玉杰;云计算环境下海量数据的并行聚类算法研究[D];大连海事大学;2014年
相关硕士学位论文 前10条
1 侯s,
本文编号:2144761
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2144761.html