一种基于密度的分布式聚类方法

发布时间：2018-12-12 08:55

【摘要】：聚类是数据挖掘领域中的一种重要的数据分析方法.它根据数据间的相似度,将无标注数据划分为若干聚簇.CSDP是一种基于密度的聚类算法,当数据量较大或数据维数较高时,聚类的效率相对较低.为了提高聚类算法的效率,提出了一种基于密度的分布式聚类方法 MRCSDP,利用MapReduce框架对实验数据进行聚类.该方法定义了独立计算单元和独立计算块的概念.首先,将数据拆分为若干数据块,构建独立计算单元和独立计算块,在集群中分配独立计算块的任务;然后进行分布式计算,得到数据块的局部密度,将局部密度合并得到全局密度,根据全局密度计算中心值,由全局密度和中心值得到每个数据块中候选聚簇中心;最后,从候选聚簇中心选举出最终的聚簇中心.MRCSDP在充分降低时间复杂度的基础上得到较好的聚类效果.实验结果表明,分布式环境下的聚类方法MRCSDP相对于CSDP更能快速、有效地处理大规模数据,并使各节点负载均衡.
[Abstract]:Clustering is an important data analysis method in the field of data mining. CSDP is a density-based clustering algorithm, and the clustering efficiency is relatively low when the amount of data is large or the dimension of data is high. In order to improve the efficiency of the clustering algorithm, a density based distributed clustering method, MRCSDP, is proposed to cluster experimental data using the MapReduce framework. This method defines the concepts of independent computing unit and independent computing block. Firstly, the data is divided into several data blocks, the independent computing unit and the independent computing block are constructed, and the task of the independent computing block is assigned in the cluster. Then the local density of the data block is obtained by distributed computation, and the global density is combined to get the global density. According to the global density, the global density and center are worth to the candidate cluster center in each data block. Finally, the final cluster center is selected from the candidate cluster center. MRCSDP can get better clustering effect on the basis of fully reducing the time complexity. The experimental results show that the clustering method MRCSDP in distributed environment can deal with large scale data more quickly and effectively than CSDP and make each node load balance.
【作者单位】：吉林大学计算机科学与技术学院;吉林大学符号计算与知识工程教育部重点实验室;
【分类号】：TP311.13

【相似文献】