一类基于密度的聚类算法研究

发布时间：2018-05-22 15:26

本文选题：聚类算法 + 聚类有效性指标　；参考：《山东师范大学》2017年硕士论文

【摘要】：聚类分析中,基于密度的聚类算法占有非常重要的地位,在信息的过滤、检索、医疗卫生和公共服务等各个领域都得到广泛地应用,是聚类分析的重点研究内容。本文对层次聚类算法的特征和密度聚类算法的特征进行研究,提出了基于层次的密度聚类算法,结果表明新算法聚类的准确率和聚类的效率均得到提高。根据Alex Rodriguez和Alessandro Laio提出的一种新的密度聚类算法CFSFDP(Clustering by Fast Search and Find of Density Peaks),提出了Map Reduce框架下该算法的并行化模型。和其他密度聚类算法一样,该算法在并行条件下能对复杂形状的聚类进行处理,并且数据中类的数量也不需要提前指定,同时,CFSFDP算法需要用户指定的参数较少。和需要迭代的聚类算法相比,该算法的运行时间得到很大程度地降低。本文主要的研究工作包括:(1)针对传统的聚类算法需要反复地对数据集聚类,且计算效率在大规模数据集上欠佳的问题,提出了一种改进算法,即基于层次聚类确定最佳聚类数和初始聚类中心的CODHD算法。该算法研究计算过程,对数据集不需要反复进行聚类。首先,通过对数据集进行扫描,进而获得聚类特征的所有的统计值;其次,采用自下而上的方法生成层次不相同的数据划分,对每个划分的数据点的密度进行计算,将密度最大的点定为中心点,计算中心点距离更高密度点的最小距离,将最小距离与中心点的密度作乘积,取乘积之和的平均值作为有效性指标,根据聚类结果,增量地构建一条属于不同层次的曲线;最后,曲线极值点处对应的划分,用来估计初始的聚类中心和最佳的聚类数。实验结果表明,相比较COPS算法,本文提出的CODHD算法,聚类准确率和效率均得到提高。(2)传统的CFSFDP算法能够很好地识别空间中任意形状和任意维度的聚类,但是当处理大规模数据集时,两点之间距离的计算耗费太长时间,为克服提到的缺点,本文提出了一种基于Map Reduce的CFSFDP算法,又称mr CFSFDP。mr CFSFDP只需要读取数据集一遍,因此运行时间很快,运行在多个节点的mr CFSFDP算法的每个阶段都划分为两步:Map阶段和Reduce阶段。在许多数据集上测试了这个算法,实验结果表明,此算法模型是可行的,并且在准确率和效率上都有很好的效果。本文数据集全部取自UCI真实数据集。根据经典的聚类模型,建立了两种新的聚类模型。文中与其他算法进行一些比较,证明了新提出算法在聚类方面具有更好的聚类效果。
[Abstract]:In clustering analysis, density-based clustering algorithm plays a very important role. It is widely used in the fields of information filtering, retrieval, medical and health, public services and so on. In this paper, the characteristics of hierarchical clustering algorithm and density clustering algorithm are studied, and a hierarchical density clustering algorithm is proposed. The results show that the accuracy and efficiency of the new algorithm are improved. Based on a new density clustering algorithm CFSFDP(Clustering by Fast Search and Find of Density Peaks proposed by Alex Rodriguez and Alessandro Laio, a parallelization model of the algorithm under Map Reduce framework is proposed. Like other density clustering algorithms, the algorithm can deal with the clustering of complex shapes under parallel conditions, and the number of classes in the data does not need to be specified in advance, and the CFSFDP algorithm requires fewer parameters specified by the user. Compared with the clustering algorithm which needs iteration, the running time of the algorithm is greatly reduced. The main research work of this paper includes: (1) aiming at the problem that the traditional clustering algorithms need to repeatedly cluster data and the computational efficiency is poor in large-scale data sets, an improved algorithm is proposed. That is, CODHD algorithm based on hierarchical clustering to determine the best clustering number and initial clustering center. The algorithm does not need to cluster data sets repeatedly. Firstly, all the statistical values of the clustering feature are obtained by scanning the data set. Secondly, the bottom-up method is used to generate the data partition with different levels, and the density of the data points in each partition is calculated. Taking the maximum density point as the center point and calculating the minimum distance between the center point and the higher density point, the minimum distance between the minimum distance and the density of the center point is multiplied, and the average value of the sum of the product is taken as the validity index, according to the clustering result, Finally, the corresponding partition at the extreme point of the curve is used to estimate the initial cluster center and the optimal clustering number. The experimental results show that compared with the COPS algorithm, the clustering accuracy and efficiency of the CODHD algorithm proposed in this paper are improved. (2) the traditional CFSFDP algorithm can recognize the clustering of arbitrary shape and dimension in space very well. However, when dealing with large-scale data sets, the calculation of the distance between two points takes too long. In order to overcome the mentioned shortcomings, this paper proposes a CFSFDP algorithm based on Map Reduce, which is also called mr CFSFDP.mr CFSFDP, which only needs to read the data set once. Therefore, the running time is very fast, and each phase of the Mr CFSFDP algorithm running on multiple nodes is divided into two steps: map stage and Reduce stage. The algorithm is tested on many data sets. The experimental results show that the algorithm model is feasible and has good accuracy and efficiency. All the data sets in this paper are taken from UCI real data sets. According to the classical clustering model, two new clustering models are established. Compared with other algorithms, it is proved that the new algorithm has better clustering effect.
【学位授予单位】：山东师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】