基于正态分布的密度峰聚类算法的研究

发布时间：2018-09-11 07:48

【摘要】：聚类算法是一种根据相似特征将数据集分为几个类别的重要机器学习算法。聚类分析广泛应用于机器学习,模式识别,生物信息学和图像处理。2014年,Alex Rodriguez等人在《Science》上提出了一种新的基于密度的密度峰聚类(clustering by fast search and find of density peaks,DPC)算法。该算法借助了数据点的密度以及其到具有更高密度点的距离这两个特征来发现潜在的簇心。密度峰聚类算法简洁明了,能一步得到聚类结果,且聚类效果较佳。但是该算法在聚类过程中需要人为参与分析决策图并选取潜在的簇心,这降低了算法的效率。为了实现自动聚类的目的,本文针对各个点在决策图上的特点,提出了采用密度与距离的乘机Z为新的判断指标来选择潜在的簇心并采用概率统计的方法来筛选簇心的方法。由于只有潜在的簇心具有较高的密度与较大的距离,因此它们的Z值远远大于非簇心点。假设Z的分布是正态分布,因此可以借助概率统计的方法来确定一个上界。超过该上界的值所对应的点将自动被视为簇心点。实验结果表明,采用正态分布这样概率统计方法能正确识别出潜在的簇心点,且该方法选取簇心的方式与人为分析决策图选取潜在簇心的方法相似,与其他优秀的聚类算法相比,基于正态分布的密度峰聚类算法在应对不同形状的数据集的方面具有更优秀的性能,能得到较好的聚类结果。
[Abstract]:Clustering algorithm is an important machine learning algorithm which divides data sets into several categories according to similarity characteristics. Clustering analysis is widely used in machine learning, pattern recognition, bioinformatics and image processing. In 2014, Alex Rodriguez et al proposed a new density-based density peak clustering (clustering by fast search and find of density peaks,DPC) algorithm on < Science >. The algorithm uses the density of data points and the distance between the data points and the higher density points to find the potential cluster centers. The density peak clustering algorithm is simple and clear, and the clustering results can be obtained in one step, and the clustering effect is better. But in the process of clustering, the algorithm needs to participate in the analysis of decision graph and select the potential cluster core, which reduces the efficiency of the algorithm. In order to achieve the purpose of automatic clustering, this paper presents a method of selecting potential cluster centers by using the multiplier Z of density and distance as a new judgement index and selecting cluster centers by probability and statistics according to the characteristics of each point in the decision graph. Because only the potential cluster centers have higher density and longer distance, their Z value is much larger than that of non-cluster centers. Assuming that the distribution of Z is a normal distribution, an upper bound can be determined by the method of probability and statistics. The point corresponding to the value above the upper bound will automatically be regarded as the cluster center point. The experimental results show that the probabilistic statistical method such as normal distribution can correctly identify the potential cluster center points, and the method is similar to the method of selecting the potential cluster center in the artificial analysis decision map. Compared with other excellent clustering algorithms, the density peak clustering algorithm based on normal distribution has better performance in dealing with different shape data sets, and can obtain better clustering results.
【学位授予单位】：浙江工业大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【相似文献】