基于密度聚类算法的研究与改进

发布时间：2018-05-12 08:12

本文选题：聚类 + 密度峰值　；参考：《内蒙古大学》2017年硕士论文

【摘要】：聚类分析,是一种在没有任何先验知识的情况下对待聚类数据根据数据间的相似性来进行分类的一种技术,在模式识别中被称为无监督分类,在统计学中被称为非参数估计。聚类分析被广泛地应用于众多学术领域,比如生物信息学、信息安全、文本聚类等。在过去发展的几十年,数以千计的聚类算法被不同学者提出,但是仍存在很大的研究空间,例如如何处理不同形状及密度的簇,对高维数据的合理计算,如何有效测定聚类结果当中簇的数量,噪声点的合理检测及如何定义及评判一个正确的簇等等。Alex Rodriguez与Alessandro Laio在2014年提出了一种新的启发式聚类算法 CFSFDP(Clustering by Fast Search and Find of Density Peaks)。该算法具有初始参数少、执行速度快、可有效探测目标簇数目及对噪声数据不敏感的特点,本文通过一系列实验证明了该算法的有效性,并且该算法提出者利用Olivetti人脸数据库中的图片聚类来证明该算法可以处理高维度数据。然而通过学习研究发现,该算法在遇到某些情况时表现不好。首先,该算法的初始簇中心的选取需要依靠人工选定且对处于密度稀疏区域的簇中心无法有效提取。其次,该算法认定数据集中的每个簇有且仅有一个局部密度值极点,这将导致拥有多密度极值点的簇及共享密度极值点的簇被错误划分。再者,该算法对噪声点的识别方法会致使较多的数据点被判定为噪声。基于这些发现,本文提出一种新的基于密度峰值的算法,改进算法通过改进的决策值计算方法来构建决策图,通过发现决策图拐点来自动识别簇中心。然后通过加入构建子簇的局部密度分布图的操作以及改进的层次聚类算法思想对错误划分的子簇进行分割和合并,最后通过新引入的数据点离群度计算公式来识别噪声。通过实验表明,该改进算法在多个数据集上的聚类效果优于原有的算法及其他基于密度的聚类算法。
[Abstract]:Clustering analysis is a technique to classify clustering data according to the similarity of data without any prior knowledge. It is called unsupervised classification in pattern recognition and nonparametric estimation in statistics. Clustering analysis is widely used in many academic fields, such as bioinformatics, information security, text clustering and so on. In the past decades, thousands of clustering algorithms have been proposed by different scholars, but there is still a lot of research space, such as how to deal with clusters with different shapes and densities, and how to calculate the high-dimensional data reasonably. In 2014, Alex Rodriguez and Alessandro Laio proposed a new heuristic clustering algorithm, CFSFDP(Clustering by Fast Search and Find of Density Peaks, how to effectively determine the number of clusters in clustering results, how to reasonably detect noise points and how to define and judge a correct cluster. The algorithm has the advantages of less initial parameters, fast execution speed, effective detection of the number of target clusters and insensitivity to noise data. The effectiveness of the algorithm is proved by a series of experiments in this paper. The proposed algorithm uses image clustering in Olivetti face database to prove that the algorithm can deal with high dimensional data. However, it is found that the algorithm does not perform well in some cases. Firstly, the selection of initial cluster centers depends on manual selection and can not be effectively extracted from clusters located in dense sparse regions. Secondly, the algorithm determines that each cluster in the dataset has only one local density extremum, which leads to the misdivision of clusters with multi-density extremum points and clusters with shared density extremum points. Furthermore, more data points are judged as noise by the method of noise recognition. Based on these findings, this paper proposes a new algorithm based on the peak density. The improved algorithm constructs the decision graph by the improved method of calculating the decision value, and automatically identifies the cluster center by finding the inflection point of the decision graph. Then the sub-clusters are segmented and merged by adding the operation of constructing the local density distribution map of the subclusters and the idea of improved hierarchical clustering algorithm. Finally, the noise is identified by the newly introduced formula for calculating the outliers of data points. The experimental results show that the improved algorithm is superior to the original algorithm and other density-based clustering algorithms in clustering performance on multiple datasets.
【学位授予单位】：内蒙古大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】