基于层次和密度的任意形状聚类算法研究

发布时间：2018-01-24 16:03

本文关键词： 层次聚类密度聚类任意形状聚类子簇合并密度峰值点边界区域密度　出处：《河南理工大学》2016年硕士论文　论文类型：学位论文

【摘要】：聚类技术作为数据挖掘领域的一个重要研究方向,可以有效地帮助人们了解数据的分布和特征,以便作进一步研究分析。虽然已有的聚类算法很多,但聚类技术依然存在很多问题和挑战。结合层次聚类算法和密度聚类算法,本文提出一种新的任意形状聚类算法。该算法在层次聚类技术的框架下,使用基于密度聚类的思想来定义子簇和子簇合并方法,论文的主要工作如下:(1)针对目前层次聚类算法的计算时间复杂度较高,并且需要人为输入聚类个数或者阈值参数作为聚类终止条件的问题,本文提出了一种新颖的基于密度的子簇合并方法,将簇间边界区域密度大于等于其中任何一个簇平均密度的相邻子簇进行合并。该子簇合并准则使用动态模型的方法,能够自动适应被合并簇之间的内部特征属性,可自动确定聚类个数和聚类终止点,从而发现任意形状的聚类。(2)针对密度聚类算法容易忽略密度稀疏区域中的密度峰值点的问题,本文找出距离其他高密度点距离较远的点作为密度峰值点,放宽了对密度峰值点的选择范围。然后,根据这些密度峰值点将数据集切分为大量初始子簇,而且得到的子簇比较正确。(3)针对密度聚类方法使用全局统一的距离参数,不利于密度差异较大数据集的问题,本文通过将低密度数据集与高密度数据集分层,将低密度子簇筛选出来,设置合适的距离参数进行聚类来解决上述问题。在测试数据集以及真实数据集上的对比实验表明,本文算法具有能自动确定聚类个数,能够有效地发现任意形状、大小的聚类,对输入参数的选择具有鲁棒性,并且适用于密度分布不均匀的数据集等优点。
[Abstract]:As an important research direction in the field of data mining, clustering technology can effectively help people to understand the distribution and characteristics of data for further research and analysis, although there are many existing clustering algorithms. However, there are still many problems and challenges in clustering technology. Combined with hierarchical clustering algorithm and density clustering algorithm, this paper proposes a new arbitrary shape clustering algorithm, which is based on hierarchical clustering technology. Using the idea of density clustering to define subclusters and sub-cluster merging, the main work of this paper is as follows: 1) the computational time complexity of hierarchical clustering algorithm is high. In addition, it is necessary to input the number of clusters or threshold parameters as the termination condition. In this paper, a novel density-based subcluster merging method is proposed. The adjacent subclusters whose boundary region density is greater than or equal to the average density of any of the clusters are merged. The merging criterion of the subclusters uses the method of dynamic model. It can automatically adapt to the internal characteristic attributes between the merged clusters, and can automatically determine the number of clusters and the cluster termination points. Thus it is found that arbitrary shape clustering algorithm is easy to ignore the density peak point in the sparse density region. In this paper, we find the point which is far away from other high density points as the density peak point. Then, the data set is cut into a large number of initial subclusters according to these density peaks. Moreover, the obtained subcluster is more correct. 3) the global uniform distance parameter is used for the density clustering method, which is not conducive to the problem of large density difference data sets. In this paper, the low density subclusters are filtered out by stratifying the low density data sets and the high density data sets. Set the appropriate distance parameters to cluster to solve the above problem. The experimental results on the test data set and the real data set show that the algorithm can automatically determine the number of clustering. It can find the clustering of arbitrary shape and size effectively and is robust to the selection of input parameters. It is also suitable for data sets with uneven density distribution and so on.
【学位授予单位】：河南理工大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【相似文献】