基于网格密度区分的多维聚类挖掘算法设计

发布时间：2018-03-19 10:28

本文选题：聚类算法　切入点：网格　出处：《西安财经学院》2014年硕士论文　论文类型：学位论文

【摘要】：聚类分析为数据挖掘算法的重要组成部分，是数据挖掘中的一种分析活动。聚类算法是总体聚类分析的核心，决定了全部聚类分析结果的质量。目前，如何在保证算法稳定与有效的前提下，进一步提高聚类效率，，减少用户成本和负担，成为当前非常有意义的研究方向。由于传统的聚类算法对计算机硬件资源要求比较高，海量数据聚类运算时间比较长，本文提出了一种新的基于网格和密度的聚类算法。一般基于网格的聚类具有节省时间成本、高效率的特点，但它的聚类质量不是很好；密度的聚类算法可以将任意具有相异外形的簇进行聚类，但它在处理高维空间数据的时间复杂度高。由于这两者的互补关系，基于网格密度结合的策略进行样本空间的区分，能够极大的提高聚类效率。本文聚类算法的思想是：首先，创建网格，对数据空间进行初始网格划分。其次，样本空间划分，根据得到的网格密度阀值，将网格单元的数据划分成高、低密度区两部分；将高密度区所有网格按照密度大小进行排列，找到密度最大的网格，利用其周围最近低密度网格区寻找到第一个高密度簇；将第一个高密度簇的点去掉，将剩余高密度网格进行排序，依次进行，直到形成最终空间的划分结果。最后，计算各子簇类重心，将临近簇重心空间合并，形成新簇重心，依次合并空间，直到等于给定簇类数，形成最终聚类结果。本文首先从理论方面对该算法进行了描述，验证了该算法设计的合理性和科学性。最后通过Matlab随机生成几组数据进行了实证分析，验证了本算法能够在与经典的K-means算法组间离差平方和相差不大的条件下，运算时间上有了显著的改善。
[Abstract]:Clustering analysis is an important part of data mining algorithm and an analysis activity in data mining. Clustering algorithm is the core of overall clustering analysis, which determines the quality of all the results of clustering analysis. How to further improve the clustering efficiency and reduce the cost and burden of users under the premise of ensuring the stability and effectiveness of the algorithm has become a very meaningful research direction. Because the traditional clustering algorithm requires high computer hardware resources, the clustering time of mass data is relatively long. In this paper, a new clustering algorithm based on grid and density is proposed. Generally, the clustering based on grid has the characteristics of saving time cost and high efficiency, but its clustering quality is not very good. The density clustering algorithm can cluster any cluster with different shapes, but it has a high time complexity in processing high-dimensional spatial data. Because of the complementary relationship between the two, the sample space is distinguished based on the combination of grid density. The idea of clustering algorithm in this paper is: firstly, to create grid, to divide the data space into the initial grid, secondly, to divide the sample space, according to the grid density threshold, The data of the grid cells are divided into high and low density areas, and all the grids in the high density region are arranged according to the density to find the most dense grid, and the first high density cluster is found by using the nearest low density grid area around the grid. The point of the first high density cluster is removed, the remaining high density grid is sorted, and then the final space is obtained. Finally, the center of gravity of each subcluster is calculated, and the adjacent center of gravity space is merged to form a new cluster center of gravity. The space is merged in turn until it is equal to a given number of clusters, and the final clustering result is obtained. Firstly, this paper describes the algorithm from the theoretical aspect, and verifies the rationality and scientificity of the algorithm design. Finally, several groups of data are generated randomly by Matlab for empirical analysis. It is verified that the algorithm can significantly improve the operation time under the condition that the sum of squared difference between the two groups is not different from that of the classical K-means algorithm.
【学位授予单位】：西安财经学院
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：C81

【参考文献】