基于子空间的聚类算法研究

发布时间：2018-05-29 08:59

本文选题：高维数据 + 聚类分析　；参考：《江南大学》2017年硕士论文

【摘要】：随着生命科学、移动通信、电子商务、社交网络等相关领域的飞速发展,涌现出大量的高维数据,如何有效地对高维数据进行聚类分析,成为当下的研究热点和难点。传统的聚类分析通常将数据对象全部属性考虑在内,然而高维数据中常常包含很多无关的冗余的属性,这些属性的存在使得数据样本点间的距离相互接近,使得在整个特征空间中存在类的可能性几乎为零。子空间聚类方法尝试在相同数据集的不同子空间上进行聚类,有效地解决了这类问题。根据加权方式的差异,现有算法可分为硬子空间聚类和软子空间聚类两种方法。本文从这两个角度对子空间聚类算法展开了深入研究,主要工作如下:(1)硬子空间聚类算法SUBCLU在自底向上搜索最大兴趣子空间类的过程中不断迭代产生中间类,这个过程消耗大量时间的问题,针对这一问题,本文提出改进算法BDFS-SUBCLU,采用一种带回溯的深度优先搜索策略来挖掘最大兴趣子空间中的类,通过这种策略避免了中间类的产生,降低了算法的时间复杂度。同时BDFS-SUBCLU算法在子空间中对核心点增加一种约束,通过这个约束条件在一定程度上避免了聚类过程中相邻的类由于特殊的数据点合为一类的情况。在仿真数据集和真实数据集上的实验结果表明BDFS-SUBCLU算法与SUBCLU算法相比,效率和准确性均有所提高。(2)基于k-means算法框架的软子空间聚类算法大多对初始聚类中心点敏感,不当的初始聚类中心点会导致其过早陷入局部最优,针对这一问题,本文提出相应的改进方案:在原有算法的基础上,通过反馈来验证算法是否陷入局部最优,当算法陷入局部最优则以当下最优为聚类结果并不断反馈验证直到不能找到更优的聚类结果,同时增设对比组来提高算法跳出局部最优的可能性。在UCI真实数据集上的实验结果表明改进后的FSC和EWKM算法准确率均有所提高。(3)运用开源的中文分词器mmseg4j对中文文本进行分词处理,然后基于向量空间模型将文本转化为算法可以处理的数字矩阵,最后用本文所提的软子空间聚类算法对其进行聚类分析。
[Abstract]:With the rapid development of life science, mobile communication, electronic commerce, social network and other related fields, a large number of high-dimensional data have emerged. How to effectively cluster analysis of high-dimensional data has become a hot and difficult issue. Traditional clustering analysis usually takes all attributes of data object into account. However, high dimensional data often contains many irrelevant redundant attributes, which make the distance between data sample points close to each other. The possibility of the existence of classes in the entire feature space is almost zero. The subspace clustering method attempts to cluster on different subspaces of the same data set, which effectively solves this kind of problem. According to the difference of weighting methods, the existing algorithms can be divided into two methods: hard subspace clustering and soft subspace clustering. In this paper, the subspace clustering algorithm is studied from these two angles. The main work is as follows: 1) hard subspace clustering algorithm SUBCLU iterates to produce intermediate classes in the process of bottom-up searching for subspace classes of greatest interest. This paper proposes an improved algorithm BDFS-SUBCLU, which uses a backtracking depth first search strategy to mine classes in the subspace of maximum interest, which avoids the generation of intermediate classes. The time complexity of the algorithm is reduced. At the same time, the BDFS-SUBCLU algorithm adds a constraint to the core point in the subspace, which to some extent avoids the confluence of the adjacent classes in the clustering process because of the special data points. Experimental results on simulation data sets and real data sets show that BDFS-SUBCLU algorithm is more efficient and accurate than SUBCLU algorithm.) soft subspace clustering algorithms based on k-means algorithm framework are mostly sensitive to initial clustering center points. Improper initial clustering center points will lead to premature local optimization. In view of this problem, this paper puts forward the corresponding improvement scheme: on the basis of the original algorithm, the feedback is used to verify whether the algorithm falls into local optimal or not. When the algorithm falls into the local optimum, the current optimal is used as the clustering result and the feedback is verified until the better clustering result can not be found. At the same time, a contrast group is added to improve the possibility of the algorithm jumping out of the local optimum. The experimental results on the real data set of UCI show that the accuracy of the improved FSC and EWKM algorithms are both improved. (3) the open source Chinese word Segmentation (mmseg4j) is used to deal with Chinese text segmentation. Then based on the vector space model, the text is transformed into a digital matrix which can be processed by the algorithm. Finally, the soft subspace clustering algorithm proposed in this paper is used for clustering analysis.
【学位授予单位】：江南大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】