基于高维数据的聚类算法研究

发布时间：2018-03-03 17:10

本文选题：子空间聚类　切入点：高维数据　出处：《深圳大学》2017年硕士论文　论文类型：学位论文

【摘要】：近年来,随着互联网技术的快速发展,数据的规模和维度急剧增大,由此带来维数灾难和密度稀疏问题.高维数据中通常包含许多冗余的、不相关的特征和噪声,给高维数据的聚类分析带来了巨大的挑战.研究表明,高维数据的簇结构通常存在于数据的某个子空间,而非整个数据空间.为了处理高维数据,国内外研究者提出了许多子空间聚类方法.其中,软子空间聚类是子空间聚类算法中的一个重要研究主题,它为样本的每个特征分配一个权重,并通过权重较大的特征确定簇的子空间结构.然而,高维数据中的单个特征是微弱的,很难通过单个微弱的特征发现簇结构,对单个特征加权的方法处理有成千上万特征的数据时效果也并不理想.许多高维数据集都是不同方面观测的集成结果,以至于不同方面的特征可以进行分组,并且不同特征组在不同簇中的重要性也是不同的.有研究者提出为高维数据的特征组分配权重的FG-k-means方法,它把特征分为若干个特征组,引入特征组和单个特征的两级权重处理高维数据,并获得巨大的性能提升.FG-k-means不能实现特征的自动分组,需要根据人的先验知识进行特征分组,然而对于许多高维数据集,我们事先并不知道特征的分组信息.针对这些问题,本文以高维数据为研究对象,主要工作包含以下两个部分:(1)提出了子空间聚类中的隐藏特征组学习模型(LFGL).先前的方法在聚类过程中不能进行自动分组,需要人为根据先验知识进行特征分组,然而在许多高维数据中我们并不知道特征的分组信息.针对这些问题,本文提出了LFGL模型,首先构建一个特征分组模型(FGM),然后嵌入特征分组模型到子空间聚类算法中并构造一个优化问题,最后在满足FGM模型的要求下通过一些优化算法求解该问题.并在图像、基因等真实数据集上进行试验,通过和先前的聚类方法比较发现,LFGL不仅实现了特征的自动分组,而且获得了更好的聚类效果.(2)提出了基于深度去噪稀疏自动编码器(DDSAE)的维度化简和聚类分析.高维数据中存在“维数灾难”和密度稀疏,当维度增加时,各种聚类方法的性能都出现明显下降,并且超高维数据在单机中运行甚至出现内存溢出.本文利用自动编码器的非线性表达能力,在自动编码器中引入L2范数防止过拟合、在输入数据中添加噪声提高模型的鲁棒性,并使用交叉熵作为损失函数,然后将多个编码器叠加构成深度去噪稀疏自动编码器.深度去噪稀疏自动编码器从高维数据中学习得到低维抽象的本质特征,然后将低维特征向量运用第三章的LFGL模型进行聚类分析.与PCA和LLE的实验结果比较发现,该方法在高维数据的维度化简和聚类分析上有更好的表现.另外通过比较DDSAE的聚类结果和LFGL的聚类结果,我们发现DDSAE的聚类效果明显好于LFGL的聚类效果,这也说明了该方法的有效性.
[Abstract]:In recent years, with the rapid development of Internet technology, the scale and dimension of data increase dramatically, which brings about the problem of dimensionality disaster and density sparsity. High dimensional data usually contain many redundant, unrelated features and noises. Clustering analysis of high-dimensional data presents a great challenge. Research shows that the cluster structure of high-dimensional data usually exists in one subspace of the data, rather than in the whole data space. Many subspace clustering methods have been proposed by domestic and foreign researchers. Among them, soft subspace clustering is an important research topic in subspace clustering algorithm, which assigns a weight to each feature of the sample. However, the single feature in high-dimensional data is weak, so it is difficult to find the cluster structure by a single weak feature. The method of weighting a single feature is also not effective in processing data with thousands of features. Many high-dimensional data sets are integrated results from different aspects of observations, so that different aspects of features can be grouped. Moreover, the importance of different feature groups in different clusters is different. Some researchers have proposed a FG-k-means method for assigning weights to feature groups of high-dimensional data, which divides the features into several feature groups. The feature group and the two-level weight of a single feature are introduced to deal with the high-dimensional data and obtain a huge performance improvement. FG-k-means can not realize the automatic grouping of features, which needs to be grouped according to the prior knowledge of human beings. However, for many high-dimensional data sets, We do not know the grouping information of features in advance. In order to solve these problems, we take high-dimensional data as the research object. The main work includes the following two parts: 1) A hidden feature group learning model in subspace clustering is proposed. However, we do not know the grouping information of features in many high-dimensional data. In order to solve these problems, we propose a LFGL model. First, a feature grouping model is constructed, then the feature grouping model is embedded into the subspace clustering algorithm and an optimization problem is constructed. Finally, some optimization algorithms are used to solve the problem under the requirements of the FGM model. Experiments on real data sets such as genes show that LFGL not only realizes automatic grouping of features, but also compares with previous clustering methods. Furthermore, a better clustering effect is obtained. (2) Dimension reduction and clustering analysis based on deep denoising sparse automatic encoder DDSAE) are proposed. There are "dimension disaster" and "sparse density" in the high-dimensional data, and when the dimension increases, The performance of all kinds of clustering methods is obviously decreased, and the ultra high dimensional data runs in a single machine, even memory overflow. In this paper, the L2 norm is introduced to the automatic encoder to prevent over-fitting by using the nonlinear expression ability of the automatic encoder. Noise is added to the input data to improve the robustness of the model, and cross-entropy is used as the loss function. Then the multiple encoders are superposed to form the deep denoising sparse automatic encoders. The deep denoising sparse automatic encoders learn the essential features of the low dimensional abstractions from the high dimensional data. Then the low dimensional eigenvector is analyzed by using the LFGL model in Chapter 3. The results are compared with the experimental results of PCA and LLE. By comparing the clustering results of DDSAE and LFGL, we find that the clustering effect of DDSAE is better than that of LFGL. This also shows the effectiveness of the method.
【学位授予单位】：深圳大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】