子空间聚类算法研究及应用
发布时间:2018-09-12 13:13
【摘要】:聚类分析是数据挖掘领域中重要的研究手段之一,其处理方法可以概括为将一组数据集中的所有数据按照不同的相似度划分为不同的类别或者簇的过程。由于聚类分析的无监督学习特性,使得其在众多领域得到大量的应用,包括电子商务、生物信息、Web日志分析以及金融交易等等。但是,由于受“维度效应”的影响,当采用传统的聚类算法处理高维数据时,聚类结果的精确度和稳定性将会大幅度降低。近年来,如何使用聚类分析的方法处理高维数据逐渐成为人工智能领域研究的难点和热点之一。于是,子空间聚类的概念应运而生,其基本思想是将原始数据所处的特征空间分割成不同的特征子集,从不同的子空间角度按照一定的规则考察每组数据划分的意义,同时为每组数据寻找其对应的特征子空间的过程。子空间聚类算法已经在高维数据集上取得较为理想的结果。论文针对已有的基于隶属度表示和基于自我表示模型这两类子空间聚类算法存在的一些缺点与不足,按照分析已有子空间聚类算法,寻求改进方法加以创新的思路,提出了新的子空间聚类算法。论文的具体内容如下:(1)已有的软子空间聚类算法采用随机选取数据集中的样本点作为聚类中心点的方法,该方法存在陷入局部最优的缺点。针对以上问题,在基于软子空间聚类框架下,我们提出一种利用QPSO算法和梯度下降法相结合的思想优化软子空间聚类目标函数的子空间聚类新方法。通过QPSO算法全局寻优的特点,求解子空间中的聚类中心点,同时,利用梯度下降法收敛速度快的性质,求解样本点的模糊权重和隶属度。在UCI数据集上的实验结果表明,改进算法提高了聚类精度和聚类结果的稳定性。(2)传统的软子空间聚类算法的目标函数基于欧氏距离,当样本点维数过高时,易出现“维数灾难”的问题,从而导致软子空间聚类基于欧氏距离度量函数的失效。针对单一欧式距离度量目标函数存在的问题,我们将相关熵的思想运用到目标函数中。在新的目标函数中,利用全新推导出的迭代更新公式求解样本点的模糊权重和隶属度矩阵,同时,结合QPSO算法求解聚类中心点。在UCI数据集上,我们从随机索引,标准化互信息以及算法显著性检验方面给出了实验分析,实验结果表明我们的算法可以获得较好的聚类性能。(3)传统的基于自我表示模型的子空间聚类算法包括稀疏表示和低秩表示这两类主要方法,该类算法将样本点中的误差、噪声等信息结合到目标函数中,通过交替方向乘子法,求解出样本点系数表示矩阵,进一步根据系数表示矩阵构造相似度矩阵。然而,这些方法受限于样本点中的错误结构信息需要作为先验知识,同时,算法迭代过程耗时较大。针对以上存在的两点问题,我们将系数仿射条件结合拉格朗日乘子法运用到子空间聚类中,将岭回归方程作为目标函数,求解系数矩阵解析解。在流行的人脸库上的实验表明,我们的算法在提高了聚类精度和聚类结果稳定性的同时,也降低了计算复杂度。
[Abstract]:Cluster analysis is one of the most important research methods in data mining. Its processing method can be summarized as the process of dividing all data in a set of data sets into different classes or clusters according to different similarities. Because of unsupervised learning characteristics of clustering analysis, it has been widely used in many fields, including electronics. Business, bioinformatics, Web log analysis, financial transactions and so on. However, due to the "dimensionality effect", when using traditional clustering algorithms to process high-dimensional data, the accuracy and stability of clustering results will be greatly reduced. In recent years, how to use clustering analysis to deal with high-dimensional data has gradually become artificial intelligence. Therefore, the concept of subspace clustering arises at the historic moment. Its basic idea is to divide the feature space of the original data into different feature subsets, examine the significance of each group of data partition according to certain rules from different subspace angles, and find the corresponding feature subspace for each group of data. Subspace clustering algorithms have achieved satisfactory results on high-dimensional datasets. In view of the shortcomings and shortcomings of the existing subspace clustering algorithms based on membership representation and self-representation model, this paper analyzes the existing subspace clustering algorithms to find innovative ways to improve them. A new subspace clustering algorithm is proposed. The main contents of this paper are as follows: (1) The existing soft subspace clustering algorithm adopts the method of randomly selecting the sample points in the data set as the clustering center points, which has the disadvantage of falling into local optimum. A new subspace clustering method based on SO algorithm and gradient descent method is proposed to optimize the objective function of soft subspace clustering.The clustering center points in subspace are solved by global optimization of QPSO algorithm.Meanwhile, the fuzzy weight and membership of sample points are solved by using the fast convergence speed of gradient descent method. The experimental results show that the improved algorithm improves the clustering accuracy and the stability of clustering results. (2) The objective function of traditional soft subspace clustering algorithm is based on Euclidean distance. When the dimension of sample points is too high, the problem of "dimension disaster" is easy to occur, which leads to the failure of soft subspace clustering based on Euclidean distance metric function. In the new objective function, the fuzzy weight and membership matrix of the sample points are solved by the new iterative update formula, and the clustering center is solved by the QPSO algorithm. Experiments on standardized mutual information and algorithm saliency test show that our algorithm can achieve better clustering performance. (3) Traditional subspace clustering algorithms based on self-representation model include sparse representation and low rank representation, which make errors and noises in sample points. The acoustical information is combined with the objective function, and the coefficient representation matrix is solved by the alternating direction multiplier method, and then the similarity matrix is constructed by the coefficient representation matrix. However, these methods are limited by the error structure information in the sample points, and the iterative process of the algorithm is time-consuming. Two-point problem, we apply the coefficient affine condition and Lagrange multiplier method to subspace clustering, and use ridge regression equation as the objective function to solve the coefficient matrix analytic solution. Heterozygosity.
【学位授予单位】:江南大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP311.13
,
本文编号:2239111
[Abstract]:Cluster analysis is one of the most important research methods in data mining. Its processing method can be summarized as the process of dividing all data in a set of data sets into different classes or clusters according to different similarities. Because of unsupervised learning characteristics of clustering analysis, it has been widely used in many fields, including electronics. Business, bioinformatics, Web log analysis, financial transactions and so on. However, due to the "dimensionality effect", when using traditional clustering algorithms to process high-dimensional data, the accuracy and stability of clustering results will be greatly reduced. In recent years, how to use clustering analysis to deal with high-dimensional data has gradually become artificial intelligence. Therefore, the concept of subspace clustering arises at the historic moment. Its basic idea is to divide the feature space of the original data into different feature subsets, examine the significance of each group of data partition according to certain rules from different subspace angles, and find the corresponding feature subspace for each group of data. Subspace clustering algorithms have achieved satisfactory results on high-dimensional datasets. In view of the shortcomings and shortcomings of the existing subspace clustering algorithms based on membership representation and self-representation model, this paper analyzes the existing subspace clustering algorithms to find innovative ways to improve them. A new subspace clustering algorithm is proposed. The main contents of this paper are as follows: (1) The existing soft subspace clustering algorithm adopts the method of randomly selecting the sample points in the data set as the clustering center points, which has the disadvantage of falling into local optimum. A new subspace clustering method based on SO algorithm and gradient descent method is proposed to optimize the objective function of soft subspace clustering.The clustering center points in subspace are solved by global optimization of QPSO algorithm.Meanwhile, the fuzzy weight and membership of sample points are solved by using the fast convergence speed of gradient descent method. The experimental results show that the improved algorithm improves the clustering accuracy and the stability of clustering results. (2) The objective function of traditional soft subspace clustering algorithm is based on Euclidean distance. When the dimension of sample points is too high, the problem of "dimension disaster" is easy to occur, which leads to the failure of soft subspace clustering based on Euclidean distance metric function. In the new objective function, the fuzzy weight and membership matrix of the sample points are solved by the new iterative update formula, and the clustering center is solved by the QPSO algorithm. Experiments on standardized mutual information and algorithm saliency test show that our algorithm can achieve better clustering performance. (3) Traditional subspace clustering algorithms based on self-representation model include sparse representation and low rank representation, which make errors and noises in sample points. The acoustical information is combined with the objective function, and the coefficient representation matrix is solved by the alternating direction multiplier method, and then the similarity matrix is constructed by the coefficient representation matrix. However, these methods are limited by the error structure information in the sample points, and the iterative process of the algorithm is time-consuming. Two-point problem, we apply the coefficient affine condition and Lagrange multiplier method to subspace clustering, and use ridge regression equation as the objective function to solve the coefficient matrix analytic solution. Heterozygosity.
【学位授予单位】:江南大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP311.13
,
本文编号:2239111
本文链接:https://www.wllwen.com/jingjilunwen/dianzishangwulunwen/2239111.html