自表达属性选择研究
发布时间:2018-03-14 23:24
本文选题:数据挖掘 切入点:图学习 出处:《广西师范大学》2017年硕士论文 论文类型:学位论文
【摘要】:高维数据通常含有噪音以及冗余。特别是,数据的高属性维度不仅会增加储存空间,而且属性维数在达到某一临界值后,特定数据挖掘算法的性能反而下降,即所谓的“维度灾难”。另一方面,由于资源所限等原因数据的类标签在实际应用中很难获取,因此,无监督的属性约简通过降低无标签数据的维度以解决上述问题,在数据挖掘领域具有重要意义。现有的属性约简方法可分为子空间学习和属性选择。子空间学习比属性选择更高效,但属性选择方法得到的结果更具有可解释性。本文结合子空间学习和属性选择思想提出两种无监督属性选择方法,即从输入的高维数据中选取有意义的属性(也就是说去除属性的冗余和噪音),使得输出的低维数据既能提升数据的学习效果,又具有可解释性。本文具体的内容和创新点为:(1)基于样本自表达方法的成功运用,本文利用属性自表达能力,提出了一种简单而且有效的无监督属性选择框架一基于稀疏学习的鲁棒自表达属性选择算法(SRFS算法)。具体来说,SRFS算法首先采用包含属性自表达的损失函数,将数据每个属性用其他属性线性表示来取得自表达系数矩阵;然后结合稀疏学习的理论(即用系数矩阵的l2,1-范数作为稀疏正则化项)取得稀疏的系数矩阵。在优化所得的目标函数时,稀疏正则化因子导致重要的属性对应的自表达系数值,相对于冗余属性或者不相关属性的值要大,以此区别属性的重要性从而达到属性选择的目的。SRFS算法利用属性自表达的方法,使得每个属性都能被全体属性很好的表现出来,不重要的属性或噪音冗余属性在自表达过程中被赋予很小的权重或零权重。在真实数据的模拟实验中,使用支持向量机(SVM)作为属性选择的评价方法进行分类,分别作用于被SRFS方法和其他属性约简算法处理过的数据,结果表明SRFS优于其他对比算法。(2)传统的属性选择方法通常不考虑属性间的关系,如:数据的局部结构或整体结构。而噪声或离群点会增加数据矩阵秩,基于以上事实,本文结合低秩约束、流形学习、超图理论和属性自表达在同一个框架下进行无监督属性选择,即提出了“基于超图的属性自表达无监督低秩属性选择算法”(SHLFS算法)。具体来说,SHLFS算法首先扩展上述属性自表达理论,即将各个属性用其他属性来表示,然后嵌入一个低秩约束项来去除噪音和离群点的影响。此外,鉴于超图(Hypergraph)能比一般图捕获更复杂的关系,SHLFS算法使用一个超图正则化因子来考虑数据的高阶关系和局部结构,且使用l2,1-范数正则化实现系数矩阵的稀疏性。本文进一步证明了所用的低秩约束导致SHLFS算法具有子空间学习的效果。最终,SHLFS算法既考虑了全局的数据结构(通过低秩约束)又考虑了局部数据结构(通过超图正则化),而且在进行属性选择的同时进行了子空间学习,使得得到的属性选择模型既具有可解释性且性能优异。由于比上一方法使用了更强的约束,且考虑了数据间的关系,SHLFS算法比之前的模型更健壮。在实验部分,使用SVM分类和k-means聚类两种评价方法,在多类和二类数据集上进行实验,经多个评价指标验证,SHLFS方法比对比属性约简方法具有更好的效果。本论文主要针对高维数据的特点,设计新的属性选择方法。具体地说,本文创新的使用属性自表达来实现无监督属性选择,另一方面使用超图模型和低秩约束表示数据之间的高阶关系,并结合稀疏学习理论给每个属性赋予不同的权重以判别属性的重要性。为保证设计方法的有效性,模拟实验部分在多个公开数据集上进行,对比算法包括近几年流行的算法和领域经典算法,使用分类和聚类作为评价方法,分类准确率(ACC)和标准化互信息(NMI)等多个评价指标。实验结果显示,本文提出的方法均获得最优的效果。后续的工作拟探索半监督学习和深度学习框架设计新的属性选择方法。
[Abstract]:High dimensional data usually contain noise and redundancy. In particular, the high dimension data will not only increase the storage space, but also in the attribute dimension reaches a critical value, the performance of data mining algorithm is decreased and the so-called "Curse of dimensionality". On the other hand, from the class to the limit of resources and other reasons the data label in the practical application is difficult to obtain, therefore, attribute reduction without supervision by reducing the dimension of the unlabeled data to solve the problem in the data mining field has important significance. The existing method of attribute reduction can be divided for subspace learning and attribute subspace learning. But more efficient than attribute selection, attribute selection the result obtained is more explicable. Based on subspace learning and feature selection proposed two unsupervised feature selection method, namely from the high dimensional data input in the selection of meaningful Attribute (i.e. attribute redundancy and noise removal), the low dimensional data output data can enhance the learning effect, and it can be explained. The specific content and innovation is: (1) using sample self expression method based on the success of this paper, using the attributes of self expression ability, put forward a a simple and effective unsupervised feature selection framework for a robust sparse learning based on self expression attribute selection algorithm (SRFS algorithm). Specifically, the SRFS algorithm uses the self expression contains property loss function, the data of each attribute with other attributes of a linear representation obtained from the expression of the coefficient matrix; then combining sparse learning theory (which uses the l2,1- norm coefficient matrix as sparse regularization) has been sparse. In the objective function optimization, sparse regularization factor resulting from the important expression of the corresponding attribute Value, relative to the redundant attributes or attribute values to be big, in order to distinguish the importance of attributes so as to achieve the purpose of the.SRFS algorithm of attribute selection method using attribute self expression, so that each attribute can be the attribute good performance out of unimportant attributes or noise redundant attribute weight was given very little or zero weight in self expression. In the simulation of real data, using support vector machine (SVM) as the evaluation method of attribute selection classification, respectively by SRFS method and attribute reduction algorithm of processed data, the results show that SRFS outperforms other algorithms. (2) compared to traditional property methods usually do not consider the relationship between attributes, such as: the local data structure or overall structure. And the noise or outlier data will increase the rank of matrix, based on the above facts, this paper combined with low rank constraint flow Learning, hypergraph theory and attribute self expression in the same framework of unsupervised feature selection, namely "self expression unsupervised low rank attribute selection algorithm based on Hypergraph attribute" (SHLFS algorithm). Specifically, the SHLFS algorithm is extended from the attribute table on each attribute is used to Darius, other attributes that is then embedded into a low rank constraint to remove the influence of noise and outliers. In addition, in view of the hypergraph (Hypergraph) can capture more complex relationship than the general graph, SHLFS algorithm uses a hypergraph regularization factor to consider higher order relational data and local structure, and the use of l2,1- norm sparse the realization of the coefficient matrix. This paper further proves that the low rank constraint for the resulting SHLFS algorithm with subspace learning effect. Finally, SHLFS algorithm considers the global data structure (by low rank constraint) and Considering the local data structure (through hypergraph regularization), and subspace learning while performing attribute selection, which makes the attribute selection model has both interpretability and excellent performance. Due to the stronger constraints than a method, and considered the number according to the relationship between the model of SHLFS algorithm is better than the before the more robust. In the experiment, using two kinds of evaluation methods of SVM classification and K-means clustering experiments in multi class and two types of data sets, multiple evaluation index verification, SHLFS method has better effect than the comparison of attribute reduction method in the thesis. According to the characteristics of high dimensional data, the design of the new the attribute selection method. Specifically, to realize unsupervised feature selection using attribute the self expression, on the other hand, using hypergraph model and low rank constraint representation of higher-order relation between data, combined with sparse learning 鐞嗚缁欐瘡涓睘鎬ц祴浜堜笉鍚岀殑鏉冮噸浠ュ垽鍒睘鎬х殑閲嶈鎬,
本文编号:1613423
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1613423.html