基于重构信息保持的降维算法研究
发布时间:2018-05-28 07:08
本文选题:降维 + 特征提取 ; 参考:《山东师范大学》2017年硕士论文
【摘要】:随着网络和存储技术的不断发展,越来越多的数据呈现出数据量大、维数高等新的特点。这些海量的高维数据包含更加丰富信息的同时,也带来了如维数灾难、计算量大等问题,对数据分析提出了新的挑战。因此,如何能够有效地描述高维数据并挖掘出其中有意义的信息成为亟待解决的问题。降维作为解决该问题的有效手段之一,在人脸识别、生物信息学、图像检索等领域都有着广泛的应用。近年来,随着降维技术的发展,人们对降维算法的要求逐渐提高,降维算法的优劣直接关系到对数据信息提取和分析的准确性。本文以提高维数据在降维后的可分性为目标,针对数据集的特殊性,在保持数据重构信息的基础上,提出两种不同的降维算法,并分别在不同数据集上对所提出方法的准确性和可靠性进行验证及分析。本文的主要工作及创新点概括如下:1.提出一种基于全局距离和类别信息的邻域保持嵌入算法(Neighborhood Preserving Embedding Algorithm based on Global Distance and Label Information,GLI-NPE)。GLI-NPE算法在邻域保持嵌入算法通过传统欧氏距离构造邻域图的公式中,加入表征全局距离的全局因子和表示数据类别信息的函数项。全局因子使分布不均匀的样本变得平滑均匀,使邻域保持嵌入算法在分布不均匀的样本上更为鲁棒。类别信息使类内样本点且紧凑类间样本点疏离,通过提高所选邻近点的质量,优化数据的局部邻域,使降维后的数据具有更好的可分性。实验结果表明,GLI-NPE算法能够有效提高数据降维后的分类准确率。2.针对高维的基因表达数据,立足于对数据进行维数约减的同时提高肿瘤数据的可分性,同时分析稀疏表示与近邻表示各自的局限性以及肿瘤数据中分类的独特性,提出一种基于判别混合结构保持投影(Discriminative Hybrid Structure Preserving Projections,DHSPP)的特征提取算法。DHSPP算法将稀疏表示与近邻表示线性组合成一种混合表示,然后根据类别信息将混合表示分为类内混合表示和类间混合表示,以最大化类间距离最小化类内距离为原则构造目标函数。此外,鉴于肿瘤数据大多为不平衡数据,在计算类内距离时加入平衡调节因子平衡多数类与少数类。实验结果表明,通过DHSPP算法对肿瘤表达数据进行降维,能够有效提高降维后肿瘤数据的分类准确率。
[Abstract]:With the development of network and storage technology, more and more data show new characteristics of large data volume and high dimension. These massive high-dimensional data not only contain more information, but also bring problems such as dimensionality disaster and large amount of computation, which pose a new challenge to data analysis. Therefore, how to effectively describe high-dimensional data and mine meaningful information is an urgent problem to be solved. As one of the effective methods to solve this problem, dimensionality reduction is widely used in face recognition, bioinformatics, image retrieval and so on. In recent years, with the development of dimensionality reduction technology, the demand for dimensionality reduction algorithm has been gradually raised. The advantages and disadvantages of dimensionality reduction algorithm are directly related to the accuracy of data information extraction and analysis. In this paper, aiming at improving the separability of dimensionally reduced data, aiming at the particularity of data set, two different dimensionality reduction algorithms are proposed on the basis of preserving the information of data reconstruction. The accuracy and reliability of the proposed method are verified and analyzed on different data sets. The main work and innovation of this paper are summarized as follows: 1. In this paper, a neighborhood preserving embedding algorithm based on global distance and class information is proposed, which is based on neighborhood Preserving Embedding Algorithm based on Global Distance and Label Information (GLI-NPEN). GLI-NPE algorithm is used to construct neighborhood graph by traditional Euclidean distance. A global factor representing the global distance and a function item representing data class information are added. The global factor makes the unevenly distributed samples smooth and uniform, and makes the neighborhood retention embedding algorithm more robust on the unevenly distributed samples. Class information alienates the sample points within classes and compactly between classes. By improving the quality of the selected adjacent points and optimizing the local neighborhood of the data, the reduced dimension data has better separability. Experimental results show that the GLI-NPE algorithm can effectively improve the classification accuracy. 2. For high-dimensional gene expression data, based on reducing the dimension of the data and improving the separability of tumor data, the limitations of sparse representation and nearest neighbor representation and the uniqueness of classification in tumor data are analyzed. A feature extraction algorithm based on discriminant mixed structure preserving projection Hybrid Structure Preserving projects (DHSPP) is proposed. DHSPP algorithm combines sparse representation with nearest neighbor representation to form a mixed representation. Then the mixed representation is divided into intra-class mixed representation and inter-class hybrid representation according to the class information. The objective function is constructed based on the principle of maximizing inter-class distance and minimizing intra-class distance. In addition, in view of the fact that the tumor data are mostly unbalanced, the balance regulator balance most and few classes are added in the calculation of intra-class distance. The experimental results show that the dimensionality reduction of tumor expression data by DHSPP algorithm can effectively improve the classification accuracy of tumor data after dimensionality reduction.
【学位授予单位】:山东师范大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP301.6
【参考文献】
相关期刊论文 前1条
1 梅清琳;张化祥;;基于全局距离和类别信息的邻域保持嵌入算法[J];山东大学学报(工学版);2016年01期
,本文编号:1945781
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1945781.html