归一化高维数据降维与可视化研究

发布时间：2018-04-13 19:37

本文选题：数据嵌入 + 数据可视化　；参考：《北京邮电大学》2016年硕士论文

【摘要】：本论文主要研究降维算法,这种降维算法不仅专门针对被归一化的高维数据,还可以把这种高维数据降维到可以进行可视化的低维空间。在实际的工业以及科研中,为了对高维数据的聚类情况进行有效且直观的分析和展示,直接对高维数据可视化是一个有效又便捷的方法,也即使用散点图,图上的每一个点对应每一个高维数据,这样可以直观的展现数据的分布情况,甚至是聚类情况。但是能够直接可视化的数据的维度一般要求不超过3维,所以针对高维数据的可视化,降维是一种有效的方法。另外,降维的实质是让高维空间中的数据的结构,尽量的接近被映射到的低维空间中的数据的结构,所以,降维算法必须要考虑数据的分布结构,比如常见的被归一化的数据,这一类数据或者是分布在超平面或者是超球面上,则针对于这种归一化高维数据的降维算法,如果想取得更好的降维效果,则必须要让降维算法针对归一化的数据的结构进行专门优化。直至今日,已经有许多针对高维数据的低维嵌入的方法,这些方法甚至可以有效的进行数据的可视化操作,比如t-SNE算法,这种算法的一个基础假设是数据分布在一个不受限的欧式空间中,且某一邻域内数据分布符合高斯分布。然而在实际应用中,数据经常分布在受限空间,其分布形态很难用高斯分布来模拟。例如,对于高维超球面数据(L2归一化的数据),则vMF(vonMises-Fisher)分布是比高斯分布更好的描述方法;而针对超平面数据(L1归一化的数据),则dirichlet(狄利克雷)分布是更好的描述方法。基于此,本文提出两种基于vMF分布和dirichlet分布的数据嵌入方法。因为只要数据的维度不超过3维,则一定可以进行如前所述的可视化,而这种画图的方法不具有较高的科研价值;而研究能够把归一化的高维数据映射到不超过3维的低维空间的降维算法才具有一定的科研价值,才是本论文的研究重点。所以本论文不专门介绍可视化的方法,而直接研究分析这种适合针对归一化的高维数据进行可视化的降维算法。论文的主要工作内容包括:1、分析传统的针对高维数据进行低维嵌入的算法,尤其是针对目前效果较好的t-SNE算法,详细分析其相对于其它传统方法的优势,以及在处理“受限空间”分布的数据的缺陷。2、针对超球面分布的数据,提出一种基于vMF分布进行数据描述的新嵌入算法:vMF-SNE算法。分析这种算法的执行过程,并从实验上对比t-SNE算法。3、针对超平面分布的数据,提出一种基于dirichlet分布进行数据描述的新嵌入算法:dirichlet-SNE算法。同样分析其执行过程,并从实验上对比t-SNE算法。本论文针对两种归一化的高维数据,研究两种新的适合可视化的降维算法,并从实验上对比当下较好的t-SNE算法,分析得出这两种算法的优势,对于理论和应用都具有一定价值。
[Abstract]:This paper mainly studies the dimensionality reduction algorithm which not only aims at the normalized high-dimensional data but also reduces the dimension of the high-dimensional data to a low dimensional space which can be visualized.In the actual industry and scientific research, in order to analyze and display the clustering of high-dimensional data effectively and intuitively, it is an effective and convenient method to visualize the high-dimensional data directly, that is, to use scattered plot.Each point on the graph corresponds to each high-dimensional data, which can show the distribution of the data directly, even the clustering situation.But the dimensionality of data that can be visualized directly is generally not more than 3 dimensional, so dimension reduction is an effective method for visualization of high dimensional data.In addition, the essence of dimensionality reduction is to make the structure of the data in the high-dimensional space as close as possible to the structure of the data mapped to the low-dimensional space. Therefore, the dimensionality reduction algorithm must consider the distribution structure of the data, such as the common normalized data.This kind of data is distributed on the hyperplane or hypersphere, then the dimensionality reduction algorithm for the normalized high-dimensional data, if you want to obtain better dimensionality reduction effect,The dimensionality reduction algorithm must be specially optimized for the structure of normalized data.Up to now, there are many low-dimensional embedding methods for high-dimensional data, which can even be used to visualize data, such as t-SNE algorithm.One of the basic assumptions of this algorithm is that the data is distributed in an unconstrained Euclidean space and the data distribution in a neighborhood conforms to Gao Si's distribution.However, in practical application, data is often distributed in restricted space, and its distribution form is difficult to be simulated by Gao Si distribution.For example, for high dimensional hyperspherical data, the vMFN Mises-Fisher distribution is a better description method than Gao Si distribution, while for the hyperplane data with L1 normalized data, the Dirichlet distribution is a better description method.Based on this, two data embedding methods based on vMF distribution and dirichlet distribution are proposed.As long as the dimension of the data is not more than 3 dimensions, we can visualize the above mentioned data, and this method of drawing does not have high scientific research value.The research on dimensionality reduction algorithm which can map normalized high-dimensional data to low-dimensional space with no more than three dimensions has certain scientific research value and is the focus of this paper.Therefore, this paper does not specially introduce the visualization method, but directly studies and analyzes the dimensionality reduction algorithm which is suitable for the visualization of normalized high-dimensional data.The main work of this paper includes: 1, analyzing the traditional algorithm of low-dimensional embedding for high-dimensional data, especially for the t-SNE algorithm, which has good effect at present, and analyzing in detail its advantages over other traditional methods.For the data of hypersphere distribution, a new embedding algorithm named:: vMF-SNE algorithm is proposed to describe the data based on vMF distribution.The execution process of this algorithm is analyzed and compared with t-SNE algorithm .3experimentally. A new embedding algorithm named: Drichlet-SNE algorithm based on dirichlet distribution is proposed to describe the data of hyperplane distribution.At the same time, the execution process is analyzed, and the t-SNE algorithm is compared experimentally.In this paper, two new dimensionality reduction algorithms suitable for visualization are studied for two kinds of normalized high-dimensional data, and the advantages of these two algorithms are analyzed and compared with the better t-SNE algorithm in experiments, which are valuable for both theory and application.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【参考文献】