超高维数据下特征筛选方法的研究与应用

发布时间：2018-12-05 20:11

【摘要】：随着大数据时代的到来,在气象预测、模式识别、基因研究等一些领域中,常面临超高维数据。对于超高维数据,只有少量的协变量同响应变量之间是相互关联的,模型呈现稀疏性特征,由于维数过高,传统的稳健的统计分析方法和高维数据变量选择方法会变得不再适用。为了更好的对超高维数据进行分析,需要对它进行降维处理。近年来很多学者提出多种便捷的超高维变量筛选方法,一种有效合理的方法是将其分为两步,首先使用一种快捷高效的变量筛选过程将超高维数据降低到样本大小之下的合适规模,并能够保留所有重要变量,在此基础上再使用一些成熟的方法对降维后的高维数据进行变量选择。本文创新性的提出两种超高维特征筛选法,在出现异方差、重尾等复杂超高维数据时基于区间条件分位数提出了一种稳健的超高维特征筛选方法;当面临响应变量随机缺失的不完全超高维数据问题中,提出一种基于逆概率加权的边际相关度量特征筛选方法。本硕士论文的主体工作如下:第一章概述了超高维数据下变量筛选的研究历史与现状,以及对分位数和缺失数据进行了系统的回顾与学习。第二章提出一种稳健的区间条件分位数超高维特征筛选法,处理重尾、异常点这些复杂的超高维数据。目前大部分的条件分位数的研究都是基于一个单一的分位数水平下进行的,变量的筛选依赖于所提前设置的分位数,这使得分位数点的扰动可能导致变量筛选的不稳定性,本文引入全局分位数回归思想,让分位点取一个区间,提出一种基于区间的条件分位数筛选方法,使其筛选标准更加准确,并通过理论证明、模拟研究和实例说明改进后的方法更加稳定。第三章提出有关响应变量随机缺失的超高维的特征筛选法。在现有的研究工作中,特征筛选研究主要关注完全数据问题,然而,在市场研究调查、社会调查、医学研究领域中经常出现响应变量随机缺失(MAR)的情况,面对响应变量随机缺失的数据,基于逆概率加权的方法提出一种边际筛选过程。同样也通过理论证明、数值模拟和实例证明验证了其有效性。第四章对本文提出的两种特征筛选方法进行了总结,并提出了还可以更加深入地去研究的方向。
[Abstract]:With the advent of big data era, ultra-high dimensional data are often encountered in meteorological prediction, pattern recognition, gene research and other fields. For ultra-high dimensional data, only a small number of covariables are correlated with response variables, and the model is sparse because of its high dimension. Traditional robust statistical analysis methods and high-dimensional data variable selection methods will no longer be applicable. In order to better analyze the ultra-high-dimensional data, it is necessary to reduce the dimension. In recent years, many scholars have proposed a variety of convenient ultra-high dimensional variable screening methods. One effective and reasonable method is to divide them into two steps. First, a fast and efficient variable filtering process is used to reduce the ultra-high dimensional data to an appropriate size below the sample size and to retain all important variables. On the basis of this, some mature methods are used to select the variables of high dimensional data after dimensionality reduction. In this paper, two kinds of ultra-high dimensional feature selection methods are proposed, and a robust ultra-high dimensional feature selection method based on interval conditional quantiles is proposed in the presence of heteroscedasticity and heavy-tailed complex ultra-high dimensional data. In the case of incomplete ultra-high dimensional data with random absence of response variables, a method for feature selection of marginal correlation measures based on inverse probabilistic weighting is proposed. The main work of this thesis is as follows: in Chapter 1, the history and present situation of variable selection under ultra-high dimensional data are summarized, and the quantiles and missing data are systematically reviewed and studied. In chapter 2, we propose a robust feature selection method of interval conditional quantiles, which deals with the complex ultra-high dimensional data such as heavy-tailed and outliers. At present, most of the studies of conditional quantiles are based on a single quantile level. The selection of variables depends on the quantile set in advance, which makes the disturbance of quantile point lead to the instability of variable selection. In this paper, the idea of global quantile regression is introduced, and a conditional quantile screening method based on interval is proposed, which makes the screening criteria more accurate. Simulation studies and examples show that the improved method is more stable. In chapter 3, a feature screening method for random deletion of response variables is proposed. In the current research work, feature screening mainly focuses on the problem of complete data. However, in the field of market research, social research and medical research, the random absence of (MAR) in response variables is often found in the field of market research, social research and medical research. A marginal selection process based on inverse probability weighted method is proposed for randomly missing data with response variables. It is also proved by theory, numerical simulation and practical example to verify its validity. In chapter 4, we summarize the two methods of feature selection, and point out that we can study them more deeply.
【学位授予单位】：南京信息工程大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：O212

【相似文献】