超高维数据下特征筛选方法的研究与应用
[Abstract]:With the advent of big data era, ultra-high dimensional data are often encountered in meteorological prediction, pattern recognition, gene research and other fields. For ultra-high dimensional data, only a small number of covariables are correlated with response variables, and the model is sparse because of its high dimension. Traditional robust statistical analysis methods and high-dimensional data variable selection methods will no longer be applicable. In order to better analyze the ultra-high-dimensional data, it is necessary to reduce the dimension. In recent years, many scholars have proposed a variety of convenient ultra-high dimensional variable screening methods. One effective and reasonable method is to divide them into two steps. First, a fast and efficient variable filtering process is used to reduce the ultra-high dimensional data to an appropriate size below the sample size and to retain all important variables. On the basis of this, some mature methods are used to select the variables of high dimensional data after dimensionality reduction. In this paper, two kinds of ultra-high dimensional feature selection methods are proposed, and a robust ultra-high dimensional feature selection method based on interval conditional quantiles is proposed in the presence of heteroscedasticity and heavy-tailed complex ultra-high dimensional data. In the case of incomplete ultra-high dimensional data with random absence of response variables, a method for feature selection of marginal correlation measures based on inverse probabilistic weighting is proposed. The main work of this thesis is as follows: in Chapter 1, the history and present situation of variable selection under ultra-high dimensional data are summarized, and the quantiles and missing data are systematically reviewed and studied. In chapter 2, we propose a robust feature selection method of interval conditional quantiles, which deals with the complex ultra-high dimensional data such as heavy-tailed and outliers. At present, most of the studies of conditional quantiles are based on a single quantile level. The selection of variables depends on the quantile set in advance, which makes the disturbance of quantile point lead to the instability of variable selection. In this paper, the idea of global quantile regression is introduced, and a conditional quantile screening method based on interval is proposed, which makes the screening criteria more accurate. Simulation studies and examples show that the improved method is more stable. In chapter 3, a feature screening method for random deletion of response variables is proposed. In the current research work, feature screening mainly focuses on the problem of complete data. However, in the field of market research, social research and medical research, the random absence of (MAR) in response variables is often found in the field of market research, social research and medical research. A marginal selection process based on inverse probability weighted method is proposed for randomly missing data with response variables. It is also proved by theory, numerical simulation and practical example to verify its validity. In chapter 4, we summarize the two methods of feature selection, and point out that we can study them more deeply.
【学位授予单位】:南京信息工程大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:O212
【相似文献】
相关期刊论文 前10条
1 武森;冯小东;吴庆海;;基于稀疏指数排序的高维数据并行聚类算法[J];系统工程理论与实践;2011年S2期
2 杨力行 ,刘金清;投影寻踪应用技术在水文领域中喜获丰收[J];水文;1993年02期
3 蔡利平;周绪川;;高维数据上的自适应谱聚类降维方法研究[J];西南民族大学学报(自然科学版);2010年05期
4 毛林;陆全华;程涛;;基于高维数据的集成逻辑回归分类算法的研究与应用[J];科技通报;2013年12期
5 陈晓明;;海量高维数据下分布式特征选择算法的研究与应用[J];科技通报;2013年08期
6 刘立月;黄兆华;刘遵雄;;高维数据分类中的特征降维研究[J];江西师范大学学报(自然科学版);2012年02期
7 李祚泳;投影寻踪技术及其应用进展[J];自然杂志;1997年04期
8 王家耀;谢明霞;郭建忠;陈科;;基于相似性保持和特征变换的高维数据聚类改进算法[J];测绘学报;2011年03期
9 张娇;裘国永;张奇;;基于二分K均值的SVM决策树的高维数据分类方法[J];赤峰学院学报(自然科学版);2012年07期
10 周迪斌;蒋健明;胡斌;张量;;基于多GPU的千万级高维空间实时检索[J];科技通报;2013年01期
相关会议论文 前6条
1 周煜人;彭辉;桂卫华;;基于映射的高维数据聚类方法[A];04'中国企业自动化和信息化建设论坛暨中南六省区自动化学会学术年会专辑[C];2004年
2 梁俊杰;杨泽新;冯玉才;;大规模高维数据库索引结构[A];第二十三届中国数据库学术会议论文集(研究报告篇)[C];2006年
3 陈冠华;马秀莉;杨冬青;唐世渭;帅猛;;面向高维数据的低冗余Top-k异常点发现方法[A];第26届中国数据库学术会议论文集(A辑)[C];2009年
4 刘运涛;鲍玉斌;吴丹;冷芳玲;孙焕良;于戈;;CBFrag-Cubing:一种基于压缩位图的高维数据立方创建算法(英文)[A];第二十二届中国数据库学术会议论文集(研究报告篇)[C];2005年
5 刘文慧;;PCA与PLS用于高维数据分类的比较性研究[A];2011年中国卫生统计学年会会议论文集[C];2011年
6 刘喜兰;冯德益;王公恕;朱成喜;冯雯;;脸谱分析在中进期地震跟踪预报中的应用[A];中国地震学会第四次学术大会论文摘要集[C];1992年
相关重要报纸文章 前1条
1 本报记者 李双艺;引领高维数据分析先河[N];吉林日报;2013年
相关博士学位论文 前10条
1 刘胜蓝;余弦度量下的高维数据降维及分类方法研究[D];大连理工大学;2015年
2 黄晓辉;高维数据的若干聚类问题及算法研究[D];哈尔滨工业大学;2015年
3 杨崇;高维数据流上的K近邻问题研究[D];山东大学;2016年
4 路梅;面向高维数据的特征学习理论与应用研究[D];苏州大学;2016年
5 徐微微;高维数据降维可视化研究及其在生物医学中的应用[D];武汉大学;2016年
6 连亦e,
本文编号:2365405
本文链接:https://www.wllwen.com/kejilunwen/yysx/2365405.html