基因表达谱缺失数据填补融合方法及策略研究
[Abstract]:BACKGROUND AND SIGNIFICANCE: The large number of missing data in gene expression profiles seriously affects the accuracy of subsequent analysis results; how to effectively fill the missing data according to the characteristics of existing data sets and how to construct strategies and evaluate the impact of different filling methods on subsequent analysis of gene expression profiles are functional genomics and tumor genomics. The effective solution of these problems makes it possible for the performance of analytical techniques to be further improved by more precise missing fill analysis strategies, so that researchers can make better use of genes. Methods: The main contents of the subject were explored and verified by the theoretical research methods and literature research methods of statistics, computer science and biomedicine, etc. The non-parametric Multi-Filling method based on support vector regression was used. Fusion method and non-parametric deletion forest filling method were used to estimate and fill the missing data of gene expression profiles of different sequence types under six different deletion mechanisms and different deletion ratios. The filling results were compared with K-nearest distance method, Bayesian principal component analysis and multiple filling method. On the basis of this, we constructed different sequence datasets, different deletion mechanisms and different deletion ratios, and clarified the biological effects of different filling methods on different analysis purposes of gene expression profiles. Five methods are used to fill the set. After comparative analysis, we find that the normalized root mean square error increases with the increase of the missing ratio: Bayesian principal component analysis, K-nearest distance method, non-parametric missing forest method, Monte Carlo multiple filling method and support-based method when the missing ratio is 30%. Normalized Root Mean Square Error (NRMSE) of vector regression was 0.2877, 0.3335, 0.2018, 0.2550 and 0.1621, respectively; NRMSE of the five filling methods was 0.1810, 0.3874, 0.0780, 0.0917 and 0.0744 respectively when the missing rate of breast cancer data was 20% in random missing time series; NRMSE of the five filling methods was 0. The conserved Pairs Proportion (CPP) values of the five filling methods were 0.8762, 0.8753, 0.8972, 0.8811 and 0.9797 respectively when the missing ratio was 10%. At present, it is more robust, filling effect is the best, followed by non-parametric missing forest filling method, multiple filling method, K proximity distance method is the worst, and other data sets filling effect is consistent with the two data sets. (2) Class structure retention shows a downward trend with the increase of missing data sets, if the use of inappropriate filling method will be. Follow-up gene expression profiles play a misleading role, different filling methods, SVR-NPMI performance is more robust, using SVR-NPMI to fill the data set clustering effect is better than the other four methods. (3) Through case analysis, summarized the filling strategies of different gene expression profiles missing data sets, SVR-NPMI method under the influence of various factors. It has a good filling effect, but the method has a high computational complexity and a long filling time; the non-parametric deletion forest method can get a good filling result in the gene expression spectrum dataset with few genes and more experimental conditions; the MI method can get a good filling effect in the case of the missing gene expression spectrum dataset showing normal, low-dimensional characteristics and low missing ratio. Conclusion: The SVR-NPMI fusion method proposed in this study has developed and enriched the filling model of missing data in gene expression profiles, and promoted the development of new methods in the field of bioinformatics technology analysis. It has important academic and theoretical value to provide methodological references for the analysis of large data in the fields of biomedicine and other fields. The strategy of filling and analyzing missing data in gene expression profiles and the software of "filling and analyzing system of missing data in gene expression profiles" developed for the first time can help researchers better and faster determine the data set suitable for them. The filling method is more convenient and quick for data analysis, providing reference and service.
【学位授予单位】:第三军医大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:R3416
【参考文献】
相关期刊论文 前10条
1 吴小姣;李高明;易大莉;刘岭;张彦琦;易东;伍亚舟;;基因表达谱的非参缺失森林填补算法研究[J];中国卫生统计;2016年06期
2 武瑞仙;邓子兵;谯治蛟;李晓松;;利用Monte Carlo技术模拟研究不同缺失值处理方法对完全随机缺失数据的处理效果[J];中国卫生统计;2015年03期
3 康茜;李德玉;王素格;冀庆斌;;传播过程中信号缺失的层次聚类社区发现算法[J];计算机工程与应用;2015年09期
4 沈琳;胡国清;陈立章;谭红专;;缺失森林算法在缺失值填补中的应用[J];中国卫生统计;2014年05期
5 黄健斌;康剑梅;齐俊杰;孙鹤立;;一种基于同步动力学模型的层次聚类方法[J];中国科学:信息科学;2013年05期
6 杜文久;孙胜亮;原坤;;改进的MCMC算法—DSY算法及其在估计IRT模型参数中的应用[J];心理科学;2013年03期
7 邓明;;基于GMM的缺失数据回归模型的半参数估计[J];统计与信息论坛;2013年03期
8 帅平;李晓松;周晓华;刘玉萍;;缺失数据统计处理方法的研究进展[J];中国卫生统计;2013年01期
9 王凤梅;胡丽霞;;一种基于近邻规则的缺失数据填补方法[J];计算机工程;2012年21期
10 邹薇;王会进;;基于朴素贝叶斯的EM缺失数据填充算法[J];微型机与应用;2011年16期
相关硕士学位论文 前3条
1 尹婷婷;基因表达谱识别算法研究[D];南京林业大学;2015年
2 金连;不完全数据中缺失值填充关键技术研究[D];哈尔滨工业大学;2013年
3 袁中萸;多元线性回归模型中缺失数据填补方法的效果比较[D];中南大学;2008年
,本文编号:2239636
本文链接:https://www.wllwen.com/kejilunwen/jiyingongcheng/2239636.html