基因表达谱缺失数据填补融合方法及策略研究

发布时间：2018-09-12 16:55

【摘要】：研究背景和意义:基因表达谱缺失数据的大量存在,严重影响后续分析结果的准确性;如何根据已有数据集特征进行缺失数据的有效填补及策略构建和不同填补方法对基因表达谱后续分析目的影响评估是功能基因组学和肿瘤基因组学研究中极具重要科学意义的研究内容,也是统计学和生物信息学中数据分析研究领域的重难点。上述问题的有效解决,使得分析技术的性能很有可能因为更为精确的缺失填补分析策略而进一步提高,使得研究者可以更好的利用基因表达谱数据的信息,更为有效地进行疾病诊断与治疗。研究方法:采用统计学、计算机科学和生物医学等多个交叉学科的理论研究方法和文献研究方法,对课题的主要内容进行探索与证实。具体通过将基于支持向量回归的非参多重填补融合方法和非参缺失森林填补法对6个不同缺失机制下、不同缺失比例下的不同序列类型的基因表达谱缺失数据进行估计和填补,并将填补结果与K邻近距离法、贝叶斯主成分分析法和多重填补方法进行比较;在一定填补策略构建原则的基础上,结合不同填补方法的性能,以构建不同序列数据集、不同缺失产生机制、不同缺失比例情况下的填补策略,并阐明不同填补方法对基因表达谱后续不同分析目的的生物学影响。研究结果:(1)对于不同特点的基因表达谱缺失数据集分别使用五种方法来填补,通过比较分析后我们发现:标准化均方根误差随着缺失比例的增加而增大:非时间序列的肝癌数据集,缺失比例为30%时,贝叶斯主成分分析法、K邻近距离法、非参缺失森林法、蒙特卡洛多重填补法和基于支持向量回归的非参多重填补法的标准化均方根误差(Normalized Root Mean Square Error,NRMSE)分别为0.2877、0.3335、0.2018、0.2550和0.1621;随机缺失下时间序列的乳腺癌数据的缺失比例为20%时,五种填补法的NRMSE依次为0.1810、0.3874、0.0780、0.0917和0.0744;非时间序列的淋巴癌数据集,缺失比例为10%时,五种填补方法的类结构保持度(Conserved Pairs Proportion,CPP)值依次为0.8762、0.8753、0.8972、0.8811和0.9797。总体上,支持向量回归的非参多重填补法(Support Vector Regression Nonparametric Multiple Imputation,SVR-NPMI)的表现较为稳健、填补效果最好,其次为非参缺失森林填补法、多重填补法,K邻近距离法效果最差,其它数据集的填补效果与这两个数据集一致。(2)类结构保持度随着数据集缺失比例的增大而呈现下降的趋势,如果运用不恰当的填补方法会对后续基因表达谱的研究起误导性作用,不同的填补方法中,SVR-NPMI的表现较为稳健,使用SVR-NPMI填补数据集的聚类效果优于其它四种方法。(3)通过实例分析,总结了不同基因表达谱缺失数据集的填补策略,SVR-NPMI方法在各种因素影响下都有较好的填补效果,但该方法计算复杂度高,填补时间长;非参缺失森林方法在基因少、实验条件多的基因表达谱数据集中可以取得较好的填补结果;MI方法在基因表达谱缺失数据集呈现正态、低维特征且缺失比例低的情况下填补效果可以接受;贝叶斯主成分分析法和K邻近距离法的填补效果是否优劣则与重要参数的选择有关。研究结论:本研究提出的SVR-NPMI融合方法发展和丰富了基因表达谱缺失数据的填补模型,推动了生物信息学技术分析领域中新方法的发展,为生物医学等领域大数据的分析提供方法学的借鉴和参考,具有重要的学术理论价值;首次构建的针对基因表达谱缺失数据的填补分析策略和开发的《基因表达谱缺失数据填补分析系统》软件,可以帮助研究者更好更快的确定适合其数据集的填补方法,更为方便快捷地进行数据分析,提供参考与服务。
[Abstract]:BACKGROUND AND SIGNIFICANCE: The large number of missing data in gene expression profiles seriously affects the accuracy of subsequent analysis results; how to effectively fill the missing data according to the characteristics of existing data sets and how to construct strategies and evaluate the impact of different filling methods on subsequent analysis of gene expression profiles are functional genomics and tumor genomics. The effective solution of these problems makes it possible for the performance of analytical techniques to be further improved by more precise missing fill analysis strategies, so that researchers can make better use of genes. Methods: The main contents of the subject were explored and verified by the theoretical research methods and literature research methods of statistics, computer science and biomedicine, etc. The non-parametric Multi-Filling method based on support vector regression was used. Fusion method and non-parametric deletion forest filling method were used to estimate and fill the missing data of gene expression profiles of different sequence types under six different deletion mechanisms and different deletion ratios. The filling results were compared with K-nearest distance method, Bayesian principal component analysis and multiple filling method. On the basis of this, we constructed different sequence datasets, different deletion mechanisms and different deletion ratios, and clarified the biological effects of different filling methods on different analysis purposes of gene expression profiles. Five methods are used to fill the set. After comparative analysis, we find that the normalized root mean square error increases with the increase of the missing ratio: Bayesian principal component analysis, K-nearest distance method, non-parametric missing forest method, Monte Carlo multiple filling method and support-based method when the missing ratio is 30%. Normalized Root Mean Square Error (NRMSE) of vector regression was 0.2877, 0.3335, 0.2018, 0.2550 and 0.1621, respectively; NRMSE of the five filling methods was 0.1810, 0.3874, 0.0780, 0.0917 and 0.0744 respectively when the missing rate of breast cancer data was 20% in random missing time series; NRMSE of the five filling methods was 0. The conserved Pairs Proportion (CPP) values of the five filling methods were 0.8762, 0.8753, 0.8972, 0.8811 and 0.9797 respectively when the missing ratio was 10%. At present, it is more robust, filling effect is the best, followed by non-parametric missing forest filling method, multiple filling method, K proximity distance method is the worst, and other data sets filling effect is consistent with the two data sets. (2) Class structure retention shows a downward trend with the increase of missing data sets, if the use of inappropriate filling method will be. Follow-up gene expression profiles play a misleading role, different filling methods, SVR-NPMI performance is more robust, using SVR-NPMI to fill the data set clustering effect is better than the other four methods. (3) Through case analysis, summarized the filling strategies of different gene expression profiles missing data sets, SVR-NPMI method under the influence of various factors. It has a good filling effect, but the method has a high computational complexity and a long filling time; the non-parametric deletion forest method can get a good filling result in the gene expression spectrum dataset with few genes and more experimental conditions; the MI method can get a good filling effect in the case of the missing gene expression spectrum dataset showing normal, low-dimensional characteristics and low missing ratio. Conclusion: The SVR-NPMI fusion method proposed in this study has developed and enriched the filling model of missing data in gene expression profiles, and promoted the development of new methods in the field of bioinformatics technology analysis. It has important academic and theoretical value to provide methodological references for the analysis of large data in the fields of biomedicine and other fields. The strategy of filling and analyzing missing data in gene expression profiles and the software of "filling and analyzing system of missing data in gene expression profiles" developed for the first time can help researchers better and faster determine the data set suitable for them. The filling method is more convenient and quick for data analysis, providing reference and service.
【学位授予单位】：第三军医大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：R3416

【参考文献】