高维数据下半参数可加危险率模型中基于ISIS的变量选择方法及其应用

发布时间：2018-04-29 00:35

本文选题：高维数据 + AHAZISIS模型　；参考：《重庆医科大学》2017年硕士论文

【摘要】：目的本文主要介绍高维数据下半参数可加危险率模型中基于ISIS的变量选择方法,并探讨AHAZISIS模型,AHAZLASSOISIS模型,AHAZENISIS模型,AHAZSSCADISIS模型在高维数据生存分析中的优劣。从而揭示死亡或其他生存结局发生的时间与基因表达之间的关系,从基因层面上为疾病的诊疗和预后以及改进治疗方案提供依据。方法介绍AHAZISIS模型,AHAZLASSOISIS模型,AHAZENISIS模型和AHAZSSCADISIS模型的基本方法原理。针对生物信息学高维度,强相关,小样本量的数据特征进行数据模拟,并比较四种模型在不同模拟数据下的表现情况。最后利用来源于TCGA的前列腺癌数据进行实证研究。结果(1)各种模拟数据情形下,三种初次惩罚函数的模拟结果在一致性和精确性的表现上差别不大。(2)各种数据情形下,四种再次惩罚函数在一致性方面OS-SCAD表现最好,SSCAD次之,Lasso第三,EN表现最差;而在精确性方面,OS-SCAD和SSCAD较好,Lasso次之,EN表现最差。(3)各种数据情形下,再次惩罚函数SSCAD的不同steps在一致性方面,steps=1表现最好,steps=2,3,4,5比较接近;在精确性方面,steps=1表现最差,steps=2,3,4,5比较接近。(4)三种初次惩罚函数,四种再次函数以及再次惩罚函数SSCAD的不同steps在精确性方面与协变量相关系数大小呈负相关,即相关系数较小则精确性高,反之精确性则低。(5)AHAZISIS模型、AHAZSSCADISIS模型在实证研究中筛选出基因数目少,模型可解释性较好。根据log-rank检验的p值大小,AHAZISIS模型、AHAZSSCADISIS模型在实证研究中预测能力方面表现较好。结论在模拟研究和实证研究中,各模型表现一致。AHAZISIS模型和AHAZSSCADISIS模型的模型解释性较好,估计精确性也较高,是处理高维度、强相关、小样本量的数据比较可靠的模型。而AHAZLASSOISIS模型和AHAZENISIS模型在处理高维度、强相关、小样本量的数据时表现较差,尤其是AHAZENISIS模型可解释性最差且估计精确性也最差。
[Abstract]:Objective this paper mainly introduces the variable selection method based on ISIS in the semi-parametric additive risk rate model of high dimensional data, and discusses the advantages and disadvantages of AHAZISIS model AHAZLASSOISIS model and AHAZS SCADISIS model in high-dimensional data survival analysis. Thus, the relationship between the time of death or other survival outcome and gene expression is revealed, and the basis for diagnosis, treatment and prognosis of disease and improvement of treatment plan are provided from the gene level. Methods the basic principles of AHAZISIS model AHAZLASSOISIS model and AHAZENIS model and AHAZSSCADISIS model were introduced. The data characteristics of high dimension, strong correlation and small sample size of bioinformatics were simulated, and the performance of the four models under different simulated data was compared. Finally, an empirical study was conducted using prostate cancer data from TCGA. Results 1) under all kinds of simulated data, the simulation results of the three primary penalty functions have little difference in the performance of consistency and accuracy. The four repenalty functions performed best in consistency terms, OS-SCAD performed the second best, Lasso third and en performed the worst, while in accuracy, OS-SCAD and SSCAD showed the worst performance of en.) in all kinds of data cases, The different steps of the repenalty function SSCAD have the best consistency in terms of consistency, and the three primary penalty functions are similar to each other in terms of accuracy. The different steps of four reorder functions and the repenalty function SSCAD have negative correlation with the correlation coefficient of covariable in terms of accuracy, that is, the smaller the correlation coefficient, the higher the accuracy. On the other hand, the accuracy of AHAZIS model is lower than that of AHAZIS model / AHAZSSCADISIS model, and the number of genes screened out by AHAZSSCADISIS model is less than that of AHAZIS model, and the model can be interpreted well. According to the p value of log-rank test, AHAZIS model and AHAZS SCADISIS model are good in forecasting ability in empirical research. Conclusion in both simulation and empirical studies, the models of AHAZIS and AHAZSSCADISIS are consistent. AHAZIS model and AHAZSSCADISIS model have better explanatory and accurate estimation, which is a reliable model for dealing with high dimensional, strong correlation and small sample size data. However, AHAZLASSOISIS model and AHAZENISIS model have poor performance in dealing with high dimension, strong correlation, small sample size data, especially AHAZENISIS model is the worst interpretable and estimation accuracy is the worst.
【学位授予单位】：重庆医科大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：O212;R195.1

【参考文献】