基于弹性网技术下的加速失效时间模型的规范化估计

发布时间：2018-03-19 03:20

本文选题：加速时效时间模型　切入点：弹性网　出处：《西南交通大学》2016年硕士论文　论文类型：学位论文

【摘要】：对高维度基因数据研究的一个重要目标就是识别和疾病的发生和发展有关的基因标记,其中十分有代表性的例子是微阵列数据的预后分析。从微阵列基因表达数据中搜寻显著相关的生物标记是十分困难的。由于基因表达数据的高维度性质使得标准的生存分析技术无法直接应用其中,而且在被研究的数以千计的基因中,只有很小的一部分基因是与疾病有关的。当研究的对象为时间数据时,往往由于删失情况的存在而无法得到准确的数据,因而筛选相关的基因变得十分具有挑战性。我们提出利用弹性网惩罚规范化加速失效模型的Gehan估计方法,从而筛选出对生存时间有重要影响的基因数据,采用和LASSO相似的算法得到估计值,并且证明了估计值的性质。和已经存在的基于逆概率加权和Buckley and James估计不同,本文所提出的方法不要求对删失数据的额外假设,使得本方法更加具有普遍适用性。在本文中我们做了大量数字模拟,其中部分模拟采用Cai,T.于2009年发表的文章中对模拟研究的设置,从而对所提方法在有限样本上进行了验证。通过和Cai,T.的方法进行对比可以发现本文方法在筛选变量的能力上有所提高,并且能够处理变量个数大于样本观测值的情况,这是Cai,T.的方法所无法解决的。但是本文方法也存在着一定的缺陷,如在协变量间相关系数较大时均方误差和Cai,T.相比较大等。最后我们将所提方法用于Beer, D文章中的肺腺癌实验研究数据,筛选出与肺腺癌有关联的基因数据。在最终筛选出的数据中我们选出了Beer, D文章所没有找出的基因,并且通过t检验表明这些基因对病人是否患病有显著影响,当然所选基因是否与疾病真正相关仍需要后续临床研究的证明。
[Abstract]:An important goal in the study of high-dimensional genetic data is to identify genetic markers associated with the occurrence and development of diseases. One of the most representative examples is the prognostic analysis of microarray data. It is very difficult to search for significant related biomarkers from microarray gene expression data. The high dimensional nature of gene expression data makes the standard. Can't be directly applied to the survival analysis technology based on. And of the thousands of genes that have been studied, only a small fraction of them are linked to disease. When the subject of the study is time data, it is often impossible to obtain accurate data due to the presence of deletions. Therefore, it is very challenging to screen related genes. We propose a Gehan estimation method using elastic network to punish normalized accelerated failure models, and then we can screen out gene data that have an important impact on survival time. The estimated value is obtained by using an algorithm similar to that of LASSO, and the properties of the estimated value are proved. Unlike the existing inverse probabilistic weighting and Buckley and James estimation, the method proposed in this paper does not require additional assumptions for censored data. In this paper, we have done a lot of digital simulation, some of which are based on Caian T.The article published in 2009, By comparing the method with CaiT. it is found that the ability of this method to screen variables is improved, and the number of variables is larger than the observed values of the sample, and the proposed method can be used to deal with the case that the number of variables is larger than the observed value of the sample. This method cannot be solved by Caian T.'s method, but the method in this paper also has some defects. For example, when the correlation coefficient between covariables is large, the mean square error is larger than that of Caian T.Finally, we apply the proposed method to the experimental study of lung adenocarcinoma in Beer, D article. We screened out the genetic data associated with lung adenocarcinoma. In the final data we selected the genes that Beer, D did not find, and t-test showed that these genes had a significant effect on the patient's disease. Of course, further clinical studies are needed to prove whether the selected gene is truly related to the disease.
【学位授予单位】：西南交通大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：C81

【相似文献】