高维基因数据中的统计方法

发布时间：2018-02-15 10:32

本文关键词： 扫描统计量全基因组关联分析渐近性质广义线性模型变量选择　出处：《清华大学》2016年博士论文　论文类型：学位论文

【摘要】：人类遗传学研究的一个重要目标是发现和识别人类疾病的遗传基础。现有的检验方法是检验表型和指定区域中遗传变异体的关联性,例如检验疾病和基因间的相关性。然而由于全基因组测序数据中包含大量的基因间区段,因此序列中的分析单位并没有很清晰的界定。鉴于此,我们提出了一种基于二次扫描统计量的检验方法。该方法通过连续扫描全基因组序列来检验信号区域的存在和位置。我们提出的方法考虑到了以下三种情况:由连锁不平衡引起的单核苷酸多态性间的相关性,在信号区域中同时出现致病性和非致病性突变,以及在信号区域中同时出现正效应和负效应致病性突变。本文给出了所提出扫描方法的渐近性质。我们得到了可以渐近控制族错误率的理论阈值并表明在一定的正则性条件下,所提出的方法能以趋于1的概率选择出确切的信号区间。我们通过模拟研究来评估上述方法的有限样本性质。模拟结果显示,我们的方法在以下三种情况下优于现有的其他方法:信号区域中的突变相关,信号区域中出现非致病性突变,信号区域中同时出现正效应和负效应致病性突变。我们将所提出的方法应用到一组肺癌全基因组关联研究中,得到了与肺癌相关的遗传变异区域。在遗传学研究中,人们关心的另一个重要问题是估计所选变量的效应量。从高维基因数据中选择出一组与疾病相关的变量同时基于这些变量建立合理的预测模型是一项非常具有挑战性的任务。合理的变量选择和准确的效应量估计可以帮助我们建立易于解释且有效的预测模型。惩罚似然法提供了一种可以同时进行变量选择和参数估计的统计方法。受此启发,本文提出了一种在广义线性模型中利用SELO惩罚来进行变量选择和参数估计的方法,我们称之为SELO-GLM。SELO惩罚是一种近似非光滑L0惩罚的光滑惩罚函数。在文章中,我们给出SELOGLM的高效算法,同时证明了SELO-GLM估计的Oracle性质。在较宽泛的正则性条件下,我们指出应用BIC选择调整参数,SELO-GLM/BIC能以趋于1的概率选择出正确的模型。文章中应用数值模拟的方法比较了SELO-GLM和现有的几种似然惩罚方法。模拟结果指出,在变量个数较多和信号较弱的情况下,SELO-GLM的有限样本性质要优于现有的其他方法。最后我们应用SELO-GLM分析了一组乳腺癌基因数据并从中选出了与乳腺癌发生率相关的SNPs。
[Abstract]:An important objective of human genetics research is to identify and identify the genetic basis of human diseases. Existing testing methods examine the association of phenotypes with genetic variants in designated regions. For example, testing the correlation between disease and genes. However, because the whole genome sequencing data contain a large number of intergenic segments, the units of analysis in the sequence are not clearly defined. We propose a test method based on quadratic scan statistics. This method checks the existence and location of the signal region by continuously scanning the whole genome sequence. The proposed method takes into account the following three situations:. The association of single nucleotide polymorphisms caused by linkage disequilibrium, Both pathogenicity and non-pathogenicity mutations occur in the signal region. In this paper, the asymptotic properties of the proposed scanning method are given. The theoretical threshold of error rate of asymptotically controlled families is obtained and it is shown that under certain regularity conditions, The proposed method can select the exact signal interval with the probability of approaching 1. We evaluate the finite sample properties of the above method by simulation study. The simulation results show that, Our method is superior to other existing methods in three cases: mutation correlation in the signal region, non-pathogenic mutation in the signal region, Both positive and negative mutations occur in the signal region. We applied the proposed method to a whole genome association study of lung cancer and obtained genetic variation regions associated with lung cancer. Another important concern is to estimate the effect of selected variables. It is highly selective to select a set of disease-related variables from high-dimensional genetic data and to build a reasonable prediction model based on these variables. Reasonable variable selection and accurate estimation of effect quantities can help us to establish an easily explained and effective prediction model. The penalty likelihood method provides a system for variable selection and parameter estimation at the same time. Method of calculation. Inspired by this, In this paper, we propose a method of variable selection and parameter estimation using SELO penalty in generalized linear model. We call SELO-GLM.SELO penalty a smooth penalty function that approximates non-smooth L0 penalty. We give an efficient algorithm for SELOGLM and prove the Oracle property of SELO-GLM estimator. We point out that the correct model can be selected with the probability of approaching 1 by using BIC to select the adjustment parameter. In this paper, we use numerical simulation to compare SELO-GLM with several existing likelihood penalty methods. The simulation results show that, The finite sample properties of SELO-GLM are superior to those of other methods when the number of variables and signals are weak. Finally, we use SELO-GLM to analyze a group of breast cancer gene data and select the SNPs related to the incidence of breast cancer.
【学位授予单位】：清华大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：O212

【相似文献】