基于惩罚回归的纵向数据罕见变异关联分析

发布时间：2018-04-27 03:43

本文选题：纵向数据 + 罕见变异关联分析　；参考：《山西医科大学》2017年博士论文

【摘要】：目的:纵向二代测序数据相比于横断面数据,可以研究复杂性状随时间的变化关系、遗传位点对复杂疾病的动态效应,从而提高遗传变异对复杂疾病的解释程度。由于罕见变异的发生率极低,全基因组关联研究(genome-wide association studies,GWAS)常用的基于单个位点的分析用于罕见变异分析时,统计效能过低。现有的罕见变异分析大多数以基因为单位,研究一组罕见变异的遗传效应。有关纵向数据罕见变异关联分析的方法刚刚起步,由于纵向二代测序数据有限的样本量和不可避免的数据缺失,现有的广义估计方程(generalized estimating equations,GEE)和线性混合效应模型(linear mixed model,LMM)框架下的罕见变异关联分析面临计算的挑战。因此,针对纵向二代测序数据,迫切需要发展高效且计算上可行的关联分析方法,以克服现有方法的不足,筛选出对人类复杂疾病有重要影响的遗传变异位点或基因,为人类复杂疾病相关基因的识别,提供方法学支撑,为精准医学发展和新靶点的发现和挖掘提供证据。方法:本文提出基于惩罚GEE(pGEE)和惩罚二次推断函数(penalized quadratic inference function,pQIF)的纵向数据罕见变异关联分析方法。在pGEE和pQIF框架下,借用加权合计检验(Weighted Sum Statistic,WSS)以及遗传风险得分的思想,以基因为单位,对基因内所有常见变异和罕见变异加权求和,得到新的基因得分变量,将基因得分变量引入到pGEE和pQIF中进行分析,研究基因得分与疾病之间的关系,从而筛选出复杂疾病相关基因。利用GAW18真实遗传数据,模拟产生连续和二分类的血压表型数据,综合评价pGEE和pQIF纵向数据罕见变异关联分析在不同模型条件下的参数估计和基因筛选情况,同时,探讨pGEE和pQIF纵向数据罕见变异关联分析在不同作业相关矩阵下基因筛选的稳健性和一致性。最后,基于通路进行GAW18真实数据分析,选取高血压相关的两条重要通路,肾素-血管紧张素系统(Renin-angiotensin system,RAS)和Ca2+/AT-IIR/a-AR信号通路,以识别出高血压相关基因。结果:惩罚GEE和惩罚QIF的参数估计精度远优于未惩罚的GEE和QIF,随着样本量的增大,惩罚模型的参数估计精度接近于oracle模型,oracle模型即为仅包含效应为非零系数变量的真实模型;连续应变量的pGEE和pQIF纵向数据罕见变异关联分析的参数估计和变量选择结果略优于二分类应变量结果,体现出二分类模型的复杂性。pQIF的错误选择率极低,且在不同作业相关矩阵设置下参数估计具有稳健性和一致性,优于pGEE。然而,在样本量较小且维度较高时,pQIF无法正确选择效应基因;而pGEE对高维度且小样本情况,仍能以较高的正确选择率筛选出效应基因。因此,在纵向数据罕见变异关联分析中,当样本量较小且维度较低时,采用pQIF以避免错误选择;当样本量小且高维度时,采用pGEE方法。在Ca2+/AT-IIR/a-AR信号通路中,pGEE和pQIF共同识别出基因AGTR1;在RAS系统通路中,pGEE识别出THOP1和PRCP基因,pQIF识别出THOP1基因和ACE基因。结论:针对纵向二代测序数据分析,构建了pGEE和pQIF纵向数据罕见变异关联分析方法,两种方法互为补充,能应用于自变量个数随样本量增大而增大的情况,有效地识别出复杂疾病相关基因。随着纵向二代测序数据的日益增多,pGEE和pQIF纵向数据罕见变异关联分析的应用将更为广泛。
[Abstract]:Objective: compared with cross section data, longitudinal two generation sequencing data can study the variation of complex traits with time, the dynamic effects of genetic sites on complex diseases, and thus improve the interpretation of genetic variation for complex diseases. The total genome association study (genome-wide association studies, GW) AS) commonly used analysis based on single loci used for rare variation analysis, the statistical effectiveness is too low. Most of the existing rare variation analyses are based on the basis of units to study the genetic effects of a rare group of variations. The method for the analysis of rare variations in longitudinal data is just starting, due to the limited sample size and lack of longitudinal two generation sequencing data. Avoidable data loss, rare variation association analysis under the existing generalized estimating equations (GEE) and linear mixed effect model (linear mixed model, LMM) framework is faced with the challenge of computing. Therefore, for longitudinal two generation sequencing data, it is urgent to develop efficient and computationally feasible association analysis parties. In order to overcome the shortcomings of the existing methods, the genetic variation loci or genes that have important effects on human complex diseases are screened out, which provide a methodological support for the identification of related genes related to human complex diseases, and provide evidence for the development of precision medicine and the discovery and mining of new targets. In this paper, two times of punishment based on penalty GEE (pGEE) and punishment are proposed. Penalized quadratic inference function (pQIF), a rare variant correlation analysis method for longitudinal data. In the framework of pGEE and pQIF, the idea of using weighted aggregate test (Weighted Sum Statistic, WSS) and genetic risk scores, based on a unit, to obtain a new weighted sum for all the common variations and rare variations within the group. Gene scoring variables were introduced into pGEE and pQIF to analyze the relationship between gene score and disease, to screen out complex disease related genes. Using GAW18 real genetic data to simulate continuous and two classification of blood pressure phenotypic data, combined to evaluate the rare variation Association of pGEE and pQIF longitudinal data. The parameter estimation and gene screening under different model conditions were analyzed. Meanwhile, the robustness and consistency of gene screening under the different job correlation matrix of pGEE and pQIF longitudinal data was discussed. Finally, based on the GAW18 real data analysis, two important pathways related to hypertension were selected and renin blood vessel tightened. The Zhang Su system (Renin-angiotensin system, RAS) and Ca2+/AT-IIR/a-AR signaling pathways are used to identify the hypertension related genes. Results: the precision of the parameter estimation of the penalty GEE and the penalty QIF is far superior to the penalized GEE and QIF. With the increase of the sample size, the precision of the parameter estimation of the penalty model is close to the oracle model, and the oracle model is only included in the effect. It should be a real model of non zero coefficient variables; the parameter estimation and variable selection results of pGEE and pQIF longitudinal data correlation analysis are slightly better than the results of two classified variables. It shows that the error selection rate of the complexity of the two classification model is very low, and the parameter estimation under the setting of different job correlation matrices has the parameters. Robustness and consistency are superior to pGEE., however, when the sample size is small and the dimension is high, pQIF can not select the effect gene correctly; while pGEE is still able to screen the effect genes with higher correct selection rate for the high dimension and small sample. Therefore, in the rare variation correlation analysis of the longitudinal data, when the sample size is small and the dimension is low, the P is used. QIF to avoid error selection; when the sample size is small and high dimension, the pGEE method is used. In the Ca2+/AT-IIR/a-AR signaling pathway, the gene AGTR1 is identified jointly by pGEE and pQIF. In the RAS system, pGEE recognizes the THOP1 and PRCP genes, and pQIF identifies the THOP1 genes and genes. IF longitudinal data rare variation association analysis method, the two methods are complementary to each other, can be applied to the number of independent variables increasing with the increase of sample size, effectively identifying the related genes of complex diseases. With the increasing number of longitudinal two generation sequencing data, the application of pGEE and pQIF longitudinal data rare variation association analysis will be more extensive.

【学位授予单位】：山西医科大学
【学位级别】：博士
【学位授予年份】：2017
【分类号】：O212.1

【相似文献】