基于随机森林和梯度提升模型的上位效应检测算法研究
发布时间:2018-06-21 02:11
本文选题:全基因关联分析 + 上位效应 ; 参考:《哈尔滨工业大学》2016年硕士论文
【摘要】:过去十年中,全基因组关联分析(GWAS)研究提高我们对疾病遗传学的认知和理解,对于发现基因型-表型关系起到关键作用。在GWAS分析中,遗传学家依靠DNA多态性标记来检测这些关联关系。单核苷酸多态性是其中最流行的一类遗传标记,可以用来挖掘疾病的致病原因和潜在的生物机理。迄今为止,大多数遗传关联研究使用单基因位点分析策略,其中每个基因变体单独和特定的表型关联测试。但是这种策略在复杂疾病中则表现不成功,例如高血压、糖尿病和哮喘等,这是由于单位点分析忽略上位效应,有些位点仅能够通过与其他基因的相互作用而影响疾病,而该基因位点的主效应的影响非常小或者不存在,这一现象也被称为“丢失的遗传性”。研究表明,上位性是复杂的人类疾病病因中普遍存在的成分,在许多性状的遗传控制起到至关重要的作用。随着高通量测序技术的出现,使得研究人员能够在全基因组范围内检测上位效应,能够更好的揭露出复杂疾病潜在的遗传机理。而在全基因组范围检测上位效应所遭遇到的第一个困难和挑战是计算负担。在本文研究中,提出一种基于混合随机森林框架的预筛选模型,来选择最佳候选集合,然后在候选集合中使用MDR算法来检测上位效应。混合随机森林模型能够筛选出主效应显著的上位效应模型和主效应微弱而组合效应显著的纯上位效应。在相加模型、相乘模型、阈值模型和纯上位模型四种类型的实验中验证了我们的算法,实验结果表明该算法具有一定的实际意义。另外我们提出一种基于梯度提升模型的置换方法,用来检测主效应微弱的纯上位效应。所提出的置换梯度提升模型p GBM,通过移除SNP相互作用对GBM模型分类器的影响,来检测最有可能发生相互作用的SNP组合对。我们采用平均AUC差值来定义相互作用,进而将模型应用到非平衡数据集上。在实验验证中当遗传互质性大于0.01的时候,该算法的检测能力能够达到百分之百,遗传互质性取值小于0.01的时候,其检测能力也远高于p RF算法。同时采用CPU并行计算的思想,提升模型的运算速度,进而缩短计算时间。p GBM算法采用6个CPU并行计算时,要比p RF算法快4.78倍。这种方法表现出很大的潜力,通过检测基因-基因相互作用来研究潜在的遗传结构,有利于揭示复杂的疾病机制。
[Abstract]:Over the past decade, Genome-wide Association Analysis (GWAS) has improved our understanding of disease genetics and played a key role in the discovery of genotypic relationships. In Gwas analysis, geneticists rely on polymorphic DNA markers to detect these associations. Single nucleotide polymorphism (SNP) is one of the most popular genetic markers, which can be used to explore the causes and potential biological mechanisms of disease. To date, most genetic association studies use single locus analysis strategies, in which each gene variant is individually and specifically tested for phenotypic association. But this strategy is not successful in complex diseases, such as hypertension, diabetes and asthma, because unit point analysis ignores epistatic effects, and some loci can affect disease only by interacting with other genes. The main effect of the locus is very small or nonexistent, a phenomenon also known as "lost heredity". Studies have shown that epistasis is a common component of complex human disease and plays an important role in genetic control of many traits. With the development of high-throughput sequencing technology, researchers can detect epistatic effect in the whole genome and reveal the potential genetic mechanism of complex diseases. The first difficulty and challenge in detecting epistatic effects across genomes is computational burden. In this paper, a prescreening model based on mixed stochastic forest framework is proposed to select the best candidate set, and then MDR algorithm is used to detect the epistatic effect in the candidate set. The mixed stochastic forest model can screen the epistatic effect model with significant main effect and pure epistatic effect with weak main effect and significant combination effect. Our algorithm is verified in four kinds of experiments: additive model, multiplication model, threshold model and pure epigynous model. The experimental results show that the algorithm has some practical significance. In addition, we propose a replacement method based on gradient lifting model to detect the pure epistatic effect with weak main effect. By removing the influence of SNP interaction on the classifier of GBM model, the proposed displacement gradient lifting model p GBM is used to detect the SNP combination pairs which are most likely to interact with each other. We use the average AUC difference to define the interaction and then apply the model to the non-equilibrium data set. In the experiment, when the genetic mutuality is greater than 0.01, the detection ability of the algorithm can reach 100%, and when the value of genetic mutuality is less than 0.01, the detection ability of the algorithm is much higher than that of the p RF algorithm. At the same time, using the idea of CPU parallel computing, the calculation speed of the model is improved, and the computing time is shortened by 4.78 times faster than that of the p RF algorithm when 6 CPUs are used for parallel computation. This method shows great potential, and it is helpful to reveal the complex disease mechanism by detecting gene-gene interaction to study the potential genetic structure.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:R440
【相似文献】
相关硕士学位论文 前2条
1 张俊威;基于随机森林和梯度提升模型的上位效应检测算法研究[D];哈尔滨工业大学;2016年
2 孙安;上位效应检测算法及其在MapReduce框架下实现的研究[D];吉林大学;2014年
,本文编号:2046717
本文链接:https://www.wllwen.com/linchuangyixuelunwen/2046717.html
最近更新
教材专著