基于Hadoop的全基因组关联研究系统设计与实现

发布时间：2018-04-10 01:15

本文选题：全基因组关联研究　切入点：Hadoop　出处：《天津大学》2012年硕士论文

【摘要】：随着人类基因组精细图谱的发布，全基因组关联研究（Genome-wideassociation study，GWAS）得到了快速发展并成为研究人类复杂性疾病遗传因素的重要手段。基因填补（genotype imputation）能够增加研究数据中单核苷酸多态性（single nucleotide polymorphism，SNP）的密度，提高GWAS发现致病基因的能力，因此基于基因填补的GWAS方法得到了广泛应用。然而，这种方法目前在实际应用中存在着两方面的问题：（1）缺少综合的系统工具来完成整个GWAS的数据处理以及分析工作；（2）当前用于基因填补和关联检测的GWAS工具不能有效地应对由参考数据增加而导致的数据量和计算量大幅的增加。本文在对基于基因填补的GWAS方法和Hadoop平台进行研究的基础上，实现了一个基于Hadoop平台的全基因组关联研究系统——CloudAssoc，该系统主要包括数据预处理、基因填补和SNPs关联检测三个功能模块。数据预处理模块能够实现常用数据转换和质量控制功能；基因填补模块基于Hadoop平台设计实现，用于根据公共数据预测研究数据中没有分型的SNPs位点的基因型；关联检测模块同样基于Hadoop平台实现，，用于对填补后的研究数据进行SNPs的关联检测。 CloudAssoc能够提高GWAS效率的关键在于基因填补模块和关联检测模块的并行化实现。本文根据对基因填补软件IMPUTE2所用模型和算法的分析研究，使用分割数据分析区间的方法，将时间和资源消耗巨大的计算任务切分为众多在Hadoop集群上分布式执行的小任务，基于Hadoop streaming框架实现了基因填补的并行化；并采用类似的方法，实现了关联检测模块的并行化。本文最后对系统进行了测试。首先对CloudAssoc中并行化软件的可扩展性、高效性、运行时间与数据分割窗口大小的关系进行了测试。测试表明，系统中并行化软件具有接近线性的加速比，具有良好的可扩展性以及高效性。最后，对CloudAssoc进行了整体测试，测试结果表明本系统能够高效完成对全基因组数据的基于基因填补的GWAS分析。
[Abstract]:With the release of the detailed map of the human genome, Genome-wide Association study (Genome-wide Association) has been developed rapidly and become an important means to study the genetic factors of human complex diseases.Gene filling can increase the density of single nucleotide polymorphisms (SNPs) and enhance the ability of GWAS to detect pathogenic genes. Therefore, the GWAS method based on gene filling has been widely used.However,There are two problems in the practical application of this method. (1) lack of comprehensive system tools to complete the data processing and analysis of the whole GWAS. The current GWAS tools for gene filling and association detection are not effective.A large increase in the amount of data and computation resulting from an increase in reference data.Based on the research of GWAS method and Hadoop platform based on gene filling, a genome association research system based on Hadoop platform, CloudAssoc-based, is implemented in this paper. The system mainly includes data preprocessing.There are three functional modules of gene filling and SNPs association detection.The data preprocessing module can realize the function of data conversion and quality control, the gene filling module is designed and implemented based on Hadoop platform, which is used to predict the genotypes of SNPs loci that are not typed in the data according to the common data.The association detection module is also implemented based on Hadoop platform, which is used for SNPs association detection of the research data after filling.The key to improve the efficiency of GWAS by CloudAssoc lies in the parallelization of gene filling module and association detection module.Based on the analysis of the models and algorithms used in the gene filling software IMPUTE2, this paper uses the method of dividing the data analysis interval to divide the computation tasks which consume a great deal of time and resources into many small tasks that are distributed on the Hadoop cluster.The parallelization of gene filling is realized based on Hadoop streaming framework, and the parallelization of association detection module is realized by using a similar method.Finally, the system is tested.Firstly, the relationship between the expansibility, high efficiency, running time and the size of the data partition window in CloudAssoc parallel software is tested.The test results show that the parallelized software has a linear speedup, good scalability and high efficiency.Finally, the overall test of CloudAssoc is carried out, and the results show that the system can efficiently complete the gene-filled GWAS analysis of the whole genome data.
【学位授予单位】：天津大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：R394;TP311.52

【参考文献】