组学大数据变异分析关键技术与系统研发
发布时间:2018-04-27 06:05
本文选题:基因组学 + 变异 ; 参考:《哈尔滨工业大学》2017年硕士论文
【摘要】:下一代测序技术的迅猛发展,给生物信息学的领域研究带来深刻的变革。人类基因组数据量愈来愈大,产生的变异信息越来越多,这为精准医疗提供了探索疾病内在成因的机会。但是随之而来的是给计算设施带来前所未有的压力,当前基因组数据生成与处理方法之间存在巨大差距,与之配套的数据分析、存储与检索技术较为落后,这成为制约组学大数据知识挖掘的瓶颈。旨在处理PB级数据的云计算的出现,为这些不断增长的需求提供了一个令人振奋的解决方案。本文探讨的就是如何利用大数据技术对组学变异大数据实现高效分析、安全存储和快速检索。在本课题中,我们在研究基因组变异检测分析过程的基础上,充分结合大数据相关技术,对变异检测工具GATK进行分布式并行化,实现了基于内存计算模式的GATK-Spark,然后利用分布式数据库HBase存储GATK-Spark产生的高度注释的VCF变异文件,接着针对存储的变异信息利用Fisher精确检验进行等位基因频率分析,形成了完整的组学变异大数据分析管道。我们开发的基因组变异大数据管理分析平台,集成了变异检测、查询和分析模块。其中变异检测工具GATK-Spark,相比GATK有很大性能提升,在28核的Spark集群下,对于个人全基因组重测序数据的分析时间由3天降至4小时。此外,由GATK-Spark产生的变异直接存储到查询引擎,供后续变异分析。查询引擎提供了一个可编程和交互式查询接口,支持集成各种广泛使用的基因组浏览器和工具。为了弥补HBase仅支持一级索引的短板,我们利用Elastic Search为HBase提供二级索引机制,使基于非Row Key的查询性能提高近百倍。此外,本文给出了基于Fisher精确检验的等位基因频率分析算法,为存储在HBase中的变异信息的后续分析提供了思路。与现有工具的良好集成以及可扩展的数据库,使得该系统适合日益增长基因组大数据的存储、搜索和分析的需求,使变异分析过程得到极大简化,为后续探索变异与疾病成因提供了有力支持。
[Abstract]:The rapid development of next generation sequencing technology has brought profound changes to the field of bioinformatics. The amount of human genome data is increasing and the variation information is becoming more and more, which provides an opportunity for accurate medical treatment to explore the intrinsic causes of disease. However, with the unprecedented pressure on computing facilities, there is a huge gap between the methods of generation and processing of genome data, and the data analysis, storage and retrieval techniques are relatively backward. This becomes the bottleneck of knowledge mining of big data. The emergence of cloud computing to handle PB-level data provides an exciting solution to these growing demands. This paper discusses how to use big data technology to realize efficient analysis, safe storage and fast retrieval of genetic variation big data. In this paper, on the basis of studying the process of genomic mutation detection and analysis, we fully combine big data's related technology to implement distributed parallelization of mutation detection tool GATK. The GATK-Spark-based memory computing model is implemented, and then the highly annotated VCF mutation file generated by GATK-Spark is stored by distributed database HBase, and the frequency of allele is analyzed by using Fisher accurate test for the stored mutation information. Formed a complete formation of variation big data analysis pipeline. We have developed big data Management Analysis platform for Genomic variation, which integrates mutation detection, query and analysis modules. The mutation detection tool GATK-Spark has a better performance than GATK. In the 28 core Spark cluster, the analysis time for individual genome resequencing data is reduced from 3 days to 4 hours. In addition, the mutation generated by GATK-Spark is stored directly into the query engine for subsequent mutation analysis. The query engine provides a programmable and interactive query interface that supports the integration of a variety of widely used genomic browsers and tools. In order to make up for the short board of HBase which only supports the first-level index, we use Elastic Search to provide the second-level index mechanism for HBase, which can improve the query performance of non- Key nearly a hundred times. In addition, this paper presents an algorithm of allele frequency analysis based on Fisher precise test, which provides a way for the subsequent analysis of variation information stored in HBase. Good integration with existing tools and extensible databases make the system suitable for the growing needs of big data for storage, search and analysis, and greatly simplify the process of mutation analysis. It provides a strong support for further exploring the causes of variation and disease.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:Q811.4;TP311.13
【参考文献】
相关期刊论文 前2条
1 陈健;陈启龙;苏式兵;;中医药精准医疗的思考与探索[J];世界科学技术-中医药现代化;2016年04期
2 赵辉;赵方庆;;基于千人基因组谱系数据的拷贝数变异识别与分析[J];南方医科大学学报;2015年06期
,本文编号:1809499
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1809499.html