组学大数据环境下的基因信息并行处理与分析方法研究

发布时间：2017-12-28 20:00

本文关键词：组学大数据环境下的基因信息并行处理与分析方法研究　出处：《中国科学技术大学》2017年硕士论文　论文类型：学位论文

【摘要】：随着下一代测序技术的不断发展和逐渐成熟,高通量测序已经成为生物、医学研究中的常规工具,也即将在农业和医疗等行业中得到广泛应用,促生了精准医疗和分子育种等新兴产业。不同以往的低通量技术,高通量测序技术所产生的多种组学(全基因组、全外显子组、转录组、宏基因组等)数据具有通量高、数据量大、复杂异质等特点,所涉及的处理与分析步骤多且繁琐,对数据处理的软、硬件都提出了较高的要求。如何快速、高效处理和分析高通量测序数据成为高通量测序技术广泛应用的瓶颈。比如,当前受到广泛关注的精准医疗主要依赖于基因测序技术,如何高效处理和分析海量的病人的基因测序数据,从中获取个性化的癌变驱动信息成为实现肿瘤精准诊疗的关键和难点问题。基因测序技术从第一代测序技术发展到当前最新的第三代测序技术,其测序通量爆炸性增长。第一代测序技术的通量仅仅只有0.2MB/run,而以Illumina为代表的第二代测序技术其通量能达到1500GB/run左右,第三代测序技术的通量更是达到了 30-400bp/s。测序技术的进步为相关的生物、医学研究提供了有力的支持,但是如何解决海量的测序数据成为急需解决的学术和行业难题。为了解决上述问题,本文基于Hadoop系统设计并实现了一套高通量测序数据自动化并行处理系统(SeqReduce),其主要的目的是利用计算机集群,为海量的测序数据分析提供一款高效、稳定、低廉的自动化处理工具。该系统的核心设计思想是通过MapReduce并行运算框架对相关测序数据进行分割、对比、信息查询,最后输出突变基因信息文件或者转录本文件。该系统具有以下几个优点:(1)该款工具能够同时兼容多种测序平台包括主流的Illumina以及Roche 454等所产生的测序数据。(2)该款工具不仅能够处理DNA-seq的数据,还能够对RNA-seq数据进行分析处理。(3)为了使该工具能够适应不同的硬件坏境,设计了两种不同的并行处理模式,分别是低性能模式和高性能模式,使得该工具能够适应不同配置条件的硬件环境。
[Abstract]:With the continuous development and maturity of next-generation sequencing technology, high-throughput sequencing has become a conventional tool in biological and medical research, and will soon be widely applied in agriculture and medical industry. It has promoted the emerging industries such as precision medicine and molecular breeding. Different from the previous low flux technology, many high-throughput sequencing technology generated by Science (whole genome, whole exome, transcriptome and metagenomics) data with high flux and large amount of data, complex and heterogeneous characteristics, processing and analysis steps involved and complicated, have put forward higher requirements for the hardware and software of data processing. How to quickly and efficiently process and analyze high - throughput sequencing data has become a bottleneck for the wide application of high - throughput sequencing technology. For example, the current precision medical treatment that is widely concerned is mainly dependent on gene sequencing technology. How to efficiently process and analyze the large number of patient's gene sequencing data and get personalized cancer driving information from it is the key and difficult problem to achieve precise diagnosis and treatment of tumor. Gene sequencing technology has developed from the first generation sequencing technology to the latest third generation sequencing technology, and its sequencing flux has exploded. The throughput of the first generation sequencing technology is only 0.2MB/run, and the throughput of the second generation sequencing technology, which is represented by Illumina, is about 1500GB/run. The throughput of the third generation sequencing technology is 30-400bp/s. The progress of sequencing technology has provided strong support for related biological and medical research, but how to solve massive sequencing data has become an urgent academic and industry problem. In order to solve the above problems, this paper based on the design and implementation of Hadoop system is a high-throughput sequencing data automatic parallel processing system (SeqReduce), its main objective is the use of computer cluster, providing an efficient, stable, low automation processing tools for sequencing data analysis. The core idea of the system is to segment, contrast and query the related sequencing data through the MapReduce parallel operation framework, and finally output the mutant gene information file or transcript file. The system has the following advantages: (1) the tool can be compatible with various sequencing platforms, including the mainstream Illumina and Roche 454, etc. (2) the tool not only can handle the data of DNA-seq, but also can analyze and process the RNA-seq data. (3) in order to enable the tool to adapt to different hardware environment, two different parallel processing modes are designed, which are low performance mode and high performance mode respectively, enabling the tool to adapt to different configuration conditions of the hardware environment.
【学位授予单位】：中国科学技术大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：Q811.4;TP311.13

【参考文献】