二三代基因组混合组装流程的搭建与序列拼接并行优化方法研究

发布时间:2018-07-05 18:26

  本文选题:生物信息学 + Rocks集群 ; 参考:《昆明理工大学》2017年硕士论文


【摘要】:随着生物信息学的飞速发展,当今世界已经迈入生命科学和信息科学的时代。第三代测序技术因为其读长长的特点,彻底的革新了基因组学。测序技术发展的同时,生物信息学面临了更多的挑战,越来越多的测序数据的积累意味着需要更多的计算资源来满足其分析需求,而新的测序技术产生新的特征的序列又势必需要新的序列组装技术来应对。本文从上述挑战出发,研究二三代混合组装策略和序列拼接并行优化方法,以此满足科研人员对于二三代基因测序数据分析的需求,也可以在序列拼接过程中能够保证更好的节约计算资源,主要开展以下3个工作。首先,生物数据数据量大且资源多样,对数据进行处理必须以来强大的计算资源。为满足课题需求,建立生物信息学平台成为必须。本文中搭建了一个基于Rocks集群系统的生物信息学平台(Rocks Cluster),充分利用现有的集群计算技术来整合计算资源,为生物信息学的研究提供了方便快捷且强有力的数据处理平台。其次,测序技术日新月异,推动了基因组学的发展。本文分析三代测序数据具有读长长、错误率较高的特点和二代测序数据读长短但错误率低的特点,于生物信息学平台搭建了二三代基因组混合组装流程,充分利用了三代测序技术读长长和二代测序技术错误率低的优点,以二代测序数据对三代测序数据进行纠错,再以纠错之后得到的三代数据进行基因组装,以达到更好的拼接效果。最后,考虑到在基因混合组装过程中纠错环节内存消耗较高,如果对基因组较大的物种进行基因组装,现有平台无法满足其内存消耗需求。为了解决这个问题,本文分析了组装过程中内存使用情况,并根据实验室的生物信息学平台结构特点设计了解决方案。一是利用GlobalArray虚拟和管理不同节点的内存,将数据和计算分开运行;二是设计进程并行优化方法用来缓解单节点的内存压力。同时为了寻求更好的解决方案,以基因混合组装纠错方法本身所用算法为突破点,基于二代三代数据混合拼接的思想,即考虑首先用二代数据进行拼接得到正确率高的序列拼接图,然后用三代测序数据比对到图上,利用三代测序数据读长长的优势确定图上路径的选择,以达到简化图的目的,这样就避免了纠错环节。
[Abstract]:With the rapid development of bioinformatics, the world has entered the era of life science and information science. The third generation sequencing technology revolutionized genomics because of its long reading characteristics. With the development of sequencing technology, bioinformatics is facing more challenges. The accumulation of more and more sequenced data means that more computing resources are needed to meet its analytical needs. And the new sequencing technology produces the new characteristic sequence, and it is bound to need the new sequence assembly technology to deal with. Based on the above challenges, this paper studies the strategy of hybrid assembly of the second and third generation and the parallel optimization method of sequence splicing, so as to meet the needs of researchers for the analysis of gene sequencing data of the second and third generation. It can also be used to save computing resources in the process of sequence splicing. First of all, because of the large amount of biological data and diverse resources, the processing of the data must be a powerful computing resource. In order to meet the needs of the subject, it is necessary to establish a bioinformatics platform. In this paper, a Rocks Cluster platform based on Rocks cluster system is built, which makes full use of the existing cluster computing technology to integrate computing resources, and provides a convenient and fast and powerful data processing platform for bioinformatics research. Secondly, sequencing technology changes with each passing day, promoting the development of genomics. This paper analyzes the characteristics of the third generation sequencing data with long reading length, high error rate and the second generation sequencing data reading length but low error rate, and builds the second and third generation genome mixed assembly process on the bioinformatics platform. It makes full use of the advantages of the third generation sequencing technology and the low error rate of the second generation sequencing technology. The second generation sequencing data is used to correct the error of the third generation sequencing data, and then the third generation data obtained after the error correction is used for genome installation. In order to achieve better stitching effect. Finally, considering the high memory consumption in the error-correcting process, the existing platforms can not meet the memory consumption needs of the species with larger genomes. In order to solve this problem, the memory usage in the assembly process is analyzed, and the solution is designed according to the structural characteristics of the bioinformatics platform in the laboratory. One is to use GlobalArray to virtual and manage the memory of different nodes, and the other is to design a parallel optimization method to reduce the memory pressure of a single node. At the same time, in order to find a better solution, the algorithm used in the gene hybrid assembly and error correction method itself is the breakthrough point, based on the idea of the second generation and the third generation data mixed splicing. That is to say, we first use the second generation data to get the sequence splicing map with high accuracy, then compare the third generation sequence data to the graph, and make use of the long advantage of the third generation sequencing data to determine the choice of the path on the map, so as to achieve the purpose of simplifying the graph. This avoids error correction.
【学位授予单位】:昆明理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:Q811.4

【参考文献】

相关期刊论文 前6条

1 柳延虎;王璐;于黎;;单分子实时测序技术的原理与应用[J];遗传;2015年03期

2 韩九强;吕红强;刘俊;张善新;;基于生物信息学的HERV研究现状与发展趋势[J];生物信息学;2014年02期

3 徐培杰;;生物信息学研究现状[J];科技信息;2013年10期

4 任鲁风;于军;;解读生命密码的基本手段——DNA测序技术的前世今生[J];生命科学;2012年12期

5 杨晓玲;施苏华;唐恬;;新一代测序技术的发展及应用前景[J];生物技术通报;2010年10期

6 张予倩;周健;翁红明;韩静;;Rocks高性能计算集群的建立和管理[J];实验室研究与探索;2006年04期



本文编号:2101287

资料下载
论文发表

本文链接:https://www.wllwen.com/shoufeilunwen/benkebiyelunwen/2101287.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户0b519***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com