参考基因压缩库间快速迁移算法研究
发布时间:2018-02-09 06:44
本文关键词: 基于参考基因组压缩 DNA数据压缩 参考序列转换 FASTA 龙芯 出处:《深圳大学》2017年硕士论文 论文类型:学位论文
【摘要】:随着基因测序费用的降低以及精准医疗和基因深度学习等新兴技术对基因大数据的需要,目前已进入一个基因数据爆发的时代。面对如此海量的基因数据,如何存储和传输这些数据成为当前研究的一个热点,基于参考基因组的压缩算法以其压缩率高的特点被广泛应用于各大基因库中。同时,这类压缩算法依赖于参考基因数据,这也严重的制约了该类压缩算法产生的压缩数据的共享、合并和传送等应用。本文主要针对不同压缩基因库由于采用的参考基因的不同而无法直接共享的问题进行深入研究,提出了一套快速转换基于不同参考基因的压缩数据进行参考序列的转换。主要工作包括:(1)对多种基因压缩算法进行分类,针对不同的基因压缩算法分别讨论其特点。并对几种最新的基于参考基因组的压缩算法进行详细分析。(2)针对相同压缩算法不同参考序列的基于参考基因组的压缩算法进行快速的参考基因转换算法研究。该算法主要利用参考基因组序列之间的相似性来进行参考序列的快速迁移。实验结果表明迁移所需时间远低于原始的解压再压缩方法,同时也为后面的研究指明方向。(3)在(2)的基础上进行拓展,针对不同压缩算法不同参考序列迁移研究,我们选取三种压缩算法进行分析提取共性,并结合三种压缩算法的特点,在(2)快速迁移算法基础上提高迁移后压缩基因的压缩率,设计了两种迁移算法来实现三种压缩算法的相互迁移。并通过大量的实验验证了算法的高效性。(4)最后针对龙芯平台我们实现了一套完整的具备基因压缩、迁移和解压功能的基因数据管理工具TReC。并对其进行性能分析,然后通过多进程对龙芯平台上的TReC进行性能优化,使其能充分利用龙芯多核来加速TReC的运行速度。本文在基于参考基因组压缩算法过于依赖参考序列的基础上,提出了两个有效的迁移算法,在迁移时间上具有很大优势,这些技术可以有效的缓解基于参考基因组压缩基因库之间相互迁移的问题,也为后续相关研究提供经验和借鉴。
[Abstract]:With the reduction of the cost of gene sequencing and the need for gene big data by new technologies such as precise medical treatment and gene in-depth learning, we have entered an era of gene data explosion. Faced with such a large amount of gene data, How to store and transmit these data has become a hot topic in current research. The compression algorithm based on reference genome is widely used in gene banks because of its high compression ratio. At the same time, this kind of compression algorithm depends on reference gene data. This also seriously restricts the compression data sharing, merging and transferring applications produced by this kind of compression algorithms. This paper mainly focuses on the problem that different compressed gene banks can not be directly shared because of the different reference genes. A set of fast conversion of reference sequences based on compressed data of different reference genes is proposed. The main work includes: 1) classifying various gene compression algorithms. The characteristics of different gene compression algorithms are discussed respectively. Several new compression algorithms based on reference genome are analyzed in detail. (2) Compression based on reference genomes for the same compression algorithm and different reference sequences is analyzed in detail. The algorithm mainly uses the similarity between reference genome sequences to transfer the reference sequences. The experimental results show that the migration time is much lower than the original decompression recompression method. At the same time, it also points out the direction of the later research. (3) expand on the basis of "2". In view of the different compression algorithms and different reference sequence migration research, we select three compression algorithms to analyze and extract the commonalities, and combine the characteristics of the three compression algorithms. On the basis of fast migration algorithm, the compression ratio of post-migration compressed genes was increased. Two migration algorithms are designed to realize the mutual migration of the three compression algorithms, and the efficiency of the algorithm is verified by a large number of experiments. Finally, we implement a complete set of gene compression for the Godson platform. The function of migration and decompression of gene data management tool TReC. and its performance analysis, and then through the multi-process to optimize the performance of the TReC on the Godson platform, In this paper, based on the reference genome compression algorithm, two efficient migration algorithms are proposed, which have great advantages in migration time. These techniques can effectively alleviate the problem of migration between gene banks based on reference genome compression, and also provide experience and reference for further research.
【学位授予单位】:深圳大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:Q811.4
【相似文献】
相关期刊论文 前1条
1 林小春;;中国科学家领衔“破译”绵羊基因组[J];科技致富向导;2014年17期
相关重要报纸文章 前2条
1 记者 白毅;人类肠道微生物最高质量参考基因集数据库问世[N];中国医药报;2014年
2 记者 马芳;人类首获自身参考基因组数据集合[N];南方日报;2010年
相关博士学位论文 前3条
1 SAMMINA MAHMOOD;[D];华中农业大学;2016年
2 易会广;无参考基因组的比较基因组学研究[D];复旦大学;2013年
3 陈庚;整合多层次数据多方位解析和注释人类转录组[D];华东师范大学;2014年
相关硕士学位论文 前4条
1 张雪莹;小麦近等基因系白粉病抗性反应的转录组分析[D];山东农业大学;2015年
2 吴欣欣;‘复瓣跳枝’梅花瓣呈色相关蛋白质组与转录组分析[D];南京农业大学;2014年
3 谭云涛;运用RAD(Restriction Site Associated DNA)技术构建烟草高密度连锁图谱[D];昆明理工大学;2016年
4 张义军;参考基因压缩库间快速迁移算法研究[D];深圳大学;2017年
,本文编号:1497322
本文链接:https://www.wllwen.com/shoufeilunwen/benkebiyelunwen/1497322.html