基于Hadoop基因序列比对BWT索引的建立方法研究
发布时间:2018-02-04 12:12
本文关键词: 基因序列 BWT索引 后缀数组 Hadoop MapReduce 出处:《内蒙古农业大学》2016年硕士论文 论文类型:学位论文
【摘要】:由于基因数据的增长速度飞快,人工进行序列比对已经无法满足科研人员的需求,那么机器比对已经走上了舞台,基因序列比对是基因数据分析和处理的基础。而现在的序列比对算法大致分为两类,一类是精确比对算法,另一类是模糊比对算法。目前,大部分的基因序列比对方法都是启发式算法,该类算法大致分为两步:建立索引和序列比对,所以无论是精确比对算法和非精确比对算法都离不开索引结构。由此可见,建立索引是基因序列比对算法的重要步骤,常见的索引构建算法大致分为两类,一类是基于哈希表的算法,另一类是基于后缀树或后缀数组的算法。而BWT(Burrows-Wheeler Transform)索引是基于后缀数组中比较重要的索引结构。目前,构建较大基因组序列(例如,人类基因组序列)的BWT索引需要几个小时的串行计算。本文提出一种基于Hadoop的并行计算方法构建后缀数组和BWT索引。算法使用MapReduce的数据处理功能,并且更改了原有的使用哈希方式的Partitioner,本文使用直接分配任务来建立索引。本文依次将基因链首的一个碱基轮转到基因链尾并与链尾的17个字符形成一个Key以及相应的Map任务,将这些Map任务根据新改写Partitioner分配给Reduce。最终得到全序的后缀数组和BWT索引,减少建立索引的时间。通过实验数据表明,本文提出的方法可以节省索引构建的时间,达到了预期目的,并验证了算法的有效性。
[Abstract]:Because of the rapid growth of genetic data, artificial sequence alignment has been unable to meet the needs of researchers, so machine alignment has been on the stage. Sequence alignment is the basis of gene data analysis and processing. Now, sequence alignment algorithms are divided into two categories, one is accurate alignment algorithm, the other is fuzzy alignment algorithm. Most gene sequence alignment methods are heuristic algorithms, which can be divided into two steps: indexing and sequence alignment. Therefore, both accurate alignment algorithm and imprecise alignment algorithm can not be separated from index structure. Thus, building index is an important step of gene sequence alignment algorithm, and the common index construction algorithm can be divided into two categories. One is the algorithm based on hash table. The other is the algorithm based on suffix tree or suffix array. And BWT(Burrows-Wheeler transform). Index is based on the suffix array of the more important index structure. Construction of large genome sequences (for example. Human genome sequence. This paper presents a parallel computing method based on Hadoop to construct suffix array and BWT index. The algorithm uses the data processing of MapReduce. Function. And changed the original hash Partitioner. In this paper, the direct assignment task is used to build the index. In this paper, one base of the first gene chain is rotated to the tail of the gene chain and a Key and the corresponding Map task are formed with the 17 characters of the tail. Assign these Map tasks to Reducebased on the new overwrite Partitioner, resulting in a fully ordered suffix array and BWT index. The experimental data show that the proposed method can save the time of index construction, achieve the expected purpose, and verify the effectiveness of the algorithm.
【学位授予单位】:内蒙古农业大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.3
,
本文编号:1490179
本文链接:https://www.wllwen.com/kejilunwen/jiyingongcheng/1490179.html
最近更新
教材专著