基于k-mer短序列的DNA数据压缩算法研究

发布时间：2019-04-01 15:56

【摘要】：DNA序列数据量巨大，其相关压缩技术是生物信息学中必不可少的关键技术，是DNA序列数据有效存储、读取和传输的基础，是进行DNA测序拼接、序列比对、基因预测等的前提，因此，对DNA序列数据的压缩技术进行研究具有非常重要的理论意义与应用价值。近年来，随着信息处理技术的发展以及对DNA序列数据自身特点研究的深入，各种专门针对DNA序列数据的压缩算法大量涌现。本文从DNA序列数据具有高度重复性的特点出发，对序列中长度很小的k-mer子序列片段重复性进行了统计分析，并归纳和总结了DNA序列数据碱基及k-mer短序列分布的重复性规律。针对DNA序列中，不同片段区域k-mer分布具有很大差异性的特点，提出了基于分段编码的DNA数据压缩算法。在预处理阶段，将DNA序列分割成64个碱基一组的短序列片段，对每一个片段分别进行独立考虑。统计片段中重复率最高的3-mer子序列，利用其在片段中出现的次数和位置等信息进行替代编码，从而对DNA序列进行压缩。分段编码压缩算法简单，对常用基准测试序列都能具有比较好的压缩性能。针对DNA序列中，k-mer长度很小时，部分k-mer具有很高重复性的特点，提出了基于GA-PSO混合优化的DNA数据压缩算法，将DNA序列中等长k-mer的不同组合抽象成不同的寻优粒子，用GA-PSO混合优化算法搜索序列中重复性高，能达到最大压缩率的最优k-mer组合，对序列中出现的最优k-mer进行编码，从而对序列进行压缩。GA-PSO混合优化算法中，每一轮迭代寻优前，先用支持向量机模型将DNA碱基粒子群分成两组，，然后分别采用GA算法和PSO算法优化。实验结果表明，本算法能获得比较高的压缩率，而且相比于传统算法，具有更好的鲁棒性。
[Abstract]:DNA sequence is a huge amount of data, and its correlation compression technology is the essential key technology in bioinformatics, is the basis of effective storage, reading and transmission of DNA sequence data, and is the premise of DNA sequence splicing, sequence alignment, gene prediction and so on. Therefore, the research on the compression technology of DNA sequence data has very important theoretical significance and application value. In recent years, with the development of information processing technology and in-depth research on the characteristics of DNA sequence data, a large number of compression algorithms for DNA sequence data have emerged. Based on the high repeatability of the DNA sequence data, the reproducibility of the small length k-mer subsequence fragment in the sequence was statistically analyzed in this paper. The repeatability of base and k-mer short sequence distribution of DNA sequence data were summarized and summarized. In view of the great difference of k-mer distribution in different fragment regions in DNA sequences, a DNA data compression algorithm based on piecewise coding is proposed. In the pre-processing phase, the DNA sequence is divided into 64-base groups of short sequence fragments, and each fragment is considered independently. In order to compress the DNA sequence, the 3-mer subsequence, which has the highest repetition rate, is substituted for coding by using the information of the number and position of its occurrence in the fragment. Piecewise coding compression algorithm is simple and has good compression performance for common benchmark sequences. In view of the fact that the length of k-mer is very small in DNA sequences and some k-mer have high repeatability, a DNA data compression algorithm based on GA-PSO hybrid optimization is proposed. The different combinations of equal length k-mer in DNA sequences are abstracted into different optimization particles. The GA-PSO hybrid optimization algorithm is used to search the optimal k-mer combinations with high repeatability and maximum compression ratio. The optimal k-mer appeared in the sequence is encoded to compress the sequence. In the GA-PSO hybrid optimization algorithm, the DNA base particle swarm is divided into two groups by using the support vector machine model before each iteration optimization. Then GA algorithm and PSO algorithm are used to optimize. The experimental results show that the proposed algorithm can achieve high compression ratio and has better robustness than the traditional algorithm.
【学位授予单位】：华南理工大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TN911.7

【参考文献】