基于序列统计特征的基因识别算法研究
本文选题:基因识别 + 多特征融合 ; 参考:《哈尔滨工业大学》2017年硕士论文
【摘要】:面对世间纷繁浩瀚的模式生物的全基因组数据,能够高效、精准的识别其中可编码蛋白的基因序列具有非常巨大的实用意义。这种意义致使基因识别作为生物信息学研究和发展的基础,向来备受学者们的青睐。传统的研究方式主要依托于繁琐的生物实验,过程缓慢且耗时耗力。本文则主要依托信号处理的理论和方法,如傅里叶变换、滤波器算法、智能计算、统计学习等,从序列统计特征的角度对该问题加以深入研讨。而周期3性质作为一项重要的统计特征一直被广泛地应用于基因识别中。为了获得更好的识别性能,研究者们在基因序列的信号滤波处理以及周期3特征强化方面做出了很大的研究贡献,但仍然存在很大的不足。本文针对固定步长LMS自适应滤波器算法在基因预测中存在的问题,结合系统的反馈输出和基因序列碱基组成成份的特征信息,提出一种新的具有更好滤波效果和强化周期3特征功能的变步长LMS自适应滤波器改进算法,通过仿真实验分析验证算法性能。研究表明,与现有算法相比,所提算法精度优越性较为明显。另外,针对短基因序列存在的特征信息较弱,不利于基因识别的问题,本文也提出一种新的依据各单特征表征能力而加权融合多特征的改进算法,着重分析其在序列长度低于200 bp的短基因数据集中的识别性能,与传统多特征融合算法相比,所提算法是有效的、鲁棒的。结合上述两方面的研究,本文实现一个结合了数字信号处理技术和多特征融合优势的人类基因组专用的基因识别系统。该系统因摆脱了对条件随机场、隐马尔科夫模型和支持向量机等传统机器学习方法的依赖,具有实现简单、无需训练保存大量模型参数、不过多受已有训练数据集知识结构影响以及可实时识别等特点。并通过基准测试数据集ALLSEQ和HMR195综合验证系统性能。
[Abstract]:It is of great practical significance to recognize the gene sequence of the encoded protein efficiently and accurately in the face of the vast genome data of the model organism in the world. As the basis of bioinformatics research and development, gene recognition has always been favored by scholars. The traditional research methods mainly rely on tedious biological experiments, the process is slow and time-consuming. This paper mainly relies on the theory and methods of signal processing, such as Fourier transform, filter algorithm, intelligent computing, statistical learning, etc. Cycle 3, as an important statistical feature, has been widely used in gene recognition. In order to obtain better recognition performance, researchers have made great contributions to the signal filtering of gene sequences and the enhancement of cycle 3 features, but there are still many shortcomings. In order to solve the problem of fixed-step LMS adaptive filter algorithm in gene prediction, this paper combines the feedback output of the system and the characteristic information of the base composition of gene sequence. A new variable step size LMS adaptive filter with better filtering effect and enhanced cycle 3 features is proposed. The performance of the algorithm is verified by simulation analysis. The results show that compared with the existing algorithms, the accuracy of the proposed algorithm is obvious. In addition, in view of the weak feature information of short gene sequences, which is not conducive to gene recognition, this paper also proposes a new weighted fusion algorithm for multiple features according to the ability of each single feature representation. The performance of the proposed algorithm in the short gene dataset with a sequence length of less than 200 BP is analyzed. Compared with the traditional multi-feature fusion algorithm, the proposed algorithm is effective and robust. Combined with the above two aspects, this paper implements a special gene recognition system for human genome, which combines the advantages of digital signal processing and multi-feature fusion. The system is free from the dependence of traditional machine learning methods such as conditional random field, hidden Markov model and support vector machine, so it is easy to implement and saves a large number of model parameters without training. It is not too much influenced by the knowledge structure of existing training data sets and can be recognized in real time. The system performance is verified by benchmark data set ALLSEQ and HMR195.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:Q811.4
【参考文献】
相关期刊论文 前7条
1 Haohuan FU;Junfeng LIAO;Jinzhe YANG;Lanning WANG;Zhenya SONG;Xiaomeng HUANG;Chao YANG;Wei XUE;Fangfang LIU;Fangli QIAO;Wei ZHAO;Xunqiang YIN;Chaofeng HOU;Chenglong ZHANG;Wei GE;Jian ZHANG;Yangang WANG;Chunbo ZHOU;Guangwen YANG;;The Sunway Taihu Light supercomputer:system and applications[J];Science China(Information Sciences);2016年07期
2 马玉韬;轩秀巍;车进;滕建辅;;基于全相位滤波理论的基因预测[J];上海交通大学学报;2013年07期
3 罗亮;史晓红;许进;;LVQ神经网络方法预测蛋白质结构中的二硫键[J];系统仿真学报;2007年09期
4 王明怡,吴平,王德林;基于相关性分析的基因选择算法[J];浙江大学学报(工学版);2004年10期
5 陈晓燕,鲍伦军,莫金垣;连续小波变换法分析核酸序列的长程相关性[J];中山大学学报(自然科学版);2003年03期
6 夏慧煜,周晴,李衍达;隐Markov模型在剪接位点识别中的应用[J];清华大学学报(自然科学版);2002年09期
7 杨文强,钱敏平,HUANG Da-Wei;基于隐马氏模型对编码序列缺失与插入的检测(英文)[J];生物化学与生物物理进展;2002年01期
相关博士学位论文 前1条
1 马宝山;基于信号处理理论和方法的基因预测研究[D];大连海事大学;2008年
,本文编号:1904940
本文链接:https://www.wllwen.com/kejilunwen/jiyingongcheng/1904940.html