高阶马尔科夫模型在生物发育树重建和模体发现中的应用
[Abstract]:The traditional method of biological sequence analysis is based on the sequence comparison. and the sequence ratio has the limitation that the selection of the nucleic acid and the amino acid substitution matrix is not uniform; the ratio of the sequence with high differentiation degree, such as the gene regulation sequence, is invalid; and due to the large time consumption, the mass data generated by the new generation sequencing technology, The method based on the sequence alignment is impractical. Therefore, in the post-genome era, the biological sequence analysis is in urgent need of a more rapid and efficient non-alignment method. The Markov model is an important model for describing the stochastic process, and has a long history in the application of the biological sequence analysis. For example, many classical methods of CpG island recognition and gene discovery use a Markov model. But in the past, using the low-order Markov model, this paper will discuss the application of the high-order Markov model in the analysis of the biological sequence. The main work is as follows:1. Markov-Shannon entropy-maximizing (MME) order method. The application of the Markov model in the analysis of the biological sequence is very wide, but the problem of the identification of the order is less concerned, and it is generally concluded by using the second statistic or by using the AIC/ BIC information standard method. For a biological sequence comparison problem, if a high-order Markov model is used, it is desirable that the information of the sequence be characterized as much as possible. In this paper, we first put forward the order method of the Markov Shannon Entropy Maximization (MME). Tests on a number of data sets have shown that the method identified by this method has a higher order than the AIC/ BIC information standard method, and has a significant advantage in the comparison of biological sequences. One-dimensional hybrid game representation. the hybrid game representation of the function-iteration-based dna sequence presented by jeffrey is a one-to-one two-dimensional graphical representation method that converts the dna sequence into a set of points in a unit square region in a two-dimensional plane, As a result, the frequency specificity of the multimers of different lengths in the sequence is expressed as the density specificity of different regions of the scattergram, and the combined preference of the different levels of the polymer can be reflected as the fractal characteristic of the scattergram. The hybrid game of the DNA sequence thus represents the characterization of the DNA sequence widely used. But Jeffrey's hybrid game is a custom-made representation of the DNA sequence, and at most, you can only process the sequence that is defined on a set that contains the 1 2 characters. a one-to-one numerical representation method based on the iteration of a similar function is a one-to-one numerical representation method based on a similar function iteration, It can also be applied to a protein sequence containing 20 amino acids, and even an English text sequence containing 26 letters. In addition to the visual effect, one-dimensional hybrid game represents all the other features that have inherited Jeffrey's hybrid game. In this paper, we first put forward the inversion formula of one-dimensional hybrid game and the structural index for the seven-string representation of the biological sequence, and discuss the relation between the one-dimensional hybrid game and the high-order Markov model. Two key problems of applying the high-order Markov model are the identification of the order and the estimation of large-scale parameters. These properties of one-dimensional hybrid game play a role in the identification and parameter estimation of the order of the high-order Markov model. The reconstruction of the tree. The phylogenetic tree is constructed by using a biological sequence, and the traditional method is to construct a gene tree by comparing a certain gene under the hypothesis of a molecular clock, and obtaining a genetic distance between the genes according to a nucleic acid or an amino acid substitution matrix. These genes generally have considerable conservation, such as 16S rRNA, 18S rRNA, and the like, but in many cases, genetic trees based on different genes are not consistent. As a result of the limitations of the method based on the comparison of the needle, a number of unparalleled methods have emerged. The widely used component vector (CV) method is to use the word frequency of fixed word length as the feature vector for describing the genome or proteome, wherein the background probability is obtained by using the high-order Markov model. In this light, we first put forward the direct utilization of the high-order Markov model to represent the whole protein group or the whole genome, and the corresponding transfer probability matrix is used as the feature vector for describing the sequence. The identification of the order is to use the new Markov Shannon entropy maximization (MME) order method. The results of a number of all-protein and all-genome data sets demonstrate that this non-specific development tree reconstruction method is very effective. The phantom was found. The gene is the basic unit with the genetic information in the DNA sequence, and the transcription and expression of the influence and control gene is realized by the combination of the binding site of the gene regulation element (promoter, enhancer, silence, etc.). These binding sites are DNA sequence patterns of 5-20 bp length, which are relatively fixed and repeated, referred to as a phantom. Understanding gene expression is a major challenge in biology, and identification of gene regulatory elements, in particular, is an important subject in this challenge. Inspired by the methods of Tompa et al., we propose a new-series method using the high-order Markov model. First, using the high-order Markov model to describe the background sequence set, in the background high-order Markov model, the desired frequency of each red string in the sequence set is determined. The relative deviation rate of the actual frequency and the desired frequency is then determined, and the cylinder string is judged to be from a random background sequence or a sample from the phantom. We use multiple HT-SELEX data sets to demonstrate the effectiveness of this cross-series method.
【学位授予单位】:湘潭大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:Q811.4
【相似文献】
相关期刊论文 前10条
1 赵娟;秦玉芳;刘太岗;王军;;基于一种新型马尔科夫模型的预测蛋白质亚细胞位点的方法(英文)[J];上海师范大学学报(自然科学版);2011年02期
2 常克贵;;应用马尔科夫模型的方法对呼和浩特—五原地震亚带危险性估计[J];华北地震科学;1987年02期
3 陈振颂;李延来;;基于广义信度马尔科夫模型的顾客需求动态分析[J];计算机集成制造系统;2014年03期
4 陈永;冯元;庞思伟;;基于灰色马尔科夫模型的传染病预测[J];信息与电脑(理论版);2010年02期
5 刘文远;刘丽云;王常武;王宝文;;基于二阶马尔科夫模型预测可趋近性靶基因[J];燕山大学学报;2012年04期
6 吴金华;戴淼;;基于改进算法的灰色马尔科夫模型的建设用地预测[J];安徽农业科学;2010年08期
7 汪可;杨丽君;廖瑞金;齐超亮;周nv;;基于离散隐式马尔科夫模型的局部放电模式识别[J];电工技术学报;2011年08期
8 邓鑫洋;邓勇;章雅娟;刘琪;;一种信度马尔科夫模型及应用[J];自动化学报;2012年04期
9 陈焕珍;;基于灰色马尔科夫模型的青岛市粮食产量预测[J];计算机仿真;2013年05期
10 张延利;张德生;井霞霞;任世远;;基于无偏灰色马尔科夫模型的人民币/美元汇率短期预测模型[J];陕西科技大学学报(自然科学版);2011年06期
相关会议论文 前2条
1 王虎平;李炜;赵志理;;基于灰色马尔科夫模型的杭州市客流预测[A];第九届中国不确定系统年会、第五届中国智能计算大会、第十三届中国青年信息与管理学者大会论文集[C];2011年
2 郑亚斌;曹嘉伟;刘知远;;基于最大匹配和马尔科夫模型的对联系统[A];第四届全国学生计算语言学研讨会会议论文集[C];2008年
相关博士学位论文 前2条
1 陈勐;轨迹预测与意图挖掘问题研究[D];山东大学;2016年
2 阳卫锋;高阶马尔科夫模型在生物发育树重建和模体发现中的应用[D];湘潭大学;2016年
相关硕士学位论文 前8条
1 陈潇潇;基于马尔科夫模型的代谢综合征描述和风险预测研究[D];山东大学;2015年
2 张胜娜;含有隐变量的高阶马尔科夫模型的理论及应用[D];电子科技大学;2014年
3 杨世安;优化的灰色马尔科夫模型在建筑物沉降预测中的应用[D];东华理工大学;2014年
4 张海君;基于马尔科夫模型的沙漠扩散和天气预测[D];新疆大学;2013年
5 蔡亮亮;改进的灰色马尔科夫模型及其对全国邮电业务总量的预测[D];南京邮电大学;2013年
6 叶t,
本文编号:2500351
本文链接:https://www.wllwen.com/shoufeilunwen/jckxbs/2500351.html