高阶马尔科夫模型在生物发育树重建和模体发现中的应用

发布时间：2019-06-15 16:23

【摘要】：传统的生物序列分析方法是建立在序列比对基础之上。而序列比对有其自身的局限：核酸和氨基酸替换矩阵选择没有统一的标准；对分化程度很高的序列比如基因调控序列的比对失效；由于时间消耗量大,针对新一代测序技术产生的海量数据,基于序列比对的方法已不切实际。因此在后基因组时代,生物序列分析急需更快速高效的非比对方法。马尔科夫模型是刻画随机过程的重要模型,在生物序列分析的应用有很长的历史。比如,CpG岛识别和基因发现的很多经典方法都使用了马尔科夫模型。但过去往往是利用低阶马尔科夫模型,本文将讨论高阶马尔科夫模型在生物序列分析中的应用。主要工作如下：1.马尔科夫香农熵最大化(MME)定阶法。马尔科夫模型在生物序列分析中的应用很广,但是对其阶的识别问题关注较少,一般用Χ2统计量推断或者用AIC/BIC信息标准方法识别。针对生物序列比较问题,如果利用高阶马尔科夫模型,则希望序列的信息尽可能多的被表征出来。本文我们首次提出了马尔科夫香农熵最大化(MME)的定阶方法。多个数据集的测试表明这种方法识别的阶比AIC/BIC信息标准法识别的阶高,并且在生物序列比较方面有明显优势。2.一维混沌游戏表示。Jeffrey提出的基于函数迭代的DNA序列的混沌游戏表示是一种一对一的二维图形表示方法,它将DNA序列转换成二维平面中的单位正方形区域的点集,由此将序列中不同长度的多聚体的频率特异性表现为散点图的不同区域的疏密特异性,还能将多聚体的不同层次的组合偏好性体现为散点图的分形特征。因此DNA序列的混沌游戏表示被广泛应用于DNA序列的特征描述。但是Jeffrey的混沌游戏是为DNA序列量身定做的表示方法,至多只能处理定义在包含尼2个字符的集合上的序列。一维混沌游戏表示是基于类似函数迭代的一种一对一的数值表示方法,是将定义于任何有限字符集的符号序列映射为一维数轴上单位区间的数值序列,不仅可以处理DNA序列和RNA序列,还可以应用于包含20种氨基酸的蛋白质序列,甚至包含26个字母的英文文本序列。除了可视化效果,一维混沌游戏表示继承了Jeffrey的混沌游戏的其它所有特征。我们首次提出了一维混沌游戏表示的反演公式和用于生物序列七-串表示的结构指数,并讨论了一维混沌游戏表示与高阶马尔科夫模型的关系。应用高阶马尔科夫模型的两个关键问题是阶的识别和大规模参数的估计。一维混沌游戏表示的这些性质有助于高阶马尔科夫模型的阶的识别和参数估计。3.进化树重建。利用生物序列构建系统发育树,传统的方法是在分子钟假设之下对某种基因进行比对,根据核酸或氨基酸替换矩阵获得基因之间的进化距离从而构建基因树。这些基因一般具有相当的保守性,比如16S rRNA,18S rRNA等等,但是在很多情况下,基于不同基因的基因树并没有一致性。由于基于比对针的方法的局限性,出现了很多无比对方法。广泛应用的组分矢量(CV)法是利用固定字长的词频作为刻画基因组或蛋白组的特征向量,其中用到背景概率是利用高阶马尔科夫模型获得的。受此启发,我们首次提出直接利用高阶马尔科夫模型表示全蛋白质组或者全基因组,将相应的转移概率矩阵作为刻画序列的特征向量。其中阶的识别是利用我们新提出的马尔科夫香农熵最大化(MME)定阶方法。多个全蛋白质组和全基因组数据集的结果证实了这种非比对的发育树重建方法很有效。4.模体发现。基因是DNA序列中具有遗传信息的基本单元,而影响和控制基因的转录和表达的是转录因子通过与基因调控元件(启动子,增强子,沉默子等)中结合位点相结合实现的,这些结合位点是相对固定又重复出现的5-20bp长度的DNA序列模式,称之为模体。理解基因表达是生物学中的重大挑战,而基因调控元件的识别特别是模体的识别是这个挑战中的一个重要课题。受Tompa等的方法的启发,我们提出利用高阶马尔科夫模型的新尼-串法。首先利用高阶马尔科夫模型描述该背景序列集,在背景高阶马尔科夫模型下,确定每个红串在序列集中的期望频数。再由实际频数与期望频数的相对偏离率,判断缸串是来自随机背景序列还是来自模体的样例。我们用多个HT-SELEX数据集证实了这种舡串法的有效性。
[Abstract]:The traditional method of biological sequence analysis is based on the sequence comparison. and the sequence ratio has the limitation that the selection of the nucleic acid and the amino acid substitution matrix is not uniform; the ratio of the sequence with high differentiation degree, such as the gene regulation sequence, is invalid; and due to the large time consumption, the mass data generated by the new generation sequencing technology, The method based on the sequence alignment is impractical. Therefore, in the post-genome era, the biological sequence analysis is in urgent need of a more rapid and efficient non-alignment method. The Markov model is an important model for describing the stochastic process, and has a long history in the application of the biological sequence analysis. For example, many classical methods of CpG island recognition and gene discovery use a Markov model. But in the past, using the low-order Markov model, this paper will discuss the application of the high-order Markov model in the analysis of the biological sequence. The main work is as follows:1. Markov-Shannon entropy-maximizing (MME) order method. The application of the Markov model in the analysis of the biological sequence is very wide, but the problem of the identification of the order is less concerned, and it is generally concluded by using the second statistic or by using the AIC/ BIC information standard method. For a biological sequence comparison problem, if a high-order Markov model is used, it is desirable that the information of the sequence be characterized as much as possible. In this paper, we first put forward the order method of the Markov Shannon Entropy Maximization (MME). Tests on a number of data sets have shown that the method identified by this method has a higher order than the AIC/ BIC information standard method, and has a significant advantage in the comparison of biological sequences. One-dimensional hybrid game representation. the hybrid game representation of the function-iteration-based dna sequence presented by jeffrey is a one-to-one two-dimensional graphical representation method that converts the dna sequence into a set of points in a unit square region in a two-dimensional plane, As a result, the frequency specificity of the multimers of different lengths in the sequence is expressed as the density specificity of different regions of the scattergram, and the combined preference of the different levels of the polymer can be reflected as the fractal characteristic of the scattergram. The hybrid game of the DNA sequence thus represents the characterization of the DNA sequence widely used. But Jeffrey's hybrid game is a custom-made representation of the DNA sequence, and at most, you can only process the sequence that is defined on a set that contains the 1 2 characters. a one-to-one numerical representation method based on the iteration of a similar function is a one-to-one numerical representation method based on a similar function iteration, It can also be applied to a protein sequence containing 20 amino acids, and even an English text sequence containing 26 letters. In addition to the visual effect, one-dimensional hybrid game represents all the other features that have inherited Jeffrey's hybrid game. In this paper, we first put forward the inversion formula of one-dimensional hybrid game and the structural index for the seven-string representation of the biological sequence, and discuss the relation between the one-dimensional hybrid game and the high-order Markov model. Two key problems of applying the high-order Markov model are the identification of the order and the estimation of large-scale parameters. These properties of one-dimensional hybrid game play a role in the identification and parameter estimation of the order of the high-order Markov model. The reconstruction of the tree. The phylogenetic tree is constructed by using a biological sequence, and the traditional method is to construct a gene tree by comparing a certain gene under the hypothesis of a molecular clock, and obtaining a genetic distance between the genes according to a nucleic acid or an amino acid substitution matrix. These genes generally have considerable conservation, such as 16S rRNA, 18S rRNA, and the like, but in many cases, genetic trees based on different genes are not consistent. As a result of the limitations of the method based on the comparison of the needle, a number of unparalleled methods have emerged. The widely used component vector (CV) method is to use the word frequency of fixed word length as the feature vector for describing the genome or proteome, wherein the background probability is obtained by using the high-order Markov model. In this light, we first put forward the direct utilization of the high-order Markov model to represent the whole protein group or the whole genome, and the corresponding transfer probability matrix is used as the feature vector for describing the sequence. The identification of the order is to use the new Markov Shannon entropy maximization (MME) order method. The results of a number of all-protein and all-genome data sets demonstrate that this non-specific development tree reconstruction method is very effective. The phantom was found. The gene is the basic unit with the genetic information in the DNA sequence, and the transcription and expression of the influence and control gene is realized by the combination of the binding site of the gene regulation element (promoter, enhancer, silence, etc.). These binding sites are DNA sequence patterns of 5-20 bp length, which are relatively fixed and repeated, referred to as a phantom. Understanding gene expression is a major challenge in biology, and identification of gene regulatory elements, in particular, is an important subject in this challenge. Inspired by the methods of Tompa et al., we propose a new-series method using the high-order Markov model. First, using the high-order Markov model to describe the background sequence set, in the background high-order Markov model, the desired frequency of each red string in the sequence set is determined. The relative deviation rate of the actual frequency and the desired frequency is then determined, and the cylinder string is judged to be from a random background sequence or a sample from the phantom. We use multiple HT-SELEX data sets to demonstrate the effectiveness of this cross-series method.
【学位授予单位】：湘潭大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：Q811.4

【相似文献】