基于转录终点信号或保守性的大肠杆菌sRNA预测研究
发布时间:2018-08-10 07:27
【摘要】:细菌sRNA是细菌中普遍存在的一类长度在40~500个核苷酸的调控小分子RNA(small regulatory RNA),主要位于基因间区,但也有位于蛋白编码基因5’端和3’端非编码区的情况。与通常的非编码RNA如tRNA或rRNA不同,细菌sRNA不仅长度变化范围很大,也没有保守的二级结构特征。 目前的研究表明,细菌sRNA主要通过与靶标mRNA或靶标蛋白质的结合,广泛参与多种生命活动的调控过程来应对环境变化,如质粒复制、噬菌体发育、压力反应、群体感应、细菌毒性和铁的动态平衡调节等;其次,在目前已测序的上千个细菌基因组中,仅在E.coli等少数基因组中得到了较充分研究,还有大量的细菌sRNA等待发现。因此,开展细菌sRNA的识别研究具有重要意义。 然而,开展基因组水平sRNA发现的实验研究存在很多缺点,如操作过程复杂、周期长、准确性低和有的sRNA只有在特定的环境下才能表达等。目前一般采用生物信息学预测和实验验证相结合的策略来识别细菌sRNA。因此,开展sRNA的生物信息学预测研究具有重要意义,可以加快sRNA的发现进程。其次,随着大量的多种类型细菌基因组测序工作的完成和各种RNA数据库的构建,也为开发sRNA基因的生物信息学预测方法提供了数据基础。 与蛋白编码基因具有易于识别的特征不同,sRNA编码基因通常没有明确的编码特征,且不受移码或无义突变的影响,因此需要发展专门的生物信息学预测方法,目前已发展的方法主要分为三类:比较基因组学方法、寻找转录信号方法和机器学习方法。 基于比较基因组学方法寻找sRNA,其理论依据是sRNA基因在相近种属的基因组中具有一定的序列保守性和结构保守性。目前这种方法比较常用,但此方法不能预测出一个细菌特异的sRNA基因;其次还必须有相近物种的基因组信息可以利用;最后是保守的基因间区可能是其它类型的基因结构,不一定是sRNA,也不能识别位于编码区反义链的sRNA基因。 基于转录信号寻找sRNA,其基本方法是在基因间区寻找潜在的启动子或者转录因子结合位点和Rho-非依赖终止子结构来发现sRNA。由于目前预测启动子或者转录因子结合位点的假阳性率较高,相应地预测sRNA的假阳性率也较高;其次不能预测出具有Rho-依赖性终止子结构的sRNA。 利用机器学习方法预测sRNA,其基本假设是细菌sRNA基因序列部分一定与其余部分是可以区分的。然而,在利用机器学习方法提取特征时,通常要对序列片段进行窗口化处理,例如在有些研究中窗口取50nt,由于sRNA长度变化很大,很难获得最佳的窗口大小。 为了克服上述方法的一些缺点,我们对sRNA预测方法进行了深入思考,以便为实验发现sRNA提供更好的支持。为此,我们提出了基于转录终点信号预测sRNA以及利用保守性分析来预测sRNA两种方法,并在大肠杆菌基因组中进行sRNA预测。 基于转录终点特征的预测方法,其基本假设是细菌sRNA在长期进化过程中,在基因组中它们的起点与终点形成特有的序列与结构模式,sRNA的起点与终点模式不会在基因组中随机分布。通过对大肠杆菌中的已知sRNA分析发现,细菌sRNA 5’端信号较弱,而3’端序列信号较强,为此,我们提出了基于细菌sRNA转录终点特征预测模型,可以比较准确地预测出sRNA的终点位置。此模型用碱基频率矩阵来描述细菌sRNA的转录终点特征,并用统计学方法来区分阳性数据集和阴性数据集。通过阳性训练集中63个样本和阴性训练集中随机生成10万个样本来构建模型,在阈值为28.9524时,训练集的敏感性和特异性分别为34.92%和100.00%,模型的PPV达到最大值100.00%;对测试集进行预测,阳性测试集为22个样本,阴性测试集为10000个样本,预测结果的敏感性和特异性分别为4.30%和99.99%,此时的阳性检出率PPV为90.90%。模型的特异性和PPV很高,可以为实验验证提供很好的支持。 基于保守性来预测sRNA,是基于在相近种属中已知的sRNA在进化上具有保守性,并且sRNA的Rho-非依赖性终止子结构既为重要的功能元件,在进化上也具有一定的保守性,所以我们认为在多个种属中具有保守的Rho-非依赖的终止子以及一定长度的保守序列片段才有可能是候选sRNA。基于此假设,我们在大肠杆菌的基因间区寻找Rho-非依赖的终止子,并对其进行保守性分析,确定为保守Rho-非依赖的终止子后,对其及其上游片段在肠杆菌科39个基因组中进行保守性分析,如果其上游保守片段长度在20nt以上认为是sRNA。当取保守基因组个数为7时,在6340条基因间区中预测出可能的sRNA 335条,预测出已知sRNA 65条中的21条,模型敏感性为32.3%,特异性为94.4%。特异性与sRNAPredict2相当,敏感性高于sRNAPredict2敏感性12个百分点。说明在用序列保守性和Rho-非依赖的终止子预测sRNA方法得到了进一步提升。
[Abstract]:Bacterial sRNA is a common type of small molecule RNA (small regulatory RNA), which is commonly found in 40~500 nucleotides, which is mainly located in the intergenic region, but also in the non coding region of the 5 'and 3' ends of the protein encoding gene. Unlike the usual non coded RNA, such as tRNA or rRNA, bacteria sRNA not only has a wide range of variation in length, but also in the normal non coded RNA, such as tRNA or rRNA. There is no conservative two - level structural feature.
Current studies have shown that bacterial sRNA is mainly involved in the regulation of various biological activities through the combination of target mRNA or target protein to respond to environmental changes, such as plasmid replication, phage development, stress response, quorum induction, bacterial toxicity, and iron dynamic balance regulation. Secondly, over a thousand bacterial bases that have been sequenced at present. In the group, only a few genomes such as E.coli have been fully studied, and a large number of bacterial sRNA are waiting to be found. Therefore, it is of great significance to carry out the recognition and study of bacterial sRNA.
However, there are many shortcomings in the experimental study of genome level sRNA discovery, such as complex operation process, long period, low accuracy, and some sRNA can only be expressed in a specific environment. At present, bioinformatics prediction and experimental verification are commonly used to identify bacterial sRNA. and carry out sRNA bioinformatics. The prediction research is of great significance and can accelerate the discovery process of sRNA. Secondly, with the completion of a large number of various types of bacterial genome sequencing and the construction of various RNA databases, it also provides a data basis for the development of the bioinformatics prediction method of sRNA gene.
Unlike protein coding genes, which are easy to identify, sRNA coding genes usually have no specific coding characteristics and are not affected by transcoding or nonsense mutation. Therefore, special bioinformatics prediction methods need to be developed. The methods that have been developed are divided into three categories: comparative genomics method, search for transcription signal methods and machines. Tool learning method.
SRNA based on comparative genomics is based on the theory that the sRNA gene has a certain sequence conservatism and conservatism in the genomes of similar genus, but this method is often used, but this method can not predict a bacterial specific sRNA gene; secondly, the genome information of similar species must be used; Finally, conserved intergenic regions may be other types of gene structure, not necessarily sRNA, and cannot identify sRNA genes located in the antisense chain of the coding region.
The basic method of finding sRNA based on the transcriptional signal is to find the potential promoter or transcription factor binding site and the Rho- non dependent terminator structure in the intergenic region to find that the false positive rate of sRNA. is higher than that of the promoter or transcription factor binding site, and the false positive rate of sRNA is also higher; secondly, it can not be predefined. Detection of sRNA. with Rho- dependent termination substructure
Using machine learning methods to predict sRNA, the basic assumption is that the sequence of sRNA gene sequences must be distinguished from the rest. However, when using machine learning methods to extract features, it is usually necessary to make a window processing of sequence fragments, for example, in some of the research windows, it is difficult to obtain the 50nt because of the large change in the length of the sRNA. Good window size.
In order to overcome the shortcomings of the above methods, we think deeply about the sRNA prediction method so as to provide better support for the experimental discovery of sRNA. Therefore, we propose two methods based on the prediction of the transcription end point signal and the use of conservatism analysis to predict the two methods of the sRNA, and to predict the sRNA in the Escherichia coli genome.
The basic hypothesis of the prediction method based on the characteristics of the transcriptional endpoint is that the bacterial sRNA forms a unique sequence and structure pattern at the beginning and end of the genome in the long evolution process. The starting point and terminal pattern of sRNA will not be randomly distributed in the genome. By the known sRNA analysis in Escherichia coli, the bacterial sRNA 5 'end is found. The signal is weak and the signal of the 3 'end sequence is strong. Therefore, we propose a prediction model based on the characteristics of the bacterial sRNA transcriptional endpoint, which can predict the terminal position of sRNA more accurately. This model uses the base frequency matrix to describe the characteristics of the transcriptional end point of bacterial sRNA, and uses statistical methods to distinguish positive data sets and negative data sets. 63 samples and negative training centers were collected and 100 thousand samples were randomly generated to build the model. When the threshold was 28.9524, the sensitivity and specificity of the training set were 34.92% and 100% respectively. The PPV of the model reached the maximum value of 100%. The test set was predicted, the positive test set was 22 samples and the negative test set was 10000 samples. The sensitivity and specificity of the prediction results are 4.30% and 99.99% respectively. The positive detection rate of PPV at this time is the specificity of the 90.90%. model and the high PPV, which can provide good support for the experimental verification.
The prediction of sRNA based on conservatism is based on the conservatism of the sRNA known in the similar genus, and the Rho- non dependent terminator structure of sRNA is an important functional element and has a certain conservatism in evolution, so we think that there is a conservative Rho- non dependent terminator and a certain number of species in many species. It is possible that the length of the conservative sequence is a candidate sRNA. based on this hypothesis. We find Rho- non dependent terminator in the intergenic region of Escherichia coli, and analyze it conservatively, and determine the conservatism of the terminator for conserving Rho- and its upstream fragment in the 39 genome of Enterobacteriaceae. The length of the upstream conservative fragment is above 20nt. When the number of conservative genome is 7, the possible sRNA 335 is predicted in the 6340 intergenic region, and 21 of the 65 known sRNA are predicted. The sensitivity of the model is 32.3%, the specificity is 94.4%. specificity and sRNAPredict2, and the sensitivity is higher than the sRNAPredict2 sensitivity 12 percent. The results show that the method of predicting sRNA with conservative sequences and Rho-independent terminators has been further improved.
【学位授予单位】:中国人民解放军军事医学科学院
【学位级别】:硕士
【学位授予年份】:2011
【分类号】:R378
本文编号:2175381
[Abstract]:Bacterial sRNA is a common type of small molecule RNA (small regulatory RNA), which is commonly found in 40~500 nucleotides, which is mainly located in the intergenic region, but also in the non coding region of the 5 'and 3' ends of the protein encoding gene. Unlike the usual non coded RNA, such as tRNA or rRNA, bacteria sRNA not only has a wide range of variation in length, but also in the normal non coded RNA, such as tRNA or rRNA. There is no conservative two - level structural feature.
Current studies have shown that bacterial sRNA is mainly involved in the regulation of various biological activities through the combination of target mRNA or target protein to respond to environmental changes, such as plasmid replication, phage development, stress response, quorum induction, bacterial toxicity, and iron dynamic balance regulation. Secondly, over a thousand bacterial bases that have been sequenced at present. In the group, only a few genomes such as E.coli have been fully studied, and a large number of bacterial sRNA are waiting to be found. Therefore, it is of great significance to carry out the recognition and study of bacterial sRNA.
However, there are many shortcomings in the experimental study of genome level sRNA discovery, such as complex operation process, long period, low accuracy, and some sRNA can only be expressed in a specific environment. At present, bioinformatics prediction and experimental verification are commonly used to identify bacterial sRNA. and carry out sRNA bioinformatics. The prediction research is of great significance and can accelerate the discovery process of sRNA. Secondly, with the completion of a large number of various types of bacterial genome sequencing and the construction of various RNA databases, it also provides a data basis for the development of the bioinformatics prediction method of sRNA gene.
Unlike protein coding genes, which are easy to identify, sRNA coding genes usually have no specific coding characteristics and are not affected by transcoding or nonsense mutation. Therefore, special bioinformatics prediction methods need to be developed. The methods that have been developed are divided into three categories: comparative genomics method, search for transcription signal methods and machines. Tool learning method.
SRNA based on comparative genomics is based on the theory that the sRNA gene has a certain sequence conservatism and conservatism in the genomes of similar genus, but this method is often used, but this method can not predict a bacterial specific sRNA gene; secondly, the genome information of similar species must be used; Finally, conserved intergenic regions may be other types of gene structure, not necessarily sRNA, and cannot identify sRNA genes located in the antisense chain of the coding region.
The basic method of finding sRNA based on the transcriptional signal is to find the potential promoter or transcription factor binding site and the Rho- non dependent terminator structure in the intergenic region to find that the false positive rate of sRNA. is higher than that of the promoter or transcription factor binding site, and the false positive rate of sRNA is also higher; secondly, it can not be predefined. Detection of sRNA. with Rho- dependent termination substructure
Using machine learning methods to predict sRNA, the basic assumption is that the sequence of sRNA gene sequences must be distinguished from the rest. However, when using machine learning methods to extract features, it is usually necessary to make a window processing of sequence fragments, for example, in some of the research windows, it is difficult to obtain the 50nt because of the large change in the length of the sRNA. Good window size.
In order to overcome the shortcomings of the above methods, we think deeply about the sRNA prediction method so as to provide better support for the experimental discovery of sRNA. Therefore, we propose two methods based on the prediction of the transcription end point signal and the use of conservatism analysis to predict the two methods of the sRNA, and to predict the sRNA in the Escherichia coli genome.
The basic hypothesis of the prediction method based on the characteristics of the transcriptional endpoint is that the bacterial sRNA forms a unique sequence and structure pattern at the beginning and end of the genome in the long evolution process. The starting point and terminal pattern of sRNA will not be randomly distributed in the genome. By the known sRNA analysis in Escherichia coli, the bacterial sRNA 5 'end is found. The signal is weak and the signal of the 3 'end sequence is strong. Therefore, we propose a prediction model based on the characteristics of the bacterial sRNA transcriptional endpoint, which can predict the terminal position of sRNA more accurately. This model uses the base frequency matrix to describe the characteristics of the transcriptional end point of bacterial sRNA, and uses statistical methods to distinguish positive data sets and negative data sets. 63 samples and negative training centers were collected and 100 thousand samples were randomly generated to build the model. When the threshold was 28.9524, the sensitivity and specificity of the training set were 34.92% and 100% respectively. The PPV of the model reached the maximum value of 100%. The test set was predicted, the positive test set was 22 samples and the negative test set was 10000 samples. The sensitivity and specificity of the prediction results are 4.30% and 99.99% respectively. The positive detection rate of PPV at this time is the specificity of the 90.90%. model and the high PPV, which can provide good support for the experimental verification.
The prediction of sRNA based on conservatism is based on the conservatism of the sRNA known in the similar genus, and the Rho- non dependent terminator structure of sRNA is an important functional element and has a certain conservatism in evolution, so we think that there is a conservative Rho- non dependent terminator and a certain number of species in many species. It is possible that the length of the conservative sequence is a candidate sRNA. based on this hypothesis. We find Rho- non dependent terminator in the intergenic region of Escherichia coli, and analyze it conservatively, and determine the conservatism of the terminator for conserving Rho- and its upstream fragment in the 39 genome of Enterobacteriaceae. The length of the upstream conservative fragment is above 20nt. When the number of conservative genome is 7, the possible sRNA 335 is predicted in the 6340 intergenic region, and 21 of the 65 known sRNA are predicted. The sensitivity of the model is 32.3%, the specificity is 94.4%. specificity and sRNAPredict2, and the sensitivity is higher than the sRNAPredict2 sensitivity 12 percent. The results show that the method of predicting sRNA with conservative sequences and Rho-independent terminators has been further improved.
【学位授予单位】:中国人民解放军军事医学科学院
【学位级别】:硕士
【学位授予年份】:2011
【分类号】:R378
【参考文献】
相关期刊论文 前2条
1 王立贵;应晓敏;曹源;查磊;李伍举;;sRNASVM——基于SVM方法构建大肠杆菌sRNA预测模型(英文)[J];生物物理学报;2009年04期
2 王立贵;赵雅琳;李伍举;;细菌sRNA基因及其靶标预测研究进展[J];微生物学报;2009年01期
,本文编号:2175381
本文链接:https://www.wllwen.com/xiyixuelunwen/2175381.html
最近更新
教材专著