原核生物调控模体和调节子预测算法研究

发布时间：2018-05-28 02:23

本文选题：调控模体 + 调节子预测　；参考：《山东大学》2014年博士论文

【摘要】：生物信息学是近年来快速发展的一门交叉学科,它综合了生物、数学和计算机等领域的知识来进行生物数据的分析和生命现象的研究.序列分析是生物信息学的一个重要组成部分,其中DNA序列模体预测一直是生物信息学中的一个重要研究问题,尤其是转录因子结合位点的预测,既具有重要的生物意义,又具有算法设计上的难度.本论文主要研究的问题为原核生物基因表达调控模体和调节子的预测算法. 基因需要表达为相应的蛋白质才能发挥生物功能,并且需要针对不同自身与外界环境,对表达做出调控.原核生物的表达调控主要是通过RNA聚合酶和调控蛋白之间的相互作用实现.调控蛋白能够识别出基因组DNA序列上特定的序列片段,并与之结合,起到调控作用,这些特定序列称为调控蛋白结合位点.因此在基因组中不但包含了编码蛋白质和RNA的基因序列,还包含了调节基因表达的调控序列.同一调控蛋白的结合位点的长度一般相同,并具有较高的序列保守性,这种序列的保守模式,称为一个cis-调控模体.在原核生物中,基因组上多个连续的基因往往构成一个操纵子,能够共同转录；单个基因也可看作操纵子的特殊类型.被同一调控蛋白所调控的操纵子的集合,称为一个调节子. 在这篇论文中,我们首先对调控模体的模型表示和预测算法做了简要介绍.在已有模体预测算法的基础上,结合原核生物全基因组中调控结合位点的分布特征,我们设计了对所预测模体的生物功能显著性进行考量的方法,能够对所预测出的模体进行准确的筛选；利用模体信息量和保守性特征进行模体的相似性分析和聚类分析；利用超几何分布等统计工具分析模体在全基因组上的共存在特征.这一系列的方法构成了模体预测分析工具包BoBro2.0,相应软件可通过http://code.google.com/p/bobro/免费下载使用. 结合模体预测与系统发生足迹法,我们设计了全基因组调节子预测的新方法.系统发生足迹法使我们能够从同源基因的调控区域中发现调控模体,然而这些结果往往具有非常高的假阳性.为了克服这个问题,我们设计了基于二部图的模体的相似性比较方法,能够对所有模体进行初步筛选,并产生了反映操纵子间共调控关系的得分,即如果两个操纵子之间具有较高的得分,那么它们属于同一个或多个调节子的可能性较大.我们只保留了能够产生较高得分的模体,用来构造模体相似性图,其中以单个模体作为点,以较显著的相似性得分做边,整个图反映出所预测出的模体之间的相似性关系.通过对已知的调节子所对应的图中的点集进行分析,我们发现由这些点集所导出的子图比原图具有更高的边密度和聚类系数,因而能够反映出原核生物调节子的特征.利用这一发现,通过设计聚类算法,我们从图中获得了对应真实调节子的操纵子集合.通过与其它两种能够反映共调控关系的分数的比较,我们设计的方法更加准确反映共调控关系；并且由于我们以模体作为点来预测调节子,很好的解决了调节子之间的交集会使聚类过程不准确的问题,从而更准确预测调节子.我们的预测流程完全基于基因组序列数据,不需要过多的生物注释信息作为辅助,这对于新测序出的基因组具有更重要的使用价值. 为了方便生物学家使用我们设计的算法和工具,我们开发了以操纵子数据为核心的线上数据库DOOR2.0其中包含了2072个完全测序的原核生物基因组的操纵子结构,而且具有基因功能注释和经过实验验证的调控蛋白结合位点信息.与发表于2009年的之前版本相比,DOOR2.0具有一些列新的特征,(i)包含了来自于实验验证或者基于RNA-seq数据计算预测出的250000个转录单元结构,提供了操纵子的动态功能展示；(ii)整合了以操纵子为中心的数据资源,不仅对每个涉及的基因组提供操纵子结构,而且有功能和调控信息,例如cis-调控因子结合位点,启动子和终止子结构；(iii)对用户提供的基因组进行操纵子预测的高效网络服务；(iv)使用直观的基因组浏览器对用户选择的数据进行可视化展示；(v)类似于Google搜索的基于关键词的搜索引擎,可以从数据库中快速查找所需的信息.数据库会根据测序数据的发布进行更新,可通过http://csbl.bmb.uga.edu/DOOR/进行访问,所有数据和功能均免费提供给用户.最后,利用比较基因组学的种种方法和我们的模体分析工具,我们对梭状芽孢杆菌的40个物种进行了系统的分析,尤其注重与生物质降解相关的基因和功能.通过这些研究,不仅做出了有生物研究价值的发现,也验证了我们开发的方法的实用价值.
[Abstract]:Bioinformatics is a rapid development in recent years. It combines the knowledge of biological, mathematical and computer fields to analyze biological data and study the life phenomenon. Sequence analysis is an important part of bioinformatics. The prediction of DNA sequence model body is always an important part of bioinformatics. The research problem, especially the prediction of the transcription factor binding site, has both important biological significance and the difficulty of algorithm design. The main problem in this paper is the prediction algorithm of modulo body and regulator for gene expression in prokaryotes.
The gene needs to be expressed as the corresponding protein to play a biological function, and the expression needs to be regulated for different self and external environment. The regulation of the expression of prokaryotes is realized mainly through the interaction between RNA polymerase and regulatory protein. In combination with it, these specific sequences are called regulatory protein binding sites. Therefore, the genome contains not only the sequence of genes encoding proteins and RNA, but also the regulatory sequences that regulate the expression of genes. The length of the binding site of the same regulatory protein is the same and has a higher sequence conservatism. The conservative model of a sequence is called a cis- regulatory model. In the prokaryotes, a number of successive genes in the genome often constitute an operon, which can be transcribed together; a single gene can also be seen as a special type of the operon. The aggregation of the operon controlled by the same regulatory protein is called a regulator.
In this paper, we first briefly introduce the model representation and prediction algorithm of the regulated model body. On the basis of the existing model body prediction algorithm, combined with the distribution characteristics of the regulated binding sites in the whole genome of the prokaryotes, we design a method to estimate the significance of the biological power of the predicted model body, which can be predicted. The model body is screened accurately, the model body similarity analysis and cluster analysis are carried out using the model body information quantity and conservatism characteristics. The common characteristics of the model body in the whole genome are analyzed by the statistical tools such as hypergeometric distribution. This series of methods constitute the model body prediction and analysis toolkit BoBro2.0, and the corresponding software can be used through the http //code.google.com/p/bobro/: free download and use.
We designed a new method to predict the whole genome by combining the model body prediction and the systematic footprint method. The systematic footprint method enables us to discover the modulo bodies from the control regions of the homologous genes. However, these results often have very high false positive results. In order to overcome this problem, we designed the model based on the two graph. The similarity comparison method of the body can make a preliminary screening of all the modules and produce a score reflecting the co regulation relationship between the operators, that is, if there is a higher score between the two operon, then they are more likely to belong to the same or multiple regulators. A pattern of structural similarity, in which a single model body is used as a point, with a more significant similarity score, and the whole graph reflects the similarity relation between the predicted models. By analyzing the set of points in the graph corresponding to the known regulator, we find that the subgraphs derived from these points have a higher edge density than the original graph. The degree and the clustering coefficient can reflect the characteristics of the prokaryotes regulator. By using this discovery, we obtain the operon set corresponding to the real regulator by designing the clustering algorithm. By comparing with the other two kinds of scores that can reflect the co regulation relationship, our design method is more accurate to reflect the common regulation and control. And because we predict the regulator with the model body as a point, it is very good to solve the problem that the intersection of the regulators will make the clustering process inaccurate, so that the regulator is more accurately predicted. Our prediction process is based on the genome sequence data and does not need too much raw material annotation information as a supplement, which is for the new sequencing. The genome has a more important use value.
In order to facilitate the biologists to use the algorithms and tools we design, we developed an online database DOOR2.0 based on the core of the operon data, which contains the operon structure of the genome of 2072 completely sequencing prokaryotes, and has the gene function annotation and the experimental verification of the regulatory protein binding site information. Compared with previous versions of 2009, DOOR2.0 has some new features, and (I) contains 250000 transcriptional unit structures derived from experimental validation or based on RNA-seq data computing, providing a dynamic functional display of the operon; (II) integration of data resources centered on the operon, not only for each involved genome. The operon structure, and has functional and regulatory information, such as the cis- regulatory factor binding site, promoter and terminator structure; (III) efficient network services for the user's genome for operon prediction; (IV) visualizing user selected data using an intuitive genome browser; (V) similar to Google search The keyword based search engine can quickly find the information needed from the database. The database will be updated according to the publication of the sequencing data, and can be accessed through http://csbl.bmb.uga.edu/DOOR/. All data and functions are provided free of charge to the user. Finally, the various methods of comparative genomics and our model body are used. Analysis tools, we systematically analyzed 40 species of Clostridium spore, focusing on genes and functions related to biodegradation, which not only made the discoveries of biological research value, but also proved the practical value of the methods we developed.
【学位授予单位】：山东大学
【学位级别】：博士
【学位授予年份】：2014
【分类号】：Q811.4

【共引文献】