细菌必需基因团簇模型的研究与特征分析
发布时间:2019-06-17 11:40
【摘要】:必需基因指的是在优化生长的前提下对有机体的生存和生长不能缺失的基因。研究必需基因有如下重要的意义:(1)必需基因可以作为构建最小基因集的基础,通过对必需基因的研究可以帮助我们了解生命的起源和进化,生产工业实用型微生物;(2)必需基因编码的蛋白质通常参与最重要且基础的代谢过程,因此,可以作为抗菌药物的靶标。近年来,必需基因的研究已经成为生物信息学研究的热点之一。本文的研究对象是由湿实验方法确定的细菌必需基因集。其原始的必需基因数据来自于必需基因数据库DEG(http://tubic.tju.edu.cn/deg/)。受COGs(https://www.ncbi.nlm.nih.gov/COG/)中团簇的启发,我们提出了细菌必需基因团簇模型的概念,就是将具有相同或者相近功能的必需基因以团簇的形式进行存储,这也是与当前大多数存储基因的数据库的最大不同,团簇的大小反映了该类基因的保守性强弱。到目前为止,细菌必需基因数据进一步丰富,例如DEG的最新版本(截止到2017年3月)收录了46套细菌必需基因数据集和16套真核生物细菌的必需基因数据集,为相关研究奠定了基础。基于必需基因团簇模型和最新的数据,我们构建更新了细菌必需基因团簇数据库(CEG,Cluster of Essential Genes,http://cefg.cn/ceg/),其版本称之为CEG 2.0。在该数据库中,以团簇的形式存放必需基因,并进一步增加并丰富了和必需基因相关的很多信息,如:增加了基因编码蛋白质结构、基因毒力因子、基因参加的代谢通路以及与基因相关的药物等重要信息。另外,我们将细菌必需基因与人类基因序列作比对,提供用户两者的同源性信息。这些信息在新的药物靶标发掘过程中,具有极大的借鉴意义。团簇的大小也具有重要的生物学意义,团簇越大,其中包含的基因就越保守,用户通过观察团簇的大小,就可以直接看出具有该功能的基因的是在多物种中普遍存在的,还是个别物种所具有的。根据构建的CEG数据库,我们提出了一种新的基于存储的团簇大小来预测细菌基因必需性的算法—K-value。K-value算法的主要原理是依据团簇的大小进行必需基因的预测,在预测的时候,只需要用户提供基因的基因名即可对基因完成预测。最后,我们编程实现了此算法,称之为CEG_Match。在CEG 2.0中的CEG_Match,我们增加了新的功能,用户不仅可以根据基因功能进行预测,还可以根据基因序列信息进行预测。该预测工具与传统的必需基因识别方式比较,在保证不低的准确率的基础上,对非必需基因的识别率更高,而且执行效率更快。这解决了CEG 1.0中的预测算法只能根据基因名进行预测的缺陷。最后,对本文构建的数据库信息进行统计,包括物种、团簇以及基因功能等,并对以后可开展的工作进行了展望。
[Abstract]:Essential genes refer to genes that cannot be deleted from the survival and growth of organisms on the premise of optimal growth. The study of essential genes is of great significance as follows: (1) essential genes can be used as the basis for the construction of the minimum gene set, and the study of essential genes can help us understand the origin and evolution of life and produce industrial practical microorganisms. (2) the proteins encoded by essential genes are usually involved in the most important and basic metabolic process, so they can be used as targets for antibiotics. In recent years, the study of essential genes has become one of the hotspots in bioinformatics research. The research object of this paper is the necessary gene set of bacteria determined by wet experiment. Its original essential gene data comes from the essential gene database DEG (http://tubic.tju.edu.cn/deg/).) Inspired by COGs (cluster in https://www.ncbi.nlm.nih.gov/COG/), we put forward the concept of bacterial essential gene cluster model, that is, to store the necessary genes with the same or similar functions in the form of clusters, which is also the biggest difference from most of the current databases of stored genes, and the size of the clusters reflects the conservatism of this kind of genes. So far, the data of bacterial essential genes have been further enriched, such as the latest version of DEG (up to March 2017), which contains 46 sets of bacterial essential gene data sets and 16 sets of essential gene data sets of eukaryotic bacteria, which lays a foundation for related research. Based on the essential gene cluster model and the latest data, we constructed and updated the bacterial essential gene cluster database (CEG,Cluster of Essential Genes, http://cefg.cn/ceg/),), which is called CEG 2.0. In this database, essential genes are stored in clusters, and a lot of information related to essential genes is further increased and enriched, such as gene coding protein structure, gene virulence factors, metabolic pathways in which genes participate, and gene-related drugs. In addition, we compare the bacterial essential genes with human gene sequences to provide users with homology information. This information has great reference significance in the process of discovering new drug targets. The size of the cluster also has important biological significance, the larger the cluster, the more conservative the genes contained in it. Through the size of the observation cluster, users can directly see whether the gene with this function is common in many species or has it in individual species. According to the constructed CEG database, we propose a new algorithm based on storage cluster size to predict the necessary gene requirements of bacteria. The main principle of K-value.K-value algorithm is to predict the necessary genes according to the size of the cluster. in the prediction, only the gene name of the gene can be provided by the user to complete the prediction of the gene. Finally, we program and implement this algorithm, called CEG_Match.. We add new functions to CEG_Match, in CEG 2.0. Users can predict not only according to gene function, but also according to gene sequence information. Compared with the traditional essential gene recognition method, the prediction tool has higher recognition rate and faster execution efficiency on the basis of ensuring the accuracy of non-essential genes. This solves the defect that the prediction algorithm in CEG 1.0 can only predict according to the gene name. Finally, the database information constructed in this paper is counted, including species, clusters and gene function, and the future work is prospected.
【学位授予单位】:电子科技大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:Q933
本文编号:2500958
[Abstract]:Essential genes refer to genes that cannot be deleted from the survival and growth of organisms on the premise of optimal growth. The study of essential genes is of great significance as follows: (1) essential genes can be used as the basis for the construction of the minimum gene set, and the study of essential genes can help us understand the origin and evolution of life and produce industrial practical microorganisms. (2) the proteins encoded by essential genes are usually involved in the most important and basic metabolic process, so they can be used as targets for antibiotics. In recent years, the study of essential genes has become one of the hotspots in bioinformatics research. The research object of this paper is the necessary gene set of bacteria determined by wet experiment. Its original essential gene data comes from the essential gene database DEG (http://tubic.tju.edu.cn/deg/).) Inspired by COGs (cluster in https://www.ncbi.nlm.nih.gov/COG/), we put forward the concept of bacterial essential gene cluster model, that is, to store the necessary genes with the same or similar functions in the form of clusters, which is also the biggest difference from most of the current databases of stored genes, and the size of the clusters reflects the conservatism of this kind of genes. So far, the data of bacterial essential genes have been further enriched, such as the latest version of DEG (up to March 2017), which contains 46 sets of bacterial essential gene data sets and 16 sets of essential gene data sets of eukaryotic bacteria, which lays a foundation for related research. Based on the essential gene cluster model and the latest data, we constructed and updated the bacterial essential gene cluster database (CEG,Cluster of Essential Genes, http://cefg.cn/ceg/),), which is called CEG 2.0. In this database, essential genes are stored in clusters, and a lot of information related to essential genes is further increased and enriched, such as gene coding protein structure, gene virulence factors, metabolic pathways in which genes participate, and gene-related drugs. In addition, we compare the bacterial essential genes with human gene sequences to provide users with homology information. This information has great reference significance in the process of discovering new drug targets. The size of the cluster also has important biological significance, the larger the cluster, the more conservative the genes contained in it. Through the size of the observation cluster, users can directly see whether the gene with this function is common in many species or has it in individual species. According to the constructed CEG database, we propose a new algorithm based on storage cluster size to predict the necessary gene requirements of bacteria. The main principle of K-value.K-value algorithm is to predict the necessary genes according to the size of the cluster. in the prediction, only the gene name of the gene can be provided by the user to complete the prediction of the gene. Finally, we program and implement this algorithm, called CEG_Match.. We add new functions to CEG_Match, in CEG 2.0. Users can predict not only according to gene function, but also according to gene sequence information. Compared with the traditional essential gene recognition method, the prediction tool has higher recognition rate and faster execution efficiency on the basis of ensuring the accuracy of non-essential genes. This solves the defect that the prediction algorithm in CEG 1.0 can only predict according to the gene name. Finally, the database information constructed in this paper is counted, including species, clusters and gene function, and the future work is prospected.
【学位授予单位】:电子科技大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:Q933
【参考文献】
相关期刊论文 前2条
1 邱东茹;;细菌必需基因、最小基因组和合成细胞[J];生物工程学报;2012年05期
2 骆建新,郑崛村,马用信,张思仲;人类基因组计划与后基因组时代[J];中国生物工程杂志;2003年11期
,本文编号:2500958
本文链接:https://www.wllwen.com/kejilunwen/jiyingongcheng/2500958.html
最近更新
教材专著