当前位置:主页 > 医学论文 > 基础医学论文 >

鼠疫耶尔森氏菌基因组重注释及其跨组学数据库系统的构建

发布时间:2018-04-04 02:17

  本文选题:鼠疫菌 切入点:重注释 出处:《中国人民解放军军事医学科学院》2016年博士论文


【摘要】:鼠疫耶尔森氏菌(Yersinia pestis)是一种能够引起致命全身感染的高危细菌,世界上曾发生过三次鼠疫大流行,死亡人数过亿。根据WHO的数据,仅2001-2015年间,全球就发生18次鼠疫公共安全事件。中国目前已经发现12块典型的鼠疫自然疫源地,分布于15个省,占国土陆地总面积的15%。自2001年Sanger实验室发表第一株鼠疫菌CO92全基因组起,目前已有12株鼠疫菌的完成图序列被公布,且都进行了基因组注释工作。由于高通量实验技术的快速发展,鼠疫菌各方面研究工作产出了大量数据,对其致病和传播的理论认识也得到提高。因此重新审视基于过去知识的基因组注释时发现:原有信息存在诸多局限性甚至错误,而且这些错误信息会随着以同源序列比对为基础的注释工作被不断复制、放大、扩散。研究者曾使用比较基因组学、转录组和蛋白基因组学方法对个别鼠疫菌基因组进行了重注释,但这些注释侧重于基因功能矫正和发现新基因等方面,数据内容不够全面。因此需要归纳、整合、完善已有鼠疫菌知识库,在过去基因组注释结果基础上,通过增加新的实验数据、使用改进的算法对序列进行重新分析、修正可能存在的注释错误,以进一步完善鼠疫菌基因组注释结果,最终达到系统化加深鼠疫菌功能、生物行为和致病机理认识的目的。数据共享是推动研究知识进步的重要方法。但除了大型公共数据库外,仅有极少数原核模式生物(例如大肠杆菌)建立了组学数据库。因此为了给研究者提供更加完整、准确、且易于查用的鼠疫菌注释信息,有必要在收集整合鼠疫菌多种类型实验数据和重注释结果基础上,建立了针对该物种的跨组学数据库系统。本研究工作的主要数据来源包括:(1)基因组序列。NCBI提供的12株鼠疫菌的完成图,它们是注释的基础;(2)91001的蛋白组质谱数据。质谱结果是一种格式化、标准化的数据,筛选后可以方便地使用,同时这类数据直接来源于实验,数据质量高;(3)RNA-seq数据,来源于对91001菌株进行的RNA测序;(4)表达谱数据。来自91001基因芯片实验,这些数据显示基因在多种环境下的表达量,虽然暂时难以在注释中使用,但是可以为研究人员提供一个参考。此外,补充了文献中发表的相关数据。数据重注释工作包括两部分,第一部分是数据预处理。首先,结合多组学数据和生物信息软件、数据库,采用de novo从头注释的方法,共同完成重注释工作。从基因预测开始,重新鉴定CDS区,修正部分基因的起始位点;结合多个蛋白注释数据库,确定基因的功能;对于非基因区,采用预测工具、数据库和文献注释出ncRNA;最后,在全基因组范围,注释出重复序列、移动元件注释工具等。第二部分是数据整理和分析。数据经过分类、筛选,确定数据标准,进行标准化处理过程,完成后进行基因同源性分析、等位基因性分析等。整个过程需要对30多种软件和数据库进行本地化和使用。跨组学数据库是一个以若干组学数据库表为基础的数据库,不同类型数据之间存在密切的相互联系。构建数据库时采用信息系统的处理方法,结合鼠疫菌的生物学特点,确定研究目标后,从研究人员的需求出发,首先进行需求分析,评估系统的可行性,了解功能和业务需求,初步制定出数据标准,并构建出数据模型;然后根据数据模型,进行组学数据库的结构设计和功能设计;最终基于MySQL关系数据库,使用Python Django框架进行web service系统的开发。结合基因组、蛋白组、转录组等多组学数据和上述方法,本研究首先对鼠疫菌91001株进行了全面的重注释:移除了137个不可靠的编码区;修正了41个基因起始位点、以及7个假基因和392个假想基因的功能;增加了ncRNA、重复序列、移动元件等特殊基因组元件和基因组片段多样性的注释。通过对信息分析算法和软件等的梳理整合,建立起可应用于其他鼠疫菌的半自动化重注释工作流程;并进一步将该流程应用于其他11株鼠疫菌完成图序列。最后,采用关系数据库和web框架,构建了基于互联网络的鼠疫菌跨组学数据库系统——TODY分析平台(http://tody.bmi.ac.cn/),方便研究者对重注释数据进行查询和使用。在等位基因多样性处理和Web service服务系统实现的过程中,采用了并行计算技术和分布式调度系统,大大减少了计算时间,为下一步大规模数据分析和处理提供知识储备和技术支持。本工作融合了生物实验、生物学知识、生物信息工具和计算机技术,对明确鼠疫菌基因组的结构和功能,揭示其更多的生物学特性具有重要意义。下一步我们将增加更多的相关文献数据和实验数据,不断丰富、充实鼠疫菌组学数据库;通过实验进行重注释结果的准确性验证;寻找合适的数据挖掘模型,进行深层次的数据分析,构建出鼠疫菌知识库;不断完善web service系统;移植整个系统到云计算平台上,为大规模数据处理服务。
[Abstract]:Jerson Prand (Yersinia pestis) the plague is a deadly risk of bacteria can cause systemic infection, the world had three plague epidemic, deaths of billions of dollars. According to WHO, only 2001-2015 years, the world happened 18 times of plague public safety incidents. China has found 12 typical natural foci of plague, distributed in 15 provinces, land accounted for the total land area of 15%. since 2001, Sanger published the first laboratory strains of Yersinia pestis CO92 genome, there are 12 strains of Yersinia pestis complete graph sequence is published, and the genome annotation work. Due to the rapid development of high-throughput experimental techniques, various aspects of Yersinia pestis study on the work output of a large amount of data, the pathogenic and the spread of the theory has also been improved. Therefore re-examine past knowledge discovery based on genome annotation: the original information of limitations Even wrong, but these error messages will with homology based annotation work by continuous replication, amplification, diffusion. Researchers have used comparative genomics on individual Y.pestis genome re annotation methods transcriptome and protein genome, but these comments focused on gene function correction and discovery of new genes so, the data content is not comprehensive enough. So we need induction, integration, improve the existing knowledge base of Yersinia pestis genome annotation in the past, on the basis of the results, by adding new experimental data, using the improved algorithm to analysis the sequence, correction of annotation errors may exist, in order to further improve the Y.pestis genome annotation, eventually to deepen the knowledge of biological function of Yersinia pestis, behavior and pathogenic mechanism. The data sharing is the important method of promoting the progress of knowledge. But Large public database, only a handful of prokaryotic organisms (e.g. Escherichia coli) established proteomics database. So in order to provide researchers a more complete, accurate, and easy to check with the plague annotation information, it is necessary to integrate various types of Yersinia pestis in the collection of experimental data and comments on the basis of the results, the establishment of for the cross species genomics database system. Including the main data source of this research work: (1) complete Figure 12 Y.pestis genome sequence provided by.NCBI, which is the basis of notes; (2) protein group 91001 spectral data. Mass spectrometry results is a standard data format. After screening, can be conveniently used at the same time, this kind of data directly from the experimental data of high quality; (3) RNA-seq data from RNA sequencing of 91001 strains; (4) expression data from 91001 microarray experiments, these data Display the amount of gene expression in a variety of environments, although temporarily difficult to use in a comment, but can provide a reference for the researchers. In addition, add the relevant data published in the literature. Data annotation work includes two parts, the first part is the data preprocessing. Firstly, combining data and biological information software. Multi group database, using the de method of de novo novo notes, to complete the re annotation work. From the gene prediction, re identification of CDS area, start site correction part gene; combining multiple protein annotation database, determine the function of genes; for non gene prediction using tools, database and document annotation ncRNA; finally in whole genome annotation, and a repeat, mobile element annotation tool. The second part is the collation and analysis of data. The data after classification, screening, to determine the data standard, standard treatment The process, after the completion of homologous analysis, allelic analysis. The whole process takes the localization and use of 30 kinds of software and database. The database is a cross group with several groups of database table based database, there is a close tie between different types of data processing methods to build the database. The information system, combined with the biological characteristics of Yersinia pestis, determine the research goal, starting from the needs of researchers, first needs analysis, feasibility evaluation system, understand the functions and business needs, develop a preliminary data standard, and constructs the data model; then based on the data model, structure design and functional genomics the design of the database; MySQL based relational database, the development of web service system using Python Django framework. With the genome, proteome, transcriptome etc. The data and the method of group learning, this study first of 91001 strains of Yersinia pestis were re annotation comprehensive: removed the 137 unreliable encoding region; modifying 41 gene start sites, and 7 pseudogenes and 392 hypothetical genes; increased ncRNA, repeat, note mobile components etc. Special genomic components and genomic DNA diversity. Through combing the integration of information analysis algorithm and software, establish a semi automated re annotation process can be applied to other Yersinia pestis; and further the process for the other 11 strains of Yersinia pestis sequences. Finally, the relational database and web frame construction the plague of Internet based on cross omics database system -- TODY analysis platform (http://tody.bmi.ac.cn/), to facilitate researchers to query and use of annotation data. In allelic diversity processing and W The process of implementation of the EB service services system, using parallel computing technology and distributed scheduling system, greatly reduces the calculation time, providing knowledge and technical support for large-scale data analysis and processing. The next step of this work combines biological experiments, biological knowledge, bioinformatics tools and computer technology, the structure and function of clear plague bacterial genome, has important significance to reveal more of its biological characteristics. The next step will be to add more relevant literature data and experimental data, and constantly enrich and enrich the Yersinia pestis proteomics database; to verify the accuracy of the experimental results of the re annotation; find suitable data mining model, conducted in-depth data analysis, construction a plague of knowledge; constantly improve the web service system; transplantation of the whole system to the cloud computing platform, for large-scale data processing services.

【学位授予单位】:中国人民解放军军事医学科学院
【学位级别】:博士
【学位授予年份】:2016
【分类号】:R378


本文编号:1707930

资料下载
论文发表

本文链接:https://www.wllwen.com/yixuelunwen/jichuyixue/1707930.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户515d2***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com