综合分析组学数据以构建植物基因结构注释与功能解析平台
发布时间:2018-05-20 11:59
本文选题:生物信息学 + 大数据 ; 参考:《中国农业大学》2016年博士论文
【摘要】:大数据,即超出传统关系数据库系统处理范畴的海量数据集。随着测序技术及相关生物学应用的发展,生命科学领域已经迎来了大数据的时代。如何对纷繁复杂的测序数据进行挖掘分析是摆在生物信息工作者面前的重要课题。本文从植物领域基因功能研究的需求出发,探讨如何利用现有的生物信息学方法,对实验科学家产生的多维组学数据进行剖析,并揭示数据背后隐藏的生物学奥秘。本文首先设计了一个大规模功能组学数据的标准化分析流程,用于发现植物新基因与新的可变剪切形式,接着搭建了一个针对植物领域的基因集富集分析在线工具,最后构建了一个综合的植物非编码RNA数据库分析平台,对生物信息学大规模组学数据挖掘的几个关键方向做了有益的尝试。当获得某一物种完整的全基因组序列后,对其总体水平的基因结构注释是一研究重点。随着测序技术的飞速进步,表观基因组学和转录组学的数据也快速积累。为了有效地利用这些组学数据进行基因结构注释,我构建了一套标准化的分析流程。首先利用染色体免疫共沉淀结合高通量测序(ChIP-seq)技术产生的数据,对植物的全基因组水平上两个表观遗传修饰(即H3K4me3和H3K27ac)进行研究,随后利用已知的功能基因组注释信息,对组蛋白修饰在基因结构上的分布特点进行探讨。同时利用转录组学的数据,确认了两个组蛋白修饰与基因表达之间的正相关性。对实验室自行产生的及公共平台的转录组学数据进行整合后,我对水稻日本晴和亚洲棉的新基因进行了预测,并利用组蛋白修饰在基因上的分布特点,对新基因的正负链进行判定。此外,对其中数个基因进行了qRT-PCR实验的验证。预测了新基因的位置后,对其具体基因结构、表达的组织特异性以及在染色体上的组蛋白修饰特点等一一进行了分析。最后还总结出了一套利用RNA-seq和ChIP-seq数据对亚洲棉进行可变剪切位点预测的规则。在基因结构注释的基础上,如何有效利用现有数据进行基因功能的全面解析,是接下来着重探讨的内容。现有的植物GO富集工具如EasyGO和AgriGO利用GO词条进行统计学分析,得到某些富集词条相关的特定基因,达到帮助生物学家缩小研究范围的目的。为了对一组或多组差异表达的基因进行更加深入细致的功能研究,我对GO词条进行拓展,引入了“基因集”这一概念,将包括基因本体论(GO)、植物本体论(PO)、基因家族、KEGG注释、PlantCyc注释等多达九个方面的基因集类别进行基因功能的描述。相比单个类别而言,基因集对基因组注释率有明显的提高,功能描述的精度和广度均有很大改善。利用GSEA算法,我开发了PlantGSEA (http://structuralbiology.cau.edu.cn/PlantGSEA)这一针对植物领域的基因集富集分析工具,该工具自发表以来应使用者的请求做了多次更新,并得到了科研工作者的广泛认同。另外,生物信息学二级数据库能提供单个DNA或蛋白序列的多方面的功能信息。表观遗传学的研究不但包括组蛋白修饰,还包括非编码序列的调控。在对植物非编码RNA的工作进行调研时,我发现现有数据库中涵盖植物多种类型非编码序列、多个层面功能信息的平台尚少。分析了已有平台的优劣势,利用获得的信息和掌握的技术,我构建了一个植物非编码序列相关的综合的数据库平台,并将其命名为PNRD (http://structuralbiology.cau.edu.cn/PNRD)。PNRD一共搜集了150种植物的11个不同类别,共25739条非编码RNA序列,46种植物的178138个miRNA和其靶基因的互作关系对,35个miRNA的表达图谱数据,以及整合了148篇文献的信息挖掘池。平台包括五大功能模块,即搜索模块、浏览模块、工具模块、下载页面以及帮助页面。本论文旨在构建植物基因结构与功能注释以及组学数据挖掘的平台体系,试图提供一些针对海量数据进行综合分析的解决方案。面对背景复杂、噪音巨大的高通量数据,如何加强实验科学家们的洞察力继而发现数据背后隐藏的价值,是我们生物信息学工作者的使命。
[Abstract]:Large data, which is a massive data set beyond the traditional relational database system. With the development of sequencing technology and the development of related biological applications, the field of life science has come to the era of big data. It is an important task for biological information workers to find out how to analyze and analyze the complicated and complicated sequencing data. Based on the needs of the research on gene function in the field of matter, this paper discusses how to use the existing bioinformatics methods to analyze the multidimensional data produced by experimental scientists and reveal the biological mysteries hidden behind the data. This paper first designs a standardized analysis process for a large-scale functional omics data, which is used to discover new plant bases. As a result of the new variable shear form, an online tool for genetic enrichment and analysis for plants was built, and a comprehensive plant non coded RNA database analysis platform was constructed, and a useful attempt was made for several key directions of bioinformatics large scale data mining. Gene structure annotation on its overall level is the focus of research after genome sequencing. With the rapid progress of sequencing technology, epigenetic and transcriptional data are also rapidly accumulated. In order to effectively use these data for genetic structure annotation, I constructed a set of standardized analysis processes. First, the use of chromosomes is a set of chromosomes. Immunoprecipitation combined with high throughput sequencing (ChIP-seq) technology to study two epigenetic modifications (H3K4me3 and H3K27ac) at the whole genome level of plants, and then explore the distribution characteristics of histone modification on the gene structure by using the known functional genome annotation information. The positive correlation between the two histone modification and the gene expression was confirmed. After integrating the transcriptional data of the laboratory and the public platform, I predicted the new genes of rice Japan and Asia cotton, and used the histone to modify the distribution characteristics on the base, and carry out the positive and negative chains of the new genes. In addition, several of the genes were verified by qRT-PCR experiments. After predicting the location of the new genes, the specific gene structure, the tissue specificity of the expression and the characteristics of the histone modification on the chromosomes were analyzed. Finally, an arbitrage was made to use RNA-seq and ChIP-seq data to change the Asian cotton. On the basis of the gene structure annotation, how to effectively use the existing data to fully analyze the function of the gene is the following content. The existing plant GO enrichment tools, such as EasyGO and AgriGO, use the GO word for statistical analysis to get some specific genes related to the enrichment of the word, to reach the help. Biologists reduce the scope of the study. In order to carry out a more thorough and detailed functional study of a group of genes expressed differently, I extend the GO phrase and introduce the concept of "gene set", which will include as many as nine parties, such as gene ontology (GO), plant Ontology (PO), gene family, KEGG annotation, PlantCyc annotation, etc. The gene set category performs the description of the function of the gene. Compared with a single category, the gene set has significantly improved the annotation of the genome, and the accuracy and breadth of the functional description have been greatly improved. Using the GSEA algorithm, I developed the PlantGSEA (http://structuralbiology.cau.edu.cn /PlantGSEA), a gene rich in the plant field. In addition, the two level database of bioinformatics can provide multiple functional information on a single DNA or protein sequence. The epigenetic study includes not only the histone modification but also the non coding sequence. When studying the work of plant non coded RNA, I found that the existing database covers a variety of non coding sequences of plants, and there are few platforms for multiple levels of functional information. I have analyzed the advantages and disadvantages of the existing platforms, and using the acquired information and mastered techniques, I constructed a comprehensive plant non coding sequence related synthesis. The database platform, which was named PNRD (http://structuralbiology.cau.edu.cn/PNRD).PNRD, collected 11 different categories of 150 plants, 25739 non coded RNA sequences, 178138 miRNA of 46 species and the interaction of their target genes, the atlas data of 35 miRNA, and the integration of 148 documents. The platform consists of five functional modules, namely, the search module, the browsing module, the tool module, the download page and the help page. This paper aims to build the platform system of plant gene structure and functional annotation and the data mining of the group, trying to provide some solutions for the comprehensive analysis of the mass data. It is the mission of our bioinformologists to strengthen the insight of experimental scientists and discover the hidden value behind the data.
【学位授予单位】:中国农业大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:Q943.2
,
本文编号:1914540
本文链接:https://www.wllwen.com/kejilunwen/jiyingongcheng/1914540.html
最近更新
教材专著