基于多组学的结核病发病分子机制核心复杂网络系统的发现及预测诊断模型构建的研究

发布时间：2018-06-19 15:25

本文选题：多组学 + 结核病　；参考：《北京市结核病胸部肿瘤研究所》2017年博士论文

【摘要】：目的:当前全球结核病的防控形式严峻,亟待对其发病机制有更深的认识,并基于此发展更有效的防控策略和方法。本研究希望从多生物组学的角度寻找结核病的发病分子机制核心复杂网络系统,并基于此构建可用于结核病高危人群筛选的结核病预测诊断模型,为新的结核病防控策略提供可用方法和工具,实现结核病的精准预防。同时,希望在研究中论证多组学层面类中心法则的相关证据,为今后理论生物学的发展提供依据。方法:本研究为基于多生物组学大数据和计算机算法的Transomics研究。首先,通过各大国际生物组学数据库获得基因组、核组、转录组和蛋白组的相关数据。而后,对来自基因组和转录组的数据,分别通过PLINK和limma等常规分析流程,获得每个基因变异位点和基因表达的疾病相关统计值。然后,对现有的核组数据进行整合、网络丛集化,并联合基因组和转录组的疾病相关统计结果进行染色质疾病相关性分析,获得结核病相关染色质疾病模块。再后,基于获得的结核病相关染色质模块中的基因变异位点信息,利用机器学习的方法初步构建结核病预测诊断模型。继之,对蛋白组数据进行整合,构建蛋白质间相互作用网络,对核组与蛋白组学的标准化网络矩阵进行相关性分析,论证多组学层面类中心法则。最后,将蛋白质相互作用网络与已获得的结核病相关染色质疾病模块进行整合,通过网络分解的方法获得结核病发病分子机制核心复杂网络系统,以此对结核病预测诊断模型进行优化,并通过ROC分析验证其分类效果。结果:基因组学数据经分析后,在未统计矫正状态下共有49236个p值0.05的SNPs位点被发现。转录组数据经基因差异表达分析,共筛选到1594个差异表达基因,其中738个基因上调,856个基因下调。核组数据经整合成为3044*3044的标准化矩阵,通过丛集的划分和染色质疾病相关性分析获得的结核病相关染色质疾病模块中包含101417个SNPs,以此构建的结核病预测诊断模型AUC达到了0.926,敏感性和特异性均分别为0.87和0.866,均超过了0.85,处于高分类效果的水平。蛋白质组数据以核组的标准化矩阵为基准进行整合并与核组的矩阵进行分析后发现两者之间存在相关性,证明染色质高级结构与蛋白质相互作用之间有生物学关联。经结核病相关染色质模块与蛋白质间相互作用网络整合后,形成的结核病发病分子机制复杂网络系统包含了5846个节点和458653条边,经层次聚类分析后得到的结核病发病分子核心网络包含2015个节点和61318条边,再通过iNP算法分析后获得了15个内核网络单位,包含228个基因。基于内核单位和前向搜索算法优化后的结核病预测诊断模型的AUC为0.841,敏感性和特异性分别为0.768和0.769,而所包含的SNPs参数数量为2260个,在没有大幅牺牲分类效能的情况下,将模型参数减少到原模型的1/50水平,实现了应用现有技术成本可控的目的。结论:本研究应用基因组、核组、转录组和蛋白组等多组学的生物大数据,通过Transomics的研究方法,初步找到了TB发病分子机制核心复杂网络系统及其内核单位,并在此基础之上应用机器学习的方法构建了有较佳分类效果的TB预测诊断模型,为TB高危人群的筛选提供了备用工具。在理论生物学层面,本研究找到了一些多组学类中心法则存在的线索,为今后该理论的形成与完善做了部分先期探索性工作。复杂性疾病发病分子机制方面,本研究初步探索构建了可能适合一般复杂性疾病的多组学分析流程,同时找到了验证基因间相互关系重要性的部分证据,提示在今后相关的研究中需要对其予以重视。最后,由于本研究为基于生物大数据和计算机算法的Transomics研究,研究所得到的结论还有待于今后的实验室工作、临床及流行病学等层面研究的进一步深入剖析与验证。
[Abstract]:Objective: the current global tuberculosis prevention and control form is severe, and it is urgent to have a deeper understanding of its pathogenesis and to develop more effective prevention and control strategies and methods based on this. This study hopes to find the core complex network system of the molecular mechanism of tuberculosis from the perspective of multi bioomics, and based on this construction, it can be used to screen the high-risk population of tuberculosis. The selected model of tuberculosis predictive diagnosis provides the available methods and tools for the new tuberculosis prevention and control strategy to achieve the precise prevention of tuberculosis. At the same time, we hope to demonstrate the relevant evidence of the central rules of the multicomponent level in the study, and provide the basis for the development of the future theoretical biology. Transomics study of computer algorithms. First, the related data of genome, nuclear group, transcriptional group and protein group are obtained through the major international biomics database. Then, the data from the genome and transcriptional group are obtained through the routine analysis process such as PLINK and limma, respectively, to obtain the disease correlation of each gene mutation site and gene expression. Then, the existing nuclear group data are integrated, the network clustering, and the association of the disease related statistics of the genome and transcriptome to analyze the chromatin disease correlation, and obtain the tuberculosis related chromatin disease module. Then, based on the information of the genetic variation point in the acquired tuberculosis related color chromatin module, the use of the machine The method of learning is preliminarily constructed for the model of tuberculosis prediction and diagnosis. Then, the protein group data are integrated, the interprotein interaction network is constructed, the standardized network matrix of the nuclear group and the proteomics is analyzed, and the central rule of the multi group level is demonstrated. Finally, the protein interaction network and the obtained tuberculosis are obtained. The related chromatin disease module was integrated, and the core complex network system of the molecular mechanism of tuberculosis was obtained through network decomposition. In order to optimize the model of tuberculosis prediction and diagnosis, the classification results were verified by ROC analysis. Results: after the analysis of genomic data, there were 49236 P values of 0. in the uncorrected state. 05 SNPs loci were found. The transcriptional data were analyzed by gene differential expression, and 1594 differentially expressed genes were screened, of which 738 were up-regulated and 856 genes were down. The nuclear group data were integrated into the standardized matrix of 3044*3044, and the tuberculosis related chromatin disease was obtained by clustering and chromatin disease correlation analysis. The module contains 101417 SNPs, and the model AUC has reached 0.926. The sensitivity and specificity of the model are 0.87 and 0.866 respectively, which are more than 0.85, and are at the level of high classification effect. There is a correlation between the interaction of the chromatin advanced structure and the protein interaction. After the integration of the tuberculosis related chromatin module and the protein interplay network, the complex network system of the molecular mechanism of tuberculosis is composed of 5846 nodes and 458653 sides, and obtained after hierarchical cluster analysis. The core network of tuberculosis is composed of 2015 nodes and 61318 sides, and then 15 kernel network units are obtained by iNP algorithm, and 228 genes are included. The AUC based on the kernel unit and the forward search algorithm is 0.841, and the sensitivity and specificity are 0.768 and 0.769 respectively, and the S is included. The number of NPs parameters is 2260. Without the significant sacrifice of classification efficiency, the model parameters are reduced to the 1/50 level of the original model, and the purpose of controlling the application of the existing technology costs is realized. Conclusion: This study applies the biological large data of the genome, the nuclear group, the transcriptional group and the protein group, and the preliminary study method of the Transomics. The core complex network system and its kernel unit of the molecular mechanism of TB are found, and on this basis, the TB predictive diagnostic model with better classification effect is constructed by using machine learning method, which provides a backup tool for the screening of high risk population of TB. In the context of the formation and improvement of the theory in the future, a part of the molecular mechanism of the pathogenesis of complex diseases has been explored. This study has initially explored and constructed a multi group analysis process that may be suitable for general complex diseases, and also found some evidence to verify the importance of INTERGENE interrelationships, suggesting that it is related in the future. It needs to be paid attention to in the study. Finally, because this research is based on the Transomics research of large data and computer algorithms, the conclusion of the research is still to be further analyzed and verified in the future laboratory work, clinical and epidemiological studies.
【学位授予单位】：北京市结核病胸部肿瘤研究所
【学位级别】：博士
【学位授予年份】：2017
【分类号】：R52

【参考文献】