生物疾病数据挖掘与系统建模
发布时间:2018-03-05 15:18
本文选题:降维 切入点:模型选择 出处:《上海交通大学》2014年博士论文 论文类型:学位论文
【摘要】:在后基因组时代,处理各个层次的生物数据,是当前生物信息学发展的重要任务。在海量数据中学习并选择有效的信息,来鉴别及分析一系列特定疾病的分子特征与规律,对于疾病的诊断与预后至关重要。更加关键的,从系统生物学的角度去研究疾病的分子机理,建立定量的调控网络模型,已经成为研究重大疾病分子机理的关键步骤。然而,现有的学习算法没能针对疾病相关数据自身的特点,为特定疾病设计学习高通量数据的计算方法,以至于未能充分反映疾病的全部关键特征;特别是定量模型的缺乏,使得一些基因表达调控网络没有得到有效的建立与分析。疾病相关的特征过多而生物实验数据不足所造成的“小样本问题”则是造成上述问题的主要原因之一。本文着眼于学习一系列疾病的关键特征,以及疾病相关定量的分子动力学机制,特别针对处理“小样本问题”为不同的生物医学问题设计了专门的算法。本文的主要工作任务包含三个部分:1,为肺炎以及龋齿的元基因组16s rRNA数据设计“特征合并选择算法”,学习并提取关于微生物种类的特征组合。该算法在充分降维压缩特征空间的同时保留了充足的原始特征数量,并且转化后的新特征组合之间没有重叠,使之更具有可理解性。经过两种不同疾病元基因组数据的验证,该算法不仅比其他方法拥有较高的识别率,同时也保证了较低的维数,使得模型更加稳定。2,针对白血病小鼠体内正常的造血干细胞Maff与Egr3两种基因高表达,并且以相反方式影响细胞周期的生物实验结果,本文通过生物信息网络资源,经过“穷举——模型选择”的方式筛选出Maff与Egr3调控细胞周期的定量模型。在模拟细胞周期一系列关键分子表达量以及结合位点序列扫描等方式验证模型之后,通过动力学模拟,计算得到Egr3强烈抑制细胞周期,而Maff促进细胞周期则要受到前者约束的一系列结论,同时也印证了白血病环境下的正常细胞“癌化——自我保护”的机制。3,针对脂肪细胞分化过程中的基因表达调控网络,为基因表达数据的小样本问题,设计了基因定量调控网络的参数估计算法——“小样本迭代优化算法”。该算法能够在样本量明显不足的情况下,正确而又准确地估计合理的参数,从而实现定量调控网络的构建,并且在人类与小鼠两个物种的调控网络得到了验证。此外,通过寻找分化前后差异表达较大的基因,对比计算发现了一系列额外的反馈结构并且得到了验证。在估算定量网络的基础上分别在参数大小,动力学结果,以及统计调控强度差异等方面比较了人类与小鼠脂肪分化的异同之处。得出了两物种在基因表达调控细节上的诸多差异与人类和小鼠脂肪分化系统的效率差异之间的潜在关系。
[Abstract]:In the post-genome era, processing biological data at all levels is an important task in the development of bioinformatics. Learning and selecting effective information from massive data to identify and analyze the molecular characteristics and laws of a series of specific diseases. It is very important for the diagnosis and prognosis of disease. More importantly, studying the molecular mechanism of disease from the point of view of system biology and establishing a quantitative regulatory network model have become the key steps to study the molecular mechanism of major diseases. The existing learning algorithms have not been able to design the calculation method of high-throughput data for specific diseases according to the characteristics of disease-related data, so that they can not fully reflect all the key characteristics of disease, especially the lack of quantitative models. Some gene expression regulatory networks have not been effectively established and analyzed. The "small sample problem" caused by too many disease-related characteristics and insufficient biological experimental data is one of the main reasons for these problems. This article focuses on learning the key features of a range of diseases, And disease related quantitative molecular dynamics mechanisms, Special algorithms are designed to deal with "small sample problem" for different biomedical problems. The main task of this paper includes three parts: 1, designed for pneumonia and dental caries meta-genome 16s rRNA data. And select the algorithm ", learn and extract the feature combination of microbial species. This algorithm reduces the dimension of the feature space and retains sufficient number of original features." And the transformed new feature combination has no overlap, which makes it more comprehensible. After the verification of two different disease metadata, the algorithm not only has a higher recognition rate than other methods, but also ensures lower dimension. Make the model more stable. 2. In view of the high expression of Maff and Egr3 genes of normal hematopoietic stem cells in leukemia mice, and affect the cell cycle in the opposite way, this paper through the biological information network resources, A quantitative model of cell cycle regulation by Maff and Egr3 was selected by exhaustive model selection. After simulating cell cycle with a series of key molecules expression and binding site sequence scanning, the model was verified by kinetic simulation. It was calculated that Egr3 strongly inhibited cell cycle, while Maff inhibited cell cycle by a series of conclusions. At the same time, it also confirms the mechanism of "carcinogenesis-self-protection" of normal cells in leukemia environment. It aims at the gene expression regulatory network during adipocyte differentiation, which is a small sample of gene expression data. The parameter estimation algorithm of gene quantitative control network, "small sample iterative optimization algorithm", is designed. This algorithm can correctly and accurately estimate reasonable parameters under the condition of obvious shortage of sample size, so as to realize the construction of quantitative control network. And the regulatory networks of both human and mouse species were verified. In addition, by looking for genes that were differentially expressed before and after differentiation, A series of additional feedback structures are found and verified by the comparative calculation. Based on the estimation of the quantitative network, the size of the parameters and the dynamic results are obtained, respectively. The differences between human and mouse adipose differentiation were compared in terms of statistical regulation intensity, and the potential relationship between the differences in gene expression and regulation details and the efficiency of adipose differentiation system in human and mouse was obtained.
【学位授予单位】:上海交通大学
【学位级别】:博士
【学位授予年份】:2014
【分类号】:R318
【参考文献】
相关期刊论文 前2条
1 申伟科;钟理;;基因表达聚类分析及在肿瘤研究中的应用[J];肿瘤学杂志;2008年05期
2 Amr M.GHALEB,Mandayam O.NANDAN,Sengthong CHANCHEVALAP,W.Brian DALTON,Irfan M.HISAMUDDIN,Vincent W.YANG;Krüppel-like factors 4 and 5:the yin and yang regulators of cellular proliferation[J];Cell Research;2005年02期
,本文编号:1570787
本文链接:https://www.wllwen.com/yixuelunwen/swyx/1570787.html