基于改进的多核学习算法的癌症分化度预测及生物通路分析

发布时间：2018-05-07 14:22

本文选题：多核学习 + 特征选择　；参考：《吉林大学》2017年硕士论文

【摘要】：针对不同类型组学数据高通量测序技术的快速发展,为生物信息学领域的研究带来了巨大的变革,使得我们在研究中能够以较低成本迅速产生大量的组学数据,这些数据包括基因组数据、转录数据、表观遗传数据、蛋白组学数据以及代谢组学数据。更重要的是,其中针对同一样本测序获取多种类型的组学数据的研究也越来越普遍。与此同时,出现了大量包含高质量以及高置信度的组学数据的公共数据库,这也为我们收集组学数据并进行相关研究提供了便利。对这些组学数据进行整合分析的一个关键目标就是要确定一个可以预测表型性状与相关结果、寻找重要生物标记物或者解释复杂性状产生所依赖的遗传基础的有效的模型。目前针对多组学数据整合的方法策略主要有两种,分别是多阶段分析策略以及元维分析策略。在多阶段分析策略中,假设不同组学数据与复杂性状之间的联系是线性的、层次的,通过每次整合之间存在联系的两种组学数据分析,根据分析结果逐步构建模型。而在绝大多数情况下,复杂性状是不同组学数据变化同时作用导致的结果,多阶段分析策略则不能针对复杂性状有效地建模。但是在元维分析策略中,则可以通过同时整合多种组学数据来构建模型。癌症分化度,作为癌症的一种复杂性状,表示癌症细胞在细胞形态以及组织结构的异变程度。其包含了与癌症的临床行为例如恶化以及侵袭等相关的重要信息,并且在制定癌症临床治疗计划以及改善癌症预后有着重要作用。通过对癌症分化度的预测可以大大提升癌症早期检测率以及有效地指导治疗过程。尽管有很多研究者注意到了癌症分化度的重要性,并且出现一些与癌症分化度预测的相关研究,但其中鲜有通过整合多种组学数据来解决此问题。因此我们需要一种能够利用多种组学数据进行癌症分化度预测的先进的算法。在本文中,我们首先提出了一种基于元维分析策略,受p?范式正则化约束的多核学习算法,并使用序列最小化算法对其计算效率进行改进。同时,我们在原有模型的基础上加入了生物通路信息,使其可以用于评价不同生物通路在不同癌症分化度中的重要性。最后,我们使用乳腺癌作为研究案例,基于我们提出的算法,整合了经过特征选择之所后得到的基因表达数据以及甲基化数据,针对不同乳腺癌的分化度构造了预测器。我们的实验结果显示,提出的模型在预测效果上优于目前流行的其他多种组学数据整合模型,并且给出了关于产生乳腺癌分化度差异在生物通路层面的解释。此外,我们的模型可以进一步揭示相关组学数据与乳腺癌分化度之间的联系,藉此能够更深入了解产生乳腺癌分化度差异所依赖的生物模型。
[Abstract]:With the rapid development of high-throughput sequencing technology for different types of cluster data, great changes have been brought to the field of bioinformatics, which enables us to quickly generate a large amount of cluster data at lower cost. These data include genomic data, transcriptional data, epigenetic data, proteomics data, and metabonomics data. More importantly, it is more and more common to obtain multiple types of cluster data by sequencing the same sample. At the same time, a large number of public databases with high quality and high confidence have emerged, which makes it convenient for us to collect and study the cluster data. A key objective of integrating these data is to identify an effective model that can predict phenotypic traits and related results, find important biomarkers or explain the genetic basis on which complexity depends. At present, there are two main strategies for multi-group data integration, namely, multi-stage analysis strategy and meta-dimensional analysis strategy. In the multi-stage analysis strategy, it is assumed that the relationship between different sets of data and complexity is linear and hierarchical, and the model is constructed step by step according to the analysis results. In most cases, complexity is the result of simultaneous action of different data sets, and multi-stage analysis strategies cannot effectively model complexity traits. However, in the meta-dimensional analysis strategy, the model can be constructed by integrating multiple sets of data at the same time. Cancer differentiation, as a complex trait of cancer, indicates the degree of aberration of cancer cells in cell morphology and tissue structure. It contains important information related to the clinical behavior of cancer, such as deterioration and invasion, and plays an important role in making plans for clinical treatment of cancer and improving the prognosis of cancer. By predicting the degree of cancer differentiation, the early detection rate of cancer can be greatly improved and the treatment process can be effectively guided. Although many researchers have paid attention to the importance of cancer differentiation and there have been some studies related to the prediction of cancer differentiation, few of them have solved this problem by integrating a variety of cluster data. Therefore, we need an advanced algorithm to predict the degree of cancer differentiation using multiple sets of data. In this paper, we first propose a meta-dimensional analysis strategy. A multi-core learning algorithm with normal regularization constraints is proposed and its computational efficiency is improved by using the sequence minimization algorithm. At the same time, we add biological pathway information based on the original model, which can be used to evaluate the importance of different biological pathways in different degrees of cancer differentiation. Finally, we use breast cancer as a case study. Based on the proposed algorithm, we integrate the gene expression data and methylation data obtained by feature selection, and construct a predictor for different differentiation degree of breast cancer. Our experimental results show that the proposed model is superior to other popular compositional data integration models in predicting results, and gives an explanation of the difference in breast cancer differentiation at the biological pathway level. In addition, our model can further reveal the correlation between the relevant data and the degree of breast cancer differentiation, so that we can further understand the biological model on which the difference of breast cancer differentiation depends.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：R73-3;TP18

【相似文献】