基于基因组数据的癌症亚型发现聚类研究

发布时间:2018-01-28 05:52

  本文关键词: 癌症 癌症亚型 癌症基因组 癌症基因组图谱(TCGA) 基因调控网络 数据挖掘 聚类分析 出处:《中国科学技术大学》2016年博士论文 论文类型:学位论文


【摘要】:癌症亚型的定义和发现是针对癌症个性化治疗的一个重要组成部分,将癌症样本正确归类到不同的亚型能够为病人选择正确的治疗方法提供非常重要的参考。基因组技术的发展和应用,可以获取癌症病例全基因组的高通量测序数据,为人们在全基因组水平上研究癌症个体的差异和癌症的发生、发展以及转移机制创造了条件。然而,癌症基因组数据是多谱系高维特征的生物大数据集合,高维、高噪声、低样本数是生物大数据的普遍特征,给传统数据挖掘技术应用提出了新的挑战;基因组技术的发展积累了大量的癌症样本数据,如何利用数据挖掘的大数据分析方法处理这些癌症基因组数据,探索每一种癌症存在的可能亚型及其相应的肿瘤分子标记物,将对癌症研究和治疗具有非常重要的现实意义。本文以癌症基因组数据为研究对象,针对癌症基因组数据高维性和多谱系的特点,主要研究在癌症亚型发现的聚类分析中有关癌症基因组数据的处理和融合方法,同时探索癌症基因组数据的新型聚类算法。癌症基因组学是通过高通量测序技术将基因与癌症研究进行关联,基因芯片技术和二代测序技术作为当前癌症基因组数据获取的主要来源,本文对其技术特点及技术细节进行详细论述;对迄今为止最大的癌症基因组研究项目癌症基因组图谱(TCGA)计划进行比较全面的介绍。本文构建了基于基因组数据的癌症亚型发现研究的分析框架,主要包括基因组数据的预处理方法,基因组数据重要特征提取方法,基因组数据的聚类方法,以及聚类结果的评估方法;详细介绍了数据过滤、数据补齐和数据标准化的基因组数据预处理方法;提出四种基因组数据特征选择方法;聚类算法作为基于基因组数据的癌症亚型发现的核心内容,本文系统介绍了一致性聚类、一致性非负矩阵因式分解、多基因组数据集成聚类和相似性网络融合四种主要癌症亚型发现的计算生物学方法:针对聚类结果的评估向题,本文给出了生存分析、Silhouette方法以及聚类统计显著性检验的评价指标。多基因组数据挖掘聚类研究是定义和发现癌症亚型的一种非常有效的途径,并且已经在很多癌症研究中产生了非常重要的发现和应用。有关癌症亚型发现的新计算生物学方法在不断的发展,目前存在的基于基因组数据的癌症亚型发现方法都是“纯”机器学习方法,然而生命科学的复杂性决定了“纯”机器学习方法不能完全有效解决癌症亚型识别问题。本文引入基因调控网络分析,将基因调控网络集成到多基因组融合聚类过程中,提出基于miRNA-TF-mRNA基因调控网络加权相似性融合算法,集成基因组表达数据和基因调控网络信息实现对癌症样本的聚类分析,得到了有生物学意义的癌症亚型。
[Abstract]:The definition and discovery of cancer subtypes is an important part of personalized treatment for cancer. The correct classification of cancer samples into different subtypes can provide a very important reference for patients to choose the right treatment methods. Development and application of genomic technology. High-throughput sequencing data of the whole genome of cancer cases can be obtained, which provides conditions for the study of the difference of cancer individuals and the occurrence, development and metastasis mechanism of cancer at the whole genome level. Cancer genome data is a biological big data set with multi-lineage and high-dimensional features. High dimension, high noise and low sample number are the universal features of biological big data, which brings a new challenge to the application of traditional data mining technology. The development of genomic technology has accumulated a large number of cancer sample data, how to use data mining big data analysis method to deal with these cancer genome data. Exploring the possible subtypes of each cancer and its corresponding tumor molecular markers will be of great practical significance for cancer research and treatment. In view of the characteristics of high dimensional and multi-pedigree of cancer genome data, the methods of processing and fusion of cancer genome data in cluster analysis of cancer subtypes were studied. At the same time, it explores a new clustering algorithm for cancer genome data. Cancer genomics links genes to cancer research through high-throughput sequencing techniques. Gene chip technology and second-generation sequencing technology are the main sources of current cancer genome data acquisition. This paper discusses their technical characteristics and technical details in detail. The TCGA-based cancer genome mapping project, the largest cancer genome research project, is introduced in this paper. In this paper, an analytical framework for cancer subtype discovery based on genomic data is constructed. It mainly includes the preprocessing method of genome data, the extraction method of important feature of genome data, the clustering method of genome data, and the evaluation method of clustering result. The preprocessing methods of data filtering, data collation and data standardization are introduced in detail. Four methods for feature selection of genomic data are proposed. Clustering algorithm is the core of cancer subtype discovery based on genomic data. This paper systematically introduces consistent clustering and consistent non-negative matrix factorization. Multi-genome data integration clustering and similarity network fusion of four major cancer subtypes of computational biology methods: for the evaluation of clustering results, this paper gives a survival analysis. The Silhouette method and the evaluation index of clustering statistical significance test. Multi-genome data mining clustering research is a very effective way to define and find cancer subtypes. And has produced very important discovery and application in many cancer research. The new computational biology method of cancer subtype discovery is developing continuously. The existing methods of cancer subtype discovery based on genomic data are "pure" machine learning methods. However, because of the complexity of life science, "pure" machine learning method can not solve the problem of cancer subtype recognition effectively. In this paper, gene regulation network analysis is introduced. A weighted similarity fusion algorithm based on miRNA-TF-mRNA gene control network was proposed by integrating gene control network into multi-genome fusion clustering process. The cluster analysis of cancer samples was carried out by integrating genomic expression data and gene regulatory network information, and the cancer subtypes with biological significance were obtained.
【学位授予单位】:中国科学技术大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:R73-3;TP311.13

【相似文献】

相关重要报纸文章 前2条

1 蓝岸;中国有望首获黄种人基因组数据[N];深圳特区报;2007年

2 本报记者 任荃;公共基因组数据被污染了[N];文汇报;2011年

相关博士学位论文 前1条

1 许桃胜;基于基因组数据的癌症亚型发现聚类研究[D];中国科学技术大学;2016年

相关硕士学位论文 前2条

1 董伯Oz;节节麦基因组数据平台的构建[D];吉林大学;2013年

2 林延春;个人基因组数据管理研究[D];哈尔滨工业大学;2010年



本文编号:1469954

资料下载
论文发表

本文链接:https://www.wllwen.com/shoufeilunwen/xxkjbs/1469954.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户64348***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com