基于基因表达谱的结直肠癌的判别与分型
发布时间:2018-02-02 04:00
本文关键词: 基因表达谱 结直肠癌判别 结直肠癌分型 特征基因选择 维数灾难 数据不平衡性 出处:《南方医科大学》2017年博士论文 论文类型:学位论文
【摘要】:基于基因表达谱的癌症判别指针对基因表达谱数据集,设计有效的分类算法,把正常样本和癌症样本分开,并找出癌症的判别基因(特征基因);基于基因表达谱的癌症分型指针对基因表达谱数据集,设计有效的分类算法,将癌症样本分为多个亚型,并找出判别各个亚型的特征基因,以利于确定药物靶向和对患者的精准治疗。然而,基因表达谱数据集的四大显著特征:“维数灾难、高冗余、高噪声、数据不平衡性”,形成了基于基因表达谱的癌症判别与分型的困难。本研究针对基因表达谱数据集的上述特征,以及结直肠癌亚型的数量未知的问题,研究了前沿水平的相关算法,以此为基础,提出了更合理的结直肠癌的判别与分型的方法,提高了结直肠癌判别与分型的准确性并找出了一系列具有高判别能力的特征基因集。本文分为四部分,第一章为绪论;第二章,研究了基于基因表达谱的结直肠癌的判别与分型的相关算法,主要包括:(1)基于RUSBoost的不平衡数据集的分类算法,将该二分类算法扩展为多分类算法,命名为 RUSBoost.M2,(2)基于差分进化(Differential Evolution,DE)和轮盘搜索策略的特征基因选择算法DEFSw,并针对基因表达谱数据的样本不平衡性,将该算法所封装的分类评估测度和分类算法分别改进为权重精度和RUSBoost.M2算法,提出了 DEFSw.wAcc及DEFSw.RUSBoost.M2.wAcc算法,提升了所选出的特征基因的分类判别能力,(3)将用于视频监控处理的BRPCA(Bayesian Robust Prince Component Analysis)算法作适当改进,引入到基因表达谱数据的处理,用于降维与降噪,(4)基于平面极大过滤图(Planar Maximally Filtered Graph)的层次信息聚类算法(简称DBHT),重点研究了其聚类的原理,利用该算法能自动确定类数并无监督地完成聚类的特点来分型;第三章,针对结直肠癌的判别,利用第二章所提出的DEFSw.RUSBoost.M2.wAcc特征选择算法,从TCGA COAD(结肠癌)数据集中选择并经在GEO GSE39582、GSE41657和TCGA READ(直肠癌)数据集上验证,筛选出13组只包含1个基因和88组包含2个基因、既可高精度地判别结肠癌也可高精度地判别直肠癌的特征基因集合,以及14组只包含1个基因、只可高精度地判别结肠癌的特征基因集合,一些基因之前并无癌症或结直肠癌的相关报道。同时,对于5个已报道的有前景的结直肠癌生物标志物,均为其找出了多个辅助基因,能显著地提高这些生物标志物对结肠癌的判别能力。第四章,利用TCGA COAD数据集,先利用第二章所改进的BRPCA算法进行基因表达谱数据的降维和降噪,再利用DBHT算法对BRPCA算法分离出的稀疏成分S进行无监督聚类,以正常样本被正确聚类为参照物,将结肠癌症分为7个亚型,然后利用DEFSw.wAcc算法选出了分型的44个特征基因,其中包含基因MSH6,其是一个已知的和结直肠癌遗传相关的基因,直接存在于KEGG的结直肠癌的通路中。
[Abstract]:Cancer discrimination based on gene expression profile refers to the design of an effective classification algorithm for gene expression data sets, which separates normal samples from cancer samples. The cancer classification based on gene expression profile is to design an effective classification algorithm to classify cancer samples into multiple subtypes, and find out the characteristic genes to distinguish each subtype. In order to facilitate the identification of drug targets and accurate treatment of patients. However, the four significant features of the gene expression data set are: "Dimension disaster, high redundancy, high noise, This study aims at the above characteristics of the gene expression profile data set and the unknown number of subtypes of colorectal cancer. Based on the research of relevant algorithms at the frontier level, a more reasonable method for the discrimination and classification of colorectal cancer is proposed. Improve the accuracy of discrimination and classification of rectal cancer and find out a series of characteristic gene sets with high discriminant ability. This paper is divided into four parts, the first chapter is the introduction, the second chapter, In this paper, the related algorithms of discriminating and classifying colorectal cancer based on gene expression profile are studied, including the classification algorithm of RUSBoost based unbalanced dataset, and the two classification algorithms are extended to multi-classification algorithm. Named RUSBoost.M _ 2N _ 2) based on differential evolution evolution (DED) and disk search strategy, the feature gene selection algorithm DEFSw. and the sample imbalance of gene expression profile data. The classification evaluation measure and classification algorithm encapsulated in this algorithm are improved to weight accuracy and RUSBoost.M2 algorithm respectively, and DEFSw.wAcc and DEFSw.RUSBoost.M2.wAcc algorithms are proposed. The BRPCA(Bayesian Robust Prince Component Analysis (BRPCA(Bayesian Robust Prince Component Analysis) algorithm, which is used in video surveillance processing, is introduced to the processing of gene expression profile data. For dimensionality reduction and noise reduction, a hierarchical information clustering algorithm based on Planar Maximally Filtered Graph-based Planar Maximally Filtered Graph-based (DBHT) algorithm is presented, which focuses on the principle of clustering. The algorithm can automatically determine the number of clusters and unsupervised the characteristics of clustering. In chapter 3, according to the discrimination of colorectal cancer, the DEFSw.RUSBoost.M2.wAcc feature selection algorithm proposed in Chapter 2 is used to select the data set of TCGA Coad (Colon Cancer) and verified on GEO GSE39582 GSE41657 and TCGA read (rectal cancer) data set. Thirteen groups contain only one gene and 88 groups contain two genes, which can distinguish the characteristic gene set of colon cancer and rectal cancer with high accuracy, and 14 groups contain only one gene. The characteristic gene sets of colon cancer can only be identified with high accuracy, and some genes have not been previously reported for cancer or colorectal cancer. At the same time, five promising biomarkers for colorectal cancer have been reported. In chapter 4th, using the TCGA COAD data set, the improved BRPCA algorithm was first used to reduce and reduce the noise of the gene expression profile data. Then the sparse component S isolated from BRPCA algorithm is clustered unsupervised by DBHT algorithm, and the normal samples are correctly clustered as reference, and colon cancer is divided into 7 subtypes, and 44 characteristic genes are selected by DEFSw.wAcc algorithm. It contains the gene MSH6, a known genetic gene associated with colorectal cancer, which is directly present in the KEGG pathway for colorectal cancer.
【学位授予单位】:南方医科大学
【学位级别】:博士
【学位授予年份】:2017
【分类号】:R735.34
【相似文献】
相关期刊论文 前10条
1 汪伟;;基于数据库语言实现基因表达谱数据的单因素重复测量方差分析[J];中国医疗设备;2013年11期
2 孙德利,舒琦瑾;基因表达谱在中医药研究中的意义[J];中国中医药信息杂志;2002年01期
3 刘s,
本文编号:1483574
本文链接:https://www.wllwen.com/kejilunwen/jiyingongcheng/1483574.html
最近更新
教材专著