基于主成分分析和神经网络的癌症驱动基因预测模型

发布时间：2018-06-07 22:30

本文选题：主成分分析 + 神经网络　；参考：《北京交通大学》2017年硕士论文

【摘要】：癌症是人类生命和健康的主要威胁之一,它不仅给个人和家庭造成沉重的精神压力和经济负担,也严重影响了全球的经济发展和社会进步。癌症产生机制及其控制研究已经成为全球性的卫生战略研究重点。既往癌症的研究主要集中在寻找其外部诱因,对于内在的致癌机理知之甚少,直到高通量测序技术等方法的出现,使得从基因水平分析内因成为可能。通过分析癌症形成过程中细胞内基因表达水平的变化,人们发现有些基因能够对肿瘤起控制作用,如果抑制这些基因表达或基因通路,就可以终止肿瘤发展的相关事件,这些基因被称为癌症驱动基因。驱动基因是决定癌症的最主要内部原因,针对驱动基因靶向治疗,癌症治疗就可能事半功倍。目前,我们主要通过分析大量样本的序列比对结果来预测癌症驱动基因,这种基于生物学的方法易于理解,但往往需要对大量的癌症样本进行测序,花费昂贵。随着分子生物学的快速发展,诸如TCGA(The Cancer Genome Atlas)等组织为研究者提供了数量庞大且更新及时的数据资源,此外,机器学习、数据挖掘等技术的涌现为分析这些数据提供了强大的支撑。驱动基因预测逐渐向数据化方向发展。本文介绍了驱动基因的研究背景、意义和方法,并对主成分分析方法和神经网络的基本原理及在本文中的应用做详细分析介绍。基于这两种方法,我们提出了一种用于预测癌症驱动基因的系统生物学模型,该模型能够从微阵列数据出发逐步得到驱动基因预测集,降低实验过程中相关步骤的系统误差和人为误差,可以有效地减少经费支出和实验周期,为癌症的靶向治疗提供依据。本文选取多形性胶质母细胞瘤作为实验对象进行验证。首先,对实验样本数据进行预处理,对肿瘤表达谱数据进行归一化等处理,之后利用主成分分析方法进一步过滤无表达信息或者表达信息过低的表达数据;其次,受模块网络的启发,对筛选出的基因进行划分,将具有相似突变率的基因划分在同一个块中,并对块进行排序;最后,通过受限玻尔兹曼机学习得到驱动基因的预测集,并将预测结果和文本挖掘的结果进行比较,发现有80%左右的基因符合文本挖掘的结果,证明本文提出的模型具有一定的可行性和有效性。
[Abstract]:Cancer is one of the main threats to human life and health. It not only causes heavy mental stress and economic burden to individuals and families, but also seriously affects global economic development and social progress. The research on the mechanism and control of cancer has become the focus of global health strategy research. Previous studies on cancer have focused on finding out the external causes, but little is known about the underlying carcinogenic mechanisms until the advent of high-throughput sequencing techniques, which make it possible to analyze the internal causes at the gene level. By analyzing the changes in gene expression levels in cells during cancer formation, it has been found that some genes can control tumors, and if these genes are inhibited or gene pathways are inhibited, the events associated with tumor development can be terminated. These genes are called cancer-driven genes. Driving gene is the main internal cause of cancer. At present, we mainly predict the cancer driving gene by analyzing the sequence alignment results of a large number of samples. This biology-based approach is easy to understand, but it often requires a large number of cancer samples to be sequenced, which is expensive. With the rapid development of molecular biology, organizations such as TCGA and the Cancer Genome Atlas have provided researchers with a large number of updated and timely data resources, in addition to machine learning. The emergence of technologies such as data mining provides a strong support for the analysis of these data. Driving gene prediction is gradually moving towards data. In this paper, the background, significance and method of driving gene are introduced, and the principle of principal component analysis (PCA), the basic principle of neural network and its application in this paper are introduced in detail. Based on these two methods, we propose a system biological model for predicting cancer driven genes. The model can be used to obtain the prediction set of driving genes from microarray data step by step. Reducing the systematic error and artificial error of the relative steps in the experiment process can effectively reduce the expenditure and the experimental period and provide the basis for the targeted treatment of cancer. Pleomorphic glioblastoma was selected as experimental object. First, preprocessing the experimental sample data, normalizing the tumor expression profile data, then using principal component analysis method to further filter the unexpressed information or the expression information too low expression data; secondly, Inspired by the module network, the selected genes are divided into the same block with similar mutation rate and sequenced. Finally, the prediction set of the driving gene is obtained by the restricted Boltzmann machine learning. By comparing the predicted results with the results of text mining, it is found that about 80% of the genes are consistent with the results of text mining, which proves that the proposed model is feasible and effective.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：R73-3;TP183

【参考文献】