基于张量分解的癌症亚型分析算法的研究

发布时间：2018-10-30 12:06

【摘要】：通过形态学或所属组织器官命名的癌症并不准确,癌症的临床治疗需要更精确的亚型才能对症下药和靶向治疗。通过对基因芯片数据如m RNA、mi RNA、DNA、蛋白质等数据的分析能发现和识别出更准确的癌症亚型。整合多源基因组数据不仅能够发现肿瘤与基因组数据的关系,而且可以发现各基因数据之间对肿瘤的协同共作用关系。综合考虑不同基因数据,在不丢失信息的前提下分析不同数据相互之间的共享结构是分析癌症亚型的难点。本文使用多维阵列的张量结构来整合多源基因组数据,不经过中间数据转换,保留的原始单一基因数据的特有信息,同时挖掘不同基因数据之间的协同致病模式。本文介绍了张量模型的原理和框架,在基于乳腺癌的基因表达谱数据和DNA甲基化数据上构建了张量模型,构建的方法是对预处理的芯片数据做差异表达分析,有明显差异的基因在张量中置位1或者保留原芯片值。表达正常或没有明显差异的基因则稀疏化为0。这样基因表达谱数据和甲基化数据就整合为一个三维张量。在现有的CP-ARP分解算法的基础上,本文针对基因芯片数据高维度小样本的数据特征和基因功能差异表达和表达水平正常的两极化特征,引入了非负和稀疏性限制条件,优化了CP分解模型。改进的模型使用基于随机梯度下降的ALS优化方法,在计算性能上有所提升。使用改进的分解方法在与已经验证的乳腺癌五种亚型对比结果证明了张量分解模型在癌症分型应用上的有效性。通过对癌症分型的结果分析,验证了Her2这种临床已证明存在的亚型。从平均轮廓系数和生存分析等角度证明了算法的性能和所分亚型的有效性。证实了本文提出的方法在癌症的分型以及癌症诊断治疗上能提供一定的参考和借鉴。
[Abstract]:Cancer named by morphology or tissue or organ is not accurate. The clinical treatment of cancer requires more precise subtypes in order to get the right medicine and target treatment. More accurate cancer subtypes can be identified by analyzing microarray data such as m RNA,mi RNA,DNA, protein. The integration of multi-source genomic data can not only find the relationship between tumor and genomic data, but also find the synergistic co-action relationship between gene data and tumor. Considering different gene data and analyzing the shared structure of different data without losing information, it is difficult to analyze cancer subtype. In this paper, the Zhang Liang structure of multi-dimensional array is used to integrate the multi-source genomic data, and the unique information of the original single gene data is preserved without intermediate data conversion, and the cooperative pathogenicity patterns among different genetic data are also mined. In this paper, the principle and framework of Zhang Liang model are introduced, and then, on the basis of gene expression profile data and DNA methylation data of breast cancer, Zhang Liang model is constructed. The method is to analyze the differential expression of pre-processed microarray data. There are significant differences in the gene in Zhang Liang to place 1 or to retain the original chip value. Genes that express normal or no significant differences are sparse to 0. In this way, the gene expression profile data and methylation data are integrated into a three-dimensional Zhang Liang. Based on the existing CP-ARP decomposition algorithms, this paper introduces non-negative and sparse constraints for the data characteristics of high-dimensional small samples of gene chip data and the polarimetric characteristics of normal expression and expression level of gene functional differences. The CP decomposition model is optimized. The improved model uses the ALS optimization method based on stochastic gradient descent to improve the computational performance. The application of Zhang Liang decomposition model in cancer classification was proved by using the improved decomposition method in comparison with the five subtypes of breast cancer. Her2, a clinically proven subtype, was verified by analysis of cancer typing results. The performance of the algorithm and the validity of the subtype are proved from the point of view of average contour coefficient and survival analysis. It is confirmed that the proposed method can provide some reference for the classification of cancer and the diagnosis and treatment of cancer.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：R73-3

【参考文献】