基于SVM算法的癌症基因数据分类研究
发布时间:2018-04-27 11:53
本文选题:DNA微列阵 + 基因表达数据 ; 参考:《苏州大学》2015年硕士论文
【摘要】:癌症是对人类生命构成严重威胁的主要疾病之一,而癌症的早诊断是提高癌症患者成活率的关键。随着DNA微列阵技术的飞速发展,海量的癌症基因表达数据得以积累。在分子生物学的基础上,如何根据这些庞大的基因表达数据进行癌症的早期诊断已成为后基因组时代的研究热点,但是癌症基因表达数据一般都具有高维数、样本数量少、非线性等特征,这就给基因数据的分类带来了很多困难。针对以上基因表达数据的普遍特征,本文运用一种基于支持向量机的分类方法对癌症数据样本进行分类。SVM是在统计学理论的基础上发展起来的新一代机器学习方法,它采用结构风险化原则,代替了经验最小化原则,成功应用核函数将非线性问题转化为线性问题,在解决有限样本、非线性及高维模式识别问题中表现出了许多特有的优势。尽管SVM有效的解决了欠学习和过学习的问题,但是基因表达数据样本数少、维数高的特性对数据分类准确度的影响难以避免。如果直接对原始数据进行分类,工作量大且得不到比较满意的结果。因此,数据降维就成为癌症基因数据分类的关键性问题。本文首先运用数据降维方法,对原始基因表达数据进行降维,得到较低维度的数据之后,再对其进行SVM分类。通过多种降维方法的比较以及SVM参数的合理设置,可以取得较高的癌症诊断精度。文章中使用的数据降维方法有稀疏主成分分析,广义判别分析和拉普拉斯特征值映射法等。本文的研究重点是如何利用降维方法优化数据,通过选择两组网络公开的数据集进行相关实验,可得对于Prostate Tumor数据,GDA的降维效果最佳,而对于Leukemia数据,MDS的降维效果最佳。实验结果表明:寻求最优的降维方法以及合理的调整SVM参数,可以有效的优化基因数据,提高SVM的分类性能,取得较高的分类精度。
[Abstract]:Cancer is one of the major diseases that pose a serious threat to human life, and the early diagnosis of cancer is the key to improve the survival rate of cancer patients. With the rapid development of DNA microarray technology, huge amounts of cancer gene expression data have been accumulated. On the basis of molecular biology, how to make early diagnosis of cancer based on these huge gene expression data has become a hot topic in the post-genomic era, but the cancer gene expression data generally have high dimension and few samples. Nonlinear and other characteristics, this brings a lot of difficulties to the classification of genetic data. In view of the general characteristics of the above gene expression data, this paper uses a classification method based on support vector machine to classify cancer data samples. SVM is a new generation machine learning method developed on the basis of statistical theory. It adopts the principle of structural risk, replaces the principle of empirical minimization, and successfully transforms the nonlinear problem into a linear problem by using kernel function. It has many unique advantages in solving the problem of finite sample, nonlinear and high-dimensional pattern recognition. Although SVM can effectively solve the problem of underlearning and overlearning, it is difficult to avoid the influence of the characteristics of high dimension on the accuracy of data classification because of the small number of samples of gene expression data. If the original data is classified directly, the workload is large and the result is not satisfactory. Therefore, data dimensionality reduction has become a key issue in cancer gene data classification. In this paper, we first use data dimension reduction method to reduce the dimension of the original gene expression data, get the lower dimension data, then classify them with SVM. Through the comparison of various dimensionality reduction methods and the reasonable setting of SVM parameters, a high accuracy of cancer diagnosis can be obtained. The data dimension reduction methods used in this paper include sparse principal component analysis, generalized discriminant analysis and Laplace eigenvalue mapping. The key point of this paper is how to optimize the data by using the dimension reduction method. By selecting two groups of data sets published in the network to carry on the related experiments, we can get the best dimensionality reduction effect for Prostate Tumor data and the best for Leukemia data. The experimental results show that the optimal dimensionality reduction method and the reasonable adjustment of SVM parameters can effectively optimize gene data, improve the classification performance of SVM, and achieve higher classification accuracy.
【学位授予单位】:苏州大学
【学位级别】:硕士
【学位授予年份】:2015
【分类号】:R730.4;TP311.13
【参考文献】
相关期刊论文 前5条
1 丁世飞;齐丙娟;谭红艳;;支持向量机理论与算法研究综述[J];电子科技大学学报;2011年01期
2 王立强,陆祖康,倪旭翔,郑旭峰,李映笙;共聚焦生物芯片扫描仪中PMT电流增益的自动控制[J];光子学报;2004年03期
3 William CS CHO;南娟;;miRNAs作为癌症预测和预后标志物的巨大潜能[J];中国肺癌杂志;2013年01期
4 罗记平,屠大维;基因芯片CCD荧光检测及图像处理[J];红外技术;2003年05期
5 祁亨年;支持向量机及其应用研究综述[J];计算机工程;2004年10期
相关博士学位论文 前1条
1 陆慧娟;基于基因表达数据的肿瘤分类算法研究[D];中国矿业大学;2012年
,本文编号:1810630
本文链接:https://www.wllwen.com/yixuelunwen/zlx/1810630.html