基于反投影表示的肿瘤基因表达谱数据分类研究
发布时间:2018-10-08 16:02
【摘要】:随着基因芯片技术的快速发展,我们可以快速准确地获得肿瘤基因表达谱数据.特征选择和样本分类是基于基因表达谱数据的肿瘤分类的两个基本问题.通过分析这些数据可以为肿瘤早期诊断和从分子层面上研究提供强有力的工具.近几年来基于稀疏表示的肿瘤分类技术受到越来越多的关注.然而基于稀疏表示的分类器存在以下问题:(1)高度依赖充足的训练样本;(2)忽略蕴含在测试样本中的信息;(3)重建误差的分类不稳定性.而且,设计高效且具有生物意义的基因选择方法是目前发展的趋势.针对以上问题,本文主要做了如下研究工作:一方面,提出了一种基于反投影表示和类别贡献率的肿瘤分类方法,并从理论上证明了该方法的可行性和稳定性.首先,通过挖掘嵌入在测试样本中的信息,构造了一种新的反投影表示模型以减小训练样本数目的影响;然后,为了匹配反投影表示模型完成分类,提出了一种新的分类准则——类别贡献率;最后定义了一种新的统计指标——分类稳定性指标,用于量化不同分类准则的稳定性.另一方面,在前一工作的基础之上,进一步提出了一种结合两阶段混合基因选择和反投影表示模型的肿瘤分类方法.两阶段混合基因选择方法的第一阶段是综合BW、SNR和F检验三种过滤法的基因初选,第二阶段是基于统计Lasso方法对初选出的信息基因进行再次选择,得到可能的致病基因.进而,结合反投影表示模型完成分类.实验部分针对第一个工作,首先验证了反投影表示对小样本问题的有效性,然后利用分类稳定性指标验证了本文基于类别贡献率的分类准则的稳定性,最后进行了分类方法的鲁棒性测试;对于第二个工作,首先给出了基因选择的必要性和Lasso的可行性验证,然后借助不同阶段基于主成分分析的可视化投影分布图和分类性能验证两阶段混合基因选择方法的高效性.值得一提的是,进一步地借助该方法选出了候选致病基因并对这些基因进行了生物学分析.
[Abstract]:With the rapid development of gene chip technology, we can obtain tumor gene expression profile data quickly and accurately. Feature selection and sample classification are two basic problems in tumor classification based on gene expression profile data. The analysis of these data provides a powerful tool for early diagnosis and molecular research. In recent years, sparse representation based tumor classification technology has attracted more and more attention. However, the classifier based on sparse representation has the following problems: (1) highly dependent on sufficient training samples; (2) ignoring the information contained in the test samples; (3) the classification instability of reconstruction errors. Moreover, it is a trend to design efficient and biological gene selection methods. In order to solve the above problems, this paper mainly researches as follows: on the one hand, a tumor classification method based on backprojection representation and class contribution rate is proposed, and the feasibility and stability of the method are proved theoretically. Firstly, by mining the information embedded in the test samples, a new backprojection representation model is constructed to reduce the influence of the number of training samples, and then, in order to match the backprojection representation model, the classification is completed. A new classification criterion, category contribution rate, and a new statistical index, classification stability index, are proposed to quantify the stability of different classification criteria. On the other hand, on the basis of the previous work, a tumor classification method combining two-stage mixed gene selection model and back-projection representation model is proposed. The first stage of the two-stage mixed gene selection method is the primary selection of the three filter methods of BW,SNR and F test. The second stage is the selection of the information gene based on the statistical Lasso method to obtain the possible pathogenic gene. Furthermore, the classification is completed by combining the back-projection representation model. In the first part of the experiment, the effectiveness of the backprojection representation for the small sample problem is first verified, and then the stability of the classification criterion based on the category contribution rate is verified by using the classification stability index. Finally, the robustness of the classification method is tested. For the second work, the necessity of gene selection and the feasibility of Lasso are given. Then the effectiveness of the two-stage hybrid gene selection method is verified by the visual projection map based on principal component analysis (PCA) and classification performance in different stages. It is worth mentioning that the candidate pathogenic genes were further selected and biologically analyzed by this method.
【学位授予单位】:河南大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:R73-3
本文编号:2257373
[Abstract]:With the rapid development of gene chip technology, we can obtain tumor gene expression profile data quickly and accurately. Feature selection and sample classification are two basic problems in tumor classification based on gene expression profile data. The analysis of these data provides a powerful tool for early diagnosis and molecular research. In recent years, sparse representation based tumor classification technology has attracted more and more attention. However, the classifier based on sparse representation has the following problems: (1) highly dependent on sufficient training samples; (2) ignoring the information contained in the test samples; (3) the classification instability of reconstruction errors. Moreover, it is a trend to design efficient and biological gene selection methods. In order to solve the above problems, this paper mainly researches as follows: on the one hand, a tumor classification method based on backprojection representation and class contribution rate is proposed, and the feasibility and stability of the method are proved theoretically. Firstly, by mining the information embedded in the test samples, a new backprojection representation model is constructed to reduce the influence of the number of training samples, and then, in order to match the backprojection representation model, the classification is completed. A new classification criterion, category contribution rate, and a new statistical index, classification stability index, are proposed to quantify the stability of different classification criteria. On the other hand, on the basis of the previous work, a tumor classification method combining two-stage mixed gene selection model and back-projection representation model is proposed. The first stage of the two-stage mixed gene selection method is the primary selection of the three filter methods of BW,SNR and F test. The second stage is the selection of the information gene based on the statistical Lasso method to obtain the possible pathogenic gene. Furthermore, the classification is completed by combining the back-projection representation model. In the first part of the experiment, the effectiveness of the backprojection representation for the small sample problem is first verified, and then the stability of the classification criterion based on the category contribution rate is verified by using the classification stability index. Finally, the robustness of the classification method is tested. For the second work, the necessity of gene selection and the feasibility of Lasso are given. Then the effectiveness of the two-stage hybrid gene selection method is verified by the visual projection map based on principal component analysis (PCA) and classification performance in different stages. It is worth mentioning that the candidate pathogenic genes were further selected and biologically analyzed by this method.
【学位授予单位】:河南大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:R73-3
【参考文献】
相关期刊论文 前6条
1 张靖;胡学钢;李培培;张玉红;;基于迭代Lasso的肿瘤分类信息基因选择方法研究[J];模式识别与人工智能;2014年01期
2 张秀秀;王慧;田双双;乔楠;闫丽娜;王彤;;高维数据回归分析中基于LASSO的自变量选择[J];中国卫生统计;2013年06期
3 张靖;胡学钢;张玉红;施万锋;;K-split Lasso:有效的肿瘤特征基因选择方法[J];计算机科学与探索;2012年12期
4 杨华;骆嘉伟;;基于BW ratio与二进制量子粒子群的基因选择方法[J];微计算机信息;2011年01期
5 王树林;王戟;陈火旺;李树涛;张波云;;肿瘤信息基因启发式宽度优先搜索算法研究[J];计算机学报;2008年04期
6 李颖新;李建更;阮晓钢;;肿瘤基因表达谱分类特征基因选取问题及分析方法研究[J];计算机学报;2006年02期
相关博士学位论文 前3条
1 陆慧娟;基于基因表达数据的肿瘤分类算法研究[D];中国矿业大学;2012年
2 于化龙;基于DNA微阵列数据的癌症分类技术研究[D];哈尔滨工程大学;2010年
3 卢新国;基于DNA微阵列基因表达谱数据的癌症检测研究[D];湖南大学;2007年
相关硕士学位论文 前2条
1 于攀;基于基因表达数据的肿瘤分类方法研究[D];重庆大学;2012年
2 张秋水;支持向量机在基因表达数据中的研究[D];厦门大学;2007年
,本文编号:2257373
本文链接:https://www.wllwen.com/kejilunwen/jiyingongcheng/2257373.html
最近更新
教材专著