肿瘤信息基因选择与分类方法研究
[Abstract]:The development of large-scale gene expression profile and its rapid development provide a brand-new technology platform for tumor research. The data mining based on the gene expression profile is of great significance in the discovery of pathogenic genes, the clinical diagnosis of the tumor, the judgment of the curative effect of the drugs and the mechanism of the pathogenesis. The tumor gene expression profile data has the characteristics of high characteristic dimension, small sample size or relatively small sample background, large sample background difference, high redundancy, non-linearity, interaction effect between genes, and the like, and the traditional statistical method and the pattern recognition method are limited in application. In this paper, based on the characteristics of gene expression data, the research on the selection method of information gene and the construction of the classifier is carried out. The main results are as follows: (1) The binary matrix rearrangement filter BMSF (Binary Matrix Shift Filter) of high-dimensional feature selection is developed based on the support vector machine. Most of the information gene selection methods only take into account the action of a single gene or a pair of genes, but do not take into account the interaction between multiple genes. The BMSF algorithm proposed in this paper comprehensively considers the interaction between multi-genes, and transforms the classification problem into the regression problem by introducing an intermediate (0,1) binary matrix which is randomly generated, and realizes the high-dimensional feature selection based on the support vector machine under the premise of the optimization of the kernel function parameters. In the gene selection process, a subset of the genes remaining in the model is recursively optimized and updated repeatedly according to their contribution to other genes in the tumor classification. For 9 oncogene expression two-class data sets, BMSF is far superior to the one-way prediction accuracy of the literature report with a small subset of information genes, and the selected subset of information genes can improve the prediction accuracy of a plurality of classifiers at the same time. (2) The robust high-dimensional feature selection is developed based on the chi-square test and the new algorithm TSG (Top-scanning genes) without training is developed. The prediction accuracy is not only related to feature selection but also the influence of the classifier; the training is the main cause of the overfitting of most classifiers. The main stream algorithm (TSP) family is not only a feature selection method but also a classifier. In this paper, a TSG algorithm is proposed to overcome the defects such as the size of the sample, the constant number of the selected information genes and the fussy algorithm of the multi-classification. TSG puts forward and realizes the direct classification based on the transfer reasoning and does not need training, and the decision process comprises the following steps of: assuming that a sample to be detected belongs to a positive (+) class, combining the sample to be detected and the training sample to obtain a square value Chi +; and then, assuming that the sample to be tested belongs to a negative (-) class, And combining the sample to be detected and the training sample to obtain a square value Chi-; for example, Chi + Chi-, the sample to be tested belongs to a positive class, and vice versa. And so on. The characteristic selection process of the TSG is that the gene with the highest score is selected as a subset of the initial information genes, and then a gene with the best combination effect with the selected gene is selected from the remaining genes to be added to the information gene subset at a time, And the final information gene subset is automatically determined according to the retention-one method precision of the training set. TSG has obtained the results of independent prediction of 9 two-class and 10 multi-classification data, especially the prediction accuracy of the training set-keeping method is very close to that of the independent test set. The independent test precision on some data sets is even better than that of the training set, which shows that the TSG is unique, and the direct classification without training can effectively control the over-fitting. (3) The new method of selection of information gene was developed based on the interaction and the chi-square test (Chi-square test-based Integrated Rank Gene and Direct Classifier). the 1-2-IRG-DC feature selection process comprises the following steps of: firstly, using a single gene card square value and a pair of gene interaction card square values to calculate the comprehensive weighted score of the gene to obtain the importance of the gene; and sequentially introducing the sequencing gene based on the 1-2-DC classifier, and the first standard according to the retention-one method of the training set, The chi-square gain is the second standard deredundancy, and a more robust subset of information genes is obtained; and finally, independent prediction is carried out on the basis of the 1-2-DC and the information genes. In the meantime, the complexity of the algorithm is greatly reduced by the comprehensive weighted score of the gene, and the robustness of the feature selection is enhanced by introducing the second standard square-square gain. The independent prediction accuracy of 9 two-class and 10 multi-classified tumor gene expression profiles shows that the 2-2-IRG-DC model is better than that of the literature. As a feature selection method, the 1-2-IRG-DC is obviously superior to four reference feature selection methods such as mRMR, SVM-RFE, HC-K-TSP, TSG and the like; as a classifier, The 1 ~ 2-DC is better than that of NB, KNN and other reference classifiers. The method of this paper is of great theoretical and practical value for advancing high-dimensional data feature selection and tumor classification identification.
【学位授予单位】:湖南农业大学
【学位级别】:博士
【学位授予年份】:2015
【分类号】:R730.2
【相似文献】
相关期刊论文 前10条
1 李钧涛;贾英民;;用于癌症分类与基因选择的一种改进的弹性网络(英文)[J];自动化学报;2010年07期
2 黄海燕;;高矮胖瘦由你说[J];大众科技;1999年08期
3 张树波;赖剑煌;;基于融合信息的癌症相关基因选择方法[J];计算机科学;2010年12期
4 姬翔;王安文;;一种基于SVM和相关性的基因选择方法[J];计算机应用与软件;2007年06期
5 黄海燕;;胖瘦将由你掌握——人类未来饮食的重大变革[J];大科技;1999年05期
6 游伟;李树涛;谭明奎;;基于SVM-RFE-SFS的基因选择方法[J];中国生物医学工程学报;2010年01期
7 石修权;王增珍;;多因子降维法在评价代谢酶基因-基因-环境交互作用中的应用[J];环境与健康杂志;2010年12期
8 丁剑涛,黄涛,李兰英,范钰,沈岩,吴冠芸;FMR1基因在人胚胎组织中的选择剪接表达[J];中国医学科学院学报;1997年04期
9 孟超;;“疯狂基因”:进化的动力?[J];中国新闻周刊;2011年46期
10 李钧涛;贾英民;;PCD型自适应弹性网络在微阵列分类中的应用[J];智能系统学报;2010年03期
相关会议论文 前3条
1 任伟;闫桂英;;利用聚类算法来研究基因选择问题[A];中国运筹学会第八届学术交流会论文集[C];2006年
2 张春美;;守望生命,关注人的尊严——基因伦理的若干热点问题[A];中国的遗传学研究——中国遗传学会第七次代表大会暨学术讨论会论文摘要汇编[C];2003年
3 李卉卉;袁谷;;血管内皮生长因子(VEGF)基因启动子区G-四链体识别的研究[A];第六届全国化学生物学学术会议论文摘要集[C];2009年
相关重要报纸文章 前2条
1 郑诗亮;薛人望谈基因与生命[N];东方早报;2011年
2 本报记者 章勇;基因选择和饲养管理可改善羊肉颜色[N];中国畜牧兽医报;2014年
相关博士学位论文 前1条
1 张红燕;肿瘤信息基因选择与分类方法研究[D];湖南农业大学;2015年
相关硕士学位论文 前7条
1 周萍;基于频度与联合效应的基因选择[D];西安电子科技大学;2009年
2 曹涛;基于聚类的混合基因选择方法研究[D];湖南大学;2011年
3 姬翔;基于SVM的多病类诊断基因选择方法研究[D];西安电子科技大学;2005年
4 吴希贤;基于优化算法的基因选择与癌症分类[D];湖南大学;2008年
5 刘申岭;基于SVM的基因选择[D];西安电子科技大学;2004年
6 高红超;基于聚类的基因选择算法和DPC聚类算法研究[D];陕西师范大学;2015年
7 陆燕;基于启发式聚类的混合特征基因选择方法研究[D];湖南大学;2010年
,本文编号:2453761
本文链接:https://www.wllwen.com/yixuelunwen/zlx/2453761.html