当前位置:主页 > 医学论文 > 肿瘤论文 >

肿瘤信息基因选择与分类方法研究

发布时间:2019-04-04 11:23
【摘要】:肿瘤是多基因与环境共同作用的结果,大规模基因表达谱技术的出现及其飞速发展为肿瘤研究提供了一种全新的技术平台。基于基因表达谱的数据挖掘对致病基因发现、肿瘤临床诊断、药物疗效判断和发病机理阐明等意义重大。肿瘤基因表达谱数据多具特征维数高、样本小或相对小、样本背景差异大、存在批次效应等非随机噪声、冗余度高、非线性、基因间存在互作效应等特点,传统的统计方法和模式识别方法应用受限。本文针对基因表达数据特点,围绕信息基因选择方法和分类器构建展开研究,主要结果如下:(1)基于支持向量机发展了高维特征选择新方法二元矩阵重排过滤器BMSF (Binary Matrix Shift Filter)。大多数信息基因选择方法只考虑单个基因或成对基因的作用,却未考虑多个基因之间的相互作用。本文提出的BMSF算法综合考虑了多基因间的互作关系,通过引入随机产生的一个中间(0,1)二元矩阵,将分类问题转化为回归问题,实现了核函数参数寻优前提下基于支持向量机的高维特征选择。在基因选择过程中,对保留在模型中的基因子集根据其在肿瘤分类中对其他基因的贡献情况进行递归优化并反复更新。对9个癌基因表达二分类数据集, BMSF均以较小的信息基因子集获得了远优于文献报道的留一法预测精度,所选信息基因子集能同时提高多个分类器的留一法预测精度。(2)基于卡方测验发展了鲁棒的高维特征选择与无需训练的直接分类新算法TSG(Top-scoring genes)。预测精度既与特征选择有关,又受分类器的影响;训练是多数分类器产生过拟合的主要原因。主流算法TSP (Top score pairs)家族既是特征选择方法又是分类器,本文克服TSP不能反映样本大小、所选信息基因恒为偶数个、多分类时算法繁琐等缺陷,提出TSG算法。TSG提出并实现了基于转导推理、无需训练的直接分类,其决策过程为:先假定某个待测样本属于正(+)类,合并待测样本与训练样本得卡方值Chi+;再假定待测样本属于负(-)类,合并待测样本与训练样本得卡方值Chi-;如Chi+ Chi-,则待测样本属于正类,反之,则判为负类。多分类类推。TSG的特征选择过程为:先选取出得分最高的基因对TS2作为初始信息基因子集,接着每次从剩余的基因中挑选一个与已入选基因联合效应最好的基因添加到信息基因子集中,并根据训练集的留一法精度自动确定最终的信息基因子集。TSG对9个二分类和10个多分类数据独立预测均获得了明显优于文献报道的结果,特别是其训练集留一法预测精度与独立测试集预测精度相当接近,在部分数据集上独立测试精度甚至优于训练集留一法预测精度,显示TSG独特的、无需训练的直接分类能有效控制过拟合。(3)基于互作与卡方测验发展了信息基因选择新方法χ~2-IRG-DC (Chi-square test-based Integrated Rank Gene and Direct Classifier).χ~2-IRG-DC特征选择过程为:先利用单基因卡方值和成对基因互作卡方值,计算基因的综合加权得分,得基因的重要性排序;再基于χ~2-DC分类器序贯引入排序基因,并依训练集的留一法精度为第一标准、卡方增益为第二标准去冗余,获得了更为鲁棒的信息基因子集;最后基于χ~2-DC和信息基因实施独立预测。χ~2-IRG-DC继承TSG优点的同时,进一步通过基因综合加权评分大幅降低了算法复杂度,通过引入第二标准卡方增益增强了特征选择的鲁棒性。对9个二分类和10个多分类肿瘤基因表达谱数据集的独立预测精度表明,χ~2-IRG-DC模型明显优于文献报道;作为特征选择方法,χ~2-IRG-DC明显优于mRMR、SVM-RFE、HC-K-TSP、TSG等四种参比特征选择方法;作为分类器,χ~2-DC明显优于NB、KNN等参比分类器,与SVM分类器性能可比。本文方法对于推进高维数据特征选择和肿瘤分类识别具有重要理论意义和实用价值。
[Abstract]:The development of large-scale gene expression profile and its rapid development provide a brand-new technology platform for tumor research. The data mining based on the gene expression profile is of great significance in the discovery of pathogenic genes, the clinical diagnosis of the tumor, the judgment of the curative effect of the drugs and the mechanism of the pathogenesis. The tumor gene expression profile data has the characteristics of high characteristic dimension, small sample size or relatively small sample background, large sample background difference, high redundancy, non-linearity, interaction effect between genes, and the like, and the traditional statistical method and the pattern recognition method are limited in application. In this paper, based on the characteristics of gene expression data, the research on the selection method of information gene and the construction of the classifier is carried out. The main results are as follows: (1) The binary matrix rearrangement filter BMSF (Binary Matrix Shift Filter) of high-dimensional feature selection is developed based on the support vector machine. Most of the information gene selection methods only take into account the action of a single gene or a pair of genes, but do not take into account the interaction between multiple genes. The BMSF algorithm proposed in this paper comprehensively considers the interaction between multi-genes, and transforms the classification problem into the regression problem by introducing an intermediate (0,1) binary matrix which is randomly generated, and realizes the high-dimensional feature selection based on the support vector machine under the premise of the optimization of the kernel function parameters. In the gene selection process, a subset of the genes remaining in the model is recursively optimized and updated repeatedly according to their contribution to other genes in the tumor classification. For 9 oncogene expression two-class data sets, BMSF is far superior to the one-way prediction accuracy of the literature report with a small subset of information genes, and the selected subset of information genes can improve the prediction accuracy of a plurality of classifiers at the same time. (2) The robust high-dimensional feature selection is developed based on the chi-square test and the new algorithm TSG (Top-scanning genes) without training is developed. The prediction accuracy is not only related to feature selection but also the influence of the classifier; the training is the main cause of the overfitting of most classifiers. The main stream algorithm (TSP) family is not only a feature selection method but also a classifier. In this paper, a TSG algorithm is proposed to overcome the defects such as the size of the sample, the constant number of the selected information genes and the fussy algorithm of the multi-classification. TSG puts forward and realizes the direct classification based on the transfer reasoning and does not need training, and the decision process comprises the following steps of: assuming that a sample to be detected belongs to a positive (+) class, combining the sample to be detected and the training sample to obtain a square value Chi +; and then, assuming that the sample to be tested belongs to a negative (-) class, And combining the sample to be detected and the training sample to obtain a square value Chi-; for example, Chi + Chi-, the sample to be tested belongs to a positive class, and vice versa. And so on. The characteristic selection process of the TSG is that the gene with the highest score is selected as a subset of the initial information genes, and then a gene with the best combination effect with the selected gene is selected from the remaining genes to be added to the information gene subset at a time, And the final information gene subset is automatically determined according to the retention-one method precision of the training set. TSG has obtained the results of independent prediction of 9 two-class and 10 multi-classification data, especially the prediction accuracy of the training set-keeping method is very close to that of the independent test set. The independent test precision on some data sets is even better than that of the training set, which shows that the TSG is unique, and the direct classification without training can effectively control the over-fitting. (3) The new method of selection of information gene was developed based on the interaction and the chi-square test (Chi-square test-based Integrated Rank Gene and Direct Classifier). the 1-2-IRG-DC feature selection process comprises the following steps of: firstly, using a single gene card square value and a pair of gene interaction card square values to calculate the comprehensive weighted score of the gene to obtain the importance of the gene; and sequentially introducing the sequencing gene based on the 1-2-DC classifier, and the first standard according to the retention-one method of the training set, The chi-square gain is the second standard deredundancy, and a more robust subset of information genes is obtained; and finally, independent prediction is carried out on the basis of the 1-2-DC and the information genes. In the meantime, the complexity of the algorithm is greatly reduced by the comprehensive weighted score of the gene, and the robustness of the feature selection is enhanced by introducing the second standard square-square gain. The independent prediction accuracy of 9 two-class and 10 multi-classified tumor gene expression profiles shows that the 2-2-IRG-DC model is better than that of the literature. As a feature selection method, the 1-2-IRG-DC is obviously superior to four reference feature selection methods such as mRMR, SVM-RFE, HC-K-TSP, TSG and the like; as a classifier, The 1 ~ 2-DC is better than that of NB, KNN and other reference classifiers. The method of this paper is of great theoretical and practical value for advancing high-dimensional data feature selection and tumor classification identification.
【学位授予单位】:湖南农业大学
【学位级别】:博士
【学位授予年份】:2015
【分类号】:R730.2

【相似文献】

相关期刊论文 前10条

1 李钧涛;贾英民;;用于癌症分类与基因选择的一种改进的弹性网络(英文)[J];自动化学报;2010年07期

2 黄海燕;;高矮胖瘦由你说[J];大众科技;1999年08期

3 张树波;赖剑煌;;基于融合信息的癌症相关基因选择方法[J];计算机科学;2010年12期

4 姬翔;王安文;;一种基于SVM和相关性的基因选择方法[J];计算机应用与软件;2007年06期

5 黄海燕;;胖瘦将由你掌握——人类未来饮食的重大变革[J];大科技;1999年05期

6 游伟;李树涛;谭明奎;;基于SVM-RFE-SFS的基因选择方法[J];中国生物医学工程学报;2010年01期

7 石修权;王增珍;;多因子降维法在评价代谢酶基因-基因-环境交互作用中的应用[J];环境与健康杂志;2010年12期

8 丁剑涛,黄涛,李兰英,范钰,沈岩,吴冠芸;FMR1基因在人胚胎组织中的选择剪接表达[J];中国医学科学院学报;1997年04期

9 孟超;;“疯狂基因”:进化的动力?[J];中国新闻周刊;2011年46期

10 李钧涛;贾英民;;PCD型自适应弹性网络在微阵列分类中的应用[J];智能系统学报;2010年03期

相关会议论文 前3条

1 任伟;闫桂英;;利用聚类算法来研究基因选择问题[A];中国运筹学会第八届学术交流会论文集[C];2006年

2 张春美;;守望生命,关注人的尊严——基因伦理的若干热点问题[A];中国的遗传学研究——中国遗传学会第七次代表大会暨学术讨论会论文摘要汇编[C];2003年

3 李卉卉;袁谷;;血管内皮生长因子(VEGF)基因启动子区G-四链体识别的研究[A];第六届全国化学生物学学术会议论文摘要集[C];2009年

相关重要报纸文章 前2条

1 郑诗亮;薛人望谈基因与生命[N];东方早报;2011年

2 本报记者 章勇;基因选择和饲养管理可改善羊肉颜色[N];中国畜牧兽医报;2014年

相关博士学位论文 前1条

1 张红燕;肿瘤信息基因选择与分类方法研究[D];湖南农业大学;2015年

相关硕士学位论文 前7条

1 周萍;基于频度与联合效应的基因选择[D];西安电子科技大学;2009年

2 曹涛;基于聚类的混合基因选择方法研究[D];湖南大学;2011年

3 姬翔;基于SVM的多病类诊断基因选择方法研究[D];西安电子科技大学;2005年

4 吴希贤;基于优化算法的基因选择与癌症分类[D];湖南大学;2008年

5 刘申岭;基于SVM的基因选择[D];西安电子科技大学;2004年

6 高红超;基于聚类的基因选择算法和DPC聚类算法研究[D];陕西师范大学;2015年

7 陆燕;基于启发式聚类的混合特征基因选择方法研究[D];湖南大学;2010年



本文编号:2453761

资料下载
论文发表

本文链接:https://www.wllwen.com/yixuelunwen/zlx/2453761.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户62991***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com