当前位置:主页 > 医学论文 > 肿瘤论文 >

基于特征选择和集成学习的结直肠癌预测模型研究

发布时间:2018-02-05 03:06

  本文关键词: 结直肠癌 特征选择 集成学习 HELM算法 出处:《西南大学》2017年硕士论文 论文类型:学位论文


【摘要】:结直肠癌是世界范围内最常见同时也是最危险的恶性肿瘤之一,它的高发区主要集中在欧美、新西兰和澳大利亚等经济发达的西方国家。虽然中国是传统意义上的结直肠癌低发地区,但是随着人们生活方式及饮食习惯等越来越西方化,结直肠癌在我国的发病率正在逐年呈上升趋势,不仅严重危害着人们的健康,同时对人们的生活质量也造成了一定的影响。虽然结直肠癌一直是全球范围内最具危害的肿瘤之一,但是到目前为止,其病因及发病机制仍然尚未完全明了,尽管大量的流行病学研究表明结直肠癌的发生是一个复杂过程,在这个过程中,它不仅会受到环境因素、遗传因素等诸多因素单方面的影响,同时也可能受到它们之间相互作用的影响。然而,究竟是哪些环境因素、遗传因素或者其相互作用影响着结直肠癌的发生及发展,仍旧没有统一的定论。因此,建立结直肠癌预测模型,研究环境、膳食及遗传易感性等多因素对结直肠癌的影响具有重要的意义。本文基于第三军医大学提供的结直肠癌病例对照组样本数据,利用机器学习研究方法建立了结直肠癌预测模型,为结直肠癌早期诊断和预防提供了可靠依据,本文的主要工作如下:1、提出了从多方面的特征选择方法。由于数据维度较大,为了降低模型的计算复杂度,本文提出从两个方面对数据进行降维处理,即relief特征选择算法和相关性检验方法。通过relief算法计算样本特征权重,将权重小的特征删除,保留权重大的特征得到特征子集,然后对relief算法得到的特征子集进行相关性分析,对于相关性大的特征对,只保留权重大的特征,删除权重小的特征,进而得到权重大且无相关性的征子集,称之为最优特征子集。2、提出了混合集成学习模型(HELM)。HELM算法是在经典的集成学习算法Adaboost的基础上提出的。为了提高Adaboost算法的泛化能力,本文在提高Adaboost基本分类器的差异度上做了相关研究并提出了HELM方法。HELM方法同时融合了同态集成和异态集成方法,即分别利用不同类型的基本分类器训练得到多个Adaboost同态集成分类器,然后将这些Adaboost同态集成分类器作为基本分类器进行集成,最终得到HELM模型。结果表明,HELM算法具有很好的性能。3、建立了CRC癌症预测模型。整个预测模型分为四个部分:(1)数据收集和预处理。主要分为两个步骤完成,首先是对数据进行清洗,即除噪、处理缺失值等;然后通过第三军医大学研究结直肠癌的教授专家指导,从生物学的角度对数据进行分类,将一百多个维度的样本属性分为四大类,即基因位点(SNPs),人口学特征,生活方式及食物。(2)特征选择,从两个方面对样本特征进行提取,即按照特征对分类贡献大小(relief特征选择)和特征之间的冗余度(相关性检验)来选择最优特征。(3)分类预测,利用提出的HELM算法对数据进行分类预测。(4)对比分析,通过相关算法与HELM分类算法进行对比。综上所述,本文把基于relief特征选择算法和基于相关性检验的特征选择方法进行有效的结合,同时利用提出的HELM算法,建立的CRC癌症预测模型能够对结直肠癌进行有效的预测,并通过与相关算法对比,证明了本研究模型具有较好的稳定性及泛化能力。今后可将此模型应用于更多的复杂疾病病因学的研究中。
[Abstract]:Colorectal cancer is one of the world's most common and the most dangerous malignant tumor, its incidence area mainly concentrated in Europe, New Zealand and Australia and other developed countries. Although China is low incidence of colorectal cancer in the traditional sense, but with people's lifestyle and dietary habits are more and more Westernized colorectal cancer incidence in China is increasing year by year, not only seriously endanger people's health, but also caused a certain impact on people's quality of life. Although colorectal cancer has been one of the world within the scope of the most dangerous tumor, but so far, the etiology and pathogenesis is still not completely clear although, a large number of epidemiological studies showed that the occurrence of colorectal cancer is a complicated process, in this process, it will not only affected by environmental factors, genetic factors etc. The influence factors of unilateral, but also may be affected by the interaction between them. However, what exactly is the environmental factors, genetic factors or their interactions affect the occurrence and development of colorectal cancer, still no unified conclusion. Therefore, the establishment of colorectal cancer prediction model, research environment, has important significance of many factors dietary and genetic susceptibility to colorectal cancer. The Third Military Medical University colorectal cancer cases control group based on the sample data, using machine learning method to establish the prediction model of colorectal cancer, and provide a reliable basis for early diagnosis and prevention of colorectal cancer, the main work of this paper are as follows: 1. A method is proposed to select from many characteristics.. because the data dimension is larger, in order to reduce the computational complexity of the model, this paper proposes to reduce the dimension of the data from two aspects, namely relief Feature selection algorithm and correlation test method. The relief algorithm is used to calculate the sample feature weights, will feature weight small deletion, major characteristics of reserves the right to get the feature subset, and then obtain the feature subset of relief algorithm are analyzed. The characteristics of relevance for large, only reserves the right major characteristic, delete feature weight small then, get the subset weights large and no correlation, called the optimal feature subset of.2, proposes a hybrid integrated learning model (HELM.HELM) algorithm is proposed based on Adaboost ensemble learning algorithm on the classic Adaboost algorithm. In order to improve the generalization ability of the improved Adaboost classifier of basic differences the related research and puts forward HELM method.HELM method combines homomorphic integration and ensemble method, namely using the basic classifier training of different types are Multiple Adaboost homomorphic ensemble classifier, and then the Adaboost homomorphic ensemble classifier as base classifier integration, HELM model is obtained. The results show that the HELM algorithm has good performance of.3, established the CRC prediction model. The prediction model of cancer is divided into four parts: (1) data collection and preprocessing is divided. For the two steps, the first is to clean data, namely, denoising, dealing with missing values; then by Third Military Medical University professor of colorectal cancer expert guidance, to classify the data from the perspective of biology, properties of the samples of the more than 100 dimensions are divided into four categories, namely the gene locus (SNPs), demographic characteristics that way of life and food. (2) feature selection, the features of the samples extracted from two aspects, namely, according to the characteristics of size classification contribution (relief feature selection) and redundancy (correlation between features Test) to select the optimal feature. (3) classification prediction, the prediction of the HELM data using the proposed algorithm. (4) comparative analysis, by comparing the related algorithm and HELM algorithm. To sum up, the relief feature selection algorithm and feature correlation test selection method based on effective combination, at the same time by using the proposed HELM algorithm, a CRC cancer prediction model can effectively predict the colorectal cancer, and by comparison with the related algorithms, proved this model has better generalization ability and stability. The future study of this model is applied to the more complex disease etiology.

【学位授予单位】:西南大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:R735.34;TP181

【参考文献】

相关期刊论文 前6条

1 王海艳;许春伟;吴永芳;张博;邵云;满秋红;邰艳红;李晓兵;;结直肠癌患者肿瘤组织中KRAS和BRAF基因突变的分子病理检测分析[J];贵州医药;2015年11期

2 李道娟;李倩;贺宇彤;;结直肠癌流行病学趋势[J];肿瘤防治研究;2015年03期

3 傅传刚;高显华;;结直肠癌诊断治疗新进展[J];中华外科杂志;2012年06期

4 陈坤;国人结直肠癌的病因学及综合防治策略[J];国外医学.流行病学传染病学分册;2005年04期

5 余捷凯,杨美琴,姜铁军,郑树;血清肿瘤标志物优化组合人工神经网络模型在大肠癌诊断中的应用[J];浙江大学学报(医学版);2004年05期

6 王磊;宋顺心;汪建平;;结直肠癌实验研究现状及展望[J];中华实验外科杂志;2013年03期

相关博士学位论文 前1条

1 周紫垣;环境—膳食因素和遗传易感性与结直肠癌发病的研究[D];第三军医大学;2005年

相关硕士学位论文 前2条

1 熊莎;国内移动社交用户使用意愿的影响因素研究[D];北京邮电大学;2013年

2 曹倩;异态集成学习方法在个人信用评估中的应用[D];哈尔滨工业大学;2011年



本文编号:1491939

资料下载
论文发表

本文链接:https://www.wllwen.com/yixuelunwen/zlx/1491939.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户40923***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com