面向医学数据的随机森林特征选择及分类方法研究

发布时间：2017-12-28 11:36

本文关键词：面向医学数据的随机森林特征选择及分类方法研究　出处：《哈尔滨工程大学》2016年博士论文　论文类型：学位论文

【摘要】：医学数据挖掘是数据挖掘技术的重要研究方向,多年来始终是计算机科学和医学领域的研究热点。近年来,医学数据挖掘对象正逐渐由临床诊断数据转变到基因芯片数据。目前,许多优秀的数据挖掘算法被应用于各种医学研究工作中,如决策树、支持向量机和人工神经网络等。然而,医学数据所固有的高维特征空间、高度特征冗余、特征高度相关、样本类别不平衡等特点以及医学研究对于数据挖掘结果的可理解性要求,使得传统的数据挖掘算法难以直接应用于医学数据挖掘任务中。随机森林算法是一种基于决策树的集成机器学习算法。一方面,随机森林因为具有分类精度高、运算速度快、能够从具有小边际效应和复杂相互作用的特征中识别主相关特征的优点,在医学数据分析中得到广泛应用;另一方面,有研究表明随机森林在类别不平衡数据集和高维数据集上,其分类能力和稳定性会受到削弱。针对医学数据集的特征高维性、冗余性、相关性以及样本类别不平衡等问题,本文以UCI标准数据集、糖尿病临床诊断数据集和基因芯片数据集为对象,研究了基于随机森林的特征选择和数据分类方法,主要开展了以下几个方面的工作。首先,针对医学数据集的样本类别不平衡问题,提出了一种基于有放回地随机重采样技术和集成机器学习思想的随机森林改进算法。该算法首先从原始训练数据集中利用有放回地随机重采样技术构建样本类别平衡的数据集,然后在每一个采样数据集上训练随机森林分类器,最后所有采样数据集上生成的随机森林分类器通过多数投票方式确定未知样本的分类类别。在UCI数据集上的实验结果表明,与传统的基于随机降采样和基于代价敏感的方法相比,本文提出的算法能够有效提高分类器的分类性能尤其是少数类样本的查全率。其次,针对医学临床数据集的高维特征空间和特征间高度相关问题,提出了一种基于随机森林的Filter式特征选择算法。该算法首先基于随机森林变量重要性分数对数据集中的特征进行排序,然后通过迭代实验确定特征选择的阈值,选取重要性分数最大的前若干个特征构成特征子集,最后在选出的特征子集上训练分类器。在UCI数据集和糖尿病临床数据集上的实验结果表明,基于随机森林变量重要性分数的算法的分类性能明显高于现有的基于特征子集区分度和特征相关性等度量的算法。再次,针对医学数据集特征之间高度相关和高度冗余的问题,提出了一种基于随机森林和序列联合搜索策略的Wrapper式特征选择算法。该算法利用随机森林善于从具有小边际效应和复杂相互作用的特征中识别主相关特征的能力,以随机森林变量重要性分数作为特征重要性度量,采用序列后向和序列前向相结合的序列联合特征搜索策略选择特征子集,以特征子集上分类器的分类正确率评价特征子集的质量,最后选择分类正确率最高的特征子集作为最优特征子集。在UCI数据集、糖尿病临床数据集和微阵列表达数据集上的仿真实验表明,本文提出的算法的分类正确率和特征子集质量均优于基于过滤式的方法和基于其它度量的方法。最后,针对微阵列表达数据集存在大量不相关特征、噪声特征和冗余特征的问题,提出了一种基于Filter和Wrapper相结合的随机森林特征选择算法。该算法首先采用Filter特征选择算法过滤掉与目标变量明显不相关的基因,然后采用基于随机森林的Wrapper特征选择算法选择最优特征子集。在Wrapper特征选择过程中,针对微阵列表达数据的特点,基于随机森林变量重要性分数,提出了序列前向特征选择和序列后向特征选择相结合、分层剔除冗余特征和不相关特征的特征搜索策略。在微阵列表达数据集上的仿真实验表明,本文提出的算法在分类正确率上优于现有的算法。
[Abstract]:Medical data mining is an important research direction of data mining technology. For many years, it has always been a hot topic in the field of computer science and medicine. In recent years, medical data mining objects are gradually changing from clinical diagnostic data to gene chip data. At present, many excellent data mining algorithms have been applied to various medical research work, such as decision tree, support vector machine and artificial neural network. However, the high dimensional feature space, the inherent characteristics of medical data highly redundancy, highly relevant features, sample class imbalance and the characteristics of medical research data mining results for the understanding of the requirements, traditional data mining algorithms cannot be applied directly to the medical data mining tasks. Random forest algorithm (random forest algorithm) is an integrated machine learning algorithm based on decision tree. On the one hand, because the random forest has the advantages of high classification accuracy, fast calculation speed and can identify the main related feature from having small marginal effect and complex interactions in the widely used in medical data analysis; on the other hand, studies have shown that random forests in the categories of imbalanced data sets and high-dimensional data sets on the classification ability and stability will be weakened. In order to solve the problem of medical data sets of features of high dimension and redundancy, and the correlation between the sample class imbalance, based on the UCI standard data sets, diabetes clinical diagnosis data sets and microarray data sets on the feature selection and data classification method based on random forest, mainly carried out the following work. First, aiming at the problem of sample class imbalance in medical datasets, a random forest improvement algorithm based on random resampling technology and integrated machine learning idea is proposed. Firstly, from the original training data set using back random resampling technique to construct samples balanced data set, and then train the random forest classifier at each sampling data set, finally all the random forest classifier to generate the data set by the number of votes to determine the classification of unknown samples. Experimental results on UCI dataset show that compared with traditional random sampling and cost sensitive methods, the algorithm proposed in this paper can effectively improve the classification performance of classifiers, especially the recall of minority samples. Secondly, aiming at the problem of high dimensional feature space and high correlation among features of medical clinical dataset, a Filter based feature selection algorithm based on random forest is proposed. The algorithm based on the sort of feature data set of the random forest variable importance scores, then feature selection is determined by iterative threshold selection before the experiment, several important features constitute the largest fraction feature subset, finally in the feature subset of the training classifier. Experimental results on UCI dataset and diabetes clinical dataset show that the classification performance based on the importance score of random forest variables is significantly higher than that of existing algorithms based on feature subset subarea and feature correlation. Thirdly, aiming at the problem of highly correlated and highly redundant features among medical datasets, a Wrapper feature selection algorithm based on random forest and sequential search strategy is proposed. This algorithm uses the random forest good ability for identifying the main related feature from having small marginal effect and complex interactions in the random forest variable importance scores as feature importance measurement, using sequence to the United feature sequence search and sequence prior to the combination of search strategy feature subset selection, quality classification feature subset classifier the rate of correct evaluation of feature subsets, the final choice of the correct rate of classification feature subset with the highest as the best subset of features. Simulation experiments on UCI dataset, diabetes clinical dataset and microarray dataset show that the classification accuracy and subset quality of the proposed algorithm are better than those based on filtering method and other metric methods. Finally, aiming at the problem that there are many unrelated features, noise characteristics and redundant features in the dataset of microarray, a random forest feature selection algorithm based on Filter and Wrapper is proposed. The algorithm uses the Filter feature selection algorithm to filter out was not associated with the target variable genes, then using Wrapper feature selection algorithm to select the most random forest based on feature subset. In the process of Wrapper feature selection, aiming at the characteristics of microarray list data, based on the importance score of random forest variables, we propose a feature search strategy combining sequential forward feature selection and sequential backward feature selection, delamination redundant features and irrelevant features hierarchically. The simulation experiments on the data set of the microarray list show that the proposed algorithm is better than the existing algorithm in the classification accuracy.
【学位授予单位】：哈尔滨工程大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP311.13

【相似文献】