Relief特征选择与混合核SVM在疾病诊断中的研究

发布时间：2018-01-19 16:07

本文关键词： Relief特征选择 SVM 组合优化混合核函数　出处：《太原理工大学》2017年硕士论文　论文类型：学位论文

【摘要】：医学诊断,是指医生给病人检查疾病,并对病人疾病的病因、发病机制作出分类鉴别,以此作为制定治疗方案的方法和途径。这本质上是一个分类过程,也称模式识别。现有的分类方法有支持向量机(Support Vector Machine,SVM)、K邻近(K-Nearest Neighbor,KNN)、神经网络(Neural Network,NN)和决策树算法等。SVM对小样本、非线性及高维数据的模式识别问题具有很好的鲁棒性,具有较好的识别能力与适应能力。SVM在构建分类模型的过程中,所表现出的对训练样本的学习能力与对测试数据的推广性能主要由三种因素决定:原始数据集的处理、所选择的核函数以及核函数的参数。目前SVM在分类过程中存在的主要问题有:(1)目前SVM均采用单一核函数,其核函数可以分为全局核函数与局部核函数两种。全局核函数具有推广性能强而学习能力弱的特点,而局部核函数的学习能力强、但是推广性能弱。所以SVM分类结果往往无法同时满足较高的学习能力与推广性能。(2)在SVM参数的选择方面,主要有两种方法:传统的网格搜索法与启发式算法。网格搜索法特点是总能找到最优解,但是耗时、效率低;启发式算法查找速度快,但是解的精度不及网格搜索法高,并且遗传算法只是概率得到最优解。为了提高SVM的分类性能,本文主要在以下几方面进行研究:(1)选用Relief算法进行特征选择。在疾病诊断中,病人所表现出的多种临床特征与疾病的相关性是不同的,医生无法具体量化每个特征与疾病的关联度。因此,为了更准确地进行诊断,需要用特征选择算法计算出每个特征的权重,也就是各个临床症状与所患疾病的关联度;(2)提出将全局与局部两种核函数进行线性结合,构造学习能力与推广性能都得到提高的混合核函数;(3)对核函数参数进行组合寻优,先使用启发式算法中的遗传算法快速查找到最优解的大致范围,再使用网格搜索法在该小范围内进行二次精确搜索,不仅可以大大减少网格搜索法的计算时间,找到的解也比遗传算法更优。本文使用Matlab R2015b及台湾林智仁教授开发的LIBSVM工具包进行建模,分析了Matlab开发环境、LIBSVM工具包的接口配置、如何设置核函数及其参数、如何构造混合核函数以及如何进行参数的组合寻优。并以公共数据集UCI中的Heart disease数据集及Breast cancer数据集为应用背景,进行疾病诊断模型的构建与验证。
[Abstract]:Medical diagnosis means that the doctor examines the patient and classifies the etiology and pathogenesis of the disease as a method and approach to make a treatment plan. This is essentially a classification process. Also known as pattern recognition. The existing classification methods are support Vector Machine (SVM). K-nearest neighbor, neural network, decision tree algorithm, etc. The pattern recognition problem of nonlinear and high-dimensional data is robust and has good recognition ability and adaptability. SVM is used to construct classification model. The learning ability of the training sample and the generalization performance of the test data are mainly determined by three factors: the processing of the original data set. The kernel function selected and the parameters of the kernel function. At present, the main problem existing in the classification of SVM is: 1) at present, SVM uses a single kernel function. The kernel function can be divided into global kernel function and local kernel function. The global kernel function has the characteristics of strong generalization performance and weak learning ability, while the local kernel function has strong learning ability. But the generalization performance is weak, so SVM classification results often can not meet the higher learning ability and extension performance. 2) in the choice of SVM parameters. There are two main methods: the traditional grid search method and the heuristic algorithm. The characteristic of the grid search method is that it can always find the optimal solution, but it is time-consuming and inefficient. Heuristic algorithm is fast, but the accuracy of the solution is not as high as the grid search method, and the genetic algorithm is only probability to get the optimal solution. In order to improve the classification performance of SVM. This paper mainly studies the following several aspects: 1) choose the Relief algorithm for feature selection. In the diagnosis of disease, the patients show different clinical characteristics and the correlation between the disease. Doctors can not quantify the correlation between each feature and disease. Therefore, in order to diagnose more accurately, it is necessary to calculate the weight of each feature by feature selection algorithm. Namely each clinical symptom and the disease that suffer from the correlation degree; (2) A new hybrid kernel function is proposed, which combines global and local kernel functions linearly and constructs hybrid kernel functions with improved learning ability and extended performance. Firstly, the genetic algorithm in the heuristic algorithm is used to find the approximate range of the optimal solution quickly, and then the grid search method is used to carry out the quadratic accurate search in the small range. Not only can the computing time of grid search method be greatly reduced. The solution is also better than genetic algorithm. In this paper, Matlab R2015b and LIBSVM toolkits developed by Professor Lin Zhiren of Taiwan are used to model and analyze the Matlab development environment. Interface configuration for the LIBSVM toolkit, how to set the kernel function and its parameters. How to construct the mixed kernel function and how to optimize the combination of parameters. The Heart disease data set and Breast in the common data set UCI. Cancer data set is the application background. To construct and verify the disease diagnosis model.
【学位授予单位】：太原理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：R44;TP18

【参考文献】