测试代价敏感的贝叶斯分类器研究

发布时间：2018-12-24 08:26

【摘要】：分类作为数据挖掘与机器学习中的热点研究方向吸引了众多学者的关注,并在客户流失、入侵检测、医疗诊断、文本分类等实际领域取得了广泛的应用。在传统的分类研究及其应用中,经常假设实例数据已经被存储在数据库中或者假设可以无偿获得并随意使用,因此研究的目标是构建一个最大化预测精度的分类模型,常见的分类方法有贝叶斯网络、决策树、人工神经网络、支持向量机等。然而上述假设在大部分实际应用中是不实际的,每一个实例属性值的获取都需要付出一定代价(金钱、时间、风险等),简称为测试代价。为了使传统的算法能被转换到可用的实际系统中,研究者除了关注如何最大化提升模型分类精度之外还需要尝试最小化模型所需的测试代价,因此测试代价敏感学习显得十分重要。测试代价敏感学习中需要同时优化模型的分类精度和测试代价两个目标,是一个典型的多目标优化问题。求解该多目标优化问题可以直接采用多目标优化算法求解前沿解集或者将该多目标优化问题转换为单目标优化问题进行求解,而后一种求解策略又可以分为两种方法:1)将多目标优化问题转化成单目标约束优化问题,在测试代价敏感学习中将分类精度看作约束条件而将测试代价看作目标函数;2)将优化问题中的多个目标整合成一个新的目标函数,在测试代价敏感学习中将分类精度与测试代价整合成新的目标函数进行最优解搜索。近年来随着信息社会中数据的爆炸式发展,数据的维度也呈现出指数级增长的趋势,过多的属性不仅会增加算法的空间存储消耗和时间复杂度,而且大量的不相关/冗余属性甚至会降低算法最终的分类精度。属性选择方法旨在从原始的属性集空间为算法学习选择出最佳的属性子集,成为改进朴素贝叶斯分类器的主要方向之一,当前大量研究成果表明属性选择对朴素贝叶斯分类器精度提升有着显著的效果。然而,现有的研究很少将朴素贝叶斯的属性选择问题和测试代价敏感问题相结合,专门研究测试代价敏感的朴素贝叶斯分类器。本文以朴素贝叶斯分类器为基本研究对象,使用上述两种方法将测试代价敏感学习这一多目标优化问题转化为单目标优化问题,分别提出了基于约束优化的测试代价敏感朴素贝叶斯分类器(COTCSNB)和基于优化目标的测试代价敏感朴素贝叶斯分类器(OOTCSNB)。通过基于WEKA平台的实验证明了所提出的两种新算法在保持模型高分类精度和最小化模型测试代价方面优秀的性能表现,最后以不同的医疗诊断问题为例详细地研究了新算法在实际问题中的应用转换以及性能表现。论文的主要创新及贡献包括:1)提出了基于约束优化的测试代价敏感朴素贝叶斯分类器(COTCSNB)新算法。在传统贪婪搜索策略中,每一次的属性选择都旨在选择最能使分类器精度提高的一个属性,从而期望最终达到模型分类性能最大化的目标。而在代价敏感学习中,COTCSNB将删除属性不会降低模型的分类精度作为约束条件,在后向每一步的属性选择搜索中删除满足约束条件的测试代价最大的一个属性,直至任何属性的删除都会违背约束条件时停止搜索。2)给出了测试代价敏感的包装法属性选择学习框架,然后通过将分类精度指标与测试代价指标做差提出了一个新的测试代价敏感的属性选择目标函数,最后基于新的目标函数和最优搜索策略提出了一种基于优化目标的测试代价敏感朴素贝叶斯分类器(OOTCSNB)新算法。3)分析了医疗诊断中病理值获取的测试代价问题,以心脏病、肝炎、糖尿病和甲状腺疾的实际医疗诊断问题为例探讨了新算法(COTCSNB、OOTCSNB)在实际问题中的应用情况,实验结果表明新算法在保持模型分类精度的同时能显著降低医疗诊断过程中所需的测试代价。
[Abstract]:It has attracted many scholars' attention as the hot spot research direction in data mining and machine learning, and has been widely used in the fields of customer loss, intrusion detection, medical diagnosis, text classification and so on. In the traditional classification research and its application, it is often assumed that the case data has been stored in the database or assumed to be available for free and will be used at will, so the objective of the study is to construct a classification model to maximize the prediction accuracy, and the common classification method has a Bayesian network. Decision tree, artificial neural network, support vector machine, etc. However, the above-mentioned assumption is not practical in most of the practical applications, and the acquisition of each instance property value needs to pay a certain price (money, time, risk, etc.), which is simply referred to as the test cost. In order to enable a conventional algorithm to be converted into an available actual system, the researchers need to try to minimize the cost of testing required to model the model in addition to how to maximize the classification accuracy of the model, so the test-cost-sensitive learning is important. It is a typical multi-objective optimization problem that both the classification accuracy and the test cost of the model need to be optimized simultaneously in the cost-sensitive learning of the test. In order to solve the multi-objective optimization problem, a multi-objective optimization algorithm can be used to solve the forward solution set, or the multi-objective optimization problem can be converted into a single-objective optimization problem to be solved, and then a solution strategy can be divided into two methods: 1) the multi-objective optimization problem is converted into a single objective constraint optimization problem, the classification accuracy is considered as a constraint condition in the test cost sensitive learning, the test cost is regarded as a target function, and 2) the multiple targets in the optimization problem are integrated into a new target function, and the classification accuracy and the test cost are integrated into a new target function for optimal solution search in the testing cost sensitive learning. In recent years, with the explosion of the data in the information society, the data's dimension also shows the trend of exponential growth, and too many attributes will not only increase the space storage consumption and the time complexity of the algorithm, and a large number of non-related/ redundant attributes may even reduce the final classification accuracy of the algorithm. The attribute selection method is to select the best subset of attributes from the original attribute set space as one of the main directions to improve the Naive Bayes classifier, and the current large number of research results show that the attribute selection has a significant effect on the accuracy improvement of the naive Bayesian classifier. However, the existing research seldom combines the attribute selection problem of Naive Bayes and the sensitive problem of the test cost, and studies the naive Bayesian classifier which is sensitive to the test cost. In this paper, a simple Bayesian classifier is used as the basic research object, and the two methods are used to transform the multi-objective optimization problem into a single-objective optimization problem. A test-cost-sensitive naive Bayesian classifier (COTCSNB) based on constrained optimization and a test-cost-sensitive naive Bayesian classifier (OOTCSNB) based on the optimization objective are presented. Based on the WEKA platform, the two new algorithms are proved to be excellent performance in maintaining the high classification accuracy of the model and minimizing the cost of the model test. Finally, the application and performance of the new algorithm in the real problem are studied in detail with different medical diagnosis. The main innovation and contribution of the thesis include: 1) a new algorithm based on the constraint-optimized naive Bayesian classifier (COTCSNB) is proposed. In the traditional greedy search strategy, each attribute selection is designed to select a property that is most likely to improve the accuracy of the classifier, so that the goal of maximizing the model classification performance is expected. in that cost-sensitive study, the COTCSNB will delete the attribute without reducing the classification accuracy of the model as a constraint condition, and then, the attribute of the maximum of the test cost satisfying the constraint condition is deleted in the attribute selection search of each step, if the deletion of any attribute is in violation of the constraint condition, the search is stopped. 2) the packaging method attribute sensitive to the test cost is given, the learning framework is selected, and then a new test cost-sensitive attribute selection target function is proposed by comparing the classification accuracy index with the test cost index, Finally, based on the new objective function and the optimal search strategy, a new algorithm for testing the cost-sensitive naive Bayesian classifier (OOTCSNB) based on the optimization objective is presented. The application of the new algorithm (COTCSNB, OOTCSNB) in the real problem is discussed in this paper. The experimental results show that the new algorithm can significantly reduce the cost of the test during the medical diagnosis.
【学位授予单位】：中国地质大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP18

【参考文献】