面向非平衡类数据的分类器性能比较研究与方法改进
发布时间:2018-08-30 10:47
【摘要】:类分布不均衡数据广泛存在于现实世界中。在某些领域,少数类样本被正确分类的重要性远远高于多数类。然而,大多数经典分类算法均假设样本的先验概率分布平衡或者错分类的代价相等。在处理非均衡分布数据时,少数类样本的信息经常被多数类样本的信息掩盖,致使少数类样本的分类错误率远远高于多数类样本。因此,非平衡类数据分类问题的研究受到越来越多的关注。由于非平衡数据集中样本数量上的严重倾斜或者分布不均衡,传统的分类算法直接处理非平衡类数据集时,会造成少数类样本分类精度较差。因此,在数据层面采用混合抽样方法改变类分布情况和在算法层面提出一种改进的基于混合遗传算法的选择性集成算法提高分类性能,不仅能使分类性能得到改进,而且能使少数类的分类精度得到提升。主要研究工作和成果包括:(1)选择基分类器。在WEKA平台上,对比分析C4.5决策树、BP神经网络、朴素贝叶斯和支持向量机四种分类器在平衡数据集和非平衡数据集的分类性能和稳定性。(2)选择性集成对平衡和非平衡数据集的影响。借助WEKA平台,对比分析单一分类器和集成分类器在所有数据集上的分类精度,寻找集成学习中具有较大提升空间的基分类器组合:通过非平衡数据集在选择性集成和非选择性集成实验下的分类性能的差异,验证选择性集成的可行性;通过在平衡与非平衡数据集上的集成分类性能不同,证明非平衡数据集需进行数据层面的改动。(3)提出了一种基于非平衡数据分类问题的综合集成方法。针对类别非平衡数据的分布特点,采用上抽样SMOTE和下抽样Bootstrap相结合的方式构建相对平衡的训练集;接着通过混合遗传算法选择C4.5决策树基分类器进行集成学习,从而提高非平衡数据集中少数类的分类效果。
[Abstract]:Class disequilibrium data widely exist in the real world. In some areas, the importance of a few samples being correctly classified is much higher than that of most classes. However, most classical classification algorithms assume that the prior probability distribution of samples is balanced or the cost of misclassification is equal. When dealing with the disequilibrium distribution data, the information of a few samples is often masked by the information of the majority samples, so the classification error rate of the minority samples is much higher than that of the majority samples. Therefore, more and more attention has been paid to the classification of unbalanced class data. Because of the serious skew in the number of samples in the unbalanced dataset or the uneven distribution, the traditional classification algorithm can directly deal with the non-equilibrium class dataset, which will result in the poor classification accuracy of a small number of samples. Therefore, using mixed sampling method to change the class distribution at the data level and improving the classification performance by an improved selective ensemble algorithm based on hybrid genetic algorithm can not only improve the classification performance, but also improve the classification performance. Moreover, the classification accuracy of a few classes can be improved. The main research work and achievements are as follows: (1) selecting base classifier. On the WEKA platform, the C4.5 decision tree BP neural network is compared and analyzed. The classification performance and stability of naive Bayes and support vector machines in balanced and unbalanced datasets. (2) the effect of selective integration on balanced and unbalanced datasets. With the help of WEKA platform, the classification accuracy of single classifier and integrated classifier on all data sets is compared and analyzed. Search for the combination of base classifiers with large lifting space in ensemble learning: verify the feasibility of selective integration by comparing the classification performance of non-balanced datasets in selective and non-selective ensemble experiments; It is proved that the non-equilibrium data sets need to be modified at the data level through the different performance of integrated classification on balanced and unbalanced data sets. (3) A comprehensive integration method based on unbalanced data classification problem is proposed. According to the distribution characteristics of class non-equilibrium data, a relatively balanced training set is constructed by combining top-sampling SMOTE and down-sampling Bootstrap, and then C4.5 decision tree based classifier is selected by hybrid genetic algorithm for ensemble learning. In order to improve the classification effect of a few classes in unbalanced data sets.
【学位授予单位】:大连海事大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP181
本文编号:2212842
[Abstract]:Class disequilibrium data widely exist in the real world. In some areas, the importance of a few samples being correctly classified is much higher than that of most classes. However, most classical classification algorithms assume that the prior probability distribution of samples is balanced or the cost of misclassification is equal. When dealing with the disequilibrium distribution data, the information of a few samples is often masked by the information of the majority samples, so the classification error rate of the minority samples is much higher than that of the majority samples. Therefore, more and more attention has been paid to the classification of unbalanced class data. Because of the serious skew in the number of samples in the unbalanced dataset or the uneven distribution, the traditional classification algorithm can directly deal with the non-equilibrium class dataset, which will result in the poor classification accuracy of a small number of samples. Therefore, using mixed sampling method to change the class distribution at the data level and improving the classification performance by an improved selective ensemble algorithm based on hybrid genetic algorithm can not only improve the classification performance, but also improve the classification performance. Moreover, the classification accuracy of a few classes can be improved. The main research work and achievements are as follows: (1) selecting base classifier. On the WEKA platform, the C4.5 decision tree BP neural network is compared and analyzed. The classification performance and stability of naive Bayes and support vector machines in balanced and unbalanced datasets. (2) the effect of selective integration on balanced and unbalanced datasets. With the help of WEKA platform, the classification accuracy of single classifier and integrated classifier on all data sets is compared and analyzed. Search for the combination of base classifiers with large lifting space in ensemble learning: verify the feasibility of selective integration by comparing the classification performance of non-balanced datasets in selective and non-selective ensemble experiments; It is proved that the non-equilibrium data sets need to be modified at the data level through the different performance of integrated classification on balanced and unbalanced data sets. (3) A comprehensive integration method based on unbalanced data classification problem is proposed. According to the distribution characteristics of class non-equilibrium data, a relatively balanced training set is constructed by combining top-sampling SMOTE and down-sampling Bootstrap, and then C4.5 decision tree based classifier is selected by hybrid genetic algorithm for ensemble learning. In order to improve the classification effect of a few classes in unbalanced data sets.
【学位授予单位】:大连海事大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP181
【参考文献】
相关期刊论文 前10条
1 秦锋;杨波;程泽凯;;分类器性能评价标准研究[J];计算机技术与发展;2006年10期
2 王丽丽;苏德富;;基于群体智能的选择性决策树分类器集成[J];计算机技术与发展;2006年12期
3 杨晓霜;汪源源;;基于Moore-Penrose逆矩阵的选择性集成[J];光电工程;2009年11期
4 王磊;;基于约束投影的支持向量机选择性集成[J];计算机科学;2009年10期
5 王成;刘亚峰;王新成;闫桂荣;;分类器的分类性能评价指标[J];电子设计工程;2011年08期
6 吕卉;周聪;邹娟;郑金华;;基于多种群进化的遗传算法[J];计算机工程与应用;2010年28期
7 李明方;张化祥;;针对不平衡数据集的Bagging改进算法[J];计算机工程与应用;2010年30期
8 倪黄晶;王蔚;;多类不平衡数据上的分类器性能比较研究[J];计算机工程;2011年10期
9 钱洪波;贺广南;;非平衡类数据分类概述[J];计算机工程与科学;2010年05期
10 赵自翔;王广亮;李晓东;;基于支持向量机的不平衡数据分类的改进欠采样方法[J];中山大学学报(自然科学版);2012年06期
,本文编号:2212842
本文链接:https://www.wllwen.com/kejilunwen/zidonghuakongzhilunwen/2212842.html