类别不平衡与代价敏感数据的集成分类方法研究

发布时间:2018-05-28 21:29

  本文选题:机器学习 + 类别不平衡分类 ; 参考:《中国科学技术大学》2017年硕士论文


【摘要】:随着大数据时代的来临,机器学习作为现代数据分析技术的理论基石,发挥了至关重要的作用,同时也面临着大大小小的挑战。分类问题作为机器学习领域最基本最核心的问题之一,持续受到学术界的热切关注。传统的分类算法一般基于两个假设:一是不同类别的样本数量大致相同;二是不同类别的错分代价基本相等。然而在真实世界中,数据集往往存在类别不平衡问题和代价敏感问题,这使得基于准确率的传统分类算法变得不再适用。类别不平衡指的是不同类别的样本数量分布不平衡;代价敏感指的是不同类别的错误分类代价相差很大。在类别不平衡的数据集中,传统分类算法为了获得较高准确率,倾向于错分少数类样本,然而这些少数类样本往往更加重要;在代价敏感的数据集中,传统分类算法对错误分类代价不敏感,无法最小化错误分类总代价。由于类别不平衡问题和代价敏感问题在现实中的普遍性和重要性,国内外学术界对此展开了广泛而深入的研究,并提出了各种各样的解决方法。经过归纳总结,这些方法大致从两个层面来解决问题:一是从数据层面,通过重构训练集改变样本分布,典型的是采用重采样技术;二是从算法层面,通过重新设计现有算法使之适应这两个问题,典型的是代价敏感学习和基于Boosing的方法。在这些方法中,集成学习扮演了举足轻重的角色。经过十几年的研究,该领域已经取得了十分瞩目的成就,但是仍然存在一些问题,比如过拟合,丢失信息等,影响了分类模型的稳定性和可靠性。本文针对类别不平衡问题和代价敏感问题,做了以下两点工作:·提出两种基于重采样的集成分类方法:xEnsemble和RSEnsemble。首先介绍这两种方法的理论基石,然后对现有算法进行改进,最后分别从偏差-方差分解、误差-分歧分解的角度,理论上证明这两种方法的有效性。·将xEnsemble和RSEnsemble方法应用于真实的糖尿病诊断数据集。该数据集规模庞大,高度类别不平衡且代价敏感。首先明确实验的评价标准,然后对该数据集进行预处理,最终实验结果证明:相比其他类似方法,这两种方法能够取得更好的分类效果。
[Abstract]:With the advent of big data era, machine learning, as the theoretical cornerstone of modern data analysis technology, plays a vital role, but also faces challenges large and small. As one of the most basic and core problems in the field of machine learning, classification problem has been paid more and more attention by academic circles. The traditional classification algorithms are generally based on two assumptions: one is that the number of samples of different categories is about the same; the other is that the cost of different categories of misdivision is basically equal. However, in the real world, the classification imbalance problem and the cost sensitivity problem often exist in the data sets, which makes the traditional classification algorithm based on accuracy no longer applicable. Class imbalance refers to the imbalance in the distribution of samples of different categories, while the cost sensitivity refers to the large difference in the cost of different categories of error classification. In class unbalanced data sets, traditional classification algorithms tend to misclassify a few samples in order to achieve higher accuracy. However, these minority samples are often more important; in cost sensitive data sets, The traditional classification algorithm is insensitive to the cost of error classification and can not minimize the total cost of error classification. Due to the universality and importance of category imbalance and cost sensitive problems, scholars at home and abroad have carried out extensive and in-depth research and put forward a variety of solutions. After summing up, these methods can solve the problem from two aspects: one is to change the distribution of samples from the data level by reconstructing the training set, and the other is to use resampling technology from the algorithm level. By redesigning the existing algorithms to adapt to these two problems, the typical cost sensitive learning and Boosing based approach. In these methods, integrated learning plays an important role. After more than ten years of research, this field has made great achievements, but there are still some problems, such as over-fitting, loss of information and so on, which affect the stability and reliability of classification models. In this paper, we focus on the problem of class imbalance and the problem of cost sensitivity, and do the following two works: we propose two integrated classification methods: xEnsemble and RSEnsemble based on resampling. First, the theoretical foundation of these two methods is introduced, and then the existing algorithms are improved. Finally, from the angle of deviation variance decomposition and error bifurcation decomposition, The effectiveness of these two methods is proved theoretically. The xEnsemble and RSEnsemble methods are applied to the real diabetes diagnosis data set. The data set is large, highly class unbalanced and cost sensitive. First, the evaluation criteria of the experiment are defined, and then the data set is preprocessed. Finally, the experimental results show that the two methods can achieve better classification effect than other similar methods.
【学位授予单位】:中国科学技术大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP181

【参考文献】

相关期刊论文 前2条

1 李勇;刘战东;张海军;;不平衡数据的集成分类算法综述[J];计算机应用研究;2014年05期

2 叶志飞;文益民;吕宝粮;;不平衡分类问题研究综述[J];智能系统学报;2009年02期

相关博士学位论文 前1条

1 王瑞;针对类别不平衡和代价敏感分类问题的特征选择和分类算法[D];中国科学技术大学;2013年



本文编号:1948257

资料下载
论文发表

本文链接:https://www.wllwen.com/shoufeilunwen/xixikjs/1948257.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户d86d0***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com