一种去冗余抽样的非平衡数据分类方法
发布时间:2018-11-19 21:38
【摘要】:欠抽样是一类常见的解决非平衡数据分类的技术。传统抽样方法(如Kennard-Stone抽样和密度保持抽样)只考虑保持数据分布。已有欠抽样方法侧重抽取分类边界附近的样本,这样抽取的样本可能改变数据的原始分布特征,从而影响分类效果。提出数据冗余度的概念,即如果一个多数类样本处于多数类的密集区且距离分类边界或少数类样本较远,则样本冗余度较高。去冗余抽样(Redundancy-removed Sampling,RRS)采用传统抽样规则去掉多数类中冗余度相对较高的样本。这样的样本子集尽量包含对分类最有帮助的样本和保持原始数据分布,且两类样本数量相对均衡。实验结果表明,经RRS抽样的分类结果的总体精度高于其他抽样方法,尤其在分类精度较低的数据集上。同时,少数类样本的判别精度也有所提高。
[Abstract]:Undersampling is a common technique to solve the problem of unbalance data classification. Traditional sampling methods (such as Kennard-Stone sampling and density preserving sampling) only consider preserving data distribution. Existing under-sampling methods focus on sampling samples near the classification boundary, which may change the original distribution characteristics of the data, thus affecting the classification effect. The concept of data redundancy is proposed, that is, if a sample of most classes is located in a dense area of most classes and is far from the classification boundary or a few samples, the redundancy of the sample is higher. De-redundancy sampling (Redundancy-removed Sampling,RRS) adopts traditional sampling rules to remove samples with relatively high redundancy in most classes. Such a subset of samples contains the most helpful samples for classification and maintains the distribution of raw data, and the two types of samples are relatively balanced in number. The experimental results show that the overall accuracy of the classification results with RRS sampling is higher than that of other sampling methods, especially on the data sets with low classification accuracy. At the same time, the discriminant accuracy of a few kinds of samples is improved.
【作者单位】: 太原师范学院科研处;太原师范学院计算机系;
【基金】:山西省青年科学基金(201601D202040)
【分类号】:O212.2
,
本文编号:2343495
[Abstract]:Undersampling is a common technique to solve the problem of unbalance data classification. Traditional sampling methods (such as Kennard-Stone sampling and density preserving sampling) only consider preserving data distribution. Existing under-sampling methods focus on sampling samples near the classification boundary, which may change the original distribution characteristics of the data, thus affecting the classification effect. The concept of data redundancy is proposed, that is, if a sample of most classes is located in a dense area of most classes and is far from the classification boundary or a few samples, the redundancy of the sample is higher. De-redundancy sampling (Redundancy-removed Sampling,RRS) adopts traditional sampling rules to remove samples with relatively high redundancy in most classes. Such a subset of samples contains the most helpful samples for classification and maintains the distribution of raw data, and the two types of samples are relatively balanced in number. The experimental results show that the overall accuracy of the classification results with RRS sampling is higher than that of other sampling methods, especially on the data sets with low classification accuracy. At the same time, the discriminant accuracy of a few kinds of samples is improved.
【作者单位】: 太原师范学院科研处;太原师范学院计算机系;
【基金】:山西省青年科学基金(201601D202040)
【分类号】:O212.2
,
本文编号:2343495
本文链接:https://www.wllwen.com/kejilunwen/yysx/2343495.html