面向不平衡数据集分类的改进K-近邻法研究
发布时间:2018-09-19 15:34
【摘要】:在信息化大爆炸的今天,如何高效地从现有复杂多变的信息中提取出人们所需要的信息是一个急需解决的难题。为了解决这个难题,机器学习、人工智能和模式识别等领域的学者们展开了深入的研究,分类方法是其中重要的研究方向之一。经过多年的不懈努力,已有许多分类性能较好的方法应用于分类问题。然而这些分类方法主要是以整体的分类误判率、准确率和召回率等作为分类目标,这些分类性能的评价指标在不平衡数据集的分类问题中容易降低少数类和分布稀疏类样本的识别率。由于现实生活的需要,人们越来越重视少数类的分类精度,故在保证不平衡数据集整体分类质量的前提下提高少数类样本的识别率是一个值得研究的热点。本文主要研究了面向不平衡数据集分类的K-近邻法,具体的工作如下:(1)针对传统K-近邻法在寻找近邻样本时由于较大的相似度计算量而导致分类速度慢的不足,引入了代表样本和阈值。各测试样本的近邻样本只在其与各类代表样本相似程度不小于相应阈值的类中选取,从而减少了计算量,在不影响分类精度的同时提高了分类速度。(2)对于传统K-近邻法对不平衡数据集分类精度低的问题,提出了类代表度与样本代表度。通过赋予类代表程度大的近邻样本和少数类样本较大权重来减弱多数类及分布密集类对分类的影响,从而提高了传统K-近邻法对不平衡数据集的分类精度。本文以UCI分类数据集作为实验数据。通过比较传统K-近邻法与改进K-近邻法的各性能评价指标,结果显示改进的K-近邻法在一定程度上提高了分类性能。
[Abstract]:How to efficiently extract the information that people need from the existing complex and changeable information is a difficult problem that needs to be solved in today's information-based Big Bang. In order to solve this problem, scholars in the fields of machine learning, artificial intelligence and pattern recognition have carried out in-depth research, and classification method is one of the important research directions. After years of unremitting efforts, there are many good classification performance methods applied to classification problems. However, these classification methods are mainly based on the overall classification error rate, accuracy rate and recall rate. It is easy to reduce the recognition rate of a few classes and distributed sparse class samples in the classification problem of unbalanced datasets. Due to the need of real life, people pay more and more attention to the classification accuracy of a few classes, so it is a hot topic to improve the recognition rate of a few kinds of samples on the premise of guaranteeing the overall classification quality of unbalanced data sets. In this paper, the K-nearest neighbor method for classification of unbalanced datasets is studied. The main works are as follows: (1) in order to solve the problem of slow classification speed caused by the large amount of similarity calculation, the traditional K-nearest neighbor method is used to find the nearest neighbor samples. The representative sample and threshold are introduced. The nearest neighbor sample of each test sample is only selected from the class whose similarity with each representative sample is not less than the corresponding threshold value, thus reducing the calculation amount. The classification accuracy is not affected and the classification speed is improved. (2) for the problem of low classification accuracy of traditional K-nearest neighbor method for unbalanced datasets, class representation and sample representation are proposed. In order to reduce the influence of most classes and distributed dense classes on the classification, the traditional K-nearest neighbor method can improve the classification accuracy of unbalanced data sets by giving a large weight to the nearest neighbor samples and a few class samples. In this paper, UCI classification data set is used as experimental data. By comparing the traditional K-nearest neighbor method with the improved K-nearest neighbor method, the results show that the improved K-nearest neighbor method improves the classification performance to some extent.
【学位授予单位】:西南交通大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13
本文编号:2250543
[Abstract]:How to efficiently extract the information that people need from the existing complex and changeable information is a difficult problem that needs to be solved in today's information-based Big Bang. In order to solve this problem, scholars in the fields of machine learning, artificial intelligence and pattern recognition have carried out in-depth research, and classification method is one of the important research directions. After years of unremitting efforts, there are many good classification performance methods applied to classification problems. However, these classification methods are mainly based on the overall classification error rate, accuracy rate and recall rate. It is easy to reduce the recognition rate of a few classes and distributed sparse class samples in the classification problem of unbalanced datasets. Due to the need of real life, people pay more and more attention to the classification accuracy of a few classes, so it is a hot topic to improve the recognition rate of a few kinds of samples on the premise of guaranteeing the overall classification quality of unbalanced data sets. In this paper, the K-nearest neighbor method for classification of unbalanced datasets is studied. The main works are as follows: (1) in order to solve the problem of slow classification speed caused by the large amount of similarity calculation, the traditional K-nearest neighbor method is used to find the nearest neighbor samples. The representative sample and threshold are introduced. The nearest neighbor sample of each test sample is only selected from the class whose similarity with each representative sample is not less than the corresponding threshold value, thus reducing the calculation amount. The classification accuracy is not affected and the classification speed is improved. (2) for the problem of low classification accuracy of traditional K-nearest neighbor method for unbalanced datasets, class representation and sample representation are proposed. In order to reduce the influence of most classes and distributed dense classes on the classification, the traditional K-nearest neighbor method can improve the classification accuracy of unbalanced data sets by giving a large weight to the nearest neighbor samples and a few class samples. In this paper, UCI classification data set is used as experimental data. By comparing the traditional K-nearest neighbor method with the improved K-nearest neighbor method, the results show that the improved K-nearest neighbor method improves the classification performance to some extent.
【学位授予单位】:西南交通大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13
【参考文献】
相关期刊论文 前10条
1 樊存佳;汪友生;边航;;一种改进的KNN文本分类算法[J];国外电子测量技术;2015年12期
2 万韩永;左家莉;万剑怡;王明文;;基于样本重要性原理的KNN文本分类算法[J];江西师范大学学报(自然科学版);2015年03期
3 罗贤锋;祝胜林;陈泽健;袁玉强;;基于K-Medoids聚类的改进KNN文本分类算法[J];计算机工程与设计;2014年11期
4 杨柳;于剑;景丽萍;;一种自适应的大间隔近邻分类算法[J];计算机研究与发展;2013年11期
5 余鹰;苗夺谦;刘财辉;王磊;;基于变精度粗糙集的KNN分类改进算法[J];模式识别与人工智能;2012年04期
6 周靖;刘晋胜;;特征联合熵的一种改进K近邻分类算法[J];计算机应用;2011年07期
7 赵俊杰;盛剑锋;陶新民;;一种基于特征加权的KNN文本分类算法[J];电脑学习;2010年02期
8 印鉴;谭焕云;;基于χ~2统计量的kNN文本分类算法[J];小型微型计算机系统;2007年06期
9 王晓晔,王正欧;K-最近邻分类技术的改进算法[J];电子与信息学报;2005年03期
10 李荣陆,胡运发;基于密度的kNN文本分类器训练样本裁剪方法[J];计算机研究与发展;2004年04期
相关硕士学位论文 前2条
1 梁洲;改进的K-近邻模式分类[D];电子科技大学;2015年
2 孙丽华;中文文本自动分类的研究[D];哈尔滨工程大学;2002年
,本文编号:2250543
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2250543.html