基于SVM的不平衡数据分类算法研究及其应用

发布时间：2018-06-25 09:21

本文选题：SVM + 不平衡数据分类　；参考：《华侨大学》2017年硕士论文

【摘要】：随着计算机和信息技术的发展,在生产和生活中,每天都会产生大量的数据。如何有效地寻找和发掘这些数据中的知识和规律,对其进行分类和预测已成为人工智能和机器学习等领域的重要研究内容之一。SVM是一种基于统计学习理论和结构风险最小化原则的分类算法,它的决策函数只由少数的支持向量决定,增加或删除部分非支持向量样本并不影响模型的性能。相比于传统的分类算法,SVM具有较强的泛化能力,不易陷入局部极小值且适合分类高维小样本,能有效地解决平衡数据集的分类问题。但是,当两类数据分布不均衡时,SVM显现出了以下不足:一是由于SVM是基于软间隔最大化的方法,从而使得在边界区域分类超平面会向少数类倾斜。二是支持向量的不平衡比率也将导致测试样本的周围充满更多的负支持向量。本文针对SVM对分布不平衡的数据集进行分类时的难点和不足,从数据层面和算法层面展开研究,并将不平衡数据分类算法应用到微博情感分类问题中。主要工作包括以下3方面:1)在数据层面提出了一种基于类边界样本自适应合成的重采样方法BADASYN算法。该算法首先找出类边界区域的少类样本,然后根据它们的分布自适应地合成部分少类样本,并将新合成的样本添加到训练集。经BADASYN采样的数据集,训练得到的SVM模型的支持向量主要由新合成的样本构成,并最终使分离超平面靠近多类样本。2)在算法层面提出了一种基于负相关学习和Ada Boost SVM算法的选择性集成学习方法NCAB-SVM。将负相关学习理论融合到Ada Boost SVM的训练过程中,目的是训练一批多样性很好的强SVM分类器组成一个更强的集成分类系统,即强强联手。该算法利用负相关学习理论计算基分类器间的相关性,并根据相关性的值自适应调整各基分类器的权重,进而得到加权后的决策分类器。3)针对微博情感分类过程中存在样本分布不平衡和特征分布不平衡的问题,结合数据层面和算法层面的方法,使用基于SVM的不平衡数据分类算法对微博情感极性进行分类。首先,使用BADASYN算法自适应合成部分少类样本,调整训练样本的不平衡度;然后,使用NCAB-SVM算法,训练得到一系列SVM基分类器,并选择性集成得到决策系统;最后,使用爬取的不同领域的新浪微博数据集和公开的评测数据集测试该方法的性能。
[Abstract]:With the development of computer and information technology, a lot of data are produced every day in production and life. How to effectively find and discover the knowledge and rules in these data, Classification and prediction has become one of the important research contents in artificial intelligence and machine learning. SVM is a classification algorithm based on statistical learning theory and structural risk minimization principle. Its decision function is determined by only a few support vectors. Adding or deleting some non-support vector samples does not affect the performance of the model. Compared with the traditional classification algorithm, SVM has a strong generalization ability. It is difficult to fall into local minima and is suitable for classifying high-dimensional small samples. It can effectively solve the classification problem of balanced datasets. However, when the two classes of data are distributed unevenly, SVM shows the following shortcomings: first, because SVM is based on the method of maximization of soft interval, the hyperplane in the boundary region is inclined to a few classes. Second, the unbalance ratio of support vector will lead to more negative support vectors around the test samples. Aiming at the difficulties and shortcomings of SVM in classifying unevenly distributed datasets, this paper studies the classification of unbalanced data from the data and algorithm levels, and applies the unbalanced data classification algorithm to the Weibo emotional classification problem. The main work includes the following three aspects: (1) A new method of resampling BADASYN based on adaptive composition of class boundary samples is proposed in the data level. The algorithm firstly finds out the few class samples in the boundary region of the class, then adaptively synthesizes some small class samples according to their distribution, and adds the newly synthesized samples to the training set. Based on the data set sampled by BADASYN, the support vector of SVM model is mainly composed of newly synthesized samples. Finally, the separation hyperplane is close to the multi-class sample .2) at the algorithm level, a selective ensemble learning method, NCAB-SVM, is proposed based on negative correlation learning and Ada boost SVM algorithm. The negative correlation learning theory is integrated into the training process of Ada boost SVM. The purpose is to train a group of strong SVM classifiers with good diversity to form a stronger ensemble classification system, that is, strong and strong join forces. The algorithm uses the negative correlation learning theory to calculate the correlation among the base classifiers, and adaptively adjusts the weights of each base classifier according to the value of the correlation. Then the weighted decision classifier. 3) aiming at the unbalance of sample distribution and feature distribution in the process of Weibo emotional classification, combining the methods of data level and algorithm level. The Weibo affective polarity is classified by the unbalanced data classification algorithm based on SVM. Firstly, a series of SVM based classifiers are trained by using BADASYN algorithm to self-adaptively synthesize a few classes of samples to adjust the unbalance of training samples. Finally, a series of SVM classifiers are trained by NCAB-SVM algorithm, and the decision system is obtained by selective integration. The performance of the method is tested using crawled Sina Weibo datasets in different domains and publicly evaluated datasets.
【学位授予单位】：华侨大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP18

【参考文献】