基于不平衡数据集的数据挖掘分类算法研究

发布时间：2018-04-28 22:10

本文选题：数据挖掘 + 不平衡数据集　；参考：《兰州理工大学》2017年硕士论文

【摘要】：21世纪是一个高度信息化的时代,数据作为载体隐藏着大量可以挖掘的有用信息,如何处理数据和提取有价值的信息已成为迫在眉睫的问题。分类是数据挖掘领域的重要研究分支,是数据分析的一种重要形式。在实际生活中,重要的有研究价值的往往是那些数量稀少的数据类,简称不平衡数据集。那么如何在不平衡数据集中,有效的提取少数类数据集,将是本文研究的重点。主要研究内容如下:(1)针对不平衡数据集中正类分类准确率不高的问题,提出了一种集成C4.5和改进朴素贝叶斯(C4.5-INB)算法。首先通过对多数类概率乘以比例系数得到改进朴素贝叶斯分类结果,再利用C4.5算法对原数据分类。根据两种分类结果通过等权法或最优搭配器优先法确定这两种基分类算法的权值,最后根据平均表决法得到新的分类结果。利用UCI数据集对三种算法进行分类验证,结果表明提出的算法分类效果更准确,稳定性更好。(2)针对不平衡数据集在分类过程中易产生噪声数据和分类精度低的问题,提出了一种基于改进SMOTE的不平衡数据集主动学习SVM分类算法。该算法对训练样本集利用少数类样本的归属值通过多数票选择法控制合成少数类样本的数量,以距离公式为衡量标准划分超平面,在分类超平面两侧选择最近距离的等量对称的多数类样本,组成平衡采样数据集,利用支持向量机(SVM)进行分类得到优化分类器,再用主动学习对去除了训练样本的不平衡数据集利用优化分类器进行分类循环,直到剩余样本为零。利用UCI数据集中的数据实验表明,提出的算法有效地减少了噪声数据对分类的影响,并有效改善了不平衡数据集的分类精度。(3)针对高维不平衡数据集分类性能较差的问题,提出了一种改进非监督线性差分投影(I-ULDP)高维不平衡数据集分类算法。算法首先将一个样本分成的局部小块都构造在同一个流形上,使得每个样本都有属于自己的流形空间;然后构造出每一个子流形的最小局部嵌入和最大全局方差,再利用优化求解目标函数得出在高维空间中嵌入的低维流形;最后通过流形距离设定支持向量机的分类超平面,通过训练支持向量机得到最终的分类器。经UCI数据集验证,I-ULDP分类算法在处理高维不平衡数据集问题上有明显的优势。
[Abstract]:The 21st century is a highly information age. Data as a carrier hides a lot of useful information that can be mined. How to deal with data and extract valuable information has become an urgent problem. Classification is an important research branch in the field of data mining and an important form of data analysis. In real life, the important research value is often those few data classes, referred to as unbalanced datasets. So how to extract a few kinds of data sets effectively in unbalanced data sets will be the focus of this paper. The main contents of this paper are as follows: (1) aiming at the problem that the accuracy of positive class classification in unbalanced data sets is not high, a new algorithm of integrating C4.5 and improving naive Bayesian C4.5-INB is proposed. First, the improved naive Bayes classification results are obtained by multiplying the probability of most classes by the proportional coefficients, and then the original data are classified by C4.5 algorithm. According to the two classification results, the weights of the two basic classification algorithms are determined by the equal weight method or the optimal collocation priority method. Finally, the new classification results are obtained according to the average voting method. The UCI dataset is used to classify the three algorithms. The results show that the proposed algorithm is more accurate and stable. (2) aiming at the problem that the unbalanced dataset is prone to produce noisy data and low classification accuracy in the process of classification, the proposed algorithm is more accurate and stable. An active learning SVM classification algorithm for unbalanced datasets based on improved SMOTE is proposed. In this algorithm, the number of synthesized minority samples is controlled by the method of majority vote selection, and the hyperplane is divided according to the distance formula. Two sides of the classification hyperplane selected most of the samples with the nearest distance and symmetry to form the balanced sampling data set, and the support vector machine (SVM) was used to classify the optimal classifier. Then active learning is used to loop the unbalanced data set which removes the training samples by using an optimized classifier until the remaining samples are zero. The experimental results of UCI dataset show that the proposed algorithm can effectively reduce the influence of noise data on classification, and improve the classification accuracy of unbalanced dataset effectively. An improved classification algorithm for unsupervised linear differential projection (I-ULDP) high dimensional unbalanced datasets is proposed. The algorithm first constructs a local block of a sample on the same manifold so that each sample has its own manifold space, and then constructs the minimum local embedding and the maximum global variance of each submanifold. Finally, the hyperplane of support vector machine is set up by manifold distance, and the final classifier is obtained by training support vector machine. The UCI data set verifies that the I-ULDP classification algorithm has obvious advantages in dealing with the problem of high dimensional unbalanced datasets.
【学位授予单位】：兰州理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】