基于免疫系统的不平衡数据分类方法研究

发布时间：2018-10-25 06:27

【摘要】：随着云计算和移动技术的发展,互联网进入大数据时代,人们面对急剧膨胀的多媒体信息,需要有效的内容管理和快速的信息查找。分类算法通过学习已标注数据建立模型,对数据进行分类和标签,已经广泛应用于计算机视觉、文字识别、声音识别、文档归类等领域。基于标注数据的分类算法已经走向成熟,如朴素贝叶斯、逻辑回归、支持向量机、决策树等。然而,这些算法都依赖于数据集规模,按照学习理论,只有样本规模超过规定下界时,正确率才能高于临界点;同时不平衡数据集大量存在于人们的现实生活中,人们更关心少数类的样本,错分少数类所产生的代价更大。为了解决这个矛盾,本文致力于基于免疫系统的不平衡数据分类方法研究。借鉴人体免疫系统的原理和特性,研究和解决二类不平衡数据分类问题、多类不平衡数据分类问题,密度缺失下的不平衡数据分类问题,以及类内簇不平衡下的不平衡数据分类问题,主要工作和贡献如下:(1)在二类不平衡环境下,研究了基于免疫中心点的过采样提高分类算法性能的理论和方法。在二类学习中,多数类(或负类)的样本数量比少数类(或正类)的样本数量更多,标准分类学习算法趋于偏向多数类,造成少数类的错分率明显高于多数类的错分率。本文提出的基于免疫中心点的过采样方法(ICOTE)借鉴免疫网络原理,经过繁殖、变异、抑制等过程,产生免疫型中心点来扩充少数类样本,以达到样本分布的类平衡。免疫型中心点反映少数类的分布特征,扩张后的样本集不会改变原有样本的形状,防止新簇的产生,因而ICOTE在避免过学习的同时,也克服了随机合成采样方法不考虑样本空间分布的问题。(2)在多类不平衡环境下,研究了基于多免疫子网络的过采样提高分类算法性能的理论和方法。与二类学习相比,多类学习面临着搜索空间变大、算法复杂度升高、空间重合等新问题,往往无法简单地把二类方法照搬到多类问题。同时,不平衡问题变得更加突出,少数类不止一个,类空间重叠现象更加普遍,造成传统分类算法忽视少数类现象,更倾向降低多数类的错分率。本文提出的基于免疫中心点的全局过采样方法(Global-IC)借鉴免疫网络原理,在每个少数类空间生成免疫子网络,网络节点用来扩充少数类样本,最终达到整个样本分布的类平衡,促使分类算法在生成模型时,给予每个类同样的权重,正确预测未知样本。(3)在少数类数据密度稀疏条件下,研究基于阴性选择的过采样提高分类算法性能的理论和方法。与多数类样本空间相比,少数类空间不仅样本数量少,数据也比较稀疏,形成许多的孤立点或小簇,分类算法易于向多数类偏置。本文借鉴人体免疫系统的阴性选择机制,提出非我抗原型检测器和离散点检测相结合,学习整个数据空间的分布特性,生成符合少数类密度分布的合成样本,扩大少数类空间的决策区域。因为尽可能多的利用样本数据,在少数类空间生成更大或更稠密的决策区后,决策树分类算法有足够的分类信息,生成的决策树能够对未标注样本进行正确分类。(4)在类内簇不平衡条件下,研究基于形状的过采样提高分类算法性能的理论和方法。不平衡问题不简单是类间的不平衡,而是类内部有更多的“小簇”,簇间的不平衡造成预测精度变低。本文借鉴免疫网络原理和离散点检测,提出了基于形状的过采样方法(SBO)。SBO利用聚类算法识别类内的“簇”,然后在簇内构建免疫子网络,网络节点用来扩充少数类样本。我们也研究解决了CURE算法对输入参数的依赖性,利用免疫网络生成代表点替换以前的向量均值;同时,SBO检查簇算法引入的“假簇”,只对真实簇扩充样本规模,避免重复样本带来的过学习问题。因为过采样后的数据集变得类间和类内平衡,并且扩展后数据集和原数据集有着相似的空间分布,因此生成的决策树能够对未标注样本进行正确分类。
[Abstract]:With the development of cloud computing and mobile technology, the Internet has entered the age of big data, and people face the rapid expansion of multimedia information, requiring effective content management and quick information searching. The classification algorithm has been widely used in the fields of computer vision, text recognition, voice recognition, document classification and so on. The classification algorithm based on annotation data has been mature, such as naive Bayes, logistic regression, support vector machine, decision tree and so on. However, these algorithms depend on the size of the data set, and according to the learning theory, only the accuracy can be higher than the critical point when the sample size exceeds a prescribed lower limit; meanwhile, the unbalanced data set exists in the real life of people, and people are more concerned with a few samples. Mistakes are at a greater cost than they produce. In order to solve this contradiction, this paper is devoted to the study of unbalanced data classification based on immune system. Based on the principles and characteristics of human immune system, we study and solve the classification of unbalanced data of Class II, the classification of multi-class unbalanced data, the classification of unbalanced data under the loss of density, and the classification of unbalanced data under the imbalance of clusters. The main work and contribution are as follows: (1) In the second-class unbalanced environment, the theory and method of improving the performance of the classification algorithm based on the over-sampling of the immune central point are studied. In Class II study, the number of samples of most classes (or negative classes) is more than that of a few (or positive) classes, and the standard classification learning algorithm tends to favor most classes, resulting in a significant fraction of the error fraction of a few classes being significantly higher than that of the majority class. In this paper, we propose an immune central point-based oversampling method (ICOTE), which is based on the principle of immune network, propagation, mutation, inhibition and so on, to generate an immune center point to expand a few samples so as to achieve the class balance of sample distribution. An immunotype center point reflects the distribution characteristics of a few classes, and the expanded sample set does not change the shape of the original sample so as to prevent the generation of new clusters, so that the ICOTE overcomes the problem that the random synthesis sampling method does not take into account the distribution of the sample space at the same time of avoiding overlearning. (2) In the multi-class imbalance environment, the theory and method for improving the performance of classification algorithm based on over-sampling of multi-immune subnetworks are studied. Compared with the second-class learning, the multi-class learning is confronted with new problems such as large search space, high algorithm complexity and space coincidence, and the second-class method can not be simply copied to the multi-class problem. At the same time, the imbalance problem becomes more prominent, and a few more than one class space overlap phenomenon is more common, which causes the traditional classification algorithm to ignore a few phenomena and tends to lower the error rate of most classes. Global oversampling method based on immune central point (Global-IC), which is based on immune central point, uses the principle of immune network to generate immune sub-network in each small space, and the network node is used to expand a few samples, and finally, the class balance of the whole sample distribution is reached, and the classification algorithm is promoted to generate the model. Each class is given the same weight to correctly predict unknown samples. (3) Under the sparse condition of small data density, the theory and method for improving the performance of classification algorithm based on the over-sampling of negative selection are studied. Compared with most sample spaces, a few types of space have little sample quantity and sparse data, and many isolated points or clusters are formed, and the classification algorithm is easy to be biased to most classes. Based on the negative selection mechanism of human immune system, this paper puts forward a combination of non-my antigen-type detector and discrete point detection, and studies the distribution characteristics of the whole data space. Since sample data is used as much as possible, the decision tree classification algorithm has sufficient classification information after generating a larger or more dense decision region in a few types of space, and the generated decision tree is able to correctly classify the unlabeled samples. (4) Based on the shape-based oversampling, the theory and method of improving the performance of classification algorithm are studied under the condition of clustering in clusters. The imbalance is not simply an imbalance between classes, but there are more internal classes" Cluster "and the imbalance between clusters causes the prediction accuracy to be low. In this paper, based on the principle of immune network and the detection of discrete points, the shape-based oversampling method (SBO) is proposed." Cluster "and then constructing an immune sub-network within the cluster, the network node being used to augment a few samples. We also studied the dependence of the CURE algorithm on the input parameters, using the immune network to generate a representative point to replace the previous vector mean, and at the same time, the SBO check cluster algorithm introduced" false cluster "and avoiding the problem of over-learning caused by repeated samples only by expanding the sample size for the real cluster. Since the oversampled data set becomes inter-class and intra-class balance, and the extended data set and the original data set have a similar spatial distribution, the generated decision tree is able to correctly classify the unlabeled samples.
【学位授予单位】：苏州大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP301.6

【相似文献】