基于免疫系统的不平衡数据分类方法研究
[Abstract]:With the development of cloud computing and mobile technology, the Internet has entered the age of big data, and people face the rapid expansion of multimedia information, requiring effective content management and quick information searching. The classification algorithm has been widely used in the fields of computer vision, text recognition, voice recognition, document classification and so on. The classification algorithm based on annotation data has been mature, such as naive Bayes, logistic regression, support vector machine, decision tree and so on. However, these algorithms depend on the size of the data set, and according to the learning theory, only the accuracy can be higher than the critical point when the sample size exceeds a prescribed lower limit; meanwhile, the unbalanced data set exists in the real life of people, and people are more concerned with a few samples. Mistakes are at a greater cost than they produce. In order to solve this contradiction, this paper is devoted to the study of unbalanced data classification based on immune system. Based on the principles and characteristics of human immune system, we study and solve the classification of unbalanced data of Class II, the classification of multi-class unbalanced data, the classification of unbalanced data under the loss of density, and the classification of unbalanced data under the imbalance of clusters. The main work and contribution are as follows: (1) In the second-class unbalanced environment, the theory and method of improving the performance of the classification algorithm based on the over-sampling of the immune central point are studied. In Class II study, the number of samples of most classes (or negative classes) is more than that of a few (or positive) classes, and the standard classification learning algorithm tends to favor most classes, resulting in a significant fraction of the error fraction of a few classes being significantly higher than that of the majority class. In this paper, we propose an immune central point-based oversampling method (ICOTE), which is based on the principle of immune network, propagation, mutation, inhibition and so on, to generate an immune center point to expand a few samples so as to achieve the class balance of sample distribution. An immunotype center point reflects the distribution characteristics of a few classes, and the expanded sample set does not change the shape of the original sample so as to prevent the generation of new clusters, so that the ICOTE overcomes the problem that the random synthesis sampling method does not take into account the distribution of the sample space at the same time of avoiding overlearning. (2) In the multi-class imbalance environment, the theory and method for improving the performance of classification algorithm based on over-sampling of multi-immune subnetworks are studied. Compared with the second-class learning, the multi-class learning is confronted with new problems such as large search space, high algorithm complexity and space coincidence, and the second-class method can not be simply copied to the multi-class problem. At the same time, the imbalance problem becomes more prominent, and a few more than one class space overlap phenomenon is more common, which causes the traditional classification algorithm to ignore a few phenomena and tends to lower the error rate of most classes. Global oversampling method based on immune central point (Global-IC), which is based on immune central point, uses the principle of immune network to generate immune sub-network in each small space, and the network node is used to expand a few samples, and finally, the class balance of the whole sample distribution is reached, and the classification algorithm is promoted to generate the model. Each class is given the same weight to correctly predict unknown samples. (3) Under the sparse condition of small data density, the theory and method for improving the performance of classification algorithm based on the over-sampling of negative selection are studied. Compared with most sample spaces, a few types of space have little sample quantity and sparse data, and many isolated points or clusters are formed, and the classification algorithm is easy to be biased to most classes. Based on the negative selection mechanism of human immune system, this paper puts forward a combination of non-my antigen-type detector and discrete point detection, and studies the distribution characteristics of the whole data space. Since sample data is used as much as possible, the decision tree classification algorithm has sufficient classification information after generating a larger or more dense decision region in a few types of space, and the generated decision tree is able to correctly classify the unlabeled samples. (4) Based on the shape-based oversampling, the theory and method of improving the performance of classification algorithm are studied under the condition of clustering in clusters. The imbalance is not simply an imbalance between classes, but there are more internal classes" Cluster "and the imbalance between clusters causes the prediction accuracy to be low. In this paper, based on the principle of immune network and the detection of discrete points, the shape-based oversampling method (SBO) is proposed." Cluster "and then constructing an immune sub-network within the cluster, the network node being used to augment a few samples. We also studied the dependence of the CURE algorithm on the input parameters, using the immune network to generate a representative point to replace the previous vector mean, and at the same time, the SBO check cluster algorithm introduced" false cluster "and avoiding the problem of over-learning caused by repeated samples only by expanding the sample size for the real cluster. Since the oversampled data set becomes inter-class and intra-class balance, and the extended data set and the original data set have a similar spatial distribution, the generated decision tree is able to correctly classify the unlabeled samples.
【学位授予单位】:苏州大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:TP301.6
【相似文献】
相关期刊论文 前10条
1 王胜祥;现实、实践与理论——兼谈图书馆高位理论[J];黑龙江图书馆;1990年02期
2 王健庭;火信号的采集与相关修正[J];数据采集与处理;1987年02期
3 陈国阶;我国东西部发展不平衡与西部开发[J];科技导报;1995年07期
4 王萌;施艳艳;王海明;沈明辉;;不平衡电网电压下双馈风力发电系统强励控制[J];测控技术;2014年07期
5 漫征;;克服地区落后论的错误思想[J];新闻战线;1960年11期
6 ;来稿选题建议[J];青年研究;1999年01期
7 沈睿;;区域发展不平衡——不同地域中小企业信息化建设差距较大[J];每周电脑报;2004年08期
8 张昕竹;用电信普遍服务政策改善经济发展不平衡[J];通信世界;2001年16期
9 周耘;;试论我国年鉴发展的不平衡性[J];图书馆学研究;1987年04期
10 刘叶婷;;智慧城市应依“标”而建[J];信息化建设;2013年09期
相关会议论文 前6条
1 张雨石;唐丽敏;王庸凯;陈文科;;关于中日航线集装箱运量不平衡原因的分析[A];中国航海学会——2004年度学术交流会优秀论文集[C];2004年
2 廖芳宇;;基于LabVIEW的三相不平衡的测量[A];2011年云南电力技术论坛论文集(入选部分)[C];2011年
3 沙鹏程;;关于西部民营企业可持续发展的思考[A];第十四次全国回族学研讨会论文汇编[C];2003年
4 张敦伟;丁博;;配电网三相不平衡补偿的探讨[A];2007中国电机工程学会电力系统自动化专委会供用电管理自动化学科组(分专委会)二届三次会议论文集[C];2007年
5 王仲生;王翔;;转子不平衡自愈监控系统设计[A];第七届全国信息获取与处理学术会议论文集[C];2009年
6 王中卿;李寿山;朱巧明;李培峰;周国栋;;基于不平衡数据的中文情感分类[A];中国计算语言学研究前沿进展(2009-2011)[C];2011年
相关重要报纸文章 前10条
1 本报记者 刘金松;教育最大的不公平是教育资源不平衡[N];经济观察报;2014年
2 程凯;解决不平衡还要靠市场[N];中华工商时报;2005年
3 本报见习记者 周宁;示范小城镇建设“四个不平衡”[N];经济信息时报;2013年
4 记者 张黎明;我市治堵工作进展不平衡[N];金华日报;2014年
5 本报记者 任s,
本文编号:2292894
本文链接:https://www.wllwen.com/shoufeilunwen/xxkjbs/2292894.html