当前位置:主页 > 经济论文 > 投融资论文 >

非均衡分类的集成学习应用研究

发布时间:2018-07-02 19:40

  本文选题:非均衡分类 + 集成学习 ; 参考:《南京信息工程大学》2017年硕士论文


【摘要】:类倾斜分布的数据集广泛存在于现实世界中。在很多领域,针对非均衡分布的分类问题,少数类样本被正确分类的重要程度往往高于多数类样本被正确分类的重要程度。在类倾斜分布数据集的前提下构建非均衡分类模型时,大多经典分类算法都是以训练集具有平衡的类分布或者各类样本具有相同的误分代价为前提建立分类模型,因此,非均衡的类分布在一定程度上造成了这些分类算法性能下降。在这种情况下,少数类样本的信息往往被多数类样本信息所掩盖,导致来自少数类样本的分类错误率远远高于多数类样本。因此,非均衡分类问题的研究愈发受到广泛关注,同时也成为数据挖掘应用领域的热点及难点问题。本文在探讨非均衡分类应用问题之前,首先对非均衡分类问题研究内容和现状进行介绍,从采样方法、分类算法方面展开详细的综述。然后,根据集成学习算法在处理非均衡数据时,较单分类器能够取得更好的性能的优点,进一步探讨了集成学习组合方法对非均衡分类问题的处理情况,并对相关应用进行详细阐述。本文基于集成学习模型对非均衡分类问题有以下两部分应用:第一部分,基于2014年A股沪市1000组上市公司财务数据,使用基于海格林距离的随机森林(Hellinger Distance based Random Forest, HDRF)从 ST股非均衡分类的角度对上市公司财务预警模型构建问题进行研究。基于海林格距离的随机森林能够集成随机森林的差异性以及海林格距离决策树的倾斜不敏感特征,实验中选择了传统随机森林、基于C4. 5决策树为基分类器的Bagging、AdaBoost、旋转森林集成分类器以及基于海林格决策树为基分类器的相应集成分类器作对比实验,实验结果表明基于海林格距离的随机森林集成模型在处理上市公司ST股非均衡分类问题时,在ROC曲线下面积及Fmeasure指标上具有相对更好的综合分类性能,此外海林格距离决策树作为基分类器能够提高集成模型非均衡分类性能。第二部分,拓展了非均衡分类模型的应用。针对客户关系管理领域的客户保持研究,此部分重点讨论了商业银行客户流失问题,将CVParameterSelection应用于支持向量机组合核函数参数寻优,构建了基于EasyEnsemble的Relief-SVM客户流失预测模型,并通过商业银行客户资料数据研究验证了该模型较单一核函数EasyEnsemble-Relief-SVM模型及传统C4. 5决策树为基分类器的Bagging、AdaBoost集成分类器在AUC、Fmeasure指标上均有所提升。因此,在参数寻优前提下组合核函数EasyEnsemble的Relief-SVM客户流失预测模型是一种处理商业银行客户流失分类预测问题的有效方法,不但能够更准确地对潜在流失客户进行预测,同时还兼顾客户整体分类精度,这使得针对流失客户开展客户挽留决策成为可能,最终尽可能达到客户保持的目的。最后,本文对基于集成学习的非均衡分类方法对这两部分应用研究进行了总结,分析不足之处并对未来的研究做了展望,希望能够对经济管理领域中一些非均衡数据开展有效的知识发现。
[Abstract]:In many fields, in many fields, for the classification problem of unbalanced distribution, the importance of the correct classification of the minority samples is often higher than the importance of the correct classification of the majority of the samples. Most classical classification is made when the non equilibrium classification model is built on the premise of the class inclined distribution data set. The class algorithm sets up a classification model on the premise that the training set has a balanced class distribution or the same misclassification cost. Therefore, the non equilibrium class distribution causes the performance degradation of these classification algorithms. In this case, the information of the minority samples is often obscured by the majority of the sample information. The classification error rate from a few classes of samples is far higher than the majority of the samples. Therefore, the research on the disequilibrium classification problem has become more and more popular, and it has also become a hot and difficult problem in the field of data mining applications. This paper first discusses the research content and status of the non equilibrium classification problem before discussing the application of the disequilibrium classification. This paper introduces a detailed overview of the sampling method and classification algorithm. Then, according to the advantages of the integrated learning algorithm in dealing with non balanced data, a better performance can be obtained than the single classifier. The processing of the integrated learning combination method to the non equilibrium classification problem is further discussed, and the related applications are expounded in detail. Based on the integrated learning model, the following two parts are applied to the disequilibrium classification problem: in the first part, based on the financial data of the 1000 groups of Listed Companies in A shares of Shanghai stock market in 2014, Hellinger Distance based Random Forest (HDRF) is used for the financial early-warning model of listed companies from the point of view of the non equilibrium classification of ST shares. The Stochastic Forest Based on the Hailin lattice distance can integrate the discrepancy of random forest and the insensitive feature of the Hailin lattice distance decision tree. In the experiment, the traditional random forest, the Bagging, AdaBoost, the rotary forest integration classifier and the Hailin lattice decision tree based on the C4. 5 decision tree are selected as the base classifier. Compared with the corresponding ensemble classifier based on the base classifier, the experimental results show that the random forest integration model based on the Hailin lattice distance has a relatively better comprehensive classification ability on the area and Fmeasure index under the ROC curve when dealing with the non equilibrium classification problem of the listed company s t shares. The distance decision tree of the outer sea linger is used as the base classification. The device can improve the non equilibrium classification performance of the integrated model. Second, the application of the non equilibrium classification model is extended. The customer retention in the customer relationship management field is maintained. This part focuses on the problem of customer loss in commercial banks. The CVParameterSelection is applied to the optimization of the parameter of the support vector machine combination kernel function. EasyEnsemble Relief-SVM customer loss prediction model, and through the commercial bank customer data data research verified that the model compared with the single kernel function EasyEnsemble-Relief-SVM model and the traditional C4. 5 decision tree as the base classifier based Bagging, AdaBoost integrated classifier on the AUC, Fmeasure indicators have been improved. Therefore, before the optimization of parameters optimization The Relief-SVM customer churn prediction model based on the combined kernel function EasyEnsemble is an effective method to deal with the customer loss classification prediction problem in commercial banks. It can not only predict the potential lost customers more accurately, but also give consideration to the overall classification accuracy of the customers. This makes the customer retention decision for the lost customers to be made. Finally, this paper makes a summary of the two parts of the application research based on the integrated learning based non equilibrium classification method, analyzes the shortcomings and looks forward to the future research, hoping to carry out effective knowledge discovery of some non balanced data in the field of economic management.
【学位授予单位】:南京信息工程大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:F832.51;F275

【参考文献】

相关期刊论文 前10条

1 董振波;;基于加权模糊聚类的不平衡数据分类方法[J];现代计算机(专业版);2016年17期

2 吕洪艳;刘芳;;组合核函数SVM在特定领域文本分类中的应用[J];计算机系统应用;2016年05期

3 邢胜;王熙照;王晓兰;;基于多类重采样的非平衡数据极速学习机集成学习[J];南京大学学报(自然科学);2016年01期

4 李诒靖;郭海湘;李亚楠;刘晓;;一种基于Boosting的集成学习算法在不均衡数据中的分类[J];系统工程理论与实践;2016年01期

5 徐丽丽;闫德勤;;不平衡数据加权集成学习算法[J];微型机与应用;2015年23期

6 肖进;唐静;刘敦虎;谢玲;汪寿阳;;基于改进GMDH的目标客户选择模型研究[J];中国管理科学;2015年10期

7 徐可欣;张文;王永吉;;基于统计抽样的非均衡分类方法在软件缺陷预测中的应用[J];计算机应用与软件;2015年08期

8 王瑞琦;沈韬;马帅;郭剑毅;余正涛;;基于凸组合核函数的化合物太赫兹透射光谱分类[J];光谱学与光谱分析;2015年05期

9 陈宇;许莉薇;;基于优化LM模糊神经网络的不均衡林业信息文本分类算法[J];中南林业科技大学学报;2015年04期

10 傅清秋;谢永华;汤波;张恒德;;基于组合核函数SVM沙尘暴预警技术的研究[J];计算机工程与设计;2014年02期

相关博士学位论文 前5条

1 杨泽平;基于神经网络的不平衡数据分类方法研究[D];华东理工大学;2015年

2 钱云;非均衡数据分类算法若干应用研究[D];吉林大学;2014年

3 尹留志;关于非平衡数据特征问题的研究[D];中国科学技术大学;2014年

4 秦志敏;我国上市公司财务预警变量选择研究[D];东北财经大学;2012年

5 谷琼;面向非均衡数据集的机器学习及在地学数据处理中的应用[D];中国地质大学;2009年

相关硕士学位论文 前5条

1 刘熙钰;我国ST股摘帽行情及相关影响因素研究[D];西南财经大学;2016年

2 肖坚;基于随机森林的不平衡数据分类方法研究[D];哈尔滨工业大学;2013年

3 李娜;我国农业上市公司财务预警模型研究[D];沈阳农业大学;2008年

4 王华;财务预警模型的构建与检验[D];西南财经大学;2008年

5 雷浩;数据挖掘技术在我国商业银行CRM中的应用研究[D];中南大学;2005年



本文编号:2090707

资料下载
论文发表

本文链接:https://www.wllwen.com/jingjilunwen/touziyanjiulunwen/2090707.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户82fe2***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com