基于数据挖掘的个人信用评分建模与分析
发布时间:2019-04-11 07:22
【摘要】:随着经济的不断发展,人们对住房、汽车、教育、日常消费等有信贷需求的家庭也越来越多。因此对于金融机构如何规避潜在的个人信用风险是银行和信贷机构面临的重大挑战。所以使用统计方法或数据挖掘技术,建立个人信用贷款模型,能够比较准确的预测个人违约的概率,对银行或金融机构有着重要的意义。个人信用贷款预测实质上是需要我们找到一种分类模型,即将个体消费者划分为能够按期还本付息(即“好”客户)和违约(即“坏”客户)两类。对于此类问题,本文选择Logistic回归和决策树分类方法进行建模并比较两者之间的优缺点,选择最优模型。本文以kaggle竞赛数据为实证数据结合SAS、SPSS软件进行论文研究,首先结合SAS软件对原始数据进行随机抽样,分成训练集、验证集和测试集三个数据集,接着对数据集进行预处理,对缺失值、异常值进行检验和多重共线性检验,并相应使用插补法和变量聚类分析进行变量筛选得到处理后的数据集,最后从xl-x10十个变量中筛选出五个变量x1、x2、x4、x8、x9进行Logistic回归建模;然后通过 Logistic回归分析中的全模型法得到三个候选模型,对三个候选模型进行参数估计以及模型显著性检验拟合数据得到两个预测模型,且计算得出两个模型AUC统计量都为0.714,说明模型预测效果较为理想,为了进一步选择稳健性高、简洁的最优模型,再通过验证集绘制ROC曲线以及计算AUC值,两模型在验证数据集中AUC值都超过了70%,最后综合比较得到最优模型,筛选出x2、x8、x9建立Logistic回归模型;接着结合SPSS软件对训练集使用Exhaustive CHAID算法建立决策树分类模型,筛选出x1、x3、x4、x7、x9五个变量,然后通过验证集检验模型的稳健性,得到AUC值为0.839,说明模型有很好的稳健性;最后通过测试集比较Logistic回归模型和决策树分类模型预测效果,Logistic回归模型与决策树分类模型预测违约概率p与实际值误差平方和分别为823.298和231.559,说明在模型的预测准确度、稳健性上,决策树模型都优于Logistic回归模型。
[Abstract]:With the continuous development of the economy, there are more and more families in need of credit such as housing, cars, education, daily consumption and so on. Therefore, how to avoid the potential personal credit risk for financial institutions is a major challenge for banks and credit institutions. Therefore using statistical method or data mining technology to establish personal credit loan model can accurately predict the probability of personal default which is of great significance to banks or financial institutions. In essence, the forecast of personal credit needs us to find a classification model, that is, individual consumers can be divided into two categories, namely, "good" customers and "bad" customers, who can pay their debts on schedule (that is, "good" customers) and default ("bad" customers). For this kind of problem, this paper chooses Logistic regression and decision tree classification method to model, compares the advantages and disadvantages of the two methods, and chooses the optimal model. In this paper, kaggle competition data is used as empirical data and SAS,SPSS software is used to carry on the research. Firstly, the original data are randomly sampled with SAS software, and divided into three data sets: training set, verification set and test set, and then the data set is preprocessed. The missing value and abnormal value are tested and multi-collinearity test is carried out, and the data set is selected by interpolation and variable cluster analysis. Finally, five variables x 1, x 2, x 4 are selected from the ten variables of xl-x10. X8, x9 for Logistic regression modeling; Then three candidate models are obtained by the full model method of Logistic regression analysis. The parameters of three candidate models are estimated and the model significance test data are fitted to get two prediction models. The AUC statistics of the two models are both 0.714, which shows that the prediction effect of the model is ideal. In order to select the best model with high robustness and simplicity, the ROC curve is drawn by the verification set and the AUC value is calculated. The AUC value of the two models is over 70% in the verification data set. Finally, the optimal model is obtained by comprehensive comparison, and the Logistic regression model is established by selecting x2, x8 and x9. Then using Exhaustive CHAID algorithm to set up a decision tree classification model with SPSS software, five variables x 1, x 3, x 4, x 7, x 9 were screened out, and then the robustness of the model was verified by verifying the robustness of the model, and the AUC value was 0.839, and the value of AUC was 0.839. It shows that the model has good robustness; Finally, the prediction results of Logistic regression model and decision tree classification model are compared by test set. The sum of square of the error between Logistic regression model and decision tree classification model is 823.298 and 231.559, respectively. It is shown that the decision tree model is superior to the Logistic regression model in the prediction accuracy and robustness of the model.
【学位授予单位】:华中师范大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP311.13
本文编号:2456202
[Abstract]:With the continuous development of the economy, there are more and more families in need of credit such as housing, cars, education, daily consumption and so on. Therefore, how to avoid the potential personal credit risk for financial institutions is a major challenge for banks and credit institutions. Therefore using statistical method or data mining technology to establish personal credit loan model can accurately predict the probability of personal default which is of great significance to banks or financial institutions. In essence, the forecast of personal credit needs us to find a classification model, that is, individual consumers can be divided into two categories, namely, "good" customers and "bad" customers, who can pay their debts on schedule (that is, "good" customers) and default ("bad" customers). For this kind of problem, this paper chooses Logistic regression and decision tree classification method to model, compares the advantages and disadvantages of the two methods, and chooses the optimal model. In this paper, kaggle competition data is used as empirical data and SAS,SPSS software is used to carry on the research. Firstly, the original data are randomly sampled with SAS software, and divided into three data sets: training set, verification set and test set, and then the data set is preprocessed. The missing value and abnormal value are tested and multi-collinearity test is carried out, and the data set is selected by interpolation and variable cluster analysis. Finally, five variables x 1, x 2, x 4 are selected from the ten variables of xl-x10. X8, x9 for Logistic regression modeling; Then three candidate models are obtained by the full model method of Logistic regression analysis. The parameters of three candidate models are estimated and the model significance test data are fitted to get two prediction models. The AUC statistics of the two models are both 0.714, which shows that the prediction effect of the model is ideal. In order to select the best model with high robustness and simplicity, the ROC curve is drawn by the verification set and the AUC value is calculated. The AUC value of the two models is over 70% in the verification data set. Finally, the optimal model is obtained by comprehensive comparison, and the Logistic regression model is established by selecting x2, x8 and x9. Then using Exhaustive CHAID algorithm to set up a decision tree classification model with SPSS software, five variables x 1, x 3, x 4, x 7, x 9 were screened out, and then the robustness of the model was verified by verifying the robustness of the model, and the AUC value was 0.839, and the value of AUC was 0.839. It shows that the model has good robustness; Finally, the prediction results of Logistic regression model and decision tree classification model are compared by test set. The sum of square of the error between Logistic regression model and decision tree classification model is 823.298 and 231.559, respectively. It is shown that the decision tree model is superior to the Logistic regression model in the prediction accuracy and robustness of the model.
【学位授予单位】:华中师范大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP311.13
【参考文献】
相关期刊论文 前6条
1 董艳;;数据预处理方法在移动通信行业中的应用[J];计算机技术与发展;2010年11期
2 丁娟娟;崔媛媛;;个人信用评估模型的比较研究[J];商场现代化;2007年15期
3 徐少锋;;FISHER判别分析在个人信用评估中的应用[J];统计与决策;2006年02期
4 李建平,徐伟宣,刘京礼,石勇;消费者信用评估中支持向量机方法研究[J];系统工程;2004年10期
5 朱兴德,冯铁军;基于GA神经网络的个人信用评估[J];系统工程理论与实践;2003年12期
6 石庆焱,靳云汇;个人信用评分的主要模型与方法综述[J];统计研究;2003年08期
,本文编号:2456202
本文链接:https://www.wllwen.com/jingjilunwen/jiliangjingjilunwen/2456202.html