基于数据挖掘的P2P网贷获贷结果影响因素及放贷决策模型研究
本文选题:P2P网络贷款 + 随机森林模型 ; 参考:《上海师范大学》2017年硕士论文
【摘要】:P2P网络贷款指的是出借人与借款人之间通过网络借贷平台而不是金融机构产生的无抵押贷款。从2015年起我国的P2P网络贷款发展非常迅猛,《中国P2P网贷行业2015年年报简报》显示,2015年全国的P2P网贷平台数量从2918家增至5121家,年度累计成交量从2014年的2528亿元增加到2015年的9823.04亿元。然而,截止至2017年2月,全国累计成立的5882家P2P网络贷款平台中,已有3547家平台停业或者出现问题。由此可见,P2P网贷平台的风险控制问题刻不容缓。本文基于P2P网贷平台“好贷网”的真实贷款数据,从申请者的一系列特征变量中识别出影响其获贷结果的显著因素,并建立了有效的放贷决策模型判别申请者的获贷结果。文章具体内容如下:数据预处理部分,将原始数据的贷款申请表和申请者信息表用SQL拼接成个人贷款分析表,通过逻辑处理删除无效数据,然后用KNN插值法对缺失值进行插补,再通过WOE分箱法处理离群值,最终得到3003条有效数据,20个申请者特征变量。获贷结果影响因素识别部分,首先通过计算20个变量的IV值筛选出对获贷结果显著的14个变量,接着用随机森林模型计算每个显著变量的Gini值平均减少量,平均减少量越大的变量对获贷结果的影响越大。结果发现,对获贷结果影响最大的因素是申请者以往信用记录,其次是其职业和资产情况,最后是贷款额度和贷款期限,而性别和婚姻状况等个人基本特征的影响非常小。通过成败比进一步识别各因素对获贷结果影响的具体方向和大小,发现有信用卡比没有信用卡的获贷的成功率高20倍,单卡最高额度、开卡时间、工资、工作年限、文化程度都与获贷成功率显著成正比。放贷决策模型建立部分,本文选用最常见的6种模型:统计模型中的Logistic回归模型、非统计模型中的SVM模型和神经网络模型、组合模型中的AdaBoost模型、GDBT模型、XGBoost模型。首先对申请者用K-means聚类法进行分类,总结每类申请者的特征,再对每类申请者单独建立模型并将每类申请者的模型预测结果汇总,将汇总结果与未分类前所建立的模型结果进行对比,发现聚类后的模型准确度、灵敏度、特异性分别有3.31%、17.39%、11.05%的显著提高,这意味着聚类后的模型与未聚类相比能为P2P网贷平台增加17.39%的业务,降低11.05%的错判风险。从而得到如下结论:不同申请者之间存在较大差异,对申请者整体建模会忽略这些差异信息,造成模型精度下降。先用K-means聚类法将申请者分类再在各类申请者中建立模型,能显著增强模型捕捉不同类申请者特征的能力,从而增加模型的风险控制能力。
[Abstract]:P2P network loan refers to the unsecured loan between the lender and the borrower through the network lending platform rather than the financial institution. Since 2015, the development of P2P network loans in China has been very rapid. According to the Annual report of China's P2P Network loan Industry 2015, the number of P2P network lending platforms in China increased from 2918 to 5121 in 2015. Annual cumulative turnover increased from 252.8 billion yuan in 2014 to 982.304 billion yuan in 2015. However, as of February 2017, 3547 of the country's 5882 P2P network lending platforms had been shut down or had problems. It can be seen that the risk control of P2P network loan platform is urgent. Based on the real loan data of "good loan Network", a P2P platform, this paper identifies the significant factors that affect the loan result of the applicant from a series of characteristic variables, and establishes an effective lending decision model to judge the loan result of the applicant. The content of this paper is as follows: in the part of data preprocessing, the loan application form and applicant information table of original data are spliced into personal loan analysis table by SQL, and the invalid data is deleted by logical processing, and then the missing value is interpolated by KNN interpolation method. Finally, 3003 valid data and 20 applicant characteristic variables were obtained by using WOE subgroup method to deal with outliers. In the identification part of the factors influencing the loan result, the 14 variables that are significant to the loan result are screened out by calculating the IV value of 20 variables, and then the average reduction of the Gini value of each significant variable is calculated by using the stochastic forest model. The larger the average reduction, the greater the effect on the loan result. The results show that the most important factors affecting the loan result are the applicant's previous credit record, his occupation and assets, the loan quota and the loan term, and the influence of the personal basic characteristics such as gender and marital status is very small. By further identifying the specific direction and magnitude of the impact of various factors on the result of the loan, it is found that the success rate of obtaining a loan with a credit card is 20 times higher than that without a credit card, the maximum amount of a single card, the time to open the card, the salary, and the number of years of work. The level of education is significantly proportional to the success rate of obtaining loans. In the part of establishing loan decision model, this paper selects the most common six models: Logistic regression model in statistical model, SVM model and neural network model in non-statistical model, AdaBoost model in combination model and XGBoost model. Firstly, the applicants are classified by K-means clustering method, and the characteristics of each type of applicants are summarized, then the model of each type of applicant is established separately and the forecast results of each type of applicant are summarized. By comparing the summary results with the results of the models established before the classification, it was found that the accuracy, sensitivity and specificity of the models were significantly improved by 3.31%, 17.39% and 11.05%, respectively. This means that compared with the unclustered model, the clustering model can increase 17.39% service for P2P network loan platform and reduce the risk of misjudgment by 11.05%. The conclusions are as follows: there are great differences among different applicants, and the model precision will be reduced because of the ignoring of the difference information in the overall modeling of the applicants. The ability of the model to capture the characteristics of different types of applicants can be significantly enhanced by using the K-means clustering method to classify applicants and then to establish a model among all kinds of applicants, thus increasing the risk control ability of the model.
【学位授予单位】:上海师范大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:F724.6;F832.4
【参考文献】
相关期刊论文 前10条
1 周玉琴;张晓玫;罗璇;;基于随机森林的P2P网络借贷成功率预测研究[J];东北农业大学学报(社会科学版);2016年06期
2 杜江;李连发;;商业银行个人信用评分模型的应用发展研究[J];现代商业;2016年05期
3 孙权;赵金涛;;基于数据挖掘的商户风险评分方法和系统[J];软件产业与工程;2016年01期
4 孙国瑞;华锦芝;刘思帆;杨阳;钟亦平;张凌毅;;实时风险评估模型的研究与实现[J];计算机科学与探索;2015年04期
5 吴东武;;抵押贷款、社会资本与农户贷款可得性的实证研究——基于电白县农户的调查数据[J];当代财经;2014年07期
6 王会娟;廖理;;中国P2P网络借贷平台信用认证机制研究——来自“人人贷”的经验证据[J];中国工业经济;2014年04期
7 柴洪峰;;金融大数据及银行卡产业大数据实践[J];上海金融;2013年10期
8 冯果;蒋莎莎;;论我国P2P网络贷款平台的异化及其监管[J];法商研究;2013年05期
9 俞庆生;;基于云平台的逻辑回归模型构建算法的设计与实现[J];科技通报;2013年06期
10 钱金叶;杨飞;;中国P2P网络借贷的发展现状及前景[J];金融论坛;2012年01期
相关硕士学位论文 前9条
1 许江峰;数据挖掘技术在P2P网络金融中的应用研究[D];北京交通大学;2016年
2 王梦佳;基于Logistic回归模型的P2P网贷平台借款人信用风险评估[D];北京外国语大学;2015年
3 杨薇薇;P2P网络信贷行为及风险评估研究[D];中国海洋大学;2014年
4 刘峙廷;我国P2P网络信贷风险评估研究[D];广西大学;2013年
5 倪晓芬;基于P2P网络借贷平台的中小企业联保贷款模式研究[D];华侨大学;2012年
6 曾超群;基于聚类算法的数据挖掘技术的研究[D];中南大学;2010年
7 杨毅;基于数据挖掘技术的信用卡信用评分模型研究[D];西北农林科技大学;2009年
8 黄丽;BP神经网络算法改进及应用研究[D];重庆师范大学;2008年
9 陈浩;基于数据挖掘技术的信用卡申请评分模型研究[D];湖南大学;2007年
,本文编号:1779511
本文链接:https://www.wllwen.com/jingjilunwen/touziyanjiulunwen/1779511.html