基于数据挖掘的P2P网贷获贷结果影响因素及放贷决策模型研究

发布时间：2018-04-20 21:23

本文选题：P2P网络贷款 + 随机森林模型　；参考：《上海师范大学》2017年硕士论文

【摘要】：P2P网络贷款指的是出借人与借款人之间通过网络借贷平台而不是金融机构产生的无抵押贷款。从2015年起我国的P2P网络贷款发展非常迅猛,《中国P2P网贷行业2015年年报简报》显示,2015年全国的P2P网贷平台数量从2918家增至5121家,年度累计成交量从2014年的2528亿元增加到2015年的9823.04亿元。然而,截止至2017年2月,全国累计成立的5882家P2P网络贷款平台中,已有3547家平台停业或者出现问题。由此可见,P2P网贷平台的风险控制问题刻不容缓。本文基于P2P网贷平台“好贷网”的真实贷款数据,从申请者的一系列特征变量中识别出影响其获贷结果的显著因素,并建立了有效的放贷决策模型判别申请者的获贷结果。文章具体内容如下:数据预处理部分,将原始数据的贷款申请表和申请者信息表用SQL拼接成个人贷款分析表,通过逻辑处理删除无效数据,然后用KNN插值法对缺失值进行插补,再通过WOE分箱法处理离群值,最终得到3003条有效数据,20个申请者特征变量。获贷结果影响因素识别部分,首先通过计算20个变量的IV值筛选出对获贷结果显著的14个变量,接着用随机森林模型计算每个显著变量的Gini值平均减少量,平均减少量越大的变量对获贷结果的影响越大。结果发现,对获贷结果影响最大的因素是申请者以往信用记录,其次是其职业和资产情况,最后是贷款额度和贷款期限,而性别和婚姻状况等个人基本特征的影响非常小。通过成败比进一步识别各因素对获贷结果影响的具体方向和大小,发现有信用卡比没有信用卡的获贷的成功率高20倍,单卡最高额度、开卡时间、工资、工作年限、文化程度都与获贷成功率显著成正比。放贷决策模型建立部分,本文选用最常见的6种模型:统计模型中的Logistic回归模型、非统计模型中的SVM模型和神经网络模型、组合模型中的AdaBoost模型、GDBT模型、XGBoost模型。首先对申请者用K-means聚类法进行分类,总结每类申请者的特征,再对每类申请者单独建立模型并将每类申请者的模型预测结果汇总,将汇总结果与未分类前所建立的模型结果进行对比,发现聚类后的模型准确度、灵敏度、特异性分别有3.31%、17.39%、11.05%的显著提高,这意味着聚类后的模型与未聚类相比能为P2P网贷平台增加17.39%的业务,降低11.05%的错判风险。从而得到如下结论:不同申请者之间存在较大差异,对申请者整体建模会忽略这些差异信息,造成模型精度下降。先用K-means聚类法将申请者分类再在各类申请者中建立模型,能显著增强模型捕捉不同类申请者特征的能力,从而增加模型的风险控制能力。
[Abstract]:P2P network loan refers to the unsecured loan between the lender and the borrower through the network lending platform rather than the financial institution. Since 2015, the development of P2P network loans in China has been very rapid. According to the Annual report of China's P2P Network loan Industry 2015, the number of P2P network lending platforms in China increased from 2918 to 5121 in 2015. Annual cumulative turnover increased from 252.8 billion yuan in 2014 to 982.304 billion yuan in 2015. However, as of February 2017, 3547 of the country's 5882 P2P network lending platforms had been shut down or had problems. It can be seen that the risk control of P2P network loan platform is urgent. Based on the real loan data of "good loan Network", a P2P platform, this paper identifies the significant factors that affect the loan result of the applicant from a series of characteristic variables, and establishes an effective lending decision model to judge the loan result of the applicant. The content of this paper is as follows: in the part of data preprocessing, the loan application form and applicant information table of original data are spliced into personal loan analysis table by SQL, and the invalid data is deleted by logical processing, and then the missing value is interpolated by KNN interpolation method. Finally, 3003 valid data and 20 applicant characteristic variables were obtained by using WOE subgroup method to deal with outliers. In the identification part of the factors influencing the loan result, the 14 variables that are significant to the loan result are screened out by calculating the IV value of 20 variables, and then the average reduction of the Gini value of each significant variable is calculated by using the stochastic forest model. The larger the average reduction, the greater the effect on the loan result. The results show that the most important factors affecting the loan result are the applicant's previous credit record, his occupation and assets, the loan quota and the loan term, and the influence of the personal basic characteristics such as gender and marital status is very small. By further identifying the specific direction and magnitude of the impact of various factors on the result of the loan, it is found that the success rate of obtaining a loan with a credit card is 20 times higher than that without a credit card, the maximum amount of a single card, the time to open the card, the salary, and the number of years of work. The level of education is significantly proportional to the success rate of obtaining loans. In the part of establishing loan decision model, this paper selects the most common six models: Logistic regression model in statistical model, SVM model and neural network model in non-statistical model, AdaBoost model in combination model and XGBoost model. Firstly, the applicants are classified by K-means clustering method, and the characteristics of each type of applicants are summarized, then the model of each type of applicant is established separately and the forecast results of each type of applicant are summarized. By comparing the summary results with the results of the models established before the classification, it was found that the accuracy, sensitivity and specificity of the models were significantly improved by 3.31%, 17.39% and 11.05%, respectively. This means that compared with the unclustered model, the clustering model can increase 17.39% service for P2P network loan platform and reduce the risk of misjudgment by 11.05%. The conclusions are as follows: there are great differences among different applicants, and the model precision will be reduced because of the ignoring of the difference information in the overall modeling of the applicants. The ability of the model to capture the characteristics of different types of applicants can be significantly enhanced by using the K-means clustering method to classify applicants and then to establish a model among all kinds of applicants, thus increasing the risk control ability of the model.
【学位授予单位】：上海师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：F724.6;F832.4

【参考文献】