基于改进SMOTE非均衡支持向量机的建模与应用

发布时间：2018-08-11 11:28

【摘要】：支持向量机是机器学习算法中的一种经典分类方法,具有分类性能好,训练速度快的优点,尤其在非线性分类场景下有较优异的表现。支持向量机以严格的数学推导和坚实的统计学方法为理论基础,现已被广泛得应用于工业生产,入侵检测,医学鉴定,用户推荐,管理评价,决策系统,金融征信,生物科学等领域。同时,伴随社会经济发展,个人征信也逐步被提升至越发重要的位置。随着数据挖掘技术不断更新,基于大数据的机器学习方法也逐步替代了人工筛选的方法,在征信行业中扮演着越来越重要的角色。但是,随着技术水平发展,数据采集、存储的成本迅速下降,分类问题中的数据复杂性伴随数据量的急剧提升也在不断增加,如数据维度不断增高、数据均衡度越发像单边倾斜,这些改变对分类问题带来了越来越多得挑战。对于支持向量机而言,这些问题严重影响了经典分类器在特定场景下的分类性能。为了应对数据量提升、实用场景更为复杂带来的这些问题,就需要根据支持向量机的内在特性,充分考虑非均衡数据、指标复杂性等给分类结果带来的影响,从影响分类性能的根因出发,进而才可能对经典支持向量机有针对性地进行改进,在延续支持向量机的严格的理论基础支撑的前提下,进一步提升其应用价值。本文系统地研究了经典支持向量机的相关理论及其性质,针对处理支持向量机中的数据非均衡问题与解决方案建模和具体实现方法分别进行了讨论,并提出具有自适应特性、对非均衡数据有良好抗性的改进支持向量机算法,并以小额贷款公司客户信用风险评估为实际应用案例,经测试,本文方法提高了潜在违约客户的分类精度。本文的主要研究内容如下:(1)研究模糊情况下SVM分类器的建模与应用,研究了基于区间数的SVM分类器;针对样本中带有区间数的情况,提出了基于超立方体定点采样的采样方法;给出利用二叉树对区间数样本进行采样的算法。(2)分析了传统SMOTE算法在处理非均衡数据时不考虑样本本身含义的弊端,并会对整个少数类样本进行操作的问题,在SMOTE对少数类样本进行插值的基础上,提出基于关键指标优选的改进过采样方法;利用区间数SVM的分类特性,改善新合成样本的分布情况;最后给出了非均衡数据下的改进SMOTE支持向量机的完整模型与算法流程。(3)分析了在使用改进SMOTE过程中设置关键指标和相关参数对分类结果的影响,提出基于信息增益的优化的SOMTE支持向量机算法。首先建立基于信息增益的超立方体顶点采样SMOTE支持向量机,再通过优化算法对改进后的SMOTE-SVM模型的参数进行自动寻优;进而增强了算法参数设置的合理性,提升了分类性能,并给出组合算法的具体流程。(4)研究了小额贷款公司在信用风险评估方面所面临的实际问题,分析了其在对客户信用评估时的劣势;依据小额贷款公司经营实际构建了信用风险评估指标体系;将本文提出的改进支持向量机算法应用到实际问题,并与其他经典分类算法进行了分类综合性能比对,并从关键指标出发,分析了客户违约的关键指标下分布情况,最后根据两类用户的典型特征进行了用户画像。
[Abstract]:Support Vector Machine (SVM) is a classical classification method in machine learning algorithm, which has the advantages of good classification performance and fast training speed, especially in non-linear classification scenarios. Based on strict mathematical deduction and solid statistical methods, SVM has been widely used in industrial production and invasion. At the same time, with the development of social economy, personal credit has gradually been promoted to a more important position. With the continuous updating of data mining technology, machine learning based on large data has gradually replaced the method of manual screening. However, with the development of technology, the cost of data acquisition and storage decreases rapidly, and the complexity of data in classification problems increases with the rapid increase of data volume. For example, the data dimension increases constantly, and the data balance becomes more and more like a one-sided tilt. These changes bring about the problem of classification. For support vector machines, these problems seriously affect the classification performance of classical classifiers in specific scenarios. In order to deal with these problems caused by increasing data volume and more complex practical scenarios, it is necessary to fully consider unbalanced data, index complexity and so on according to the inherent characteristics of support vector machines. The impact on the classification results, starting from the root of the impact on classification performance, and then it is possible to improve the classical support vector machine, in the continuation of the strict theoretical basis of support vector machine, further enhance its application value. This paper systematically studies the classical support vector machine theory and its related theory. In this paper, we discuss the problem of dealing with data imbalance in support vector machine and the method of solution modeling and implementation, and propose an improved support vector machine algorithm with self-adaptive characteristics and good resistance to imbalance data. The main research contents of this paper are as follows: (1) The modeling and application of SVM classifier in fuzzy case are studied, and the SVM classifier based on interval number is studied. For the case of interval number in the sample, a sampling method based on hypercube fixed-point sampling is proposed. (2) The disadvantage of traditional SMOTE algorithm in dealing with unbalanced data without considering the meaning of the sample itself is analyzed, and the whole minority sample is operated. Based on SMOTE interpolation for minority samples, an improved oversampling method based on key index optimization is proposed. Finally, the complete model and algorithm flow of the improved SMOTE support vector machine with unbalanced data are given. (3) The influence of setting key indicators and related parameters in the process of using the improved SMOTE on the classification results is analyzed, and an optimized SOM based on information gain is proposed. TE Support Vector Machine (SVM) algorithm. Firstly, a hypercube vertex-sampled SMOTE support vector machine based on information gain is established, and then the parameters of the improved SMOTE-SVM model are automatically optimized by an optimization algorithm. Then the rationality of the algorithm parameters setting is enhanced, and the classification performance is improved. Finally, the specific flow of the combined algorithm is given. The practical problems faced by microfinance companies in credit risk assessment are analyzed, and their disadvantages in credit assessment are analyzed; the credit risk assessment index system is constructed according to the actual operation of microfinance companies; the improved support vector machine algorithm proposed in this paper is applied to practical problems, and is carried out with other classical classification algorithms. It compares the comprehensive performance of classification, analyzes the distribution of the key indicators of customer default from the key indicators, and finally carries out user portraits according to the typical characteristics of the two types of users.
【学位授予单位】：南京航空航天大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP181;F832.4

【参考文献】