数据归一化方法对提升SVM训练效率的研究

发布时间：2018-08-13 14:06

【摘要】：支持向量机(Support Vector Machines,SVM)是基于统计学习理论,建立在结构风险最小化原理和VC维理论基础上的一种机器学习方法。近几十年来以其优秀的分类能力在很多领域得到广泛应用,至今仍然是机器学习领域最热门的研究之一,众多的国内外学者都致力于SVM训练效率的提升。数据归一化是训练支持向量机必须的数据预处理过程。常用的归一化策略有[-1,+1]、N(0,1)等方法,但现有文献尚未发现关于这些常用归一化方法科学依据方面的研究。本文通过对SVM中顺序最小优化算法运行机制的研究,发现高斯核函数会受到数据样本属性值的影响,数据属性值过大或过小都会使高斯核函数的参与度降低。数据归一化恰好能够将数据限定在某一范围内,使其能够更好地配合高斯核半径,从而避免最优分类超平面过于崎岖。论文以经验性的实验对数据归一化的内在机理、归一化与不归一化对训练效率和模型预测能力影响等方面开展了探索和研究。论文选择标准数据集,对原始未归一化、不同方法归一化、人工非归一化、任选数据属性列等情况下的数据分别进行了SVM训练,并记录目标函数值随迭代次数的变化、训练时间、模型测试及k-CV性能等信息。概括起来取得了如下的研究成果:(1)在传统的顺序最小优化算法(SMO)的基础上,总结出了目标函数值及其变化量的表达式,并使用C++11技术进行了算法编程,实现了目标函数值及其变化值和训练时间及测试正确率的计算和输出。对使用高斯核函数的顺序最小优化算法的典型研究文献进行深入分析,确定了高斯核半径的最优值λ以及违反KKT条件的精度值κ。实验结果表明所确定的λ值和κ值能够达到最好的泛化能力,并通过对输出数据变化曲线的分析得出有根据的结论:可以通过数据的预处理来改进SVM训练效率。(2)对数据预处理的方式方法进行了深入研究,尤其是对最值归一化、中值归一化、标准分数归一化三种不同数据归一化方法进行了应用实现,使其与SVM分类机进行了有机融合。实验结果表明数据归一化方法可以弥补高斯核函数核半径认为选择上的不足,使高斯核函数更加理想地应用于SVM分类。(3)对标准实验数据集以三种不同的数据归一化方法进行了预处理,设计了多种实验方式,利用k-CV验证方法,对训练时间以及测试正确率进行了详细记录和比较。最终通过分析数据归一化后SVM训练效率的变化得出了数据归一化可以提升SVM训练效率的较为根本的内在机制。(4)通过数据归一化对SVM训练效率影响的分析以及对分类能力差异的比较,分析出了最能提升SVM训练效率的数据归一化的最优限定原则,即将各数据属性的值控制在常规的可比拟的数值范围内,如:[-0.5,+0.5]~[-5,+5]、N(0,1)~N(0,5)等。通过大量的实验分析验证,数据归一化能够有效的提升SVM的训练效率。本文为SVM以及一般机器学习算法的数据归一化提供了科学依据。
[Abstract]:Support Vector Machines (SVM) is a machine learning method based on statistical learning theory, structural risk minimization principle and VC dimension theory. It has been widely used in many fields for its excellent classification ability in recent decades, and is still one of the most popular research fields in machine learning. Data normalization is a necessary data preprocessing process for SVM training. The commonly used normalization strategies are [-1,+1], N (0,1), etc. However, the existing literature has not found the scientific basis for these commonly used normalization methods. In this paper, the order of SVM is minimal. It is found that the Gaussian kernel function will be affected by the attribute values of the data samples, and the participation of the Gaussian kernel function will be reduced if the attribute values are too large or too small. The plane is too rugged. The paper explores and studies the internal mechanism of data normalization by empirical experiments, and the effects of normalization and non-normalization on training efficiency and model prediction ability. The data are trained by SVM and the changes of the objective function values with the number of iterations, training time, model testing and k-CV performance are recorded. The algorithm is programmed with C++ 11 technology, and the calculation and output of the objective function value, its variation value, training time and test accuracy are realized. The typical research literature of sequential minimization optimization algorithm using Gaussian kernel function is deeply analyzed, and the optimal value of Gaussian kernel radius is determined, and the precision value of violation of KKT condition is determined. The results show that the determined values of lambda and kappa can achieve the best generalization ability, and through the analysis of the change curve of output data, we can draw a reasonable conclusion that the training efficiency of SVM can be improved by data pretreatment. (2) The methods of data pretreatment are studied deeply, especially the normalization of the maximum value and the normalization of the median value. Three different data normalization methods of standard fraction normalization are applied to SVM classifier. The experimental results show that the data normalization method can compensate for the shortage of kernel radius of Gaussian kernel function and make Gaussian kernel function more ideal for SVM classification. (3) Standard experimental data sets. Three different data normalization methods are used to preprocess the SVM data, and a variety of experimental methods are designed. The training time and test accuracy are recorded and compared in detail by using k-CV verification method. (4) By analyzing the effect of data normalization on the training efficiency of SVM and comparing the difference of classification ability, the optimal criterion of data normalization which can improve the training efficiency of SVM is put forward, i.e. the value of each data attribute is controlled within the conventional comparable range, such as: [-0.5, +0.5]~[-5, +5], N (0,1) ~ N (0,0) N (0). Through a large number of experimental analysis and verification, data normalization can effectively improve the training efficiency of SVM. This paper provides a scientific basis for data normalization of SVM and general machine learning algorithms.
【学位授予单位】：山东师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP181

【参考文献】