在线AUC优化的线性方法研究

发布时间：2019-02-10 19:42

【摘要】：AUC是衡量分类算法性能的重要指标之一,被广泛应用于类不平衡学习、排序学习、异常检测和代价敏感学习等任务中。在线学习凭借其处理大规模数据和流数据的高效性在机器学习领域受到广泛关注。针对大数据环境下的AUC优化问题,研究者提出了诸多在线AUC优化算法。在线AUC优化的难点在于AUC优化的损失函数由来自不同类别的两个样本构成,这使得依赖于损失函数之和的目标函数与训练样本数二次相关,不能直接使用传统在线学习方法求解。当前的在线AUC优化算法聚焦于通过在求解过程中避免直接计算所有的损失函数,从而减少问题规模,实现在线AUC优化,但其复杂度仍然高于同类型的传统在线学习算法。如何能使AUC优化的目标函数不再和训练样本数二次相关,仅和训练样本数线性相关,是一个值得研究的问题。基于最小二乘损失函数,本文提出了一种AUC优化的新目标函数,该目标函数仅与训练样本数线性相关。理论分析表明,最小化该目标函数等价于最小化由L2正则化项和最小二乘损失函数组成的AUC优化的目标函数。基于该目标函数,本文提出了在线AUC优化的线性方法(LOAM)。并根据不同优化求解策略,提出两种算法:一是使用增量式最小二乘法(ILSC)进行优化求解的LOAMILSC算法;另一种是使用AdaGrad方法进行优化求解的LOAMAda算法。其中,LOAMILSC算法的空间复杂度和每次迭代的复杂度与ILSC算法相同,LOAMAda算法的空间复杂度和每次迭代的时间复杂度与传统在线梯度下降算法相同;同时,这两种算法都不需要存储任何历史样本,仅需扫描数据集一遍。实验结果表明,与原有方法相比,LOAMILSC算法获得了更优的AUC性能,而对于实时或高维学习任务,LOAMAda算法则更加高效。
[Abstract]:AUC is one of the most important indexes to measure the performance of classification algorithms. It is widely used in class imbalance learning, ranking learning, anomaly detection and cost sensitive learning. Online learning has attracted much attention in the field of machine learning because of its high efficiency in dealing with large-scale data and stream data. Aiming at the problem of AUC optimization in big data environment, researchers put forward many online AUC optimization algorithms. The difficulty of online AUC optimization is that the loss function of AUC optimization is composed of two samples from different classes, which makes the objective function which depends on the sum of loss functions to be quadratic related to the number of training samples. Traditional online learning methods can not be directly used to solve the problem. The current online AUC optimization algorithm focuses on reducing the size of the problem and realizing online AUC optimization by avoiding the direct calculation of all loss functions in the process of solving the problem, but its complexity is still higher than that of the traditional online learning algorithm of the same type. How to make the objective function of AUC optimization no longer have quadratic correlation with the number of training samples, but only linearly with the number of training samples, is a problem worth studying. Based on the least square loss function, a new objective function for AUC optimization is proposed in this paper. The objective function is only linearly related to the number of training samples. Theoretical analysis shows that minimizing the objective function is equivalent to minimizing the objective function of AUC optimization consisting of L2 regularization term and least square loss function. Based on the objective function, a linear method for on-line AUC optimization, (LOAM)., is presented in this paper. According to different optimization strategies, two algorithms are proposed: one is the LOAMILSC algorithm which uses the incremental least square method (ILSC) to optimize the solution; the other is the LOAMAda algorithm which uses the AdaGrad method to solve the optimization problem. The space complexity and the complexity of each iteration of LOAMILSC algorithm are the same as those of ILSC algorithm, and the space complexity and time complexity of each iteration of LOAMAda algorithm are the same as those of traditional on-line gradient descent algorithm. At the same time, the two algorithms need not store any historical samples, but only scan the data set once. Experimental results show that the LOAMILSC algorithm achieves better AUC performance than the original method, while the LOAMAda algorithm is more efficient for real-time or high-dimensional learning tasks.
【学位授予单位】：郑州大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP181

【相似文献】