面向不平衡二分类准则的稀疏模型构造算法研究

发布时间：2018-03-05 14:28

本文选题：不平衡二分类　切入点：QM　出处：《安徽大学》2017年硕士论文　论文类型：学位论文

【摘要】：社会的进步,科学的发展,给人们生活带来了日新月异的变化。与此同时各种数据信息的不断积累,在方便人们的同时,也带来了新的挑战。如何从这些大量数据中发现有用信息成为当前急需解决的迫切问题。机器学习的出现为解决上述挑战提供了一种有效的手段,其中的分类学习特别是二分类学习由于在众多领域的广泛应用更是成为当前的研究热点。然而在现实的生活中,很多应用(如网络搜索引擎、个性化推荐系统等)都是不平衡二分类问题,且具有数据维度高的特点,已有面向小数据的传统二分类算法很难直接应用在上述问题中。对此,近些年有学者提出研究直接优化不平衡准则的稀疏二分类模型构造算法,并取得了较好的效果。但这些研究考虑的不平衡准则都是AUC或F1等简单易分解的标准,对于其他较复杂的不平衡准则,如何获得相应的稀疏模型,则研究较少。本文就是在这样的背景下,主要研究了面向复杂不平衡准则的稀疏模型构造算法。全文的主要工作如下:(1)文中从二分类学习入手,首先介绍了传统二分类和不平衡二分类在评估准则的差异,然后总结了面向不平衡二分类算法的研究现状,重点分析了不平衡稀疏模型构造算法的进展,在此基础上,提出研究基于L1范式的复杂不平衡稀疏模型构造算法。(2)不同于已有不平衡稀疏模型构造算法多关注AUC或F1等简单准则,本文研究了面向复杂不可分QM准则的稀疏模型构造算法。算法首先定义了基于QM的新目标函数,针对该目标非光滑难以直接优化,提出使用割平面算法进行求解,不仅解决上述问题,且算法的外围迭代次数仅为O(1/ε)。不平衡基准数据集上的实验结果表明,当用QM为评价标准时,本文提出的算法不仅有很好的精度还有较高的稀疏度。(3)针对已有不平衡稀疏模型构造算法都采用批学习,当面对大规模数据集时,计算效率较差,本文提出一种基于随机学习的稀疏模型构造算法。更具体的说,我们关注的不是某一个具体的不平衡标准,而是具有一类通用特性(如伪线性)的评价准则。文中首先将直接优化伪线性准则问题变成一个代价敏感问题。针对新问题,如果直接使用随机梯度法求解难以获得满意的稀疏度,因此提出使用COlMID算法作为优化方法,确保了解的稀疏性。同时针对已有COMID算法即使是强凸目标函数,也仅能获得O(logT/T)收敛速度,给出一种基于多项式衰减的改进方法,并从理论上证明了所提新方法具有O(1/T)的最优收敛效率。不平衡基准数据集上的实验证明了本文所提算法的高效性和有效性。
[Abstract]:With the progress of society and the development of science, people's life has changed with each passing day. At the same time, the accumulation of various kinds of data and information is convenient for people. It has also brought new challenges. How to find useful information from these large amounts of data has become an urgent and urgent problem. The emergence of machine learning provides an effective means to solve these challenges. The classification learning, especially the second classification learning, has become the current research hotspot because of its wide application in many fields. However, in the real life, many applications (such as network search engine, web search engine, etc.). Personalized recommendation system is an unbalanced two-classification problem, and has the characteristics of high data dimension. The traditional two-classification algorithm for small data is difficult to be directly applied to the above problems. In recent years, some scholars have proposed a sparse two-class model construction algorithm, which directly optimizes the disequilibrium criteria, and has achieved good results. However, the disequilibrium criteria considered in these studies are simple and easy to decompose standards such as AUC or F1. There are few studies on how to obtain the corresponding sparse model for other more complex unbalanced criteria. This paper mainly studies the sparse model construction algorithm for complex imbalance criterion. The main work of this paper is as follows: (1) in this paper, we first introduce the difference between traditional two classification and unbalanced two classification in evaluation criteria. Then, the research status of unbalanced two-classification algorithm is summarized, and the development of unbalanced sparse model construction algorithm is analyzed. An algorithm for constructing complex unbalanced sparse models based on L1 normal form is proposed. It is different from the simple criteria such as AUC or F1. In this paper, a sparse model construction algorithm based on complex indivisible QM criterion is studied. Firstly, a new objective function based on QM is defined. In view of the non-smooth and difficult direct optimization of the objective, a cutting plane algorithm is proposed to solve the problem. Not only to solve the above problem, but also to solve the problem, and the number of the peripheral iterations of the algorithm is only 1 / 蔚. The experimental results on the unbalanced datum dataset show that, when QM is used as the evaluation criterion, The algorithm presented in this paper not only has good accuracy but also has a high sparsity. (3) for the existing algorithms of constructing unbalanced sparse models, batch learning is used, and the computational efficiency is poor when dealing with large data sets. In this paper, we propose a sparse model construction algorithm based on random learning. In this paper, the problem of direct optimization of pseudo linear criteria is first turned into a cost sensitive problem. If it is difficult to obtain satisfactory sparsity by using the stochastic gradient method directly, the COlMID algorithm is proposed as an optimization method to ensure the sparsity of the solution. At the same time, for the existing COMID algorithm, even if it is a strongly convex objective function, We can only get the convergence rate of OGlogT / T, and give an improved method based on polynomial attenuation. It is proved theoretically that the proposed method has the optimal convergence efficiency of 1 / T). The experiment on the unbalanced datum dataset proves the efficiency and effectiveness of the proposed algorithm.
【学位授予单位】：安徽大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP181;TP311.13

【参考文献】