面向类不平衡问题的逻辑回归分类学习算法研究

发布时间：2018-04-24 16:39

本文选题：逻辑回归 + 类不平衡　；参考：《信阳师范学院》2017年硕士论文

【摘要】：类不平衡问题是模式识别和机器学习领域的热门研究问题之一,其特征是某些类实例数明显少于其它类实例数。在实际应用中,正确识别少数类实例往往比正确识别多数类实例更有价值。例如在医疗诊断中,只有极少数人是癌症患者,如何正确识别这些癌症患者具有重要意义。然而,作为经典的统计分类方法,逻辑回归试图通过假设数据集中各类的实例数目相当,以达到总体高准确率的分类目的。这往往导致学习到的模型不能很好地捕获少数类实例特征,进而误分少数类实例。针对该问题,本文提出了两种面向类不平衡问题的逻辑回归分类学习算法:(1)提出新的针对类不平衡的逻辑回归学习算法。逻辑回归使用最大似然估计法求解模型参数,这导致模型很难捕获少数类实例特征。针对该问题,本文构造了一种基于最大似然函数和召回率的度量指标MLER(Maximum Likelihood Evaluation and Recall)。与最大似然目标函数不同,MLER同时考虑模型的准确率和召回率,进而保证模型在所有类上的性能。根据MLER,本文提出了一种面向类不平衡问题的逻辑回归新算法LRIL(Logistic Regression for Imbalanced Learning)。依据MLER,LRIL使用牛顿法学习相关参数。实验结果表明,LRIL在保持逻辑回归高准确率的前提下,有效地提高了其在召回率、f-measure以及g-mean上的性能,同时与其它高级方法相比,LRIL也表现出明显优势。(2)针对类不平衡问题中类分布不均衡这一特征,提出了基于k-means和逻辑回归混合策略的类不平衡学习算法ILKLR(Imbalanced Learning based on k-means and Logistic Regression)。不同于传统的逻辑回归方法,ILKLR采用k-means算法将多数类数据集划分成多个子簇并关联新的类标号,进而达到训练集线性可分的目的。实验结果显示,本文提出的数据预处理方法比传统逻辑回归、欠抽样逻辑回归、过抽样逻辑回归等方法在召回率、g-mean和f-measure等指标上效果更优。
[Abstract]:Class imbalance is one of the most popular problems in the field of pattern recognition and machine learning, which is characterized by the fact that the number of instances in some classes is obviously less than the number of instances in other classes. In practical applications, it is more valuable to recognize a few class instances correctly than to identify most class instances correctly. For example, in medical diagnosis, only a small number of people are cancer patients, how to correctly identify these cancer patients has important significance. However, as a classical statistical classification method, logical regression attempts to achieve the goal of overall high accuracy by assuming that the number of instances in the data set is equal. This often leads to the learning model can not capture the characteristics of a few class instances and misdivide the few instances. In order to solve this problem, this paper proposes two kinds of learning algorithms of logic regression classification for class unbalance problem: (1) A new algorithm of logic regression learning for class unbalance is proposed. The method of maximum likelihood estimation is used to solve the model parameters, which makes it difficult for the model to capture a few instance features. In order to solve this problem, a MLER(Maximum Likelihood Evaluation and recall based on maximum likelihood function and recall rate is constructed in this paper. Different from the maximum likelihood objective function, MLER considers the accuracy and recall of the model simultaneously, thus ensuring the performance of the model on all classes. According to MLERs, this paper presents a new logic regression algorithm, LRIL(Logistic Regression for Imbalanced learning, which is oriented to class imbalance problem. According to MLER-LRIL, Newton's method is used to learn the relevant parameters. The experimental results show that LRIL can effectively improve its performance on f-measure and g-mean on the premise of keeping high accuracy of logical regression. At the same time, compared with other advanced methods, LRIL also shows obvious advantages. (2) aiming at the feature of class disequilibrium in class imbalance problem, a class unbalance learning algorithm ILKLR(Imbalanced Learning based on k-means and Logistic regulation based on the mixed strategy of k-means and logical regression is proposed. Different from the traditional logical regression method, ILKLR uses k-means algorithm to divide the majority of class data sets into multiple subclusters and associate new class labels, thus achieving the purpose of linearly separable training sets. The experimental results show that the proposed data preprocessing method is more effective than the traditional logical regression, under-sampling logical regression and over-sampling logical regression in the recall rate of g-mean and f-measure.
【学位授予单位】：信阳师范学院
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP181

【参考文献】