改进的LMS-KNN近邻分类方法研究

发布时间：2018-08-15 19:12

【摘要】：近邻分类算法作为经典的机器学习算法之一,因其无需估计参数、易于实现、适合多分类问题的特点,近年来在广告、聊天机器人、网络安全、医疗保健、营销策划等领域得到了广泛应用。其中,基于局部均值与类均值的近邻分类算法(Nearest neighbor classification based on local mean and class mean,LMS-KNN)是针对K近邻分类(K-nearest neighbor classification)对离群点不敏感,没有利用样本全局信息等问题的一种改进算法,改进后的算法虽然在分类精度和分类效率得到一定的提高,但是该算法还存在一些弊端。数据的不平衡性会影响LMS-KNN的分类精度,同时该算法涉及到很多参数的设置,如近邻值K的选取、权值的确定、距离度量方式的选取等等。因此,为了进一步改进LMS-KNN算法的分类精度,本文进行了以下的研究工作:1)总结分析了几种常用的近邻分类方法和局部均值与类均值的近邻分类算法,对比了它们各自的算法原理和优缺点,并简单介绍了文中所用到的几种优化算法。2)针对不平衡数据对LMS-KNN分类精度的影响,运用迭代近邻过采样的算法对数据进行预处理,并把处理后的近似平衡数据集,用半监督的局部均值与类均值进行分类。3)采用交叉验证与传统迭代算法确定LMS-KNN分类算法参数,本文先将该分类算法交叉验证误差模型化,再把类均值向量的权重基于客观决策信息确定为数学公式,最终运用步长优化的统一迭代法来对加权权重进行选取,在平衡主客观决策规则的情况下改进传统算法的分类精度和分类效率。4)为了优化LMS-KNN分类算法参数的确定,利用遗传算法(Genetic Algorithm)可以在不依赖问题具体领域的情况下求解非线性、多目标等复杂优化问题,提出了一种基于遗传算法的局部均值和类均值最近邻分类算法,该方法选取类均值的权重为初始种群,以分类误差为评价函数,通过遗传迭代选取最佳的类均值特征权重,与传统的KNN、LM-KNN(A local mean based nonparametric classifier)及LMS-KNN算法的实验比较分析表明,该方法在UCI数据集上可有效地搜索出合适的特征权重,获得更好的分类精度。
[Abstract]:As one of the classical machine learning algorithms, the nearest neighbor classification algorithm is suitable for multi-classification problems because it does not need to estimate parameters and is easy to implement. In recent years, it has been widely used in advertising, chat robot, network security, medical care, etc. Marketing planning and other fields have been widely used. Among them, the nearest neighbor classification algorithm based on local mean and class means, (Nearest neighbor classification based on local mean and class mean-LMS-KNN, is an improved algorithm for K-nearest neighbor classification (K-nearest neighbor classification) is insensitive to outliers and does not use global information of samples). Although the improved algorithm improves the classification accuracy and classification efficiency, it still has some drawbacks. The unbalance of data will affect the classification accuracy of LMS-KNN. At the same time, the algorithm involves the setting of many parameters, such as the selection of nearest neighbor value K, the determination of weight value, the selection of distance measure and so on. Therefore, in order to further improve the classification accuracy of the LMS-KNN algorithm, the following research work: 1) summarizes and analyzes several commonly used nearest neighbor classification methods and local mean and class mean nearest neighbor classification algorithms. In this paper, the principles, advantages and disadvantages of their algorithms are compared, and several optimization algorithms used in this paper are briefly introduced. In view of the effect of unbalanced data on LMS-KNN classification accuracy, the iterative nearest neighbor oversampling algorithm is used to preprocess the data. After processing the approximate equilibrium data set, the semi-supervised local mean and class mean are used to classify the parameters of the LMS-KNN classification algorithm. The cross-validation and the traditional iterative algorithm are used to determine the parameters of the LMS-KNN classification algorithm. In this paper, the cross-validation error of the classification algorithm is first modeled. Then the weight of the class mean vector is determined as a mathematical formula based on objective decision information, and the weighted weight is selected by the uniform iterative method of step size optimization. In order to optimize the parameter determination of LMS-KNN classification algorithm, genetic algorithm (Genetic Algorithm) can solve the nonlinearity without depending on the specific domain of the problem in order to optimize the parameter determination of LMS-KNN classification algorithm by improving the classification accuracy and classification efficiency of the traditional algorithm under the condition of balancing the subjective and objective decision rules. In this paper, a local mean and class mean nearest neighbor classification algorithm based on genetic algorithm is proposed. The weight of class mean is selected as initial population, and the classification error is used as evaluation function. The best class mean weight is selected by genetic iteration, and compared with the traditional KNNN LM-KNN (A local mean based nonparametric classifier) and LMS-KNN algorithm, the experimental results show that this method can effectively search the appropriate feature weights on the UCI dataset and obtain better classification accuracy.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP181

【参考文献】