粒子群优化加权随机森林算法研究

发布时间：2018-06-14 00:15

本文选题：随机森林 + 粒子群　；参考：《郑州大学》2017年硕士论文

【摘要】：随机森林(Random Forest,RF)算法是2001年由Breiman提出的一种分类模型。其本质是将Bagging的Bootstrap Aggregating算法和Ho的随机子空间(Random Subspace)算法结合起来,通过对多棵决策树分类结果采取投票选取机制,确定最终的分类结果。随机森林算法自提出之后,被广泛地运用于数据挖掘与分类问题,后来还有许多学者对模型做出了改进。随机森林是一种高效的分类算法,随机森林模型的优点在于它不需要样本的背景知识,不用做变量选择,拥有很高的噪声容忍度,因此可省略数据预处理的繁琐工作。但模型中的投票选取机制会导致一些训练精度较低的决策树也拥有相同的投票能力,从而降低投票准确度。而且随机森林模型中的决策树棵数及其它参数的选取通常对随机森林的最终分类结果也有较大的影响。针对那些训练精度不优、投票能力相对较差的决策树,本文通过对传统随机森林算法进行详细试验与分析,基本确定了传统随机森林算法性能不足的原因:随机森林投票选取机制会导致一些训练精度较低的决策树也拥有相同的投票能力,这对随机森林最终的分类结果准确率造成较大的影响。在分类的同时也可能会产生多个类别的最高票数相同而最终导致难以分类的现象,本文将此现象定义为“死局现象”。为解决低精度决策和高票数竞争带来的分类困难,本文以传统随机森林模型为基础提出一种精确度加权随机森林算法(Accuracy Weighted Random Forest,AWRF),即在投票时将每棵决策树乘以一个与其训练精度成正比的权重,针对参数难以选取的问题,采取粒子群算法对影响新模型的参数进行迭代优化,选取模型中包括的参数。同时设计相关仿真实验对比,通过Matlab软件对UCI数据库中6个标准数据集进行验证,最后用不同的算法对比新模型的优缺点。通过对比得出结论,表明了新模型在对此类数据分类时的优势。
[Abstract]:Random Forest Random (RFF) algorithm is a classification model proposed by Bizaran in 2001. Its essence is to combine bagging bootstrap Aggregating algorithm and Ho's random subspace algorithm to determine the final classification results by adopting the voting selection mechanism for the classification results of multiple decision trees. Since the stochastic forest algorithm was proposed, it has been widely used in data mining and classification problems, and many scholars have improved the model. Stochastic forest is an efficient classification algorithm. The advantage of stochastic forest model is that it does not need the background knowledge of sample, does not need to do variable selection, and has high noise tolerance, so it can omit the tedious work of data preprocessing. However, the voting selection mechanism in the model will result in some decision trees with low training accuracy have the same voting ability, thus reducing the voting accuracy. Moreover, the selection of the number of decision trees and other parameters in the stochastic forest model usually has a great influence on the final classification results of the stochastic forest. For those decision trees with poor training precision and relatively poor voting ability, this paper makes a detailed experiment and analysis of the traditional stochastic forest algorithm. The reason why the traditional stochastic forest algorithm has insufficient performance is basically determined: the mechanism of random forest voting will result in some decision trees with low training precision also having the same voting ability. This has a great influence on the accuracy of the final classification of the random forest. At the same time, it may produce the phenomenon that the highest number of votes in many categories is equal and it is difficult to classify. This phenomenon is defined as "death phenomenon" in this paper. In order to solve the classification difficulties caused by low precision decision making and high vote competition, Based on the traditional stochastic forest model, an accuracy weighted random forest algorithm is proposed in this paper, in which each decision tree is multiplied by a weight proportional to its training accuracy, and the parameters are difficult to select. Particle swarm optimization (PSO) is used to optimize the parameters that affect the new model, and the parameters included in the model are selected. At the same time, the relevant simulation experiments are designed to verify the six standard data sets in UCI database by Matlab software. Finally, the advantages and disadvantages of the new model are compared with different algorithms. By comparison, the advantages of the new model in classifying this kind of data are shown.
【学位授予单位】：郑州大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP18;TP311.13

【参考文献】