随机森林算法处理不平衡数据的改进及其并行化

发布时间：2018-10-25 19:30

【摘要】：随机森林(Random Forest)是用随机的方式建立一个森林,森林里面有很多的决策树组成,随机森林的每一棵决策树之间是没有关联的.每一棵决策树的建立,采用的是随机有放回采样的过程,然后使用投票的形式进行分类和预测.该算法很好的解决了单分类器在性能上的瓶颈,因此被广泛应用在很多方面.当然,该算法也存在一些有待完善的地方,针对随机森林算法在处理不平衡数据集时运行效率低下的问题,本文提出一种新的处理不平衡问题的方法,同时随着计算量呈现指数值的增长,如何提高预测速度和缩短运行时间,本文根据随机森林算法在构建过程中的特点提出了并行化的思想.本文在详细参考国内外文献的基础上,主要从两个方面对随机森林进行优化.一、对数据预处理的研究,提出一种新的数据预处理方法.针对随机森林算法在处理不平衡数据集方面的缺点和SMOTE算法在选取样本时存在一定的盲目性和容易边缘化的问题,本文结合K-means算法,在SMOTE算法的基础上,提出一种K_SMOTE算法K_SMOTE的主要思想是首先利用K-means方法找出原始负类的中心点,再根据SMOTE得出“新增负类”,将原始数据集中的负类全部替换为“新增负类”,再次利用SMOTE得出“新数据集”.实验结果表明该方法在随机森林算法上分类性能得到提升.二、基于Mapreduce框架的随机森林算法并行化研究.随着现代社会数据量呈指数增长,运用随机森林算法进行分类,不但需要花费大量的时间,而且分类性能也低下.在此背景下,本文根据随机森林构建单棵决策树互相独立的特点,同时结合Hadoop平台的分布式框架Mapreduce思想,提出将随机森林算法基于Mapreduce框架并行研究Mapreduce框架的主要思想是分而治之,将复杂的问题分解成若干个相同的子问题,相应的解决子问题就容易很多.具体到随机森林算法中,分而治之主要体现在,构建单棵决策树的过程的并行化处理,然后将组合构建好的多棵决策树进行投票.实验结果表明并行化的随机森林在时间和效率上都得到改善.
[Abstract]:Random forest (Random Forest) is to build a forest in a random way. There are many decision trees in the forest, and there is no correlation between each decision tree of the random forest. Each decision tree is constructed by random sampling, and then the voting is used to classify and predict the decision tree. The algorithm solves the performance bottleneck of single classifier, so it is widely used in many aspects. Of course, there are still some problems to be improved in this algorithm. In view of the low efficiency of the stochastic forest algorithm in dealing with unbalanced data sets, this paper proposes a new method to deal with the unbalanced problem. At the same time, with the increase of the number of computations, how to improve the prediction speed and shorten the running time, according to the characteristics of the stochastic forest algorithm in the construction process, this paper proposes the idea of parallelization. Based on the detailed reference of domestic and foreign literatures, this paper mainly optimizes the random forest from two aspects. Firstly, a new method of data preprocessing is proposed. In view of the shortcomings of stochastic forest algorithm in dealing with unbalanced data sets and the problems of blindness and marginalization of SMOTE algorithm in selecting samples, this paper combines K-means algorithm with SMOTE algorithm. The main idea of K_SMOTE, a K_SMOTE algorithm, is to find out the center of the original negative class by using the K-means method, and then to get the "new negative class" according to SMOTE, and to replace all the negative classes in the original data set with the "new negative class". Use SMOTE again to get the "new data set". The experimental results show that the classification performance of the proposed method is improved on the stochastic forest algorithm. Second, the parallel research of stochastic forest algorithm based on Mapreduce framework. With the exponential growth of data volume in modern society, it takes a lot of time to classify by using stochastic forest algorithm, and the classification performance is also low. In this context, according to the independent characteristics of constructing a single decision tree in random forest, and combining with the Mapreduce idea of distributed framework of Hadoop platform, It is proposed that the main idea of parallel research on Mapreduce framework based on Mapreduce framework is to divide and conquer the complex problems into several identical sub-problems, and it is much easier to solve the corresponding sub-problems. In the stochastic forest algorithm, divide-and-conquer is mainly reflected in the parallel process of constructing a single decision tree, and then the combined construction of multiple decision trees is voted. The experimental results show that the time and efficiency of the parallel stochastic forest are improved.
【学位授予单位】：广东工业大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【相似文献】