基于Spark的一种改进的随机森林算法研究

发布时间：2018-05-09 10:38

本文选题：随机森林 + 分类精度　；参考：《太原理工大学》2017年硕士论文

【摘要】：随机森林算法是一种具有优秀分类性能的机器学习算法,它具有擅长处理大规模数据集、可以处理多达几千个属性的数据集、需要调整的参数少、不会出现过拟合等特点。因此随机森林算法在各个领域都得到了广泛的应用和发展,吸引了大量的学者对其进行改进和研究,并取得了丰硕的成果。但是传统随机森林算法在生成随机森林模型的过程中,一是生成的决策树模型在分类性能上参差不齐,二是决策树模型之间会有相关性,那些分类性能差的决策树以及相互之间相关性强的决策树会对随机森林模型的整体分类性能产生消极的影响。本文针对传统随机森林的这两个特性,提出了一种基于分类精度和相似度的改进的随机森林算法。该算法选用分类性能评价指标AUC值对随机森林模型中的决策树模型的分类性能进行评判,选出其中分类性能在设定阈值之上的决策树模型;然后对选出的分类性能好的决策树模型进行相似度计算,得到这些决策树模型之间的相似度矩阵,因为相似度高的决策树,他们之间的相关性就高,所以再根据相似度矩阵和相似度评判标准对这些决策树模型进行聚类;最后选出每一个聚类中AUC值最高的决策树作为这一个聚类的代表,从而组成新的随机森林模型。通过对心脏病、乳腺癌、Pima印第安人糖尿病和印度肝病等UCI数据集的测试结果表明,本文提出的基于分类精度和相关性的改进的随机森林算法比传统的随机森林算法在分类精度上有了一定的提升。本文先在MATLAB平台上对改进的随机森林算法进行了实现,然后通过设计实验在四个UCI数据集上对改进的随机森林算法和传统的随机森林算法在分类精度上进行了比较,结果表明改进的随机森林算法在分类精度上有了一定的提升,但是由于相比传统的随机森林算法,改进的随机森林算法多了两个优化步骤,所以在分类速率上会有所下降,而且单机的MATLAB平台对于较大型数据的处理和迭代速度会非常缓慢,因此最终又在Spark平台上对改进的随机森林算法进行了实现,使得改进的随机森林算法的分类速率有了较大的提升。
[Abstract]:Stochastic forest algorithm is a machine learning algorithm with excellent classification performance. It is good at dealing with large scale data sets and can handle data sets with thousands of attributes. Therefore, stochastic forest algorithm has been widely used and developed in various fields, attracting a large number of scholars to improve and study it, and has achieved fruitful results. However, in the process of generating stochastic forest model, the traditional stochastic forest algorithm, one is that the decision tree model is different in classification performance, the other is the correlation between the decision tree model and the decision tree model. Those decision trees with poor classification performance and decision trees with strong correlation will have a negative impact on the overall classification performance of stochastic forest models. In this paper, an improved stochastic forest algorithm based on classification accuracy and similarity is proposed. In this algorithm, the classification performance of the decision tree model in the stochastic forest model is evaluated by AUC, and the decision tree model with the classification performance above the threshold is selected. Then, the similarity of the decision tree models with good classification performance is calculated, and the similarity matrix between these decision tree models is obtained. Because the decision trees with high similarity, the correlation between them is high. According to the similarity matrix and similarity evaluation criteria, these decision tree models are clustered. Finally, the decision tree with the highest AUC value in each cluster is selected as the representative of this cluster, and a new stochastic forest model is formed. Tests on UCI data sets such as heart disease, breast cancer, Pima Indian diabetes and Indian liver disease showed that, The improved stochastic forest algorithm based on classification accuracy and correlation is better than the traditional stochastic forest algorithm in classification accuracy. In this paper, the improved stochastic forest algorithm is implemented on the MATLAB platform, and the classification accuracy of the improved stochastic forest algorithm is compared with that of the traditional stochastic forest algorithm on four UCI datasets. The results show that the improved stochastic forest algorithm has a certain improvement in classification accuracy, but compared with the traditional stochastic forest algorithm, the improved stochastic forest algorithm has two more optimization steps, so the classification rate will be reduced. Moreover, the processing and iterative speed of the larger data on the single MATLAB platform will be very slow, so the improved stochastic forest algorithm is implemented on the Spark platform. The classification rate of the improved stochastic forest algorithm is greatly improved.
【学位授予单位】：太原理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP181

【参考文献】