基于Spark的一种改进的随机森林算法研究
本文选题:随机森林 + 分类精度 ; 参考:《太原理工大学》2017年硕士论文
【摘要】:随机森林算法是一种具有优秀分类性能的机器学习算法,它具有擅长处理大规模数据集、可以处理多达几千个属性的数据集、需要调整的参数少、不会出现过拟合等特点。因此随机森林算法在各个领域都得到了广泛的应用和发展,吸引了大量的学者对其进行改进和研究,并取得了丰硕的成果。但是传统随机森林算法在生成随机森林模型的过程中,一是生成的决策树模型在分类性能上参差不齐,二是决策树模型之间会有相关性,那些分类性能差的决策树以及相互之间相关性强的决策树会对随机森林模型的整体分类性能产生消极的影响。本文针对传统随机森林的这两个特性,提出了一种基于分类精度和相似度的改进的随机森林算法。该算法选用分类性能评价指标AUC值对随机森林模型中的决策树模型的分类性能进行评判,选出其中分类性能在设定阈值之上的决策树模型;然后对选出的分类性能好的决策树模型进行相似度计算,得到这些决策树模型之间的相似度矩阵,因为相似度高的决策树,他们之间的相关性就高,所以再根据相似度矩阵和相似度评判标准对这些决策树模型进行聚类;最后选出每一个聚类中AUC值最高的决策树作为这一个聚类的代表,从而组成新的随机森林模型。通过对心脏病、乳腺癌、Pima印第安人糖尿病和印度肝病等UCI数据集的测试结果表明,本文提出的基于分类精度和相关性的改进的随机森林算法比传统的随机森林算法在分类精度上有了一定的提升。本文先在MATLAB平台上对改进的随机森林算法进行了实现,然后通过设计实验在四个UCI数据集上对改进的随机森林算法和传统的随机森林算法在分类精度上进行了比较,结果表明改进的随机森林算法在分类精度上有了一定的提升,但是由于相比传统的随机森林算法,改进的随机森林算法多了两个优化步骤,所以在分类速率上会有所下降,而且单机的MATLAB平台对于较大型数据的处理和迭代速度会非常缓慢,因此最终又在Spark平台上对改进的随机森林算法进行了实现,使得改进的随机森林算法的分类速率有了较大的提升。
[Abstract]:Stochastic forest algorithm is a machine learning algorithm with excellent classification performance. It is good at dealing with large scale data sets and can handle data sets with thousands of attributes. Therefore, stochastic forest algorithm has been widely used and developed in various fields, attracting a large number of scholars to improve and study it, and has achieved fruitful results. However, in the process of generating stochastic forest model, the traditional stochastic forest algorithm, one is that the decision tree model is different in classification performance, the other is the correlation between the decision tree model and the decision tree model. Those decision trees with poor classification performance and decision trees with strong correlation will have a negative impact on the overall classification performance of stochastic forest models. In this paper, an improved stochastic forest algorithm based on classification accuracy and similarity is proposed. In this algorithm, the classification performance of the decision tree model in the stochastic forest model is evaluated by AUC, and the decision tree model with the classification performance above the threshold is selected. Then, the similarity of the decision tree models with good classification performance is calculated, and the similarity matrix between these decision tree models is obtained. Because the decision trees with high similarity, the correlation between them is high. According to the similarity matrix and similarity evaluation criteria, these decision tree models are clustered. Finally, the decision tree with the highest AUC value in each cluster is selected as the representative of this cluster, and a new stochastic forest model is formed. Tests on UCI data sets such as heart disease, breast cancer, Pima Indian diabetes and Indian liver disease showed that, The improved stochastic forest algorithm based on classification accuracy and correlation is better than the traditional stochastic forest algorithm in classification accuracy. In this paper, the improved stochastic forest algorithm is implemented on the MATLAB platform, and the classification accuracy of the improved stochastic forest algorithm is compared with that of the traditional stochastic forest algorithm on four UCI datasets. The results show that the improved stochastic forest algorithm has a certain improvement in classification accuracy, but compared with the traditional stochastic forest algorithm, the improved stochastic forest algorithm has two more optimization steps, so the classification rate will be reduced. Moreover, the processing and iterative speed of the larger data on the single MATLAB platform will be very slow, so the improved stochastic forest algorithm is implemented on the Spark platform. The classification rate of the improved stochastic forest algorithm is greatly improved.
【学位授予单位】:太原理工大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP181
【参考文献】
相关期刊论文 前5条
1 马春来;单洪;马涛;史英春;;随机森林改进算法在LBS用户社会关系推断中的应用[J];小型微型计算机系统;2016年12期
2 陈松景;杨林;吴思竹;李姣;;基于C4.5分类的呼吸系统疾病危险因素定量分析方法[J];中华医学图书情报杂志;2016年08期
3 张宇航;;微博社交网络数据挖掘与用户权重分析[J];中国高新技术企业;2016年05期
4 李定启;程远平;王海峰;王亮;周红星;孙建华;;基于决策树ID3改进算法的煤与瓦斯突出预测[J];煤炭学报;2011年04期
5 郑炜;沈文;张英鹏;;基于改进朴素贝叶斯算法的垃圾邮件过滤器的研究[J];西北工业大学学报;2010年04期
相关博士学位论文 前1条
1 隋学深;基于时间序列数据挖掘的股票市场价格行为研究[D];哈尔滨工业大学;2008年
相关硕士学位论文 前6条
1 车晋强;基于Spark平台的高血压药物推荐及疗效预测研究[D];太原理工大学;2016年
2 陈秀芬;基于文献挖掘的中药治疗糖尿病用药筛选及作用机制研究[D];北京中医药大学;2016年
3 万飞;基于网格搜索的支持向量机在入侵检测中的应用[D];合肥工业大学;2015年
4 陈金佑;数据挖掘在股票分析中的应用研究[D];华南理工大学;2014年
5 李贞贵;随机森林改进的若干研究[D];厦门大学;2013年
6 卢明泰;WEB数据挖掘及其在社交网络的应用研究[D];电子科技大学;2012年
,本文编号:1865699
本文链接:https://www.wllwen.com/kejilunwen/zidonghuakongzhilunwen/1865699.html