基于Spark分布式平台的随机森林分类算法研究

发布时间：2018-03-25 23:15

本文选题：高维大数据　切入点：分类　出处：《中国民航大学》2017年硕士论文

【摘要】：信息技术及网络的高速发展,带来了大量高维复杂数据,如何有效地对这些数据进行分类以挖掘出有价值的信息是具有重大意义的课题。随机森林是一种重要的分类算法,对噪声和异常值有较好的容忍性,能够适用于并行化。原始随机森林分类算法及其改进算法多是运行在单机上,当它们面对大量高维复杂数据时,时间效率和空间资源都已无法满足实际需求。Spark是一种高效的分布式计算框架,能够提供性能与速率兼并的并行运算,是解决这一问题的有效方法。高维数据的很多特征信息量少、与类别的相关性弱,影响了随机森林的分类正确率。因此,论文在Spark平台上改进随机森林算法以提高大数据时代分类高维数据的有效性。首先,随机森林算法在集成决策树和进行分类决策时,无法区别对待每一棵决策树,导致分类能力弱的决策树会影响算法整体的分类性能。针对此问题,提出一种权重树随机森林算法,并在Spark平台上实现该算法。算法采用权重树集成策略,能够加强分类能力强的树对于分类决策的影响,同时削弱分类能力弱的树对分类决策的影响,提高随机森林整体的分类能力。实验结果表明,相比原始随机森林算法,所提算法分类正确率更高,可扩展性良好,能够有效分类高维大数据。其次,随机森林算法在结点处生成特征子空间时,所采用的简单随机抽样会导致生成的特征子空间中往往含有很多分类能力弱的特征,影响了随机森林算法的分类性能。针对此问题,通过改进分层子空间的实施方式,提出了一种分层子空间随机森林算法,并在Spark平台上实现该算法。改进的实施方式既保证了特征分层结果的正确性,又降低了计算成本,适合高维大数据。实验结果验证了所提算法能够有效分类高维大数据。相比原始随机森林算法,所提算法具有更高的分类正确率和更好的泛化能力,可扩展性良好。最后,将权重树随机森林算法和分层子空间随机森林算法应用于航班延误的预测中,在对数据集特征的详细信息进行分析的基础上,通过最小-最大规范化和延误等级划分对数据进行预处理,实验验证了权重树随机森林算法和分层子空间随机森林算法能够有效分类和预测航班延误的延误等级。
[Abstract]:The rapid development of information technology and network has brought a large number of high-dimensional complex data. How to effectively classify these data to mine valuable information is of great significance. Random forest is an important classification algorithm. It has good tolerance for noise and outliers, and can be applied to parallelization. The original stochastic forest classification algorithms and their improved algorithms are mostly run on a single computer, when they face a large number of high dimensional complex data, Both time efficiency and space resources can no longer meet the actual demand. Park is an efficient distributed computing framework that provides parallel computation of performance and rate annexation. It is an effective method to solve this problem. Many features of high-dimensional data have little information and weak correlation with category, which affects the classification accuracy of random forest. In order to improve the effectiveness of classifying high-dimensional data in big data's time, this paper improves the stochastic forest algorithm on Spark platform. Firstly, the stochastic forest algorithm can not treat each decision tree differently when it integrates decision trees and makes classification decisions. The decision tree with weak classification ability will affect the whole classification performance of the algorithm. In order to solve this problem, a weighted tree stochastic forest algorithm is proposed and implemented on Spark platform. The effect of trees with strong classification ability on classification decision is strengthened, and the influence of trees with weak classification ability on classification decision is weakened. The experimental results show that compared with the original stochastic forest algorithm, the classification ability of the whole stochastic forest is improved. The proposed algorithm is more accurate and extensible, and can effectively classify high dimensional big data. Secondly, when the stochastic forest algorithm generates feature subspace at the node, The simple random sampling will lead to many features with weak classification ability in the generated feature subspace, which affects the classification performance of the stochastic forest algorithm. In order to solve this problem, the implementation of the hierarchical subspace is improved. A hierarchical subspace random forest algorithm is proposed and implemented on the Spark platform. The improved implementation not only ensures the correctness of the feature stratification results, but also reduces the computational cost. The experimental results show that the proposed algorithm can effectively classify the high-dimensional big data. Compared with the original stochastic forest algorithm, the proposed algorithm has higher classification accuracy and better generalization ability. Finally, The weighted tree stochastic forest algorithm and hierarchical subspace stochastic forest algorithm are applied to the prediction of flight delay. On the basis of analyzing the detailed information of the feature of the data set, the weight tree random forest algorithm and the hierarchical subspace random forest algorithm are applied to the prediction of flight delay. The data are preprocessed by minimum-maximum normalization and delay classification. The experimental results show that the weighted tree stochastic forest algorithm and the hierarchical subspace stochastic forest algorithm can effectively classify and predict the delay level of flight delays.
【学位授予单位】：中国民航大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP181

【参考文献】