基于Spark平台的公交客流预测方法的研究

发布时间：2018-10-31 11:02

【摘要】：城市公共交通是城市建设和社会生活的重要组成部分,对城市经济和居民生活具有深远性、全面性的影响。但是,当前交通资源利用率低、交通拥堵、交通污染等问题日益严重,这些现实问题直接关系着人民群众的切身利益。公交客流预测作为一种科学的措施,能为城市公共交通政策制定、系统规划、运营管理提供重要信息,能帮助公交管理者制定合理的公交运营计划和政策,是提高交通资源利用率、增强城市功能的重要途径,对缓解交通拥堵、降低交通污染具有十分重要的作用。随机森林是基于多棵决策树的组合模型,相比于其他算法有较多的优势。然而在单机模式下,随机森林的决策树构建和预测投票过程都是串行化的,运行效率较低。数据量规模较大时,传统单机环境下的随机森林算法会消耗大量时间。Spark是一个分布式计算平台,能够轻松处理海量数据,使得大规模,分布式迭代计算成为可能。本文结合了随机森林和Spark两者的优点,将随机森林作为公交客流预测模型,Spark作为随机森林的并行化实现平台。本文在现有公交客流数据的基础上,使用Spark SQL统计和提取有用信息,对公交客流的出行规律进行分析。分别研究了客流的时间分布特征和动态影响因素,分析了公交客流在工作日、周末的变化规律,同时分析了天气、温度、节假日等因素对公交短时客流的影响。为了解决单机环境下随机森林耗时长的问题,本文提出了基于Spark平台的随机森林并行化方法,实现了建树和投票两个过程的并行化。实验结果表明,并行化随机森林的运行效率要好于传统单机环境下的随机森林。另外,本文通过对比多种回归模型的实验结果,证实了并行化随机森林在模型拟合度和预测精度上都能取得较好的效果。现有对随机森林的改进研究大多用于分类问题上,对于回归问题的改进研究较少。本文总结了以往各方面的研究经验,提出了改进型随机森林样本相似度计算方法,并基于该计算方法对随机森林的投票过程进行优化,提出了加权投票方法。同时实现了改进型特征选择算法,该算法能缩小随机森林进行特征选择时抽取的特征子集,减小不重要的特征对随机森林预测效果的影响。实验结果表明,改进后随机森林模型的客流预测精度较改进前有所提高。
[Abstract]:Urban public transportation is an important part of urban construction and social life, which has far-reaching and comprehensive influence on urban economy and residents' life. However, the current low utilization of traffic resources, traffic congestion, traffic pollution and other problems are increasingly serious, these practical problems directly related to the vital interests of the people. As a scientific measure, bus passenger flow prediction can provide important information for urban public transport policy making, system planning and operation management, and can help public transport managers to formulate reasonable bus operation plans and policies. It is an important way to improve the utilization rate of traffic resources and enhance the function of the city. It plays an important role in alleviating traffic congestion and reducing traffic pollution. Stochastic forest is a combination model based on multiple decision trees, which has more advantages than other algorithms. However, in the single machine mode, the decision tree construction and prediction voting process of stochastic forest are serialized, and the operation efficiency is low. When the amount of data is large, the traditional stochastic forest algorithm in single computer environment will consume a lot of time. Spark is a distributed computing platform, which can easily process massive data, making large-scale and distributed iterative computing possible. Combining the advantages of stochastic forest and Spark, this paper takes stochastic forest as bus passenger flow prediction model and Spark as parallel implementation platform of stochastic forest. Based on the existing bus passenger flow data, this paper analyzes the travel rules of bus passenger flow by using Spark SQL statistics and extracting useful information. This paper studies the time distribution characteristics and dynamic influencing factors of passenger flow, analyzes the changing law of bus passenger flow on weekdays and weekends, and analyzes the influence of weather, temperature, holidays and other factors on the short-time passenger flow of public transport. In order to solve the problem of long time consuming of random forest in single machine environment, this paper proposes a parallel method of stochastic forest based on Spark platform, which realizes the parallelization of building and voting processes. The experimental results show that the operational efficiency of parallel random forest is better than that of traditional random forest in single machine environment. In addition, by comparing the experimental results of various regression models, it is proved that parallel stochastic forest can achieve good results in model fitting and prediction accuracy. Most of the existing researches on the improvement of stochastic forests are used for classification problems, but few researches on the improvement of regression problems. This paper summarizes the previous research experiences and proposes an improved method for calculating the similarity of random forest samples. Based on this method, the voting process of random forest is optimized and a weighted voting method is proposed. At the same time, an improved feature selection algorithm is implemented, which can reduce the feature subset extracted from the random forest for feature selection, and reduce the influence of the unimportant features on the prediction effect of the stochastic forest. The experimental results show that the prediction accuracy of passenger flow in the improved stochastic forest model is higher than that before the improvement.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：U491.17;TP181;TP311.13

【参考文献】