基于Storm的在线序列极限学习机的降雨量预测研究

发布时间：2018-11-22 14:15

【摘要】：降雨量是防灾减灾的重要参量,很大程度反映灾害发生趋势,降雨量对农业生产、水土流式和工程应用等有着重要的影响,对一个地区的降雨量进行准确预测,可以帮助农业、水利部门提高防治旱涝灾害的能力,将危害降低到最低。随着近几年,我国洪涝灾害不断频发,如何准确及时地利用气象数据对降雨量预报也变得越来越重要了。大数据时代的来临,也给气象预报行业带来了新的挑战。气象数据主要来自于地面观测、气象卫星遥感、天气雷达和数值预报产品。这四类数据占数据总量的90%以上,直接应用于气象业务、天气预报、气候预测以及气象服务。流数据是一组数字编码并连续的信号。一般情况下,数据流可被视为一个随时间延续而无限广泛应用于网络舆情分析、股票市场走向、卫星定位、金融实时监控、物联网监控以及实时气象监控等多个领域。在基于大规模气象流数据的降雨量预测领域,还有很大的发展空间。对于传统的降雨量预测,往往利用离线的气象数据,采用机器学习的方法进行批量训练,即所有的训练样本一次性学习完毕后,学习过程不再继续。但在实际应用中,训练样本空间的全部样本并不能一次得到,而往往是随着时间顺序得到。尽管采用大规模集群能够在一定程度上缓解大量数据带来计算能力不足的问题,但是对于新到达的数据,却不能进行快速处理学习并及时更新学习所获得的知识。针对气象数据的实时计算与海量处理的问题,本文提出了一种基于Storm平台的在线序列的极限学习机降雨量预测模型。本文的主要内容和创新点如下:(1)针对气象数据的离线批量预测方法不能及时反映降雨量变化的问题,提出了一种基于在线序列极限学习机的降雨量预测模型。针对气象数据的大规模和实时特性,对极限学习机算法进行在线序列优化。该模型首先初始化多个在线极限学习机模型,当不断到达新的批次的数据时,模型能够在已有的训练结果的基础上继续学习新样本,并引入随机梯度下降法和误差权值调整的方式,对新的预测结果进行误差反馈,实时更新误差权值参数,以提升模型预测准确率。(2)针对气象数据的海量高维特性的问题,在数据预处理阶段,本文采用决策属性之间的相关系数对气象数据分析,利用相关系数筛选预测属性,降低了气象数据复杂度,提高了模型训练效率。另外,采用Storm流式大数据处理框架结合Kafka分布式消息队列,对大规模气象数据进行并行训练。实验结果表明,算法在Storm平台上运行,具有优异的并行性能和预测精度。
[Abstract]:Rainfall is an important parameter for disaster prevention and mitigation, which largely reflects the trend of disaster occurrence. Rainfall has an important impact on agricultural production, soil and water flow and engineering application. Accurate prediction of rainfall in a region can help agriculture. Water conservancy departments to improve the ability to prevent drought and waterlogging disasters, the harm to the minimum. With the frequent flood and waterlogging disasters in China in recent years, how to accurately and timely use meteorological data to forecast rainfall has become more and more important. The arrival of big data era, also brought new challenge to meteorological forecast industry. Weather data are mainly derived from ground observation, meteorological satellite remote sensing, weather radar and numerical forecast products. These four types of data account for more than 90% of the total data and are directly used in meteorological operations, weather forecasting, climate prediction and meteorological services. Stream data is a set of digitally encoded and continuous signals. In general, data flow can be regarded as an infinite and extensive application in network public opinion analysis, stock market trend, satellite positioning, financial real-time monitoring, Internet of things monitoring and real-time meteorological monitoring and so on. There is still much room for development in the field of rainfall prediction based on large-scale meteorological flow data. For the traditional rainfall prediction, the off-line meteorological data are often used to carry out batch training with the method of machine learning, that is, the learning process will not continue after all the training samples have been studied at one time. However, in practical applications, all samples in the training sample space can not be obtained at one time, but often in the order of time. Although large scale cluster can alleviate the problem of insufficient computing power caused by large amount of data to a certain extent, but for the newly arrived data, it is unable to process quickly and update the knowledge acquired by learning in time. In order to solve the problem of real-time calculation and massive processing of meteorological data, this paper presents a model of rainfall prediction based on online sequence based on Storm platform for extreme learning machine. The main contents and innovations of this paper are as follows: (1) aiming at the problem that the off-line batch forecasting method of meteorological data can not reflect the change of rainfall in time, a rainfall prediction model based on on-line sequence limit learning machine is proposed. Aiming at the large-scale and real-time characteristics of meteorological data, the algorithm of extreme learning machine is optimized on line. The model initializes several online extreme learning machine models. When the data of new batches are continuously reached, the model can continue to learn new samples on the basis of the existing training results. The method of random gradient descent and the adjustment of error weight are introduced to give error feedback to the new prediction results and update the error weight parameters in real time to improve the prediction accuracy of the model. (2) aiming at the problem of the massive high dimensional characteristics of meteorological data, In the stage of data preprocessing, the correlation coefficient between the decision attributes is used to analyze the meteorological data, and the correlation coefficient is used to filter the prediction attributes, which reduces the complexity of meteorological data and improves the efficiency of model training. In addition, Storm streaming big data frame and Kafka distributed message queue are used to train large scale meteorological data in parallel. Experimental results show that the algorithm runs on Storm platform and has excellent parallel performance and prediction accuracy.
【学位授予单位】：湘潭大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：P457.6;TP181

【参考文献】