基于大数据技术的随机森林模型并行化设计及实现

发布时间：2018-01-07 15:26

本文关键词：基于大数据技术的随机森林模型并行化设计及实现　出处：《太原理工大学》2017年硕士论文　论文类型：学位论文

【摘要】：滑坡,属于一种高发并且带来严重危害的地质灾害,滑坡带来的危害包括巨额的经济损失以及惨痛的人员伤亡,并且影响社会的安定。滑坡灾害在我国分布范围较广,如四川、贵州等地的地质构造复杂多样,是滑坡灾害的高发地区。近几年来,随着人类活动的大规模进行,崩塌滑坡泥石流等地质灾害发生频繁,灾害的预防工作尤为重要。因此,提供更加准确的方法来进行滑坡灾害的防治,已经是非常急迫的任务。当灾害发生时,首要任务是做出正确且快速的应急决策,对于灾害管理工作而言,如何能对地质灾害的发生及发展做出快速而准确的评估工作,是一个亟待解决的问题,所以,研究如何提高地质灾害评估的效率更具有研究价值和现实意义。本文介绍了研究滑坡的意义、国内外对于滑坡研究的进展和现状以及云平台的相关知识和评估模型的基本理论。选取随机森林模型作为实验模型,选取山西省2000年以来的全省地貌、岩土体、地质构造、地震峰值加速度、坡度、降水量等1:50万基础资料,搭建了Hadoop大数据平台,利用Map Reduce并行编程框架,通过此并行计算框架对模型进行并行化设计,并对改进后的模型进行有效性验证等工作,实验得到以下结论:1.在单节点上对模型改进后的准确性进行验证。并行化改进后的随机森林模型精度相对于传统的串行随机森林模型的精度较高,说明改进后的模型具有一定的可行性与实用性。2.在Hadoop平台上,在机器数目不同的情况下,进行算法执行时间的比较。当选取的滑坡样本数据的总量不变时,平台机器数目增加,算法执行时间减少,说明模型改进后的运行效率提高。3.进而又考虑了不同的样本总数,在运行1台、2台、3台机器的情况下实验效果:(1)样本数据规模较小为Data1时,随着服务器数量的增加,算法在运行时间上相差并不大。这是因为在Hadoop平台上进行并行计算时,多台设备间要通信以及数据交换,而这一过程对时间效率的损耗很大,算法效率时有下降。(2)当样本数据规模较大时,将单机情况与1台机器参与运算进行对比发现,这一过程曲线斜率最大,也就是说并行化以后的随机森林模型的运行时间的显著减小,说明模型效率明显提高。(3)通过对比机器数目是1台、2台、3台的情况发现,随着机器数量的增加,改进的随机森林模型运行时间确实逐渐下降,但曲线斜率也逐渐减小,说明机器的数目越多,算法效率越高,但与此同时设备间数据通信耗时也在增加,这也是曲线斜率逐渐变小的原因。(4)当机器数目是2台和3台时,Data2,Data3,Data4样本数据集的算法运行时间相对Data1耗时更少。该现象说明,并行化的随机森林模型更适用于大规模数据,优化效果更显著。本文基本实现了论文的初衷,即通过对评估模型并行化改进,评估效率与精度有所提高,以实现快速评估的目的,为今后地质灾害提出快速应急决策提供依据。
[Abstract]:Landslide is a kind of geological disaster which has a high incidence and brings serious harm. The hazards brought by landslide include huge economic losses and heavy casualties. And affect the stability of society. Landslide disasters in China, such as Sichuan, Guizhou and other places in the geological structure is complex and diverse, is a high incidence of landslides in recent years. With the large-scale development of human activities, geological disasters such as landslides and debris flows occur frequently, and the prevention of disasters is particularly important. Therefore, to provide more accurate methods to prevent and cure landslide disasters. It is already a very urgent task. When a disaster occurs, the first task is to make the right and rapid emergency decision, for disaster management. How to make a rapid and accurate evaluation of the occurrence and development of geological disasters is a problem to be solved urgently. The study on how to improve the efficiency of geological hazard assessment has more research value and practical significance. This paper introduces the significance of landslide research. The progress and present situation of landslide research at home and abroad as well as the related knowledge of cloud platform and the basic theory of evaluation model. The random forest model is selected as the experimental model and the geomorphology of Shanxi Province since 2000 is selected. Rock and soil, geological structure, seismic peak acceleration, slope, precipitation and other 1:50 basic data, Hadoop big data platform, using Map Reduce parallel programming framework. The parallel computing framework is used to design the model and verify the validity of the improved model. The experimental results are as follows: 1. The accuracy of the improved model is verified on the single node. The accuracy of the parallel improved stochastic forest model is higher than that of the traditional serial stochastic forest model. The improved model has certain feasibility and practicability. 2. On the Hadoop platform, the number of machines is different. When the total amount of the selected landslide sample data is unchanged, the number of platform machines increases and the algorithm execution time decreases. It shows that the operation efficiency of the improved model is improved. 3. Furthermore, considering the total number of different samples, one unit or two units are running. When the size of the sample data is smaller than that of Data1, the number of servers increases with the increase of the number of servers. The algorithm has no significant difference in running time. This is because when parallel computing is carried out on the Hadoop platform, many devices have to communicate and exchange data, and this process has a great loss of time efficiency. When the size of the sample data is large, the single machine is compared with one machine to take part in the operation, and it is found that the slope of the process curve is the largest. That is to say, the running time of the parallel stochastic forest model is significantly reduced, which shows that the efficiency of the model is obviously improved. With the increase of the number of machines, the running time of the improved stochastic forest model decreases gradually, but the slope of the curve decreases gradually, which indicates that the more the number of machines, the higher the efficiency of the algorithm. But at the same time, data communication between devices is also increasing, which is why the curve slope is gradually decreasing. The running time of the algorithm of Data4 sample data set is less than that of Data1. This phenomenon shows that the parallel stochastic forest model is more suitable for large-scale data. This paper basically realizes the original intention of the paper, that is, by improving the evaluation model, the evaluation efficiency and accuracy are improved, so as to achieve the purpose of rapid evaluation. It provides the basis for the quick emergency decision of geological disaster in the future.
【学位授予单位】：太原理工大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：P642.22

【相似文献】