基于HADOOP的海事大数据分析处理平台的研究与实现

发布时间：2018-10-20 20:49

【摘要】：对海事数据进行数据挖掘,在现有的海事数据中找出影响水运安全事故的成因,建立模型对水运事故进行预测,可以减少或避免水运事故的发生。海事数据有如下特征:数据规模大,数据种类多,数据价值密度低,要求数据处理速度快。这些特征使得海事数据区别于传统的数据,具有大数据的特点。所以海事大数据平台的研究与实现具有很高的价值。目前有很多数据挖掘工具,如weka、SPSS等,但是这些工具只能运行在单机,当数据量很大时会耗费很长的计算时间。Mahout提供了一些常用的机器学习算法的分布式实现,但是当数据发生变化时,都需要对完整的数据重新进行运算。但由于海事数据平台所要处理的数据是不断增加的,针对目前的数据挖掘平台不能同时满足大数据的分布式分析和增量计算的现状,本文根据海事数据的特点,研究了如何在Hadoop上实现分布式的数据挖掘算法,并在此基础上设计了一套增量计算的方案,最后基于Hadoop实现了海事大数据分析处理平台。本文的创新点是实现了朴素贝叶斯算法、DBSCAN算法、Apriori算法在Hadoop平台上的增量计算,,并提出了增量数据检测的方式,通过增量计算提高数据处理效率。实验表明,本文设计并实现的大数据平台,能够满足对海事数据进行分布式数据挖掘的需求,能够高效、准确地完成数据分类、数据聚类、关联分析的任务。同时通过增量计算,在不影响结果准确率的情况下,有效的减少了运行时间。
[Abstract]:Based on the data mining of maritime data, the causes of marine safety accidents can be found out in the existing maritime data, and a model can be established to predict the waterway accidents, which can reduce or avoid the occurrence of waterway accidents. Maritime data has the following characteristics: large scale, large data types, low data value density and high speed of data processing. These characteristics make maritime data different from the traditional data, with big data's characteristics. Therefore, the research and implementation of maritime big data platform has high value. At present, there are many data mining tools, such as weka,SPSS and so on, but these tools can only run on a single machine. When the amount of data is very large, it will take a long time to compute. Mahout provides some distributed implementation of machine learning algorithms in common use. But when the data changes, the complete data needs to be recomputed. However, because the data to be processed by the maritime data platform is increasing, the current data mining platform can not meet the current situation of big data's distributed analysis and incremental calculation. According to the characteristics of maritime data, This paper studies how to realize the distributed data mining algorithm on Hadoop, and designs a set of incremental computing scheme based on it. Finally, the analysis and processing platform of maritime big data based on Hadoop is implemented. The innovation of this paper is to realize incremental computation of naive Bayes algorithm, DBSCAN algorithm and Apriori algorithm on Hadoop platform, and to improve the efficiency of data processing through incremental calculation. Experiments show that the big data platform designed and implemented in this paper can meet the requirements of distributed data mining for maritime data and can efficiently and accurately complete the tasks of data classification data clustering and association analysis. At the same time, by incremental calculation, the running time is reduced effectively without affecting the accuracy of the results.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【参考文献】