当前位置:主页 > 科技论文 > 软件论文 >

基于Hadoop平台的网络数据并行处理系统设计与实现

发布时间:2018-06-04 13:40

  本文选题:聚类算法 + Hadoop ; 参考:《东南大学》2017年硕士论文


【摘要】:随着移动互联网时代的到来,给人们的生活带来各种各样的便利,同时也意味着会产生越来越多的数据,如何从这海量的数据中挖掘价值将是一个非常有价值的课题。聚类算法就是其中一种从海量数据中挖掘价值的工具,它有着非常广泛的使用场景,包括对一些未知的物品进行分类,同时可以进行相应应用。随着数据量的剧增,聚类算法在单机环境下开始越来越吃力,越来越面临瓶颈。因此,海量数据对聚类算法以及相应的处理系统提出了新的要求。本文是基于Hadoop平台的网络数据并行处理系统设计与实现。本文首先对Spark相关性能进行优化研究,主要包括两部分:开发过程中相关性能优化研究,shuffle性能优化研究。开发过程中相关性能优化研究主要研究了避免使用shuffle算子以及对多次使用的RDD进行持久化这两个方面。shuffle性能优化主要研究了 sort shuffle和hash shuffle各自的适用场景以及相应的优化,并通过实验来验证。聚类算法面临海量数据处理遇到的瓶颈越来越大,为了开发并行化聚类算法来应对海量数据处理难的问题,本文引入Hadoop平台并在该平台上搭建Spark平台。针对k-means算法存在随机选取初始中心导致迭代次数过多的问题,本文提出了一种基于Spark平台的由克洛斯卡尔算法改进的k-means算法来解决初始中心选择问题,通过迭代次数和迭代时间这两个指标来评价实际效果。为了更好展示实验结果,本文将Spark的k-means++算法作为比较对象,实验结果显示,基于Spark平台的由克洛斯卡尔算法改进的k-means算法比Spark的k-means++算法有更少的运行时间以及更少的迭代次数。针对k-means算法没有考虑向量之间相似性的问题,本文提出了一种基于Spark平台的由克洛斯卡尔算法和谷本距离改进的k-means算法,使用误差平方函数作为评价指标,与Spark的k-means++算法以及基于Spark平台的由克洛斯卡尔算法改进的k-means算法相比,具有更少的误差平方函数值,也就得到更好的聚类结果。本文最后搭建了一个完整的基于Hadoop平台的网络数据并行处理系统,该网络数据并行处理系统的架构设计使得系统本身具有大数据、高复杂度数据计算的能力。Hadoop计算平台的引入使得系统可以依赖廉价硬件资源,提供高计算能力与存储能力,同时也使系统具备很好的横向扩展能力,面对数据规模的上升,只需要通过简单添加机器来增强集群处理能力。此外,该网络数据并行处理系统具有普遍适用性,不仅仅适用于电影推荐,网络异常检测,也适用于任何使用聚类算法进行数据处理的场景。
[Abstract]:With the advent of the mobile Internet era, people's life brings a variety of convenience, but also means that more and more data will be produced, how to mine the value from this massive data will be a very valuable topic. Clustering algorithm is one of the tools to mine the value from the massive data. It has a very wide range of usage scenarios, including the classification of some unknown items, and can be applied accordingly. With the rapid increase of data volume, clustering algorithm in a single computer environment began to become more and more difficult, more and more faced with bottlenecks. Therefore, massive data put forward new requirements for clustering algorithm and corresponding processing system. This paper is based on Hadoop platform network data parallel processing system design and implementation. In this paper, the performance optimization of Spark is studied, which includes two parts: the research of correlation performance optimization in the development process and the optimization of the performance of the shuffle. In the process of development, the performance optimization of sort shuffle and hash shuffle is mainly studied in the aspects of avoiding the use of shuffle operator and persisting the RDD used many times. It mainly studies the applicable scenarios of sort shuffle and hash shuffle and the corresponding optimization. And through the experiment to verify. Clustering algorithm is facing the bottleneck of mass data processing more and more. In order to develop parallel clustering algorithm to deal with the problem of mass data processing, this paper introduces Hadoop platform and builds Spark platform on the platform. In view of the problem that the k-means algorithm has too many iterations due to the random selection of the initial center, this paper proposes a k-means algorithm based on Spark platform, which is improved by the Crocal algorithm to solve the problem of selecting the initial center. The actual effect is evaluated by the number of iterations and the time of iteration. In order to better display the experimental results, the k-means algorithm of Spark is taken as a comparison object. The experimental results show that, The improved k-means algorithm based on Spark platform has less running time and fewer iterations than Spark's k-means algorithm. In view of the fact that the k-means algorithm does not consider the similarity between vectors, this paper proposes a new k-means algorithm based on Spark platform, which is improved by Crocal algorithm and Goramoto distance. The error square function is used as the evaluation index. Compared with the k-means algorithm of Spark and the improved k-means algorithm based on Spark platform, it has less error square function and better clustering result. At the end of this paper, a complete network data parallel processing system based on Hadoop platform is built. The architecture of the network data parallel processing system makes the system has big data. The introduction of Hadoop computing platform enables the system to rely on cheap hardware resources to provide high computing power and storage capacity. At the same time, the system also has a good lateral expansion ability, facing the increase of data scale. You only need to add machines simply to enhance cluster processing power. In addition, the network data parallel processing system is of universal applicability, not only for movie recommendation, network anomaly detection, but also for any data processing scenarios using clustering algorithm.
【学位授予单位】:东南大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP311.13


本文编号:1977563

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1977563.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户888d7***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com