基于Spark的BIRCH算法并行化的设计与实现

发布时间：2018-06-26 19:41

本文选题：Spark + BIRCH并行化　；参考：《计算机工程与科学》2017年01期

【摘要】：在分布式计算和内存为王的时代,Spark作为基于内存计算的分布式框架技术得到了前所未有的关注与应用。着重研究BIRCH算法在Spark上并行化的设计和实现,经过理论性能分析得到并行化过程中时间消耗较多的Spark转化操作,同时根据并行化BIRCH算法的有向无环图DAG,减少shuffle和磁盘读写频率,以期达到性能优化。最后,将并行化后的BIRCH算法分别与单机的BIRCH算法和MLlib中的K-Means聚类算法做了性能对比实验。实验结果表明,通过Spark对BIRCH算法并行化,其聚类质量没有明显的损失,并且获得了比较理想的运行时间和加速比。
[Abstract]:In the era of distributed computing and memory being king, Spark, as a distributed framework based on memory computing, has received unprecedented attention and application. This paper focuses on the design and implementation of parallelization of Birch algorithm on Spark. Through theoretical performance analysis, the Spark conversion operation, which consumes more time in the process of parallelization, is obtained. At the same time, according to the directed acyclic graph DAG of parallelized Birch algorithm, the frequency of shuffle and disk reading and writing is reduced. In order to achieve performance optimization. Finally, the parallel Birch algorithm is compared with the single-machine Birch algorithm and the K-Means clustering algorithm in MLlib. The experimental results show that the BIRCH algorithm is parallelized by Spark without obvious loss of clustering quality, and the ideal running time and speedup ratio are obtained.
【作者单位】：北京邮电大学智能通信软件与多媒体北京重点实验室;北京邮电大学计算机学院;国网山东省电力公司电力科学研究院;
【基金】：国家863计划(2015AA050204) 国网科技项目(60873120)
【分类号】：TP311.13

【相似文献】