基于Spark的BIRCH算法并行化的设计与实现
发布时间:2018-06-26 19:41
本文选题:Spark + BIRCH并行化 ; 参考:《计算机工程与科学》2017年01期
【摘要】:在分布式计算和内存为王的时代,Spark作为基于内存计算的分布式框架技术得到了前所未有的关注与应用。着重研究BIRCH算法在Spark上并行化的设计和实现,经过理论性能分析得到并行化过程中时间消耗较多的Spark转化操作,同时根据并行化BIRCH算法的有向无环图DAG,减少shuffle和磁盘读写频率,以期达到性能优化。最后,将并行化后的BIRCH算法分别与单机的BIRCH算法和MLlib中的K-Means聚类算法做了性能对比实验。实验结果表明,通过Spark对BIRCH算法并行化,其聚类质量没有明显的损失,并且获得了比较理想的运行时间和加速比。
[Abstract]:In the era of distributed computing and memory being king, Spark, as a distributed framework based on memory computing, has received unprecedented attention and application. This paper focuses on the design and implementation of parallelization of Birch algorithm on Spark. Through theoretical performance analysis, the Spark conversion operation, which consumes more time in the process of parallelization, is obtained. At the same time, according to the directed acyclic graph DAG of parallelized Birch algorithm, the frequency of shuffle and disk reading and writing is reduced. In order to achieve performance optimization. Finally, the parallel Birch algorithm is compared with the single-machine Birch algorithm and the K-Means clustering algorithm in MLlib. The experimental results show that the BIRCH algorithm is parallelized by Spark without obvious loss of clustering quality, and the ideal running time and speedup ratio are obtained.
【作者单位】: 北京邮电大学智能通信软件与多媒体北京重点实验室;北京邮电大学计算机学院;国网山东省电力公司电力科学研究院;
【基金】:国家863计划(2015AA050204) 国网科技项目(60873120)
【分类号】:TP311.13
【相似文献】
相关期刊论文 前10条
1 吴正娟;职为梅;杨勇;范明;;并行化的粒子群技术[J];微计算机信息;2009年36期
2 齐书阳;;迎接并行化的明天[J];软件世界;2009年06期
3 曹琳,杨学军,金国华;两种并行化机制的分析[J];计算机研究与发展;1993年09期
4 金国华,,陈福接;并行化技术与工具[J];计算机研究与发展;1996年07期
5 蔡立志,童维勤,廖文昭;序列拼装程序的并行化研究与实现[J];计算机工程与应用;2003年14期
6 王伟;潘建伟;;有限差分法的并行化计算实现[J];电脑知识与技术;2008年07期
7 程锦松;;迭代法的并行化[J];安徽大学学报(自然科学版);1997年03期
8 陈再高;王s
本文编号:2071189
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2071189.html