基于BSP的大规模图处理系统中通信和缓存技术研究
发布时间:2018-02-27 03:20
本文关键词: BSP 图处理 消息通信 磁盘缓存 云计算 出处:《东北大学》2012年硕士论文 论文类型:学位论文
【摘要】:随着计算机以及网络技术的发展,在计算机集群中采用并行的分布式计算方式提高计算处理能力已经成为发展趋势。云计算(Cloud Computing)的一个最主要的优势就是它的强大的并行计算处理能力,而这种能力是建立在一个简便高效的并行编程模型的基础上的。其中,最有代表性的就是Google提出的MapReduce分布式并行编程模型。然而,随着近年来互联网应用的迅猛发展,Web网络、社交网络等大规模网络图数据的分析处理成为了研究热点,例如社交网络中的最短路径、网页搜索的PageRank等。这些图处理问题通常需要多次迭代,而MapReduce适合于通用的大数据集计算问题,在处理具有多次迭代性质的图挖掘问题时会导致次优的性能。因而这些图算法往往更适合于采用基于消息传递的并行模型来处理。BSP (Bulk Synchronous Parallel)整体同步并行模型就是一种支持消息传递的块内异步并行,块间显式同步的并行计算模型。随着Google基于BSP模型实现的大规模图处理系统Pregel的提出,在云环境中采用BSP模型实现大规模图处理系统成为了主要的解决途径。 本文旨在以BSP模型为核心,研究基于BSP模型的大规模图处理系统中的消息通信原理和磁盘缓存技术的设计方案及其实现等问题。提出了一种基于队列的消息组织方式和通信方案,并在此基础上提出了基于消息打包、多发送者线程池以及支持消息合并的优化通信方案。针对基于BSP的大图处理系统可能存在的内存不足以存放计算中所有的图和消息数据的问题,本文建立了数据的内存管理模型,并基于内存优先(Memory First)的思想,分别提出了图数据和消息数据的磁盘缓存策略及相应的算法:MF-GHIC算法、MLF图数据遍历算法和基于消息队列优先级的消息数据磁盘缓存算法等。 将本文提出的通信和缓存技术应用于NEU-BSP系统中,我们通过实验,首先分析了通信方案中各类参数的较优值及其相互的制约关系;其次证明了在磁盘缓存率低于30%时,系统的时间性能下降并不显著;最后,我们以PageRank和单源最短路径为例,通过与Hadoop系统的对比实验,证明了在数据完全驻留内存时,NEU-BSP系统比Hadoop系统快1.2到18倍,在数据超过30%以上缓存到磁盘时,NEU-BSP系统仍然能保持与Hadoop系统基本持平的时间性能。
[Abstract]:With the development of computer and network technology, the computer cluster used in parallel distributed computing capacity increase has become the development trend of computing mode. Cloud computing (Cloud Computing) is one of the main advantages is that its powerful parallel computing ability, and this ability is based on a simple and efficient parallel programming model on the basis of MapReduce. Among them, the most representative is the distributed parallel programming model proposed by Google. However, in recent years with the rapid development of Internet application, Web network, social network analysis and other large-scale network data has become a hot research topic, such as social networks in the shortest path, web search PageRank these problems. Graph processing usually requires multiple iterations, while the MapReduce big data set is suitable for general computation, with multiple iterations in processing The problem of qualitative graph mining will lead to suboptimal performance. These tend to be more suitable for the graph algorithm based on parallel model to deal with the.BSP message (Bulk Synchronous Parallel) bulk synchronous parallel model is in an asynchronous message passing parallel support, inter block explicit synchronization parallel computing model with Google. BSP model to realize large-scale graph processing system is proposed based on the Pregel, in a cloud environment using BSP model to realize large-scale graph processing system has become the main way.
The purpose of this paper is to use BSP model as the core, the research of BSP model in the large-scale system message communication principle and disk caching graph processing design and Realization Based on the presented. A news organization and communication scheme based on the queue, and on this basis put forward based on message package, optimized communication scheme for multi sender the thread pool and support message merge. In picture processing system of BSP are not enough memory to store all the maps and message data problems in calculation based on memory management model is established in this paper based on the data, and memory priority (Memory First) thoughts, respectively put forward disk caching strategy map data and message the data and the corresponding algorithm: MF-GHIC algorithm, MLF graph traversal algorithm and based on message queue priority message data disk cache algorithm.
The proposed communication and cache technology in NEU-BSP system, we first analyzed by experiments, the optimal value and the relation between the various parameters of mutual communication in the solution; it is proved that in the disk cache rate is less than 30%, the time performance of the system is not significantly decreased; at last, we use PageRank and single source shortest path as an example, through the contrast experiment with Hadoop system, proved that the complete data memory, NEU-BSP System 1.2 to 18 times faster than the Hadoop system, the data of more than 30% of the cache to disk, NEU-BSP system can still maintain the performance of time with basically the same as the Hadoop system.
【学位授予单位】:东北大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP338.6;TP333
【参考文献】
相关期刊论文 前3条
1 宋青;汪小帆;;最短路径算法加速技术研究综述[J];电子科技大学学报;2012年02期
2 李稚楹;杨武;谢治军;;PageRank算法研究综述[J];计算机科学;2011年S1期
3 于戈;谷峪;鲍玉斌;王志刚;;云计算环境下的大规模图数据处理技术[J];计算机学报;2011年10期
,本文编号:1540905
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/1540905.html