面向聚类分析的迭代MapReduce计算模型研究

发布时间：2018-05-08 00:29

本文选题：聚类算法 + MapReduce　；参考：《天津大学》2012年硕士论文

【摘要】：MapReduce计算模型是一种高效的大规模数据处理方式，广泛应用于搜索引擎电子商务以及社交网络等领域然而，运行环境重复初始化静态数据重复载入中间结果对网络的负载压力等原因造成了MapReduce计算模型无法高效的处理迭代计算的问题为此，本文将数据划分为可以被分散的缓存在分布式环境节点内存中的中等规模数据以及不能被分散的缓存在分布式环境节点内存中的大规模数据，并且设计了两种针对不同规模数据的迭代MapReduce效率的优化方案首先，本文设计了用于提高MapReduce计算模型以迭代方式处理中等规模数据时效率的MapCombine方案MapCombine通过给Combine任务添加缓存数据的功能，避免了静态数据重复载入；增加了一个名为Controller的新组件，以其来调度迭代，避免了分布式环境重复初始化；设计了基于HBase的交互层，用于持久化中间数据，保证设计方案的健壮性其次，，本文设计了用于提高MapReduce计算模型以迭代方式处理大规模数据时效率的CycleMap方案CycleMap通过增加一个名为Collector的新组件来替代Reduce任务的工作，避免了排序和洗牌这两个过程对执行效率的影响；通过流水线的方式运行任务，间接的达成了整个迭代任务仅需要完成一次初始化工作的设计初衷，避免了分布式环境重复初始化最后，本设计基于以上两个方案，分别实现了K-Means Fuzzy K-Means以及Dirichlet Process三个聚类算法在与基于MapReduce计算模型的Mahout算法库中的相同聚类算法的性能比对中，MapCombine和CycleMap分别取得了1.10和1.05的加速比
[Abstract]:MapReduce computing model is an efficient large-scale data processing method, widely used in search engine e-commerce and social networks and other fields. The running environment repeatedly initializes the static data repeatedly loads the intermediate result to the network load pressure and so on causes the MapReduce computation model to be unable to deal with the iterative computation question efficiently. In this paper, the data can be divided into medium scale data that can be cached in distributed environment node memory and large scale data that can not be dispersed cache in distributed environment node memory. Two optimization schemes of iterative MapReduce efficiency for different scale data are designed. Firstly, this paper designs a MapCombine scheme to improve the efficiency of MapReduce computing model in iterative processing of medium scale data. By adding the function of caching data to Combine task, MapCombine avoids static data loading repeatedly. A new component called Controller is added to schedule iteration to avoid repeated initialization in distributed environment. An interactive layer based on HBase is designed to persist intermediate data to ensure the robustness of the design scheme. Secondly, this paper designs a CycleMap scheme to improve the efficiency of the MapReduce computing model when processing large scale data iteratively. CycleMap replaces the Reduce task by adding a new component named Collector. It avoids the influence of sorting and shuffling on the execution efficiency, and indirectly achieves the original intention that the whole iterative task only needs to complete one initialization by running the task in a pipeline way. Avoid repeated initialization in distributed environment Finally, based on the above two schemes, the performance of K-Means Fuzzy K-Means and Dirichlet Process clustering algorithms in the same clustering algorithm as Mahout algorithm library based on MapReduce computing model is realized. The speedup ratios of 1.10 and 1.05 are obtained for MapCombine and CycleMap, respectively.
【学位授予单位】：天津大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP311.13

【引证文献】