一种基于MPI和MapReduce的分布式向量计算框架的研究与实现

发布时间：2018-03-01 04:41

本文关键词： 分布式计算框架机器学习向量MPI MapReduce　出处：《浙江大学》2013年硕士论文　论文类型：学位论文

【摘要】：机器学习是近20年来兴起的多领域交叉学科,涉及多门学科,诸如概率论、统计学、逼近论、凸分析等等。机器学习算法目前已经有了广泛的应用,例如数据挖掘、自然语言处理、搜索引擎等等。当前各种机器学习算法已经有开源的单机实现,但是随着互联网的高速发展,用户数据量急剧增加,单机实现已经不能满足工业界的需求,为了满足算法的高性能实现,开发人员需要利用MPI, Hadoop/MapReduce等计算框架开发并行程序。 MPI效率高,编程灵活,扩展性好,适合高性能计算,然而也存在一些缺点：MPI接口众多,学习成本高；当前使用MPI实现高性能程序时,往往需要考虑数据切分、网络通信等问题,缺少类似MapReduce的计算模型,增加了程序员的负担；算法实现专有化不利用代码复用,缺少统一抽象的分布式数据结构；程序容错性较差。针对以上缺点,本论文综述了MPI容错方案和MapReduce的应用与改进,结合抽象向量接口设计,提出了一种MPI下基于向量和MapReduce的分布式计算框架。该框架将机器学习算法中的矩阵操作抽象成为分布式向量的操作,同时结合异步收发提高网络传输效率,尽可能重叠CPU计算和网络收发。在此基础之上,引入checkpoint机制,增加多轮迭代算法的在MPI环境中的容错性。为了验证程序的效率和正确性,选择了PageRank算法进行对比实验。实验证明,本论文提出框架适合并且能有有效解决符合MapReduce模型的机器学习算法的分布式实现问题。
[Abstract]:Machine learning is a multi-field interdisciplinary subject that has emerged in recent 20 years, involving many subjects, such as probability theory, statistics, approximation theory, convex analysis, etc. Machine learning algorithms have been widely used, such as data mining. Natural language processing, search engine and so on. At present, all kinds of machine learning algorithms have been implemented on an open source single machine, but with the rapid development of the Internet, the amount of user data has increased dramatically, and the single machine implementation has not been able to meet the needs of the industry. In order to achieve the high performance of the algorithm, developers need to use MPI, Hadoop/MapReduce and other computing frameworks to develop parallel programs. MPI has high efficiency, flexible programming, good expansibility and is suitable for high performance computing. However, it also has some disadvantages, such as: MPI interface is numerous and learning cost is high. When using MPI to implement high performance program, we often need to consider data segmentation, network communication and so on. The lack of a computing model similar to MapReduce increases the burden on programmers; the proprietary implementation of the algorithm does not use code reuse and lacks a unified abstract distributed data structure; and the fault tolerance of programs is poor. In view of the above shortcomings, this paper summarizes the application and improvement of MPI fault-tolerant scheme and MapReduce, combined with the design of abstract vector interface. This paper presents a distributed computing framework based on vector and MapReduce in MPI, which abstracts the matrix operation in machine learning algorithm into the operation of distributed vector, and improves the transmission efficiency of network by combining asynchronous transceiver and transceiver. The CPU computing and network transceiver are overlapped as much as possible. On this basis, the checkpoint mechanism is introduced to increase the fault-tolerance of multi-round iterative algorithms in the MPI environment. In order to verify the efficiency and correctness of the program, the PageRank algorithm is chosen to carry out a comparative experiment. The experimental results show that the proposed framework is suitable for and can effectively solve the distributed implementation problem of machine learning algorithm in accordance with the MapReduce model.
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP181

【参考文献】