基于MPI的云计算平台下计算依赖问题关键技术研究

发布时间：2018-07-05 02:55

本文选题：MPI + 计算依赖　；参考：《武汉理工大学》2014年硕士论文

【摘要】：对于高性能计算而言，由普通商用计算机组成的集群正在成为越来越流行的平台。为了充分利用集群的计算和存储能力同时简化分布式并行应用程序的设计，科研机构及科技公司研发出了一系列分布式并行计算框架以及云计算平台。但通过分析它们的编程模型，发现这些框架及云计算平台并不适用于存在计算依赖的作业或者说是不能有效地解决这类问题。本文提出了基于有向图的存在计算依赖的作业的编程模型，其核心就是用一个有向图来表达存在计算依赖的作业分解后的各个任务以及任务所执行的计算间的依赖关系。根据编程模型的结构来分析编程模型所对应并行计算框架的核心过程，研究了任务所执行计算间的依赖关系类型、依赖关系表示方法及任务调度机制。在上述基础之上，基于MPICH（消息传递接口MPI的一种具体实现）设计并实现编程模型相应的并行计算框架。MPI(Message PassingInterface)本身不提供容错机制，为了增强系统的可靠性及高可用性，本文在分析传统的基于检查点的卷回恢复协议的优势与不足之后，设计出改进的基于通信引发检查点的卷回恢复协议：采用通信引发的检查点设置协议可以确保作业从检查点恢复时的正确性；进程在设置检查点时采用户导向的检查点设置机制可以有效地减少无错运行时开销；作业在出错恢复时采用三级容错恢复协议，可以将出错恢复限制在与失败进程有直接依赖关系的进程范围内而不影响其他进程的正常执行，这样就加快了作业的出错恢复过程。为了支持存在计算依赖的作业的三级容错恢复协议，本文研究并设计了不共享通信域的Worker间通信机制。最终，程序开发人员只需按照框架的规范编写并提交各计算顶点（任务）对应的顺序执行的程序和计算顶点依赖关系图，系统自动地对存在计算依赖的作业进行分布式并行处理包括：负载平衡、任务调度、计算结果的返回、对用户透明的容错处理等。本文将适用于存在计算依赖的作业的并行计算框架的原型系统部署在实验室之前研发的基于MPI的多层容错高性能云计算平台上，，使之支持存在计算依赖的作业。实验测试结果表明，原型系统可以正确有效地解决存在计算依赖的作业。
[Abstract]:For high-performance computing, a cluster of ordinary commercial computers is becoming a more and more popular platform. In order to make full use of the computing and storage capabilities of clusters and simplify the design of distributed parallel applications, scientific research institutions and technology companies have developed a series of distributed parallel computing frameworks and cloud computing platforms. However, by analyzing their programming models, it is found that these frameworks and cloud computing platforms are not suitable for computing dependent jobs or can not solve such problems effectively. In this paper, a programming model of computationally dependent jobs based on directed graphs is proposed. The core of the model is to use a directed graph to express the decomposed tasks of jobs with computational dependencies and the dependencies between the computations performed by the tasks. According to the structure of the programming model, this paper analyzes the core process of the parallel computing framework corresponding to the programming model, and studies the types of dependencies between the computations executed by the tasks, the representation method of the dependency relationships and the task scheduling mechanism. On the above basis, the parallel computing framework .MPI (message passing Interface) is designed and implemented based on MPICH (message passing Interface), which does not provide fault-tolerant mechanism, in order to enhance the reliability and high availability of the system. After analyzing the advantages and disadvantages of the traditional checkpointing based rollback recovery protocol, An improved rollback recovery protocol based on communication trigger checkpoint is designed. The correctness of job recovery from checkpoint can be ensured by using communication triggered checkpoint setting protocol. The house-oriented checkpoint setting mechanism can effectively reduce the error-free runtime overhead, and the three-level fault-tolerant recovery protocol is used in the error recovery process. Error recovery can be limited to the range of processes that are directly dependent on the failed process without affecting the normal execution of other processes, thus speeding up the error recovery process of the job. In order to support a three-level fault-tolerant recovery protocol with computationally dependent jobs, this paper studies and designs an inter-worker communication mechanism for non-shared communication domains. In the end, program developers simply write and submit program and computational vertex dependency diagrams that are executed in the order corresponding to each computing vertex (task) in accordance with the framework specifications. The distributed parallel processing of jobs with computational dependencies includes load balancing, task scheduling, the return of computing results, and transparent fault-tolerant processing for users. In this paper, the prototype system for parallel computing framework with computationally dependent jobs is deployed on MPI-based multi-layer fault-tolerant and high-performance cloud computing platform developed before the laboratory to support computationally dependent jobs. The experimental results show that the prototype system can solve the problem of computing dependency correctly and effectively.
【学位授予单位】：武汉理工大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP38

【参考文献】