基于内存缓存的异步检查点容错技术
发布时间:2018-11-15 20:35
【摘要】:高性能计算机系统规模越来越大,系统可靠性问题越来越严重.检查点技术是最典型的容错方法,但是因为并行文件系统的性能提高相对缓慢,数据写带宽低,传统检查点方法产生了严峻的性能问题.针对当前计算机系统计算和存储资源丰富,而并行文件系统写带宽提高相对滞后的特点,提出了基于内存缓存的异步检查点容错技术,传统的检查点技术被划分为两步:检查点文件首先被缓存在计算结点的局部内存,然后使用一个独立的帮助任务将数据拷贝到并行文件系统.利用局部内存带宽高以及帮助任务和计算任务并行执行的特点,新方法极大减小了检查点容错引入的时间开销,模拟和实际程序测试验证了异步检查点容错技术的有效性.
[Abstract]:The scale of high performance computer system is becoming larger and larger, and the problem of system reliability is becoming more and more serious. Checkpoint technique is the most typical fault-tolerant method, but because the performance of parallel file system is relatively slow and the data write bandwidth is low, the traditional checkpoint method has a severe performance problem. In view of the rich computing and storage resources in current computer systems and the relative lag in the increase of write bandwidth in parallel file systems, an asynchronous checkpoint fault-tolerant technique based on memory cache is proposed. The traditional checkpoint technique is divided into two steps: the checkpoint file is first cached in the local memory of the computing node, and then the data is copied to the parallel file system using an independent help task. Taking advantage of the characteristics of high local memory bandwidth and parallel execution of tasks and computing tasks, the new method greatly reduces the time cost introduced by checkpoint fault tolerance. Simulation and practical program tests verify the effectiveness of asynchronous checkpoint fault tolerance technology.
【作者单位】: 国防科学技术大学计算机学院;北方车辆研究所;
【基金】:国家自然科学基金项目(60903059,61003087,61170049,61120106005) 国家“八六三”高技术研究发展计划基金项目(2012AA01A309) “核高基”国家科技重大专项基金项目(2009ZX01036-001-003-001)
【分类号】:TP302.8
[Abstract]:The scale of high performance computer system is becoming larger and larger, and the problem of system reliability is becoming more and more serious. Checkpoint technique is the most typical fault-tolerant method, but because the performance of parallel file system is relatively slow and the data write bandwidth is low, the traditional checkpoint method has a severe performance problem. In view of the rich computing and storage resources in current computer systems and the relative lag in the increase of write bandwidth in parallel file systems, an asynchronous checkpoint fault-tolerant technique based on memory cache is proposed. The traditional checkpoint technique is divided into two steps: the checkpoint file is first cached in the local memory of the computing node, and then the data is copied to the parallel file system using an independent help task. Taking advantage of the characteristics of high local memory bandwidth and parallel execution of tasks and computing tasks, the new method greatly reduces the time cost introduced by checkpoint fault tolerance. Simulation and practical program tests verify the effectiveness of asynchronous checkpoint fault tolerance technology.
【作者单位】: 国防科学技术大学计算机学院;北方车辆研究所;
【基金】:国家自然科学基金项目(60903059,61003087,61170049,61120106005) 国家“八六三”高技术研究发展计划基金项目(2012AA01A309) “核高基”国家科技重大专项基金项目(2009ZX01036-001-003-001)
【分类号】:TP302.8
【参考文献】
相关期刊论文 前1条
1 曹宏嘉;卢宇彤;谢e,
本文编号:2334384
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/2334384.html