实时系统工作流的能量感知容错算法

发布时间:2020-12-24 06:26
  科学计算需求量的爆发式增长,是高性能计算机(HPC)发展的直接驱动力。计算能力的提升,能够极大推动各个科学领域研究成果的重大突破,但同时也为系统设计提出了更多的挑战。本论文重点研究了高性能计算领域现阶段亟待解决的两个主要难题:容错和能耗。为满足科学计算所需的算力,近年来超级计算机的计算单元数量成倍增长,这就直接导致了错误频率的升高。显然,在如此庞大的计算系统中引入容错机制是必须的,否则一个需要在大量计算单元上长时间运行的大型程序,可能永远都无法执行完成。另一方面,出于预算限制与环境保护的考虑,我们必须要降低系统能耗。尤其因为容错机制引入的时间与空间冗余,也导致了额外的能量消耗。同时,节能技术通常会引起系统故障率的升高。因此,在降低能耗的同时,我们必须要考虑到系统性能与可靠性的降级。在此研究背景下,我们通过调度算法的设计,权衡系统执行时间、容错、能耗等多个因素,以解决大规模高性能计算系统中的若干优化问题,具体来说:1.本论文对工作流任务在大规模并行系统上的调度和检查点策略(时间冗余)进行研究,解决应用的容错与调度长度最小化的问题。该问题的解决方案包括两个阶段:决定任务在可用资源上的调度;... 

【文章来源】:华东师范大学上海市 211工程院校 985工程院校 教育部直属院校

【文章页数】:127 页

【学位级别】:博士

【文章目录】:
摘要
abstract
Introduction
Ⅰ Scheduling and checkpointing workflows for fail-stop errors
    1 Framework
        1.1 Introduction
        1.2 Related work
            1.2.1 Soft and silent errors
            1.2.2 Fail-stop failures
            1.2.3 Branch and bound methods
    2 Optimal solutions for special classes of task graphs
        2.1 Example
        2.2 Preliminaries
            2.2.1 Execution model
            2.2.2 Fault-tolerance model
            2.2.3 Minimal Series Parallel Graphs (M-SPG)
            2.2.4 Problem description and proposed approach
            2.2.5 Evaluation of expected makespan
        2.3 Scheduling M-SPGs
        2.4 Placing checkpoints in superchains
            2.4.1 From chains to superchains
            2.4.2 Checkpointing algorithm
            2.4.3 Technical remarks
        2.5 The CKPTNONE strategy
            2.5.1 #P-completeness
            2.5.2 Approximating the makespan
        2.6 Experiments
            2.6.1 Experimental methodology
            2.6.2 Expected makespan
        2.7 Conclusion
    3 Generic approaches for arbitrary task graphs
        3.1 Example
        3.2 Scheduling and checkpointing algorithms
            3.2.1 Scheduling heuristics
            3.2.2 Checkpointing strategies
        3.3 Experiments
            3.3.1 Experimental methodology
            3.3.2 Simulator
            3.3.3 Results
        3.4 Conclusion
Ⅱ Energy-aware strategies for reliability-oriented real-time taskallocation
    4 Framework
        4.1 Introduction
        4.2 Related work
            4.2.1 Scheduling real-time applications on homogeneous platforms
            4.2.2 Scheduling for heterogeneous platforms
            4.2.3 Scheduling real-time applications on heterogeneous platforms
    5 Homogeneous platforms
        5.1 Previous approach
            5.1.1 Optimization problem
            5.1.2 Replica sets
            5.1.3 Mapping and static schedule
            5.1.4 Dynamic schedule
        5.2 Motivational example
        5.3 New strategies
            5.3.1 Replica sets
            5.3.2 Mapping and static schedule
            5.3.3 Dynamic schedule
            5.3.4 Heuristics
            5.3.5 Complexity analysis
        5.4 Performance evaluation
            5.4.1 Experimental methodology
            5.4.2 Results
        5.5 Conclusion
    6 Heterogeneous platforms
        6.1 Model
            6.1.1 Platform and tasks
            6.1.2 Power and energy
            6.1.3 Reliability
            6.1.4 Optimization objective
            6.1.5 Complexity
        6.2 Mapping
        6.3 Scheduling
        6.4 Lower bound
        6.5 Performance evaluation
            6.5.1 Experimental methodology
            6.5.2 Results
        6.6 Conclusion
Conclusion
Bibliography
Publications



本文编号:2935157

资料下载
论文发表

本文链接:https://www.wllwen.com/shoufeilunwen/xxkjbs/2935157.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户18ee4***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com