分布式环境下主副版本任务可靠调度方法研究

发布时间：2018-10-24 19:23

【摘要】：随着计算技术和网络技术的发展,以分布式计算、并行计算为基础的分布式计算系统所构建的数据中心和计算中心在工业、商业、科技和军事等领域有着非常广泛的应用。在这些应用中将大量的复杂计算任务分解成为若干个子任务并行处理,最后将计算结果进行有效合并得到最终结果。可以看到在任务的分解和计算过程中,有效的任务调度机制是影响分布式计算系统性能和效率的关键因素,而不合理的任务调度方法会严重影响系统的计算能力,降低并行效率,甚至达不到并行计算应具有的效果。因此任务的调度问题一直是分布式系统、网格系统、云计算系统的核心内容,也是人们一直研究的热点。但是,随着分布式系统规模的不断增加、计算能力不断提高的同时,系统的稳定性和可靠性已成为影响并行应用能否顺利执行的关键。例如在天河二号、Google数据中心等超级计算机或是大规模集群中,由于复杂的上层应用以及系统超高的功耗导致了系统极容易出现故障,因此设计一套完整的可靠性保障机制显得尤为重要,而在系统的调度阶段设计高可靠的调度算法是其中重要的手段之一。本文从“保障性能,提高可靠性”这一目标出发,深入研究如何保障分布式计算系统可靠性与计算资源高效利用问题。论文将任务的类型分为实时周期任务和非实时任务两种任务类型,通过主副版本调度技术,实现了高可靠、高性能的调度策略。具体工作为:(1)针对分布式计算系统的实时任务的可靠调度问题,提出了一种依据计算节点和通信链路可靠性代价的调度算法(DRCAMD)。该方法能通过设置权值的方法来调整系统的目标权重函数,平衡用户在系统的调度性能和可靠性的不同需求,另外,针对具有依赖关系的实时任务的调度问题,本文提出了一种不考虑主版本任务与副版本任务各种重叠状态的可调度分析方法,实验结果表明了在一定的计算节点和通信链路的故障概率条件下,算法的可靠性和性能方面具有一定的优势。(2)针对混合关键任务可靠性调度的问题,基于主副版本调度策略,结合任务关键性等级的处理方法,提出了一种二阶段可靠调度算法(MCRSS)及可调度分析方法。该算法的第一阶段主要是对需要调度的混合关键性任务按照优先级高低进行调度,调度过程中,使用副本重叠的方法减少由于副版本任务的复制所带来的系统开销,第二个阶段是对调度到目标处理机上的任务进行可调度分析,对于不能满足可调度需求的任务进行升级处理,直到能满足任务对截止期的要求。仿真实验表明了MCRSS算法能有效的处理混合关键任务中不同关键等级任务的可靠调度问题,同时保证了分布式计算系统具有良好的灵活性和性能。(3)针对具有优先级依赖关系的DAG任务的调度问题,本文提出了一种基于副版本任务最早完成时间的调度算法(EFTBT),该方法通过分析主版本任务调度的状态以此得到不同情况下副版本任务调度的最早为完成时间以及调度的目标处理机的约束,并证明了该约束的合理性,该方法能够在保障可靠调度前提下获得较好的调度性能,另外,针对科学工作流应用中存在的多个DAG任务同时调度的问题,为了解决不公平导致的多个后续DAG任务无法调度问题,提出了基于分层思想的多DAG调度策略(MDDL),实验结果表明上述两种算法与经典算法相比能有效提高调度的性能。(4)针对大规模分布式计算系统的异构性、动态性的特点,提出基于节点和链路故障特征分析的具有依赖关系DAG任务可靠调度策略,该策略以副版本任务最早完成时间算法EFTBT为基础,给出了更符合实际应用需求的通信模型以及副版本执行策略,建立了分布式计算系统的故障特点分析方法,在此基础上提出了基于通信竞争模型的容错调度算法(RAPA),实验结果表明与HEFT和EFTBT相比,RAPA算法具有较好的性能和可靠性。
[Abstract]:With the development of computing technology and network technology, the data center and computing center constructed by distributed computing and parallel computing are widely used in the fields of industry, commerce, science and technology and military. In these applications, a large number of complex computational tasks are decomposed into several sub-task parallel processing, and the calculation results are effectively combined to obtain the final result. it can be seen that the effective task scheduling mechanism is the key factor that affects the performance and efficiency of distributed computing system during the decomposition and calculation process of the task, and the unreasonable task scheduling method can seriously affect the computing power of the system, reduce the parallel efficiency, Even failing to reach parallel computing should have the effect. Therefore, the task scheduling problem has been the core content of distributed system, grid system and cloud computing system. However, with the increasing scale of distributed system and increasing computing power, the stability and reliability of the system have become the key to the successful implementation of parallel application. For example, in a supercomputer or a large-scale cluster such as Chrome No. 2 and Google data center, due to the complex upper application and the ultra-high power consumption of the system, the system is extremely prone to malfunction, so it is particularly important to design a complete set of reliability guarantee mechanisms. It is one of the most important means to design a highly reliable scheduling algorithm at the scheduling stage of the system. Based on the objective of guaranteeing performance and improving reliability, this paper studies how to guarantee the efficient utilization of distributed computing system reliability and computing resources. The paper divides the types of tasks into real-time periodic task and non-real-time task type, and realizes high-reliability and high-performance scheduling strategy through main sub-version scheduling technology. The specific work is as follows: (1) In order to solve the problem of reliable scheduling of distributed computing system, a scheduling algorithm (DRCAMD) based on calculating node and communication link reliability cost is proposed. The method can adjust the target weight function of the system by the method of setting the weight value, balance the different requirements of the scheduling performance and the reliability of the user in the system, and additionally, aiming at the scheduling problem of the real-time task with the dependency relationship, This paper presents a schedulable analysis method which does not take into account the overlapping states of the main version task and the sub-version task, and the experimental results show that the algorithm has some advantages in the reliability and performance of the algorithm under the failure probability condition of certain computing nodes and communication links. (2) A two-stage reliable scheduling algorithm (MCRSS) and schedulable analysis method are proposed based on the main sub-version scheduling policy and the processing method of task criticality. the first phase of the algorithm is mainly to schedule the mixed key tasks needing to be scheduled according to the priority level, The second stage is to perform schedulable analysis on tasks scheduled to the target processor, and upgrade the tasks that can not meet the scheduling requirements until the deadline requirements for tasks can be met. The simulation experiment shows that the MCRSS algorithm can effectively deal with the reliable scheduling problem of different key-level tasks in hybrid critical tasks, and also ensures that the distributed computing system has good flexibility and performance. (3) Aiming at the scheduling problem of DAG task with priority dependence, a scheduling algorithm (EFTBT) based on the earliest completion time of sub-version task is proposed in this paper. The method obtains the earliest completion time of the sub-version task scheduling and the constraint of the scheduled target processor by analyzing the state of the main version task scheduling, and proves the rationality of the constraint, The method can obtain better scheduling performance under the premise of guaranteeing reliable scheduling, and in addition, aiming at the problem that a plurality of DAG tasks existing in the scientific workflow application are simultaneously scheduled, in order to solve the problem that a plurality of subsequent DAG tasks caused by unfair scheduling cannot be scheduled, A multi-DAG scheduling strategy (MDDL) based on layered thought is proposed. The experimental results show that the two algorithms can effectively improve the performance of scheduling compared with classical algorithms. (4) aiming at the characteristics of heterogeneous and dynamic characteristics of the large-scale distributed computing system, a reliable scheduling strategy with dependency relation DAG task based on the node and link fault characteristics is proposed, and the strategy is based on the earliest completion time algorithm EFTBT of the sub-version task, In this paper, the communication model and the sub-version execution strategy are given. The fault characteristic analysis method of distributed computing system is established. Based on this, a fault-tolerant scheduling algorithm (RAPA) based on communication contention model is proposed. The experimental results show that compared with HEFT and EFTBT, RAPA algorithm has better performance and reliability.
【学位授予单位】：哈尔滨工业大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP338.8

【相似文献】