大规模复杂IT系统可靠性、性能、能耗关联建模理论及其优化技术研究

发布时间：2019-05-21 16:07

【摘要】：随着互联网的快速,以云计算和大数据处理技术为代表的新一代信息化技术不断实现各类资源的整合和共享,以此形成了一种全新的大规模复杂IT系统(Large Scale Complex IT Systems,LSCITS)。相比传统的IT系统,其不仅需要有效地管理大规模、异构复杂的基础设施资源,同时也需要满足多样化的应用需求,尤其是可靠计算、高性能计算和节能减排的应用需求。为了实现大规模复杂IT系统下可靠、高效、节能的优化调度管理,基于理论模型的系统指标评估必不可少,但是,在已有的相关研究中,可靠性、性能和能耗指标往往被视为相互分离的指标进行分析,而忽略了这些指标间相互影响的可靠性-性能-能耗(Reliability-Performance-Energy,R-P-E)关联性;此外,大规模性的基础设施资源也对面向多目标优化的高效调度管理技术提出了新的挑战。针对这些存在的关键性问题,本文对两类典型的LSCITS(即云计算系统和大数据处理系统)进行了系统全面的R-P-E关联建模理论研究,同时将仿生自主神经系统(Bionic Autonomic Nervous Systems,BANS)的思想用于调度管理系统的设计中,并基于建立的关联模型进一步研究了可靠性、性能和能耗综合考虑的优化调度管理技术。论文的主要研究工作及创新性成果包括:1)提出了一种基于级层交互随机子模型的建模方法(Hierarchical and Interacting Stochastic Models,HISM)。面向传统服务系统迁移到云计算系统中的重要应用场景,建立了相应的R-P-E关联模型。在基础设施层首先建立了基于物理机和虚拟机失效修复行为的Semi-Markov可靠性模型,详细分析了虚拟化环境下特有的物理机失效所引发的多虚拟机失效的复杂共因失效问题;其次,在应用服务层,以可用资源量为条件参数建立了基于排队论的性能模型,根据模型详细分析了服务系统中重要的排队溢出和超时失效等事件;在系统状态监测层,详细分析了失效修复行为对系统动态能耗随机变化带来的影响,并建立了相应的系统能耗模型;最后,通过马尔可夫回报模型和贝叶斯理论提出了表征R-P-E关联性的期望性能和期望能耗等综合性评估指标,基于这些评估指标进一步提出了一种量化复杂P-E制约关系的新指标,即效能比(Performance-Energy Efficiency Ratio,PEER)。理论模型的分析结果通过仿真实验进行了验证,同时实验结果表明效能比指标可有效帮助云计算系统为迁移的传统服务系统选择更加合理全面的资源分配策略。2)根据HISM建模方法,进一步基于新兴的云服务系统(私有云服务系统和公有云服务系统)建立了相应的R-P-E关联模型。针对多类型失效下的及时修复需求,提出了一种由多修复行为组成的级层修复机制,并建立了相应的马尔可夫可靠性模型。在私有云服务系统的性能分析方面,为了实现对核心云调度器运行状态的分析,提出了一种新的Jackson排队网络模型,该模型不仅可以分析用户请求在核心云调度器的请求解析时间,还可以分析虚拟机在资源池中的服务时间;在公有云服务系统的性能建模方面,更是充分考虑了用户请求批量需求虚拟机的复杂行为特征。在云计算系统的能耗建模方面,不仅考虑了失效修复行带来的随机能耗变化情况,还考虑了服务用户时随机资源占用情况对系统动态能耗的影响。最后,通过仿真实验验证了云服务系统的R-P-E关联模型,并详细分析了资源分配决策变量影响下,云服务系统期望性能和期望能耗指标的重要变化趋势。3)提出了另一种基于拉普拉斯变换(Laplace-Stieltjes Transform,LST)的关联建模方法,建立了面向大数据处理系统的R-P-E关联模型。在面向复杂计算任务时,针对任务完成时间直接影响实际能耗量的重要问题,提出了一种考虑了理想任务完成时间限制、硬件失效、数据处理程序失效等多种因素的Semi-Markov可靠性模型,并通过LST关联建模方法实现了对期望任务执行时间和期望能耗的分析评估。在面向大数据量任务时,充分考虑了子任务切分和子任务冗余执行的复杂决策行为,并面向这种分布式冗余并行计算环境,设计了一种求解随机任务完成时间概率分布函数的算法,最后基于贝叶斯理论建立了R-P-E关联模型。实验结果表明了所建立的理论模型对复杂计算任务最优资源分配策略、大数据量任务最优切分和冗余执行策略的制定都有着重要的理论评估和分析作用。4)提出了基于R-P-E关联模型的多目标优化模型,并根据决策变量的类型和复杂性,设计了Pareto最优解分析、收敛算法、遗传算法等多种求解最优解的方法,并建立一种基于仿生自主神经系统(BANS)的新型云调度管理系统。在局部自主资源管理方面,基于用户请求到达率敏感性分析的方法,建立了一种描述资源分配策略最优性的“最优性分布图”,并进一步设计了基于最优性分布图的自主资源管理触发机制,通过动态自主的资源再分配行为可以在用户请求到达强度动态变化的环境下始终维持一种最优的资源分配策略;在全局请求调度方面,设计了一种新的基于最优性分布图的优化调度方法,从而避免了核心云调度节点对大规模复杂的基础设施资源进行繁琐的最优解搜索。实验结果显示基于BANS的云调度管理系统可以在系统期望纯利润上取得良好的优化效果,同时还有效提升了核心云调度节点搜索最优解的效率。
[Abstract]:With the rapid development of the Internet, the new generation of information technology, which is represented by cloud computing and big data processing technology, continuously realizes the integration and sharing of all kinds of resources, thus forming a brand-new large-scale complex IT system (LSCITS). Compared with the traditional IT system, it not only needs to effectively manage large-scale, heterogeneous and complex infrastructure resources, but also needs to meet the diversified application requirements, especially the application requirements of reliable calculation, high performance calculation and energy-saving and emission reduction. In order to realize the reliable, efficient and energy-saving optimal scheduling management under the large-scale complex IT system, the system index evaluation based on the theoretical model is essential, but in the existing research, the reliability, the performance and the energy consumption index are often regarded as the mutually separated indexes for analysis, The reliability-performance-energy (R-P-E) relevance of the interaction among these indicators is ignored; in addition, the large-scale infrastructure resource also presents new challenges to the high-efficiency scheduling management technology facing the multi-objective optimization. In this paper, two typical LSCITS, namely the cloud computing system and the large data processing system, are studied in this paper. The idea of BANS is used in the design of the scheduling management system, and the optimal scheduling management technology for reliability, performance and energy consumption is further studied based on the established association model. The main research work and innovative achievements of the paper include:1) a modeling method based on a hierarchical layer interaction random sub-model (HISM) is proposed. The corresponding R-P-E association model is established for the application of the traditional service system to the cloud computing system. firstly, a semi-Markov reliability model based on the failure repair behavior of the physical machine and the virtual machine is established on the infrastructure layer, the problem of the complex common cause failure of the multi-virtual machine failure caused by the failure of the special physical machine under the virtualization environment is analyzed in detail, and secondly, in the application service layer, The performance model based on the queuing theory is established with the available resources as the condition parameters, and the events such as queuing overflow and timeout failure in the service system are analyzed in detail according to the model, and the system state monitoring layer, In this paper, the influence of the failure repair behavior on the random change of the dynamic energy consumption of the system is analyzed in detail, and the corresponding system energy consumption model is established; and finally, a comprehensive evaluation index, such as the expected performance and the expected energy consumption of the R-P-E association, is proposed through the Markov model and the Bayesian theory. Based on these evaluation indexes, a new index, i.e., performance-energy efficiency Ratio (PEER), is proposed to quantify the complex P-E constraints. The results show that the performance ratio index can help the cloud computing system to choose a more reasonable and comprehensive resource allocation strategy for the traditional service system. The corresponding R-P-E association model is further developed based on the new cloud service system (private cloud service system and public cloud service system). In view of the need for timely repair in multi-type failure, a level-level repair mechanism composed of multiple repair behaviors is proposed, and a corresponding Markov reliability model is established. In order to realize the analysis of the operation state of the core cloud scheduler in the performance analysis of the private cloud service system, a new Jackson queuing network model is proposed, which can not only analyze the request resolution time of the user request at the core cloud scheduler, The service time of the virtual machine in the resource pool can also be analyzed; in the aspect of the performance modeling of the public cloud service system, the complex behavior characteristics of the user request batch demand virtual machine are fully taken into account. In the aspect of energy consumption modeling of the cloud computing system, not only the random energy consumption change caused by the failure repair line is considered, but also the influence of the random resource occupation situation on the dynamic energy consumption of the system is also taken into account. Finally, the R-P-E association model of the cloud service system is verified by the simulation experiment, and the important variation trend of the expected performance and the expected energy consumption index of the cloud service system under the influence of the resource allocation decision variables is analyzed in detail. In this paper, an R-P-E association model for a large data processing system is established based on the association modeling method of LST. In view of the important problem of the task completion time directly affecting the actual energy consumption, a semi-Markov reliability model, which takes into account various factors such as the ideal task completion time limit, the hardware failure, the data processing program failure, and the like, is proposed in the face of the complex calculation task. And the analysis and evaluation of the expected task execution time and the expected energy consumption are realized through the LST association modeling method. In that task of large data volume, the complex decision behavior of sub-task segmentation and sub-task redundancy is fully taken into account, and the algorithm for solving the probability distribution function of the time probability distribution of the random task is designed for the distributed redundant parallel computing environment. Finally, the R-P-E correlation model is established based on the Bayesian theory. The experimental results show that the established theory model has important theoretical evaluation and analysis function for the optimal resource allocation strategy of complex computing task, the optimal segmentation of the large data volume task and the development of the redundant execution strategy.4) The multi-objective optimization model based on the R-P-E correlation model is proposed. Based on the type and complexity of the decision variable, a new method for solving the optimal solution, such as Pareto optimal solution analysis, convergence algorithm and genetic algorithm, is designed, and a new type of cloud scheduling management system based on the bionic autonomous nervous system (BANS) is established. In the aspect of local independent resource management, based on the method of user request arrival rate sensitivity analysis, a "optimality profile" describing the optimality of resource allocation strategy is set up, and an autonomous resource management trigger mechanism based on the optimality profile is further designed, the dynamic independent resource reallocation behavior can always maintain an optimal resource allocation strategy in the environment where the user requests to reach the dynamic change of the intensity; in the aspect of global request scheduling, a new optimal scheduling method based on the optimality profile is designed, Thereby avoiding the complex optimal solution search of the core cloud scheduling node to the large-scale complex infrastructure resource. The results show that the cloud scheduling management system based on the BANS can achieve a good optimization effect on the expected net profit of the system, and also effectively improves the efficiency of searching the optimal solution for the core cloud scheduling node.
【学位授予单位】：电子科技大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP302.7

【相似文献】