大型高能物理计算集群资源管理方法的评测
发布时间:2018-05-13 03:24
本文选题:资源管理系统 + 作业调度器 ; 参考:《计算机科学》2017年10期
【摘要】:高能物理数据由物理事例组成,事例之间没有相关性。可以通过大量作业同时处理大量不同的数据文件,从而实现高能物理计算任务的并行化,因此高能物理计算是典型的高吞吐量计算场景。高能所计算集群使用开源的TORQUE/Maui进行资源管理及作业调度,并通过将集群资源划分成不同队列以及限制用户最大运行作业数来保证公平性,然而这也导致了集群整体资源利用率非常低下。SLURM和HTCondor都是近年来流行的开源资源管理系统,前者拥有丰富的作业调度策略,后者非常适合高吞吐量计算,二者都能够替代老旧、缺乏维护的TORQUE/Maui,都是管理计算集群资源的可行方案。在SLURM和HTCondor测试集群上模拟大亚湾实验用户的作业提交行为,对SLURM和HTCondor的资源分配行为和效率进行了测试,并与相同作业在高能物理研究所TORQUE/Maui集群上的实际调度结果进行了对比,分析了SLURM及HTCondor的优势和不足,探讨了使用SLURM或HTCondor管理高能物理研究所计算集群的可行性。
[Abstract]:The data of high energy physics are composed of physical events, and there is no correlation between them. The parallelization of high energy physics computing tasks can be realized by processing a large number of different data files at the same time, so high energy physics computing is a typical high throughput computing scenario. High-energy computing clusters use open source TORQUE/Maui for resource management and job scheduling, and ensure fairness by dividing cluster resources into different queues and limiting the maximum number of jobs run by users. However, this also leads to the very low overall utilization of resources in clusters. SLURM and HTCondor are popular open source resource management systems in recent years. The former has rich job scheduling strategies, and the latter is very suitable for high throughput computing. Both of them can replace the old ones. TORQUER / Maui, which lacks maintenance, is a feasible solution for managing computing cluster resources. The job submission behavior of experimental users in Daya Bay was simulated on SLURM and HTCondor test clusters, and the resource allocation behavior and efficiency of SLURM and HTCondor were tested, and compared with the actual scheduling results of the same jobs on the TORQUE/Maui cluster of the Institute of High Energy Physics. The advantages and disadvantages of SLURM and HTCondor are analyzed, and the feasibility of using SLURM or HTCondor to manage the cluster of high energy physics institutes is discussed.
【作者单位】: 中国科学院高能物理研究所;
【基金】:国家自然科学基金项目(11475210)资助
【分类号】:O572
【相似文献】
相关会议论文 前1条
1 裴尔明;Karim Bernardet;于传松;孙功星;;基于Agent技术“推拉”结合的网格作业调度系统[A];第十四届全国核电子学与核探测技术学术年会论文集(2)[C];2008年
,本文编号:1881445
本文链接:https://www.wllwen.com/kejilunwen/wulilw/1881445.html