基于“天河”高速互连网络的MPI聚合通信优化

发布时间：2018-06-02 01:20

本文选题：MPI + 聚合通信　；参考：《国防科学技术大学》2014年硕士论文

【摘要】：在现代MPI并行应用中,聚合通信操作被广泛使用。聚合通信操作不仅在科学计算上占据大部分时间消耗(有时可高达70%),同时也为程序员提供更便捷编程接口。然而软件实现的聚合通信操作是基于点对点操作实现的,当系统增大时,由于通信步骤、计算数据量和通信距离都增加了,聚合通信的耗时增加的很快,因此其可扩展性很差。但应用程序却随着系统的增大对于可扩展性有了越来越高的要求,因此,对聚合通信进行优化变得也更加重要了。加速聚合通信操作的一个有效的方法是在网卡端使用卸载模式(offload),即在网卡端加入一些特殊部件来辅助处理机进行聚合通信的数据移动或者数据计算,也称为软硬件结合的聚合通信操作。“天河”高速互连网络使用在网卡端加入的触发逻辑来卸载处理机端的数据移动任务,本文即以“天河”高速互连网络为基础,研究MPI聚合通信优化方法,主要取得了如下成果:1)对α-β模型进行扩展,通过扩展的模型计算出基于点对点的聚合通信操作时延,与基于offload实现的聚合通信进行对比。经典的α-β模型只能适合聚合通信操作的定性分析,而不能进行定量的分析,扩展后的α-β模型能够对聚合通信操作进行定量分析,为后面与offload模式的聚合通信操作进行对比;2)提出基于offload实现的聚合通信评价模型,通过测试,能较好的预测实测数据,并为后面同步和广播操作的算法设计提供理论分析。基于offload实现的聚合通信操作评价模型的提出为之后我们在“天河”高速互连网络上进行聚合通信优化提供了理论依据,不仅如此,也指导我们后面的聚合通信操作的优化;3)对基于offload实现的栅栏同步和广播操作进行算法优化和测试分析并且对基于offload实现的收集操作进行算法设计。栅栏同步操作和广播操作为最常用的聚合通信操作,也可以应用于许多聚合通信,我们实现了k-ary树和k-nomial树两种算法的栅栏同步操作和广播操作,同时从理论和实际测试给出两种操作在性能取得最优值时的k值,根据理论模型,基于触发的聚合通信操作拥有良好的可扩展性;本文对扩展的α-β模型和基于offload实现的聚合通信操作评价模型进行了实验验证,实验结果表明两种模型均能较好的定量分析软件和基于offload方式实现的聚合通信操作,并且模型也为后面进行聚合通信操作优化提供理论依据。同时在64个节点规模时,基于offload的栅栏同步操作比基于点对点的栅栏同步操作性能提高了2.17倍,对offload实现的栅栏同步和广播操作进行优化,优化后的栅栏同步操作性能进一步提高1.1倍,广播操在性能提高1.46倍。
[Abstract]:In modern MPI parallel applications, aggregate communication operations are widely used. Aggregate communication not only takes up most of the time (sometimes up to 70) in scientific computing, but also provides a more convenient programming interface for programmers. However, the aggregate communication operation implemented by software is based on point-to-point operation. When the system increases, the computation data and communication distance increase because of the communication step, and the time consuming of aggregate communication increases quickly, so its scalability is very poor. However, the application has more and more high requirements for scalability as the system increases, so it is more important to optimize the aggregate communication. An effective way to speed up aggregate communication is to use the offloading mode on the network card side, that is, to add some special components to the network card to assist the processor in the data movement or data calculation of the aggregate communication. Also known as a combination of hardware and software communication operations. The "Tianhe" high-speed interconnection network uses the trigger logic added to the network card to unload the data movement task of the processor. This paper studies the optimization method of MPI aggregation communication based on the "Tianhe" high-speed interconnection network. The main achievements are as follows: 1) the 伪-尾 model is extended, and the operation delay based on point to point aggregation is calculated by the extended model, which is compared with the aggregate communication based on offload. The classical 伪-尾 model can only be used for qualitative analysis of polymeric communication operations, but not for quantitative analysis. The extended 伪-尾 model can be used for quantitative analysis of polymeric communication operations. The evaluation model of aggregate communication based on offload is put forward for comparing with the aggregate communication operation of offload mode. Through testing, the measured data can be well predicted, and the theoretical analysis is provided for the algorithm design of synchronization and broadcast operation. The proposed operation evaluation model of aggregate communication based on offload provides a theoretical basis for the optimization of aggregate communication in Tianhe high-speed interconnection network. It also instructs us to optimize and test the algorithm of fence synchronization and broadcast operation based on offload, and to design the algorithm of collecting operation based on offload. As the most commonly used aggregate communication operation, the fence synchronous operation and broadcast operation can also be applied to many aggregate communications. We have realized the fence synchronization operation and broadcast operation of k-ary tree and k-nomial tree algorithms. At the same time, from the theoretical and practical tests, the k value of the two operations is given when the performance is optimal. According to the theoretical model, the triggered aggregate communication operation has good scalability. In this paper, the extended 伪-尾 model and the aggregate communication operation evaluation model based on offload are verified by experiments. The experimental results show that both models can be used in quantitative analysis software and aggregate communication operation based on offload. The model also provides the theoretical basis for the optimization of aggregation communication operation. At the same time, at 64 nodes, the performance of fence synchronization based on offload is 2.17 times higher than that based on point-to-point. The synchronization and broadcast operation of fence based on offload is optimized. The performance of the optimized synchronous operation of the fence is further improved by 1.1 times, and the performance of broadcast operation is improved by 1.46 times.
【学位授予单位】：国防科学技术大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.03

【参考文献】