基于高性能协处理器的粒子输运模拟加速关键技术研究

发布时间：2018-01-06 13:30

本文关键词：基于高性能协处理器的粒子输运模拟加速关键技术研究　出处：《国防科学技术大学》2016年博士论文　论文类型：学位论文

【摘要】：粒子输运方程,用来描述给定介质中粒子的质量、电量、动量以及能量的守恒关系,其数值求解在许多物理和工程领域有着广泛的应用,如天体物理学的天体探测、医学领域中的放射性治疗、核反应堆以及核武器设计等。随着应用需求的推动,粒子输运方程的高精度模拟需求不断增加,物理建模不断精细,导致计算规模成千上万倍的增加。此外,在某些特定应用领域,粒子输运数值模拟还面临高实时性需求的挑战。近年来,运用具有高性能功耗比的协处理器进行计算加速已经成为高性能并行计算发展的重要趋势。然而,协处理器的种类多样化以及硬件结构的复杂化,使得协处理器加速粒子输运模拟面临着并行算法设计与优化、编程模型多样性、合理选择协处理器难度大等挑战。为应对这些挑战,本文基于MIC和FPGA两种高性能协处理器,开展了粒子输运确定性方法和蒙特卡罗方法的并行加速算法或结构研究,取得的研究成果主要包括以下几个方面:1.提出了基于MIC的三维结构化网格多级并行扫描算法,以实现结构化网格下粒子输运有限差分离散纵标方程的并行求解。该算法开发了波阵面扫描过程中的多级并行性,将波阵面中I-line网格柱的扫描直接映射到MIC的并行硬件线程上,并通过隔离关键物理量的计算实现了I-line网格柱上迭代求解有限差分离散纵标方程的向量化。数值实验结果表明,对于不进行负通量修正的情况,MIC与CPU上的并行实现相比可以获得了2.03倍的加速效果;对于进行负通量修正的情况,MIC能够获得1.50倍的性能加速。2.提出了两种基于MIC的二维非结构化网格多级并行扫描算法,以实现非结构化网格下粒子输运间断有限元离散纵标方程的并行求解。在并行求解之前,设计了搜索排序算法来确定扫描过程中的波阵面及其排序。第一种算法采用并行硬件线程来开发波阵面中全部网格在所有能群上的并行性,并采用向量单元来开发单个网格单元在单能群上离散方程求解中的数据级并行性。第二种算法采用并行线程和向量单元来协同开发单个波阵面中所有网格单元在所有能群上计算的并行性,并进行了存储性能优化。数值实验结果显示,与CPU上的串行实现相比,两种算法能够分别获得39.92倍和71.54倍的性能加速。3.提出了基于MIC的快速蒙特卡罗多级并行模拟算法,以实现光子和电子耦合的快速蒙特卡罗辐射输运DPM的快速求解。在该算法中,设计了多级并行访问数据结构来满足线程和向量两级并行的访存需求,并在线程级并行化基础上,进行了数据局部性优化,通过构建多级并行随机数发生器,实现了电子输运大部分过程的向量化模拟。数值实验表明,基于MIC的DPM实现与CPU上的DPM实现在精度上保持一致。与CPU上的DPM串行实现相比,基于MIC的多级并行算法能够获得16.22到18.82倍的性能加速。4.提出了基于FPGA的快速蒙特卡罗并行加速模拟结构,以实现光子的快速蒙特卡罗辐射输运的快速求解。该结构基于单精度浮点实现,通过流水线并行、位级并行以及特殊的结构设计,使得该结构具有低功耗和高性能等特点。数值实验表明,基于FPGA的并行加速结构所产生的剂量分布与软件计算结果吻合。与3.40GHz CPU和2.30Ghz CPU上的串行实现相比,该加速结构能够分别获得22.15倍和33.18倍的性能提升。综上所述,本文对基于MIC和FPGA的粒子输运并行数值求解算法进行了深入研究,使其能够最大化地开发MIC和FPGA的计算潜力,为推动相应数值模拟的实际应用、基于高性能协处理器构建面向粒子输运的专用大规模并行计算系统以及基于协处理器的粒子输运大规模并行求解奠定基础。
[Abstract]:The particle transport equation is used to describe the quality of a given particle medium energy, momentum and energy conservation, the numerical solution is widely used in many fields such as astrophysics, physics and engineering, object detection, radiation therapy in the field of medicine, nuclear reactors and nuclear weapons design. With the promotion application requirements the transport equation of high precision simulation of the increasing demand for transport of particles, physical modeling continuously fine, resulting in the increase of the scale of tens of thousands of times the calculation. Furthermore, in certain applications, the particle transport simulation is also facing real-time challenges. In recent years, the use of high performance power consumption than the coprocessor to calculate acceleration has become an important trend of development of high performance parallel computing. However, the diversity and complexity of the coprocessor hardware structure, makes the co processor accelerated particles Transport simulation facing parallel algorithm design and optimization, the diversity of programming model, a reasonable choice of coprocessor difficult challenges. To cope with these challenges, the MIC and FPGA two kinds of high performance processor based on the particle transport parallel acceleration algorithm or structure of the transport of deterministic method and Monte Carlo method, research achievements mainly includes the following aspects: 1. proposed three-dimensional unstructured grid multistage parallel scanning algorithm based on MIC, to realize the particle transport structured grid finite difference parallel solving discrete ordinate equations. The algorithm developed multi-level parallelism of the wavefront in the scanning process, the direct mapping of I-line grid column scanning wavefront the parallel hardware thread MIC, and by calculating the isolation of key physical parameters to achieve the separation of vector discrete ordinate equations of the I-line iterative method for solving the finite difference grid column . numerical results show that for negative flux corrections, parallel implementation of MIC and CPU compared to obtain the acceleration effect of 2.03 times; the negative flux correction, MIC can get 1.50 times the acceleration of the performance.2. proposed two kinds of unstructured grid multilevel parallel MIC algorithm based on two-dimensional scanning in order to achieve particle, unstructured grid transport discontinuous finite element equation of discrete ordinate parallel solving. In parallel solution before the designed search ranking algorithm to determine the wavefront sort and scan process. The first algorithm uses parallel hardware threads to develop all wavefront meshes in all parallel to on the base of data, and uses vector unit to develop a single grid cell in a single Nengqun discrete equation in parallel. Second algorithms using parallel threads and vector unit to open cooperation A single wave front in all grid cells in all parallel computing cluster, and the storage performance optimization. Numerical results show that, compared with the CPU serial implementation, the two algorithms can achieve the performance of 39.92 times and 71.54 times respectively, the accelerated.3. presents fast Monte Carlo simulation algorithm of multistage parallel MIC based on the fast solution to realize fast Monte Carlo radiation photon and electron coupled transport of DPM. In this algorithm, the multistage parallel access data structure to meet the thread and the vector two level parallel memory design, and parallelization based on thread level, the data locality optimization, through the construction of multi level parallel the random number generator, to achieve the electronic transport process most to quantitative simulation. Numerical experiments show that the MIC DPM implementation and CPU DPM to achieve consistent accuracy. Based on CPU DPM Serial implementation compared performance of multilevel parallel algorithm MIC can achieve 16.22 to 18.82 times the acceleration of.4. presents fast Monte Carlo FPGA parallel simulation based on the structure based on the fast solution to achieve fast Monte Carlo radiative photon transport. Single precision floating point based on the structure, through the pipeline in parallel, bit level parallelism and structure design in particular, the structure has low power consumption and high performance. Numerical experiments show that the FPGA parallel acceleration of dose distribution and calculation results of the software structure generated based on agreement. Compared with 3.40GHz CPU and 2.30Ghz CPU on the serial implementation, the accelerating structure can enhance the performance of 22.15 times and 33.18 times respectively. In summary in this paper, in-depth study of parallel numerical algorithm MIC and FPGA based on particle transport, which can maximize the development of MIC and FPGA to calculate the potential In order to promote the practical application of corresponding numerical simulation, based on the high-performance coprocessor, we build a dedicated massively parallel computing system for particle transport and the large-scale parallel solution of particle transport based on coprocessor.

【学位授予单位】：国防科学技术大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：O572.2;TP332

【相似文献】