基于GPU的TOUGHREACT并行化实现

发布时间：2018-08-24 19:20

【摘要】：近年来，高性能并行计算技术发展迅速。利用新的多核、众核以及GPU计算平台高效实现复杂地质条件下物理化学状态数值模型的模拟，已经成为地质工作者越来越关心的科学课题。随着GPU通用计算的出现以及飞速发展，越来越多的研究人员利用GPU技术来加速地下多相流数值模拟软件的计算过程，以满足大尺度、高精度的应用需求。由劳伦斯伯克利实验室开发的TOUGHREACT是当前应用最广泛的解决地下多相流体运动与地球化学反应运移耦合过程和机理的模拟程序。当前，在对要求较大尺度、较高精度的复杂地质环境问题（如二氧化碳地质储存）进行数值模拟时，TOUGHREACT执行效率不高。因此通过GPU并行计算技术加速TOUGHREACT的数值模拟过程有非常重要的工程意义和研究价值。本文基于此目的在CPU-GPU异构计算平台上对TOUGHREACT软件进行了并行化实现。首先，通过了解相关专业知识，对软件的基本模拟过程进行简要理解。参考已有的研究工作，对软件的模块化结构进行了详细分析。对比多相流模块与地球化学反应运移模块在求解过程中的差异，综合考虑线性方程组的规模和每个时间步内迭代求解过程的并发性，确定多相流动数值模拟部分更适合在GPU平台上并行实现。在对自然科学和社会科学中许多实际问题进行数值求解时，经常使用偏微分方程作为数值模型来表示质量与能量守恒状态，而在对偏微分方程进行离散求解时，稀疏线性方程组的求解是主要的计算步骤之一。尤其是在对某些场地级大尺度问题进行模拟时，稀疏线性方程组的求解时间会达到80%以上。因此，本文对TOUREACT中各部分模块执行时间进行了对比，选择以其中线性方程组求解过程为重点开展并行化工作。由于求解多相流问题时遇到的系数矩阵具有非对称非正定的特征，因此本文使用krylov子空间法中的几种双共轭梯度法求解方程组。同时，为了不以牺牲求解效率为代价，决定不对预处理部分做GPU移植，而主要针对求解中最耗时的两个部分：稀疏矩阵向量乘（SPMV）和向量内积操作进行CUDA实现。确定了各个内核函数映射关系以后，基于CUDA的并行程序开发难度不大，但是一些必要的优化手段可以显著提高并行程序的性能。本文作了如下工作：选择合理的稀疏矩阵存储格式，减少内存占用以及主机与设备的数据传输开销；优化存储器访问，使用共享内存、页锁定存储器以及合并顺序执行的内核函数来减少全局内存访问；优化指令流，包括避免不必要的同步操作以及循环展开；实现多版本内核，建立线程规模判定树，根据不同的问题规模进行合理的线程组织，充分利用GPU上的处理器资源，以达到负载均衡的目的。最后，将实现的并行预处理共轭梯度求解器整合到TOUGHREACT程序中。在CPU-GPU构成的计算平台上，对不同规模的实际问题进行数值模拟，对本文实现的并行BICG和并行BICGSTB算法进行性能测试。实验表明，本文实现的线性方程组并行求解器相对于CPU串行程序有最多3.4倍的加速比，对多相流动数值模拟的整体求解过程有最多2.8倍的加速比。这一结果印证了本文使用的并行化策略的正确性，为进一步的对地球化学反应运移模块的GPU移植工作打下了很好的基础，积累了丰富的经验。
[Abstract]:In recent years, high-performance parallel computing technology has developed rapidly. Using new multi-core, multi-core and GPU computing platform to efficiently simulate the physical and chemical state numerical model under complex geological conditions has become a scientific topic of increasing concern to geologists. GPU technology is used to speed up the calculation process of underground multiphase flow numerical simulation software to meet the needs of large-scale and high-precision applications.TOUGHREACT developed by Lawrence Berkeley Laboratory is the most widely used simulation program to solve the coupling process and mechanism of underground multiphase flow and geochemical reaction and migration. Therefore, it is of great engineering significance and research value to accelerate the numerical simulation process of TOUGHREACT by GPU parallel computing technology. This paper is based on this purpose in CPU-GPU heterogeneous. TOUGHREACT software is parallelized on the computing platform.
Firstly, the basic simulation process of the software is briefly understood by understanding the relevant professional knowledge. Referring to the existing research work, the modular structure of the software is analyzed in detail. The concurrency of the iterative process in the step determines that the numerical simulation part of multiphase flow is more suitable for parallel implementation on the GPU platform.
Partial differential equations (PDEs) are often used as numerical models to represent the conservation of mass and energy in numerical solutions of many practical problems in natural and social sciences. In the discrete solution of PDEs, the solution of sparse linear equations is one of the main computational steps, especially for large sites. When the scale problem is simulated, the solution time of the sparse linear equations will be more than 80%. Therefore, this paper compares the execution time of each module in TOUREACT, and chooses the solution process of the linear equations as the focus of parallel work.
Because the coefficient matrices encountered in solving multiphase flow problems are asymmetric and non-positive definite, several double conjugate gradient methods in Krylov subspace method are used to solve the equations in this paper. Divided into: Sparse Matrix Vector Multiplication (SPMV) and Vector Inner Product (VIP) operations are implemented in CUDA. After determining the mapping relations of each kernel function, it is not difficult to develop parallel programs based on CUDA, but some necessary optimization methods can significantly improve the performance of parallel programs. Optimizing memory access, using shared memory, page-locked memory, and merging sequential kernel functions to reduce global memory access; optimizing instruction flow, including avoiding unnecessary synchronization and loop unwrapping; implementing a multi-version kernel to establish lines Program size decision tree is used to organize threads reasonably according to different problem sizes and make full use of processor resources on GPU to achieve load balancing.
Finally, the parallel preconditioned conjugate gradient solver is integrated into the TOUGHREACT program. On the platform of CPU-GPU, numerical simulations are carried out for practical problems of different scales. The performance of the parallel BICG and parallel BICGSTB algorithms implemented in this paper are tested. Experiments show that the parallel solver of linear equations realized in this paper is phase-wise. There is a maximum acceleration ratio of 3.4 times for the CPU serial program and 2.8 times for the whole solution process of multiphase flow numerical simulation. This result confirms the correctness of the parallelization strategy used in this paper, and lays a good foundation for further GPU transplantation of the geochemical reaction and migration module. Experience.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP338.6

【参考文献】