典型图像处理算法在Xeon Phi平台上的实现与优化技术研究
发布时间:2018-12-10 11:05
【摘要】:随着异构平台的兴起,高性能计算领域获得快速的发展。基于CPU+GPU的异构平台在以生物信息学、医学成像和计算流体力学等为代表的诸多领域获得广泛应用。但是,CPU和GPU使用不同指令集和编程模型,对程序编程优化有较高要求。Intel于2012年推出了基于众核架构的Xeon Phi协处理器,兼容传统x86编程模型和特性,某种程度上降低了程序编程优化的难度。Xeon Phi集成50个以上的x86轻量核,每个核支持4个硬件线程和512位SIMD向量处理,因而具有强大的并行处理能力。目前,使用Xeon Phi进行算法优化加速的研究尚处于起步阶段。本文面向典型图像处理算法在Xeon Phi平台上的实现与加速展开研究。图像处理算法对计算性能需求较高,具有数据量大和较高实时性的特点。本文选取了两个代表性算法作为研究实例,分别是2D IDCT算法和3D GVF场算法。本文主要工作包括:(1)在Xeon Phi平台上实现2D IDCT及相关优化。首先依据行列分离计算原理串行实现2D IDCT,以此作为后续优化的性能基准,然后采用512位SIMD和OpenMP对串行2D IDCT进行向量化和线程扩展,最后进行数据预取优化。实验结果表明,对单精度图像格式,相比未向量化版本,向量化处理可获得约5.84倍的性能加速,且算法性能随线程扩展近似线性增加;使用数据预取优化可在已有优化基础上再获得约1.24的性能加速。综合来说,优化后的2D IDCT算法在Xeon Phi上的最好性能相比在一颗E5-2670 CPU上的最好性能有约1.53倍的加速比。(2)在Xeon Phi平台上实现3D GVF场计算及相关3D GVF场优化。除讨论向量化和线程扩展等通用优化外,侧重在模板计算优化对计算性能的影响,提出一种有效的循环分块优化策略,有效提高了缓存利用率。实验结果表明,对双精度图像格式,经线程扩展和向量化能显著提升3D GVF场运算性能,通过提出的分块优化策略,在问题规模为′′256256256和′′512512512时,3D GVF在Xeon Phi上的计算性能在相比于在一颗E5-2670 CPU上的性能分别有约1.78和2.77的加速比。(3)归纳总结图像处理算法在Xeon Phi平台上的优化规律,整理出有指导意义的优化技术,方便后续其他图像处理算法的优化。一般而言,对计算密集型的算法,直接采用诸如向量化和线程扩展等通用优化技术可获得不错的性能提升;对计算访存比较低的图像处理算法,需要考虑提高缓存的利用效率,本文提出的循环分块策略即是一种有效的方法。
[Abstract]:With the rise of heterogeneous platforms, the field of high performance computing has developed rapidly. Heterogeneous platforms based on CPU GPU are widely used in many fields, such as bioinformatics, medical imaging and computational fluid dynamics. However, CPU and GPU use different instruction sets and programming models, which have high requirements for programming optimization. Intel introduced a Xeon Phi coprocessor based on multi-core architecture in 2012, which is compatible with traditional x86 programming models and features. To some extent, the difficulty of programming optimization is reduced. Xeon Phi integrates more than 50 x86 lightweight kernels. Each kernel supports 4 hardware threads and 512-bit SIMD vector processing, so it has powerful parallel processing capability. At present, the research of optimization acceleration using Xeon Phi is still in its infancy. This paper focuses on the implementation and acceleration of typical image processing algorithms on Xeon Phi platform. Image processing algorithm requires high computational performance and has the characteristics of large amount of data and high real-time performance. In this paper, two representative algorithms, 2D IDCT algorithm and 3D GVF field algorithm, are selected as examples. The main work of this paper includes: (1) realize 2D IDCT and related optimization on Xeon Phi platform. Firstly, 2D IDCT, is realized serially according to the principle of column separation, and then the serial 2D IDCT is vectorized and threading extended by 512-bit SIMD and OpenMP. Finally, the data prefetching optimization is carried out. The experimental results show that the performance of vectorization can be accelerated by about 5.84 times compared with the non-vectorized version for single-precision image format, and the performance of the algorithm increases linearly with thread expansion. Using data prefetching optimization can gain about 1.24 performance acceleration on the basis of existing optimization. In general, the optimal performance of the optimized 2D IDCT algorithm on Xeon Phi is about 1.53 times faster than that on an E5-2670 CPU. (2) 3D GVF field calculation and related 3D GVF field optimization are realized on Xeon Phi platform. In addition to the general optimization such as vectorization and thread expansion, this paper focuses on the effect of template computing optimization on computing performance, and proposes an effective optimization strategy for circulatory blocking, which effectively improves the cache utilization rate. The experimental results show that the performance of 3D GVF field can be significantly improved by thread expansion and vectorization for the dual-precision image format. By the proposed block optimization strategy, the scale of the problem is' 256256256 'and' 51251252'. The computational performance of 3D GVF on Xeon Phi has a speedup ratio of about 1.78 and 2.77 respectively compared with that on an E5-2670 CPU. (3) the optimization law of image processing algorithm on Xeon Phi platform is summarized. The guiding optimization techniques are sorted out to facilitate the optimization of other image processing algorithms. In general, for computationally intensive algorithms, general optimization techniques such as vectorization and thread expansion can achieve good performance improvements. It is necessary to improve the efficiency of cache utilization for the image processing algorithm with low computational memory access. The circular blocking strategy proposed in this paper is an effective method.
【学位授予单位】:国防科学技术大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP38;TP391.41
,
本文编号:2370464
[Abstract]:With the rise of heterogeneous platforms, the field of high performance computing has developed rapidly. Heterogeneous platforms based on CPU GPU are widely used in many fields, such as bioinformatics, medical imaging and computational fluid dynamics. However, CPU and GPU use different instruction sets and programming models, which have high requirements for programming optimization. Intel introduced a Xeon Phi coprocessor based on multi-core architecture in 2012, which is compatible with traditional x86 programming models and features. To some extent, the difficulty of programming optimization is reduced. Xeon Phi integrates more than 50 x86 lightweight kernels. Each kernel supports 4 hardware threads and 512-bit SIMD vector processing, so it has powerful parallel processing capability. At present, the research of optimization acceleration using Xeon Phi is still in its infancy. This paper focuses on the implementation and acceleration of typical image processing algorithms on Xeon Phi platform. Image processing algorithm requires high computational performance and has the characteristics of large amount of data and high real-time performance. In this paper, two representative algorithms, 2D IDCT algorithm and 3D GVF field algorithm, are selected as examples. The main work of this paper includes: (1) realize 2D IDCT and related optimization on Xeon Phi platform. Firstly, 2D IDCT, is realized serially according to the principle of column separation, and then the serial 2D IDCT is vectorized and threading extended by 512-bit SIMD and OpenMP. Finally, the data prefetching optimization is carried out. The experimental results show that the performance of vectorization can be accelerated by about 5.84 times compared with the non-vectorized version for single-precision image format, and the performance of the algorithm increases linearly with thread expansion. Using data prefetching optimization can gain about 1.24 performance acceleration on the basis of existing optimization. In general, the optimal performance of the optimized 2D IDCT algorithm on Xeon Phi is about 1.53 times faster than that on an E5-2670 CPU. (2) 3D GVF field calculation and related 3D GVF field optimization are realized on Xeon Phi platform. In addition to the general optimization such as vectorization and thread expansion, this paper focuses on the effect of template computing optimization on computing performance, and proposes an effective optimization strategy for circulatory blocking, which effectively improves the cache utilization rate. The experimental results show that the performance of 3D GVF field can be significantly improved by thread expansion and vectorization for the dual-precision image format. By the proposed block optimization strategy, the scale of the problem is' 256256256 'and' 51251252'. The computational performance of 3D GVF on Xeon Phi has a speedup ratio of about 1.78 and 2.77 respectively compared with that on an E5-2670 CPU. (3) the optimization law of image processing algorithm on Xeon Phi platform is summarized. The guiding optimization techniques are sorted out to facilitate the optimization of other image processing algorithms. In general, for computationally intensive algorithms, general optimization techniques such as vectorization and thread expansion can achieve good performance improvements. It is necessary to improve the efficiency of cache utilization for the image processing algorithm with low computational memory access. The circular blocking strategy proposed in this paper is an effective method.
【学位授予单位】:国防科学技术大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP38;TP391.41
,
本文编号:2370464
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/2370464.html