基于多核处理器的并行3D-FFT研究

发布时间：2019-03-11 14:12

【摘要】：随着计算任务量的增加,单核处理器已不能满足用户需求,主要原因是功耗问题限制了单核处理器不断提高性能的途径。而多核处理器的问世,为多任务、大数据难题提供了解决方案。3D-FFT是科学与工程研究中最重要的算法之一,在计算机视觉和模式识别,视频电话和核磁共振成像算法中,都有着至关重要的作用。如何在多核处理器上更快速的进行3D-FFT计算,是科学研究中首要面临的难题。本文针对并行3D-FFT在多核处理上的应用,从数据预处理,数据计算和数据转置三个方面对3D-FFT算法进行研究。在数据预处理阶段,为避免出现有的核满载,有的核空转现象,采用负载均衡算法让每个核心平均处理任务。同时,为了避免计算阶段数据在核心之间不必要的迁移,提出线程与CPU亲和性算法,让数据在指定的CPU核心运行。为解决共享内存多核处理器伪共享难题,提出缓存行填充算法,使得属于不同核心的数据被分在独立的缓存行内。在2核心与8核心下测试算法缓存命中率,实验表明,缓存行填充以及线程与CPU亲和性算法,2核心下,一级数据缓存未命中率平均降低了0.2%,三级缓存未命中率平均降低了0.2%,8核心下分别降低了0.2%与0.1%。在数据计算阶段,以多列FFT算法为基础,使用多线程并行化处理。对于FFT每次计算,使用经典六步快速傅里叶变换算法把数据划分成更小的数据量进行计算,加快处理速度。对位反转处理过程,采用两端反转策略进行优化,一次处理四个数据点,使算法时间复杂度降为原来的一半。在数据转置阶段,提出全局转置优化算法,减少参与核间通信数据点的总数,从而加快全局转置速度。与现有全局转置算法相比,通信时间平均减少0.05秒,约提高9.5%。并使用多核模拟器Sniper Simulator对其功耗进行统计,在8个计算核心下,全局转置优化算法与现有转置算法,核间功耗,核与内存功耗以及一级数据缓存功率消耗均值分别约为20W和22W,7.2W和7.5W以及16.9W和17.1W。为进一步体现算法性能优越性,与已有的FFTW和Open MP多线程进行比较。在8核心下,与FFTW和Open MP缓存未命中率相比,3D-FFT全局转置优化算法缓存未命中率几乎为0。与FFTW计算三维离散傅里叶变换的库函数相比,在2核心下,3D-FFT全局转置优化算法算法性能是其1.59倍,8核心下是4.93倍。若对FFTW库函数使用OpenMP开辟多个线程进行加速,3D-FFT全局转置优化算法性能表现在2核心与8核心下分别是多线程Open MP的1.07倍与1.48倍。
[Abstract]:With the increase of computing tasks, single-core processors are no longer able to meet the needs of users. The main reason is that power consumption limits the way to improve the performance of single-core processors. 3D-FFT is one of the most important algorithms in scientific and engineering research, in computer vision and pattern recognition, video phone and magnetic resonance imaging algorithms, and the introduction of multi-core processors provides a solution to the problem of multi-task and big data. All have a vital role to play. How to compute 3D-FFT more quickly on multi-core processors is the most important problem in scientific research. Aiming at the application of parallel 3D-FFT in multi-core processing, this paper studies the 3D-FFT algorithm from three aspects: data preprocessing, data calculation and data transfer. In the data pre-processing phase, in order to avoid the existing core full load, some core idle phenomena, load balancing algorithm is used to make each core process tasks averagely. At the same time, in order to avoid unnecessary migration of data between the core in computing phase, an affinity algorithm between threads and CPU is proposed to make the data run in the specified CPU core. In order to solve the pseudo-sharing problem of shared memory multi-core processors, a cache row filling algorithm is proposed, so that the data belonging to different cores are divided into separate cache rows. The cache hit ratio of the algorithm is tested under 2 core and 8 core. The experiment shows that cache row filling and thread affinity with CPU algorithm, under the core 2, the first-level data cache miss ratio is reduced by 0.2%, on average, the cache hit ratio of the first-level data cache is reduced by 0.2% on average. The 3-tier cache miss rate was reduced by 0.2% on average and 0.2% and 0.1% at the core 8, respectively. In the data computing phase, based on the multi-column FFT algorithm, multi-thread parallelization is used. For each calculation of FFT, the classical six-step fast Fourier transform algorithm is used to divide the data into smaller data to calculate, so as to accelerate the processing speed. In order to reduce the time complexity of the algorithm to half of the original algorithm, the two-end inversion strategy is used to optimize the process of bit inversion, and four data points are processed at one time. In the data transposition phase, a global transposition optimization algorithm is proposed to reduce the total number of data points involved in inter-kernel communication, so as to accelerate the global transposing speed. Compared with the existing global transposition algorithm, the average communication time is reduced by 0.05 seconds, and the communication time is increased by 9.5%. And using the multi-core simulator Sniper Simulator to statistics its power consumption. Under eight computing cores, the global transpose optimization algorithm and the existing transpose algorithm, inter-core power consumption, kernel and memory power consumption, and the average power consumption of first-level data cache are about 20W and 22W, respectively. 7.2W and 7.5W and 16.9W and 17.1W In order to further reflect the performance advantages of the algorithm, compared with the existing FFTW and Open MP multithreading. At 8 core, compared with FFTW and Open MP cache miss ratio, the 3D-FFT global transpose optimization algorithm cache miss rate is almost 0. 5%. Compared with the library function of 3-D discrete Fourier transform calculated by FFTW, the performance of 3D-FFT global transposed optimization algorithm is 1.59 times under 2 core and 4.93 times at 8 core. If the FFTW library functions are accelerated by using OpenMP to open up multiple threads, the performance of the global transpose optimization algorithm of 3D-FFT is 1.07 times and 1.48 times higher than that of multi-thread OpenMP under 2-core and 8-core respectively.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP332

【参考文献】