基于CUDA的FFT并行计算研究

发布时间：2018-07-22 12:24

【摘要】：离散傅立叶变换是数字信号处理系统中常用的重要数学变换，算法的可行性、复杂度和运行效率等都是影响计算结果的重要因素。近年来，GPU正在以大大超过摩尔定律的速度高速发展，主流GPU的单精度浮点处理能力和外部存储器带宽相对于同时期的CPU都有明显的优势，基于图形硬件GPU的通用计算正成为并行领域的研究热点。特别是NVIDIA公司于2007年推出的CUDA统一计算设备架构，在编程、优化等方面都得到了显著的提升，极大地增强了GPU的通用计算能力。CUDA不需要借助于图形API，采用类C语言进行开发，使开发人员比较容易的从CPU编程模式过渡到GPU编程模式。随着GPU可编程能力、并行处理能力以及应用范围的不断提升和扩展，GPU已发展成为一种高度并行化、多线程、多核的处理器。利用GPU的并行处理能力，以CPU+GPU混合加速为特征的异构并行计算系统将会成为未来高性能计算的主流。本文首先分析了CUDA硬件架构和编程模型，在分析GPU通用计算现状的基础上，提出CUDA程序设计的方法。然后深入探讨了快速傅立叶变换的基本原理，详细介绍了时域抽取基2-FFT算法的实现过程及相关性质，根据快速傅立叶算法高度并行分治的特征，结合CUDA编程模型及实现机制，用CUDA的类C语言设计了快速傅立叶变换的并行算法。改进算法采用CPU+GPU异构模型方式，将GPU引入到计算中来，让GPU承担程序中的大规模计算——复数的加法与算数的乘法。传统串行算法实现N点序列的快速傅立叶变换需要三层循环，时间复杂度为O (Nlog2N)。改进后的算法采用线程层次组织结构，将同一级中相互独立的N/2个蝶形运算实现并行操作，原有的三层循环可以用两层循环来完成，，时间复杂度变为O (N)，从而实现对快速傅立叶变换的加速与优化。文章最后搭建CUDA实验运行环境，实现传统快速傅立叶算法在CPU上的运行，以及改进后的算法在GPU上的运行，同时还调用了FFTW函数库的程序代码和CUFFT函数库的程序代码，并将以上结果进行比较，通过对实验数据的分析证明了运用CUDA架构实现快速傅立叶算法的优越性，也验证了GPU在处理大量数据计算时所占的优势。
[Abstract]:Discrete Fourier transform (DFT) is an important mathematical transformation commonly used in digital signal processing system. The feasibility, complexity and efficiency of the algorithm are the important factors that affect the calculation results. In recent years, GPU is developing at a speed that greatly exceeds Moore's Law. The single precision floating-point processing capability and external memory bandwidth of mainstream GPUs are obviously superior to those of CPU in the same period. General computing based on graphics hardware GPU is becoming a research hotspot in parallel field. In particular, NVIDIA introduced the CUDA unified computing equipment architecture in 2007, which has been greatly improved in programming, optimization and so on, greatly enhanced the GPU's general computing capability .CUDA does not need to use graphics API, using C language to develop. Make it easier for developers to transition from CPU programming mode to GPU programming mode. With the development of GPU's programmable ability, parallel processing ability and application scope, GPU has developed into a highly parallel, multithreaded, multi-core processor. Using the parallel processing capability of GPU, heterogeneous parallel computing systems characterized by CPU-GPU hybrid acceleration will become the mainstream of high-performance computing in the future. In this paper, the hardware architecture and programming model of CUDA are analyzed. Based on the analysis of the current situation of GPU general computing, the method of CUDA programming is put forward. Then the basic principle of Fast Fourier transform (FFT) is discussed in depth, and the implementation process and related properties of 2-FFT algorithm in time domain are introduced in detail. According to the characteristics of high parallel division and control of FFT algorithm, combined with CUDA programming model and implementation mechanism, The parallel algorithm of fast Fourier transform is designed with C-like language of CUDA. The improved algorithm adopts the CPU GPU heterogeneous model, and introduces GPU into the calculation, which allows GPU to undertake the addition and multiplication of the complex number and the large scale computation in the program. The traditional serial algorithm for the fast Fourier transform of N-point sequence needs three-layer cycle, and the time complexity is O (Nlog2N). The improved algorithm uses thread-level organization structure to realize parallel operation of independent N / 2 butterfly operations in the same level, and the original three-layer loop can be completed by two-layer loop. The time complexity becomes O (N), which can accelerate and optimize the fast Fourier transform (FFT). Finally, the paper builds up the CUDA experimental running environment, realizes the traditional fast Fourier algorithm running on CPU, and the improved algorithm running on GPU. At the same time, the program code of FFTW function library and CUFFT function library are also called. Through the analysis of experimental data, the superiority of fast Fourier algorithm using CUDA architecture is proved, and the advantage of GPU in dealing with a large amount of data is verified.
【学位授予单位】：湖南大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP338.6

【参考文献】