NCS成像算法的并行模型设计和优化

发布时间：2018-04-23 10:51

本文选题：雷达成像算法 + NCS算法　；参考：《南京大学》2014年硕士论文

【摘要】：SAR合成孔径雷达成像系统是先进的微波对地观察系统,经过几十年的发展,其用途已经渗透到科学和工程的各个领域。合成孔径雷达成像算法中基于FFT快速傅里叶变换的频域算法解决了与方位频率的相关性问题,但是不利于高分辨率、低频情况下的并行计算,其中NCS算法就是其中的一种。NCS成像算法对源数据的处理主要包括了FFT/IFFT和复数计算以及一些转置、倒序等等。本文介绍了并行计算及NCS算法实现所使用的多核系统架构,同时介绍了NCS算法的具体实现流程。对NCS算法的整体系统架构和运算簇、转置簇以及运算簇中的FFT模块进行了详细阐述。同时介绍了如PRAM等常用的并行计算模型,并对其性能进行了详细分析,为NCS算法的性能评测提供了理论基础。在具体设计上,本文介绍了在Linux环境下的NCS算法的具体实现。NCS算法中最重要的模块为FFT运算模块和矩阵转置模块,而FFT模块在计算过程中所耗费时间占整个程序的大部分,因此本文对于FFT模块的设计和实现进行了重点阐述。NCS算法模型是基于存储器精确的系统模型,存储精确的系统模型的特点是对于任意一个并行计算算法问题,需要在问题开始之前,对算法进行拆解,将所需要的结果以及中间数据人为的放在特定的内存地址中,达到存储级精确。好处是在问题后续修改中较为节省时间,缺点是初始阶段工作较多。在存储器精确方面,本文详细介绍了NCS算法实现过程中的存储器存储方式和数据搬运流程,实现了对存储器的精确控制。本文最后对已实现的NCS算法模型进行了部分优化。因为NCS算法模型的搭建目的是为硬件提供任务划分方案和提供中间数据,支持硬件系统后期调试,因此NCS算法的任务划分方案是否高效直接影响了硬件的最终实现过程和实现效率。优化部分我们采用了在虚拟机环境下,通过使用多线程技术,模拟实现了多核并行计算。同时,根据程序所用的时间,分析了在不同线程数目下程序的运行效率,这也从一方面为硬件提供了参考,说明在一定工作量的下,处理器核数并非越大越好,合理的任务划分和充分的处理器资源运用对一个多核系统来说至关重要。实验结果显示,优化前子孔径运行时间TFFT=89.1s,Tstart=5.5s,优化后Tsub=65.3s, TFFT=48.6s, Tstart=5.5s,优化加速比为1.39。系统中串行代码部分约占R=25.6%,因此其加速比极限为3.9。在实验条件下,不考虑算法逻辑和功耗要求,加速比极限为2.56,主要是由于实际条件下的优化无法完全忽略并行部分所需要的时间。目前项目已经完成了基于FPGA的原型演示系统的演示。
[Abstract]:SAR synthetic aperture radar imaging system is an advanced microwave to earth observation system. After decades of development, its use has penetrated into various fields of science and engineering. The frequency domain algorithm based on FFT fast Fourier transform in synthetic aperture radar imaging algorithm solves the problem of correlation with azimuth frequency, but it is not conducive to high resolution. In the low frequency parallel computing, NCS algorithm is one of the.NCS imaging algorithms for the source data processing mainly including FFT/IFFT and the complex number calculation and some transposed, reverse order. This paper introduces the parallel computing and the NCS algorithm implementation of the multi-core system architecture, and introduces the specific implementation process of the NCS algorithm. The overall system architecture and operation cluster of NCS algorithm, the transposed cluster and the FFT module in the operation cluster are expounded in detail. At the same time, the common parallel computing models, such as PRAM, are introduced, and their performance is analyzed in detail, which provides a theoretical basis for the performance evaluation of the NCS algorithm. In the body design, this paper introduces the NC under the Linux environment. The most important modules in the implementation of the S algorithm are the FFT operation module and the matrix transposed module, while the time consuming of the FFT module takes up most of the whole program. Therefore, this paper focuses on the design and implementation of the FFT module. The.NCS algorithm model is based on the precise memory system model and storage precision. The characteristic of the system model is that for any parallel computing problem, it is necessary to disassemble the algorithm before the problem begins, and put the required results and the intermediate data in the specific memory address to achieve the precision of the storage level. The advantage is that it saves time and the disadvantage is the initial stage in the aftermath of the problem. In the memory precision, the memory storage mode and data handling process in the implementation of NCS algorithm are introduced in detail, and the precise control of memory is realized. Finally, the NCS algorithm model has been partially optimized. The purpose of the NCS algorithm model is to provide the task partition for the hardware. The case and the intermediate data are provided to support the later debugging of the hardware system, so the task partition scheme of the NCS algorithm has a direct impact on the final implementation and efficiency of the hardware. In the virtual machine environment, we have implemented multi core parallel computing by using multithread technology in the virtual machine environment. The operation efficiency of the program under different threads is analyzed. It also provides a reference for the hardware on the one hand. It shows that the number of processors is not as large as possible in a certain amount of work. The rational task division and the full application of the processor resources are very important for a multi-core system. The aperture operation time TFFT=89.1s, Tstart=5.5s, optimized Tsub=65.3s, TFFT=48.6s, Tstart=5.5s, the optimized acceleration ratio is about R=25.6% in the serial code part of the 1.39. system, so its acceleration ratio limit is 3.9. under the experimental conditions, without considering the logic and power requirements of the algorithm, the acceleration ratio limit is 2.56, mainly due to the optimization under the actual conditions. The time required for the parallel part can not be completely ignored. At present, the project has completed the demonstration of the prototype demonstration system based on FPGA.

【学位授予单位】：南京大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TN957.52

【相似文献】