Backprojection成像多核并行计算系统设计

发布时间：2018-03-26 11:46

本文选题：Backprojection算法　切入点：雷达成像　出处：《南京大学》2013年硕士论文

【摘要】：Backprojection雷达成像算法运算量极大,对成像系统的性能提出了极高的要求。本文在分析算法特征的基础上充分利用多种并行计算技术设计了一款高性能Backprojection雷达成像系统,提出并实现了多种提高性能的关键技术。针对算法的脉冲预处理部分包含大量的大点数复数向量运算和大点数FFT运算的特点,设计了一种直接支持FFT加速指令的SIMD向量处理器。出于性能的考虑,FFT在以往的系统设计中都是通过硬件加速器完成,而该SIMD向量处理器不仅能高效地完成算法的脉冲预处理部分所有大点数向量运算,还直接支持FFT加速指令,而且该FFT加速指令能提供和专用硬件加速器相同的FFT加速效率,因此避免了在系统中再增加硬件加速器所带来的额外硬件开销。针对算法反投影运算部分对性能要求极高的特点,设计了反投影加速器,其功能是把经过预处理之后的脉冲数据反投影到图像上的每一个像素点,性能达到每个时钟周期完成对一个像素点的反投影。在充分的误差分析的基础上,通过使用合理设计的定点表示代替双精度浮点表示,不仅使逻辑资源的开销降低了约50%,片上存储器资源的开销降低了37.5%,而且还提高了运算精度,相位的最大误差由11°缩小到了1.4° 由于成像算法的反投影部分运算量之大以至于一个反投影加速器远不能满足系统性能要求,本文通过把多个反投影加速器集成为一个反投影子系统以并行计算的方式进一步提高计算性能,这涉及到反投影算法的并行化以及并行算法向多个计算单元映射的问题。本文在原始的像素并行方案的基础上设计了脉冲并行方案,并重新设计了反投影子系统的架构,对于集成了8个反投影加速核的反投影子系统,主存储器的访存带宽需求和片上像素存储器组的数量均降低了87.5%.相比于单个反投影加速器,以完全相同的片上像素存储器、完全相同的主存储器访存带宽和8倍的反投影加速核和片上脉冲存储器取得了大于7.99的加速比。此外,针对开发过程中算法仿真时间过长的问题,本文还尝试了通过GPU并行计算的方法加速Backprojection雷达成像算法仿真。结合GPU计算平台和算法的特征分析,选择了像素并行的方案进行加速,原来需要仿真时间5小时23分钟经过GPU加速后只需要3分20秒,加速比达到97倍。
[Abstract]:The Backprojection radar imaging algorithm has a great deal of computation, and the performance of the imaging system is very high. Based on the analysis of the characteristics of the algorithm, a high performance Backprojection radar imaging system is designed based on a variety of parallel computing techniques. Several key techniques to improve performance are proposed and implemented. The pulse preprocessing part of the algorithm includes a large number of large number of complex vector operations and large number of FFT operations. A SIMD vector processor which directly supports FFT acceleration instructions is designed. The SIMD vector processor can not only efficiently perform all the large number vector operations in the pulse preprocessing part of the algorithm, but also directly support the FFT acceleration instruction, and the FFT acceleration instruction can provide the same FFT acceleration efficiency as the dedicated hardware accelerator. Therefore, the additional hardware overhead caused by adding hardware accelerators to the system is avoided. In view of the high performance requirement of the backprojection operation part of the algorithm, a backprojection accelerator is designed, the function of which is to project the preprocessed pulse data back to every pixel point on the image. The performance of each clock cycle is achieved by completing the backprojection of a pixel. On the basis of sufficient error analysis, a reasonably designed fixed-point representation is used instead of a double-precision floating-point representation. It not only reduces the cost of logical resources by about 50, but also reduces the overhead of on-chip memory resources by 37.5 degrees, and improves the operation accuracy. The maximum phase error is reduced from 11 掳to 1.4 掳. Because the backprojection part of the imaging algorithm is so large that a backprojection accelerator can not meet the performance requirements of the system, In this paper, by integrating multiple backprojection accelerators into a backcast shadow system, the computational performance is further improved by parallel computing. This involves the parallelization of backprojection algorithms and the mapping of parallel algorithms to multiple computing units. In this paper, the pulse parallel scheme is designed based on the original pixel parallel scheme, and the architecture of the backcast shadow system is redesigned. For the backshot shadow system integrated with 8 backprojection acceleration cores, the memory access bandwidth requirements of the main memory and the number of on-chip pixel memory groups are reduced by 87.5. The memory access bandwidth of the main memory is exactly the same as that of the backprojection accelerating core and the on-chip pulse memory, and the speedup ratio is greater than 7.99. In addition, in order to solve the problem that the simulation time of the algorithm is too long, this paper also tries to accelerate the simulation of Backprojection radar imaging algorithm by GPU parallel computing, combining with the characteristic analysis of GPU computing platform and algorithm. The pixel parallel scheme is chosen for acceleration. The simulation time is 5 hours and 23 minutes, only 3 minutes and 20 seconds after GPU acceleration, and the speedup is 97 times.
【学位授予单位】：南京大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP338.6

【参考文献】