FT-XDSP中高性能SIMD浮点乘加单元的研究与实现

发布时间：2018-08-15 11:25

【摘要】：FT-XDSP是自主研发的一款超长指令字结构的64位高性能SIMD数字信号处理器（Digital Signal Processor, DSP），适用于高性能计算、无线通信、视频和图像处理等，设计主频1.25GHz。FT-XDSP处理器的单核包含50个浮点乘加单元（Floating-point fused Multiply ACcumulaor, FMAC），它的性能直接决定了FT-XDSP的浮点峰值性能。本文依托“FT-XDSP”的开发与研制，旨在研究和实现面向无线通信基站和高性能计算的高性能SIMD浮点乘加单元。本文的主要工作和贡献如下： 1、在经典低延时浮点融合乘加结构的基础上设计和实现了多功能快速浮点融合乘加运算单元。详细分析了浮点乘加通路的总体结构，对整个乘加结构进行了合理的流水线划分，提出了6级流水的高性能SIMD浮点乘加结构，支持双精度/SIMD双单精度浮点乘法、乘累加、加法和单精度复数乘法与点积等运算，其中乘法操作采用4级流水线执行，加法与减法操作采用5级流水线执行，其余操作均采用6级流水线执行。 2、双精度浮点乘加结构中，通过复用关键模块的方法来实现多种功能，，降低面积开销。研究了乘加结构中各关键模块设计思想，如浮点尾数乘法器、对阶移位器、复合加法器、前导0预测模块、规格化模块，根据体系结构的设计要求对关键模块进行了复用设计，在双精度浮点乘加结构的基础上复用设计实现了SIMD双单精度浮点乘加、浮点加法和单精度复数乘法与点积数据通路，并对浮点乘法器进行了改进，在不影响浮点乘加关键路径延时的条件下使其支持64位定点乘法操作，实现了定点和浮点乘法器复用。 3、进行了多功能浮点乘加单元的模拟验证与综合优化。本文对所设计的浮点乘加运算单元进行了详细的模块级验证和DSP内核级验证环境下的验证，验证结果表明所设计的指令功能正确，各功能点中的边界值处理符合IEEE754标准。同时依照逻辑延时优化策略对FMAC单元的关键路径进行优化。基于45nm工艺在Typical工作条件下采用Candence公司的RTL Compiler综合工具对设计单元实现进行了综合，综合结果表明：最长关键路径为550ps，功耗14.11mW，Cell面积166854um2，整体性能比传统低延迟浮点乘加结构要高，满足FT-XDSP对浮点乘加单元的性能要求。
[Abstract]:FT-XDSP is a 64 bit high performance SIMD digital signal processor (Digital Signal Processor, DSP),) with super long instruction word structure, which is suitable for high performance computing, wireless communication, video and image processing, etc. The single core of the main frequency 1.25GHz.FT-XDSP processor consists of 50 floating-point multiplication and addition units (Floating-point fused Multiply ACcumulaor, FMAC),). Its performance directly determines the floating-point peak performance of FT-XDSP. Based on the development and research of "FT-XDSP", this paper aims to study and implement the high performance SIMD floating-point multiplication and addition unit for wireless communication base stations and high-performance computing. The main work and contributions of this paper are as follows: 1. Based on the classical low-delay floating-point fusion multiplication and addition structure, a multi-function fast floating-point fusion multiplication and addition unit is designed and implemented. The overall structure of floating-point multiplication and addition path is analyzed in detail, and the whole multiplicative structure is divided into pipeline reasonably. A six-stage pipelined SIMD floating-point multiplication structure with high performance is proposed, which supports double-precision / single-precision floating-point multiplication and multiplicative accumulation. Addition and single precision complex multiplication and dot product, in which multiplication is performed by 4 stages pipeline, addition and subtraction are performed by 5 stages pipeline. The other operations are performed by a 6-stage pipeline. 2. In a double-precision floating-point multiplicative structure, multiple functions are realized by multiplexing key modules, and the area overhead is reduced. The design idea of every key module in multiplication and addition structure is studied, such as floating-point Mantissa multiplier, order shifter, compound adder, leading 0 prediction module, normalization module. The key modules are reused according to the design requirements of the architecture. Based on the structure of double precision floating-point multiplication and addition, the SIMD double-single-precision floating-point multiplication, floating-point addition, single-precision complex multiplication and dot product data path are designed and implemented. The floating-point multiplier is improved. Under the condition that the floating-point multiplication plus critical path delay is not affected, it supports 64-bit fixed-point multiplication operation, realizes the multiplexing of fixed-point multiplier and floating-point multiplier. 3. The simulation verification and synthesis optimization of multi-function floating-point multiplication and addition unit are carried out. In this paper, the design of floating-point multiplication and addition unit is verified in detail at the module level and in the DSP kernel level. The verification results show that the instruction function is correct and the boundary value processing in each function point conforms to the IEEE754 standard. At the same time, the key path of FMAC unit is optimized according to the logic delay optimization strategy. Based on the 45nm process, the design unit is synthesized with the RTL Compiler synthesis tool of Candence Company under the Typical working condition. The results show that the longest critical path is 550 ps. the power consumption is 14.11mWN / Cell area 166854um2, and the overall performance is higher than that of the traditional low-delay floating-point multiplicative structure. Meet the performance requirements of FT-XDSP for floating-point multiplication and addition unit.
【学位授予单位】：国防科学技术大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP332

【共引文献】