嵌入式环境下浮点矩阵乘法的FPGA加速关键技术研究

发布时间：2018-02-02 21:07

本文关键词： 浮点矩阵乘法乘法累加器 FPGA加速嵌入式系统 PCI-E　出处：《湖南大学》2013年硕士论文　论文类型：学位论文

【摘要】：浮点矩阵乘法是数字信号处理的基础算法,在通信、网络、工控、医疗等领域有着广泛的应用。随着嵌入式系统在这些领域的深入应用,浮点矩阵乘法由于其计算复杂度高、处理效率低,通常成为提升嵌入式系统在这领域计算速度的瓶颈。现场可编程逻辑阵列(Field Programmable Gate Array,FPGA)协处理器因其速度快、可编程、使用灵活等特点,而成为提升嵌入式系统的计算速度的一种有效方式,受到了国内外研究者的广泛关注。因此,研究嵌入式环境下的浮点矩阵乘法FPGA加速有着非常重要的意义。本文针对三维荧光数学分离算法中浮点矩阵乘法的计算问题,在分析浮点矩阵乘法算法与FPGA硬件结构的基础上,研究了一种基于并行结构的流水线浮点矩阵乘法器以及异构多处理器下的通信机制,以提高嵌入式环境下浮点矩阵乘法的FPGA计算性能,具体工作如下：针对矩阵乘法的核心计算单元乘法累加器,分析每个时钟周期中乘法累加的计算过程,在浮点乘法器和加法器知识产权核的基础上,提出了一种流水线浮点乘法累加器结构。该结构中数据在经过流水线乘法器和加法器之后,只需计算加法器的最后N级流水线结果之和即可得到所计算的累加和。此外,该结构使用灵活、适用性好,可根据实际需求调整流水线的级数以适应不同应用的性能需求。在上述乘法累加器的基础上,本文研究设计了一种并行架构下的浮点矩阵乘法器,降低了计算复杂度,提升了计算速度。该矩阵乘法器可以配置两个相乘矩阵的行列参数,并且可以根据实际的FPGA资源情况设置处理单元的数目,而相邻的处理单元之间没有数据的交互,具有良好的扩展性。针对浮点矩阵乘法的FPGA协处理器与嵌入式CPU的通信问题,本文设计了基于串口UART口PCI-E总线的两种通信结构。在PCI-E的通信结构中,将基于片上可编程系统结构的FPGA端设计与嵌入式上位机的驱动程序相结合,实现软硬件系统的协同工作。本文基于Verilog硬件描述语言实现了浮点乘法累加器和矩阵乘法,并从仿真、综合等方面对其性能进行了分析。为了进一步验证其在嵌入式环境中的性能,分别实现了浮点矩阵乘法通过UART、PCI-E与本文所依托项目中的Intel E6x5C嵌入式平台的通信。实验结果表明,采用高速PCI-E总线加速浮点矩阵乘法计算的方式,能够比目前主流的Cortex A9和ARM9嵌入式平台对浮点矩阵乘法的计算速率分别提升了约8倍和200倍,因此该加速方式能够有效的提升嵌入式平台对浮点运算的计算性能。
[Abstract]:Floating-point matrix multiplication is the basic algorithm of digital signal processing. It is widely used in the fields of communication, network, industrial control, medical treatment and so on. Floating-point matrix multiplication has high computational complexity and low processing efficiency. Field Programmable Gate Array is often the bottleneck to improve the computing speed of embedded systems in this field. FPGA (FPGA) coprocessor has become an effective way to improve the computing speed of embedded system because of its high speed, programmable and flexible use, which has been widely concerned by researchers at home and abroad. It is very important to study the FPGA acceleration of floating-point matrix multiplication in embedded environment. Based on the analysis of floating-point matrix multiplication algorithm and FPGA hardware structure, this paper aims at the calculation of floating-point matrix multiplication in 3-D fluorescence mathematical separation algorithm. A pipeline floating-point matrix multiplier based on parallel architecture and communication mechanism under heterogeneous multi-processor are studied to improve the FPGA performance of floating-point matrix multiplication in embedded environment. The main work is as follows:. Based on the intellectual property core of floating-point multiplier and adder, the calculation process of multiplication accumulation in each clock cycle is analyzed for the multiplication accumulator, which is the core of matrix multiplication. A pipeline floating-point multiplicative accumulator structure is proposed in which the data is passed by pipeline multiplier and adder. Only the sum of the last N-order pipeline results of the adder can be calculated. In addition, the structure is flexible in use and has good applicability. Pipeline's series can be adjusted according to actual requirements to meet the performance requirements of different applications. Based on the above multiplicative accumulator, a floating-point matrix multiplier based on parallel architecture is designed, which reduces the computational complexity. The matrix multiplier can configure the column and column parameters of two multiplicative matrices and can set the number of processing units according to the actual FPGA resources. There is no data interaction between adjacent processing units, so it has good expansibility. The communication between FPGA coprocessor and embedded CPU based on floating-point matrix multiplication is discussed. In this paper, we design two communication structures based on PCI-E bus of serial port UART port, in the communication structure of PCI-E. The FPGA end design based on the on-chip programmable system structure is combined with the driver of the embedded host computer to realize the cooperative work of the hardware and software system. In this paper, floating-point multiplication accumulator and matrix multiplication are realized based on Verilog hardware description language. In order to further verify its performance in embedded environment, the floating-point matrix multiplication is implemented through UART. The communication between PCI-E and Intel E6x5C embedded platform in the project of this paper. The experimental results show that the high speed PCI-E bus is used to accelerate the calculation of floating-point matrix multiplication. Compared with the current mainstream Cortex A9 and ARM9 embedded platform, the computing speed of floating-point matrix multiplication can be increased about 8 times and 200 times respectively. Therefore, the acceleration method can effectively improve the computing performance of the embedded platform for floating point operation.
【学位授予单位】：湖南大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TN791

【参考文献】