浮点傅里叶变换硬件架构综合研究

发布时间：2018-01-19 02:21

本文关键词： 离散傅里叶变换浮点定点互质数乘积 FPGA ASIC 自动生成综合卷积神经网络　出处：《中国科学技术大学》2017年硕士论文　论文类型：学位论文

【摘要】：离散傅里叶变换(DFT)被广泛应用于几乎所有的科学与工程计算领域中,特别是在一些现代大规模数据处理应用中,比如音视频信号数据处理,使用到了越来越多计算复杂且硬件需求高的特性,例如超长点数和非二的正整数次幂点的硬件离散傅里叶变换单元和拥有宽计算范围以及高有效精度的浮点运算。现代离散傅里叶变换应用诸如音视频编解码、正交分频复用、大数据处理等,其对运算实时性要求高需要硬件运算单元,对精度和通用性要求高需要满足IEEE-754标准规范的浮点数,对采样点数要求高需要长点数以及非二的正整数次幂点数的离散傅里叶变换。本文提出了一种基于矩阵分解的用于互质数乘积长度的非二的正整数次幂点数的傅里叶变换算法,并设计了可实现该算法的离散傅里叶变换硬件架构综合工具—AutoNFT。主要工作内容如下:本文研究了基于矩阵分解的可用于两两互质数乘积点数的离散傅里叶变换算法。该算法与已有的用于小奇数(3、5、9)乘二的正整数次幂点数的算法相比,具有更广的应用点数的范围;通过严谨的数学推导证明了算法的正确性,并给出了相较于传统算法不同的输入输出顺序计算公式,以实现互质数离散傅立叶变换模块间的级联。本文设计的AutoNFT综合工具可以自动生成全流水线架构的硬件离散傅里叶变换单元,支持二的正整数次幂点数和两两互质数乘积点数,并具有高度的可移植性,同时支持定点、浮点采样。提出了用于全流水线结构及自动级联的自动生成算法,能够通过基于移位寄存器的先入先出单元有效处理相比基2/4算法更高效的分裂基算法的L型结构;设计了包含八级流水线的高性能浮点加法与乘法单元,可在SMIC 40纳米工艺下工作在1Ghz频率。本文在Zynq 7000平台下对定点及浮点运算单元、手写数字神经网络、16点和15点浮点离散傅里叶变换单元进行了验证。给出了手写数字识别网络LeNet-5的FPGA实现,相比通用计算器件如CPU、GPU实现,在达到软件算法相同的低错误率0.999%的同时,其消耗运算时间比Caffe快37%,并且能耗低达93.7%。同时,本文也在SMIC40纳米工艺和500Mhz频率下,完成了对长点数以及质数乘积点数的定点以及浮点离散傅里叶变换单元的综合和仿真。特别地对于256点离散傅里叶变换单元,其每秒可处理1150亿个定点采样;对于30点离散傅里叶变换单元,其每秒可处理135亿个浮点采样。
[Abstract]:Discrete Fourier transform (DFT) is widely used in almost all fields of scientific and engineering computing, especially in some modern large-scale data processing applications, such as audio and video signal data processing. More and more complex computing and high hardware requirements are used. For example, the hardware discrete Fourier transform unit of super-long points and non-binary positive integer power points and floating-point operations with wide calculation range and high efficient precision.; Modern discrete Fourier transform applications such as audio and video coding and decoding. Orthogonal frequency division multiplexing, big data processing, etc., which requires high real-time operation requirements of hardware operation unit, high accuracy and versatility requirements to meet the IEEE-754 standard standard floating-point number. The discrete Fourier transform which requires long points and non-binary positive integer power points is required for high sampling points. In this paper, a new Fourier transform based on matrix decomposition is proposed for the length of the product length of mutual prime numbers. Riefer transform algorithm. A hardware synthesis tool for discrete Fourier transform (DFT)-AutoNFT is designed. The main work is as follows:. In this paper, we study the discrete Fourier transform (DFT) algorithm based on matrix decomposition, which can be used for the product points of pairwise prime numbers. 3. Compared with the algorithm of multiplying the number of positive integers by two, the algorithm has a wider range of points of application. The correctness of the algorithm is proved by rigorous mathematical derivation, and the formulas for calculating the order of input and output in comparison with the traditional algorithm are given. In order to realize the concatenation between the discrete Fourier transform modules, the AutoNFT synthesis tool designed in this paper can automatically generate the hardware discrete Fourier transform unit of the full pipeline architecture. Two positive integer power points and pairwise prime number product points are supported with high portability and fixed-point and floating-point sampling. An automatic generating algorithm for full pipeline structure and automatic cascade is proposed. The L-type structure of the split base algorithm, which is more efficient than the base 2/4 algorithm, can be effectively processed by the first-in-first-out unit based on the shift register. A high performance floating-point addition and multiplication unit including 8-stage pipeline is designed. It can work at 1 Ghz frequency in SMIC 40 nanoscale process. In this paper, fixed point and floating-point operation unit and handwritten digital neural network are studied on Zynq 7000 platform. 16:00 and 15:00 floating-point discrete Fourier transform units are verified. The FPGA implementation of handwritten numeral recognition network LeNet-5 is given, compared with that of general calculators such as CPU / GPU. At the same time, the software algorithm has the same low error rate (0.999%), which consumes 37 times faster than Caffe, and has a low energy consumption of 93.70.At the same time. In this paper, SMIC40 nanotechnology and 500MHz frequency are also used. The synthesis and simulation of fixed-point and floating-point discrete Fourier transform units for long points and prime product points are completed, especially for 256 points discrete Fourier transform units. It can handle 115 billion fixed-point samples per second; For a 30-point discrete Fourier transform unit, it can handle 13. 5 billion floating-point samples per second.
【学位授予单位】：中国科学技术大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP301.6

【相似文献】