高性能DSP中SIMD关键计算部件的研究

发布时间：2018-07-08 14:10

本文选题：SIMD + 子字并行　；参考：《国防科学技术大学》2012年硕士论文

【摘要】：当前，嵌入式处理器的应用正向大规模，实时性等方向发展，其中高性能的功能部件是提升处理器性能的一大基础。本文围绕子字并行功能部件为中心，以FT-X高性能浮点DSP研究为背景，开展了对功能部件子字并行的深入研究，并提出了高性能的支持子字并行的功能部件的算法。 1）本文针对功能部件的独特特点，面向不同应用，对采用子字并行的功能部件的性能进行了分析。并对DSP中存在最多的乘法和加法运算部件分别进行了加速比分析。 2）通过对乘法算法的深入分析，本文提出了一种支持子字并行的乘法算法。采用新型Booth编码技术、ES编码和CS编码合理分离结构，，对高位宽乘法具有速度优势。并支持三种位宽工作模式，在文中对可以同时执行1个64位乘法，4个32位乘法或16个16位乘法，支持有/无符号运算的乘法结构进行了举例说明；为配合乘法矩阵算法在点积指令中的应用，本文提出了一种溢出判断补偿技术，解决了在多数据通路下点积和矩阵乘法的溢出判断问题。 3）本文对有限域乘法部件进行了算法研究，并对有限域算法进行了子字并行化。提出了一种操作宽度和本原多项式同时可调的有限域乘法器。与现有的单功能有限域乘法器相比，在综合指标上具备了一定优势。 4）本文对加法算法进行了分析。在比较较为先进的加法算法的基础上，提出了一种支持子字并行的加法算法。该算法适用于支持逻辑指令和加/减法的ALU上，可扩展性较强，且性能较强。 5）上述算法最终实际应用在FT-X高性能浮点处理器的功能部件中。本文对设计的功能部件进行了详细的设计和模拟验证，并给出了最终的的综合结果。本文提出的支持子字并行的乘法部件算法具有关键路径较短，功能强大，面积较小等特点，是一种优良的算法。综合结果表明，该算法能够提高64位可支持SIMD乘法速度约4%。本文提出的支持子字并行的加法器可以在较少增加标量加法延时的前提下，支持多种子字并行模式，并将结果选择嵌在运算体内，与进位消除算法相比，性能提高11%。基于本文乘法算法的M部件能够满足应用的指令集要求。在DC综合工具的环境及TSMC40nm工艺下，FT-X DSP的M部件面积为142275(um2)，动态功耗为28.6863(mW)，最高频率可达1GHz。
[Abstract]:At present, the application of embedded processor is developing in the direction of large scale and real time, among which high performance functional components are the basis of improving processor performance. Based on the research of FT-X high performance floating-point DSP, this paper focuses on the sub-word parallelism, and makes a thorough research on the sub-word parallelism of the functional components, which is based on the FT-X high performance floating-point DSP. A high performance algorithm for supporting subword parallelism is proposed. 1) according to the unique characteristics of functional components and different applications, the performance of functional components using subword parallelism is analyzed in this paper. The speedup ratio analysis of the most existing multiplication and addition components in DSP is given. 2) through the in-depth analysis of the multiplication algorithm, a multiplication algorithm supporting subword parallelism is proposed in this paper. The new Booth coding technique is used in the separation of es coding and CS coding, which has the advantage of high bit width multiplication. It also supports three bit width working modes. In this paper, we illustrate the multiplication structure which can perform one 64-bit multiplication, four 32-bit multiplication or 16 16-bit multiplication at the same time, and support / unsigned operation. In order to match the application of multiplication matrix algorithm in dot product instruction, this paper proposes a compensation technique for overflow judgment. The problem of overflow judgment of point product and matrix multiplication under multi-data path is solved. 3) the algorithm of finite field multiplication is studied and the subword parallelization of finite field algorithm is presented. A finite field multiplier with adjustable operation width and primitive polynomial is proposed. Compared with the existing single function finite field multiplier, it has some advantages in the synthesis index. 4) the addition algorithm is analyzed in this paper. Based on the more advanced addition algorithm, a subword parallel addition algorithm is proposed. The algorithm is suitable for ALU which supports logical instruction and addition / subtraction. It is scalable and has strong performance. 5) the above algorithm is applied in FT-X high performance floating-point processor. In this paper, the functional components are designed and simulated in detail, and the final comprehensive results are given. The multiplication component algorithm, which supports subword parallelism, is an excellent algorithm because of its short critical path, powerful function and small area. The results show that the proposed algorithm can improve the speed of 64 bit SIMD multiplication. The proposed subword parallelism adder can support multi-seed word parallel mode with less scalar addition delay, and the result is embedded in the operation body. Compared with carry elimination algorithm, the performance is improved by 11%. The M part based on the multiplication algorithm in this paper can meet the requirement of instruction set. In the environment of DC synthesis tool and TSMC 40nm process, the M component area of FT-X DSP is 142275 (um2), the dynamic power consumption is 28.6863 (MW), and the highest frequency is 1 GHz.
【学位授予单位】：国防科学技术大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP332

【相似文献】