多宽度SIMD结构DSP向量存储器的设计与实现

发布时间：2018-08-04 13:46

【摘要】：近十年来，随着集成电路技术和计算机技术的发展，中央处理器的性能每年增长近60%，而存储器存取延迟每年仅改善7%[1]，存储器访问带宽和延迟造成的“存储墙”问题已成为制约微处理器性能进一步提高的瓶颈。面向高密集度数据处理的多宽度SIMD结构数字信号处理器（Digital Signal Processor,DSP）片内集成了多个向量处理单元，对数据并行访存性能提出了更高的要求。如何为多宽度SIMD数字信号处理器的向量处理单元提供充足的访存带宽、减少向量处理单元之间数据的混洗等额外操作、提高算法的访存效率和降低功耗，成为向量存储系统设计中面临的重要问题。 YHFT-Matrix是国防科学技术大学微电子与微处理器研究所自主研发的一款面向软基站的自主知识产权高性能DSP，采用10发射超长指令字和多宽度SIMD结构，其向量处理部件（VPU）包含16个同构的向量处理单元，每个向量处理单元包含两个乘加单元和其他ALU单元，需要较高的数据吞吐率和访存带宽才能充分发挥VPU的运算能力。本文根据YHFT-Matrix设计需求和通信算法的访存特点，设计并实现了一种高效、新型的片上大容量向量存储器(Vector Memory,VM)，缓存VPU运算所需的大量数据。 VM设计了专用的向量地址产生单元，支持线性和循环寻址；存储容量为1MB，，存储体采用多体双缓冲结构，按低位地址交叉编址，以较小的面积和功耗代价实现了多路向量数据的并行访问，有效减少了并行访存冲突。为加速相关通信算法，在VM中还实现了一种向量访问重整理单元和向量写回重整理单元，使VM能支持向量非对齐访问和向量条件访问，实现了向量处理部件中所有向量处理单元对VM存储空间的有限共享和条件访存，实现了最大可同时支持512Gbps的向量、256Gbps的DMA和32Gbps的标量数据访问性能；经过后期改进VM还可实现连续向量字节和半字的访问。目前基于四个YHFT-Matrix内核的多核DSP芯片YHFT-QMBase已成功投片，前期的逻辑验证和后期的芯片测试表明，所设计的VM功能正确，基于65nm工艺的芯片主频能达到500MHz以上，经后期逻辑优化后主频能达到700MHz；使用VM多体交叉双缓冲结构可大幅减少访问冲突；有限共享和向量条件的存储结构能减少或消除相关算法的混洗操作，压缩了代码密度，加速了相关算法的执行。
[Abstract]:In the last decade, with the development of integrated circuit technology and computer technology, The performance of CPU increases nearly 60% per year, while the memory access delay improves only 7% per year. The problem of "memory wall" caused by memory access bandwidth and delay has become the bottleneck restricting the further improvement of microprocessor performance. Multi-width digital signal processor (Digital Signal processor) for high density data processing (SIMD) is integrated with several vector processing units (VPs), which requires higher performance of data parallel memory access. How to provide sufficient memory access bandwidth for the vector processing unit of multi-width SIMD digital signal processor, reduce the additional operations such as data shuffling between vector processing units, improve the efficiency of the algorithm and reduce the power consumption. YHFT-Matrix is an independent developed by the Institute of Microelectronics and Microprocessor of National University of National Defense Science and Technology, which is an independent intellectual property high performance DSPs for soft base stations. Transmit ultra-long instruction words and multi-width SIMD structures, The vector processing unit (VPU) consists of 16 isomorphic vector processing units. Each vector processing unit consists of two multiplication and addition units and other ALU units. It requires high data throughput and memory access bandwidth to give full play to the computing power of VPU. According to the design requirements of YHFT-Matrix and the memory access characteristics of communication algorithm, this paper designs and implements a kind of high efficiency. A new type of Vector memory (VVM) is designed to cache a large amount of data needed for VPU operation. VM designs a special vector address generation unit to support linear and cyclic addressing, with a storage capacity of 1MB and a multi-body double-buffer structure. The parallel access of multipath vector data is realized at the cost of small area and power consumption, and the parallel access conflict is reduced effectively. In order to accelerate the correlation communication algorithm, a vector access rearrangement unit and a vector write-back rearrangement unit are implemented in VM, which enables the VM to support vector unaligned access and vector conditional access. The limited sharing of VM storage space and conditional memory access by all vector processing units in vector processing unit are realized, and the scalar data access performance of vector 256Gbps DMA and 32Gbps which can support 512Gbps at the same time is realized. After the later improvement VM can also achieve continuous vector byte and half word access. At present, the multi-core DSP chip YHFT-QMBase based on four YHFT-Matrix cores has been successfully put into the chip. The previous logical verification and the later chip test show that the function of the VM designed is correct, and the main frequency of the chip based on 65nm technology can reach 500MHz. After the later logical optimization, the main frequency can reach 700 MHz; using VM multi-body cross double buffer structure can greatly reduce the access conflict; the storage structure with finite sharing and vector condition can reduce or eliminate the shuffling operation of the related algorithm and compress the code density. Speed up the implementation of related algorithms.
【学位授予单位】：国防科学技术大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP333

【参考文献】