基于dsp的两级cache低功耗研究与实现

发布时间：2018-01-18 01:34

本文关键词：基于dsp的两级cache低功耗研究与实现　出处：《南昌大学》2012年硕士论文　论文类型：学位论文

【摘要】：DSP (digital signal processor),是一种高速处理数字信号的微处理器。工作原理是把接收到的模拟信号,转化成数字信号,再对数字信号进行一系列处理(如削弱,加强,删除),最后再把数字信号解译回模拟信号或具体环境。DSP已在交通、航空、网络.医疗等各个领域得到了广泛的应用。然后随着集成电路不断飞跃前进,其处理速度越来越快,工艺不断提高,集成度越来越大,而相对于微处理器而言,存储器读写速度的缓慢提高,两者之间的速度差距越来越大,以至于存储器的笨拙,严重导致了瓶颈问题的产生,从而制约着系统整体性能的提高。在微处理器和和存储器之间加入一个容量小但速度快的高速缓存(Cache)能有效解决此问题。本文的主要研究工作是设计并实现一款DSP芯片的二级低功耗高速缓存。通过深入学习G1000的体系结构和片内两级存储结构,研究了现代Cache的相关设计技术和低功耗理论,完成了两级低功耗高速缓存(Cache)的设计与实现。其中,一级Cache采用哈佛结构,把指令和数据分开,即一级指令Cache(L1P)和一级数据Cache(L1D),对L1P而言,GPU只能对其进行读操作,没有修改的权限；而对LID, CPU采用两组读写通路对其访问,L1D模块的组织结构为二组相联映射结构,采用伪LRU替换策略和写回的写策略,这种设计思路可有效提高了Cache命中率,提高读写速度；L2为二级cache,采用普林斯顿结构,即指令和数据可以混合存放在一起,动态有效地分配存储空间,可在不增加容量的情况下提高命中率,为保证数据的一致性,利用Snoop查询请求来维护LID、LIP与L2数据一致性。为降低Cache的功耗,本设计采用了基于伪LRU和Valid位的组预测算法和基于时间戳监控的可重构算法。最后对设计综合优化,系统仿真,上板调试,两级Cache控制器很好的完成了其在整个芯片中的所承担的功能。本文的创新之处： Cache设计中常用的替换算法,提出了伪LRU替换算法,该算法是基于最近最少使用算法(LRU)的改进算法,可以有效地避免使用计数器,仅用8位的寄存器就可以达到记录访问次数的计数器。 Write buffer的引用：由于L1D是读miss分配空间,写miss不分配空间的Cache,若把写miss的数据直接写进L2中,由于L2的数据传输速度慢和处理的请求多且周期长,这将会严重影响CPU的处理速度。引用了Write buffer,则可以把写miss的数据先暂存,可把写miss的任务独立出来,解脱CPU对写miss的处理,进而可提高CPU的处理速度。充分利用Cache的工作原理-时间局限性和空间局限性,提出了基于伪LRU和Valid位的组预测算法,有效地提高了组预测的命中率。达到了降低功耗,但又不降低性能的目的。通过运用时间戳来有效地监控Cache的命中率,以此来动态地配置SRAM/Cache的容量。做到了降低功耗又能保证命中率的效果。
[Abstract]:DSP digital signal processor is a microprocessor that processes digital signals at high speed. Convert into digital signal, then carry on a series of digital signal processing (such as weakening, strengthening, deleting, finally interpreting digital signal back to analog signal or specific environment. DSP has been in traffic, aviation. Network, medical treatment and other fields have been widely used. Then with the rapid progress of integrated circuits, the processing speed is faster and faster, the technology is improving, and the integration level is increasing, compared with the microprocessor. With the slow improvement of the speed of reading and writing, the gap between the speed of the two is increasing, so that the clumsiness of the memory leads to the bottleneck problem. This limits the overall performance of the system. Add a small but fast cache between the microprocessor and memory, Cache). The main research work of this paper is to design and implement a DSP chip with low power consumption cache. Through in-depth study of the G1000 architecture and in-chip two-level storage structure. This paper studies the design technology and low power theory of modern Cache, and completes the design and implementation of two-stage low power cache. Among them, one stage Cache adopts Harvard structure. The instruction is separated from the data, that is, the first-level instruction CacheL1P) and the first-level data Cache-L1DU. For L1P, the GPU can only read it and has no authority to modify it. For id, CPU uses two groups of read and write paths to access the L1D module. The organizational structure of L1D module is two sets of associative mapping structure, pseudo-#en1# replacement strategy and write-back strategy are adopted. This design idea can effectively improve the hit rate of Cache and improve the speed of reading and writing. L2 is a second level cache. it adopts Princeton structure, that is, instruction and data can be mixed together to allocate storage space dynamically and efficiently, which can increase hit rate without increasing capacity. In order to ensure the consistency of data, the Snoop query request is used to maintain the consistency of LID-LIP and L2 data, and to reduce the power consumption of Cache. The group prediction algorithm based on pseudo-#en0# and Valid bit and the reconfigurable algorithm based on timestamp monitoring are adopted in this design. Finally, the design is optimized synthetically, the system is simulated and debugged on the board. The two-stage Cache controller performs well in the whole chip. The innovations of this paper are as follows: This paper presents a pseudo LRU replacement algorithm, which is based on the least recently used algorithm, and can effectively avoid the use of counters. A counter that records the number of visits can be reached with a mere 8-bit register. Reference to Write buffer: since L1D is a read miss allocation, write a miss that does not allocate space, if you write miss data directly into L2. Because of the slow data transmission speed of L2 and the number of requests processed and the long period, this will seriously affect the processing speed of CPU. Reference is made to Write buffer. Then the data of writing miss can be stored temporarily, the task of writing miss can be independent out, the processing of writing miss can be relieved by CPU, and the processing speed of CPU can be improved. A group prediction algorithm based on pseudo LRU and Valid bits is proposed by making full use of the working principle of Cache-time limitation and space limitation. The hit ratio of group prediction is improved effectively, and the power consumption is reduced, but the performance is not reduced. By using timestamp to monitor the hit ratio of Cache effectively, the capacity of SRAM/Cache can be dynamically configured, which can reduce power consumption and ensure hit ratio.
【学位授予单位】：南昌大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP332

【参考文献】