片上多核处理器缓存子系统优化的研究

发布时间：2019-01-21 13:17

【摘要】：当前的片上多核处理器需要大容量的缓存系统来降低快速的处理器和慢速的片下主存之间的性能差距。本文认为可以利用和挖掘片上多核处理器的特性来优化其缓存子系统的性能和功耗。本文的工作研究了几个优化片上多核处理器缓存子系统性能的机制。具体来说,本文的研究主题包含三个方面：1)研究和设计高效的多播路由算法来提升片上网络的性能；2)利用当前的新型的非易失性存储器来为片上多核处理器设计低功耗的缓存系统；3)挖掘利用线程的进度信息来设计更加高效的缓存一致性协议。针对第一个研究主题,我们提出了一种高效的片上网络多播路由机制。对于集成越来越多核的片上多核处理器来说,片上网络为其提供了一个高效的、可扩展的通信基础架构。对于多核架构下的片上网络来说,一对多的通信模式是很普遍的。没有有效的多播路由机制的支持,传统的基于单播的片上网络在处理这些多播通信时是很低效的。本文提出了一个基于网络划分的多播路由机制,简称DPM。DPM可以高效地减低片上网络中网络包的平均传输延迟以及降低片上网络的功耗。具体来说,DPM可以根据当前网络中负载均衡级别以及多播通信的链路共享特征来动态地进行路由选择。本文的第二个研究课题是利用一种新型的非易失性存储器(自旋转移矩随机访问存储器,STT-RAM)来为片上多核处理器设计低功耗的缓存。STT-RAM具有快速的访问速度、高存储密度以及可以忽略不计的泄露功率。然而,大规模地应用STT-RAM作为多核处理器的缓存受到STT-RAM的较长的写延迟以及较高的写功耗的约束。最近研究表明过降低STT-RAM的存储单元(磁性隧道结MTJ)的数据保持时间可以有效地提升其写性能。但是保持时间降低的STT-RAM是易失性的,需要通过周期性地刷新其存储单元来避免数据丢失。当这样的STT-RAM用于多核的最后一级缓存(LLC)时,频繁的刷新操作在加剧能量消耗的同时也会给系统的性能带来负面影响。文本提出了一种高效的刷新方案(简称CCear)可以最小化这类STT-RAM上的刷新操作。CCear主要通过与缓存一致性协议以及缓存管理算法进行交互来消除不必要的刷新操作。最后我们提出了一个高效的一致性协议的调整机制来优化运行在片上多核处理器上的并行程序的性能。片上多核处理器的一个主要目标就是通过挖掘线程级别的并行性来继续提升应用程序的性能。但是对于运行在这类系统上的多线程程序来说,由于不均匀的任务分配以及共享资源的冲突,不同的线程通常呈现出不同的执行进度。这种进度的不均匀性是多线程程序性能的最大的瓶颈之一。由于多线程程序内在的同步机制,如内存屏障和锁,运行具有较快进度的线程的核必须停下来等待进度较慢的核。这样的空等不仅会降低系统性能,也会导致功耗的浪费。本文提出了一种线程进度感知的一致性调整机制,简称TEACA。TEACA利用线程的进度信息来动态地调整每个线程的一致性策略,目的是提升片上网络带宽资源的使用效率以及降低功耗。具体来说,TEACA动态地将线程划分为二类：领导者线程与落后者线程。随后,TEACA会根据线程来类别信息为其一致性请求提供特定的一致性策略。
[Abstract]:the current slice-on-chip multi-core processor requires a high-capacity caching system to reduce the performance gap between a fast processor and a slow chip. It is considered that the performance and power consumption of the cache sub-system can be optimized by using and digging the characteristics of the multi-core processor on the chip. In this paper, the mechanism of multi-core processor cache sub-system performance on several optimization slices is studied. In particular, the research topic in this paper includes three aspects: 1) research and design efficient multicast routing algorithm to improve the performance of the network on the chip; 2) use the current new non-volatile memory to design a low-power cache system for the multi-core processor on the chip; and 3) mining the progress information of the utilization thread to design a more efficient cache coherence protocol. For the first subject of the study, we propose a high-efficiency, on-chip, network-multicast routing machine for multi-core processors with more and more cores, the on-chip network provides an efficient, scalable communication infrastructure architecture. For an on-chip network under a multi-core architecture, a large number of communication modes are common. Without the support of a valid multicast routing mechanism, conventional unicast-based on-chip networks are inefficient in handling these multicast communications This paper presents a network-based multicast routing mechanism, called DPM. DPM can effectively reduce the average transmission delay of the network packets in the network and reduce the work of the network on the chip in particular, DPM can dynamically route that route in accordance with the load balance level in the current network and the link share characteristics of the multicast communication The second subject of this paper is to use a new non-volatile memory (spin transfer moment random access memory, STT-RAM) to design low power consumption for multi-core processors on the chip The cache. STT-RAM has a fast access speed, a high storage density, and a negligible drain however, large-scale application of STT-RAM as that cache of the multi-core processor is subject to a longer write delay of the STT-RAM and high write power consumption The recent study has shown that the data retention time of a memory cell (magnetic tunnel junction MTJ) that has reduced the STT-RAM can effectively increase it Write performance. However, the STT-RAM with reduced retention time is easy to lose, and it is necessary to avoid the number by periodically refreshing its storage unit It is lost. When such STT-RAM is used for the last-level cache (LLC) of a multi-core, frequent refresh operations will also negatively impact the performance of the system while increasing energy consumption The text provides a high-efficiency refresh scheme (CCear) that minimizes the brush on this class of STT-RAM The new operation. The CCear eliminates unnecessary brush by interacting with the cache coherency protocol and the cache management algorithm New operation. Finally, we put forward an efficient consistency protocol adjustment mechanism to optimize the parallelism of the multi-core processor running on the chip The performance of the program. One of the main objectives of the multi-core processor on the chip is to continue to improve the application by digging the parallelism of the thread level the performance of a program. However, for a multi-threaded program running on this class of systems, different threads typically present different threads due to the non-uniform task assignment and the collision of the shared resource The progress of the execution of the progress. The non-uniformity of this progress is the maximum of the multi-threaded program performance One of the bottlenecks in a multi-threaded program, such as a memory barrier and a lock, and the kernel running a thread with a faster progress must stop and wait for entry a relatively slow core. Such an air, etc., will not only reduce the performance of the system, but also This paper presents a thread progress-aware consistency adjusting mechanism, called TEACA. The TEACA dynamically adjusts the consistency of each thread with the thread's progress information. The purpose of this paper is to improve the utilization efficiency of network bandwidth resources on the slice. in particular, that TEACA divide the thread into two types: leader thread and the latter thread. The TEACA then provides a specific request for its consistency request based on the thread's class information
【学位授予单位】：中国科学技术大学
【学位级别】：博士
【学位授予年份】：2013
【分类号】：TP332

【共引文献】