面向延迟优化的多核处理器Cache数据管理机制研究

发布时间：2018-06-05 21:34

本文选题：多核处理器 + 大容量Cache　；参考：《国防科学技术大学》2013年博士论文

【摘要】：半导体工艺水平的不断提高和集成电路设计能力的快速发展,为多核处理器的诞生提供了必要的孵化环境并持续推动其设计技术走向成熟。目前,多核处理器凭借其计算能力较强、设计复杂度较低、可扩展性较好等优势,已经广泛应用于商业服务器、高性能计算、个人电脑、嵌入式系统等领域并且表现出有力的竞争优势。然而,随着多核计算能力与片外访存速度之间差异的不断增大,“存储墙”问题已经成为严重束缚多核处理器性能提升的关键瓶颈。片上Cache作为弥补处理器和内存之间速度差异的中间桥梁组件,是缓解“存储墙”问题的最佳着眼点和关键突破口。合理组织并充分利用片上Cache资源,设计高效的Cache数据管理机制,对于改善微处理器整体性能至关重要。随着片上Cache容量的不断增加和复杂片上互连结构的采用,加之应用程序访存特性多样化的影响,致使多核环境下大容量Cache设计面临许多新的严峻挑战,传统的私有或共享Cache结构无法在低失效率和低命中延迟之间进行有效权衡,严重制约访存系统性能提升。本文针对微处理器设计中的“存储墙”问题展开研究,在分析私有、共享以及混合Cache结构面临的挑战性问题和潜在优化空间的基础上,探索面向延迟优化的多核处理器Cache数据管理机制。本文取得的主要研究成果如下:第一,针对多核私有Cache结构面临的容量失效问题,本文提出一种基于细粒度伪划分的核间容量共享机制CSFP,通过在细粒度层次为每个Cache Bank设置加权饱和计数器阵列来统计和预测各线程的访存需求差异情况,控制各个处理器核在每个Cache Set上的私有域与共享域划分比例,并以此指导各处理器核上的牺牲块替换、溢出与接收决策,利用智能的核间容量借用机制来均衡处理器间访存需求差异,从而缓解多核私有Cache结构面临的容量失效问题。在周期精确的体系结构级全系统模拟器Simics平台上,本文对16核瓦片式结构下的CSFP机制进行了性能评估与分析,实验表明,CSFP机制能够有效改善多核私有Cache结构的容量失效问题,多线程测试程序的运行时间平均可以得到大约8.57%的压缩。第二,针对多核共享Cache环境下多线程竞争访问Cache资源导致的冲突失效问题,本文提出一种基于偏转映射的冲突失效隔离机制IMI-SM,当多核片上末级共享Cache发生失效需要从片外存储器取数据时,如果静态目标Cache Set中的LRU候选牺牲块被逐出时可能导致线程间或线程内冲突失效,则启动偏转映射机制。通过引入专用的冲突隔离缓存区,或者采用Bank内纵向压力均衡策略来扩展数据映射时的候选目标Set选择范围,IMI-SW允许将从内存取来的新数据块保存在片上冲突隔离缓存区或者存储压力相对较小的其它静态耦合Cache Set中,以此缓解冲突失效对共享Cache片上整体命中率造成的负面影响。实验结果表明,IMI-SM可以显著减少多核处理器在共享Cache资源时面临的冲突失效现象,程序运行时间平均可以降低7.35%左右,因此能够以较小的硬件代价获得较高的访存性能提升。第三,针对瓦片式多核处理器分布式共享Cache结构面临的长延迟命中问题,本文提出一种增强型选择性牺牲块复制机制E-VR,在原始牺牲块复制操作的基础上引入候选牺牲块过滤和目标组检测机制,在进行牺牲块复制操作时不但考虑其共享模式和读写特性,而且从细粒度层次考虑本地Cache Bank内访存压力纵向非均衡分布特性,通过减少高代价复制操作的发生概率和扩展牺牲块候选存放目标Set的选择范围,提高复制操作的性能获益。实验结果表明,E-VR可以将各应用程序的运行时间平均降低6.97%左右。E-VR在降低片上命中访问延迟的同时,避免对共享Cache的全局命中率造成过大负面影响,能够在低命中延迟和低失效率之间进行动态权衡,访存系统性能得到进一步改善。第四,面向瓦片式多核分布式Cache的虚拟共享域划分结构,本文提出将数据自适应替换、迁移与复制机制集成为统一的数据管理框架F-RMR。F-RMR不但在数据替换时能够感知本地目标Cache Set中候选牺牲块的活跃状态和片上唯一性,而且在多个虚拟共享域间进行数据迁移和复制决策时能够协同感知命中数据的活跃程度与目标Cache Set的空闲状态。通过替换、迁移与复制三者之间的协作,片上Cache长延迟命中和容量有效利用率之间的矛盾权衡问题得到妥善处理。实验结果表明,当共享域划分粒度为4时,多线程测试程序在F-RMR下的平均存储访问延迟平均可以降低7.59%左右。与原始虚拟共享域划分机制相比,F-RMR在不同共享域划分粒度情况下均可获得相应的性能提升,面积开销可以忽略不计。
[Abstract]:The continuous improvement of semiconductor technology and the rapid development of integrated circuit design capabilities provide the necessary incubator environment for the birth of multi-core processors and continue to promote its design technology. At present, multi-core processors have been widely used by their advantages, such as strong computing power, low design complexity, good scalability and so on. In the fields of commercial servers, high-performance computing, personal computers, embedded systems and other fields, there is a strong competitive advantage. However, with the increasing difference between the multi-core computing power and the rate of out of chip memory, the "storage wall" problem has become a critical bottleneck for the performance promotion of multi-core devices. The Cache on the sheet is used as a mass. The intermediate bridge component of the speed difference between the processor and the memory is the best point of view and key breakthrough in alleviating the "storage wall" problem. It is essential to organize and make full use of the Cache resources on the chip and to design the efficient Cache data management mechanism, which is very important to improve the overall performance of the microprocessor. With the continuous increase of the Cache capacity on the chip. With the adoption of the interconnection structure on the complex chip and the influence of the diversity of application memory characteristics, the design of large capacity Cache in multi-core environment faces many new challenges. The traditional private or shared Cache structure can not make a trade-off between low loss efficiency and low hit delay, which seriously restricts the performance improvement of the storage system. This paper studies the "storage wall" problem in microprocessor design. On the basis of analyzing the challenging problems and potential optimization space facing private, sharing, and mixed Cache structures, the paper explores the Cache data management mechanism for the multi-core processor for delay optimization. There is a problem of capacity failure for Cache structure. In this paper, an inter kernel capacity sharing mechanism based on fine-grained pseudo partition (CSFP) is proposed. By setting a weighted saturation counter array for each Cache Bank at a fine-grained level, the difference of the memory requirements of each thread is calculated and predicted, and the privacy of each processor kernel on each Cache Set is controlled. The division of domain and shared domain is used to guide the replacement of sacrificial blocks on the core of each processor, overflowing and receiving decision, using the intelligent inter nuclear capacity borrowing mechanism to balance the difference of the memory demand between processors, thus alleviating the capacity failure of the multi-core private Cache structure. In a periodic and accurate system structure level whole system simulator S On the imics platform, the performance evaluation and analysis of the CSFP mechanism under the 16 core tile structure are carried out. The experiment shows that the CSFP mechanism can effectively improve the capacity failure of the multi-core private Cache structure. The operation time of the multithreaded test program can be compressed by about 8.57%. Second. In this paper, this paper proposes a collision failure isolation mechanism based on deflection mapping, IMI-SM. When the failure of the last shared Cache in multi-core Cache is taken from the external memory, if the LRU candidate sacrificial block in the static target Cache Set is excommunicated, the thread or thread may be caused. By introducing a dedicated conflict isolation cache or using a longitudinal pressure balancing strategy within Bank to extend the candidate target Set selection range for data mapping, IMI-SW allows the new data blocks from memory to be stored in the inrush isolated buffer zone or relatively small storage pressure by the introduction of a dedicated conflict isolation cache zone. Other static coupling Cache Set, in order to alleviate the negative impact of conflict failure on the overall hit rate on shared Cache chips. The experimental results show that IMI-SM can significantly reduce the collision failure that the multicore processor faces when sharing Cache resources, and the program run time can be reduced by about 7.35%, so it can be used with smaller hardware. Third, in view of the long delay hit problem facing the distributed shared Cache structure of the tile type multi-core processor, an enhanced selective sacrificial block replication mechanism, E-VR, is proposed in this paper. On the basis of the original sacrificial block replication operation, the candidate sacrificial block filter and target group detection mechanism are introduced. The sacrificial block copy operation takes into account not only the sharing mode and the reading and writing characteristics, but also the longitudinal nonequilibrium distribution characteristics of the local Cache Bank from the fine-grained level. By reducing the occurrence probability of the high cost replication operation and extending the selection range of the candidate storage target Set for the sacrificial block, the performance benefit of the replication operation is improved. The results show that E-VR can reduce the running time of each application by about 6.97%.E-VR, while reducing the access delay of the hit on the chip, avoiding the negative impact on the global hit rate of the shared Cache, and can make a dynamic tradeoff between the low hit delay and the low loss efficiency, and the performance of the memory visiting system can be further improved. Four, facing the virtual shared domain partition structure of tile type multi-core distributed Cache, this paper proposes the adaptive replacement of data, the set of migration and replication mechanism, which is a unified data management framework, F-RMR.F-RMR not only can perceive the active state and the uniqueness of the candidate sacrificial blocks in the local target Cache Set when the data is replaced, but also in many cases. Data migration and replication decision-making between virtual shared domains can collaborate to perceive the active degree of the hit data and the idle state of the target Cache Set. By substitution, the collaboration between the three parties of the migration and replication, the problem of the conflict of weights between the long Cache long delay and the effective utilization ratio on the chip is properly handled. The experimental results show that When the shared domain partition granularity is 4, the average storage access delay of the multithread test program under F-RMR can be reduced by about 7.59%. Compared with the original virtual shared domain partition mechanism, F-RMR can get the corresponding performance enhancement in the granularity of different shared domains, and the area overhead can be ignored.
【学位授予单位】：国防科学技术大学
【学位级别】：博士
【学位授予年份】：2013
【分类号】：TP332

【相似文献】