末级高速缓存性能优化关键技术研究

发布时间：2018-08-02 20:31

【摘要】：现代处理器普遍采用多级层次化高速缓存结构以来弥补处理器和存储器之间不断扩大的性能差距。与指令和数据分离的一级高速缓存设计不同，共享的末级高速缓存受内层高速缓存的过滤作用，导致访问末级高速缓存的数据局部性相对较差。因此，通常面向传统的小容量私有一级高速缓存的管理策略难以有效利用末级高速缓存空间，严重影响处理器访存性能的提升。对末级高速缓存进行有效管理，减少末级高速缓存失效对于提高系统整体性能具有重要的意义。操作系统负责分配物理内存，建立虚实地址映射关系。通过修改物理页框分配策略可以影响末级高速缓存中的数据布局，优化数据的局部性，减少末级高速缓存失效。同基于硬件设计和编译技术的传统末级高速缓存优化策略相比，上述方法具有硬件改动小、应用透明等优点。然而，现有操作系统设计并没有充分考虑末级高速缓存优化，缺乏控制和管理末级高速缓存的有效手段。本文分别从操作系统内存管理策略设计和软硬件协同末级高速缓存设计两个方面，展开面向末级高速缓存的性能优化关键技术研究，主要研究工作和成果包括如下： 1.提出了一种降低末级高速缓存污染的分区域软件划分方法。局部性差数据进入到末级高速缓存后可能会将经常被访问到的数据替换出去，产生末级高速缓存污染问题。该方法采用基于访存踪迹的局部性剖视反馈机制，检测并发现访存密集型程序内局部性差的污染数据区域；并通过修改操作系统物理页框分配策略，将污染数据集合分配到较小的末级高速缓存空间中。采用该方法可以在末级高速缓存中保护局部性良好的数据，提高末级高速缓存命中率。实验结果表明同现有Linux操作系统相比，采用本方法后末级高速缓存每千行指令失效数MPKI平均减少15.23%，程序性能平均提高了7.01%。 2.提出了一种将进程间划分和污染区域隔离相结合的多核处理器共享末级高速缓存优化方法。并发进程数据以及进程内不同数据区域会相互抢占多核处理器共享末级高速缓存空间，产生严重的共享末级高速缓存数据访问冲突。该方法检测并发现应用程序在不同共享末级高速缓存配置下的污染数据区域分布，并在末级高速缓存中设置全局污染缓冲区集中映射各个并发进程内部污染数据区域。该方法可以在进程间划分的基础上进一步提高多核处理器多进程并发执行环境下共享末级高速缓存利用率。实验结果表明，同现有Linux操作系统和进程间划分方法RapidMRC相比，采用该方法后多核系统的整体性能分别提高了26.31%和5.86%。 3.提出了一种轻量级硬件支持的页粒度软件控制末级高速缓存插入策略。由于记录的访存信息有限，单纯基于硬件实现的末级高速缓存管理策略难以识别程序内不同数据区域的访存行为，，无法有效检测并定位局部性差的污染数据。该方法利用现有处理器页表项保留位设计末级高速缓存插入策略软件控制接口；同时，在剖视信息的指导下以页为单位控制污染区域数据进入末级高速缓存的插入位置。该方法具有较小的硬件开销，可以在采用锦标赛机制的硬件插入策略的基础上进一步降低末级高速缓存污染，提高处理器访存性能。实验结果表明，同现有的LRU、DIP和DRRIP方法相比，采用该方法后末级高速缓存MPKI平均降低了14.33%、9.68%和6.24%；处理器平均性能分别提高了8.3%、6.23%和4.24%。 4.提出了一种面向虚拟地址区域的软硬件协同末级高速缓存管理策略。在程序运行过程中，连续的虚拟地址区域中的数据往往被映射到分散的物理页框中。现有末级高速缓存性能监视器难以统计这种数据分布的情况，无法为运行时刻优化方案提供指导。该方法首先设计了一种面向虚拟地址空间的末级高速缓存分区域性能监视器，用于在线记录程序内不同数据区域的末级高速缓存访问信息；其次，设计了一种分区域性能监视器支持的在线剖视分析方法，在运行时刻了解程序内不同数据区域的访存行为和局部性特征；最后，设计了末级高速缓存软件控制接口。操作系统在剖视信息的指导下，根据每个数据区域的访存行为，为不同数据区域配置合理的旁路和插入策略。采用该方法可以在不显著增加硬件开销的前提下，有效提高末级高速缓存利用率。实验结果表明，与现有的LRU、DIP和DRRIP方法相比，采用本方法后处理器平均性能分别提高了8.05%、5.94%和4.01%。
[Abstract]:Modern processors generally adopt multi-level caching architecture to compensate for the increasing performance gap between processor and memory. Unlike the first level cache design for separation of instructions and data, the shared last level cache is filtered by the internal cache, resulting in access to the data locality of the last stage cache. It is relatively poor. Therefore, the traditional small capacity private first level cache management strategy is difficult to effectively use the last level cache space, which seriously affects the improvement of processor memory performance. The effective management of the last stage cache and the reduction of the last stage cache failure are of great significance to improving the overall performance of the system.
The operating system is responsible for the allocation of physical memory and the establishment of virtual address mapping relations. By modifying the physical page frame allocation strategy, the data layout in the last level cache can be affected, the locality of the data is optimized and the last stage cache failure is reduced. Compared with the traditional last level cache optimization strategy based on the hardware design and compilation technology, The method has the advantages of small hardware modification and transparent application. However, the existing operating system design does not fully consider the last stage cache optimization, and is lack of effective means to control and manage the last level cache. This paper is based on two aspects of the operating system memory management strategy design and the software and hardware coincident with the last level cache design. Research on Key Technologies of performance optimization for last level caching, the main research work and achievements are as follows:
1. a subregion software partition method is proposed to reduce the end level cache pollution. The local poor data may replace the frequently accessed data after entering the last level cache to produce the last level cache pollution problem. This method uses a local profile feedback mechanism based on the memory tracking trace to detect concurrent visits. The contaminated data region of the locality in the stored intensive program; and by modifying the physical page frame allocation strategy of the operating system to allocate the pollution data set to the smaller end level cache space. This method can protect the local good data in the last stage cache and improve the hit rate of the last stage cache. Compared with the existing Linux operating system, the average number of failure number per thousand lines MPKI reduced by 15.23%, and the performance of the program improved by 7.01%..
2. a multi core processor sharing final cache optimization method is proposed, which combines interprocess partitioning and contaminated area isolation. The concurrent process data and the different data regions in the process will share the multi core processor to share the last stage cache space, resulting in a serious shared last level cache data access conflict. Detection and discovery of the distributed data area of the application program under the different shared last level cache configuration, and set the global pollution buffer in the last level cache to map the internal pollution data regions of each concurrent process. This method can further improve multi processor multi process concurrent execution on the basis of inter process division. The experimental results show that compared with the Linux operating system and the inter process partition method RapidMRC, the overall performance of the multi core system is increased by 26.31% and 5.86%. respectively.
3. a lightweight hardware supported page granularity software to control the last level cache insertion strategy is proposed. Due to the limited memory access information, the last level cache management strategy based on hardware implementation is difficult to identify the memory access behavior of different data regions in the program and can not effectively detect and locate the local poor pollution data. The method uses the existing processor page table items to design the last level cache insertion strategy software control interface. At the same time, under the guidance of the profile information, the contaminated area data is controlled into the insertion position of the last stage cache. This method has small hardware overhead and can be inserted in the hardware of the tournament mechanism. On the basis of the strategy, the last level cache pollution is further reduced and the processor memory performance is improved. Experimental results show that compared with the existing LRU, DIP and DRRIP methods, the last level cache MPKI is reduced by 14.33%, 9.68% and 6.24%, and the average performance of the processor is increased by 8.3%, 6.23% and 4.24%., respectively.
4. a software and hardware cooperative last level cache management strategy for virtual address area is proposed. In the process of running the program, data in the continuous virtual address area are often mapped to the scattered physical page frames. The existing last stage cache performance monitor is difficult to count the data distribution and can not be used for the running time. The optimization scheme provides guidance. This method first designs an end level cache partition domain performance monitor for the virtual address space, which is used to record the last level cache access information in different data regions within the program. Secondly, an online profile analysis method supported by the sub regional performance monitor is designed and it is running at the time of operation. In the end, the last level caching software control interface is designed. Under the guidance of the profile information, the operating system configuring a reasonable bypass and insertion strategy for different data regions. The method can not be significantly increased. The experimental results show that, compared with the existing LRU, DIP and DRRIP methods, the average performance of the post processor is improved by 8.05%, 5.94% and 4.01%., respectively.
【学位授予单位】：北京大学
【学位级别】：博士
【学位授予年份】：2013
【分类号】：TP333

【相似文献】