基于性能监测硬件支持的片上缓存资源管理技术
发布时间:2018-12-10 06:52
【摘要】:如何高效利用片上高速缓存是多核处理器研究的一个重要课题。现有的片上高速缓存管理机制是软件透明的,不能实时感知程序数据集的局部性特征,以及来自多个线程不同的访存请求。一方面,当多个线程同时在多核处理器上运行时,现有的缓存管理策略不仅不能保证每个任务的运行性能,还会导致共享缓存的多个任务之间发生不可预测的缓存竞争,形成相互干扰,降低系统的吞吐量。另一方面,由于软件不能控制缓存空间的分配,仅靠硬件进行管理,使得程序对高速缓存的利用效率不高,尤其对于单线程程序,不能利用多核处理器丰富的片上缓存资源来获得性能加速。 针对以上问题,本文研究如何利用硬件性能监测单元来实时监测程序运行时的访存特征信息,实现对多线程运行时的共享缓存竞争管理,以及对单线程程序运行时的缓存空间分配,从而提高多任务系统的吞吐量和性能稳定性,并为单线程程序执行提供高效的缓存控制手段。本文的研究内容和主要工作成果包括以下几个方面: (1)研究了能够实时感知程序运行时访存特征的性能监测机制,提出了基于性能监测单元而实现的低代价访存性能监测方案LWM。IWM可以为用户层提供程序运行时访存性能信息的功能,以及为缓存管理器提供系统级的资源使用信息,减少了访存性能监测的代价。在实现过程中,我们在每个任务结构体中加入性能事件成员、提供事件配置的系统调用接口,并且对计数器溢出和上下文切换过程中出现的错误计数进行了处理。此外,我们还优化了性能计数器的分时复用机制,提高了多事件监测过程中的事件监测精度以及性能计数器的利用率。 (2)研究了多个任务对共享缓存资源的竞争问题,提出了访存负载概念并设计了访存负载平衡调度算法,提高了多任务系统吞吐量和程序的性能稳定性。本文提出了一种访存负载平衡调度技术来解决多任务共享缓存竞争问题。访存负载平衡调度算法参照了操作系统计算负载平衡调度算法的设计,可以作为操作系统负载平衡系统的扩展。由于本文将访存负载平衡调度实现为一个用户层的负载调度系统,所以不需要对操作系统内核进行改动。通过与其它调度算法进行实验比较后,证明本文提出的访存负载平衡调度算法在程序加权加速,以及提升系统整体吞吐量方面都有较大改进,降低了多任务对共享缓存的竞争强度,减少了系统整体的片外访存请求数量。得益于算法的稳定性能,访存负载平衡调度降低了程序多次运行之间的性能差异性,可以为操作系统实现公平可靠的任务调度算法提供支持。 (3)研究了单线程程序运行于多核处理器平台时的缓存空间利用率不高的问题,提出了一种新型缓存控制机制VSCP,提高了单线程程序的缓存利用率并加速了程序执行。本文提出的新型缓存控制方法VSCP可以有效提升单线程程序对多核处理器片上缓存空间的利用率,VSCP联合了整个系统上的缓存资源并为程序员提供显式的缓存控制接口,物理分布的缓存空间被虚拟化成用户可控的集中式缓存。与通过程序并行化来最大化计算资源的使用不同,VSCP试图去最大化缓存资源的利用率。VSCP保持单线程程序一段时间内只使用一个处理器核的状态,减少多核同时工作的功耗。另外,在片上缓存不能存放一个程序的所有工作集时,可以利用VSCP选择部分具有强局部性的数据集驻留缓存以确保这些数据不被替换或污染,降低缓存缺失率并最终加速程序。 通过对本课题的研究,我们得到了以下重要认识: (1)访存性能对于单个程序以及系统整体性能都非常重要,在“存储墙”现象日益严重的背景下,对于提升单个程序以及系统整体性能来说,降低缓存缺失率比减少执行指令数都要更加有效。 (2)现有的缓存管理策略(包括操作系统任务调度和缓存替换策略的实现)都无法感知到线程间缓存竞争与共享关系的存在,导致低效的缓存管理。缓存资源管理必须实现线程感知的策略,否则无法为系统性能、公平性和服务质量等指标提供支持。 (3)解决多核处理器缓存资源管理最终还是需要软硬件协同配合才能完成,这需要对程序运行时和缓存管理器之间的接口进行重新设计,包括建立更好的性能监测基础设施(软、硬件)以便观察系统内部运行时情况,以及细粒度的缓存资源分配机制,这些问题的解决需要操作系统设计者、硬件架构师和程序开发人员的共同努力。 本文针对缓存资源管理而提出的关键问题解决方案,都是基于真实硬件平台进行设计实现的,是相对实际的解决方法,并且这些实现方案具有一般通用性,可以为未来处理器体系结构上的缓存资源管理机制的实现提供参考。
[Abstract]:How to make use of on-chip cache is an important subject of multi-core processor research. the existing on-chip cache management mechanism is software transparent, does not perceive the locality features of the program data set in real time, and different access requests from multiple threads. On the one hand, when a plurality of threads are running on the multi-core processor at the same time, the existing cache management strategy can not only ensure the running performance of each task, but also can lead to unpredictable cache competition among a plurality of tasks of the shared cache, so that mutual interference is formed, and the throughput of the system is reduced. On the other hand, because the software is not able to control the allocation of the cache space, the hardware is managed only, so that the utilization efficiency of the program to the cache is not high, and in particular for a single-thread program, the performance acceleration cannot be obtained by using the abundant on-chip cache resources of the multi-core processor. In view of the above problems, this paper studies how to use the hardware performance monitoring unit to monitor the memory feature information in the program running in real time, to realize the shared cache competition management during the multi-thread operation, and to separate the cache space when the single thread is running. so that the throughput and the performance stability of the multi-task system can be improved, and a high-efficiency cache control hand is provided for the single-thread program execution. The contents of the study and the main work achievements of this paper include the following Surface: (1) It studies the performance monitor that can be used to detect the storage characteristics in real-time sense program. The mechanism is a low-cost-to-store performance monitoring scheme based on the performance monitoring unit. The IWM can provide the function of accessing the storage performance information when the user layer is running, and provide the system-level resource usage information to the cache manager, so as to reduce the monitoring and storage performance monitoring. At the cost of the implementation, we add performance event members to each task structure, provide a system call interface for event configuration, and count the error count that occurs during counter overflow and context switching In addition, we have optimized the time-division multiplexing mechanism of the performance counter to improve the accuracy of the event monitoring and the performance counter in the multi-event monitoring process. (2) The competition of shared cache resources by multiple tasks is studied. The concept of storage load is proposed and a distributed load balancing scheduling algorithm is designed to improve the throughput and program of multi-task system. Performance stability is presented in this paper. A distributed load balancing and scheduling technique is proposed to solve the problem of multi-task sharing. In this paper, the load balance scheduling algorithm is used to calculate the load balance scheduling algorithm, which can be used as the load balance of the operating system. The extension of the system. As this paper realizes the load-balanced scheduling as a user-level load-scheduling system, it is not necessary for the operating system The kernel is modified. After the experiment is compared with the other scheduling algorithms, it is proved that the distributed load balance scheduling algorithm proposed in this paper has a great improvement in the program-weighted acceleration and the overall throughput of the system, and reduces the multi-task to the shared cache. The competitive strength of the system is reduced, and the overall sheet-to-sheet visit of the system is reduced. Due to the stability of the algorithm, the distributed load balance scheduling reduces the performance difference between the multiple runs of the program, and can realize a fair and reliable task scheduling calculation for the operating system. (3) The problem that the cache space utilization rate is not high when the single-thread program runs on the multi-core processor platform is studied, a new cache control mechanism VSCP is proposed, and the cache utilization rate of the single-thread program is improved. The new cache control method (VSCP) proposed in this paper can effectively improve the utilization rate of the cache space on the multi-core processor chip by the single-thread program. The VSCP combines the cache resources on the whole system and provides the programmer with the explicit formula the cache control interface, the physically distributed cache space is virtualized to a user, Centralized caching of control. In parallel to the use of programs to maximize the use of computing resources, the VSCP attempts to maximize the delay the utilization rate of the storage resources. The VSCP maintains the state of only one processor core for a period of time, reducing the multi-core, In addition, when the on-chip cache is not able to store all worksets of a program, the VSCP selection section can be used to select a data set that has a strong locality to reside in the cache to ensure that these data are not replaced or contaminated, reducing the cache miss rate and finally accelerate the program. Through the research of the subject, we The following important recognition is obtained: (1) The storage performance is very important for the individual program and the overall performance of the system, and in the background of the increasing 鈥渟torage wall鈥,
本文编号:2370152
[Abstract]:How to make use of on-chip cache is an important subject of multi-core processor research. the existing on-chip cache management mechanism is software transparent, does not perceive the locality features of the program data set in real time, and different access requests from multiple threads. On the one hand, when a plurality of threads are running on the multi-core processor at the same time, the existing cache management strategy can not only ensure the running performance of each task, but also can lead to unpredictable cache competition among a plurality of tasks of the shared cache, so that mutual interference is formed, and the throughput of the system is reduced. On the other hand, because the software is not able to control the allocation of the cache space, the hardware is managed only, so that the utilization efficiency of the program to the cache is not high, and in particular for a single-thread program, the performance acceleration cannot be obtained by using the abundant on-chip cache resources of the multi-core processor. In view of the above problems, this paper studies how to use the hardware performance monitoring unit to monitor the memory feature information in the program running in real time, to realize the shared cache competition management during the multi-thread operation, and to separate the cache space when the single thread is running. so that the throughput and the performance stability of the multi-task system can be improved, and a high-efficiency cache control hand is provided for the single-thread program execution. The contents of the study and the main work achievements of this paper include the following Surface: (1) It studies the performance monitor that can be used to detect the storage characteristics in real-time sense program. The mechanism is a low-cost-to-store performance monitoring scheme based on the performance monitoring unit. The IWM can provide the function of accessing the storage performance information when the user layer is running, and provide the system-level resource usage information to the cache manager, so as to reduce the monitoring and storage performance monitoring. At the cost of the implementation, we add performance event members to each task structure, provide a system call interface for event configuration, and count the error count that occurs during counter overflow and context switching In addition, we have optimized the time-division multiplexing mechanism of the performance counter to improve the accuracy of the event monitoring and the performance counter in the multi-event monitoring process. (2) The competition of shared cache resources by multiple tasks is studied. The concept of storage load is proposed and a distributed load balancing scheduling algorithm is designed to improve the throughput and program of multi-task system. Performance stability is presented in this paper. A distributed load balancing and scheduling technique is proposed to solve the problem of multi-task sharing. In this paper, the load balance scheduling algorithm is used to calculate the load balance scheduling algorithm, which can be used as the load balance of the operating system. The extension of the system. As this paper realizes the load-balanced scheduling as a user-level load-scheduling system, it is not necessary for the operating system The kernel is modified. After the experiment is compared with the other scheduling algorithms, it is proved that the distributed load balance scheduling algorithm proposed in this paper has a great improvement in the program-weighted acceleration and the overall throughput of the system, and reduces the multi-task to the shared cache. The competitive strength of the system is reduced, and the overall sheet-to-sheet visit of the system is reduced. Due to the stability of the algorithm, the distributed load balance scheduling reduces the performance difference between the multiple runs of the program, and can realize a fair and reliable task scheduling calculation for the operating system. (3) The problem that the cache space utilization rate is not high when the single-thread program runs on the multi-core processor platform is studied, a new cache control mechanism VSCP is proposed, and the cache utilization rate of the single-thread program is improved. The new cache control method (VSCP) proposed in this paper can effectively improve the utilization rate of the cache space on the multi-core processor chip by the single-thread program. The VSCP combines the cache resources on the whole system and provides the programmer with the explicit formula the cache control interface, the physically distributed cache space is virtualized to a user, Centralized caching of control. In parallel to the use of programs to maximize the use of computing resources, the VSCP attempts to maximize the delay the utilization rate of the storage resources. The VSCP maintains the state of only one processor core for a period of time, reducing the multi-core, In addition, when the on-chip cache is not able to store all worksets of a program, the VSCP selection section can be used to select a data set that has a strong locality to reside in the cache to ensure that these data are not replaced or contaminated, reducing the cache miss rate and finally accelerate the program. Through the research of the subject, we The following important recognition is obtained: (1) The storage performance is very important for the individual program and the overall performance of the system, and in the background of the increasing 鈥渟torage wall鈥,
本文编号:2370152
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/2370152.html