面向嵌入式系统的自调数据预取
发布时间:2018-03-23 06:09
本文选题:数据预取 切入点:多核处理器 出处:《浙江大学》2013年硕士论文 论文类型:学位论文
【摘要】:针对计算机系统中存在的存储墙问题,现代处理器采用预取技术,利用应用程序中存在的规律性地址访问模式,来对存储访问行为进行预测,以减少高速缓存缺失次数。然而目前工业和学术界的各种预取技术存在以下问题:1)应用程序中存在大量的链表指针模式,而主流商业处理器上的预取引擎只针对线性地址模式进行预测;2)现有的指针预取方法对返回值进行类地址判断,其预取准确率较低,通常在10%以下;3)在多核处理器上数据预取引擎会加剧对共享资源的冲突,进而导致系统总体性能降低。 本文开发了一款兼容MIPS32指令集的周期级软件模拟器,来对嵌入式单核/多核处理器的功能、时序和成本三方面进行建模。在该平台上针对上述现有预取技术中存在的问题探索解决方案。根据对应用特性的分析和优化空间探索,提出了用于嵌入式单核处理器的多模式自调数据预取方案。该解决方案根据硬件统计的运行时信息,通过特殊预取指令对两种预取模式的激进度进行自适应调节,通过链式和线性模式判断提高了预取的准确率。在单核软件模拟器上执行EEMBC、 SPEC CPU2006和OLDEN评测程序,结果表明,多模式预取引擎的准确率分别平均为36%,40%和56%,而内容指导(Content direct prefetching, CDP)的指针预取准确率分别为8%,9%和24%,相对流预取、CDP指针预取和GHB预取性能分别提升7%、6%和9%。 本文针对多核多线程的应用环境,提出一种线程分类的预取机制,来降低数据预取导致的存储系统资源竞争。提出的多核数据预取机制包括:(1)采用过滤方式通知硬件单元,丢弃预取请求会导致线程间数据无效化的预取。(2)根据运行时信息对线程进行分类,调整各线程数据预取引擎的开关状态和激进程度,从而降低了线程间的资源冲突。在16核系统进行建模,采用PARSEC、SPLASH-2和科学计算程序进行评估,结果表明:相比于基准预取引擎,采用过滤机制和线程分类调整预取策略,系统性能分别可以提升2%和6%。相比将反馈指导预取(Feedback direct prefetching, FDP)技术应用于基准预取引擎上的结果,本文提出的预取机制提升了4%的系统性能,并减少了4%的能量时间积。
[Abstract]:Aiming at the problem of storage wall in computer system, modern processor uses prefetching technology to predict storage access behavior by using regular address access mode in application program. To reduce the number of cache deletions. However, the current industrial and academic prefetching technologies have the following problems: 1) there are a large number of linked list pointer patterns in applications, On the other hand, the prefetching engine on the mainstream commercial processor only predicts the linear address mode. (2) the existing pointer prefetching method can judge the return value by class address, and the accuracy of prefetching is low. Generally less than 10%) data prefetching engines on multicore processors can exacerbate the conflict on shared resources and thus result in a deterioration in overall system performance. In this paper, a cycle level software simulator compatible with MIPS32 instruction set is developed to perform the function of embedded single core / multi core processor. Based on the analysis of the characteristics of the application and optimization of space exploration, this platform explores solutions to the problems existing in the existing prefetching technologies mentioned above. This paper presents a multi-mode self-tuning data prefetching scheme for embedded single-core processors, which adaptively adjusts the radicalization of the two prefetching modes through special prefetching instructions according to the runtime information of hardware statistics. The accuracy of prefetching is improved by chain and linear mode judgment. The EEMBC, SPEC CPU2006 and OLDEN evaluation programs are executed on the single core software simulator, and the results show that, The average accuracy of multi-mode prefetching engine is 36% and 56%, respectively, while the accuracy of content direct prefetching is 8% and 24%, respectively. The relative flow prefetching and GHB prefetching performance are improved by 7% and 9%, respectively. In this paper, a prefetching mechanism of thread classification is proposed to reduce the resource competition of storage system caused by data prefetching. The multi-core data prefetching mechanism includes: 1) notifying the hardware unit by filtering method. Pre-fetching requests, which will invalidate data between threads, categorize threads according to runtime information, and adjust the switch state and radicalization of each thread's data prefetching engine. Thus, the resource conflict between threads is reduced. Modeling in 16-core system, using PARS ECS / SPLASH-2 and scientific calculation program to evaluate, the results show that compared with the benchmark prefetching engine, filtering mechanism and thread classification are used to adjust the prefetching strategy. The system performance can be improved by 2% and 6% respectively. Compared with the result of applying feedback guidance prefetching (FDP) technique to the reference prefetching engine, the proposed prefetching mechanism improves the system performance by 4% and reduces the energy time product by 4%.
【学位授予单位】:浙江大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP333
【参考文献】
相关期刊论文 前8条
1 高丰,刘鹏,姚庆栋,李东晓;一种基于HDTV信源集成解码芯片的RTOS的设计与实现[J];电路与系统学报;2002年03期
2 樊建平,陈明宇;网格化的动态自组织高性能计算机体系结构DSAG[J];计算机研究与发展;2003年12期
3 胡伟武;张福新;李祖松;;龙芯2号处理器设计和性能分析[J];计算机研究与发展;2006年06期
4 胡伟武,唐志敏;龙芯1号处理器结构设计[J];计算机学报;2003年04期
5 张福新;章隆兵;胡伟武;;基于SimpleScalar的龙芯CPU模拟器Sim-Godson[J];计算机学报;2007年01期
6 郇丹丹;李祖松;胡伟武;刘志勇;;结合访存失效队列状态的预取策略[J];计算机学报;2007年07期
7 高翔;张福新;汤彦;章隆兵;胡伟武;唐志敏;;基于龙芯CPU的多核全系统模拟器SimOS-Goodson[J];软件学报;2007年04期
8 包云岗;许建卫;陈明宇;樊建平;;一种新型计算机体系结构模拟器的研究与实现[J];系统仿真学报;2007年07期
,本文编号:1652221
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/1652221.html