嵌入式处理器的微体系结构优化

发布时间：2018-05-31 04:44

本文选题：嵌入式系统 + 微处理器　；参考：《浙江大学》2013年硕士论文

【摘要】：生产工艺的不断进步以及新兴应用程序的要求不断驱动着处理器性能的飞速提升。然而嵌入式处理器面临着新的挑战：一方面,存储器与处理器的性能差距不断制约着处理器的整体系统性能；另一方面,大量新应用的高精度浮点要求对处理器设计提出了新的需求。本文通过分析应用特性,采用数据预取优化处理器存储系统,并设计浮点单元加速处理器数据处理。主流的预取机制设计和配置并不适用嵌入式处理器：过于激进的预取策略会干扰处理器正常访存行为；复杂的预测和控制机制会消耗大量功耗和面积。本文设计了一种基于流信息表的可变步长流预取机制。通过优化的最小差值法对数据流进行判定和过滤,降低电路设计复杂度；同时通过设置预取缓冲降低高速缓存(cache)端口冲突率；并对预取数据采用单独的cache替换策略,弥补因为cache的污染对预取效果造成的负面影响。NoCOP硬件模拟平台仿真结果显示,针对EEMBC和SPEC2006测试集,本文的流预取机制相较于无预取时,平均性能提升比例为4.3%,性能最大提升16%；相较于MSP (minimum delta prefetching)机制,平均性能提升10.5%；面积增加3.5万等效门,总功耗增加30.1mW。目前大多数预取机制并不能同时兼顾流式和链式数据结构,且已有的链式预取机制存在着存储空间开销大或预取准确度低的问题。本文设计了集成流预取引擎和指针预取引擎的自适应多模式预取系统,根据处理器实时运行信息判断当前工作模式效率,并完成在流预取、指针预取和无预取二种模式下的切换调整。其中,我们设计的FCDP(filtered content directed prefetching)指针预取机制,通过基于偏移地址的过滤方法对CDP(content directed prefetching)机制进行准确率的优化,可以平均降低35%的预取发起数量。NoCOP硬件模拟平台仿真结果显示,针对EEMBC、SPEC2006和Olden测试集,预取系统与单独采用流预取和FCDP预取时分别提升11.7%和50.6%,能在预取效果不理想时及时关闭预取引擎,降低系统功耗。根据新应用大量的浮点数据,以及越来越高的精度要求,本文设计了适用于嵌入式处理器的浮点单元,用于加速浮点数据的处理。同时,提出了利用软件模拟器统计应用特性来指导RTL(register transfer level)级设计的方法实例。浮点单元设计采用load/store与浮点算术指令分开处理的方式,高度复用了原整型流水线的逻辑单元,并与整型流水线紧密耦合。实验与逻辑综合结果表明,浮点单元支持MIPS32单精度浮点指令集；在worst case下最大工作频率为495MHz,在typical case下最大工作频率为794MHz；面积增加24.8万等效门,功耗为88.3mW。
[Abstract]:The continuous progress of production technology and the requirements of emerging applications continuously drive the rapid improvement of processor performance. However, embedded processors face new challenges: on the one hand, the performance gap between memory and processor constantly restricts the overall system performance of the processor; on the other hand, The high precision floating-point requirement of a large number of new applications puts forward new requirements for processor design. In this paper, data prefetching is used to optimize processor storage system, and floating-point unit is designed to accelerate processor data processing. The design and configuration of the main prefetching mechanism is not suitable for embedded processor: overly aggressive prefetching strategy will interfere with the normal memory access behavior of the processor; complex prediction and control mechanisms will consume a lot of power and area. A variable step long flow prefetching mechanism based on stream information table is designed in this paper. The data stream is judged and filtered by the optimized minimum difference method to reduce the complexity of circuit design, the collision rate of cache port is reduced by setting prefetching buffer, and a separate cache replacement strategy is adopted for prefetched data. The simulation results of the hardware simulation platform of Nocop show that for EEMBC and SPEC2006 test sets, the stream prefetching mechanism in this paper is better than that without prefetching. The average performance improvement ratio is 4.3%, the maximum performance improvement is 16%; compared with the MSP minimum delta prefetching mechanism, the average performance increases 10.5%; the area increases 35000 equivalent gates, the total power consumption increases 30.1 MW. At present, most prefetching mechanisms can not take both streaming and chained data structures into account, and the existing chained prefetching mechanisms have the problems of high storage space overhead and low prefetching accuracy. In this paper, an adaptive multi-mode prefetching system based on integrated stream prefetching engine and pointer prefetching engine is designed. According to the real-time operation information of the processor, the efficiency of the current working mode is judged, and the in-stream prefetching is completed. Pointer pre-fetch and no pre-fetching two modes of switch adjustment. Among them, the FCDP(filtered content directed prefetching) pointer prefetching mechanism designed by us can optimize the accuracy of the CDP(content directed prefetching mechanism by filtering based on offset address, and can reduce the number of prefetching initiators by an average of 35%. For the EEMBC / SPEC2006 and Olden test sets, the prefetching system and single stream prefetching and FCDP prefetching can increase 11.7% and 50.6% respectively, which can shut down the prefetching engine in time and reduce the system power consumption when the prefetching effect is not satisfactory. According to the new application of a large number of floating-point data, as well as higher and higher precision requirements, this paper designed a floating-point unit suitable for embedded processor to speed up the processing of floating-point data. At the same time, an example of how to use statistical application characteristics of software simulator to guide RTL(register transfer level design is presented. The floating-point unit is designed by using load/store and floating-point arithmetic instruction separately. It highly reuses the logic unit of the original integer pipeline and is tightly coupled with the integer pipeline. The experimental and logical synthesis results show that the floating-point unit supports the MIPS32 single precision floating-point instruction set, the maximum operating frequency is 495MHz in worst case and 794MHz in typical case, and the area increase is 248000 equivalent gates, and the power consumption is 88.3mW.
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP332

【参考文献】