EDGE体系结构指令动态映射算法研究

发布时间：2018-07-24 13:38

【摘要】：乱序超标量处理器中广泛存在的集总式结构已严重限制微处理器性能的提升。EDGE(Explicit Data Graph Execution)作为应对微处理器性能提升瓶颈的模型之一，从结构模型中摒弃了超标量中能耗大不易扩展的集总式结构。在分布式EDGE结构中，指令映射到多个分片上同时执行。分片之间操作数传递需要延时从而导致性能下降。指令映射算法通过仔细权衡程序的并行度和分片间通信延时来试图消除分片后带来的性能损失。 TRIPS微处理器采用关键资源拓扑结构不对称分布和静态指令映射算法(SPDI, Static Placement Dynamic Issue)。这会导致ET(Execute Tile)上较大的负载不均衡和操作数网络通信热点，从而引起IPC下降。本文在M5-EDGE模拟器中实现与TRIPS类似的EDGE结构，以此来研究指令动态Deep映射算法。在缺乏编译器调度下，采用循环映射方式的Deep算法在发射宽度为1和2时IPC分别为SPDI的85%和98.3%。针对RT(Register Tile)和DT(Data-cache Tile)的拓扑位置，对Deep映射进行三种优化：依照ET编号顺序、“之”字形顺序和计算甚块全局通信跳步数之和来优先选择ET。在发射宽度为1时三种优化与基本的Deep算法相比平均跳步分别减少2.63%、2.18%和4.70%，而IPC分别提升1.07%、1.21%和2.11%。这说明在Deep映射下优化指令间通信跳步数能显著提高IPC。在Deep映射算法中，90%以上的操作数通过操作数旁路来传递，大大减少操作数网络的负载。在bypass宽度为2倍发射宽度时，，本地的操作数传递延时几乎下降为0。增加本地bypass宽度，能有效的减少操作数传递的延时。将RT按编号分配到ET上，基本Deep映射算法的IPC提升1.77%。针对DT位置进行优化，优先选择靠近DT的ET和计算甚块通信跳数之和选择ET。这两种优化比基本Deep映射IPC分别提升1.17%和1.89%。将RT和DT平铺到ET中形成4x4的拓扑结构。在发射宽度为1和2时该结构中Deep映射的IPC分别为SPDI的97.18%和113.42%。计算跳步数选择ET，这一比值为97.32%和114.06%。微结构变化导致拓扑距离变小或者Deep映射算法优化通信跳步数时，能显著提高系统IPC。
[Abstract]:The lumped structure widely existing in scrambled superscalar processors has seriously restricted the performance improvement of microprocessors. Edge (Explicit Data Graph Execution) is one of the models to deal with the bottleneck of microprocessor performance enhancement. The lumped structure with large energy consumption in superscalar is abandoned from the structural model. In a distributed EDGE architecture, instructions are mapped to multiple slices to execute simultaneously. The transmission of operands between slices requires delay, which results in performance degradation. The instruction mapping algorithm tries to eliminate the performance loss caused by fragmentation by carefully weighing the program parallelism and inter-slice communication delay. The TRIPS microprocessor adopts asymmetric distribution of critical resource topology and static reference. Mapping algorithm (SPDI, Static Placement Dynamic Issue). This will lead to a large load imbalance and Operand network communication hot spots on the ET (Execute Tile), thus causing a decrease in IPC. In this paper, a EDGE structure similar to TRIPS is implemented in the M5-EDGE simulator to study the instruction dynamic Deep mapping algorithm. In the absence of compiler scheduling, the Deep algorithm using cyclic mapping is 85% of SPDI and 98.3% of SPDI when the transmission width is 1 and 2, respectively. According to the topological position of RT (Register Tile) and DT (Data-cache Tile), three kinds of optimization of Deep mapping are carried out: according to the order of et numbering, the glyph order of "its" and the sum of calculating the number of leapfrogging steps in the global communication of very block to select ETs first. When the launch width is 1, the average jump steps of the three optimizations are 2.63% and 4.70% less than those of the basic Deep algorithm, respectively, while the IPC increases by 1.07% and 2.11%, respectively. This shows that optimizing the jump number of inter-instruction communication under Deep mapping can significantly increase the number of jump steps. In the Deep mapping algorithm, more than 90% of the operands are transferred by the optograph bypass, which greatly reduces the load of the operands network. When the bypass width is 2 times the transmit width, the local Operand transfer delay is almost reduced to 0. 0. Increasing the local bypass width can effectively reduce the delay of Operand transfer. RT is assigned to et by number, and the IPC of basic Deep mapping algorithm increases by 1.77. For the DT position optimization, the et near DT and the sum of calculated VBS hops are selected first. These two optimizations are 1.17% and 1.89% higher than the basic Deep mapping IPC, respectively. The RT and DT are tiled into the et to form the topological structure of 4x4. When the emission width is 1 and 2, the IPC of Deep map is 97.18% of SPDI and 113.42% of SPDI, respectively. The ratio of ETs was 97.32% and 114.06% respectively. When the topology distance becomes smaller or the Deep mapping algorithm optimizes the number of communication hops, the system IPCs can be improved significantly.
【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP332;TP301.6

【共引文献】