当前位置:主页 > 科技论文 > 计算机论文 >

EDGE体系结构指令动态映射算法研究

发布时间:2018-07-24 13:38
【摘要】:乱序超标量处理器中广泛存在的集总式结构已严重限制微处理器性能的提升。EDGE(Explicit Data Graph Execution)作为应对微处理器性能提升瓶颈的模型之一,从结构模型中摒弃了超标量中能耗大不易扩展的集总式结构。在分布式EDGE结构中,指令映射到多个分片上同时执行。分片之间操作数传递需要延时从而导致性能下降。指令映射算法通过仔细权衡程序的并行度和分片间通信延时来试图消除分片后带来的性能损失。 TRIPS微处理器采用关键资源拓扑结构不对称分布和静态指令映射算法(SPDI, Static Placement Dynamic Issue)。这会导致ET(Execute Tile)上较大的负载不均衡和操作数网络通信热点,从而引起IPC下降。 本文在M5-EDGE模拟器中实现与TRIPS类似的EDGE结构,以此来研究指令动态Deep映射算法。在缺乏编译器调度下,采用循环映射方式的Deep算法在发射宽度为1和2时IPC分别为SPDI的85%和98.3%。针对RT(Register Tile)和DT(Data-cache Tile)的拓扑位置,对Deep映射进行三种优化:依照ET编号顺序、“之”字形顺序和计算甚块全局通信跳步数之和来优先选择ET。在发射宽度为1时三种优化与基本的Deep算法相比平均跳步分别减少2.63%、2.18%和4.70%,而IPC分别提升1.07%、1.21%和2.11%。这说明在Deep映射下优化指令间通信跳步数能显著提高IPC。 在Deep映射算法中,90%以上的操作数通过操作数旁路来传递,大大减少操作数网络的负载。在bypass宽度为2倍发射宽度时,,本地的操作数传递延时几乎下降为0。增加本地bypass宽度,能有效的减少操作数传递的延时。 将RT按编号分配到ET上,基本Deep映射算法的IPC提升1.77%。针对DT位置进行优化,优先选择靠近DT的ET和计算甚块通信跳数之和选择ET。这两种优化比基本Deep映射IPC分别提升1.17%和1.89%。将RT和DT平铺到ET中形成4x4的拓扑结构。在发射宽度为1和2时该结构中Deep映射的IPC分别为SPDI的97.18%和113.42%。计算跳步数选择ET,这一比值为97.32%和114.06%。微结构变化导致拓扑距离变小或者Deep映射算法优化通信跳步数时,能显著提高系统IPC。
[Abstract]:The lumped structure widely existing in scrambled superscalar processors has seriously restricted the performance improvement of microprocessors. Edge (Explicit Data Graph Execution) is one of the models to deal with the bottleneck of microprocessor performance enhancement. The lumped structure with large energy consumption in superscalar is abandoned from the structural model. In a distributed EDGE architecture, instructions are mapped to multiple slices to execute simultaneously. The transmission of operands between slices requires delay, which results in performance degradation. The instruction mapping algorithm tries to eliminate the performance loss caused by fragmentation by carefully weighing the program parallelism and inter-slice communication delay. The TRIPS microprocessor adopts asymmetric distribution of critical resource topology and static reference. Mapping algorithm (SPDI, Static Placement Dynamic Issue). This will lead to a large load imbalance and Operand network communication hot spots on the ET (Execute Tile), thus causing a decrease in IPC. In this paper, a EDGE structure similar to TRIPS is implemented in the M5-EDGE simulator to study the instruction dynamic Deep mapping algorithm. In the absence of compiler scheduling, the Deep algorithm using cyclic mapping is 85% of SPDI and 98.3% of SPDI when the transmission width is 1 and 2, respectively. According to the topological position of RT (Register Tile) and DT (Data-cache Tile), three kinds of optimization of Deep mapping are carried out: according to the order of et numbering, the glyph order of "its" and the sum of calculating the number of leapfrogging steps in the global communication of very block to select ETs first. When the launch width is 1, the average jump steps of the three optimizations are 2.63% and 4.70% less than those of the basic Deep algorithm, respectively, while the IPC increases by 1.07% and 2.11%, respectively. This shows that optimizing the jump number of inter-instruction communication under Deep mapping can significantly increase the number of jump steps. In the Deep mapping algorithm, more than 90% of the operands are transferred by the optograph bypass, which greatly reduces the load of the operands network. When the bypass width is 2 times the transmit width, the local Operand transfer delay is almost reduced to 0. 0. Increasing the local bypass width can effectively reduce the delay of Operand transfer. RT is assigned to et by number, and the IPC of basic Deep mapping algorithm increases by 1.77. For the DT position optimization, the et near DT and the sum of calculated VBS hops are selected first. These two optimizations are 1.17% and 1.89% higher than the basic Deep mapping IPC, respectively. The RT and DT are tiled into the et to form the topological structure of 4x4. When the emission width is 1 and 2, the IPC of Deep map is 97.18% of SPDI and 113.42% of SPDI, respectively. The ratio of ETs was 97.32% and 114.06% respectively. When the topology distance becomes smaller or the Deep mapping algorithm optimizes the number of communication hops, the system IPCs can be improved significantly.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2012
【分类号】:TP332;TP301.6

【共引文献】

相关期刊论文 前10条

1 裴颂文;吴小东;唐作其;熊乃学;;异构千核处理器系统的统一内存地址空间访问方法[J];国防科技大学学报;2015年01期

2 杨文顶;覃志东;;基于NoC的众核处理器可靠性仿真分析研究[J];智能计算机与应用;2015年02期

3 刘东;张进宝;廖小飞;金海;;面向混合内存体系结构的模拟器[J];华东师范大学学报(自然科学版);2014年05期

4 谢子超;佟冬;黄明凯;;A General Low-Cost Indirect Branch Prediction Using Target Address Pointers[J];Journal of Computer Science and Technology;2014年06期

5 李凌达;陆俊林;程旭;;Retention Benefit Based Intelligent Cache Replacement[J];Journal of Computer Science and Technology;2014年06期

6 李笑天;殷淑娟;何虎;;一种DSP周期精度高效建模方法[J];计算机应用研究;2015年01期

7 刘雨辰;王佳;陈云霁;焦帅;;计算机系统模拟器研究综述[J];计算机研究与发展;2015年01期

8 黄明凯;刘先华;谭明星;谢子超;程旭;;一种面向解释器的间接转移预测技术[J];计算机研究与发展;2015年01期

9 黄永兵;陈明宇;;移动设备应用程序的体系结构特征分析[J];计算机学报;2015年02期

10 杨群;李笑天;何虎;;面向Superscalar与VLIW混合架构处理器的调试器设计[J];计算机应用与软件;2015年05期

相关博士学位论文 前2条

1 章铁飞;基于程序访存模式的存储系统节能技术研究[D];浙江大学;2013年

2 修思文;MPSoC性能估计技术研究[D];浙江大学;2015年

相关硕士学位论文 前10条

1 王勋;面向非易失存储器PCM的节能技术研究[D];浙江工业大学;2013年

2 辛愿;面向嵌入式系统的自调数据预取[D];浙江大学;2013年

3 胡妍;结合结构级和门级的多核处理器功耗评估方法[D];湖南大学;2013年

4 刘雨辰;基于多维数组的高速片上网络模拟器的设计与实现[D];内蒙古大学;2014年

5 单磊;大规模并行片上系统的分布式并行模拟关键技术研究[D];国防科学技术大学;2012年

6 佘超杰;基于多核的片上网络低延迟与低功耗的研究[D];北京工业大学;2014年

7 艾天鹏;基于通讯感知的片上网络加速机制研究[D];浙江工业大学;2014年

8 陆yN;基于计算模型的体系结构模拟器研究[D];复旦大学;2013年

9 张浪;面向异构集成的NoC路由算法研究[D];武汉理工大学;2014年

10 缪旭阳;复杂体系结构的计算特征分类研究[D];武汉理工大学;2014年



本文编号:2141553

资料下载
论文发表

本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/2141553.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户141f7***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com