数据并行处理器中指令流出的协同性研究

发布时间：2019-01-04 12:03

【摘要】：尽管在过去的20年中，半导体工艺的发展和体系结构技术的推动，使得微处理器的性能提升了有上千倍之多。然而来自应用的性能需求却依然与处理器的实际性能之间存在着日益拉大的差距。特别是随着半导体工艺的继续进步变得举步维艰，芯片功耗的负面影响逐步凸显，，如何缩小处理器实际性能与应用需求之间的差距，成为一个艰巨而又紧迫的任务。融合了多核、SIMD(Single InstructionMultiple Data)以及VLIW(Very Long Instruction Word)技术的数据并行处理器，以其高效的数据并行性开发能力，为继续提高处理器的性能带来了一道曙光。然而，不可忽视的一点是，数据并行处理器在带来希望的同时，其自身依然存在指令流出的协同性问题。本文正是针对该问题，以指令流出技术为落脚点，从两个方面加强了数据并行处理器中指令流出的协同性：即多种指令流出方式的高效融合和通过克服性能瓶颈达到硬件资源间的协同配合。本文取得的主要研究成果如下： 1).分析推演了数据并行处理器中多核、SIMD及VLIW在关注功耗开销前提下的高效融合模型。本文通过在Amdahl定律中加入对SIMD、VLIW技术的表征，将Amdahl定律成功应用于数据并行处理器,并给出有关多核数目、SIMD宽度和VLIW长度的设计指导。本文还将限制数据并行处理器性能的关键瓶颈锁定在串行处理、分支结构以及对同时多宽度SIMD的支持等问题上。 2).提出了用于加速串行处理应用，并提供控制处理高效配合的双核化框架。包括三项关键技术：kernel级软件流水、动态解耦耦合机制、统一分支和快速数据共享技术。本文通过kernel级软件流水的方法开发出大量的串、并行应用kernel间的并行性，并通过动态解耦、耦合机制，高效的实现了对串、并行应用间并行性的开发，消除了串行处理类应用的瓶颈效应。此外、本文采用统一分支及快速数据共享技术进一步提高了双核化框架在紧耦合状态下的性能。 3).提出了用于克服分支结构瓶颈效应的指令混洗机制。该机制在保持了SIMD结构高效性的同时，兼具了MIMD结构在处理分支问题时的灵活性，从而使得不同的SIMD lane能够根据各自的分支结果获取相应的指令，实现不同分支路径的并行执行。与此同时，由于在该机制中执行相同分支路径的SIMD lane仍然以SIMD的方式执行，因此很好的保持了SIMD结构本身的高效性。指令混洗机制在SIMD与MIMD结构之间搭建了一座桥梁，极大的提升了数据并行处理器的执行效率。 4).扩展了指令混洗机制，提出支持SIMD lane动态及静态分组的多SIMD多数据流(MSMD)结构。该结构能够在高效支持分支问题的同时，满足应用中对同时多宽度SIMD的需求，支持多个具有不同SIMD宽度需求的应用kernel并行执行。此外，多SIMD多数据流结构改进了指令混洗机制中指令buffer的映射算法，进一步提升了SIMD结构在处理分支问题时的性能。 5).将双核化框架与多SIMD多数据流结构有机结合，形成协同指令流出技术，实现对数据并行处理器中串行处理、分支以及同时多宽度SIMD问题的综合突破与硬件资源的协同配合。本文还对该结构在全芯片的RTL级环境中进行了设计实现，实现结果表明，协同指令流出技术能够以合理的开销，实现数据并行处理器中硬件资源的高效协同配合。数据并行处理器结构仍然是一个热点研究课题。许多关键问题还有待更加系统、更具有实际意义的研究。本文通过多种指令流出方式的融合模型研究，为数据并行处理器的设计提供了系统的指导，并针对限制数据并行处理器性能的关键瓶颈，提出了高效的解决办法。验证和评估结果表明，本文所提的解决办法是有效的，能够应用于未来数据并行处理器的设计和实现。
[Abstract]:In the past 20 years, the development of the semiconductor process and the advancement of the architecture technology have improved the performance of the microprocessor by more than a thousand times. the performance requirements from the application, however, still have an increasing gap between the actual performance of the processor. In particular, with the continuous progress of the semiconductor process, the negative effect of the chip power consumption is becoming more and more obvious, and how to reduce the gap between the actual performance and the application demand of the processor becomes a difficult and urgent task. The data-parallel processor with multi-core, SIMD (Single Instruction Multiple Data) and VLIW (Very Long Instruction Word) technology is used to develop the high-efficiency data parallelism. The non-negligible point, however, is that the data parallel processor, at the same time as it brings the hope, still has the problem of the co-existence of the instruction outflow. In this paper, aiming at this problem, the coordination of the instruction outflow in the data parallel processor is enhanced from two aspects by using the instruction outflow technology as the landing point, that is, the efficient fusion of multiple instruction outflow modes and the cooperative matching between the hardware resources by overcoming the performance bottleneck. The main research results are as follows: 1 The high-efficiency fusion mode of the multi-core, SIMD and VLIW in the data-parallel processor is analyzed. In this paper, by adding the characterization of SIMD and VLIW technology in Amdahl's law, the Amdahl's law is successfully applied to the data parallel processor, and the design of the multi-core number, the SIMD width and the length of the VLIW is given. This paper also discusses the key bottleneck of data parallel processor performance, such as serial processing, branch structure and support for simultaneous multi-width SIMD Up. 2). Put forward the dual-core for accelerating the serial processing application and providing control processing and efficient matching. The framework includes three key technologies: kernel-level software running water, dynamic decoupling coupling mechanism, unified branch and fast data co-operation In this paper, a large number of serial and parallel application kernel parallelism are developed through kernel-level software pipelining, and the development of parallelism between strings and parallel applications is realized through dynamic decoupling and coupling mechanism, and the bottle of serial processing class application is eliminated. In addition, the unified branch and fast data sharing technology is used to further improve the binuclear framework in the tight coupling state. Performance. 3). A finger for overcoming the bottleneck effect of a branch structure is proposed. the mechanism maintains the high efficiency of the SIMD structure and has the flexibility of the MIMD structure when processing the branch problems, so that the different SIMD lane can obtain the corresponding instruction according to the respective branch results to realize different branch paths, in parallel, the simd lane, which performs the same branch path in this mechanism, is still executed in a simd manner, so that the simd structure is well maintained. The instruction shuffling mechanism sets up a bridge between the SIMD and MIMD structures, which greatly improves the data parallel processor. execution of efficiency. 4). extended instruction shuffling mechanism to propose a multi-simd multi-data stream (The structure of the MSMD can meet the requirement of simultaneous multi-width SIMD in the application while supporting the branch problem efficiently, and support a plurality of applications with different SIMD width requirements. in addition, the multi-SIMD multi-stream structure improves the instruction buffer mapping algorithm in the instruction shuffling mechanism, and further improves the SIMD structure in processing the partition. the question of the branch and the combination of the dual-core framework and the multi-SIMD multi-data stream structure is organically combined to form a cooperative instruction flow-out technology to realize the comprehensive breakthrough of the serial processing, the branch and the simultaneous multi-width SIMD problem in the data parallel processor. The design and implementation of the structure in the RTL-level environment of the whole chip are also carried out in this paper. The results show that the cooperative instruction flow-out technology can realize the hardware of the data parallel processor with reasonable overhead. High-efficiency co-operation of resources and data parallel processor structure It's still a hot topic. Many of the key issues still need to be more systematic This paper studies the fusion model of the data parallel processor, provides the system guidance for the design of the data parallel processor, and the key bottleneck for limiting the performance of the data parallel processor The results of the verification and evaluation show that the solution proposed in this paper is effective and can be applied to future data
【学位授予单位】：国防科学技术大学
【学位级别】：博士
【学位授予年份】：2013
【分类号】：TP332

【参考文献】