可重构编译中循环流水优化技术研究

发布时间：2018-03-18 08:21

本文选题：可重构计算　切入点：可重构编译　出处：《哈尔滨工程大学》2016年博士论文　论文类型：学位论文

【摘要】：随着半导体技术的发展,基于时间-空间多维计算方式的可重构计算体系结构,突破了冯.诺依曼结构的局限性,兼具专用集成电路芯片ASIC高效性与通用处理器灵活性的可重构计算在高性能计算、数字信号处理、网络信息安全等重要领域中被广泛应用,在商业上和技术上存在的潜在价值逐渐被人们重视,成为另一种主流计算方式。对于通用计算领域来说,基于GPP+FPGA异构架构的可重构计算架构在能耗、存储、性能等多方面均优于传统架构的通用处理器,这使得可重构计算成为未来新型计算的一个重要研究方向。由于面向通用计算领域的可重构计算相关研究均处于起步阶段,虽然已经取得了很多研究成果,但仍存在很多问题亟需深入研究。影响可重构计算系统实际推广效果的一个重要因素是相关软件生态系统不成熟,同时不受半导体制造工艺和相关硬件技术的限制,使得面向可重构计算系统的可重构编译器相关技术成为目前世界范围内的研究重点与热点。通过对可重构计算系统实现通用计算领域中应用程序硬件加速的过程进行分析,改善可重构编译器实现应用程序中循环结构到可重构计算系统平台并行流水硬件加速单元的自动映射技术成为当前该领域关注的课题。在前人工作的基础上,本文主要针对循环程序中的运算单元、控制单元、存储单元三个主要功能模块的自动映射及优化技术展开深入研究,具体研究内容如下:(1)在现有可重构编译器实现循环程序到流水执行的运算单元自动映射过程中,往往采用流水线直接划分方法,没有考虑基本运算指令在FPGA上执行时真实的硬件延时特性,导致流水线划分结果不优。针对这种情况,本文设计了一种基于硬件延时特性的流水线自动划分算法。结合循环程序在FPGA上运行时基本运算指令的硬件延时特性,建立基本指令硬件延时特征库,并以基本运算指令延时为权值,进行流水线合并和优化,实现流水线的自动划分。实验结果表明,该算法能够有效降低流水线划分段数,从而减少了因流水线划分所导致的硬件资源开销,同时降低了运算单元单次迭代执行时的时钟周期个数。(2)在现有可重构编译器中,循环程序流水执行时迭代间启动间距均采用制导语句指令方式控制,但是该方式只能生成固定的迭代间启动间距信息,不能充分提高循环程序流水执行性能,同时限制了可重构编译器的自动化水平。针对该问题,本文设计了一种循环流水迭代间启动间距自动分析及优化方法。通过建立循环流水迭代间启动间距信息模型,采用循环流水迭代间非固定启动间距控制策略,完成循环流水迭代间启动间距的自动分析,同时采用流水线调度技术对迭代间启动间距进行优化。实验结果表明,本文所设计的循环流水迭代间非固定启动间距控制策略,能够有效减少循环程序流水执行时迭代间等待延时时间,同时采用自动分析算法能够有效提高可重构编译器的自动化水平。(3)在可重构计算系统中目前已经存在很多并行存储结构的研究成果,为了提高数据访问的并行性和重用性,往往采用空间换时间的策略,但是,在资源开销与性能方面均有提高的空间。针对这种情况,本文设计了一种参数化并行存储结构自动映射方法。针对类仿射型数组下标应用,设计一种参数化并行存储体系结构,通过自动生成算法构建循环程序的访存数据依赖图,并进行并行存储结构模板的参数计算,在可重构编译器中实现并行存储体系结构的自动映射生成。实验结果表明,该存储体系结构能够充分挖掘循环中的数据并行性和重用性,与现有方案相比,能够在占用较少硬件资源的情况下,提升循环程序流水执行的性能。最后,本文结合上述研究内容,分别将基于硬件延时特性的流水线自动划分算法、循环流水迭代间启动间距自动分析及优化方法、参数化并行存储结构自动映射方法等技术应用在循环程序运算单元、控制单元、存储单元的自动生成过程中,构建一种面向可重构编译器的循环流水自动映射框架。实验结果表明,本文方法在提高可重构编译器自动化水平的同时,能够有效提高循环程序在可重构计算系统中流水执行的性能,具有一定的可行性。
[Abstract]:With the development of semiconductor technology, time - space multidimensional calculation based on the reconstruction of the way to calculate system structure, break the limitation of the structure of von Neumann, reconstruction of both ASIC ASIC efficiency and flexibility of general purpose processor can be calculated in high performance computing, digital signal processing, has been widely used in the important field of network and information security etc. in the potential value in business and technology has been gradually valued, become a mainstream computing. For general-purpose computing, computing architecture in energy consumption, storage reconstruction of GPP+FPGA heterogeneous architecture based on general purpose processor, performance and other aspects are better than that of traditional architecture, which makes reconfigurable computing has become a a new important research direction in the future of computing. As for general-purpose reconstruction field calculation related studies are in the initial stage, although it has been taken Got a lot of achievements, but there are still many problems need to be further study. Reconfigurable computing is one of the important factors to promote the effect of the actual system is related to the software ecosystem is not mature, and not by the semiconductor manufacturing process and related hardware technology, makes for reconfigurable computing system reconfigurable Compiler Techniques become the world the emphases of research. Through the process analysis of reconfigurable computing system to achieve universal computing hardware accelerated applications, improve the reconfigurable compiler application cycle structure to automatically mapping technology of reconfigurable computing system platform parallel hardware acceleration unit has become the field of attention. Based on previous work on the operation unit, cycling program control unit, three main power storage unit An in-depth study on the automatic mapping and optimization module, the specific contents are as follows: (1) in the existing operation unit automatic mapping process reconstruction compiler implementation program to the implementation of the water cycle, often by direct division method of pipeline, without considering the basic operation instruction execution characteristics of real hardware delay in FPGA, leading to pipeline the division result is not optimal. In view of this situation, this paper designed a kind of automatic partitioning algorithm based on pipelined hardware delay characteristics. Combined with the hardware delay cycle program is run on FPGA basic operation instruction, establish the basic instruction hardware delay feature library, and to basic arithmetic instructions for delay weights of the pipelined merger and optimization automatic division of the realization of the pipeline. The experimental results show that this algorithm can effectively reduce the pipeline partition number, so as to reduce the pipeline partition The hardware resources which, while reducing the operation unit of a single iteration execution when the number of clock cycles. (2) in the existing reconfigurable compiler, water cycle program execution start between space using iterative guidance statement instruction mode control, but this method can only generate fixed spacing between iterations starting information, not to fully enhance the water cycle program execution performance, while limiting the automation level of the reconfigurable compiler. Aiming at this problem, this paper designs a recirculating iteration and optimization method of automatic analysis between the start distance. Distance through information model started to establish a recirculating iteration, by circulating water between the non fixed iteration initiation interval control strategy automatically analysis of circulating water and the spacing between iterations starting, the iteration between initiation interval was optimized by the pipeline scheduling technique. Experimental results show that the, The design of the water cycle iterative non fixed pitch control strategy can effectively reduce the cycle delay time for pipelined execution between iterations, and the automatic analysis algorithm can effectively improve the automation level of the reconfigurable compiler. (3) in the reconfigurable computing system has many parallel storage structure of the research results. In order to improve the reusability of parallelism and data access, often using the strategy space for time, but improves the resource overhead and performance space. In view of this situation, this paper designs a parametric parallel mapping method. According to the structure of the automatic storage class of affine array subscript applications, a parametric parallel design storage architecture, construction cycle program through the automatic generation algorithm of memory data dependence graph, and the parameter calculation of parallel storage structure template, In the parallel implementation of automatic map generation storage architecture reconstruction compiler. The experimental results show that the storage architecture can fully exploit the parallelism and reuse cycle data, compared with the existing schemes can occupy less hardware resources, and improve the performance up cycle pipelining. Finally, combining with the the above research contents, respectively, automatic partitioning algorithm hardware delay characteristics of pipeline based on recirculating iteration and optimization method of automatic start between the analysis of space, application of parametric parallel storage structure automatic mapping method of technology control unit in cycle operation unit, automatic generation of storage unit in the construction of a reconfigurable compiler circulating water automatic mapping framework. The experimental results show that this method can improve the level of automation in the reconstruction of the compiler at the same time, can effectively It is feasible to improve the performance of the circulation program in the reconfigurable computing system.

【学位授予单位】：哈尔滨工程大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP314

【相似文献】