可重构众核流处理器上的编译与程序优化技术

发布时间：2018-07-08 16:26

本文选题：流编程模型 + 可重构众核流处理器　；参考：《中国科学技术大学》2013年博士论文

【摘要】：半导体工艺的持续进步和流编程模型的提出是推动众核流处理器体系结构向前发展的两个重要因素。为了充分利用摩尔定律发展带来的丰富廉价的晶体管资源,片上多处理器已经成为工业界所接受的新一代处理器体系结构的解决方案之一。另一方面,流编程模型的提出为解决应用程序并行性表示和开发提供了有利的工具,业已成为程序并行化的一般方法。但是流应用程序本身的运行时特性对处理器资源的灵活可配置提出了新的要求。一方面,不同程序对于计算,存储和控制的硬件资源需求是千差万别的；另一方面,有的程序对于硬件资源的需求还表现出阶段性特征,不同阶段所需的硬件资源也不尽相同。为了解决这一问题,近年来可重构众核处理器体系结构受到广泛关注,期望以同构的轻量级物理核来构建不同粒度的逻辑处理器,自适应地调整硬件资源,最大化串行程序的性能。我们基于以上认识,提出了可重构的众核流处理器体系结构TPA-S,旨在为流应用程序提供片上资源灵活可重构的硬件衬底。本文围绕这一体系结构,研究了以CUDA为代表的流编程模型在TPA-S上的编译和程序优化技术,分别从可重构流处理器的执行模型,指令集扩展,编译系统设计,以及程序优化技术等方面展开研究。主要的研究内容包括： (1)研究流编程模型与可重构流体系结构的执行模型,探索流应用程序在可重构流处理器TPA-S上的映射方案。流编程模型的实质在于控制和计算的分离：流程序中的计算密集部分以核心函数的形式被抽取出来,使用多个计算线程来开发数据级并行性；而控制线程专注于计算之外的控制流程序,为核心函数组织数据以及开发生产者-消费者局部性。我们设计了主从式和分阶段式两种程序映射方案,分别以异步和同步的方式将计算线程和控制线程映射到TPA-S的多个逻辑处理器上。另外,我们还提出了单一线程和组合线程这两种核心函数的组织方式,在单线程性能和系统吞吐率之间寻求最佳平衡点。 (2)研究流编程模型对TPA-S指令集体系结构产生的设计需求,提出扩展的类数据流EDGE指令集DISC-S。TPA-S流处理器基于显式数据流图执行(EDGE)指令集,将程序的每个线程组织为原子执行和提交的超块序列,在超块内部按照类数据流方式执行。EDGE指令集中特殊的目标域编码方式是TPA-S处理器实现物理核可重构的基础之一。然而流编程模型为TPA-S微结构带来了一些不包括在EDGE指令集中的新特性,例如：多个计算线程需要依靠只读的特殊寄存器来快速地获取线程索引信息；计算线程需要访问软件管理的片上共享存储器数据；另外,多个线程之间需要实现高效的栅栏同步机制。DISC-S扩展指令集针对以上新特性,增加了对特殊寄存器读写,软件管理的片上共享存储层次以及线程间同步的指令支持,为实现流编程模型的映射带来了方便。 (3)设计并实现面向CUDA编程语言的TPA-S编译器系统。本文基于NVCC编译框架将编译器系统分为核心级与流级两条编译链。我们参考scale编译器设计了核心级编译器Ptx2EDGE,将PTX汇编语言形式的计算线程源代码程序编译成DISC-S扩展指令集的二进制目标代码。在流级,我们实现了CUDA C语法扩展和API函数接口以及运行时库的移植,并复用scale串行编译器产生控制线程代码。编译器系统设计的主要工作由一系列编译模块与工具软件的代码实现组成,包括PTX和CUDA语法的前端识别,中间表示形式转换,超块生成和优化,资源分配与指令调度,汇编器与链接器等。另外,为了实现CUDA控制线程中设备管理,线程管理,执行控制和存储器管理等功能模块,我们移植了CUDA运行时库并实现了软件运行时系统Mpsim。编译系统的正确性和执行效率通过实验在一组基准测试程序上进行了验证与评测。 (4)研究了不规则程序在流处理器上的程序优化方法。本文主要分析了众核流处理器上不规则程序的性能瓶颈问题,并对GPU平台上的图广度优先搜索算法进行了实例研究,希望找到不规则程序在流处理器上进行性能调优的一般性方法。我们针对不规则程序中复杂多变的无组织数据并行性提出了基于并行性反馈的FlexBFS高效实现方法,并针对不规则访存带来的动态队列访存效率问题以及不规则输入引起的负载平衡问题,分别提出了相应的程序优化技术。这些实现方法和优化技术都能够被应用到其它的不规则程序当中。本文获得了以下重要认识：(1)流编程模型能够驱动指令集体系结构的设计,ISA需要反映编程模型带来的新特性；(2)流程序的多线程映射需要软件运行时库的协同工作,在将来的操作系统中可以增加对流编程模型的支持；(3)编译系统设计环节需要使用合适的编程模式和软件工程方法来增强程序的模块化与可读性,例如访问者模式；(4)不规则程序的并行性开发需要准确的并行性度量,使用在线剖析技术能够指导计算资源的合理划分。
[Abstract]:The continuous progress of the semiconductor process and the presentation of the flow programming model are two important factors to promote the development of the architecture of the core stream processor. In order to make full use of the rich and cheap transistor resources developed by Moore's law, the multiprocessor on chip has become a solution to the new generation of processor architecture received by the industry. One of the cases. On the other hand, the presentation of the flow programming model provides a favorable tool for solving the application parallelism representation and development. It has become a general method of program parallelization. However, the runtime characteristics of the flow application itself put forward new requirements for the flexible configuration of the processor resources. On the one hand, different programs are for computing, The requirements for storage and control of hardware resources are different. On the other hand, some programs also show phase characteristics for the requirements of hardware resources, and the hardware resources are not the same in different stages. In order to solve this problem, the reconfigurable architecture of the multiprocessor system has been widely concerned in recent years and is expected to be isomorphic with light weight. Level physical kernel constructs different granularity logic processors, adaptively adjusts hardware resources and maximizes the performance of serial programs. Based on the above knowledge, we propose a reconfigurable architecture of the public kernel flow processor architecture (TPA-S), which aims to provide flexible and reconfigurable hardware substrates for streaming applications.
This paper focuses on this architecture, and studies the compiler and program optimization techniques of the flow programming model represented by CUDA on the TPA-S. It studies the execution model of the reconfigurable flow processor, the instruction set extension, the design of the compiler system, and the optimization of the program. The main research contents include:
(1) to study the execution model of flow programming model and reconfigurable flow architecture, explore the mapping scheme of flow application on reconfigurable flow processor TPA-S. The essence of the flow programming model lies in the separation of control and calculation: the computing intensive part in the flow program is extracted by the form of the core function, and is developed using multiple computing threads. Data level parallelism; while the control thread concentrating on the control flow program outside the calculation, organizing the data for the core function and developing the producer consumer locality, we designed two program mapping schemes, the master-slave and the phased, mapping the computing threads and the control threads to the multiple logic of TPA-S in asynchronous and synchronous ways. In addition, we also propose the organization of the two core functions, single thread and combined thread, to find the best balance between single thread performance and system throughput.
(2) the design requirements of the flow programming model to the TPA-S instruction collective system structure are studied. The extended class data stream EDGE instruction set DISC-S.TPA-S stream processor is based on the explicit data stream graph execution (EDGE) instruction set, and each thread of the program is organized into the super block sequence of the atom execution and submission, and in the super block the class data stream is held in the way of the class data flow. The special target domain encoding method in the line.EDGE instruction set is one of the bases for the physical reconfiguration of the TPA-S processor. However, the flow programming model brings some new features that are not included in the EDGE instruction set for TPA-S microstructures, for example, multiple computing threads need to rely on the read-only special register to quickly obtain the thread indexed letters. The computing thread needs to access the shared memory data of the software managed by the software; in addition, multiple threads need to implement the efficient barrier synchronization mechanism.DISC-S extended instruction set for the above new features, increase the read-write to special registers, the shared storage layer on the software management, and the synchronization instruction support between threads. The mapping of current programming models is convenient.
(3) design and implement the TPA-S compiler system oriented to CUDA programming language. Based on the NVCC compiler framework, the compiler system is divided into two compiler chains at the core level and the stream level. We have designed the core compiler Ptx2EDGE with reference to the scale compiler, which compiles the computation line Cheng Yuan code program of the PTX assembly language into the DISC-S extended instruction set. At the stream level, we implemented the CUDA C syntax extension, the API function interface and the migration of the runtime library, and reused the scale serial compiler to produce the control thread code. The main work of the compiler system design consists of a series of compiler modules and the code implementation of the tool software, including the front-end recognition of the PTX and CUDA syntax. In addition, in order to realize the function modules such as device management, thread management, execution control and memory management in CUDA control threads, we transplant the CUDA runtime library and implement the positive Mpsim. compiler system of the software runtime system, in order to implement the function modules of the device management, thread management, execution control and memory management in the thread of the control. Accuracy and efficiency are tested and evaluated on a set of benchmark programs through experiments.
(4) the program optimization method of irregular program on stream processor is studied. This paper mainly analyzes the performance bottleneck of irregular programs on the public kernel stream processor, and studies the algorithm of map breadth first search on the GPU platform, hoping to find a general method for the performance tuning of irregular programs on the stream processor. We propose a FlexBFS efficient implementation method based on parallel feedback for the complex and changeable data parallelism in irregular programs. The corresponding program optimization techniques are proposed for the problem of dynamic queuing efficiency and the problem of load balancing caused by irregular input. Both the law and optimization techniques can be applied to other irregular programs.
This article obtains the following important understanding: (1) the flow programming model can drive the design of the instruction collective structure, and ISA needs to reflect the new characteristics of the programming model. (2) the multi thread mapping of the flow program needs the cooperative work of the software runtime library, and can increase the support of the convective programming model in the future operating system; (3) compile the system. Design links need to use appropriate programming patterns and software engineering methods to enhance the modularization and readability of programs, such as visitor patterns; (4) the parallel development of irregular programs requires accurate parallelism measurement, and the use of online analysis techniques can guide the rational division of computing resources.
【学位授予单位】：中国科学技术大学
【学位级别】：博士
【学位授予年份】：2013
【分类号】：TP314

【相似文献】