面向可重构系统的软硬件划分技术研究

发布时间：2018-04-21 10:07

本文选题：可重构计算系统 + 现场可编程门阵列　；参考：《哈尔滨工程大学》2013年博士论文

【摘要】：基于现场可编程门阵列的可重构计算系统兼有通用处理器的灵活性和现场可编程门阵列的高效性,所以在高性能计算领域中正在被广泛应用。一个高效率的软硬件划分算法能够将应用程序自动而有效地分配到通用处理器和现场可编程门阵列上,可以使两种运算部件最大限度地发挥出各自计算模式的优势,因此,对软硬件划分的研究正逐渐成为可重构计算系统领域的研究热点。纵观国内外研究现状,对软硬件划分的研究已经取得了很多成果,但仍存在许多亟待解决的问题。在前人工作的基础上,本文以现场可编程门阵列的面积作为约束条件,以系统整体性能作为优化目标,设计了一种面向中央处理器/现场可编程门阵列的可重构加速系统的软硬件划分框架。该框架的主体由三大主要功能模块组成,在每个模块中,分别对应用程序片段在中央处理器和现场可编程门阵列上实现时花费代价的估计、以及软硬件划分算法等关键技术进行了深入研究,希望上述框架不仅能够确定程序片段是放在中央处理器上或是现场可编程门阵列上运行,并且能对被选中放在现场可编程门阵列上运行的每个程序片段(例如循环)可能的多个硬件版本进行确定,以得到尽可能佳的划分解决方案。具体的研究内容包括:在计算密集型应用程序中循环部分往往是其主要的工作负载,经过分析,采用传统面向循环的静态分析技术无法得到循环执行次数等动态信息;而采用边剖析等动态分析技术虽能得到程序片段的执行次数等信息,但却不能判定该程序片段是否是循环结构,针对这种情况,本文将基于支配关系的循环识别技术和边剖析的分析技术相结合,设计了一种动静态结合的循环运行时分析算法,并在LLVM平台上实现。实验结果表明,该算法既能够自动识别所有循环结构,又能对循环部分的平均迭代次数、循环调用次数、循环软件运行时间及在现场可编程门阵列上实现时软硬件间通信开销等进行精确分析,进而为可重构计算系统待加速循环的选择提供较全面、精确的依据。在可重构计算系统的高层次设计过程中,采用估计技术获取硬件实现及执行时的性能参数是一种快速可行的方法。但是现有的高层次硬件执行时间/面积估计方法往往与特定的硬件实现环境(例如现场可编程门阵列的某种结构及其使用的工具链属性等)相关,通用性差;另外,对循环实现时可能的多个版本的硬件实现代价的估计也支持不足。针对通用性差的问题,本文在评估时首先根据程序语言中不同的运算表达式,结合其通常的电路实现模式,推导出一整套与实现环境无关的针对每个运算的硬件执行时间/面积估计公式,再利用真实反馈信息对推导出的估计公式进行修正,使其可以适用于各种不同的实现环境;针对硬件多版本的估计支持不足的问题,设计了一种面向多版本的细化到以运算操作为基本单位的参数输入统一接口,再结合各个运算操作经过修正后的估计公式,构建了一种面向循环在FPGA上实现时多版本特征的估计算法。该方法能够快速、精确估计出不同程序片段在FPGA上实现时的硬件执行时间/面积,尤其能够对循环实现时各个不同硬件版本的执行时间/面积进行估计,为硬件多版本设计空间探索和软硬件划分提供了精确的信息支持。承上所述,目前在RCS领域已经有很多软硬件划分算法的成果,但这些方法通常默认循环在FPGA上实现时只有一种硬件实现方式,忽略了循环的硬件多版本特征,降低了划分解的质量。另外在基于CPU/FPGA的可重构加速系统中,通信开销往往是系统整体性能的瓶颈。针对以上两种情况,本文首先构建了一个带有硬件多版本特征的软硬件划分模型,然后面向软硬件间通信开销最优对循环进行分簇,并依据分簇的结果对划分模型中的优化目标函数进行更新,最后从全局优化的角度,采用以浮点数编码的遗传算法来进行求解,从而形成了本文设计的一种带有硬件多版本探索和划分粒度优化再选择的软硬件划分算法。通过该算法,不仅可以确定程序中某循环片段应该放在CPU或在FPGA上实现,而且还可以确定循环在FPGA上实现的较佳硬件版本形式,从全局性能最优的角度提高了软硬件划分解的质量。实验结果表明,采用遗传算法求解带有硬件多版本探索及划分粒度再选择的软硬件划分问题得到了较好的效果,但随着待划分集合的规模增大,遗传算法较弱的局部搜索能力又会影响划分解的质量。经过分析,发现在选择,交叉,变异算子中,遗传算法的局部搜索能力在很大程度上依靠变异算子,该算子传统上采用的随机变异策略容易对优秀的染色体造成破坏,产生较差的个体。因此本文在上述遗传算法的基础上,经过改进,又设计了一种性能更佳的基于Q-学习和遗传算法的面向硬件多版本探索的软硬件划分算法。依据硬件多版本的性能、面积的矛盾特征、将Q-学习算法和贪婪规则相结合,自适应选择合适的变异方向,成为改进后遗传算法的明显特征。实验结果表明,与标准遗传算法相比,改进算法在搜索质量、收敛性方面都具有良好的效果,增强了针对硬件多版本探索的局部搜索能力,进一步提高了软硬件划分解的质量。
[Abstract]:The reconfigurable computing system based on the field programmable gate array has the flexibility of the universal processor and the efficiency of the field programmable gate array, so it is being widely used in the field of high performance computing. An efficient software and hardware partition algorithm can automatically and effectively allocate the application to the general purpose processor and the field available. In the range gate array, the two operating components can maximize the advantages of their respective computing modes. Therefore, the research on the partition of hardware and software is becoming a hot topic in the field of reconfigurable computing systems. On the basis of the previous work, this paper takes the area of the field programmable gate array as the constraint condition, and takes the overall performance of the system as the optimization target, and designs a software and hardware partition frame for the reconfigurable acceleration system oriented to the central processor / field programmable gate array. The main body of the framework is composed of three main functional modules. In each module, the cost estimation of the application fragment in the central processor and the field programmable gate array, as well as the hardware and software partitioning algorithms are studied in depth. It is hoped that the framework not only can determine the program fragment on the central processor or the field programmable gate array. It runs on the column, and can determine the possible multiple hardware versions of each program fragment (such as loops) that are selected to run on the field programmable gate array to get the best possible partition solutions. The specific research content includes that the loop part in the computing intensive application is often its main workload, After analysis, the traditional cyclic static analysis technology can not get the dynamic information such as the number of cyclic execution. While the dynamic analysis technology such as edge analysis can get the information of the execution times of the program fragment, but it can not determine whether the program fragment is a cyclic structure. Combining ring recognition and edge analysis, a dynamic and static cyclic running time analysis algorithm is designed and implemented on the LLVM platform. The experimental results show that the algorithm can automatically identify all the cyclic structures, the average iteration number of the cyclic parts, the number of cycle calls, the running time of the cyclic software and the The communication overhead between hardware and software is accurately analyzed in the field programmable gate array, which provides a more comprehensive and accurate basis for the selection of the reconfigurable computing system to be accelerated cycle. In the high level design process of the reconfigurable computing system, it is a fast to use the estimation technique to obtain the performance parameters of the hard parts and execution. But the existing high level hardware execution time / area estimation methods are often related to the specific hardware implementation environment, such as some structure of the field programmable gate array and the tool chain properties used, and the generality is poor; in addition, the estimation of the hardware implementation costs of possible multiple versions of the loop is also supported. In order to solve the problem of poor generality, this paper first derives a set of formulae for estimating the execution time / area of each operation based on the different operational expressions in the program language and the common circuit implementation mode, and then uses the real feedback information to deduce the estimated formula. To make it correct, it can be applied to a variety of different implementation environment. Aiming at the problem of insufficient support for multi version of hardware, a kind of unified interface is designed for the parameter input of multi version to the basic unit of operation operation, and then a kind of circular orientation is constructed by combining the modified estimation formula of each operation operation. An estimation algorithm for multi version features implemented on FPGA. This method can quickly and accurately estimate the execution time / area of the hardware when different program segments are implemented on the FPGA, and it is especially able to estimate the execution time / area of different hardware versions when the cycle is implemented. There are many software and hardware partitioning algorithms in the RCS field, but these methods usually have only one hardware implementation in the FPGA implementation, ignoring the multi version features of the loop hardware and reducing the quality of the decomposition. In addition, the reconfigurable acceleration based on the CPU/FPGA is made. In the system, communication overhead is often the bottleneck of the overall performance of the system. In this paper, a software and hardware partition model with multi version features of hardware is constructed first, and then the optimal communication overhead between hardware and software is optimized to cluster the cycle, and the optimization objective function in the partition model is more based on the results of the cluster. Finally, from the point of view of global optimization, a genetic algorithm based on floating point number coding is used to solve the problem. Thus, a software and hardware partition algorithm with hardware multi version exploration and granularity optimization and re selection is formed in this paper. Through this algorithm, it can not only determine a cyclic fragment in the sequence of CPU or FPGA. In addition, the better hardware version of the loop on FPGA can be determined, and the quality of the software and hardware decomposition is improved from the point of the best global performance. The experimental results show that the application of genetic algorithm to the problem of hardware and software partitioning with multiple versions of hardware and the partition of granularity and re selection is better. The size of the partition set increases, and the weak local search ability of the genetic algorithm will affect the quality of the decomposition. It is found that the local search ability of the genetic algorithm depends largely on the mutation operator in the selection, cross and mutation operators. The traditional random mutation strategy used in the operator is easy to cause the excellent chromosomes. In this paper, on the basis of the above genetic algorithm, this paper designs a software and hardware partition algorithm based on Q- learning and genetic algorithm, which is based on the performance of the hardware and the contradictory features of the area, and combines the Q- learning algorithm with the greedy rule. The experimental results show that, compared with the standard genetic algorithm, the improved algorithm has a good effect on the search quality and convergence, and enhances the local search capability for the exploration of the hardware multi version, and further improves the quality of the software and hardware decomposition.

【学位授予单位】：哈尔滨工程大学
【学位级别】：博士
【学位授予年份】：2013
【分类号】：TP38

【相似文献】