GPGPU结构研究与性能分析

发布时间：2018-06-24 14:17

本文选题：GPGPU + Fermi　；参考：《吉林大学》2017年硕士论文

【摘要】：在过去的十几年里GPU处理性能的增长十分迅猛。GPU在结构上与CPU有很大的不同,在GPU中有更多的晶体管用于计算,而CPU中更多的晶体管用于逻辑控制。因此在不同的设计目的之下,他们的作用也变得不同。更近一步,GPU迅速从图像处理领域发展到通用计算领域,由此开启了一个新的领域叫做GPGPU(General-Purpose Computing on the Graphic Processing Unit)。GPGPU是为处理并行任务而设计的,所以对并行计算模型的研究是很有意义的。虽然PRAM模型、BSP模型和log P模型等经典的并行计算模型已经提出很多年,但是通过对这些模型的研究可以更加深刻的理解GPGPU结构。从GPGPU这个概念被提出开始,很多的研究集中在利用其强大的计算能力,对于处理某一问题的效率进行大幅度提升。这一现象主要原因在于芯片的详细结构、流水线以及存储设计都涉及到商业机密,很难获得这些资料用于研究。英伟达和AMD是两家主要生产GPGPU的厂家,相比较之下英伟达的官方文档更加详细,其CUDA套件也更加完备,因此本文以英伟达的芯片作为研究重点。本文选择了开源的GPGPU-Sim模拟器,对英伟达的GPU进行模拟。本文对一些并行计算模型,比如PRAM模型、BSP模型和log P模型等进行了对比研究,比较了其参数的异同以及核心思想,并且对当前GPU的研究现状做了简单综述。随后,本文给出了一个全新的NKGPGPU,对硬件结构、任务的逻辑结构、代码结构以及其中的映射关系做出了详细构架。整体上,NKGPGPU包括五个子模型,分别是硬件结构子模型、任务结构子模型、任务组织子模型、任务执行子模型以及任务调度子模型。硬件结构子模型主要给出了NKGPGPU芯片中的主要组成部件。任务组织子模型主要给出了适用于NKGPGPU的代码结构以及代码和任务之间的映射,除此之外还给出了任务之间的启动关系模型。任务执行子模型这一部分给出了代码和硬件之间的映射。任务调度子模型给出了任务拓扑结构和硬件结构的映射。同时本文给出了一个性能分析模型,使它符合本文提出的NKGPGPU。对于影响GPGPU性能的主要三个方面:GPGPU流水线、共享存储和全局存储,本文在不同线程数目的情况下进行了详细的实验。对GPGPU的流水线的实验主要是研究对于不同类型的指令的运行周期的差异,通过这个差异来判断指令与流水线之间的关系。研究共享内存和全局内存的方法类似,都是通过连续的访存指令测试完成周期。本文提出的NKGPGPU丰富了GPGPU的理论模型,为GPGPU硬件工程师和软件编程人员提供了改进的依据,对于GPGPU-Sim的实验方法和思路可以作为进一步研究GPGPU的基础。
[Abstract]:In the past decade, the processing performance of GPU has grown rapidly. The structure of GPU is very different from that of CPU. There are more transistors in GPU for computation and more transistors in CPU for logic control. Therefore, under different design purposes, their role also becomes different. With the rapid development of GPU from the field of image processing to the field of general computing, GPU (General-Purpose Computing on the graphic processing Unit) .GPGPU is designed to deal with parallel tasks, so the research of parallel computing model is very meaningful. Although the classical parallel computing models such as pram model and log P model have been proposed for many years, the structure of GPGPU can be better understood through the study of these models. Since the concept of GPGPU was put forward, many researches have focused on using its powerful computing power to greatly improve the efficiency of dealing with a certain problem. This phenomenon is mainly due to the detailed structure of the chip, pipeline and storage design are involved in trade secrets, it is difficult to obtain such information for research. Nvidia and AMD are two main manufacturers of GPGPU. Compared with Nvidia, the official documents of Nvidia are more detailed and its CUDA kit is more complete. Therefore, this paper focuses on Nvidia's chip. In this paper, the open source GPU-Sim simulator is chosen to simulate Nvidia's GPU. In this paper, some parallel computing models, such as pram model, BSP model and log P model, are compared, the differences and similarities of their parameters and their core ideas are compared, and the current research situation of GPUs is briefly summarized. Then, this paper presents a new NKGP GPU, which provides a detailed framework for hardware structure, task logic structure, code structure and mapping relationship. As a whole, NKGPU consists of five sub-models, namely, the hardware structure sub-model, the task organization sub-model, the task execution sub-model and the task scheduling sub-model. The hardware architecture sub-model mainly gives the main components of NKGPGPU chip. The task organization sub-model mainly gives the code structure and mapping between code and task which is suitable for NKGPU. In addition, the startup relationship model between tasks is also given. This part of the task execution submodel shows the mapping between code and hardware. The task scheduling submodel gives the mapping between the task topology and the hardware structure. At the same time, a performance analysis model is given to make it accord with the NKGP GPUproposed in this paper. For the three main aspects affecting GPGPU performance: GPGPU pipelining, shared storage and global storage, this paper makes a detailed experiment with different number of threads. The experiment of pipeline of GPGPU is mainly to study the difference of running cycle for different types of instruction, and judge the relationship between instruction and pipeline by this difference. The methods of studying shared memory and global memory are similar, they are completed by continuous memory access instruction testing. The NKGPGPU presented in this paper enriches the theoretical model of GPGPU and provides an improved basis for GPGPU hardware engineers and software programmers. The experimental methods and ideas for GPGPU-Sim can be used as the basis for further research on GPGPU.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP391.41;TP332

【相似文献】