大规模众核微处理器互连网络体系结构及性能分析研究

发布时间：2018-07-16 11:16

【摘要】：基于多核甚至众核设计的高性能处理器，是未来艾级高性能计算机的支撑技术。高带宽、低延迟、低功耗和强扩展性的互连网络对于释放处理器核强大的并行计算能力、提高众核处理器的性能有十分重要的意义。目前，众核系统的设计挑战中，互连通信逐渐成为制约系统性能提升的瓶颈。新兴的3D集成技术和硅基光子器件在芯片功能、集成密度和功耗方面有独特优势。这些新技术、新器件的发展成熟为解决众核系统互连瓶颈带来新的机会。本文以研究众核系统互连瓶颈为出发点，探索众核微处理器互连网络的创新型体系结构，并利用网络演算理论对众核互连网络进行建模与分析。主要研究内容包括四个方面：（1）众核系统片上核间互连网络体系结构核间传输的报文以控制报文为主，对实时性有着极高的要求。随着计算核节点数增多，传输延迟成为限制大规模众核处理器核间互连网络性能的首要因素。以Mesh为代表的简单低维片上网络结构，虽然布线简单，但由于其网络传输跳步数随着系统节点规模呈比例增长，很难满足大规模众核芯片的低延迟传输需求。利用3D集成技术，本文提出了一种三维扁平蝴蝶形网络的拓扑结构，用于大规模众核处理器的核间电报文传输。采用整数线性规划模型，我们克服了蝶形网络中高阶路由器和长互连线的布线挑战，成功地将扁平蝴蝶形网络嵌入到三维叠层中。扁平蝴蝶形拓扑是一种高维拓扑结构，扩展性强，尤其适合大规模计算核节点之间的互连。三维蝶形网络在保证Mesh连通性的同时增加了额外的捷径链路，同时利用高速的垂直互连线，实现了核间报文的快速传递。实验结果表明，三维蝶形网络能够有效的降低核间互连延迟，显著的提升众核处理器性能。（2）众核微处理器光访存网络体系结构访存互连对众核处理器至关重要，如果不能快速的存取数据，众核处理器强大的并行计算能力将很难发挥。随着单片上集成的处理器核数越来越多，访存通信带宽需求也急剧增长。传统的基于电IO管脚的“处理器-存储器”互连方案在大规模众核芯片中遇到了挑战，电互连方式很难在满足严格的功耗预算的前提下，为片上众核提供足够大的访存带宽。利用新兴的硅基光电子器件和3D集成技术，我们提出了一种高带宽、低功耗的光访存网络方案，用于众核处理器与DRAM之间的互连通信。这种基于光突发交换协议的访存网络采用光互连接口代替电IO管脚，能够实现众核处理器和存储器的高带宽无缝互连。除了带宽优势外，与以往的光访存网络相比，新方案的波长资源利用率得到了极大的提高，进一步提高了访存通信的功耗效率。实验结果表明，基于光突发交换协议的访存网络的功耗效率比光线路交换的访存网络提高了近2倍，比电接口方案提高了6倍。（3）芯片尺度光网络中的电控制层拥塞避免方案由于光缓存、光逻辑器件缺失，光电混合网络大都采用电控制层，负责资源仲裁、链路控制。在芯片尺度光突发交换网络研究中，我们发现，大量的细粒度光突发报文、严格的传输延迟限制和中等的网络工作频率限制了光网络的电控制层处理能力，极易导致严重的网络拥塞。因而，我们提出了一套流量整形方案，解决电控制层网络拥塞问题。在注入网络前，系统中所有报文流首先进行全局协调和整形，确保中间任何节点上的控制报文聚合流速率不会超过其最大处理能力，以达到减轻控制层拥塞的目的。我们采用优化算法，选取报文流整形器的整形参数（比如，报文流速度和报文突发性参数）。这种拥塞控制方案在一定程度上，为各个报文流的端到端传输进行资源预约，在带宽方面提供基本的服务质量保证，可以有效的缓解由控制层拥塞引起的光突发报文丢失现象。基于合成流量和真实运用轨迹的实验表明，这种新方法能有效避免控制层拥塞，降低报文丢失率，提高芯片尺度光突发交换网络的系统性能。（4）芯片尺度光互连网络性能分析芯片尺度光互连网络的设计需要平衡多方面的因素，包括网络延迟、吞吐量、能耗和硅片面积占用。这些系统级互连参数的选择直接影响整个芯片的性能，因而进行片上网络的性能分析，对系统的设计具有重要意义。为此，我们开展了芯片尺度光网络的解析建模工作。利用随机网络演算理论，我们建立了光突发交换网络的存储资源需求模型，以及光器件的波长资源需求估算模型。仿真实验与数值分析的结果表明，这些解析模型计算得到的边界相当紧致。利用这些随机网络演算分析模型，我们可以快速评估众核系统光互连网络的系统级设计参数，比如存储器资源需求、传输延迟、光器件资源需求等。在设计初期，建模分析网络的性能，还可以提前降低设计风险。总的说来，我们的解析模型刻画了系统性能与网络负载、体系结构之间的关系，有助于迅速找出影响性能的关键因素和设计瓶颈，促进设计空间收敛。综上所述，本文研究了众核系统的互连瓶颈问题，提出了新的网络体系结构，并基于网络演算理论，，对该体系结构进行了解析建模和性能分析。本文理论与实际结合紧密，为众核处理器互连瓶颈问题提供了新的解决方案，对推动高性能处理器技术发展做出了积极的贡献，并进一步扩展了网络演算理论的运用领域。
[Abstract]:High performance processor based on multi core and even kernel design is the support technology of high performance computer in the future. High bandwidth, low delay, low power and strong scalability interconnect network is very important to release the processor's powerful parallel computing power and improve the performance of the core processor. In the battle, interconnect communication has gradually become a bottleneck restricting the performance of the system. The new 3D integration technology and silicon based photonic devices have unique advantages in chip function, integration density and power consumption. These new technologies and new devices are mature to bring new opportunities to solve the bottleneck of interconnect in the core system.
This paper, based on the research of the bottleneck of the interconnect of the public nuclear system, explores the innovative architecture of the interconnect network of the core microprocessor, and uses the network calculus theory to model and analyze the interconnected network. The main research contents include four aspects:
(1) intercore interconnection network architecture of many core systems
The message transmitted between the nuclei is dominated by the control message, and it has a high requirement for real time. With the increasing number of nodes in the computation, the transmission delay is the primary factor restricting the performance of the interkernel interconnected networks of large mass core processors. The simple low dimension network structure represented by Mesh is simple, but the number of jumps in the network is due to its network transmission. As the scale of the system nodes is increasing proportionately, it is difficult to meet the demand for the low delay transmission of large mass core chips. By using 3D integration technology, a topology of a three-dimensional flat butterfly network is proposed in this paper, which is used for the transmission of interkernel telegraph between large mass core processors. The integer linear programming model is used to overcome the butterfly network. The flat butterfly network is successfully embedded in the 3D stack. The flat butterfly topology is a high dimensional topology with strong scalability and especially suitable for the interconnection between the large computing nodes. The 3D butterfly network increases the extra shortcut link while guaranteeing Mesh connectivity. At the same time, high speed vertical interconnects have been used to achieve fast transmission of internuclear messages. The experimental results show that the three-dimension butterfly network can effectively reduce internuclear interconnect delay and significantly improve the performance of the multiprocessor.
(2) the architecture of optical access network for many core processors
Memory access interconnection is very important for many nuclear processors. If the data can not be accessed quickly, the powerful parallel computing power of the core processors will be difficult to play. With the increasing number of core processors integrated with the monolithic processor, the demand for memory access communication bandwidth is also increasing. The traditional "processor memory" interconnection scheme based on the electric IO pins is large There is a challenge in large scale nuclear chips. Electrical interconnection is difficult to provide large enough memory bandwidth for all cores on the premise of satisfying the strict power budget. Using the new silicon based optoelectronic devices and 3D integration technology, we have proposed a high bandwidth, low power optical access network scheme for the public core processor and the DRAM. Interconnect communication. This network based on optical burst switching protocol uses optical interconnection interfaces instead of electrical IO pins to achieve high bandwidth and seamless interconnection of all nuclear processors and memory. In addition to bandwidth advantages, compared with the previous optical memory network, the utilization of the new scheme has been greatly improved. The experimental results show that the power efficiency of the memory access network based on the optical burst switching protocol is nearly 2 times higher than that of the optical line switched network, and the specific power interface scheme is 6 times higher.
(3) electrical control layer congestion avoidance scheme in chip scale optical network
Because of optical caching, optical logic devices are missing, optoelectronic hybrid networks mostly use electric control layer, responsible for resource arbitration and link control. In the study of chip scale optical burst switching network, we found that a large number of fine-grained optical burst messages, strict transmission delay constraints and medium network operating frequencies limit the electrical control layer of optical networks. It is very easy to cause serious network congestion. Therefore, we propose a flow shaping scheme to solve the congestion problem of the electric control layer network. Before the injection network, all message flows in the system are first coordinated and plastic to ensure that the rate of convergence of the control report on any node does not exceed its maximum processing capacity. In order to reduce the congestion of the control layer, we use the optimization algorithm to select the shaping parameters of the message flow shaper (such as the speed of the message flow and the burst parameters of the message). This congestion control scheme, to some extent, provides the resources for the end to end transmission of each message stream, and provides the basic quality of service for the bandwidth. It can effectively alleviate the loss of the burst message caused by the congestion of the control layer. The experiment based on the synthetic traffic and the real application trajectory shows that this new method can effectively avoid the congestion of the control layer, reduce the loss rate of the message, and improve the system performance of the chip scale optical burst switching network.
(4) performance analysis of chip scale optical interconnection network
The design of a chip scale optical interconnection network requires a balance of factors, including network delay, throughput, energy consumption and silicon area occupation. The selection of these system level interconnection parameters directly affects the performance of the whole chip. Therefore, the performance analysis of the on-chip network is important to the design of the system. Therefore, we have developed a chip. The analytic modeling work of the scale optical network. Using the stochastic network calculus theory, we set up the storage resource requirement model of the optical burst switching network and the estimation model of the wavelength resource requirements of the optical devices. The simulation experiment and the numerical analysis show that the boundary of these analytical models is quite compact. We can quickly evaluate the system level design parameters of the optical interconnection network of many nuclear systems, such as memory resource requirements, transmission delay, optical device resource requirements and so on. In the early design, modeling analysis network performance can also reduce design risk in advance. In general, our analytical model portrays the system performance. The relationship with network load and architecture helps to find out the key factors and design bottlenecks which affect the performance quickly, and promote the design space convergence.
To sum up, this paper studies the interconnection bottleneck problem of the multikernel system and proposes a new network architecture. Based on the network calculus theory, the analytical modeling and performance analysis of the architecture are carried out. This paper combines the theory with the reality, provides a new solution for the bottleneck problem of the interconnect of the core processors, and promotes the high performance service. It has made positive contributions to the development of science and technology, and has further expanded the application field of network calculus theory.
【学位授予单位】：国防科学技术大学
【学位级别】：博士
【学位授予年份】：2012
【分类号】：TP332

【参考文献】