GTC 2015 – University of Tokyo / University of Tsukuba Tightly Coupled Accelerator “System Configuration”

GTC 2015 related articles

・[Report]GTC 2015 – Understanding Deep Learning (Part 1) ・[Report]GTC 2015 – Understanding Deep Learning (Part 2) ・[Report]GTC 2015 – Understanding Deep Learning (Part 2) ・[Report]GTC 2015 – US next-generation supercomputer “Summit” ・ Click here for a list of articles on GTC 2015

The TCA of the University of Tsukuba and the University of Tokyo (University of Tokyo) was announced at the GTC in 2014, but last year it was announced when it started to move, and it did not include much actual measurement data. However, one year has passed this year, and since the announcement includes a wealth of actual measurement data, we will deliver the “System Performance Edition” that summarizes the actual measurement performance of the “System Configuration Edition” that also serves as a review.

Associate Professor Makoto of the University of Tokyo giving a presentation on TCA

At GTC, the presenter is Associate Professor Toshihiro Makoto of the University of Tokyo, the same as last year. HA-PACS TCA was originally a project of the University of Tsukuba, but since the main member, Professor Makoto, has moved to the University of Tokyo, the logos of both universities are now on the slides.

If you try to share the processing using multiple GPUs, you will need to exchange data between the GPUs at the seams where the processing is divided. In this case, it is usually necessary to use DMA to read the data from the sender’s GPU into the CPU’s memory via PCI Express, and then use DMA to transfer it to the receiver’s GPU via PCI Express. become. However, it is not necessary to store data in the memory of the CPU, and efficiency should be improved if data can be sent directly from GPU to GPU.

The University of Tsukuba’s “PEACH” was developed based on this idea, and the second generation is the “PEACH2”. As shown in the following figure, PEACH2 uses PCI Express 2.0 as a data transmission path, and PEACH2 can communicate directly with each other.

PCI Express normally operates in which the master node in the CPU controls other I / O nodes and transfers data to and from the master node, but even the address is specified as the communication path. The structure is such that communication with any node is possible. In PEACH2, by designing the control part independently, it is possible to communicate between two GPUs without the intervention of CPU.

PACH2 has four PCI Express 2.0 x8 ports, one port connects to the CPU and the remaining three ports connect to the other PEACH2. The Gen2 x8 PCI Express has a bandwidth of 40Gbps, which is the same bandwidth as the QDR InfiniBand x4. On the other hand, since the protocol is simple, the delay time is small.

PEACH2 uses PCI Express as a transmission line to directly DMA transfer data between GPUs.

PEACH2 is made of Altera’s FPGA “Stratix IV” and is bundled in a PCI card called PEACH2 board as shown in the following picture.

The function of PEACH2 is realized by FPGA, and it is realized as a board with PCIe x8 and x16 cable connectors.

And this board is mounted as a mezzanine card on the motherboard of the server.

The PEACH2 board is laid horizontally on the motherboard as a mezzanine card. You can see the K20X GPU in the back

“HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences)” is a major supercomputer of the Center for Computational Sciences, University of Tsukuba, and is used jointly by researchers inside and outside the university. On the other hand, HA-PACS / TCA is a system for researching a system that directly connects accelerators created by JST’s AC-CREST project, and developing software, parallel languages, and their compilers that use it.

The specifications of HA-PACS Base Cluster and TCA are shown in the following table.

HA-PACS Base Cluster and TCA specifications. Base Cluster has 802TFlops in 26 chassis and TCA has 364TFlops in 10 chassis, which exceeds 1PFlops when both are combined.

The TCA and Base Cluster are installed next to each other so that they can be operated as one.

The appearance of the HA-PACS system. The front is TCA, and the back housing is Base Cluster.

The configuration of the TCA nodes is as shown in the following figure.

The TCA node has two K20X GPUs connected to each of the two Ivy Bridge CPUs and one Ivy Bridge connected to the PEACH2 board.

The connections between the TCA compute nodes are shown in the figure below, forming a group of 16 nodes. The 16-node subcluster contains 64 K20x, but only the two connected to the CPU to which the PEACH2 board is connected can communicate via PEACH2. This is because the performance of PCIe transfer via QPI is low and unusable, and it is not a limitation of PEACH2.

There are two 8-node ring connections, and the two rings are connected by eight links with short red arrows to form a 16-node subcluster.

The entire TCA system is composed of 64 compute nodes, and these compute nodes are connected by the InfiniBand of QDR. Therefore, this InfiniBand is used for communication across subclusters.

TCA supports DMA and PIO, with a minimum latency of 2 μs or less for DMA and 1 μs or less for PIO. PEACH2 has a 4-channel DMA engine, and each DMA engine supports chaining of descriptors that execute multiple transfers continuously without CPU intervention, and also supports block stride transfer that can transfer two-dimensional matrices. ing.

TCA supports DMA and PIO. The DMA engine has 4 channels and supports descriptor chaining and block stride forwarding.

GTC 2015 article list

・[Report]GTC 2015 – US next-generation supercomputer “Summit” ·[Report]GTC 2015 – Heterogeneous HPC and NVLINK ·[Report]GTC 2015 – DDN Burst Buffer Technology ・[Report]GTC 2015 – CUDA library “cuDNN” for deep learning ·[Report]GTC 2015 – How Much Does the GPU Error? ・[Report]GTC 2015 – Understanding Deep Learning (Part 2) ・[Report]GTC 2015 – Understanding Deep Learning (Part 2) ・[Report]GTC 2015 – Understanding Deep Learning (Part 1) ·[Report]GTC 2015 – NVIDIA New Product Announcement and Deep Learning ・[Report]GTC 2015 – Baidu (Baidu) Deep Learning ·[Report]GTC 2015 – Google’s Deep Learning ・[Report]GTC 2015 – Deep Learning Keynote Speech ・[Report]GTC 2015 – NVIDIA announces high-end GPU “Titan X” that realizes computing performance of 7TFlops