Kepler GPU configuration

Balance between dispatcher instruction supply and execution unit

Returning to Figure 3-36, the Kepler GPU is configured so that four warp schedulers can issue up to eight instructions in one cycle. Below that, there is a register file that has 65,536 32-bit word entries as a whole, and the instructions from the dispatcher and the operands from the register file are sent to the arithmetic unit group to perform the operation.

Figure 3-36 (repost) SM block diagram of NVIDIA Kepler GPU

In Figure 3-36, 16 single-precision arithmetic units labeled Core are arranged vertically, and a group of 32 arithmetic units in two columns processes one warp.

The small green box labeled Core is a unit that performs 32-bit single-precision floating-point multiply-accumulate operations and integer operations. This unit is also in charge of 32-bit integer operations and logical operations. The slightly larger orange DP Unit is a unit that performs a 64-bit double-precision floating-point multiply-accumulate operation.

SFU is a Special Function Unit that calculates transcendental functions such as trigonometric functions, reciprocals, reciprocals of square roots, Logs, and powers of 10 that are often used in graphics calculations. The transcendental function means a function that cannot be expressed by a polynomial, but in reality, SFU divides the interval and approximates these functions with a polynomial. Even if it is called an approximation, in the range of 32-bit single precision floating point numbers, even if the error is large, the error is about several bits in the least significant bit, which is usually not a problem.

The Ld / St unit is a load / store unit that processes load / store instructions.

On the GK110 chip, as shown in Fig. 3-36, the single-precision arithmetic unit has 6 instructions, the double-precision arithmetic unit has 2 instructions, and the SFU and Ld / St units (because 4 cycles are executed) per SM. There are two instructions, and the structure is such that they are dynamically assigned to the issued instructions. For this reason, a crossbar for switching the connection between the register fill and the execution unit is required, but this is less wasteful than a structure in which all types of arithmetic units are placed for each instruction dispatcher.

There are 6 single-precision floating-point arithmetic units. A 6-instruction / cycle instruction-issuing bandwidth is sufficient to fully operate these arithmetic units. And although there are 16 load / store units and 16 SFUs each, as I wrote before, since they are executed in 4 cycles, each can process 4 instructions in parallel, but this is also 1 instruction each. / Cycle instruction issuance Bandwidth is all that is needed.

In other words, if four warp schedulers can issue eight instructions each cycle, they can continue to supply instructions to 192 single-precision arithmetic units, 32 load / store units, and 32 SFUs without interruption.

The DP Unit that performs 64-bit double precision arithmetic is special. It requires a 64-bit operand, but the register file port is only 32-bit wide. Therefore, two consecutive registers such as registers 0 and 1 are collectively stored as a 64-bit value. It seems that the double precision arithmetic unit uses two register ports for each operand to read and write a pair of registers such as registers 0 and 1. Since the register port is used twice in this way, the double precision arithmetic instruction cannot be issued together with other instructions using two dispatchers.

The number of port groups in the register file is considered to be 16 ports, which is the same as the number of single precision arithmetic units, Ld / St units, and SFUs. The total number of entries for SM is 65,536, which means that there are 16 arrays of 4096 entries x 32 bits. It may be configured. However, considering the multiply-accumulate operation, this array requires three read ports and one write port, and is not a simple single-port SRAM array.