Read the block diagram and processing flow of Mali-T880

Mali-T880 GPU block diagram and processing flow

Mali GPU has a structure in which all units are connected to a common network. The network has a scalable structure in which the bandwidth of the network increases as the number of shader cores increases.

The L2 cache can connect up to two slices of up to 512KB. However, the number of slices to be mounted and the capacity of slices vary depending on the requirements of the actual product.

The Mali-T880 has a structure in which all units are connected to a common network.However, network bandwidth is designed to scale with the number of cores.

The T880’s Tyler is a fixed-function unit that creates a list of polygons that fit into each tile, supporting the hierarchical tiling described above. This Tyler can process one triangular polygon per cycle.

Tyler makes a list of which tiles the output triangles of the geometric shader fit into.Processing performance is 1 triangle / cycle

Each shader core has two μTLBs that perform address translation, but if a μTLB is missed, a signal is sent to the MMU, which follows the page table to find new entries and store them in the μTLB. ..

Each shader core has two μTLBs. The Mali-T880 is equipped with an MMU that supports the 64-bit space of ARMv8, and in the case of a μTLB mistake, a table walk is performed in hardware to read the required page table entry and supply it to the μTLB.

The job manager is a fixed function block that reads the job description from memory and distributes the job to multiple shader cores, taking into account the dependencies between the jobs.

The job manager reads the job given by the driver from the memory and distributes the work to the shader core.At that time, check the dependencies between jobs

Shader core block diagram and processing flow

The shader core has a structure called “Tri-Pipe” by ARM, which has three types of pipelines: ALU for calculation, memory access, and texture processing. A 128-bit wide SIMD calculator can perform 4 operations in parallel for 32-bit variables. In addition to this, there is one scalar unit and five 32-bit arithmetic units per ALU.

Since the shader core of Mali-T880 has 3 ALUs, if a maximum of 16 shader cores are installed, it is possible to have 240 arithmetic units per chip. NVIDIA’s Tegra X1 has 256 arithmetic units, so with a full 16-shader core configuration, the Mali-T880 has roughly comparable computing power to the Tegra X1.

The shader core is an architecture called Tri-Pipe, which has three pipelines for arithmetic, loadstore, and texturing. The ALU unit includes a 128-bit SIMD vector arithmetic unit and a scalar arithmetic unit. The shader core of T880 is equipped with three ALU units to improve computing performance.

The thread that performs vertex shading is created in a block called “Compute Thread Creator”, and instructions are sent to the execution unit of Tri-Pipe. The execution unit reads the vertex coordinates and attributes from the memory, performs coordinate conversion, and writes the result to the memory via the L1 cache. This L1 cache is coherent on all shader cores and supports atomic access.

Threads that perform vertex shading are created by Compute Thread Creator and executed by Tri-Pipe.The processing result is written to the tile data area

The fragment shader processes and draws a list of primitives (triangles in this case) created by Tyler. The block labeled “Triangle setup” reads the data of the triangle to be drawn from memory and sends the data to Rasterizer to convert it to pixels. Then, a block called Early-Z determines whether the pixel is visible or hidden behind the pixel that has already been drawn and cannot be seen. If so, a drawing thread is created and Tri-Pipe is used. Let it run.

Tri-pipe uses information such as texture pasting and light source and face orientation to calculate pixel colors and write the results to a tile buffer. Then, when all the drawing data of one tile is processed, the tile buffer information is written to the memory.

Fragment processing takes Tyler’s output as input, decomposes the triangle into pixels and uses the Z-buffer to exclude hidden and invisible pixels and create a drawing thread for visible pixels. The drawing thread calculates the color of the pixel by pasting the texture and calculating the reflection of light. When all the data in one tile has been processed, the data in the tile buffer is written to memory.

The arithmetic part of the Midgard architecture has a structure of SIMD + VLIW. The T880 has three SIMD parts with a 128-bit wide calculator and two scalar parts. The SIMD section can execute 4 operations for FP32, 8 operations for FP16, and 16 operations for INT8 in parallel.

The unit as a whole is a VLIW that can execute vector multiplication and scalar addition, or vector multiplication, scalar multiplication, and load / store instructions in parallel. However, instead of all units running synchronously like normal VLIW, each unit seems to run in a separate thread.

ALU has three 128-bit SIMD adders and a 128-bit SIMD multiplier, plus a 32-bit adder and multiplier. The SIMD calculator can be used by dividing it into 8-bit, 16-bit, and 32-bit units.The arithmetic unit is a pipeline as shown in this figure.

(Next time will be posted on September 30th)