How to detect an error

Code to detect the error

In order to detect an error, create normal data so that when an error occurs, the data will be different from the normal data. For example, the rule is to add 1 bit to the original data and select the value of the additional bit so that the total number of “1” bits is even.

When this data is written to memory and read, if the total number of “1” is odd, it means that some bit is inverted. This method is called a parity check. If the 2-bit is inverted, the total number of “1” s will return to an even number, so the error will be missed, but if it is a 1-bit error, it can always be detected.

There is a method called CRC (Cyclic Redundancy Check) as a more powerful detection method. If the original data has 72 bits, the first bit is x71Coefficient, x the next bit70Coefficient,… x1Coefficient, x0Think of it as a coefficient of. And, as shown in Figure 3.11, in the case of CRC-8-ATM used in GDDR5, the polynomial whose coefficient is the data bit is x.8+ x2Add the remainder after dividing by + x + 1 as a check bit. However, since the coefficient is a binary number of 0 or 1, 0-1 is calculated as 1.

Figure 3.11 CRC calculation. The left end of the input data is x71Coefficient, then x70Coefficient of. Only some coefficients are written here. Subtraction can be done by XORing the value of each digit.Next time, subtract the highest “1” of the divisor by adding the highest “1” of the remainder to the highest “1” of the remainder.

GDDR5 DRAM uses CRC-8-ATM to check bits for 72-bit data, which is a collection of 9-bit data obtained by adding a DBI signal that inverts the value of the data bus to 8-bit data in a group of Burst Length = 8. calculate. Then, after sending the data, an 8-bit check bit is sent serially using the EDC signal line. This CRC can detect 2-bit errors.

The receiving side calculates the check bit in the same way, and if it matches the check bit sent by the EDC signal, it is judged that there is no error, and if it does not match, it is judged as an error. Then, in the case of an error, the error is corrected by requesting the sender to resend.

However, the error detection and correction by this CRC is for the error of signal transmission between GPU and GDDR5 DRAM, and there is no function to detect data garbled in the memory. According to the JEDEC standard, it is possible to support CRC check for both read and write, but it is not essential to support all functions. For example, Micron’s GDDR5 DRAM document supports CRC only on the read side. It is written that it is not.

In principle, if there is a memory to store the check bit, it is possible to calculate the CRC at the time of reading and detect the garbled data in the memory, but even if an error is detected at the time of reading, it is correct for the GPU. It is not possible to have the data resent (the GPU must have all the data stored in memory in order to be able to resend any address read, in which case it is external in the first place. No memory chip is required), so it cannot be corrected.

Code to correct the error

Considering that only a maximum of 1 bit of error occurs, when 64-bit data is received, the data is considered to have no error, bit 0 is error, bit 1 is error, … bit 63 is error and 65 different results are considered. Be done. If you cannot distinguish which of these 65 cases from the received data, you cannot correct it. A check bit must be added to make this distinction clear.

An error can also occur in the check bits, so if the original number of data bits is D and the number of check bits is C, 2C> It is a necessary condition that D + C holds. Then, if this condition is satisfied, it is shown that the coding that identifies the bit in which the error occurred can be performed.

When the number of data bits D is 64, this condition is satisfied by adding a 7-bit check bit. However, if a 2-bit error occurs, this code determines that another bit has an error, and makes an erroneous correction. For this reason, it is common to add another 1-bit check bit and use a code that indicates that a 2-bit error cannot be corrected but can be detected. This code is called the Single bit Error Correction Double bit Error Detection (SECDED) code.