When converting from a higher precision to lower precision data format, the mantissa is rounded to the nearest. If the value to round is tied, then it rounds to the nearest even value for the mantissa. For example, when converting from float32 to bfloat8, we want to round 23 bits of mantissa for float32 to 7 bits of mantissa for bfloat8. However, we also explicitly store the hidden bit of 1 for bfloat8, so we are really rounding to 6 bits total. Consider the following 23 bits of mantissa:
To get the 7 bits of mantissa for bfloat8, we want to keep 6 bits of the original 23-bit mantissa and store the additional hidden bit at the most significant bit (MSB). The least significant bit (LSB) of the 6-bit mantissa to keep is known as the guard bit, which we use to round to the nearest even (if there is a tie). In other implementations or literature, the MSB of the round value is also known as the round bit with the remaining bits denoted as the sticky bit(s), but the result is the same. In host code, the rounding is done with the following process:
To handle exponent sharing, the mantissa is first normalized prior to rounding if the exponent is different from the shared exponent. If there is an overflow in the mantissa when we round up, we do not recompute the max shared exponent and re-normalize across the 16 numbers. Instead, the mantissa is set to the max value (ie. all 1's). For the other block float formats, the same process applies but with the corresponding number of bits for the mantissa and round value.