Floating Point Math

Made simple
Nice link

The IEEE Standard defines 32-bit and 64-bit floating-point representations. The 32-bit (single-precision) format is, from high-order to low-order, a sign bit, an 8-bit exponent with a bias of 127, and 23 bits of mantissa. The 64-bit (double-precision) format is, a sign bit, an 11-bit exponent with a bias of 1023, and 52 bits of mantissa. With the hidden bit, normalized numbers have an effective precision of 24 and 53 bits, respectively.

Single-precision format
31, 30-23, 22-0
S, Exponent, Significand

Double-precision format
63, 62-52, 51-0
S, Exponent, Significand

Leave a comment