Precision/Data Loss


Using fixed-point arithmetic can result in data loss due to overflow. Checking for overflow can seriously reduce the speed increase achieved through the use of fixed-point.

Addition and subtraction are generally safe. Assume a fixed-point number of N bits (integer and fractional part). If the largest possible number is added to itself, the computation is represented by the following equation:

( 2N - 1 )+(2N - 1 )= 2*2N - 2 = 2N+1 - 2
2N+1 - 2 can be represented in N + 1 bits. If the range of values to be represented can fit in seven or fifteen bits, then addition will never overflow the data type being used. Otherwise, as in the angle representation example discussed in the body of the article, the range can be scaled to fit into 16 bits, so that the nature of the values represented allows overflow to be ignored.

Multiplication requires some care. The algorithm used for multiplication computes an intermediate value which can be equal to the largest representable number squared. This multiplication is represented by this equation:

(2N - 1)(2N - 1 )= 22N- 2*2N + 1 = 22N-2N + 1 + 1
This number requires 2N bits to be represented. If the fixed-point value is close to 16 bits, the intermediate value will require 32. An 8-bit fixed-point multiplication will require a 16-bit intermediate value. If the fixed-point number is represented with more than 16 bits, a special integer multiplication routine may be needed to compute the >32-bit result.

Division has the same caveats as multiplication, since the first value is left-shifted N bits before the division. The same rules about size apply.

The practical application of fixed-point arithmetic requires an understanding of the usage model of the fixed-point numbers. The best technique is to find a range of values that can be represented by an 8- or 16-bit data type and ensure that values stay within that range. Fixed-point numbers can then simply be cast to the next larger data type at the start of the multiplication and division routines.