August 1990/Evaluating Your Floating Point Library

Features

Evaluating Your Floating Point Library

Gene Sheppard

Gene Sheppard has B.S. degrees in computer science and mathematics and is a software engineer for Solid State Systems, Inc. He has authored and designed software systems for business/inventory, graphics, and embedded systems, and has co-authored a paper on graphics which was published in the Journal of Pascal, Ada, & Modula-2. You may contact Gene at 2610 Whitehaven Dr., Marietta, GA 30064 (404) 439-0146.
Most numerical software is designed using a high level language compiler which in turn uses mathematical algorithms to approximate functions such as the sine and cosine functions. The operative word here is "approximate". The resulting values are the result of two different kinds of approximation: first, the hardware representation of the number is often approximate because the exact value cannot be contained in the largest variable type (e.g. 1/3 doesn't have a finite binary representation); second, the software can iterate through a potentially endless approximation algorithm only a limited number of times if the software is to achieve practical throughputs. Thus instead of returning exact values for complex functions, the math package returns values which its developer deems "close enough". Different developers, each with different audiences, reach different conclusions about how close is "close enough".
Programs developed in fields such as mathematical physics, numerical analysis, and scientific programming are sensitive to small differences in precision and accuracy. In these fields an understanding of the limits of the floating point is imperative. This article will discuss topics germane to the evaluation of a system's numerical capabilities and ways to assist the reader in evaluating a floating point package.
First, I will give some insight as to how a math package may estimate the transcendental functions to illustrate how problems may occur in numerical processing. Second, I will define the subtle differences between precision and accuracy. Finally, I will discuss code to evaluate the compiler's ability to handle numerical software.

Trig Approximations
Many trig approximations are based on the Taylor series. (The series is defined in Figure 1a and expanded to show the first few terms in Figure 1b. ) Notice that the series never ends, thus practical approximations based on it must be terminated after a finite number of terms have been evaluated. Herein is the problem. Not all math packages terminate at the same point, thereby giving rise to dissimilar results under different compilers.
The function Rn(X) — otherwise known as Joseph Louis Lagrange's form of the remainder function — is used to determine where to end the series. (X is the parameter passed to the function and n is the number of iterations.) To estimate the sine function using the Taylor series to an accuracy of 0.00001, you would evaluate the remainder function until the result was less than 0.00001. If n were 10 then you would take the Taylor series out to 10 iterations to find your answer.
The developer's challenge is to evaluate only as many terms as necessary to supply the needed precision and accuracy. If he evaluates too many terms, his package will be unnecessarily slow; too few terms and the results won't be accurate or precise enough.

Precision Vs. Accuracy
Precision refers to the number of digits used in the results of a numerical analysis and does not imply that all the digits are correct.
On the other hand, "accuracy" is a measure of the correctness of each digit in the result. For example, the statement "PI = 3.14" is accurate to three places. The statement "PI = 3.139" has more precision than the first, but it is not as accurate.
However

"PI = 3.14159265358979323846264338327950288"
is as accurate as the first statement and has more precision than the second. Your software may format all output to 16 places, but if you don't validate the floating point routines, you'll never know how many of those digits are meaningful and how many are just spurious noise.
The quality of numerical applications is also affected by the math package's dynamic range — the range between the largest and smallest values the package can handle.
For example, in a loop controlled by

while( x > TEST )
if TEST has a value of 10^-6 then the floating point package with which x is being calculated must have a dynamic range of at least 10^-6. Otherwise the loop may never terminate.
In a loop controlled by

while( x ! = 0 )
the dynamic range will determine how close to zero x must become to terminate the loop.

The Code
The four routines accompanying this article will provide useful information about most any math library and, with little modification, even about hardware-based floating point support. When you run the routines you should carefully inspect the results of each iteration for precision, accuracy, and gross failures.
When you have identified an iteration in which the calculation is breaking down, you must then examine the contents of various variables before determining if the cause was a limitation in accuracy, precision, or dynamic range.
Was the problem caused by blending precision operations by the inappropriate use of the printf function, or is the floating point package in error?
Gross failures are perhaps the easiest to identify. Each routine will take most math packages to the their limits. If your system does not reach an error condition or overflow in 74 iterations, then expand the loop termination to a larger number.

Listing 1
Code in Listing 1 demonstrates the limitations of single precision. The variable f is single precision and the printf function is instructed to output in scientific notation. Everything is fine up to and including the 41st iteration in Figure 2, where things seem to go awry.
It is not immediately clear if the math package is at fault. Closer inspection finds that the system's internal mechanism maintains a more precise value than can be printed. Calculators often follow the same practice; they have a limited display but maintain a much larger number inside. At the 42th iteration, the system's internal mechanism begins to manifest its limitation; this system's internal single precision has a dynamic range from 10e-37 to 10e+38. Beginning at the 42nd iteration, garbage is produced until overflow errors begin to occur. At the 49th iteration, it can no longer handle the small numbers.

Listing 2
Some compilers recognize the keyword double, but implement double and float identically. The program in Listing 2 tests a system's double precision. The output starts with a number smaller than in Listing 1 and the variable f is declared as double. Comparing the output from both Listing 1 and Listing 2 (Figure 2 and Figure 3) will show that this system really makes a distinction between single and double precision. In Figure 2 the system gives up after the 48th iteration and in Figure 3 the system does well on all 74 iterations.
To find the breaking point of this system, I rewrote the loop to perform 310 iterations — it broke down on the 304th iteration.

Listing 3
Listing 3 checks the SIN function for roundoff error and for subtle sensitivities to the way parameters are handled.
This program is especially useful to C programmers because C wants to convert all arguments passed to the transcendental functions to double and roundoff error can become a problem if the programmer is not careful.
In some implementations of the sine approximation, the argument is squared before the range is checked, resulting in an underflow internal to the function and garbage for a return value. This program tests whether the SIN function returns its argument unchanged when the argument is very close to zero. The SIN function should return correct values over the entire dynamic range of the floating point operations.
The system used in my test seems to be in good health because both the returned value from the SIN function and the passed parameter do not show signs of being corrupted.

Listing 4
This innocuous program is not as straightforward as it may seem.
The program loops 25 times, computing roots of x twice and assigning the product of these roots to y. In otherwords, y is theoretically identical to x because it is the square of x's square root. The variables y and x are checked and printed when found to be unequal.
(Again, to avoid problems the argument must be of appropriate type. C allows the calling routine to pass arguments of a different type than is declared in the called routine.)
The results, as indicated in Figure 5, are not as one might expect. Only seven of the 25 evaluations were found to be equal when using double precision variables. I evaluated this program using only single precision and found that y and x were equal.
This may indicate that in my system either the square root function and/or the multiplication routines can't handle the results.

Perspective
Anyone using or contemplating the development of numerical software should evaluate their system's numerical capabilities in order to gain confidence in their results. A starting point could be to adapt this code to test all of the functions in your math package. But, you should keep in mind that these tests aren't exhaustive, and that a solid math package doesn't guarantee solid results — even after your math package has passed muster, your code can still introduce many subtle problems.

Additional Reading
Cheney, Ward and David Kincaid. Numerical Mathematics and Computing. Monterey, CA: Brooks/Cole Publishing Company, 1980.
Encyclopedia of Computer Science. New York: Van Nostrand Reinhold Company, 1976.
Pachner, Jaroslav. Handbook of Numerical Analysis Applications. New York: McGraw-Hill, Inc., 1984.
Press, William H. et al. Numerical Recipes: The Art of Scientific Computing. New York: Cambridge University Press, 1986.
Smith, W. Allen. Elementary Numerical Analysis. New York: Harper & Row, Publishers, 1979.