Features


Generating Source For <float.h>

Dr. Timothy Prince


Timothy Prince has worked 25 years in aerodynamic design and computational analysis. He first wrote software in BASIC in 1967 on a GE225 with KSR32 terminals. His first renovation of old FORTRAN came shortly thereafter, leading to more such projects, including seminars on adapting code to modern architectures. He received a B.A. in physics from Harvard and a Ph.D. in mechanical engineering from the University of Cincinnati. He can be contacted at 39 Harbor Hill, Grosse Pointe Farms, MI 48236.

The standard for ANSI C defines several header files which are not supported under many current compilers. Most of these header files may be created by reference to the library documentation of a given compiler and a good reference on ANSI C.

Two header files contain most of the definitions of the size of the arithmetic types supported by an ANSI C compiler. The integer types are described in <limits.h>, and the appropriate parameters may be looked up in the documentation of most systems. Floating point types are described in <float.h>. On BSD systems, similar territory is covered by <values.h>. Since this concept may be new to many C programmers, I will describe a C program which may be run to create <float.h> without depending on non-standard headers such as <values.h>, and some of the reasons why one might wish to use such a file.

Although the ANSI standard permits all manner of expressions in <float.h>, its utility is not severely compromised by restricting the values to constants. Additionally, if <float.h> values are constants, then in many cases they can be acted upon by integer comparisons and can be used to control conditional compilation. Technically, some of the example macros shown are non-portable.

The ANSI committee didn't restrict <float.h> parameters to #if expressions. This allows the environment to change at runtime. For example, if the rounding mode could be changed by a function call, then FLT_ROUNDS would become a runtime function rather than a function only of the compiler options. Or, the committee may have meant to remove some of the architecture dependence from <float.h>, making it independent of the target hardware. It seems to me that such independence defeats the purpose of header files. An example similar to the unsatisfactory behavior permitted by ANSI C occurs in the BSD <values.h>, where DSIGNIF is defined in terms of sizeof(double), preventing the preprocessor from using DSIGNIF to control conditional compilation.

In my view, a compiler should be able to select appropriate header files, rather than having the header file select constants from a library which the compiler has chosen according to the options in effect. If we have a portable program which can put these parameters into the header file in a form which the preprocessor can use, we can overcome these restrictions.

Relationship To Floating Point Standards

Two IEEE draft floating point standards prescribe the system behaviors which are described in <float.h>. These are P754, for 32/64-bit floating point, and P854, for any width floating point. One might expect to be able to write <float.h> directly from the P754 standard. In practice, not only are there variations in whether long double differs from double, but also in whether the implementors choose to apply IEEE rounding in long double or double precision. The Cray, DEC, and IBM traditions, predating the IEEE standards activity, also dictate certain shortcuts which increase speed at the cost of accuracy. Fortunately, there are ways to test the arithmetic by running C code on the target system, so that float.h may be created without depending on the style of C support.

Computing <float.h> Parameters

The first characteristic to be determined is the radix, or base, for floating point arithmetic. IEEE P754 requires base two, with P854 permitting also base 10. Only with these bases is there a reasonable degree of associativity in computational arithmetic, so that the expression .5*(x+y) evaluates the same as x*.5+y*.5 for all values of x and y between 2*DBL_MIN and .5*DBL_MAX.

IBM-compatible mainframes, which use base 16 arithmetic, are the most common environment which uses a non-IEEE radix. Early versions of Aztec C used a radix of 256 so that normalizations could be performed by byte string moves, leading to a waste of up to a full byte out of each number.

The test code exploits the properties of floating point arithmetic to determine the limiting parameters needed in <float.h>. The radix is found by determining the smallest power of 2 which is not affected by adding 1 (due to the limitations of the precision). This is a large integer, such as 253 for IEEE P754 double. Then the next floating point value is greater by the radix.

Accuracy

The float.h parameters DBL_EPSILON, FLT_EPSILON, LDBL_EPSILON are the difference between 1 and the next larger value of double, float, and long double. Corresponding parameters DBL_DIGITS etc., give the approximate equivalent number of significant digits. An example of the use of these would be conditional compilation to select the data type which should provide a required level of accuracy on various systems. The code to find these numbers resembles the code used to find the radix.

Since long double is not supported by many compilers, the type register double is assumed to be equivalent for non-ANSI compilers. The code presented here has been tested successfully on a variety of systems, all of which satisfy the requirement that all register doubles within a loop remain in long double precision. Long double generally is not supported in log(), atof(), and printf().

Rounding

ANSI C provides for a parameter specifying the type of rounding behavior, defined as whether rounding is performed in addition operations. The implicit assumption that this behavior does not depend on data type is not strictly true. For instance, double arithmetic on 8087 and 68881 systems may be rounded first to 64 bits and then to 53 bits precision, causing slight errors. Certain older systems may have better rounding in float precision than double or register double. By performing these tests in long double, we can use "float.h" as a guide to wringing the last bit of accuracy out of long double arithmetic. Performing intermediate steps in higher precision than the final destination will provide maximum accuracy in other cases.

There is no apparent concern in the ANSI C standard for the distinction between IEEE-style rounding, where numbers are rounded to the nearest even when they fall exactly on a rounding threshold, and DEC-style rounding, where the threshold numbers are rounded away from zero. Likewise, ANSI C does not care, as the IEEE standard does, whether one's or two's complement is used for floating point. Unlike <limits.h>, <float.h> does not tell whether the range of negative numbers may be larger than the range of positive numbers. To my knowledge, the most serious practical implication of ignoring these points in <float.h> is the inability to detect certain obsolete machines which switch from FLT_ROUNDS == 0 behavior for positive and small negative values to FLT_ROUNDS == 3 for larger negative numbers.

You will notice, by examining the C code, that the rounding test expressions could be simplified to a point where they will indicate that rounding is not performed. Aside from the test for subgrd, which certain K&R compilers may change to the point where it cannot be passed, known compilers are close enough to ANSI C that this should not be a problem. We are interested in how the system executes a C program, so it is immaterial whether the compiler or the hardware cause incorrect rounding.

The addition and subtraction tests generate the expressions

1 + .5*(1+LDBL_EPSILON),
1 + .5*(1-LDBL_EPSILON),
1 - .5*(1+LDBL_EPSILON)/FLT_RADIX,
1 - .5*(1-LDBL_EPSILON)/FLT_RADIX,
and check to see how they are rounded. These tests cannot be passed unless the arithmetic is carried out exactly before rounding (as in the case of float arithmetic carried out in double precision) or a "sticky" bit is used to record the loss of bits shifted out. Similar tests with the _EPSILON removed could be used to distinguish IEEE from DEC-style rounding.

Tests have been added to determine whether multiplication and division are rounded in a manner similar to addition. These tests check only whether the largest possible fraction of _EPSILON is lost in rounding each side of 1.

Many compilers choose to convert the vector by scalar division operation to an inversion outside the loop with multiplications inside, gaining speed at the cost of accuracy. The compiler may even be set up so that it passes the IEEE conformance tests which do not exercise vector operations.

Similar tests are available for sqrt(), but have not been included here because conditional compilations are not likely to be based on the correctness of sqrt().

Range

Most systems have different ranges for various floating point types, with (perhaps total) loss of accuracy when numbers exceed the range. It may also be useful to know what range of exponent values as well as how much accuracy is possible when setting up format strings for printf().

To test for the values of DBL_MIN.. et al, we start with (1+DBL_EPSILON...) and divide repeatedly by FLT_RADIX until a loss of accuracy occurs which prevents us from recovering the previous value by multiplying by FLT_RADIX. This sequence of operations finds the "minimum normalized value" for each data type, which is the smallest value for which the characteristics DBL_DIGITS... are valid. On some systems, this value is the smallest non-zero number available. If there is gradual underflow, the smallest non-zero number will be _EPSILON times this size. If these partially underflowed numbers are supported, as the IEEE standard demands, they may not (as on the 8087) be allowed as a divisor, so portable programs will not rely on them. In order to avoid complaints from log() when it gets a very small argument, we calculate the logarithm for the long double case from the loop count. Thus we need not depend on availability of logl(). The extra runtime taken by this slow and simple code on a 68881 is insignificant for code which is to be run only a few times per compile.

To find DBL_MAX... we try powers of FLT_RADIX until we get an overflow, indicated by reaching a value which does not increase when multiplied again by FLT_RADIX. We must be prepared for the program to abort, since that is the way DEC works. If the code keeps running, we go on and do the same for all the data types. On modern systems, there is a special bit pattern obtained at overflow, which printf() displays as "Inf"(inity). The code will detect the Inf result because it will not change as we try to back down to smaller magnitudes. Previous values are saved so that DBL_MAX may be displayed as the number obtained before Inf is obtained twice. The computations use negative numbers because some older systems used two's complement for floating point as well as for integers. On such systems the maximum negative number could be exactly a power of the radix, with the maximum positive number (which we are searching for) being slightly less.

Accuracy Of Conversion To And From Decimal

Compilers may have trouble digesting the MAX and MIN decimal values if they happen to convert to binary values outside the range. The IEEE standards specify ability to convert within .47 _EPSILON, but this is possible only when the printf() and scanf() functions use extended precision. For example, the IEEE-specified ability to convert to within .47 DBL_EPSILON is implemented in the 8087 by the use of long double, so that double conversions should be accurate. Although long double has over 19 significant digits, only 17 or 18 digits will be converted correctly to decimal. The number 17 complies with the IEEE standard, and the correct conversion of integers to 18 digits complies with COBOL compiler standards. On a machine which implements long double as the same as double, as many significant digits may be lost in printf() as there are in the exponent. The IEEE standard makes no demands on the ability to convert long double to decimal, so C libraries cannot be expected to support it without casting to double.

Several of the atof() or scanf() functions I have tried were inadequate, so one is included in the code. The results written into float.h are adjusted away from the limits according to the amount of error found when nearby numbers are formatted by sprintf() and checked by atof(). The code, of course, trusts the compiler to use an adequate scheme for reading the constants from <float.h>! The difficulty of precisely converting constants may be another reason for the ANSI committee's caution against expecting portable constants in <float.h>. The standard allows for <float.h> constants to be in hexadecimal or other non-standard form, so that compilers may read them without error.

Conclusion

The ANSI C header file <float.h> may be obtained by running the test code on the target system (see Listing 3 and Listing 4 for sample output). The floating point arithmetic characteristics which we might wish to test when performing a conditional compilation may be pretested and summarized in <float.h> so that code may be written to work to best advantage on a variety of architectures. This is a good way to explain the use of roundabout expressions which are needed to overcome numerical difficulties.

References

Cody, W. J. et al. "A Proposed Radix- and Word-length-independent Standard for Floating-point Arithmetic," IEEE MICRO reprint, 1984. This reference sets out the P854 draft standard, with some explanations.

Stevenson, David et al. "A Proposed Standard for Binary Floating-Point Arithmetic," IEEE COMPUTER reprint, 1981.

Plauger, P.J., Brodie, J. Standard C, Microsoft Press 1989. This reference explains the reasons for the P754 standard.

Coding For Accuracy

Here is an example of how accuracy might be preserved for non-IEEE radix while allowing normal systems to use fast code:

#include <float.h>

#if FLT_RADIX != 2 && FLT_RADIX != 10
   y= ((b>=0)==(c>=0))? a*b+a*c : a*(b+c);
#else
   y = a*(b+c);
#endif
Of course, such code is more desirable when b and c are known to be of the same sign, so that the runtime conditional may be eliminated. Such attempts to retain the accuracy which would be lost by non-standard arithmetic, cost more time than is gained by faster hardware.

A textbook case in which the effects of inaccurate division may be eliminated by careful coding is in the final calculation of a rational approximation for exp() (see Listing 1) .

The odds term is known to be much smaller than the evens term, so the inaccuracy in division is buried by performing the addition of .5 or 1 last when addition is known to be more accurate. On non-pipelined systems, the second alternative may be as fast as the first.

Exploiting Range Information

This example shows how to use range parameters to determine whether the usual DEC and IBM style heroic measures need be taken when calculating the hypotenuse of a triangle:

float x,y;
double tmp1,tmp2;
#if DBL_MAX_10_EXP > 99 || DBL_MAX_10_EXP > FLT_MAX_10_EXP
   hyp= sqrt (x*(double)x+y*(double)y);
#else
   hyp= ((tmp1=fabs(x))>(tmp2=fabs(y)))?
      tmp1*sqrt((y/x)*(y/x)+1):
      tmp2*sqrt((x/y)*(x/y)+1);
#endif
You don't want to have your code abort in the middle of a three hour job because of an unrecoverable overflow or divide by zero, but you also want to avoid writing code which spends 20 percent of its CPU time calculating the hypotenuse. The IEEE precision extensions preclude this problem, but, with conditional compilation, we can still run the conservative method on the machines which require it.

Listing 2