Timothy Prince has a B.A. in physics from Harvard and a Ph.D. in mechanical engineering from the University of Cincinnati. He has 25 years of experience in aerodynamic design and computation. He can be contacted at 39 Harbor Hill Dr., Grosse Pointe Farms, MI 48236.
Coprocessor Problems
In a letter to The C Users Journal some time ago, Paul Wexler pointed out some confusing results that developed while dealing with dollars and cents on numeric coprocessors with extended double precision registers. He showed that the Microsoft C v5.1 and Turbo C compilers treated the code:
double d = 0.99; int i = (d *= 100.0);as if the second line were written:
register long double dtemp = (long double) d*100.0; d = dtemp; int i = dtemp;with the results d==99.0 and i==98!According to K&R or ANSI C, the original code should give the same results as:
d *= 100.0; int i = d;So the most popular C compilers conform neither to K&R nor to ANSI! Paul tried the same code on some other systems and got correct results.I tried it on the Sun 4.0 compiler and found that
f77 -f68881gave the same non-K&R results as Paul had obtained with the Microsoft and Turbo compilers, but
f77 -f68881 -02generated a code expansion equivalent to:
i = d = (long double)d*100.0;which gives the correct results.The value of 0.99 rounded to the IEEE standard double precision format happens to be slightly less than 99/100. (long double)d*100 is less than 99, but if rounded again to standard double precision, the result is exactly 99.
Normal Surprises In Floating Point
Most of us at some time have written something like
for(x=0.01; x!=l.0; x+=0.01) .....Because there is no exact binary equivalent to 0.01, the conditional is always true. The code might work on a system without extended precision, but is likely to hang on 80X87 or 68881 processors. If the condition is changed to x<=1.0, the number of iterations will vary by 1 among various systems. Change the counter to integral values:
for(xx=1.; xxd; ++xx) { x=xx*0.01; ....}and it will run portably on all systems with correct floating point implementation, no matter what type is given for xx. On many systems, it will run faster if the types of x and xx are the same.
Try It On Your Own Machine
I carried Paul's example a bit further. In every case, the result for i (in his example) is either the same as the result j or the result k (when compilers take liberties with the standards,) but the k may be less than i when varying precisions of floating point arithmetic are used. In single precision or PDP-11 style double precision, the value of 0.99 is greater than 99/100, and a series of other values must be tried to verify the adherence of the compiler to the standard.C libraries should be able to convert output to the 17 or 18 digits of precision that are required to show the difference between the binary approximation and the original decimal value. Any binary floating point number may be expressed exactly in decimal, but there is no practical reason or means for going beyond the number of digits required to specify a unique binary value. Converting the other way, most fractional decimal values have no exact equivalent binary fractional value. If properly programmed, the numeric coprocessors can perform the conversions reversibly to the extent demanded by the IEEE standards. In plain double precision there will be small deviations.
The inexact converson from decimal to binary is sometimes taken as an argument for using decimal arithmetic in applications such as financial calculations. However, practical applications of this kind generally have a fixed number of fractional decimal places. It should make no difference to the result whether binary or decimal arithmetic is used.
You may wish to see the hexadecimal pattern of the binary floating point number. Since the %x conversion is not available for double, you must use unions or pointer casts. Converting the value byte by byte avoids some of the confusion introduced when the byte storage order is reversed for long or double. K&R compilers may have no unsigned char, so the 0xff mask is used to compensate. %lx conversion would be expected to work on float if sizeof(float)==sizeof(long) and byte orders are not of concern. When you run the code, look for the repeating patterns in the hex representation for evidence of rounding. When the pattern that is chopped off is greater than 0x80..., there should be upward rounding.
If your system does not have extended precision, double rounding (after divide and again after subtract) occasionally will give an incorrect rounding. This occurred only once in the hex output on my test using Z80 software floating point. The same effect reduces the number of different values that printf() can produce, making numbers appear to be exact numbers of cents when they aren't. This violates the IEEE standard but is normal without extended precision.
Why Do The Gurus Do This To Us?
In the design of the 8087 and 68881 numeric coprocessor families, the number of instructions required was reduced by choosing one data type (register long double) for the results of all floating point operations. This was a logical extension of the traditional C scheme of promoting all float arguments and expressions to double. In normal operation, it often gives an extra decimal digit of accuracy and avoids most accidents caused by exceeding the range provided for double, which was small enough to require special precautions on older hardware.Since the coprocessors give an incentive in accuracy as well as speed for allocation of variables to registers, numerical results may be expected to differ on these machines from the values obtained on other architectures. In fact, by literal interpretation of the IEEE floating point standard, these coprocessors may give incorrect results in double precision. All operations are rounded first to 64 bits precision (plus a 15-bit exponent and a sign) and then to 53 bits precision (with 11-bit exponent and hidden bit allowing room for the sign).
In effect, the coprocessor rounding threshold varies from about 0.4998 to 0.5002 of the unit in the last place (ULP) of a double. Theoretically, this could cause the same kind of non-standard behavior, but the possibility is remote and over-shadowed by effects of the magnitude of 0.25 ULP such as those which we have been discussing. Typical Z80 floating point software routines have a rounding threshold which varies from 0.49 to 0.51, and still give excellent accuracy.
You'll hardly notice whether a value that is assigned and then used for subsequent calculation is rounded off, unless you push your luck. Meanwhile, with the benchmark wars continuing, the compiler writers are looking for every chance to maximize the use of register variables. Certain optimizing compilers, given Paul's example code, will perform all the calculations at compile time, leaving nothing but the printf calls, and no record of the sequence of operations used to obtain the constant results.
Are The Standards Correct Or Is My Compiler Right?
There might be a case for allowing compilers to ignore casts and assignments that reduce the width of expressions, but when I write code such as
int ix; double x; x -= (ix = ((x=0.0) ? 0.5 : -0.5) + x);I expect x to come out such that x += ix would restore the original x. This wouldn't work if the double expression were allowed to leapfrog over the assignment to ix. When something from the integer family is involved, most of us would agree with the standard treatments. The Motorola coprocessors can evaluate this expression without reading data back from memory, so there is less likelihood of cheating to obtain faster code. On the Intel processors, the result that I intended could be produced by
register long double xtemp = RINT(x); /* old Intel FORTRAN function */ ix = xtemp; x - = xtemp;but the compiler can't be sure of this. RINT() was Intel's invention. It would round a long double to an integral value in accordance with IEEE standards.When various precisions of floating point are involved, opinions are not unanimous. In the matrix factorization code (3), I assigned the value that was to be used in immediately subsequent calculations to a local double variable, before assigning it to a float array element, to save the time taken by compilers that would either read it back from memory or cast it back to double. If I knew that the compiler would generate fast code by leapfrogging an assignment, (some do), I wouldn't write it that way.
Certain compilers will ignore the evaluation implied by an intermediate assignment if the assigned value is not actually used later on. I assume that this is contrary to the ANSI standard, but I'm sure that others, whose opinions count more, will assume differently. This is a gray area, since one must be prepared for a compiler to promote a variable to register long double, for instance, as long as it does so with reasonable consistency. If the variable is never used, by this logic, the compiler could choose to promote it to the extent that the implied cast is eliminated. Using the same name for another purpose later on doesn't count either, since renaming is an accepted technique for optimizing compilers.
On the other hand, I might want to round an expression off to a precision consistent with the numbers from which it came, to avoid problems analogous to Paul Wexler's. If the compiler were allowed to pass over assignments and casts, it couldn't be done. Besides, it might be difficult to remember if the compiler treated double differently from its treatment of int.
It's easier to write code that guarantees that the rounding will be passed over than to prevent the compiler from making a surprise interpretation. The standard definitions give more flexibility, control, and self-consistency in the language definition.
How To Pinch Pennies
Paul Wexler feels that C is inadequate to control the operations of numeric coprocessors. In particular, he suggests that he would like full control of the rounding mode. Naturally, he is dismayed by the occasional theft of 99.44% of a penny by the computational troll.Several compilers supply system-dependent functions that allow the rounding mode to be changed. I suggest that there are valid reasons for their unpopularity. First, of course, they don't exist on many compilers and there isn't any accepted syntax, so you must accept lack of portability. Secondly, the idea of having to break your source code to insert such function calls, and take the performance penalties involved in preventing compiler optimizations or code pipelining across such calls, is unappealing.
The need for a way to specify rounding to an integral value has been recognized in the definition of many languages and in the IEEE floating point standard. For instance, Pascal has the ROUND() function, which was adopted by FORTRAN as NINT(). At the same time, FORTRAN adopted the even more important rounding without type casting called ANINT(). If it were not for the disagreement between the IEEE and the language standards committees over the need for rounding to even in the half-way case, the situation would be satisfactory in these languages.
In C, the Pascal ROUND() can be simulated by a macro involving adding or subtracting 0.5 before applying (int). Situations remain where it is preferable not to use a cast, either for performance or to avoid narrowing the range.
Financial calculations are an example. The way to perform calculations down to the cent is to use integral numbers of cents and to round off to the nearest cent after each operation that could introduce a fractional value, including conversions from dollars to cents. If we always use double precision, numbers of cents up to 253-1, or nearly 16 digits (for standard IEEE double precision), may be calculated exactly.
The 8087 appears to have been intended for the integral value floating point method of decimal support, as it has an instruction for storing directly from register long double to an 18-digit decimal integer. In fact, if the atof() function makes use of the 8087, it must convert the input decimal dollar figure to cents internally before converting to binary and changing it back to dollars by multiplying by 0.01. We would have been better off to convert the decimal string to cents ourselves to begin with. Then we could use the "inexact" flag to verify that atof() performs an exact conversion.
The designers of C have been stubborn in not providing a standard floating round function. Many processors have one (e.g., the Intel FRNDINT), and the inability to have it compiled into the code when it is needed will hurt performance. There is no portable way to do it in C. The best that can be done is to use a macro with provisions for change to support specific machines. Rounding to an integer can be done portably, and conceivable optimizing compilers could recognize code such as the ix=x example above and generate more efficient code when appropriate.
Rounding may be performed in IEEE double arithmetic by
#include <float.h> #define ROUNDER 1./DBL_EPSILON #define FRNDINT(x) (((x) <ROUNDER) && ((x) >-ROUNDER) \ ? ((x)>=0.0 ? ((x)+ROUNDER)-ROUNDER : ((x)-ROUNDER)+ROUNDER) : (x))which normally would be simplified by omitting the cases for absolute value of x greater than ROUNDER. Without this precaution, larger numbers are rounded to multiples of 2. The constant ROUNDER has the value 252 for standard IEEE double precision.This code will take over twice as long as the hardware round instruction. It's usually faster than the FORTRAN ANINT() functions, which are written to overrule the IEEE style rounding of most modern hardware designs. You will need to check that your compiler treats the parentheses in accordance with the draft ANSI standard. At least one vendor (Convex) has chosen to violate this feature of the language standards in search of better benchmark speed. The whole macro could be optimized away!
Lovers of the obscure will notice that a slight rearrangement permits all of the comparisons to be performed with integer instructions, if the programmer knows the formats actually used for int and double data. A compiler for a pipelined or multiple adder machine should produce code such as
register double temp1,temp2; register int tmp1,tmp2; temp1 = x+ROUNDER; temp2 = x-ROUNDER; tmp1 = x==0.0; tmp2 = x+ROUNDER; temp1 -= ROUNDER; temp2 += ROUNDER; tmp2 &= x-ROUNDER; temp1 = tmp1?temp1:temp2; temp1 = tmp2?temp1:x;which keeps the total elapsed machine time under control and does wonders for the megaflops rating of the code. If the number of stages in the pipeline times the number of adders is at least four, there is no incentive to try integer comparison.
Conclusion
Considering the amount I have had to say on a seemingly simple topic, one can sympathize with C programmers who avoid grappling with floating point. I have a little less sympathy with compiler and run time library writers who compound the programmer's problems by exceeding the legal limits of optimization or just don't think about the reliability issues.Normal precautions in the use of floating point may help avoid problems with compilers that take shortcuts and deviate from the rounding behavior required by the language and hardware standards. Most such situations are recognizable and may be taken care of by explicit specification of rounding. Similar situations exist where code cannot be expected to be reliably portable without such precautions, even among compilers that adhere to the language standard. In C, a problem is presented by the lack of a portable way to specify rounding without truncating to int.
References
Ritchie, Dennis M., The C Programming Language Reference Manual, Bell Laboratories, Murray Hill, N.J. 1978.Harbison, S. P., Steele, G. L. C: A Reference Manual, Second Edition, Prentice-Hall 1987.
Prince, T.C., "Efficient Matrix Coding in C," The C Users Journal, May 1989, p. 59.