June 1994/float-Precision Math Library

Scientific/Numerical Applications

float-Precision Math Library

Tim Prince

Tim Prince has a B.A. in physics from Harvard and a Ph.D. in mechanical engineering from the University of Cincinnati. He has 25 years of experience in aerodynamic design and computational analysis. He can be reached at Box 16007, San Diego CA 92176.

Introduction
C compilers traditionally have had available a library of the usual math functions, in double precision only, defined in the header <math. h>. One of the aims of the ANSI standardization was to make the choice between float (single precision) and double (double precision) available in a portable way.
Standard C reserves a set of names for standard math library functions in float precision, although it doesn't require that these functions should be present in the library. (Just add f to the conventional name, so double sqrt(double) becomes float sqrtf(float).) In order to use these functions in a portable way, source code should be provided for those functions which an application will use.
P.J. Plauger's book, The Standard C Library, includes all required double math functions. It has raised the standard of quality for these functions. Functions which meet the accuracy standards and take care of the many possible error conditions can be difficult to understand. In float precision, several of the functions can be written in a more straightforward way without losing accuracy.
The required double functions will work without any problems when invoked with float arguments, even when not within the scope of the prototypes declared in the Standard C headers. So the only reason for using special float versions is to gain speed or compactness of compiled code. Since float precision is marginal in many applications, these functions should retain as much accuracy as possible. Ideally they should provide correctly rounded results in float precision for all representable results.
The errno values ERANGE and EDOM need not be stored except in those cases where the C Standard requires them. The following functions are written with the expectation that IEEE special codes such as Infinity and Not a Number will be propagated according to normal rules, so that time need be spent on them only in those cases where bit pattern manipulations prevent them from propagating normally.
I hope that these float precision functions will provide an opportunity for improved understanding of math functions, although I have not sacrificed efficiency for readability. These functions have been written with a view to efficiency on typical Reduced Instruction Set (RISC) computers, particularly those with some pipelining of floating-point instructions, such as HP PA-RISC.

Inline Functions
Several of the math functions are best expressed using inline code. In particular, fabs and sqrt are supported as single instructions on most modern computers, in both float and double versions. Many C compilers do not employ these instructions, so an inefficient function call has to be made. The Gnu gcc compiler has a satisfactory way of dealing with simple inline functions, allowing optimum code sequences for the given architecture to be prescribed in <math. h>.
A version of sqrtf is discussed below, and shortcut macros for modff and fmodf are included for use in a modified <math.h> (Listing 1) . In my experience, none of these are very useful, but they are included for completeness, including fulfillment of the test suites.
Rather than deal with subnormals and the like in frexpf and ldexpf functions, it is more efficient on IEEE-compliant architectures to promote to double as a way of dealing with numbers out side the range of magnitudes FLT_MIN to FLT_MAX. The portability and modularity of the primary math functions could be improved by using these functions, but I chose to improve efficiency, as the prevailing IEEE floating-point architectures provide reasonable portability anyway.

Internal Header "xmath.h"
In order to simplify things a little at the expense of VAX compatibility, I have modified the internal header file "xmath. h" introduced by Plauger for writing all the math functions. It now uses long instead of short int variables to perform bit manipulations on floating-point numbers. The following changes are made in his "parameter" header "yvals. h", which is included by "xmath. h":

#define _DO 0 /* big-endian order */ #define _DOFF 20 /* IEEE double union of longs */ #define _DSIGN 0x80000000 /* sign bit mask (avoid) */ #define _DBIAS 1022 /* exponent field of 0.5 */ #define _FOFF (FLT_MANT_DIG - 1) /* exponent offset */
and these changes or additions to "xmath. h":

typedef union { /* VAX compatibility discarded */ double _D; unsigned long _W[2]; } _Dvar; #define _DMASK (~_DFRAC) #define _DMAX ((1<<(31-_DOFF))-1) /* 15 for short */ #define _DNAN (0x80000000|_DMAX<<_DOFF \ |1<<(_DOFF-1))/* avoid this */ #define DSIGN(x) (((unsigned long *)&(x)) \ [_D0]&_DSIGN) /* avoid this */ /* modified _Dtest (avoid) */ #define _Dtest(x) ((x)>DBL_MAX||(x)<-DBL_MAX \ ?INF:(x)!=(x)?NAN:(x)!=0?FINITE:0)
I feel that more efficient and readable code generally results from writing out the special cases which must be treated, rather than generating Plauger's special codes INF, NAN, FINITE, etc. In most situations, negative values can be detected more efficiently in the obvious portable way by comparison with 0 than by testing the sign bit of a union.

atan2f
Listing 2 shows the code for atan2f. atanf is treated as a special case of atan2f, as this seldom costs any time, and avoids duplication of code. The programmer can save time and accuracy by using atan2f rather than performing a division before calling atanf.
In the absence of an inlined fabsf implementation, it is faster to copy the arguments to a union of float and long before comparing their magnitudes. Otherwise there is not sufficient reason to use such ugly code. This code works for any machine which uses ones complement floating point and has the same sign bits and bit order in float and long words.
Under UNIX, atan2(0.,0.) produces zero, according to well established tradition, partly as a result of the limitations of pre-IEEE floating-point architectures. The IEEE standard clearly dictates the result NaN, which follows without any special effort from this code.
The transformation
atan2(x,y) = p/2 - atan2(y,x)
(in the first quadrant) is used to reduce the range to the first or last octant. In float precision, no further reductions are necessary, as it is possible to use a single Chebyshev economized polynomial over the range -p/2 to p/2. This is a bit remarkable, as the corresponding Taylor series diverges at the ends of this interval. The gain of simplicity more than compensates for the number of terms.
When evaluating polynomials that have a leading first-order term with coefficient 1, there are two reasons for separating this term from a Horner evaluation polynomial, adding the first-order term last. First, if the terms in the series decrease in magnitude with increasing order, all the roundoff errors can be kept out beyond the least significant bit of the result. The final addition produces a result which is usually correctly rounded. Second, the last multiplication can be computed in parallel with the higher-order terms, giving the effect of one less operation on parallel or pipeline architectures.
Rather than evaluating one long polynomial by Horner's method, I evaluate terms in pairs to take advantage of pipelined architectures, or those that can multiply and add in parallel. This involves one or two additional multiplications, but gains far more speed on a pipeline or parallel processor than it loses on a purely sequential processor. The additional roundoff errors are submerged into insignificance by performing the final operations in the order described in the preceding paragraph.
When the operands have been inverted, the final subtraction from p/2 should be performed in double precision to avoid losing a bit of accuracy. On typical Sun or HP systems, the compiler will generate efficient code for this sequence of operations only if the constant is fetched first, as shown.
One of the comparisons at the end is performed by integer arithmetic, since we already made integer copies of the arguments to get around the lack of an inline fabsf. Integer comparison is much faster on most processors, and there are no other operations to pipeline with the comparison in this position.
Rather than use entirely different approximations for acosf and asinf, these functions invoke sqrt and atan2f either indirectly or by macro expansion. If an inline sqrt is available, this method is as efficient as any, as well as being consistent and economizing code. The alternate choice
acosf(x) = p/2 - asinf(x)
can cost some accuracy.

cosf, sinf, and tanf
All three trigonometric functions can be calculated at once by the hidden function _Sinf. Using the half-angle tangent formulae, most of the calculations are shared by these functions, as shown in Listing 3.
We start by scaling the argument by 1/(2p), after promoting to double. This is the first of several points where different methods would be required to preserve accuracy if we were not using extra precision. Many lines of code are required to subtract off the nearest integral part of the scaled argument, allowing for various FLT_ROUNDS cases. If your system conforms to IEEE standard, and doesn't allow for changing rounding mode at run time, the compiler should be able to eliminate all but the FLT_ROUNDS == 1 case.
The formulae are based on the calculation of tan(x/2) by a rational approximation. The scaling by 0.5 and the original scaling by 1/(2p) are accounted for in the coefficients, which are scaled to make the smallest coefficient one.
After elimination of dead code, there are no conditional branches other than the one used to implement a "copysign" operation, and the code pipelines well. HP PA_RISC parallel multiply and add instructions are used effectively, particularly if the order of source code operations is optimized experimentally.
The three values cosf, sinf, and tanf are returned in a struct. If a masking macro is used to select the desired component, an optimizing compiler should be able to eliminate redundant calls to _Sinf, if you can convince the compiler that there are no side effects. Otherwise, you have to call _Sinf explicitly and keep the desired values to gain the potential benefit when you need two or more of the results for the same argument.
With a good optimizing compiler, or a gcc-like macro facility, you might choose to return the three values from which the various trig functions are calculated, using a macro to perform the final division. This would save some time where not all the return values are used, and might allow the division to run in parallel with code from the calling function. Calculation of the functions secant, cosecant, and cotangent could be optimized.
Certain compilers generate poor code when returning float values in a struct, storing first into a temporary memory location and then copying from there to the struct return location. Better results seem to be achieved by having the function return a pointer to an array of float values, but this clearly would require use of writable static storage in the library, an unsavory practice.
This source code can be used in a multiple-copy function, typically to generate sinf and cosf of two arguments at once. A little time is saved by pipelining the first divide instruction with the calculation of the second group of data. On more deeply pipelined architectures, multiple-copy versions or inline compilation could provide a significant benefit. One of the means for facilitating in-line compilation is an extended macro facility such as gcc provides. Like the FORTRAN inline statement functions, this facility can deal with side effects and reduce the optimization workload on the compiler.

coshf
Hyperbolic cosine could be written as a macro in terms of expf, but not quite in the obvious way. The macro would read

#define coshf(x) (expf(fabsf(x)-(float)M_LN2) \ +.25f/expf(fabsf(x)- (float)M_LN2))
The reason for subtracting log(2) from the argument is to avoid premature overflow in expf. But this costs accuracy for small arguments.
This range vs. accuracy problem is solved by using a hidden function _Expf (Listing 4) to calculate expf(x) in double precision, to increase accuracy and range beyond the limits of float precision, but not to full double range or accuracy.
The visible function of Listing 5 checks for NaN arguments because we don't always want to do this test in _Expf. The expression x != x tests for a NaN on IEEE-compliant hardware. (If the compiler doesn't optimize it away, that is. If the compiler is generating code for non-IEEE hardware, it may as well consider this as dead code.) _Expf returns the exponent ceil(log2(xx)), which allows us to determine whether there will be overflow or underflow and set errno accordingly, without having to test the final result.
The factors 0.5 are distributed so that the multiplication and division can be parallelized. We write the division first because we expect it to take longer and want it to start in the pipeline first.

expf
The visible expf function (Listing 6) is consistent with coshf. It too could be replaced by a macro, if we were not concerned with error reporting.
_Expf passes data through a double pointer, because the function result is used to report the exponent field, for use in determining whether overflow or underflow will occur. double data is used because the typical use of_Expf is within the powf function, where the argument is obtained by arithmetic on the result of a log function. As exp strips a number of high-order bits off the mantissa of the argument and delivers them in the exponent of the result, it would be useful to have at least 32 bits of precision in the argument, in accordance with IEEE extended single precision. The _Logf function supplies this precision.
Since the argument datum has been forced to memory already, the sign bit can be retrieved quickly and used as an index into an array of rounding constants. The nearest integer multiple of log(2) is subtracted from the argument, with the code written to maximize the opportunity for a peephole optimizer to shortcut the cast from double to int to double. Ideally, we would do something to protect against overflow in the cast to int in case the argument is too large, but it is unlikely to overflow an int if it originated as a logf result, with maximum possible value log(FLT_MAX).
The mantissa of the result is calculated by a simple rational approximation, with no precautions needed for accuracy other than carrying out beyond float precision. The exponent field is limited so that it is within the range of a valid double but well outside the range of finite float values. This limiting is done by integer instructions, which can run parallel with the floating-point operations.
The rational polynomial has four terms each in the numerator and denominator. Because of the symmetry
exp(x) = -exp(-x)
the numerator and denominator are the same except for a change of sign in the odd-order terms.
Scaling the coefficients in the rational polynomial so that the smallest coefficient is one allows evaluation in the least number of operations. If the polynomial were evaluated in float precision, it would be necessary to scale so that the first-order coefficient becomes one, allowing a += operation to conserve accuracy. Accuracy still would be degraded by the final division. For this reason, published versions of exp manipulate the formula so that the final operation becomes an add of 0.5. This takes longer on a pipelined processor, eating up the time we might have saved by not casting from float to double and back.
The exponent scaling in _Logf and _Expf can use shortcuts. The sign is required to be positive and there is no need to deal directly with overflow or underflow with the range of double being much wider than float.
Positive and negative arguments are treated the same. When writing source code, you should try not to divide by expf, as a faster and sometimes more accurate result can be obtained by changing the sign of the argument and multiplying.

logf and log10f
The functions of Listing 7 invoke, normally by a masking macro, the function _Logf, which has a double argument and return value, as in traditional C. We can bypass the issue of subnormals by casting immediately to double. As I mentioned above, we need an extra three digits of precision in the result in certain contexts, just to avoid throwing away information for large or small arguments.
We start out flipping back and forth between checking for error cases and working on the expected case of a finite positive argument. This is done for the benefit of architectures that have a long pipeline for floating-point comparison. No harm is done by calculating as if everything were normal while testing for exceptions.
The first complicated operation extracts the exponent field, determines what adjustment needs to be made to scale the argument as close as possible to 1.0, and performs the scaling. If the fractional part of the argument is less than sqrt(0.5), the scale factor is incremented. By masking off the fractional part and performing an integer compare, we avoid having to generate a floating-point version of the fractional part. The integer comparison uses only the high-order 21 bits of precision, as the boundary where the scaling changes doesn't need to be exact.
We hasten to get the divide into the pipeline before the final two comparisons for exceptions. Even in this situation, where it seems not to save any operations, calculating the 2*r term separately improves parallelism.

powf
In float precision, we slip by with a relatively simple powf function (Listing 8) . Since the C language definition allows for zero and negative numbers to be raised to integral powers by powf, we have to separate out these cases before taking a log. On the assumption that integral powers will usually be small enough that raising to a power by multiplication will be faster than logf and expf, the same method is used for all integral powers. The number of operations for raising to a power this way increases only logarithmically with the exponent.
If the exponent is not integral, we calculate the result in a straightforward way, using _Logf and _Expf so that accuracy is preserved by keeping all the intermediate calculations in double precision. ERANGE is reported if the result is calculated to be Infinity or zero. The possibility of gradual underflow is assumed, so that non-zero numbers may be as small as
2^(FLT_MIN_EXP-FLT_MANT_DIG)
A function similar to _Expf using base 2 instead of base e may run slightly faster when used along with a _Log base 2.
For accuracy and efficiency, it is best to avoid negative integral powers, because these require powf to invert the result. As with expf, for non-integral powers, division by powf should be avoided by changing the sign of the second argument. While it may seem annoying to have the optimum usage of powf change between integral and non-integral negative powers, powf should be avoided entirely for the usual small integral powers which are known at compile time.

sinhf
sinhf (Listing 9) is much like coshf, with one additional hurdle. If we try the obvious
0.5 * (exp(x) - exp(-x))
even in full double precision, accuracy is poor for small arguments because the leading terms cancel out after swamping the important ones.
One solution found in the literature and even in computer architectures such as the Intel 8087 is a function which calculates (exp(x) - 1). The applications for such a function can be done faster or better some other way. Satisfactory rational polynomials for sinh exist, but there is no compensation for the extra time taken by division. A perfectly good Chebyshev economized polynomial can be found for the range of arguments where the exp approach doesn't work.

sqrtf
I might not even have written Listing 10 if my C compiler supported inline SQRT instructions. There is little reason to have sqrtf in addition to sqrt, and this function is an exception to the rule of float functions being more straightforward than the standard double ones. The greatest economization is in not needing to take care of arguments between zero and DBL_MIN. In order to satisfy all the Paranoia tests, the sqrt function has to be carried out nearly to full double precision before rounding to float.
The special cases to be handled are negative arguments, which are invalid, and zero arguments, where the IEEE standard requires the result to be the identical flavor of zero. In the error case, the result should be a NaN. If the compiler generates correct code, one way to get a NaN is by casting a large double constant to float, generating Infinity, and multiplying by zero. Optimizing compilers should generate the NaN constant at compiler time, possibly with a warning message.
If the argument is finite, we go through some gyrations to obtain an initial approximation by making linear curve fits over the intervals 0.5 to 1 and 1 to 2. The low-order bit of the original exponent selects the interval. The code evidently depends on the floating point representation being FLT_RADIX == 2.
With an initial approximation good to two significant figures, the accuracy can be improved by using Heron's approximation to double the number of significant figures as many times as needed. Since the basic algorithm gets hidden in the final code, I show it here as
sqrt = (operand/sqrt + sqrt) * 0.5
This formula goes back to antiquity, being credited to Heron. It can be derived using the Newton-Raphson method, an implementation of basic calculus.
Software sqrt functions intended for machines with floating point arithmetic hardware always use an algorithm based on Heron's formula. Hardware implementations could use the longdivision method to produce a correctly rounded result in less time than a plain divide.
The initial approximation is fixed up with the correct exponent, and iterations are performed to get the final result. We save a little time by combining three iterations into one, rationalizing to trade two divides for seven multiplications. After taking parallelism into account, we find a net gain if divide takes three or more times as long as multiply. On typical RISC systems, divide takes at least five times as long as multiply, so it is worth avoiding.
The next to last approximation has to be within 0.5*FLT_EPSILON in order to obtain the correctly rounded result for sqrtf(1+FLT_EPSILON), as required by the IEEE Standard. In effect, this means that the final result has to be accurate to 2*FLT_DIG, or nearly full double precision, before casting to float. There is just enough extra range in IEEE double to prevent overflow in this formula, and enough extra precision that roundoff errors of several times DBL_EPSILON don't affect the result.
With one less iteration, the accuracy is within FLT_EPSILON, but the result occasionally rounds up when the IEEE standard requires it to round down. If the iterations are not widened to double, the result still rounds to one of the two nearest numbers representable in float precision, but frequently comes to the less accurate of the two.
Since a hardware SQRT should be simpler than a divide, it seems a pity that C tradition assumes that we don't need it, even when most modern architectures have it. Certain run time libraries have shortcut handcoded functions for fabs and sqrt, so that the time spent is not too great. These handcoded functions will not appear in profiling data collection.

tanhf
tanh, like sinh, has an accuracy cancellation problem for small arguments. Because the function is by nature a ratio of two functions which can be approximated best in terms of rational approximations, no simple Chebyshev series is adequate. The approximation in Listing 11 for small arguments is found by plugging the approximation used in _Expf into the formula for tanh.
In two places we have a choice between an addition and a multiplication, and choose the addition. In both places, (xx+xx) yields the same result as 2*xx; but the addition is faster on a RISC machine because the need for the constant 2.0 is eliminated. Also, the first add might be performed in parallel with a multiply on machines like the HP. In the second place, we could multiply the result of _Expf by itself if we hadn't doubled the argument. Since we are using double precision, to be rounded to float later on, we don't care about differences in range or roundoff.
Certain compilers for RISC computers automatically change (x+x) to 2*x or vice versa, depending on the decision of the compiler writer. This is done because older systems generally performed better with a different choice than the newer architectures favor, and it is more convenient to write *2 without having to think about optimization.
If the result of _Expf is large, we return the result 1.0. We could have saved some time by not invoking _Expf, but we assume that exceptional cases will be so infrequent that the time which they take can be ignored.

Testing
Plauger's tmath2.c and tmath3.c test drivers can be adapted by adding f to their names, making tmath2f.c and tmath3f.c. The tolerance, which is set in the existing code at 4*DBL_EPSILON, must be changed to one based on FLT_EPSILON. All tests should pass at 1*FLT_EPSILON. Likewise, the error of these approximations relative to the standard double functions should be less than FLT_EPSILON.
You can also make tmathf1.c from tmat1.c, but using macro expansions results in many of the tests being essentially done at compile time. If they are done symbolically or in different precision, trying to use addition and subtraction of 1.0f/FLT_EPSILON to compute ceilf or floorf will not work.
The final group of macros in <math.h> (Listing 1) is unsafe against side effects. If you have a compiler such as gcc which provides means for making them safe, you will want to have special versions which are selected automatically for that compiler. The relevant gcc extension consists of the ability to have a new block which returns a value from within the macro, so that new local variables can be defined.
The results of these functions can be compared with the standard double functions, and should have a relative error less that FLT_EPSILON. They will not necessarily duplicate the results obtained by casting the double result to float; this could easily be off by 2*FLT_EPSILON.
The Paranoia series of tests for compliance with the IEEE floating-point Standard is particularly severe for sqrtf, as well as testing some extreme cases of powf. I used a variation on the original Paranoia FORTRAN, with the f2c translator, modified to use include files in normal C style, so that the float precision C functions are invoked.

Conclusion
These float math functions provide maximum possible accuracy in float precision with a significant improvement in speed and reduction in code complexity compared to the standard double versions. They should work interchangeably with equivalent functions which might be provided by other libraries. If questions of accuracy or speed arise with the library provided with a compiler, these functions can be tested to determine their effect on an application.

References
Stevenson, D., et al. A Proposed Standard for Binary FloatingPoint Arithmetic, IEEE Computer, 1981.
Harbison, S.P. and Steele, G.L., Jr. C, A Reference Manual, 3rd edition, Prentice-Hall, 1991.
Plauger, P.J. The Standard C Library, Prentice-Hall, 1992.

Sidebar: "Why So Many Math Functions?"