February 1996/Quad-Precision Math Functions, Part 1

Features

Quad-Precision Math Functions, Part 1

by Dr. Tim Prince

Introduction

Several implementations of ANSI C include a software implementation of long double with 113 bits precision. Standard math function libraries of this precision are not included in all of these distributions. Where they are present, they may not always be named in accordance with the rules of ANSI C. (The names should be sinl, expl, etc.) Or they may not have had the benefit of the extended development process to which standard double math libraries have been subjected.

The algorithms you should use for writing long double math functions are similar to those required for the standard double functions, except that small parts of the evaluation may be speeded up by narrowing the calculation to double. (By contrast, it seldom pays to perform parts of double calculations in float precision.) In my experience, these long double functions are peculiar in that more terms in series approximations are needed for best root-mean-square error than are required to minimize worst-case errors. The number of terms required in a series approximation is not as well defined as it is for the lower-precision versions of these functions.

IEEE Standard 854 (IEC 559) defines certain relationships between the characteristic parameters for the various floating-point data types. While Standard C doesn't require this form of arithmetic, many popular implementations today are (mostly) compatible with IEEE 854. For example, if long double has a representation distinct from double, it should have a sign bit, at least 14 bits in the exponent, and at least 64 bits precision. With 128 bits of storage, the standard can be met with a sign bit, 15 bits in the exponent, and 113 bits of precision. The extra bit is a "hidden bit" that is always 1, and hence need not be in the stored representation.

This definition of long double has been adopted by several system vendors, and is often termed "quad" precision, as it uses twice the storage of double precision. I present here a complete set of the Standard C math library functions in quad precision. All are written with series approximations for the precision LDBL_MANT_DIG = 113. If they are available anywhere else in the literature, I have yet to see them.

I tested the functions using a version of the "elefunt" (for elementary functions) test suite developed by Cody and Waite [1]. The suite was translated into Standard C for modern computer architectures. (See sidebar.) In some cases, I compare the resultant functions with those available from Hewlett Packard (HP).

The Functions asinl and acosl

These functions may be calculated in terms of the functions atan2l and sqrtl functions (see Listing 1) . In principle, you can use the expressions:
asinl(x) = atan2l(x, sqrtl(1 - x * x));
acosl(x) = atan2l(sqrtl(1 - x * x), x);
But in practice, the expression 1 - x * x should be avoided. The form (1 + x) * (1 - x) is more accurate when x is close to 1.

You can also write specialized versions of each of these functions that evaluate directly a series approximation for the function. According to the elefunt test suite, a specialized asinl function was able to reduce errors made by more than one bit compared with this substitution of atan2l. The specialized asinl function follows the organization suggested in Plauger's text, using a Chebyshev economized polynomial in place of the textbook rational approximation. However, the maximum absolute and root mean square relative errors were not improved by this means. In my experience, asinl would see so little use that it is not worth the trouble to write a specialized function. A version is included on this month's code disk but not shown here. A specialized function for acosl produced significantly worse root mean square errors. This may have come about through the use of the relationship:
acosl(x) = pi / 2 - asinl(x)
which will produce cancellation errors if precautions are not taken.

In my tests, sqrtl was written with Newton iterations. (Guess an answer, divide it into the number, and average the guess with the quotient. Each iteration doubles the number of bits of precision.) This method is slow and increases rounding errors slightly, compared to the results that would be expected with a correctly rounded sqrtl. However, elefunt reported better accuracy than was achieved when HP's FTN_QSQRT function was employed.

These substitutions of atan2l and sqrtl may be written as a macro, provided that precautions are taken against side effects from multiple evaluation of the argument. This is probably not worthwhile unless sqrtl is a machine instruction that can run in parallel with other code. Tests on an HP system produced slower running code with macro substitution.

The Functions atan2l and atanl

I took the algorithm for atan used in Unix 4.3BSD and extended it to quad precision with slight modifications. It uses a polynomial approximation for small values (|x| < 1/3) of the argument and uses a trigonometric rotation formula for rotations to cover the full range |atan (x)| < pi. Quad precision is not needed for comparisons or the two highest-order terms of the series, but more than quad precision is needed in the final addition of the formula:
atanl(y/x) = atanl(a) + atanl((y-x*a)/(x+y*a))
Constants such as pi are represented in more than long double precision, possibly up to long double plus double precision. Values of a are chosen to be multiples of 0.5 in order to minimize roundoff errors.

A factor FUDGE was inserted to compensate for remaining roundoff error. This factor may be adjusted by trial and error. The smaller piece of each constant in the extended precision scheme is on the order of 0.4 * DBL_EPSILON times the larger, so the FUDGE correction is on the order of 0.4 * LDBL_EPSILON of the final rotation. At best, it can bias the results to center the result band closer to the correct values.

I was originally rather stingy in selecting the number of terms in the series. My comparisons showed, however, that a smaller FUDGE factor was optimum when one more term was used. Elefunt tests showed smaller maximum errors but larger root-mean-square errors with the additional term.

Where type double, characterized by DBL_MANT_DIG==53, is sufficient, the FPCE extension type double_t is used. (See Thomas and Coonen [6].) This is more a matter of documentation, to show that we don't care about the exact precision, than of portability. The intended use of this FPCE definition is to facilitate effective programming of various system architectures with LDBL_MANT_DIG <= 64, where long double arithmetic might be faster than double.

I found coefficients for the series by fitting a Chebyshev economized polynomial to the function atan(x) / x. I fit the curves using a translation into bc (the Unix utility) of the code published by Press, Flannery et al. [5] . Interval boundaries have been shifted so that the reduced range is at most [-1/3, 1/3]. Just enough terms are taken so that the leading term rounds off to exactly 1 in long double. The series is written as x + x^3 * (....) rather than as x * (1 + x^2 * (...)) for improved accuracy. When evaluated this way, only the roundoff error from the final addition is significant, as the other roundoff errors are no more than 1/27 as large. According to the elefunt tests, correctly rounded results appear to be obtained on 60 per cent of the individual trials.

In general, there is a discontinuity of the order LDBL_EPSILON at the points where the change occurs from one series range to the next. For this reason, I prefer not to increase the number of intervals in which the function is evaluated, although the function could be speeded up that way. This discontinuity increases if the extra precision treatment of the pivot angle constants is dropped, although this shortcut does not greatly impact the elefunt tests.

Generally, atan2l(x,y) is more accurate than atanl(x / y), as can be seen by comparison of elefunt tests which use these two variations. In the present implementation, when the magnitude of x/y exceeds 1, atan2l uses y / x directly rather than 1 / (x / y), which takes more time and generates an additional roundoff error.

The Function coshl

In order to prevent premature overflow, this function is represented as:
coshl(x) = expl(fabsl(x) - logl(2))
           + 0.25 / expl(fabsl(x) - logl(2))
This formula occasionally introduces significant errors, so the more obvious expression
coshl(x) = 0.5 * expl(x) + 0.5 / expl(x)
is used over most of the range. Then, when the formula with logl(2) offset is used, the smaller term is negligible. The scheme shown by Plauger [4] , in which a hidden function _Exp returns an accurate value for 0.5 * exp(x), may be marginally better technically.

The Functions cosl and sinl

I extended Plauger's published code [4] for cos and sin to long double. A slightly different method is used to translate the argument into the first eight octants, and then into the first or last octant.

The problem of finding the difference between the argument and the nearest multiple of pi / 2 is complicated by the fact that the value INT_MAX is far too small compared to the largest odd integral value that can be represented in long double format. That value is 2 / LDBL_EPSILON - 1. The integral value must be found in long double arithmetic and, in general, the high-order bits must be masked (subtracted) off before it can be cast to int in order to test the remaining low-order bits.

For completeness, the code is shown for all of the usual FLT_ROUNDS modes. For architectures that support unnormalized long double constants, it is enough to add a constant with an exponent field the same as 1 / LDBL_EPSILON but value zero, without testing the sign of the argument. The code shown will fail when presented with arguments not less than pi / LDBL_EPSILON, but that is immaterial since there is at most one value represented for each segment of the curve fit in such a case. That situation may be considered a range error (though the C Standard doesn't say so).

Elefunt tests exercise the function around pi / sqrt(LDBL_EPSILON), where 32-bit integer arithmetic is insufficient for range reduction, but results should be correct to within sqrt(LDBL_EPSILON). Apparently, the idea of this test section is to see whether sequential values are produced. All that can be expected of the test beyond the limit of accurate range reduction is that the function value fall within acceptable limits.

I also show alternate code that makes use of HP's FTN_Q... function library, to show how built-in language support might be provided for these operations. Although awkward, the use of <float.h> allows the necessary operations to be carried out in a portable way on modern architectures. In the absence of builtin system support for this operation, the resulting code is reasonably efficient.

The quarter period pi / 2 has to be represented accurately to more than long double precision. This is done as consistently as possible with the method used in atan2l. Elefunt tests of range reduction require several bits beyond long double precision to produce satisfactory results.

The function evaluates Chebyshev economized polynomial approximations to sin(x) / x or cos(x), depending on the octant. Discontinuities of order LDBL_EPSILON occur at the joins, at odd multiples of pi / 4. These joins are not all at the same places as the ones in the atanl function.

Coming Soon

Next month, I'll show the remaining math functions in quad precision. I'll also summarize the results of testing these functions.

References

[1] William J. Cody, Jr. and William Waite. Software Manual for the Elementary Functions (Prentice-Hall, 1980).

[2] D. Stevenson, et al. "Proposed Standard for Binary Floating-Point Arithmetic," IEEE Computer, 1981.

[3] S.P. Harbison, and G.L. Steele, Jr. C, A Reference Manual, 3rd Edition (Prentice-Hall 1991).

[4] P.J. Plauger. The Standard C Library (Prentice-Hall, 1992).

[5] Press, Flannery et al. Numerical Recipes, The Art of Scientific Computing (Cambridge University Press, 1986).

[6] J. Thomas and J.T. Coonen, "An Introduction to Floating-Point C Extensions," C/C++ Users Journal, January: 1995, pp. 49-57.

Tim Prince has worked in turbomachinery aerodynamics for over 30 years. For the last five years, computational fluid dynamics has been a large part of his job. He still does some FORTRAN program maintenance and general UNIX work. In his spare time he tries to explain to his wife and son they probably know more about the software on their PC than he does.