Features


Double To (And From) String Conversion

Dr. Timothy Prince


Timothy Prince has a B.A. in physics from Harvard and a Ph.D. in mechanical engineering from the University of Cincinnati. He has 25 years of experience in aerodynamic design and computation. He can be contacted at 39 Harbor Hill Dr., Grosse Pointe Farms, MI 48236.

The articles by Banning, CUJ February 1988, and Asaf, CUJ September 1990, have stirred up some interest in the subject of floating point binary to ASCII string conversion. I investigated several runtime libraries and learned that this subject deserves more attention. I will try to answer some of the questions that arise concerning how much accuracy is possible without giving away too much in speed and complexity, while using code that doesn't depend excessively on specific architectures.

Applicable Standards

The IEEE floating point standards 754 and 854 are not implemented in any source code available to me, but one can improve many libraries without the complication of full adherence to the standard. There are performance problems in trying to adhere to the IEEE standard for string conversion, particularly on architectures that do not support long double, so few vendors' libraries meet the standard fully.

The most difficult aspect of meeting the IEEE standard is to convert correctly a sufficient number of digits to permit accurate conversion back to binary. For the usual IEEE double precision, this requires the equivalent of a "%.17g" format. The IEEE standard implies reversibility of conversion for numbers between 10-10 and 10+44. This is the range within which long double arithmetic is sufficient for accurate scaling to a 17-digit integer. With typical numeric coprocessors, integers of only nine decimal digits and registers of only type long double are available. So the practical upper limit of correct conversion is 10+31. Fortunately, we can shift gears in the conversion algorithm and extend the lower limit downwards. Without extended precision, satisfying the standard is difficult, even in the range from 10+8 to 10+19.

Outside the range of correctly rounded conversion, the IEEE standard allows relative errors of 0.47*DBL_EPSILON beyond the correctly rounded value, where DBL_EPSILON is defined in <float.h>. Converting from binary to string and back to binary should produce a maximum relative error of DBL_EPSILON. Normal implementations of floating point arithmetic involve additional roundings, each introducing an error up to 0.5* DBL_EPSILON. With careful coding, violations of the IEEE standard can be held to a level of practical insignificance, without paying more than a 20 percent penalty in execution time or requiring the simulation of extended precision arithmetic.

As far as C is concerned, taking the statements of Kernighan&Ritchie and Harbison&Steele into account, the final digit in %f, %e, or %g formats should be the result of a rounding. %g formats for results less than 0.0001 should be presented in %e form in order to maintain precision. The maximum precision that should be available is not specified by the C references. The IEEE standards specifically leave the behavior up to the implementor beyond the number of digits required for reversible conversion.

Why Try For Accurate Conversion?

In my experience, there are several contexts in which accurately rounded decimal output improves the usefulness of application software. First, large tabulations of numbers are more easily read if they can be produced with the minimum necessary number of digits. If the numbers are not rounded correctly, an additional digit may be needed. Second, production of portable and legible software demands that constants be expressed in decimal rather than octal or hexadecimal notation. Since the IEEE standard requires the ability to convert from binary to character string and back to binary without error over a reasonably wide range, this should not give away any accuracy. Third, a common way of testing numerical software is to produce tables of numbers to be compared with a trusted standard. This approach requires proper rounding to form sensible conclusions. Last, and perhaps not least, transferring data between machines that lack binary compatibility requires full accuracy with a decimal medium of transfer.

String To Binary Conversion

ASCII to binary conversion seems less mysterious, even for floating point, as each digit character is easily converted to integer, multiplied by the appropriate power of 10, and added in to the result. Still, a number of software vendors have made errors in the standard function atof, such as stopping the conversion short of the number of characters needed, or failing to provide for more than two digits in the exponent. The natural way to sum the terms of the series, in decreasing order of magnitude, causes the roundoff error to grow. As there is no roundoff error in the conversion of integer values up to 1/DBL_EPSILON, precautions are not needed until the fractional digits are introduced.

Extended precision helps avoid roundoff errors. Otherwise, terms beyond the decimal point should include correction of the roundoff error. The recursive scheme shown works to eliminate error growth as long as small terms are added. The addition error of each term is corrected in the next step. Such schemes for accurate evaluation of a series pay a particularly large penalty on superscalar architectures, as there is no parallelism available in the error correction.

You can write the loop with a break to skip the useless last correction. This not only gets out of the loop quicker but also improves parallelism on pipelined architectures. The isdigit manipulations for the next loop iteration can be performed while the current loop iteration is finishing. Tests on pipelined architectures show no advantage in splitting the loop so that the order of addition may be reversed. The technique (practiced by pipelining compilers) of splitting the sum to increase parallelism may defeat the attempt to perform operations in the most accurate order.

Some Cycle Eaters From The Past

Under the GCOS operating system, the double to character string formatting functions made three or four linear searches through the table of powers of 10 for each number to be converted. By comparison, any reasonable scheme is an improvement.

Early versions of UNIX did not scale numbers directly into the range where the desired characters can be split off, which is the sensible thing to do. Rather, they would generate a long internal string of characters, possibly with many leading zeros. The string could be as long as DBL_DIG - DBL_MIN_10_EXP + 2 digits. Then the desired string of digits would be picked out of the middle, using the digit following to perform rounding. This was bad enough when the string could not be more than 55 characters. With IEEE double, the scheme became ridiculous, but there still are public domain examples of such code.

IBM mainframe software may generate a rounded string of 16 digits internally, then select a substring as specified by the format. The effort of rounding off to 16 digits probably helps to give maximum accuracy, but — supposing that 15 digits are requested — the last digit occasionally will be incorrect. So the effort of rounding, considerable on IBM System/370, has been wasted. There are also nasty surprises when a long double argument is passed to printf or log, possibly giving results that are correct in the first dozen digits only.

What Goes On Below printf

The subject of this article is double to character string conversion. That is covered by the C standard only with regard to the printf family of functions. Many libraries have a function such as ftoa, named by analogy to atof, but there is no standard. The BSD and UNIX V libraries share entry points named ecvt, fcvt, and gcvt with adequate compatibility. Several 8086 family compilers use a similar, not necessarily compatible, system. Normally these invoke a lower level function such as cvt that differs somewhat between vendors, but typically is used directly by printf or the equivalent function in other languages.

In the code I show here, I have tried to strike a balance between the goals of simplicity and speed. Hardware dependent issues are ignored other than to assume that a long int can contain at least nine digits and that 18 significant digits will be enough. These assumptions hold for all cases where a double uses 64 bits or less and long contains at least 30 bits in addition to the sign bit.

Table Vs. The pow Function

The algorithm requires tables of powers of 10. Functionally equivalent code could be written using the pow function, assuming that it produces correctly rounded results for positive integral powers of 10. The use of a table should be faster, at the expense of requiring extra memory. It can also be more accurate, if the compiler uses software extended precision to generate IEEE-compliant binary values. On the other hand, a long double powl function would be more accurate, since powers of 10 greater than 10+23 could be produced without roundoff error.

The scale factor for %g format can be found by computing the base-10 logarithm either by calling log10 or by searching a table. Profiling has shown that the time spent in log10 is insignificant, so it doesn't make sense to use more complicated code. Unfortunately, on machines with IEEE gradual underflow, it may be difficult to calculate log10 for numbers that have partially underflowed. You will probably have to fix log10 if you want this code to work properly with a library that has taken short cuts. Since the result of log10 may not agree exactly with the table, the code performs a comparison with the table values. They should bracket the argument, so the index is adjusted accordingly.

Many Ways To Divide It Up

On an architecture that does not support long int in hardware, one might prefer to convert the double to a character string by splitting one character off at a time. You can start at either end. Either approach allows the string conversion to be stopped when the necessary number of characters have been generated.

Starting at the high-order end allows all conversions in the range 10-21 to 10+23 to be performed using exact powers of 10 from the table. A loop back is required to carry the results of rounding the last digit up to the higher digits. There are quite a few operations involved, including a cast from double to int and from int to double, and a double add and multiply for each digit. This method is satisfactory when using extended precision and when a small loss in speed is acceptable. It doesn't require any of the less common operations, such as integer divide, which must be implemented in software on many architectures.

Starting at the low-order end avoids the loop back for carry on round, but it is even slower. This approach requires double divide, floor, cast to int, multiply, and subtract operations for each digit. floor is seldom available except as a library function call, and divide is not sufficiently accurate without extended precision. This is the slower and less accurate of the alternatives considered, although it is supported by the 68881 and 8087.

I chose to split the double into two long ints, with binary rounding on the low-order part. This allows maximum space to guard against roundoff errors, by allowing calculations involving the high- or low-order parts to expand into double or long double storage. The rare carry over from the rounding takes only one statement, with no looping. Converting the two long ints to character strings in one loop takes advantage of parallelism on certain modern processors. One of the compilers tested automatically unrolls the loop by three, so that only one conditional exit from the loop occurs per six characters generated. The disadvantage of this method is the need to convert 18 digits even when fewer are called for, with the surplus leading zeros discarded in the final scan.

The first group of digits is split off three different ways. For large numbers, the program splits off high-order digits and multiplies them back to be subtracted from the original number. This multiplication is relatively accurate, often exact. In cases where the program expects the high-order digits to be zero, it skips this step. For small numbers, the program splits off high-order digits in two steps so that the first scaling can be done by a smaller power of 10. This allows the use of exact powers of 10 down to 10-18. For intermediate numbers, which don't fit the first two methods, the program splits off high-order digits in one step, always using exact powers of 10.

Four or more floating point operations are required when splitting off one digit at a time. With this approach, there is only one ldiv execution per digit, with the floating point operations being performed only once. Even on processors that simulate integer division by floating point operations, splitting two characters at a time off an integer is faster than splitting a single character from a double, when more than nine digits are required.

If all hardware facilities were available to the C programmer, there would be several architecture dependent ways to carry out the long int to string conversion. On an Intel 80x87 numeric processor, the scaled register long double can be converted directly to a packed decimal string. Other processors have decimal adjust" or other instructions to speed conversion from integer binary to packed decimal.

When writing a long int to string conversion in C, the new ldiv function can be used, although standard optimization techniques should allow a compiler to generate efficient code without it. This code will employ ldiv automatically if _STDC_ is defined. Without ldiv, you may have to experiment to determine whether the % operator is faster than a multiply and subtract.

Accuracy Of Results

The results appear to satisfy the IEEE standard between 10-18 and 10+31, when the calculations are performed in long double. Outside this range, errors may exceed the IEEE limits by 0.5*DBL_EPSILON, due to the inaccuracy of the powers of 10 table.

When extended precision is not used, mostly correct IEEE results are still obtained in the range 10-4 to 10+31. Between 10+8 and 10+22, there should be no errors. Outside this range, errors larger than the IEEE standard are frequent, but this seldom leads to an error of more than DBL_EPSILON when converting back to binary. Errors of similar or greater size occur in character string to double conversion, so there is little incentive to improve these results. These errors are within the uncertainty of normal calculation sequences using double arithmetic.

On the MIPS R3000 chip, multiplication of a subnormal by a large number does not generate a normal number. This code results in leading zeros (which are suppressed by the final loop) showing the reduced precision. The R3000 produces unsatisfactory results when attempting to perform conversions close to the points of abrupt underflow or overflow, although satisfactory results are obtained down to DBL_MIN*FLT_EPSILON.

Portability And Differences

In the gcvt code (Listing 2) , leaving trailing zeros in place was simplest. It also helps in maintaining column alignments and expressing the precision, both of which are weak points in traditional UNIX libraries. As a compromise, one trailing zero is stripped here, because rounding up may cause the length of the string to increase, and trailing zeros may degrade the accuracy of conversion back to binary.

The code works as written on SGI, Sun, and Multiflow compilers. The only include file required that is not present on these systems is <float.h>. Under SunOS, you are not allowed to profile using functions that have standard library names. The code to initialize the table of powers of 10 is given both in C and FORTRAN (Listing 4 and Listing 5) , to show that it need not be duplicated in a mixed language system. According to the manuals for IBM AIX, you would expect that the leading underscore would not be appended automatically by the FORTRAN compiler to the structure name dp10. I have made it external because it is so bulky and should not be repeated for other modules that might need it, such as atof.

In most libraries, the functions gcvt, ecvt, and fcvt are not invoked by the printf %g, %e, or %f formats, so there is no portable way to replace the library function with faster or more accurate code other than to execute these entry points directly to perform the conversion. Older BSD-based libraries did call cvt, typically with the last argument inverted from the System V definition.

Testing

The test driver cvttst.c in Listing 6 picks a few of the more difficult numbers to convert and converts them to character strings and back to binary for comparison with the original. When the test driver reports errors, the errors may have arisen in either conversion. I also supply a version of atof to demonstrate using the table of powers of 10 and to use in case the version supplied in the compiler library is faulty.

Running a test that converts all 2,098 numbers of the form 2N, the results I obtained (user time only) are shown in Table 1.

The Sun compiler libraries appear to be designed to achieve IEEE-compliant conversion without considering speed, while the Silicon Graphics libraries achieve great speed at the expense of accuracy. The Sun libraries all appear to use software floating point, regardless of the numeric processor selection. Sun SPARC has no hardware integer multiply or divide. The code with digits split off the high end one at a time works around this handicap. The low-split code includes SunOS dependent coding to make appropriate changes in the rounding modes, thus achieving more speed than floor.

SGI gets accurate results for powers of 10 and for numbers between one and 10+17. Most numbers between 10-17 and one are converted as if the low-order bit were zero. Smaller numbers have errors on the high side up to 14 times the IEEE standard, and large numbers have errors on the low side up to eight times the standard. The SGI atof, however, gives excellent results. The Multiflow library uses fmod (a variation on the low-split approach), achieving less accuracy or speed than the other methods. atof in this library has no protection against errors or underflows. This machine achieves adequate performance with floating point simulation of integer divide and remaindering.

Conclusion

Several major software vendors have left great latitude for hackers who want to improve speed or accuracy of floating point to string conversion. In spite of the great variations in hardware capabilities, portable C code can achieve much of this improvement. The best performing machines in this study have a good match between architecture and C syntax.

I hope that this article will improve understanding of conversion between binary and decimal floating point, and maybe even help to correct the sloppiness associated with numerical work on many C implementations. I tried to cover enough of the possibilities to produce close to optimum speed and accuracy on various architectures, short of employing operations that are not directly accessible to C compilers.

References

Anonymous, A Proposed Standard for Binary Floating-Point Arithmetic, IEEE P854/85-2.25.

Kernighan, B, Ritchie, D, The C Programming Language, Prentice-Hall (3rd edition for ANSI), 1991.

Harbison, S, Steele, G, A C Reference Manual, Prentice-Hall (2nd edition for ANSI), 1989.

Plauger, P J, Brodie, J, Standard C, Microsoft Press, 1989.

Prince, T, "Generating Source for <float.h>," C Users Journal, June 1990, pp. 119-123.

Listing 1

Listing 3