May 1992/Standard C

Columns

Standard C

Text to Numeric Conversions

P.J. Plauger

P.J. Plauger is senior editor of The C Users Journal. He is secretary of the ANSI C standards committee, X3J11, and convenor of the ISO C standards committee, WG14. His latest book is The Standard C Library, published by Prentice-Hall. You can reach him care of The C Users Journal or via Internet at pjp@plauger.uunet.uu.net.

Introduction
This is the second installment on the header <stdlib.h>. Last month, I discussed the header in general, two groups of functions in particular. Those functions perform sorting, searching, and simple integer arithmetic.
I continue this month with another group, the functions that convert between text strings and encoded numeric data. Here are a few general comments on how to use these functions:
atof — The call atof(s) is equivalent to strtod(s, NULL), except that atof is not obliged to store ERANGE in errno to report a range error. You also get no indication with atof of how many characters from the string participate in the conversion. Use strtod instead.
atoi — Replace atoi(s) with (int)strtol(s, NULL, 10). Then consider altering the second argument so that you can determine how many characters participated in the conversion.
atol — Replace atol(s) with strtol(s, NULL, 10). Then consider altering the second argument.
strtod — This is the function called by the scan functions, to convert a sequence of characters to an encoded value of type double. You can call strtod directly to avoid the overhead of the scan functions. That also lets you determine more precisely what part of the string argument participates in the conversion.
Note that the behavior of strtod can change among locales. The function effectively calls isspace to skip leading white-space. It also uses the decimal point defined for the current locale. Beyond that, the valid text patterns are essentially those of the integer and floating constants (with no suffixes) in C. For example, the following are all valid ways to represent the value 12: 12, +12., and .12e2. An implementation can also recognize additional patterns in other than the "C" locale.
strtol — This is the function called by the scan functions, to convert a sequence of characters to an encoded value of type long. You can call strtol directly to avoid the overhead of the scan functions. That also lets you specify unusual bases and to determine more precisely what part of the string argument participates in the conversion.
Note that the behavior of strtol can change among locales. The function effectively calls isspace to skip leading white-space. Beyond that, the valid text patterns are essentially those of the integer constants (with no suffixes), modified as needed for various bases. For example, the following are all valid ways to represent the value 12 (assuming the third argument to strtol specifies a base of zero): 12, +014, and 0xC. An implementation can also recognize additional patterns in other than the "C" locale.
strtoul — Use this function instead of strtol when you need a result of type unsigned long. The function strtoul reports a range error only if the converted magnitude is greater than ULONG_MAX, defined in <limits.h>. (Negating the value cannot cause overflow.) strtol, on the other hand, reports a range error if the converted value is less than LONG_MIN or greater than LONG_MAX, both defined in <limits.h>. The same text patterns apply as for strtoul.

What the C Standard Says

7.10.1 String conversion functions
The functions atof, atoi, and atol need not affect the value of the integer expression errno on an error. If the value of the result cannot be represented, the behavior is undefined.

7.10.1.1 The atof functions

Synopsis

#include <stdlib.h> double atof(const char *nptr);

Description
The atof function converts the initial portion of the string pointed to by nptr to double representation. Except for the behavior on error, it is equivalent to

strtod(nptr, (char **)NULL)

Returns
The atof function returns the converted value.
Forward references: the strtod function (7.10.1.4).

7.10.1.2 The atoi function

Synopsis

#include <stdlib.h> int atoi(const char *nptr);

Description
The atoi function converts the initial portion of the string pointed to by nptr to int representation. Except for the behavior on error, it is equivalent to

(int)strtol(nptr, (char **)NULL, 10)

Returns
The atoi function returns the converted value.
Forward references: the strtol function (7.10.1.5).

7.10.1.3 The atol function

Synopsis

#include <stdlib.h> long int atol(const char *nptr);

Description
The atol function converts the initial portion of the string pointed to by nptr to long int representation. Except for the behavior on error, it is equivalent to

strtol(nptr, (char **)NULL, 10)

Returns
The atol function returns the converted value.
Forward references: the strtol function (7.10.1.5).

7.10.1.4 The strtod function

Synopsis

#include <stdlib.h> double strtod(const char *nptr, char **endptr);

Description
The strtod function converts the initial portion of the string pointed to by nptr to double representation. First, it decomposes the input string into three parts: an initial, possibly empty, sequence of white-space characters (as specified by the isspace function), a subject sequence resembling a floating-point constant; and a final string of one or more unrecognized characters, including the terminating null character of the input string. Then, it attempts to convert the subject sequence to a floating-point number, and returns the result.
The expected form of the subject sequence is an optional plus or minus sign, then a nonempty sequence of digits optionally containing a decimal-point character, then an optional exponent part as defined in 6.1.3.1, but no floating suffix. The subject sequence is defined as the longest initial subsequence of the input string, starting with the first non-white-space character, that is of the expected form. The subject sequence contains no characters if the input string is empty or consists entirely of white space, or if the first non-white-space character is other than a sign, a digit, or a decimal-point character.
If the subject sequence has the expected form, the sequence of characters starting with the first digit or the decimal-point character (whichever occurs first) is interpreted as a floating constant according to the rules of 6.1.3.1, except that the decimal-point character is used in place of a period, and that if neither an exponent part nor a decimal-point character appears, a decimal point is assumed to follow the last digit in the string. If the subject sequence begins with a minus sign, the value resulting from the conversion is negated. A pointer to the final string is stored in the object pointed to by endptr, provided that endptr is not a null pointer.
In other than the "C" locale, additional implementation-defined subject sequence forms may be accepted.
If the subject sequence is empty or does not have the expected form, no conversion is performed; the value of nptr is stored in the object pointed to by endptr, provided that endptr is not a null pointer.

Returns
The strtod function returns the converted value, if any. If no conversion could be performed, zero is returned. If the correct value is outside therrange of representable values, plus or minus HUGE_VAL is returned (according to the sign of the value), and the value of the macro ERANGE is stored in errno. If the correct value would cause underflow, zero is returned and the value of the macro ERANGE is stored in errno.

7.10.1.5 The strtol function

Synopsis

#include <stdlib.h> long int strtol(const char *nptr, char **endptr, int base);

Description
The strtol function converts the initial portion of the string pointed to by nptr to long int representation. First, it decomposes the input string into three parts: an initial, possibly empty, sequence of white-space characters (as specified by the isspace function), a subject sequence
resembling an integer represented in some radix determined by the value of base, and a final string of one or more unrecognized characters, including the terminating null character of the input string. Then, it attempts to convert the subject sequence to an integer, and returns the result.
If the value of base is zero, the expected form of the subject sequence is that of an integer constant as described in 6.1.3.2, optionally preceded by a plus or minus sign, but not including an integer suffix. If the value of base is between 2 and 36, the expected form of the subject sequence is a sequence of letters and digits representing an integer with the radix specified by base, optionally preceded by a plus or minus sign, but not including an integer suffix. The letters from a (or A) through z (or Z) are ascribed the values 10 to 35; only letters whose ascribed values are less than that of base are permitted. If the value of base is 16, the characters 0x or 0X may optionally precede the sequence of letters and digits, following the sign if present.
The subject sequence is defined as the longest initial subsequence of the input string, starting with the first non-white-space character, that is of the expected form. The subject sequence contains no characters if the input string is empty or consists entirely of white space, or if the first non-white-space character is other than a sign or a permissible letter or digit.
If the subject sequence has the expected form and the value of base is zero, the sequence of characters starting with the first digit is interpreted as an integer constant according to the rules of 6.1.3.2. If the subject sequence has the expected form and the value of base is between 2 and 36, it is used as the base for conversion, ascribing to each letter its value as given above. If the subject sequence begins with a minus sign, the value resulting from the conversion is negated. A pointer to the final string is stored in the object pointed to by endptr, provided that endptr is not a null pointer.
In other than the "C" locale, additional implementation-defined subject sequence forms may be accepted.
If the subject sequence is empty or does not have the expected form, no conversion is performed; the value of nptr is stored in the object pointed to by endptr, provided that endptr is not a null pointer

Returns
The strtol function returns the converted value, if any. If no conversion could be performed, zero is returned. If the correct value is outside the range of representable values, LONG_MAX or LONG_MIN is returned (according to the sign of the value), and the value of the macro ERANGE is stored in errno.

7.10.1.6 The strtoul function

Synopsis

#include <stdlib.h> unsigned long int strtoul (const char *nptr, char **endptr, int base);

Description
The strtoul function converts the initial portion of the string pointed to by nptr to unsigned long int representation. First, it decomposes the input string into three parts: an initial, possibly empty, sequence of white-space characters (as specified by the isspace function), a subject sequence resembling an unsigned integer represented in some radix determined by the value of base, and a final string of one or more unrecognized characters, including the terminating null character of the input string. Then, it attempts to convert the subject sequence to an unsigned integer, and returns the result.
If the value of base is zero, the expected form of the subject sequence is that of an integer constant as described in 6.1.3.2, optionally preceded by a plus or minus sign, but not including an integer suffix. If the value of base is between 2 and 36, the expected form of the subject sequence is a sequence of letters and digits representing an integer with the radix specified by base, optionally preceded by a plus or minus sign, but not including an integer suffix. The letters from a (or A) through z (or Z) are ascribed the values 10 to 35; only letters whose ascribed values are less than that of base are permitted. If the value of base is 16, the characters 0x or 0X may optionally precede the sequence of letters and digits, following the sign if present.
The subject sequence is defined as the longest initial subsequence of the input string, starting with the first non-white-space character, that is of the expected form. The subject sequence contains no characters if the input string is empty or consists entirely of white space, or if the first non-white-space character is other than a sign or a permissible letter or digit.
If the subject sequence has the expected form and the value of base is zero, the sequence of characters starting with the first digit is interpreted as an integer constant according to the rules of 6.1.3.2. If the subject sequence has the expected form and the value of base is between 2 and 36, it is used as the base for conversion, ascribing to each letter its value as given above.
If the subject sequence begins with a minus sign, the value resulting from the conversion is negated. A pointer to the final string is stored in the object pointed to by endptr, provided that endptr is not a null pointer.
In other than the "C" locale, additional implementation-defined subject sequence forms may be accepted.
If the subject sequence is empty or does not have the expected form, no conversion is performed; the value of nptr is stored in the object pointed to by endptr, provided that endptr is not a null pointer.

Returns
The strtoul function returns the converted value, if any. If no conversion could be performed, zero is returned. If the correct value is outside the range of representable values, ULONG_MAX is returned, and the value of the macro ERANGE is stored in errno.

Implementing the Conversion Functions
Listing 1 shows the file xstoul.c. It defines the function _Stoul that performs all conversions from text string to encoded integer. The function has the same specifications as strtoul. I made it a separate function so that several masking macros defined in <stdlib.h> can call it directly as shown in Listing 2.
The three corresponding functions are equally simple. The first half of _Stoul determines the base and locates the most-significant digit. That involves stripping leading white-space, identifying any sign, and picking off any prefix such as 0X. The function then skips any leading zeros so that it can count the number of significant digits it converts. It converts all significant digits regardless of possible overflow. For unsigned long arithmetic, an overflow does not cause an exception.
_Stoul makes a coarse check for overflow by first inspecting the number of significant digits. This version assumes that an unsigned long occupies 32 bits. (Change the array ndigs if such integers are larger.) For each valid base, ndigs[base] is the number of digits at which overflow can occur. Thus, a shorter sequence cannot overflow and a longer sequence must. A sequence of the critical length requires further checking. Take away the last digit and see whether you get back the previously accumulated value (y). If not, an overflow occurred.
Note the rare use of the type ptrdiff_t, defined in <stddef.h>. It ensures that n can hold the signed difference between two pointers. ptrdiff_t is not a completely safe type. An argument string with over 32,767 significant digits can fail to report overflow on a computer with 16-bit pointers. That is an unlikely occurrence, but it can happen. Still, it is tedious to write the test completely safely. I chose speed in this case over absolute safety.
Listing 3 shows the file strtol.c. It defines the function strtol that must report an overflow properly. Thus, it chases down any leading minus sign itself so that it can check the converted value as a long. Note that the function must call _Stoul with the original pointer. Should _Stoul find an invalid string, it must store that pointer at endptr. To point past any leading white-space would be misleading.
Floating-point conversions follow a similar pattern. The header <stdlib.h> defines the two macros
#define atof(s)_Stod(s, 0)
#define strtod(s, end) _Stod(s, end)
Thus, both functions simply call the common function _Stod to do all the work. In this case, atof enjoys the same thorough checking required of strtod.
Listing 4 shows the file xstod.c. It defines the function _Stod that performs all conversions from text string to encoded floating-point. It does so carefully, avoiding intermediate overflow and loss of precision.
The macro SIG_MAX, for example, represents a careful compromise. It limits the number of significant digits to 32. That is more than enough for the most precise representation supported by this implementation (about 20 decimal digits for 10-byte IEEE 754 long double). It is also well short of the largest integer that would cause an overflow on a conforming implementation (about 37 digits). The function pays similar care in accumulating any exponent. As a result, any floating-point overflow or underflow is handled safely in the function _Dtento, declared in "xmath.h" and shown in Listing 5.
The first half of _Stod checks syntax and accumulates significant fraction digits. It then converts eight digits at a time to an array of long. It converts these elements to double, from least-significant to most-significant, and scales each appropriately before adding it to the running sum. This sequence of operations is reasonably efficient and maintains precision.
Warren Yelsin, an honors student at the University of New South Wales, studied _Stod in some depth. He found that it gives the best internal approximation most of the time. When it fails, it gives the next higher value. So far, the cost of getting the best answer all the time appears prohibitive. But the payoff can be very high. I'll discuss this issue more at a later date.
This article is excerpted from P.J. Plauger, The Standard C Library, (Englewood Cliffs, N.J.: Prentice-Hall, 1992).