August 1994/Standard C/C++

Columns

Standard C/C++

Extractors

P.J. Plauger

P.J. Plauger is senior editor of C/C++ Users Journal. He is convenor of the ISO C standards committee, WG14, and active on the C++ committee, WG21, His latest books are The Standard C Library, and Programming on Purpose (three volumes), all published by Prentice-Hall. You can reach him at pjp@plauger.com.
This is the second of two installments on the class istream, defined in the header <istream> (See Standard C: The Header <istream>, CUJ, July 1994.) Last month, I showed the definition of the class and described the member functions that perform unformatted input. I conclude this month, as promised, with a description of the member functions that perform formatted input. These functions are commonly known as extractors.
Extractors overload operator>>, as in:
float fl;
cin >> fl;
This expression extracts characters from the stream controlled by cin (the standard input stream). Extraction continues so long as characters continue to match the text pattern acceptable to the Standard C library function strtod. The entire sequence extracted by this rule must then be successfully converted, as if by calling strtod, and the resulting value must be representable as type float.
If all those conditions obtain, the extractor stores the converted value in fl and succeeds. Otherwise, the function reports failure, typically by setting ios::failbit in cin. In any event, all the characters extracted to make up the valid text field (or valid prefix for such a field) are irretrievably consumed.
If all that machinery sounds rather like fscanf, it should. Extractors serve the same function for the Standard C++ library that the fscanf family serves for the Standard C library. You can, in fact, implement extractors by calls to fscanf. That turns out to be not always convenient, however:

The source of input for a stream controlled by an istream object is controlled in turn by a streambuf object. (See last month's installment.) You can't always relate that source directly to the sources accessible by fscanf or sscanf.

You can extract characters and store them in a buffer for processing by sscanf. But doing so requires a buffer that is arbitrarily large. Or you end up doing so much preprocessing of the text that you duplicate much of the work done by sscanf.
For these reasons, I present here a different implementation of the istream extractors. It sits reasonably well atop the Standard C library, but it makes no use of fscanf or its brethren. To provide an overview, Listing 1 is a repeat from last month. It shows one way to implement the header <istream>, which consists almost entirely of the definition of the class istream. My focus this month is all those overloads of operator>>, plus one or two "secret" member functions that work with them.

Character Extractors
Let's begin with the extractors that perform a minimum of interpretation. These extract one or more characters from the input stream and deliver them either to memory or to an output stream. The simplest of all is operator>>(char&), which extracts a single character, as shown in Listing 2. As usual, class istream also has unsigned char and signed char versions of the same extractor, which call on the plain char version to do the actual work.
Last month, I showed some very similar member functions. get(char&) has much the same external interface, but it is get() that has much the same internal structure as this particular extractor. Both depend on the pair of member functions ipfx(int) and isfx() to enforce various stream disciplines. And both depend on the macros _TRY_IO_BEGIN and _CATCH_IO_END to enforce the required discipline of exception handling.
The only difference is the argument to ipfx. The extractor uses the default zero value for the flag argument. That encourages ipfx to skip leading white space before extracting the actual data to be delivered, assuming the flag ios::skipws is set within the istream flag word. (It is set by default when the object is constructed.)
If you want to skip white space before invoking an extractor, regardless of the setting of ios::skipws, you can always use the manipulator ws. You will find it declared near the bottom of the header <istream>, in Listing 1. Listing 3 shows one way to write this function. If you write:
cin >> ws >> fl;
then white space is always skipped before fl is extracted. (These two listings have all the clues you need to figure out how ws works, if you like arcane C++ puzzles.)
Listing 4 shows the member function operator>>(char *). It extracts a sequence of characters from the input stream and stores it as a null-terminated string in the character array designated by the pointer argument. (Yes, there are three flavors, once again.)
This is one of the few extractors that uses the width value stored in the istream object. (If width() is zero, the width is taken as INT_MAX, as defined in <limits.h>.) Input stops with the first white-space character extracted (which is pushed back), or when the specified number of characters are stored in the array, counting the terminating null. Note that the width field is set to zero by this extractor, as is customary for functions that make use of the field.
Finally, Listing 5 shows the member function operator>> (streambuf&). It too strongly resembles one of the unformatted input functions I showed last month. You use it to copy the remainder of the input stream to the output stream controlled by the streambuf operand. The extractor version should be substantially faster because it doesn't have to check for delimiters. Thus, it can move whole blocks of characters at a time.

Integer Extractors
Converting text fields to internal integer forms is rather more complicated. You certainly want existing library functions, such as strtol or strtoul, to do the hard work. But just setting up for one of these functions takes a bit of effort in its own right. Remember, nothing prevents a perverse user from generating an input stream with 5,000 leading zeros, followed by a perfectly reasonable integer. You don't want to blindly gather characters into a buffer as part of the extraction process.
Listing 6 shows the member function operator>> (long&), which extracts a long integer. It does indeed gather characters into a buffer, but not blindly. And the buffer has a bounded length, in this case the value of the macro _MAX_INT_DIG. For a typical machine with 32-bit longs, a value of 16 is plenty big enough to deal with a sign, prefix, and enough significant digits to ensure overflow if the value is indeed too large.
The work of gathering an integer field is carried out by the private member function _Getifld(char *), shown in Listing 7. It largely replicates the logic of fscanf, or even strtol, with an important difference or two. It compresses all leading zeros to a single digit. And it truncates a very big number at a value large enough to ensure overflow, as I indicated above. It then counts on strtol (or strtoul for the unsigned long extractor) to do the rest of the conversion.
All the other integer extractors make use of either the long or the unsigned long extractor. By way of example, Listing 8 shows the int extractor, operator>>(int&). Once it extracts a long, all it has to do is make a tighter range check before storing the converted value.
Extracting a pointer to void is a slightly different matter. It must, of course, work in concert with its corresponding inserter for pointer to void. Both should also convert an arbitrary representation for pointers, even if it is bigger than the largest integer. You can, and probably should, tailor pointer conversions for each implementation. What I show here is one way to write a pointer extractor that is both portable and robust (even if it doesn't always choose the most appropriate text representation for pointers.)
Listing 9 shows the pointer to void extractor. The trick it uses is to store the pointer in a union, so that it overlaps an array of unsigned long. The extractor then extracts a series of integers separated by colons and stores them in the union as integers. The resultant pointer value is accessed from the union in the end. So long as the corresponding inserter does the reverse process, you can be sure that a pointer value you extract matches the earlier one you inserted.

Floating-Point Extractors
Converting text fields to internal floating-point forms is even more complicated. You really want existing library functions, such as strtod, to do the hard work. Sadly, the C Standard does not require a strtold, to convert the extra precision and range of a long double. Nor does it require a strtof, to perform the tighter range checking of a float. If you want to write the Standard C++ library in terms of the Standard C library, you're somewhat at a loss in this area.
I have provided both of these functions, with secret names, in my implementation of the Standard C library. I suspect that the next revision of the C Standard will make them mandatory. Meanwhile, you can cop out by using strtod to perform in place of these missing functions. It is deficient in several ways, but it meets many needs.
Setting up for one of these floating-point conversion functions takes even more effort than for integer conversions. Our hypothetical perverse user now has several places to pad a numeric field with gratuitous zeros that don't alter the represented value.
Here is the easiest example of floating-point extractors. Listing 10 shows the member function operator>>(double&), which extracts a double. It also gathers characters into a buffer with a bounded length, in this case determined by the values of the macros _MAX_SIG_DIG (maximum significant digits) and _MAX_SIG_DIG (maximum exponent digits). For a typical machine with 80-bit long doubles, you're looking at 20 or so fraction digits and four exponent digits, plus the usual assortment of signs, decimal point, and exponent character. The buffer still need not be all that large.
The work of gathering a floating-point field is carried out by the private member function _Getffld(char *), shown in Listing 11. It plays many of the same tricks as its cousin,_Getffld(char *), to keep the buffer length small and bounded. It then counts on a function much like strtod to do the rest of the conversion. But _Stod takes an additional argument computed by _Getffld — a power-of-ten correction factor. The final converted value is what strtod produces from the compressed text field times ten raised to the correction factor.
With a bit of messy logic, you can fold this factor into the text string you construct to feed to strtod. I chose instead to write a proper version of_Stod (and _Stof and _Stold), because my library calls these functions from fscanf as well. Listing 12 shows a cheap approximation to_Stod, to show you how it works. A proper version would do better error checking, work faster, preserve precision more carefully, and not cause any floating-point overflows or underflows. But all that machinery is too much to show here.

Conclusion
You can, of course, also write your own extractors. It is commonplace, when designing a new class, to provide a tailored inserter at the very least. If reading values of the class makes sense, then it is good manners to provide an extractor as well. You might even want to write an extractor or two that are not associated with a specific class.
The best style for writing new extractors is to do so in terms of the member functions of class istream. If you must drop below this level and access the associated streambuf object directly, then by all means match the discipline followed in the extractors presented here. If you don't, then it's only a matter of time before you or one of your colleagues gets burned. Such is the blessing, and the curse, of reusable software.