Columns


Standard C

Formatted Input

P.J. Plauger


P.J. Plauger has been a prolific programmer, textbook author, and software entrepreneur. He is secretary of the ANSI C standards committee, X3J11, and convenor of the ISO C standard committee.

This is the fourth in a series of columns on input and output under Standard C. (See "Evolution of the C I/O Model," CUJ August '89, "Streams," CUJ October '89, and "Formatted Output," CUJ November '89.) The topic this month is how to perform formatted input. You can think of it as a natural, but not essential, companion to formatted output.

As I emphasized last month, you really must perform output somewhere in every program that you write. If the output is to be directly digestible by human beings, as is often the case, then you want the program to produce readable text. The formatted output functions help you produce readable text that reflects the values of encoded data in your program.

On the other hand, not all programs read input. Those that do can read data directly, using an assortment of standard library functions, and interpret it as they see fit. Converting small integers and text strings for internal consumption are both five-finger exercises that most C programmers perform easily. It is only when you must convert floating point values, or recognize a complex mix of data fields, that standard scanning functions begin to look attractive.

Even then the choice is not always clear. The usability of a program depends heavily on how tolerant it is to variations in user input. You as a programmer may not agree with the conventions enforced by the standard formatted input functions. You may not like the way they handle errors. In short, you are much more likely to want to roll your own input scanner.

Obtaining formatted input in not simply the inverse of producing formatted output. With output, you know what you want the program to generate next and it does it. With input, however, you are more at the mercy of the person producing the input text. Your program must scan the input text for recognizable patterns, then parse it into separate fields. Only then can it determine what to do next.

Not only that, the input text may contain no recognizable pattern. You must then decide how to respond to such an "error." Do you print a nasty message and prompt for fresh input? Do you make an educated guess and bull ahead? Or do you abort the program? Various canned input scanners have tried all of these strategies. No one of them is appropriate for all cases.

It is no surprise, therefore, that the history of the formatted input functions in C is far more checkered than for the formatted output functions. Most implementations of C have long agreed on the basic properties of printf and its buddies. (A notable exception is the I/O library I originally wrote for the Whitesmiths C compiler. It nicely regularized the names of functions and format conversion specifications, but at a serious cost in compatibility. Eventually, we had to abandon our special dialect of I/O.) By contrast, scanf and its ilk have changed steadily over the years and have proliferated dialects.

Committee X3J11 spent considerable time sorting out the proper behavior of formatted input. Once we agreed on which input conversions to include in Standard C, we had to agree on exactly what they did. Implementations varied on the valid formats for numeric fields. They were all over the map on how to respond to invalid input. They seldom clarified how scanf interacts with ungetc and other I/O functions.

All these decisions had to be made in an atmosphere of general dissatisfaction. A vocal minority wanted major changes in the formatted input functions. An almost silent majority didn't want to be bothered with details about functions they considered useless at best, dangerous at worst. Given all these handicaps, I think X3J11 did rather a good job of clarifying the formatted input functions and making them useful.

After that introduction, I will rashly assume that you still care about the formatted input functions. The rest of this column discusses the scan functions, so called because they all have scan as part of their names. These are the functions that scan input text and convert text fields to encoded data. All are declared in the standard header <stdio.h>. To use the scan functions, you must know how to call them, how to specify conversion formats, and what conversions they will perform for you.

Calling Scan Functions

The Standard C library provides three different scan functions, declared as follows:

int fscanf(FILE *stream, const char *format, ...);
int scanf(const char *format, ...);
int sscanf(char *src, const char *format, ...);
The function fscanf obtains characters from the stream stream. The function scanf obtains characters from the stream stdin. Both stop scanning input early if an attempt to obtain a character sets the end-of-file or error indicator for the stream. The function sscanf obtains characters from the null-terminated string beginning at src. It stops scanning input early if it encounters the terminating null character for the string.

Note that all of the functions accept a varying number of arguments, just like the print functions. And just like the print functions, you had better declare any scan functions before you use them by including <stdio.h>. Otherwise, some implementation may go crazy when you call your undeclared scan function.

All the functions accept a read-only format argument, which is a pointer to a null-terminated string. The format tells the function what additional arguments to expect, if any, and how to convert input fields to values to be stored. (A typical argument is a pointer to a data object that receives the converted value.) It also specifies any literal text or whitespace you want to match between converted fields. If scan formats sound remarkably like print formats, the resemblance is quite intentional. But there are also important differences. I will revisit formats in considerable detail later in this column.

All the functions return a count of the number of text fields converted to values that are stored. If any of the functions stops scanning early for one of the reasons cited above, however, it returns the value of the macro EOF (defined in the standard header <stdio.h>). Since EOF must have a negative value, you can easily distinguish it from any valid count, including zero. Note, however, that you can't tell how many values were stored before an early stop. If you need to locate a stopping point more precisely, break your scan call into multiple calls.

A scan function can also stop scanning because it obtains a character that it is unprepared to deal with. In this case, the function returns the cumulative count of values converted and stored. You can determine the largest possible return value for any given call by counting all the conversions you specify in the format. The actual return value will be between zero and this maximum value, inclusive.

When either fscanf or scanf obtains such an unexpected character, it pushes it back to the input stream. (It also pushes back the first character beyond a valid field when it has to peek ahead to determine the end of the field.) How it does so is similar to calling the function ungetc. There is a very important difference, however. You cannot portably push back two characters to a stream with successive calls to ungetc (and no other intervening operations on the stream). You can portably follow an arbitrary call to a scan function with a call to ungetc for the same stream.

What this means effectively is that the one-character pushback limit imposed on ungetc is not compromised by calls to the scan functions. Either the implementation guarantees two or more characters of pushback to a stream or it provides separate machinery for the scan functions.

Note that the scan functions push back at most one character. Say, for example, that you try to convert the field 123EASY as a floating point value. The field is, of course, invalid. Even the subfield 123E is invalid, since the conversion requires at least one exponent digit. What will happen is, the subfield 123E is consumed and the conversion fails. No value is stored and the scan function returns. The next character to read from the stream is A. This behavior matters most for floating point fields, which have the most ornate syntax. Other conversions can usually digest all the characters in the longest subfield that looks valid.

As a final point, the Standard C library does not provide any of the functions vfscanf, vscanf, or vsscanf. These are obvious analogs to the print functions vfprintf, vprintf, and vsprintf which I described last month. X3J11 simply felt that there was not enough call for such scan functions to require them of all implementations.

Writing Formats

Last month, I described the print formats as a mini programming language. The same is, of course, true of the scan formats. I also commented earlier that print and scan formats look remarkably alike. This should serve as both a comfort and a warning to you.

The comfort is that the print and scan functions are designed to work together. What you write to a text file with one program should be readable as a text file by another. Any values you represent in text by calling a print function should be reclaimable by calling a scan function. (At least they should be to good accuracy, over a reasonable range of values.) You would even like the print and scan formats to resemble each other closely.

Doug McIlroy, at AT&T Bell Laboratories, makes a stronger statement. He feels that any good formatted I/O package should let you write identical formats for print and scan function calls. A formatting language that is not symmetric, he feels, is deficient. I believe that Standard C comes close to achieving this goal. It is at least possible for you to write symmetric formats (those that read back what you wrote out). Be warned, however, that developing symmetry can take a bit of extra thought.

And here lies the danger. The fact remains that the print and scan format languages are different. Sometimes the apparent similarity is only superficial. You can write text with a print function call that does not scan as you might expect with a scan function call using the same format. Be particularly wary when you print text using conversions with no intervening whitespace. Be somewhat wary when you print adjacent whitespace in two successive print calls. The scan functions tend to run together fields that you think of as separate.

The basic operation of the scan functions is, indeed, the same as for the print functions. Call a scan function and it scans the format string once from beginning to end. As it recognizes each component of the format string, it performs various operations. Most of these operations consume characters sequentially from a stream (fscanf or scanf) or from a string stored in memory (sscanf).

Many of these operations generate values that the scan function stores in various data objects that you specify with pointer arguments. Any such arguments must appear in the varying length argument list, in the order in which the format string calls for them. For example,

sscanf("thx 1138", "%s%2o%d", &a, &b, &c);
stores the string "thx" in the char array a, the value 9 (octal eleven) in the int data object b, and the value 38 in the int data object c.

It is up to you to ensure that the type of each actual argument pointer matches the type expected by the scan function. (The pointer must, of course, also point to a data object of the expected type.) Standard C has no way to check the types of additional arguments in a varying length argument list.

Not every part of a format string calls for the conversion of a field and the consumption of an additional argument. In fact, only certain conversion specifications gobble arguments. Every conversion specification begins with the % escape character and matches one of the patterns described below. The scan functions treat everything else either as whitespace or as literal text.

Whitespace in a scan format, by the way, is whatever the standard library function iswhite (declared in <ctype.h>) says it is. That can change if you call the function setlocale (declared in <locale.h>) before you call the scan function. Your program begins execution in the "C" locale, where whitespace is what you have learned to know and love.

A sequence of one or more whitespace characters in a scan format is treated as a single entity. It consumes an arbitrarily long sequence of whitespace characters from the input. (Again, whitespace is whatever the current locale says it is.) The whitespace in the format need not resemble the whitespace in the input in any way. The input can contain no whitespace. Whitespace in the format simply guarantees that the next input character (if any) is not a whitespace character.

Any character in the format that is not whitespace and not part of a conversion specification calls for a literal match. The next input character must match the format character. Otherwise, the scan function returns with the current count of converted values stored. A format that ends with a literal match can produce ambiguous results. You cannot determine from the return value whether the trailing match failed. Similarly, you cannot determine whether a literal match failed or a conversion that follows it. For these reasons, literal matches have only limited use in scan formats.

For completeness, I should point out that a literal match can be any string of multibyte characters. Each sequence of literal text must begin and end in the initial shift state, if your target environment uses a state-dependent encoding for multibyte characters. I suspect, however, that you will have little need to match Kanji characters with scan formats in the next few years.

Conversion Specifications

A scan conversion specification differs from a print conversion specification in fundamental ways. You cannot write any of the print conversion flags and you cannot write a precision (following a decimal point). On the other hand, scan conversions have an assignment-suppression flag and a conversion specification called a scan set. Following the % you write three components. All but the last component is optional. In order:

The goal of each formatted input conversion is to determine the sequence of input characters that constitutes the field to convert. The scan function then converts the field, if possible, and stores the converted value in the data object designated by the next pointer argument. (If assignment is suppressed, no function argument is consumed.)

Unless otherwise specified below, each conversion first skips arbitrary whitespace in the input. Skipping is just the same as for whitespace in the scan format. The conversion then matches a pattern against succeeding characters in the input to determine the conversion field. You can specify a field width to limit the size of the field. Otherwise, the field extends to the last character in the input that matches the pattern.

The scan functions convert numeric fields by calling one of the standard library functions strtod, strtol, or strtoul (declared in <stdlib.h>). A numeric conversion field matches the longest pattern acceptable to the function it calls.

Scan Sets

A scan set behaves much like the s conversion specifier. It stores up to w characters (default is the rest of the input) in the array of char pointed at by ptr. It always stores a null character after any input.

It does not, however, skip leading whitespace. It also lets you specify what characters to consider as part of the field. You can specify all the characters to match, as in:

"%[0123456789abcdefABCDEF]"
which matches an arbitrary sequence of hexadecimal digits. Or you can specify all the characters that do not match, as in:

"%[^0123456789]"
which matches any characters other than digits.

If you want to include the right bracket (]) in the set of characters you specify, write it immediately after the opening [ (or [^). You cannot include the null character in the set of characters you specify.

Some implementations may let you specify a range of characters by using a minus sign (-). The list of hexadecimal digits, for example, can be written as:

"%[0-9abcdefABCDEF]"
or even, in some cases, as:

"%[0-9a-fA-F]"
Please note, however, that such usage is not universal. Avoid it in a program that you wish to keep maximally portable.