February 1990/Standard C

Columns

Standard C

Quiet Changes, Part I

P.J. Plauger

P.J. Plauger has been a prolific programmer, textbook author, and software entrepreneur. He is secretary of the ANSI C standards committee, X3J11, and convenor of the ISO C standard committee.
A language standards committee can commit a variety of sins. It can eliminate existing features, so that existing programs that use them generate diagnostics with new translators. It can add lots of new features, so that existing programs trip over them and generate diagnostics. It can even redefine existing features, so that existing programs apparently misuse them and generate diagnostics.
All of these are nasty things to do. A committee that indulges in such sins had better be prepared to justify its actions. Discarded features must be arguably dangerous, or at least not worth the clutter they cause by remaining in the language. Added features must fill a real need and not add to the clutter. Changed features require the most justification of all, since they cause the greatest disturbance.
So long as changes cause diagnostics, however, you can live with them. Even if you have to convert half a million lines of existing C code, you know how to proceed. Stuff your code through the new translator and see where it gripes. For very common gripes, you can often contrive a global edit that will mechanically fix up the code. For the rest, you at least have your attention forcibly directed to the areas where you must manually intervene.
The worst sin of all for a language standards committee is to make a change that does not cause a diagnostic. You have a working program with your existing C translator. You upgrade to a standard C compiler and your program quietly recompiles. The only problem is, it behaves differently. That is a project manager's worst nightmare.
Even if you generally like the new behavior, you have a serious problem on your hands. That half a million lines of existing C code may change its behavior in only a handful of places.You cannot rashly assume that the new behavior is acceptable every place. (Probably it is not.) You need to locate every place and check the implications of the change.
Committee X3J11 dubbed such alterations "quiet changes." We blanched every time we faced the prospect of introducing one. We did our best to avoid them. Nevertheless, we occasionally found compelling reasons to adopt quiet changes along with various other subtle but noisy changes. So we made sure that we documented every quiet change we made in the Rationale that accompanies the Standard.
I discussed the most ambitious of these quiet changes last year. (See "Standard C Promotes Types According to Value Preserving Rules," CUJ August '88.) The rules for mixing signed and unsigned integer operands in an expression were, in the past, both subtle and varied. The Committee discussed the different approaches at length before choosing a particular set of "promotion" rules. I did my best to present all the arguments and to justify the choice we eventually made.
This column and the next endeavor to summarize all of the quiet changes made in Standard C. They may not affect you because there have been numerous dialects of C in past practice. (That's a principal reason for making a language standard, to eliminate dialects.)
We labeled something a quiet change if any significant dialect of C quietly changed meaning. The change may not affect your favorite dialect. Nevertheless, you should be aware of any possibility of a quiet change in C code. Who knows, you may already have a lurking problem in code moved from a different implementation of C.
In the explanations that follow, I have copied the description of each quiet change almost verbatim from the Rationale for Standard C. They appear in the same order as in the Rationale, which reflects the order of topics presented in the Standard.

The Quiet Changes
"Programs with character sequences such as ??! in string constants, character constants, or header names will now produce different results."
For example,

printf ("You said what??!\n");
quietly becomes

printf ("You said what|\n");
This is the result of introducing trigraphs. The committee felt a compelling need to provide a way to represent certain characters unavailable in EBCDIC or the invariant subset of ISO 646. (The characters are [\]^{/}~#.) The alternate forms had to be representable using just the common subset of characters. They also had to be usable within character constants, string literals, and header names. Since existing programs can conceivably contain an arbitrary sequence of characters in these places, we had no way to satisfy these basic requirements without introducing the possibility of a quiet change.
We settled on trigraphs, or three-character sequences, as a compromise. Digraphs might be easier to type, but were more likely to change the meaning of older programs. (C uses all of the characters in the subset, so even code outside quotes and headers is endangered.) Each trigraph begins with two question marks, to minimize the chance of a quiet change. It ends with a character from the subset that is designed more or less to suggest the replacement character.
Nobody pretends that ??< is a highly readable alternative to {. But then nobody prevents you from filtering your C code before you send it to a printer. (You might, for example, overstrike a left parenthesis and a minus sign to print a left brace instead of printing the actual trigraph.) Trigraphs serve the limited purpose of providing a minimal interchange standard for shipping C between countries. (Even the Danes, who are adamant that trigraphs are insufficient, have offered no alternative to their use within quotes and header names.)
"A program that depends upon internal identifiers matching only in the first (say) eight characters may change to one with distinct objects for each variant spelling of the identifier."
For example,

int get_stuff_DEF; f() { extern int get_stuff_REF; return (get_stuff_REF); }
A clever programmer may expect that all the names beginning with get_stuff refer to the same data object. That is no longer true.
There was widespread support for longer names in C. The eight-character significance limit inherited from Ritchie's original implementation is certainly inadequate. Worse, implementations differed on the treatment of "insignificant" characters in a name. (Is an implementation obliged to ignore the extra characters when comparing names? Or is it merely permitted to ignore them?) Further confusing the issue was the distinct, and more severe, limit on external names imposed by old-fashioned linkers.
The committee decided on a three-tiered limitation on names. First, any name can be as long as a logical line. An implementation can choose to inspect all characters when comparing names. Second, an implementation must inspect at least the first 31 characters. It can choose to look at no more than 31 characters. Finally, an implementation may require that external names differ in the first six characters, and ignore case distinctions.
These rules were adopted despite a few notorious cases cited of existing programs that would quietly change. It seems that some implementations ignore characters after the first eight. Some programmers have made a practice of intentionally punning by writing distinct names that are intended to compare equal. I don't recall the rationale for this practice and I don't care. The practice is sufficiently barbaric that it garners little sympathy, even if it can be the victim of a quiet change.
"A program relying on file scope rules may be valid under block scope rules but behave differently — for instance, if d_struct were defined as type float rather than struct data in the following example:"

typedef struct data d_struct { /* ... */ }; first() { extern d_struct func(); /* ... */ } second() { d_struct n = func(): }
(This example from the Rationale is not wonderful. I even had to fix a small bug in reproducing it here.)
At issue here is the clash between C as a block scoped language and C as a "flat" language with separately compiled modules. The former requires that names be forgotten at the end of the scope in which they are defined. The latter requires that external names be remembered and matched up across separate compilations.
Past implementations differ widely on the treatment of extern declarations within function bodies. Do such declarations percolate out, a block at a time, to file level so they can be matched up with any other file-level declarations for the same name? Or does each such declaration form a worm-hole out to the linker, with the worm-hole forgotten at the end of the block? Or does something even more bizarre occur?
The example above can give different results with different interpretations. In the first case, the declaration of func percolates out from the first function. It is then visible within the second function, so the assignment makes sense.
In the second case, the declaration of func goes out of scope at the end of the first function. The second function must assume that func is implicitly declared as an external function returning int. In this case, you get a diagnostic. But change the type definition to float, as the Rationale suggests, and you get a quiet (but erroneous) conversion across the assignment.
Like the previous issue on identifier lengths, here is a case where a quiet change is essentially unavoidable. Existing dialects differ too much for the standard to contain a common subset of behavior. What the committee chose, in fact, was the second behavior. C is a block structured language with holes blown in it.
A translator can diagnose conflicting external declarations within a translation unit. It can also elect not to do so, since this is a case of "undefined behavior." A linker can diagnose conflicts between separate compilations. It can also elect not to do so. In practice, most compilers and few linkers will choose to diagnose such conflicts.
"Unsuffixed integer constants may have different types. In K&R, unsuffixed decimal constants greater than INT_MAX, and unsuffixed octal or hexadecimal constants greater than UINT_MAX, are of type long."
For example, on an implementation where type int occupies 16 bits,

f(32768); /* argument now 16-bits */ i = OxFFFFF / -10; /* divide now unsigned */
This is part of the fallout of choosing value-preserving rules for promoting types in expressions (discussed later). The committee felt obliged to tidy up the typing rules for integer constants, to maintain a consistent philosophy toward preserving the expected value of a sub-expression.
Ritchie's original rules required that 32768 have type long on an implementation where type int occupies 16 bits. That led to occasional surprises, particularly when writing arguments on function calls. (There were no function prototypes in those days to fix up or diagnose improper argument types.) With value-preserving promotion rules, however, you get the expected result more often by making 32768 type unsigned int. And such a choice is more consistent with the basic philosophy of choosing the "cheapest" type that preserves the value of an expression.
Similarly, octal and hexadecimal integer constants are expected to be unsigned. It is silly for one to lose its unsignedness just because its value is too large to be represented as type int. Consistency requires that 0x10000 (on an implementation where type int occupies 16 bits) have type unsigned long instead of long.
In both cases, you can contrive programs that quietly change meaning with the change of typing rules for integer constants. The committee felt, however, that such programs were already at risk in being moved among existing dialects, which supported a variety of promotion rules.
"A constant of the form '\078' is valid, but now has different meaning. It now denotes a character constant whose value is the (implementation-defined) combination of the values of the two characters '\07' and '8'. In some implementations the old meaning is the character whose code is 078 == 64."
This is a consequence of now disallowing the digits 8 and 9 in octal escape sequences. Even the earliest C compilers have tolerated the practice, and more than a few programs have taken advantage of this tolerance. Nevertheless, the committee felt it was sufficiently barbarous that it had to be dropped. (The committee did not revoke the even more barbarous license to write 111l in place of 111L.)
"A constant of the form '\a' or '\x' now may have different meaning. The old meaning, if any, was implementation defined."
For example,

char letter = 'a'; if (letter == '\a') /* no longer same as 'a' */
The backslash is no longer ignored in front of an arbitrary letter. Worse, Standard C now gives special meaning to \a.
The committee felt obliged to add to the list of character escape sequences. The sequence \a stands for the "alert" character. In ASCII, it is the BEL code that rings the bell on old Teletype terminals and makes some sort of electronic beep on modern ones. The sequence \x signals the start of a hexadecimal escape sequence of arbitrary length.
Neither of these escape sequences was officially defined in the past. There was the general promise that a backslash before a character with no magic meaning simply stood for that character. (I had, in fact, written a number of strings that used \x as a place holder to be filled in. That was my tough luck.) Nevertheless, the addition could cause a quiet change.
"A string of the form "\078" is valid, but now has different meaning."
See above for the same issue with character constants. The only difference is that the string literal gets longer. Character constants pack all the character codes into a single int value, in an implementation-defined manner.
"A string of the form "\a" or "\x" now has different meaning."
See above for the same issue with character constants.
"It is neither required nor forbidden that identical string literals be represented by a single copy of the string in memory; a program depending upon either scheme may behave differently."
For example,

char *s = "abc"; ..... if (s != &"abc"[0]) printf("s has changed\n");
The printed message is correct only if both instances of "abc" become the same data object. This is not guaranteed in Standard C.
Here is another case where existing dialects of C were in conflict. Some dialects guarantee that identical string literals are represented by a single copy within a translation unit. Others guarantee that each string literal occupies distinct storage.
The committee chose to leave the choice up to the implementation. It is "unspecified," so the implementation need not document the choice or even be consistent in how it chooses. (Another example of unspecified behavior is the order in which a program evaluates multiple arguments on a function call.) Naturally, any program that depends on some particular behavior is likely to be disappointed by some conforming implementation.
"Expressions of the form x=-3 change meaning with the loss of the old-style assignment operators."
For example,

i =-3; /* now stores -3 */
It has been many years since UNIX C reversed the assigning operators. Where now you write -= you once wrote =- as in the example above. Programmers who are stingy or haphazard with spacing around operators got burned often enough that Ritchie switched C to match the Algol 68 convention. Nevertheless, a number of commercial C compilers retained the old forms for backward compatibility with early C code.
Disallowing the old forms can, of course, lead to all sorts of nasty puns. Those who didn't bite the bullet back in the seventies must do so now.

Intermission
That's about half of the quiet changes documented in the Rationale for the C standard. Tune in next month for the rest of the story.