April 1990/Standard C

Columns

Standard C

Wha Gang Agley

P.J. Plauger

P.J. Plauger has been a prolific programmer, textbook author, and software entrepreneur. He is secretary of the ANSI C standards committee, X3J11, and convenor of the ISO C standards committee.
Nothing is perfect. A document produced by a committee is certainly no exception. It is hardly surprising, therefore, that people have found much to criticize in the ANSI standard for C.
Most of the imperfections can be chalked up to political compromise. Some are existing practices that are too deeply entrenched to change, no matter how strong the current consensus against them. A few are simply things that the standards committee arguably got wrong and didn't fix. A few more are important additions that somehow never garnered enough concerted support to make it in.
Preprocessing, for example, was in the worst shape of any part of the C language. The committee did rather a good job of tidying up several messes in this area. Just defining the preprocessing phases more precisely was a major contribution. Still, there were a few botches and omissions.
I have been one of the strongest defenders of the ANSI C standard produced by committee X3J11. As an active participant, I saw the need for compromise and the need to retain backward compatibility even when it hurt. I also know intimately how much work went into producing the standard. If a few areas couldn't get cleaned up in time, so be it. The ANSI C standard is still one of the best language standards I have ever encountered.
Nevertheless, I am not blind to the shortcomings of the document we produced. We missed a number of opportunities to make the language better in small ways. We committed the sin of inconsistency more times than I care to admit. We left out all sorts of clever improvements to the C language. I have my own list of gripes about the C standard.
I figured that it was time for a change of pace in these pages. After a couple of years of explaining and defending the C standard, I plan to take a few potshots at it. What follows is a weakly ordered collection of observations. Each describes some way in which I feel the standard could have been better. For now, I confine my remarks to the language proper. I plan to devote considerable attention to the Standard C library in the months to come.

What Didn't Get Cleaned Up
We missed several opportunities to tidy up the language proper. Here are a few of them.
Historical usage prevented us from making floating literals type float by default. It makes more sense to add a prefix to get type double. Sadly, you must add an F to get the former, since C has traditionally considered floating literals to have type double.
Similarly, the committee had to back off from making string literals type array of const char. Too many existing programs have code such as

char *p = "abc";
which would require a cast to avoid a diagnostic. So string literals have the curious property of being semantically const (for a portable program) without having the type that goes with the semantics.
The French standards committee, AFNOR, wanted to put the null pointer constant NULL into the language. So did a few other people. It has the same slippery semantics that nul enjoys in Pascal, but without the same full language support. As a consequence, different implementations must define it as a macro in different ways. That invites its misuse, which in turn makes it harder to write portable programs.
Several people proposed various schemes for making enumerations more strongly typed. Most were too scary to adopt. The rest failed to garner enough support even for extended debate. What we ended up with is somewhat better than using preprocessor macros to name constants, but not much.
Each enumeration you write becomes a synonym for one of the integer types that promotes to type int. (An implementation can tailor the storage it uses to represent an enumeration.) As far as type checking goes, however, an enumeration constant or data object simply has an integer type. You can mix apples and oranges.
We talked more about making bitfields better, but in the end we didn't do much. What you want, at the very least, is the ability to declare the size of "storage unit" that you are carving up into bitfields. You want eight different base types, the signed and unsigned flavors of char, short, int, and long.
The standard provides only three base types, plain int, signed int, and unsigned int. The plain flavor has special meaning in this context (and only in this context). It lets the implementation define whether the component bitfields have values that are signed or unsigned. That wart was added to be nice to existing implementations, not to make bitfields any more usable.
We talked at great length about value preserving versus unsigned preserving arithmetic. (It is more fair to say that we fought tooth and nail.) Nevertheless, none of us tried to fix a closely related problem, the surprises that abound when you mix signed int and unsigned int operands.
C traditionally calls for the signed operand to be converted to unsigned, which is the type of the result. To get a sensible value in many cases, however, you should convert both to a slightly larger signed type. We shuddered to think what changing this rule might do to existing programs, so we left the problem alone. I wish we could have fixed it.
When I wrote my first C compiler many years ago, the first thing I found myself hating was the unrestricted goto statement. You can write a goto that transfers control into a block from somewhere outside. You can even jump to the statements controlled by if, else , while, and other flow-of-control keywords. What that does to code optimization is beyond belief. Either you despair of doing many optimizations or you write a much larger translator.
We discussed restricting goto statements on several occasions. What prompted us to leave them alone was the protests of an important constituency. More and more people write applications that generate C code to be compiled, as a sort of universal assembly language. A number of existing applications depend on the ability to write ugly goto statements that no human being need ever see. Were we to tidy up the semantics of control flow, we would require serious restructuring of these applications. With no little sadness, we left the goto alone.
There was one area that even our extensive cleaning could not rescue completely. It was simply too dirty. I refer to the whole business of declaring and naming external variables. The problem is that C must work with many existing assemblers and linkers built to ancient specifications. That severely limits the length of external names. The committee had no serious problem increasing internal names to 31 significant characters. But we balked at requiring more than the worst-case six characters (and single case of letters) required by the stupidest of existing linkers. Despite heated debate, the majority did not want to add to the difficulty of linking C with other languages.
Another aspect of this problem affects how you write multiple declarations for the same external variable. C programmers need reliable methods for ensuring that each variable has a definition, and that none has a multiple definition. Linkers vary all over the map in the kind of machinery they provide. As a consequence, C developed several dialects in this area. I believe the committee did an admirable job of embracing all these dialects and accommodating the varied linker technologies. It's too bad, however, that we couldn't just throw it all away and do it over properly.

What Went In Wrong
In some cases, what we added to the language proper wasn't exactly right.
We botched things a bit when we introduced preprocessing numbers. These are tokens that subsume all valid numeric C tokens. We defined them to clarify what intermediate forms can occur during preprocessing while you endeavor to paste together valid numeric C tokens. The only problem is, 0X12E+3 now looks like a single preprocessing number (which becomes an invalid numeric C token). In the past, most translators knew to parse it as a hexadecimal literal, a plus operator, and a decimal literal. We must now learn to be wary of hexadecimal literals that end in E.
The include directive had to compromise between two rather different implementation styles. One approach is to parse just enough of each C source line during preprocessing to decide what to do with the rest. In this case, angle brackets and double quotes parse as special delimiters within the include directive. The other approach is to parse every line into preprocessing tokens, then decide what to do. That makes it very exciting to parse directives such as

#include </*.h>
If you see that you are building an include directive soon enough, you know to ignore anything funny before the closing angle bracket. If you first tokenize and then look, you may decide that the /* signals the start of a comment.
The committee endeavored to describe preprocessing in such a way that either approach is acceptable. Sadly, the words were reworked several times by editors with conflicting views. I can't honestly report that the pre-tokenizers were well treated in the end. You can still pre-tokenize each line when parsing C, but you have to indulge in a few heroic measures to rescue include directives.
Another example also has to do with how you write declarations, but you can't blame any problems on existing linkers. The difficulties are purely internal to C.
I refer to the outrageous overloading of the storage class keywords. What you mean by static or extern (or by writing no storage class at all) can have three different meanings, depending upon where you write the declaration. And if another declaration for the same name is in scope, each of these meanings can change again. C has always been messy in this regard, but the committee made it even messier with one or two arbitrary decisions.
I have tried to tabulate the semantics of storage class keywords several different ways. (See, for example, "What's in a Name?" CUJ February 1988, and Standard C by P.J. Plauger and Jim Brodie, Microsoft Press, Redmond WA, 1989.) None of the presentations have a compelling logic, because the underlying machinery is not entirely logical. It could have been made much cleaner.
Another thing we got wrong was allowing the sizeof operator to accept an rvalue operand. I suspect most people who voted for the extension assumed you could make useful tests with it. For instance, you might think that sizeof (x+y) would tell you whether two floats are added in double precision on a particular implementation. Not so. The type of the expression is float even if the intermediate representation happens to be double.
The extension was worse than useless, however, because it caused trouble. People started asking all sorts of embarrasing questions about the types of various rvalues. And the committee started deciding answers all sorts of different ways. We now have the situation that sizeof 'a' can be larger than sizeof <'a' even though sizeof (char) is less than sizeof (wchar_t). Yuk.
There is only one other thing in the C language proper that I think we got really wrong — the semantics of pointers to constant data objects. What I wanted was a fairly serious promise. The data object pointed to by any pointer to const type should be truly constant, at least for a while. ("A while" should be from the time execution enters the function containing a reference using the pointer until the function returns.)
What this restriction provides is much of the semantics you need to safely parallelize C code automatically. What it evidently costs you is additional subtle compatibility problems with C++. At least that was the strongest argument I heard against the stronger semantics.
So we settled for a fairly wimpy position. All that a pointer to const assures you is that you can't alter the value stored in a data object by using that particular pointer. You can't optimize much, however, because some other agency might be changing the stored value.
I backed the addition of the notorious noalias type qualifier in large part because of the differences over pointers to const. I identified five or six desirable sets of semantics for accessing data objects. Three type qualifiers gives you eight possibilities. When noalias got shot down, we had to settle for only four. They weren't the four I wanted.

What Didn't Get In
Lots of things didn't get into the language proper. Here are a few whose loss I lament.
Our failure to solve the non-ASCII character set problem still haunts us at the international level. We need alternate spellings of the operators and punctuators that use the more esoteric ASCII characters, since these are often recycled in ISO 646 or even absent in EBCDIC. Trigraphs such as ??< just don't cut it for readability. Sadly, the committee could never agree on a particular set of more readable operators.
All sorts of clever additions were suggested to make macros more powerful. Most I cheerfully helped beat down, but two failed suggestions I miss. One is for some form of conditional macro, such as

#define ptc(f,c) eq(f,stdin,putchar(c),putc(f,c))
If the first two arguments to eq match (after expansion) then the third is retained, otherwise the fourth. With recursion, you can write wondrous macro definitions.
The other thing I miss is some way to create character literals. You can now create a string literal from argument X by writing #X within a macro definition. It would be nice if you could create a character literal by some similar mechanism. Since the next obvious operator ## is already defined, however, that suggests a rather odious ### which few people could swallow. Dave Prosser suggested a rather nice notation, but not until well after the committee (and several implementations) got settled with the current one.
A typeof operator would also help make more powerful macros. It would let you declare temporary data objects having the same type as one of the arguments to a macro. You could then write a generic "swap" macro, as in:

#define swap(x, y)\ { typeof (x) t;\ t = (x);\ (x) = (y);\ (y) = t; }
Of course, swap can only take the place of a statement. It cannot yield a value. That's what you need to write a safe macro for, say, the maximum value of two arguments. Otherwise, it is hard to avoid evaluating an argument expression twice, side effects and all. To get temporaries inside a subexpression, you need some way to delimit a local scope. Several schemes were proposed, none were adopted.
A similar but somewhat different need is the ability to construct a structure on the fly. More than one existing implementation lets you write something like (struct complex){cos(th), sin(th)} within an expression. C is certainly a more attractive language, at least to some constituencies, with such expressive capabilities.
The last thing I really miss is some form of repetition counts within data initializers. The Whitesmiths C complier let you write things like:

char pattern[1000] = { [100] '.', [800] 'X', [100] '.'};
which is much easier to type, and maintain, than spelling out all the data.
Beyond this point, my wish list dribbles off with items I find less important. Many of my customers loved the case ranges we added to Whitesmiths' C. Unnamed unions within structures can eliminate the need for dummy member names. Arbitrary rvalues in initializers for auto arrays and structures can have their uses. All of these features I can take or leave, however.
I would like to have seen arrays become first class objects in Standard C. Array assignment and functions returning arrays have always been expressible, despite what many people think. The advent of function prototypes gave us a way to pass functions as arguments. Nevertheless, the confusion surrounding arrays as lvalues in C is so widespread that even I must acknowledge the dangers. I remain a minority of one in this area, I fear, in being willing to face those dangers and fix array handling in Standard C.

Conclusion
Having said all this, I now feel moved to make a few disclaimers. First, I acknowledge that everyone has a list of grievances about the current C standard. I don't presume that my list is more important or (much) more wisely considered than all others. It just happens to be my list, and this is my soapbox.
Second, I do not feel ill used that my list of grievances is so long. I got plenty of opportunity to mouth off during the committee meetings. (Many witnesses can attest that I got more than my share of opportunities.) I felt well heard and was pleased to see any number of issues go the way I hoped.
Last and most important, I don't even want most of these grievances satisfied. (I argued against fixing many of them when they were debated.) I respect the need to satisfy diverse constituencies. If I got my way on many of these issues, I would feel duty bound to accept the strong desires of others in similar areas. I far prefer a compromise language with widespread support to one that meets my needs but alienates many others.
Even if I were the sole arbiter, I still would not make many of the changes I outlined here. Why? Because the language would be too different from the C we know and love. And it would get that much bigger for a questionable increase in value.
Standard C is essentially twice as big as the C described by Kernighan and Ritchie. Admittedly, complexity is hard to quantify, but I arrive at that number through three telling metrics. The size of the Whitesmiths C compiler doubled in lines of source by the time we achieved full compliance with Standard C. It also doubled in bytes of executable code. And the size of the reference manual that went with it doubled in pages. I believe Standard C is still intellectually manageable, but is beginning to strain the bounds of a "small" language.
Think how big the language would have gotten had committee X3J11 tried to please everyone. Or even just me.

Standard Finalized
The ANSI C standard has been adopted! The ANSI Board of Standards Review (BSR) voted unanimous approval at their December meeting of the draft developed by committee X3J11 and approved by X3.
BSR was meticulous in informing the complainant who had delayed progress of the standard for the past year. He was given a generous period of time to file a further protest with BSR. The time period expired, however, with no protests filed.
The official designation if the new C standard is ANSI X3.159-1989. It came in just under the wire, but it did earn a 198X designation.

ISO Update
The C standard commenced its six-month balloting period as a "draft international standard" (DIS) in December 1989. That is normally the final approval process before SC22 sends the draft on for mechanical review and adoption by ISO. It is widely understood, however, that both the United Kingdom and Denmark are determined to make changes in the C standard at the ISO level.
A meeting of the ISO C committee WG14 will be held in London in late May or early June 1990 to commence work on two "normative addenda." These were approved by the parent committee, SC22, at a recent meeting. One addendum is an attempt by the British to make the language of the standard more precise. The other is expected to add machinery for writing C source files more readably in European character sets.
Once these normative addenda are developed and approved by WG14, they must follow the same approval path through ISO as the standard developed by X3J11. It remains to be seen whether the DIS will be held up pending approval of the addenda. It also remains to be seen how much support exists within ISO for amending the ANSI C standard.