May 1996/The Learning C/C++urve

Columns

The Learning C/C++urve

Bobby Schmidt

The Why Files

Bobby takes a break from his column's routine to investigate a few of the mysteries lurking in C and C++.

Have you ever watched Mulder and Scully investigate a secret government agency crossbreeding Martians with Bigfoot's ghost, and thought, "That's not so mysterious and bizarre — let's see them diagnose a missing semicolon after a class definition." We programmers routinely encounter phenomena so baffling, dangers so unspeakable, diagnostics so incomprehensible, we dare not speak of them in polite company. Until now.

This month I start a sporadic but regular feature, investigating these mysteries and bringing their truths to light. Along with i Virtuosi di Redmond — my house band who also double as research staff — I allow select readers to pose the Tough Questions. Questions on quixotic aspects of C and C++ that almost nobody really understands and, let's face it, no person with a normal social life would ever want to understand. Strap on your five-point harnesses kids, and pay up your insurance premiums, for it's time to open ... The Why Files.

Nice Plot, Messy Ending

In a stunning coincidence perfectly suited to the Why Files, two readers from England, Francis B. and Willy S., have sent us the exact same letter:

Dear Virtuosi:

I am a playwright, and as such, I use characters pretty much non-stop all day. Not understanding the OS I bought as a child — just how does that pip command work, anyway? — I wrote my own C program to copy the characters of one play to another. While many of my plays are admittedly long, my program seems to feel they are of infinite length: with some compilers (but not all!) my character read/write loop never terminates. O, bid my program return!

The letters go on to include code, with the relevant fragment being
char c;
while ((c = getchar()) ! = EOF) 
    putchar(c);
This code may seem innocent, yet it contains several mysteries for C programmers. Such mysteries are the stuff of life for us at The Why Files, so let's investigate.

We first make Francis and Willy's code look less like C and more like a normal, readable programming language:
char c;
c = getchar();
while (c != EOF)
    {
    putchar(c);
    c = getchar();
    }
This duplicates the statement c=getchar(), but makes the logic more overt, especially for those not fluent in C's pig latin.

Next we tackle their assumptions. Like so many of us, Francis and Willy want desperately to believe what they read. In this case, they read getchar and putchar, thinking "forsooth, these routines get and put a char." Unfortunately, they instead should have read either stdio.h (the header declaring these functions), or the ANSI C spec, where they would have seen the prototypes
int getchar();
int putchar(int);
It would seem, then, that these routines could more properly be called getint and putint. Given this, our readers should have written their code more like
/* change from 'char' to 'int' */
int c;
c = getchar();
while (c != EOF)
    {
    putchar(c);
    c = getchar();
    }
The language in the spec is about as obtuse as that in Francis and/or Billy's plays, so i Virtuosi are not surprised the two missed this subtlety. But larger questions still remain: why do these routines get and put integers in the first place? And why did the original code work on some systems, but not on others? Sensitive viewers be warned: the truth may shock you.

The Ancient Reptile Brain in C

In the Cretaceous period of C programming (ca. 1980), just about everything in C was an int or an int wannabe. int was C's lingua franca, and many language features were oriented toward that type, features that (in hindsight) have no apparent logical connection to integers. As an example, consider characters, both in strings and individually. What is the data type of "blotto"? Logic would dictate the answer be char const *. But as Spock discovered in the first movie, logic alone is not enough: const had not evolved in the Cretaceous period, so to maintain backward compatibility with pre-ANSI code, the data type of "blotto" is char *. Now by implication, reckon the data type of 'b'. If you said the sensible thing — char — go back three spaces. The correct answer is, of all things, int[1] .

getchar and putchar are ginkgoes and coelacanths, living fossils from that earlier C age. But vintage alone does not explain these routines' manipulation of int. Truth is, even were these routines created from scratch today, we may still have compelling reason to use them with int. Why? Well, assume getchar is running on a system with eight-bit characters, and that this system renders all 256 such characters meaningful (e.g., MS-DOS with code pages). If all the possible values getchar returns are real characters, how can getchar tell the system there are no more characters? Possible answers include:

Pick one of these 256 characters, and designate it as the end-of-file character, just as the character code 10, typically designated by the token '\n', designates end-of-line. In my previous life as a VAX Pascal programmer, the character code 26 (control-Z) served as end-of-file. Unfortunately, while this may work fine for text files, the strategy breaks down for binary files (e.g., how can you know a control-Z is not really a byte of the binary file?).

Change getchar to something like getchar(char *, int *), so that it "returns" two values, one the character data, and the other a flag indicating end-of-file.

Have getchar set a global flag, much like ERRNO, to indicate end-of-file.

Don't have getchar indicate end-of-file status, but instead use the stdio.h routine feof.

getchar, of course, does none of these things. Instead, it picks a value outside the range of all conceivable characters, a 257th character in effect. How do you squeeze 257 characters into 8 bits? By making getchar return something larger than 8 bits, something like, oh say, an int.

This 257th character lives under the magic name EOF, a macro defined in stdio.h. In every implementation I've ever seen, EOF is defined as -1, although I can see little compelling reason for this. [It's handy for implementing the ctype functions by table lookup — pjp] According to the ANSI spec, any negative value should do. Logically, getchar now returns either

the next real character, or

EOF, which you interpret not as a character, but as a flag.

To use a hardware analogy, getchar multiplexes two kinds of information (character data and EOF signal) on the same channel. Dan Saks calls the union of characters and EOF "meta-characters," since they are conceptually characters (not integers, regardless of implementation), yet the total domain of values is larger than the set of characters. In fact, in the early days of our programming together, we had something like typedef int metachar as part of our canonical header [2] .

While I can't be certain, I speculate that the reason putchar accepts an int is because that's the type returned by getchar, allowing such classic C mischief as
putchar(getchar());
In more civilized code, as in our earlier example, having getchar and putchar use the same data type lets the same variable serve unaltered in two roles. If putchar truly put achar, we'd be faced with unfortunate (and non-portable) constructions such as
int c;
c = getchar();
while (c != EOF)
    {
    putchar((char) c); // ooo, ick, a cast
    c = getchar();
    }
char's Alien Abduction Experience

While all of this may offer semi-compelling rationale for getchar and putchar using int, it does nothing to explain why the original code works on some implementations and not others. One possible answer that should leap to mind is, "What if char and int are identical on a particular implementation?" And indeed, for such implementations, assuming EOF is safely within the bound of an int, the program works. But then, your mind may saunter to, "Hmmm, if these two types are the same, given int is signed, that means that ... that ... (the mind starts to cramp up here) that char must be signed!"

A signed character? What does that mean? If ANSI C underwrote Sesame Street, would we start seeing "today's episode is brought to you by the letter negative B?" These are characters, not blood types. And yet the amazing fact remains, characters in C (and of course C++, since we can't break that Cretaceous code) can be signed or unsigned — the choice is up to the implementation [3] .

Lets explore this a bit. Assume your implementation has the following common properties:

int is 16 bits long.

char is 8 bits long, but you don't know if it's signed or unsigned.

The execution character set is 7-bit ASCII, and you are reading text, so you won't manipulate characters with a high bit of 1.

EOF is defined as -1.

Further assume you have read to the end of stdin, so that the next call to getchar will return EOF. If your implementation uses unsigned characters, after execution of
char c = getchar();
how will the conditional
while (c != EOF)
evaluate?

Take Us to Your int

Having taken on the value of EOF (as returned by getchar), c now contains the bit pattern 0xFF (the original EOF 16-bit pattern 0xFFFF truncated to fit into an eight-bit char). When comparing an unsigned eight-bit char (c) and a signed 16-bit integer (EOF), C implicitly promotes c to a positive value stored in a signed 16-bit integer, so the synthetically 16-bit value remains 0xFF (or more accurately, OxOOFF). EOF is already a signed 16-bit integer, so its bit pattern remains unchanged (0xFFFF).

Once the dust settles from this conversion, the net result is a comparison between the bit patterns 0x00FF and 0xFFFF. These are not the same, so the original conditional
while (c != EOF)
is always true, resulting in an infinite loop.

Now we revisit the same scenario, only this time assuming char is signed. The compiler, upon encountering
while (c != EOF)
sees a signed eight-bit value compared to a signed 16-bit value. In this instance, the compiler widens the eight-bit negative value c to 16 bits. When ones- or twos-complement signed values are widened, the left-most bit (the sign bit) propagates left. In the case of 0xFF (the value contained in c), the leftmost bit is a 1. That bit propagates left, resulting in the 16-bit value 0xFFFF. The other operand, EOF, is already a 16-bit signed value (0xFFFF), so the compiler leaves it untouched. The comparison then becomes 0xFFFF against 0xFFFF. These match, so the loop correctly terminates.

You can easily run this experiment on your own system because not only can char be signed or unsigned (depending on implementation whim), but you can also explicitly specify the signedness of a character type. That is, you can declare three different characters as
char c;
signed char sc;
unsigned char uc;
On your system, take our original example, change the data type of c first to signed char, then to unsigned char, and see what happens.

To add a further wrinkle, C++ considers each of these three types distinct for overloading, so that you can declare
void f(char);
void f(signed char);
void f(unsigned char);
all within the same scope.

I have a great approach/avoidance response to characters having sign. Conceptually, characters hold pieces of text, which in my world is not signed. Unfortunately, in C char does more than hold text: it can also hold "small" integers, or conceptually signless bytes. If, for example, you want to create an array of positive integers, and you know all their values are containable within eight bits,
unsigned char array[100];
typically takes less space [4] than does
unsigned int array[100];
In my own work, I use char for text, period. In the unlikely event I want a small integer or byte, I use a typedef or, in C++, a class. I was weaned on Pascal and Ada, which consider characters essentially as a builtin enumerated type. While C and C++ don't, I still do, enforcing my own meta-semantics upon the language.

Scraps from Roswell

Diligent reader Eli Gur sympathizes with my quest for type correctness, and recommends Gimpel's Lint, claiming it actually catches use of non-bool operands in conditional expressions! For the scope of the code examples I show in this column, such a lint is probably overkill. Even so, I think I'll ring up Gimpel and try giving their product a test drive, assuming they have a Mac version.

Mac, you say? Yes, as promised/threatened, I bought a new Apple/Sun/whomever PowerBook 5300ce/117. Other than the confusion from having to operate three different GUIs now, I'm finding my initiation into the Mac world quite enjoyable. One nice side-effect is that I've gone upscale with my C++ implementation. The folks at MetroWerks have provided me their CodeWarrior Gold 8.0 C/C++ development system. MetroWerks apparently update their compilers frequently, so by the time you read this, I may be on a different version. o

Notes

[1] This is no longer true in C++, where character constants like 'b' are of type char. This distinction normally doesn't cause problems, although features such as sizeof('b'), overload resolution, and character arithmetic are affected.

[2] Since I wrote this, Dan told me that he, in turn, had cribbed the name metachar from Tom Plum's book C Programming Guidelines.

[3] Some compilers let you select the signedness of plain char, typically via compiler switch or #pragma. Although this may help simplify your code on a particular platform, any code assuming the signedness of char is non-portable.

[4] This space savings could come with a time penalty — array access tends to slow down when elements don't line up along the machine's natural word/register boundaries. If you program on a 16-bit machine, fetches of eight-bit array elements may extract a non-trivial time cost.

Bobby Schmidt is a freelance writer, teacher, consultant, and programmer. He is also a member of the ANSI/ISO C standards committee, an alumnus of Microsoft, and an original "associate" of (Dan) Saks & Associates. In other career incarnations, Bobby has been a pool hall operator, radio DJ, private investigator, and astronomer. You may summon him at 3543 167th Ct NE #BB-301, Redmond WA 98052; by phone at (206) 881-6990, or via Internet e-mail as rschmidt@netcom.com.