August 1994/We Have Mail

Departments

We Have Mail

Dear Mr. Plauger,
At first I would like to express my strong interest in the topic of your new column in the CUJ, tracking the evolving C++ standard library. Also, reading your recent book Programming on Purpose was both a pleasure and enlightening sometimes, while in a number of other cases I had already come to much the same conclusions as you. (In fact, there are a few points where I substantially differ from what you write in your essays.)
But the main reason for writing to you as the Senior Editor of the CUJ, is to bring up another topic. As I assume your mailbox is full and you don't want to read letters as long as this one without having at least some slight idea what they want to tell you, here is the contents of my letter in short: The first part details my criticism in the unsecure techniques used by so many C programs to handle character strings. This applies to a number of programs printed in the CUJ too. In the second part (see Listings 1 & 2) I present an amazingly simple string handling solution I use in my own programs — no, no, not the (n+1)th full-blown string handling library. Something very simple, I promise.
When I try to judge the overall quality of a program, I often glance through the listings to locate the instances of strcpy, strcat, etc. It alerts me to see these operations applied to fixed-length char arrays. Finding that the program doesn't check the array bounds but instead reserves space for some assumed "worst case" is not uncommon. Programs that at least check the array bounds are protected against unusual input or malicious users, but either suffer from arbitrary limitations or reserve space in an extreme size, so that they tend to run into memory limitations sooner than necessary.
Some programs contain comments on why a seemingly arbitrary limit was chosen. Sometimes these comments reveal the limited experience of the author. As a recent example, take the Dec. '93 issue of CUJ, Listing 5, page 37. The comment that accompanies the definition of MAX_LENGTH states that it is sufficient to reserve 128 bytes for a file name, simply because it is supplied as a command-line argument and MS-DOS obviously limits the command line length to that magic number.
But suppose the program is useful, suppose someone ports it to UNIX? It will run happily most of the time and fail some other day when you try it with a real long file name. (Please understand: I do not want to pick especially on the author of that particular program. He just had the bad luck that his article was the first at hand to illustrate my point. And as well, in this example there is enough DOS-specific file name and suffix handling so that the critical limitation might well show up while porting it to some other environment.)
Because C has not much support for reliable string processing built into the language proper and the library, many programmers seem to look for reasons to justify certain fixed sizes of char arrays used as "string variables." Here is another typical misconception: Just because the UNIX kernel evaluates only the first PATH_MAX characters of filenames in system calls, some programmers assume it is sufficient to define an array of this size. Subsequently they do not check the array bounds, which is of course deadly if any part of the file name is supplied from outside (by user input, environment variables, etc.)
And a third one: The existence of the miraculous constant BUFSIZ in <stdio.h> still makes a number of programmers believe it would be safe to read user input with the function gets into an array of exactly this size. But this was dangerous already in the old days, when BUFSIZ was also wired into the UNIX terminal device driver and because of that happened to limit the amount of interactive input. Even in these ancient times input could have been redirected and read from a file with longer lines.
Do you remember the "Internet Worm?" It was this sort of loophole through which that creature crawled from system to system — cheating a critical program by writing beyond the end of a fixed-length char array, the "worm" cleverly faked the return address of a function and finally gained access to privileges that made further intrusion possible.
So, if some program obviously does not care for defensive programming techniques but uses fixed-length char arrays for user-supplied input without checking the bounds, I lose trust in other algorithms that I do not understand easily. Further, if some comments in the program show limited experience with different operating systems, I fear other, possibly more difficult to spot, pitfalls in porting the program. (Again, this is not meant as criticism of the author of the program mentioned as example above. Maybe portability had never been one of his design goals!)
To turn my criticism into something constructive, I want to explain the approach I use for char string handling. It demonstrates that it neither needs complicated string-handling libraries nor cumbersome counting and mallocing with good odds for some "off-by-one" errors if a program wants to avoid fixed-length char arrays. The source files pstring.h and pstring.c (appended at the end of this mail) show my approach in detail, but I think you need not look at them to understand the following explanations.
The central function is pstr_x, which accepts a variable number of char pointers as arguments. After mallocing sufficient space for the summed-up length of all strings to which the arguments point, copies of these strings are concatenated in the allocated space and a pointer to that space is returned. There are macros called pstr_1, pstr_2, etc. that constitute a simple interface to pstr_x as they forward a fixed number of (macro) arguments to the function pstr_x and add a trailing null pointer that marks the end of the list. Note that these macros are not essential to my approach, they only make string operations in the source more readable.
As a typical example consider the case when a program needs to build a file name from

the user's home directory,

a path separator (e.g. '/' for UNIX),

the program's argv[1],

and the suffix ".xxx"
Everybody who ever did this by direct calls to strcpy and cousins knows that concatenating four strings, two of which have unknown length, is not a trivial task if you want to guard against all kinds of problems malicious users could create. At least, the source code gets intermingled with strcpys, strcats, and the usual "pea-counting" to avoid writing beyond the end of an array. Surely it will not be as readable as the following fragment:
/* const char SEP[] = "/";
const char *home = getenv("HOME");
if (!home) home = ".";
...
{
    /* here comes the interesting part */
    char *filename = pstr_4(home, SEP, argv[1], ".xxx");
    ...
    ... fopen(filename. "r")...
    ...
    free(filename);
}
Considering readability, this is not so far away from a C++ string class with an overloaded operator+. The only catch is to remember that the pstr_ function and macros return a pointer to space that should be freed after it is no longer used. But it is even possible to relieve the programmer from this burden in a number of cases, the above example included. The header "pstring.h" defines some more macros, and in the file pstring.c there is the helper function tmpstr, which all work together to make the following possible:
const char SEP[] = "/";
const char *home = getenv("HOME");
if (!home) home = ".";
...
... /* here it comes */
... fopen(TMPSTR4(home, SEP, argv[1], ".xxx"), "r");
As you see, not only the free can now be omitted, but also the extra variable filename. The price to pay is to understand that the strings returned by one of the TMPSTR macros are meant for only one time use. Pointers should not be initialized by some TMPSTR call, and especially calling a function with more than one argument initialized by TMPSTR calls will produce undefined results (and probably cause havoc). I choose uppercase names for that group of macros to make them stand out a bit more clearly and to alert the uninformed reader of their special nature.
The function pstr_x resembles a generalized version of strdup which was part of some older C libraries, e.g. in XENIX, but for unknown reasons did not make it into the ANSI/ISO-C-library. (Probably the committee could not agree whether to group it under <string.h> or <stdlib.h> — you should know better than I :-).) But there is also an important difference between strdup and its direct equivalent pstr_1. If the program runs out of memory, the latter will not return but abort the program instead, while the former returns a null pointer and leaves error handling to the caller. Much of the ease of use of the pstr_ function and macros is due to the guarantee that they always return a valid pointer.
One may argue that this behavior (aborting the program) contradicts "The Spirit of C," which would rather be to return control to the caller of a function in case of problems. But especially for pstr_x (or more generally, when relatively small chunks are allocated from the free store) there is another strong argument in favor of aborting the program. As the heap typically grows against the program's stack, i.e. both live at opposite borders of the (data) address space extending toward each other, a program that aborts in case of malloc problems would break down because of stack overflow in other situations with about equal probability. So, even if this program carefully cleans up after malloc fails, it does so for only half the number of crashes.
To put it in other words, whenever it is crucial to some application to remove work files, reset tty-modes, etc. and there is the possibility that the program will run into memory limitations, it is necessary to look for alternate strategies to clean up (e.g., parent/child process design) because watching out for malloc only will not be sufficient. So, I think, it is admissible that pstr_x simply aborts the program if there is no more space — especially as this substantially simplifies large parts of the caller's source. Also note, if char arrays of fixed size are used as an alternative to malloc for string processing, arrays would have to be allocated on the stack if their storage class is auto, or would reduce the program's data space permanently if their storage class is static. As such, arrays must be sized for the worst case, the program will typically run out of space much sooner, and probably with an uncatchable stack overflow too.
An advantage of my simple approach over more complicated string handling libraries is that you need not learn much to use it. This is because all strings are still char pointers and can be handled as usual, except for assignment. In that case, either care must be taken to deallocate previously assigned space (i.e. call free before assigning to the pointer holding the result from a call to the pstr_ function or macros) or alternatively the special function pstrcpy can be used.
In my own programs I've used the shown approach to string handling for some time now and I'm very happy with it. Of course, one can do better and be more efficient with a more complicated string library. But such libraries exist and hardly anybody uses them for "simple" tasks. This shows that a "low-overhead" solution is still of general interest. For an Intel 386/486 my approach adds little more than 200 bytes to the program's executable size (not subtracting what it may save by simplifying other parts). Though I would recommend putting the functions as a module into a library, if someone prefers not to do this, the source is small enough to be carried around with the application that makes use of it. It is quite possible that it saves more lines of source in other places than its own size, especially if you do a fair comparison based on the comfort and reliability it adds to an application.
If you think my approach to string handling may help other programmers to write more reliable code, you may publish the enclosed source files in the CUJ at an appropriate place.
Best Regards,
Martin Weitzel
P.S.: I'm looking forward to reading your next column about the C++ Library Working Group's progress.
There are many ways to skin this particular cat. I've used an approach much like yours on occasion. — pjp
P.J. Plauger
I was intrigued by the article by Greg Colvin, "Extending C for Object Oriented Programming" in the July 1993 issue. We are currently re-engineering our FORTRAN scientific apps, and self-training on C, so the idea of getting some of the benefits of OOP under C looks attractive. Further, the knowledge that what we do today under C can later be easily ported to C++ makes this scheme all the more interesting. My thought is to create a set of container classes under C for now, with the option of porting later to C++.
I wish to know if you've had any reader feedback on the article. Has anybody written to confirm or dispute the proposed methods? I recall a similar article by Ford (May 1992) that was subsequently commented on (by a reader), who showed, I feel, a much better way of doing things.
If any reader has tried Colvin's techniques, I would appreciate hearing their experiences.
Scott Daniels
daniels@minga.pn.com
CIS: 73201, 670
Greg Colvin tells me that he's personally received quite a lot of feedback, most of it positive, on his article. — pjp
Mr. Plauger,
In the Feb. '94 issue of CUJ, you mentioned that work on revising the C Standard was about to commence. I'm not so sure that is a Good Thing, but since it's bound to happen anyway, I wonder if I could bend your ear a bit about just a few of what I consider to be the most fundamental weaknesses of the C language as we know it today.
The creators of the C language did a noble thing in trying to make it portable. Unfortunately, some of these efforts have had the opposite effect. At first glance, it would seem that portability would be enhanced by not explicitly defining the sizes of the integral data types (char, short, int, long). In practice, leaving these sizes to be defined by the compiler has been the primary reason that (for example) a lot of old UNIX code won't compile and run on PCs. (Much of this could have been avoided if only PC compiler writers would have created 32-bit integers — there is no reason why ints in C can't be 32 bits, even if the processor is a "16-bit machine!") Even the issue of whether a char is signed or not is up to the compiler. I realize that this was partially addressed with the introduction of the signed keyword, but that's still not enough.
An awful lot of code has been written where the key data structures being manipulated are not defined in the code, but exist already in the operating system, the hardware, or in some other piece of (possibly non-C) software. C needs to have much better facilities for describing data structures that the programmer needs to manipulate, even when that same programmer is not the one who gets to decide on the format of those data structures. A big step in this direction would be to introduce new fundamental types, such as int8, uint8, int16, uint16, int32, uint32, int64, and uint64. (Hopefully we will steer clear of names such as "word," since that has different meanings on different machines.)
A common technique for making C code more portable has been to define these with typedefs in a header file, which is edited by someone who knows in advance what sizes are used by the target compiler. I'm sure many hours of debug time have been wasted in porting when that header file is incorrect. Also, not every programmer using this technique uses the same names, or defines them in a common header file, and these things can complicate the reuse of code written by others. It would help things a lot to have this nailed down by the language definition.
To continue further along this same line of thought, there is no way at present to use a C structure to describe an arbitrary existing data structure in some other piece of software. Part of this is because of the fundamental types not having standard sizes, and part of it is because the language has no means of specifying whether a structure is to be packed. A simple feature to specify this in the language could prevent an awful lot of frustration. Some compilers provide a pragma for this, but that's not supported by all compilers.
Some compilers actually provide a command-line switch to decide whether the structures are packed. What a horrible thought that the flip of one command-line switch could cause such disaster. And if that's not enough, think of the problems of trying to integrate two pieces of code where one assumes packed structures and the other assumes unpacked structures! Furthermore, there should be a way to specify the alignment of a structure, to explicitly force it to begin on a particular kind of boundary.
I also think it is time that the C language should end its long denial stage with respect to Intel I/O ports. I'd like to see the definition of two or three pseudo-arrays to integrate this concept into the language. Could be something like io8[], io16[], and io32[], implemented such that io8[0xABCD] = 0x5A would cause code to be generated to output 0x5A to port 0xABCD, and info = io8[0xABCD] would cause code to be generated to input a value from that port. Of course, I'm aware that a C++ class can be defined to provide this kind of syntactic sugar, but efficiency is a key issue here. One of the goals is to allow the compiler to avoid a function call every time an I/O operation is performed.
Finally, I think that the definition of the . (structure member) operator should be expanded to replace the -> operator (which should still be kept for compatibility reasons). How many times have you changed a function parameter from a structure to a pointer to a structure, or vice versa, and then had to go through the source and change . to -> or vice versa? Happens a lot. Presently, there is no way to use the . operator with a pointer; it has no meaning. It should be a very simple matter to modify the compiler for this feature.
I don't want to give the impression that these are the only areas I'm dissatisfied with, or that I'm not concerned with the more higher-level features of the language. But these things are just bare essentials that should be there to allow the programmer to do the basic, fundamental chores of everyday systems programming.
Thanks for listening,
Jeff Pipkins
Jeff=Pipkins%FW=Util%Sys=Hou@bangate.compaq.com
Many people have strong opinions about how to "fix" Standard C. It will be interesting to see how the language survives all the fixing it is about to undergo these next few years. — pjp