Portability


Writing Standard Headers: The String Functions

Dan Saks


Dan Saks is the owner of Saks & Associates, which offers training and consulting in C and C++. He is a member of X3J11, the ANSI C committee. He has an M.S.E. in computer science from the University of Pennsylvania. You can write to him at 287 W. McCreight Ave., Springfield, OH 45504 or call (513) 324-3601.

In a recent letter to The C Users Journal, Phil Cogar of N.S.W. Australia complained that much of the C source code appearing in this and other programming journals contains references to headers such as <stdlib.h> that are not published along with the code. He observed that if your compiler provides these headers, then typing in the code and getting it to run is usually easy; without them, it may be impossible. He has a legitimate complaint, but as editor Robert Ward points out in his response, it's often impractical to publish the headers with the code. (See The C Users Journal, October 1989, p.138.)

To get the programs to run, you can write your own standard headers to go with your existing compiler and library. Although writing an entire Standard C library from scratch is a big chore, you can fill many of the gaps in an existing library by yourself in only a few days.

The Standard Headers

The fifteen headers specified by the Standard are summarized in Table 1. Most of them declare a set of related library functions, along with any macros and types needed to call them. A few headers don't contain any functions; they simply define useful macros and types that have nowhere else to go. Some macros and types appear in more than one header, but each function is declared only once.

Most compilers supply additional headers. For example, UNIX compilers add headers such as <direct.h>, <fcntl.h> and <process.h>. Many MS-DOS compilers supply some of the UNIX headers, along with others such as <bios.h>, <conio.h> and <dos.h>. None of these headers is covered by the C Standard. Some UNIX headers have been formalized by the IEEE 1003.1 POSIX Portable Operating System Standard, but many aren't covered by any non-proprietary standard. A C program using library headers other than those listed in Table 1 will not be portable to all Standard C implementations.

A program accesses the contents of a standard header by referencing the header in an include directive, such as

#include <stdio.h>
Headers are often referred to as "include files" because they are almost always implemented as source files with the same names. Other implementations are permitted, and so the Standard is careful not to refer to them as files. Nevertheless, "headers" and "include files" are generally understood to mean the same thing.

Determining What You Already Have

Before starting to fix your standard headers, you should look to see what you already have. Headers are usually easy to locate them. For example, on UNIX systems the headers for cc are usually in /usr/include (see the subheading FILES on the manual page(s) for cc(1) in your UNIX manual). The default setup for Turbo C on MS-DOS places the headers in \turboc\include. Most MS-DOS compilers do something similar. The headers for DECUS C on my PDP-11 are in the same subdirectory as my compiler executables, which is a subdirectory with the logical name C:.

You should not be surprised to find that you already have several of the standard headers. The standard library is not pure invention; it's the result of an effort to "codify common existing practice." You will almost certainly find a version of <stdio.h> — the only standard header used by Kernighan and Ritchie in the first edition of The C Programming Language. <ctype.h> is also extremely common. Beyond that, it's hard to say just how many headers you're likely to find.

For example, the DECUS C compiler has only four of the standard headers: <ctype.h>, <setjmp.h>, <stdio.h>, and <time.h>. The UNIX 4.2 BSD compiler (cc) has these four, plus <assert.h>, <errno.h>, <math.h>, and <signal.h>. It also has <varargs.h>, which is very similar to <stdarg.h>. Turbo C 2.0, Microsoft C 5.1 and Zortech C 1.07 (all for MS-DOS) have every header except <locale.h>, but very few of the headers among all three compilers are exactly as they should be.

Where To Put New Headers

Before you start creating and modifying headers, you should think about where to put them. You can throw caution to the wind and put the new headers in the same directory as your existing ones (assuming you have the access rights), but then you run a serious risk that some of your old code won't work with the new headers. I recommend creating a directory for your new headers and reconfiguring your compiler environment to search this new directory before it searches the old one. Remove the new headers from the search if you have to. Compiler environments vary so much that I can't explain how to do this for everyone, but I will show you what I've done on a few different systems:

On UNIX 4.2 BSD:

I put the new headers in a subdirectory /usr/include within my home directory (/u/dsaks). I wrote a shell script called cc that simply contains

/bin/cc -I/u/dsaks/usr/include $*
This script invokes the UNIX C compiler (in /bin) with the -I option. -I tells the compiler to search for include files in the named directory before searching in the standard places. The $* passes all the arguments to the cc script through to the C compiler.

I put this script in /u/dsaks/usr/bin, and added this directory name to my shell path variable. I made the script executable by using

chmod +x cc
This cc command compiles with the new headers. If I need to omit them, I simply rename the command with

mv cc cc.new
so the cc command reverts to the one in /usr/bin (without -I).

On MS-DOS 3.0 and higher:

I put the original headers for Microsoft C and Quick C in \ms\include, and my new headers in \ms\usr\include. Both compilers support the -I option, so you can create a cc.bat command file like the UNIX shell script. Yet, Microsoft gives you an easier alternative. The Microsoft compilers use the INCLUDE environment variable to define the search path for include files. I use two different command files to configure the compiler environment. My msnew.bat uses

set INCLUDE=c:\ms\usr\include;c:\ms\include
to put the new headers in the search path, while msold.bat uses

set INCLUDE=c:\ms\include
to take them out.

Other MS-DOS compilers require slightly different approaches. Zortech's command line compiler, ZTC, uses the INCLUDE environment variable just like Microsoft C, but their integrated environment, ZED, gets its search path from a configuration file maintained by a utility called ZCONFIG. Borland's Turbo C lets you specify the search path in a file called TURBOC.CFG. Consult your compiler user's guide for details.

On RT-11 V5.0 and higher:

The DECUS C compiler has a built-in preprocessor that's virtually useless. Fortunately, the compiler is distributed with MP, a decent preprocessor from the UNIX User's Group. My compilation command files disable the built-in preprocessor (with the /M compiler switch) and use MP instead.

MP has a preset search path for include files. First it looks in the directory with the logical name LB:, then it looks in C:, and finally it looks in SY:. I put the original headers in a directory assigned to C: and the new headers in another directory assigned to LB:. I can remove the new headers from the search by deassigning LB:.

<string.h>

I'll begin with <string.h> because it's often missing and yet is easy to create. Once you have it, you'll use it frequently.

<string.h> (see Table 2) declares the string handling functions in the library. It also declares one macro, NULL, and one type, size_t, that are needed to use these functions.

There is no universal way to define NULL ---- you tailor the definition to your machine's architecture. The easiest way to obtain a definition for NULL is to steal one from <stdio.h>. If you can't find a definition there or in some other header, then you should probably use

#define NULL ((void *)0)
if your compiler supports the void * type, or

#define NULL ((char *)0)
if it doesn't. If you know that your pointers have the same size as type int, you can use simply

#define NULL 0
If the pointers on your machine have the same size as type long int, you can use

#define NULL OL
I prefer to use the casts to determine the size of NULL. However, I suspect you'll find that one of the latter two forms is already used in your existing headers. Whichever form you choose, use it consistently.

Most MS-DOS C compilers provide pointers in two different sizes, near and far. The headers in these compilers use conditional compilation to select the appropriate definition for NULL, something like

#ifdef _NEAR_POINTERS
#define NULL 0
#else
#define NULL OL
#endif
If your <string.h> needs a definition like this, you should find it in one of your existing headers. (For more insight into the possible definitions for NULL, see "Doctor C's Pointers: The 'NULL' Macro and Null Pointers" by Rex Jaeschke in The C Users Journal, Sept/Oct, 1988.)

NULL is defined in several standard headers. The headers may be included in any order, and a given header may be included more than once, so you must insure that the repeated definitions for NULL don't conflict with each other. Most implementations permit "benign" macro redefinitions (repeated definitions formed by identical sequences of tokens) as specified in the Standard. In this case, make all the definitions the same. If your preprocessor doesn't allow any redefinitions, you will have to put a "protective wrapper" around each one, as in

#ifndef NULL
#define NULL ((void *)O)
#endif
size_t is the type of the result of the sizeof operator. The Standard says that it should be an unsigned integral type, so use either

typedef unsigned size_t;
or

typedef unsigned long size_t;
You can select the appropriate definition using the program in Listing 1.

In many C implementations, sizeof yields a signed int value. You should still define size_t as unsigned, so that operations on objects of that type have the proper unsigned behavior. You can always use size_t to cast the possibly negative result of sizeof to its 'true' unsigned value, as in

if ((size_t)sizeof(something_big) > 0)
For more about size_t and sizeof, see "Doctor C's Pointers: Exploring the Subtle Side of the 'sizeof' Operator" by Rex Jaeschke in The C Users Journal, Feb., 1988 or see Rex's book, listed in References.

As with NULL, size_t appears in several standard headers. The Standard and many implementations do not allow typedef redefinitions (even "benign" ones) in the same scope, so you may need a protective wrapper around each definition. For example

#ifndef _SIZE_T_DEFINED
typedef unsigned size_t;
#define _SIZE_T_DEFINED
#endif
You don't have to use the name _SIZE_T_DEFINED. Any identifier beginning with an underscore followed by an upper-case letter or another underscore will do. The Standard reserves these names for the implementation of the compiler (of which the headers are part).

Since benign macro redefinitions are usually allowed, you may be tempted to define size_t as

#define size_t unsigned
in order to eliminate the protective wrapper. I have seen this done in some "ANSI-conforming" compilers. Although you will probably never notice the difference, the macro definition is wrong because it changes the scope of size_t. Use the typedef.

And now for the functions. Most older C compilers don't support prototypes, so you might have to delete or "comment out" the parameter lists. Some functions return void *. If your compiler won't accept that type, use char *.

You will find that your library contains some, but not all, of the string functions. Sometimes you will find a standard C function under an archaic name. Many recent books on C have an appendix that details the functions in the standard library. (See references at the end of the article.) You should compare the functions in the standard library with the functions in your compiler's library to find as many matches as you can.

For example, some implementations use index instead of strchr. In this case, you could declare strchr as

char *index();
#define strchr(s, c) index(s, c)
but there is a hazard. If you forget that strchr is really index, and write another function called index, you will inadvertently redefine strchr. (This is an excellent way to test your debugging skills.) This macro definition should only be used as an interim fix until you add a compiled version of the missing function to the run-time library.

What about functions that are completely missing? Should you still put their declarations in <string. h>? The answer is a definite maybe.

Suppose that memchr is missing from your library. memchr returns a void *, but if you leave the declaration out of <string. h>, the compiler will assume it returns an int. When you compile

char *p, s[10];
p = memchr(s, 'x', 10);
you may get a spurious warning about an illegal pointer assignment, but compilation will continue. You won't know what's really happening until the linker reports that memchr is undefined. Under these circumstances, you should declare memchr in the header to eliminate the unnecesary warnings.

If you use a Lint-like program checker that can detect undeclared functions (or if your compiler has such an option), then don't declare functions that are missing from the library. When you reference a missing function, you will still get a meaningful error message, but won't have to wait for the linker to tell you what you already know.

Listing 2 shows the <string.h> that I use on UNIX 4.2 BSD. It includes some interim macro definitions for missing functions. The #ifndef ... #endif wrapper around the entire header prevents repeated compilation of the declarations if the header is included more than once. The wrapper isn't needed for protection since you can redeclare functions (provided all declarations in the same scope are the same), and everything else in the header is either benign or protected.

I added the wrapper to simplify debugging. While debugging macros, I sometimes look at the preprocessor output to verify the expansions. Eliminating redundant headers from preprocessor output makes it easier to read. The comment at the header's beginning is not in the wrapper so it still appears wherever the header is included, even if the rest of the header does not.

One final word of caution. In Listing 2, strlen is declared to return a size_t, even though strlen is actually defined in the library to return an int. On machines where a signed int to unsigned int conversion performs no transformation of the data (as on twos-complement machines), strlen returning a size_t is perfectly safe. On other machines, you should leave the declaration as

int strlen();
so that the compiler can recognize that

size_t n;
n = strlen(s);
involves a signed to unsigned conversion and generate the proper code. You should also cast the result of strlen to size_t whenever strlen is used in an expression with other ints, such as

if ((size_t)strlen(s) > 0)
This is the same technique used with sizeof when it returns an int.

Conclusion

In this article I've tried to show why it's impossible to just publish a single portable version of the standard headers. The headers provide a portable definition of the Standard C environment, but they do it in a non-portable way.

Rather than writing the missing string functions in the library, I suggest you write the remaining standard headers. Doing so solves more portability problems and gives you the definitions you need to compile new library functions as you write them. In <string. h>, you've already seen many of the design problems, so most of the remaining work is simply determining what goes into the other headers.

References

Darnell, Peter and Margolis, Philip, Software Engineering in C (1988, Springer-Verlag).

Gardner, James, From C to C: An Introduction to ANSI Standard C (1989, Harcourt Brace Jovanovich).

Jaeschke, Rex, Portability and the C Language, (1989, Hayden Books).

Plauger, P.J. and Brodie, Jim, Standard C (1989, Microsoft Press).

Ritchie, Dennis and Kernighan, Brian, The C Programming Language, 2nd. ed. (1988, Prentice-Hall).