November 1990/Standard C

Columns

Standard C

Implementing <ctype.h>

P.J. Plauger

P.J. Plauger is senior editor of The C Users Journal. He is secretary of the ANSI C standards committee, X3J11, and convenor of the ISO C standards committee, WG14. His latest book is Standard C, which he co-authored with Jim Brodie. You can reach him at pjp@plauger .uunet.
I conclude a two-part presentation this month by discussing ways to implement the standard header <ctype.h> and the functions that go with it. Last month, I reproduced the relevant portion of the C standard. I also discussed the various classes of characters and how they change between implementations and between locales.
The actual code for any of these functions is fairly simple. What is hard to get right are a few subtleties that can affect portability, and the machinery for making the functions adapt to changing locales.

Size Of Character Set
The implementation presented here follows the traditional approach. A translation table captures the peculiarities of the target character set. Each of the functions uses its argument as an index into the table. The function tests the selected table element against a unique mask to determine whether the character is in the class in question.
A translation table makes sense only if it is not too large. How big it gets is a product of how many elements it contains and how big each element must be. Let us consider the number of elements first, since that is the dominant factor.
Standard C defines three "character" types — char, signed char, and unsigned char. All these types must be able to represent all the characters in the target character set. All these types must have representations with the same number of bits. All other types must have representations with some integer multiple of the bits in a character type. A character type is represented by at least eight bits.
The character classification functions each accept an argument of type int, but with a limited range of values. Any value that type unsigned char can represent is valid for an argument to one of these functions. Those values start at zero and go up to the value specified by the macro UCHAR_MAX (defined in the standard header <limits.h>).
One additional value is permitted — it is specified by the macro EOF (defined in the standard header <stdio.h>). Most sensible implementations give EOF the value -1. This implementation is no exception. So the number of elements in a translation table must be one more than the number of distinct values representable by a character type.
The vast majority of C implementations use exactly eight bits to represent a character type. That provides for a target character set with 256 members. Hence, a translation table must contain 257elements. An implementation can, however, use more bits. C has been implemented with nine, ten, 16, and even 32 bits to represent character types. A translation table that must represent all the values in a 16-bit character is clearly unwieldy. It would contain 65,537 elements.
Using a translation table for the character classification functions carries with it an implicit implementation restriction. Characters must be represented with eight bits. At worst, they contain nine or ten bits. The larger they get, the progressively sillier is the choice of a translation table.

Number Of Subclasses
Even with eight-bit characters, an implementation may not find a translation table always suitable. Why? Let's look at the size of each table element. Last month, I reproduced a diagram from Standard C, the book I wrote with Jim Brodie. It neatly summarizes the character classes and how the functions in <ctype.h> relate. Figure 1 is that same table, repeated here for ease of reference.
The table contains eight nodes, represented by rounded rectangles. Each of these nodes holds a proper subset of the basic C character set. Every class contains some combination of these nodes. That means that each character in the basic C character set is easily classified. Its entry in the translation table can have a single bit set. The bit tells you what node the character inhabits. You perform a bitwise AND between a mask and an eight-bit entry in the translation table to test its membership in any of the classes.
Let us say, for example, that the translation table is called ctype. Then we could define bits for the different nodes, and the isalpha macro, something like
#define ISLOWER 01
#define ISUPPER 02

extern const unsigned char
     ctype[];

#define isalpha(c) \
     (ctype[c] &\ (ISLOWER|ISUPPER))
(In an actual implementation, the names must be less readable to avoid collision with user-defined names.)
Eight is a nice number. If eight bits suffice to classify an arbitrary entry, then the translation table can always be an array of unsigned char as in the example shown earlier. If we need even one additional bit per entry, the size of the translation table can double. On an implementation with hundreds of kilobytes of storage, that is not so important. The difference between 257 and 514 bytes is hardly worth considering. On an implementation with just a few kilobytes of available storage, however, you might be stingier.
I observed that any character in the basic C character set can be classified by eight bits. But it is a rare implementation that has no additional characters. Figure 1 shows where an implementor can add characters to the classes. The double pluses mark the two functions that can recognize additional characters even in the "C" locale. The single pluses mark the four functions that can recognize additional characters only outside the "C" locale.
On the surface, it looks like we need an additional six bits in each translation table entry. Not so. Most of those pluses can be pushed to the right, as it were. Each of the functions iscntrl, islower, ispunct, and isupper has a node all its very own. We can simply mark extra characters for such a function the same way as for the characters in the private node.
Two problems remain, outside the "C" locale at least:

The function isalpha can recognize characters that are recognized by neither islower or isupper.

The function isspace can recognize characters that are recognized by neither iscntrl or isprint.
For these two functions, you have no place to push additional characters. You must either rule out locales with funny letters and space, or you must make each element of the translation table big enough to hold ten classification bits. Alternatively, you can divide the functions into subgroups, each with its own smaller translation table. That is a hard guessing game to win, in general.
Another important implementation issue is deciding the number of locales you intend to support. If any chance exists that you may want to support locales with such alphabetic or space characters, plan ahead. Declare the translation table to have type array of short. If you are willing to rule out such latitude, however, you can save space by declaring the translation table to have type array of unsigned char. Since this implementation aims at maximum portability, it takes the former course.
One subtle point should not get bypassed. I have consistently said that an eight-bit translation table should have elements of type unsigned char. Many implementations will tolerate type char, or even signed char, for the table element type. A few might conceivable generate slightly better code for one of these types, although I doubt it. You should always use an unsigned type, however, if you intend to perform bitwise operations operations and the sign bit can be set.
Why? Not all implementations represent integers in twos-complement. In other representations, converting a signed representation to an unsigned one is not so simple. Twos-complement leaves the bit pattern unchanged. Other representations can alter low-order bits when converting an apparently negative value. Performing a bitwise AND between a signed value and an unsigned mask can cause surprises.
Granted, twos-complement machines dominate the marketplace today. It is hard to avoid all coding practices that depend on twos-complement behavior to work right. But you can make your code more portable, and more reusable, if you avoid the obvious pitfalls.
You might even choose to implement the translation table as an array of unsigned short instead of array of short. Since we're using only the low-order ten bits, it shouldn't matter. It is more a question of aesthetics, or of having uniform style rules. I can't bring myself to be quite that pedantic, however.

Additional Assumptions
So far, I have assumed that characters are represented in eight bits (or not much more). I have also assumed that a program can afford to include a translation table of 514 bytes (or not much more). To show some real code, I must make at least three more assumptions.
The first additional assumption concerns the case mapping functions tolower and toupper. These functions differ from the others in this group. They don't simply classify their argument, but return a character that may differ from the argument character. I assume that they should be implemented with mapping tables similar to the translation table shared by all the other functions. If it makes sense to use a table to implement islower, then it makes sense to use another table for tolower and toupper.
The next additional assumption is that the target character set is ASCII. The letters stand for "American Standard Code for Information Interchange." While hardly universal, this encoding is widely used among modern computers. An international variant of ASCII also exists. ISO 646 has the same code values and much the same glyphs, or visible forms of the characters. Some of the punctuation in ASCII can be replaced with alternate glyphs in ISO 646, however. That is how Europeans can introduce accented characters, such as Å and ê, without going beyond seven-bit codes.
This implementation is compatible with any variant of ISO 646 where none of the punctuation characters are redefined as letters. It is easily changed to match other ISO 646 variants, however. Just change the pertinent table entries. The functions remain the same. You can also accommodate other character sets just as easily. IBM's EBCDIC (for "Extended Binary-Coded Decimal Interchange Code") also requires a simple change of table entries. Just be sure that your table entries agree with the character constants (such as 'a') produced by your C translator!
The final additional assumption is not so easily encapsulated. It concerns how the implementation deals with static storage in the library. In the simplest case, this is a nonproblem. The translator includes code from the Standard C library as needed. Once included in the program, library code behaves just like code supplied by the programmer.
An implementation that can run multiple programs, however, often benefits from having shared libraries. All the code for the Standard C library occupies a single place in computer memory. A C program linked to run in this environment transfers control to functions in the shared library, rather than including its own private copy of the library code. The obvious benefits are that each program is smaller and can link faster.
A not-so-obvious drawback appears when several functions need to share a writable static data object that is private to the library. You can't share the same data object between different programs. You want to allocate a unique version of each writable data object for each program and initialize it to its required starting value.
Sadly, no common method exists for performing this feat. Operating systems and linkers use ad hoc machinery to make shared libraries work at all. Some simply disallow writable statics. Others require you to invoke special machinery to set up and access writable statics. You must write your code in a special way.
The character classification functions need writable static data if they are to adapt to changing locales. One approach is to rewrite the tables when the locale changes. A better way that speeds changing locales is to alter pointers to point to different (read-only) tables. It also minimizes the amount of writable storage that might need special handling.
This presentation ignores the potential problems associated with writable static data in the library. I minimize the use of writable statics as much as possible. I also call attention to every writable static data object that must be introduced. But I use no special notation for accessing such data.

The Code
The code for the functions declared in <ctype.h> is built around three translation tables. Three writable pointers at all times point to the tables corresponding to the current locale. Listing 1 contains the code for the standard header <ctype. h>. Note that every function has a corresponding macro.
I used fairly cryptic names for the macros that define the classification bits. That helps save space for the presentation. It also speeds the processing of standard headers in many implementations. I also elected to put the letters 'a' through 'f' in two classes at once. Each is both a lower case letter, _LO, and a hexadecimal digit, _XD. (The first six upper case letters are similarly in two classes.) Nothing is gained by keeping the subclasses absolutely disjoint, but this way the macros are slightly smaller and more readable.
The code for the functions looks just like the macros. See Listing 2.
Finally, Listing 3 is the code for the three tables that the character classification functions use.