October 1990/Standard C

Columns

Standard C

Character Classification Functions

P.J Plauger

P.J. Plaunger has been a prolific programmer, textbook author, and software enterprenuer. He is secretary of the ANSI C standards committee, X3J11, and convenor of the ISO standards committee, WG14. His latest book is Standard C which he co-authored with Jim Brodie.
Last month, I began the long trek through the Standard C library. I discussed the header <assert.h>, how to use it and how it can be implemented. The next stop on the journey, in alphabetical order at least, is the header <ctype.h>.
Here is what the C standard has to say about this header:

4.3 Character Handling <ctype. h>
The header <ctype.h> declares several functions useful for testing and mapping characters. [Footnote: See "future library directions" (§4.13.2).] In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.
The behavior of these functions is affected by the current locale. Those functions that have implementation-defined aspects only when not in the "C" locale are noted below.
The term printing character refers to a member of an implementation-defined set of characters, each of which occupies one printing position on a display device; the term control character refers to a member of an implementation-defined set of characters that are not printing characters. [Footnote: In an implementation that uses the seven-bit ASCII character set, the printing characters are those whose values lie from 0x20 (space) through 0x7E (tilde); the control characters are those whose values lie from 0 (NUL) through 0x1F (US), and the character 0x7F (DEL).]

Forward References: EOF (4.9.1), localization (4.4).

4.3.1 Character Testing Functions
The functions in this section return nonzero (true) if and only if the value of the argument c conforms to that in the description of the function.

4.3.1.1 The isalnum Function

Synopsis

#include <ctype.h> int isalnum(int c);

Description
The isalnum function tests for any character for which isalpha or isdigit is true.

4.3.1.2 The isalpha Function

Synopsis

#include <ctype.h> int isalpha(int c);

Description
The isalpha function tests for any character for which isupper or islower is true, or any character that is one of an implementation-defined set of characters for which none of iscntrl, isdigit, ispunct, or isspace is true. In the "C" locale, isalpha returns true only for the characters for which isupper or islower is true.

4.3.1.3 The iscntrl Function

Synopsis

#include <ctype.h> int iscntrl(int c);

Description
The iscntrl function tests for any control character.

4.3.1.4 The isdigit function

Synopsis

#include <ctype.h> int isdigit(int c);

Description
The isdigit function tests for any decimal-digit character (as defined in §2.2.1).

4.3.1.5 The isgraph Function

Synopsis

#include <ctype.h> int isgraph(int c);

Description
The isgraph function tests for any printing character except space (' ').

4.3.1.6 The islower Function

Synopsis

#include <ctype.h> int islower(int c);

Description
The islower function tests for any character that is a lower-case letter or is one of an implementation-defined set of characters for which none of iscntrl, isdigit, ispunct, or isspace is true. In the "C" locale, islower returns true only for the characters defined as lower-case letters (as defined in §2.2.1).

4.3.1.7 The isprint Function

Synopsis

#include <ctype.h> int isprint(int c);

Description
The isprint function tests for any printing character including space (' ').

4.3.1.8 The ispunct Function

Synopsis

#include <ctype.h> int ispunct(int c);

Description
The ispunct function tests for any printing character that is neither space (' ') nor a character for which isalnum is true.

4.3.1.9 The isspace Function

Synopsis

#include <ctype.h> int isspace(int c);

Description
The isspace function tests for any character that is a standard white-space character or is one of an implementation defined set of characters for which isalnum is false. The standard white-space characters are the following: space (' '), form feed ('\f '), new-line ('\n '), carriage return ('\r '), horizontal tab ('\t '), and vertical tab ('\v '). In the "C" locale, isspace returns true only for the standard white-space characters.

4.3.1.10 The isupper Function

Synopsis

#include <ctype.h> int isupper(int c);

Description
The isupper function tests for any character that is an upper-case letter or is one of an implementation-defined set of characters for which none of iscntrl, isdigit, ispunct, or isspace is true. In the "C" locale, isupper returns true only for the characters defined as upper-case letters (as defined in §2.2.1).

4.3.1.11 The isxdigit Function

Synopsis

#include <ctype.h> int isxdigit(int c);

Description
The isxdigit function tests for any hexadecimal-digit character (as defined in §3.1.3.2).

4.3.2 Character Case Mapping Functions

4.3.2.1 The tolower Function

Synopsis

#include <ctype.h> int tolower(int c);

Description
The tolower function converts an upper-case letter to the corresponding lower-case letter.

Returns
If the argument is a character for which isupper is true and there is a corresponding character for which islower is true, the tolower function returns the corresponding character; otherwise the argument is returned unchanged.

4.3.2.2 The toupper Function

Synopsis

#include <ctype.h> int toupper(int c);

Description
The toupper function converts a lower-case letter to the corresponding upper-case letter.

Returns
If the argument is a character for which islower is true and there is a corresponding character for which isupper is true, the toupper function returns the corresponding character; otherwise the argument is returned unchanged.

History
Character handling has been important since the earliest days of C. Many of us were attracted to the DEC PDP-11 because of its rich set of character manipulation instructions. When Ken Thompson moved UNIX to the PDP-11/20, he gave us a great vehicle for manipulating streams of characters in a uniform style. When C came along, it was only natural that we should use it to write programs preoccupied with walloping characters.
This was truly a new style of programming. C programs tended to be small and devoted to a single function. The tradition until then was to write huge monoliths that offered a spectrum of services. C programs read and wrote streams of human-readable characters. The tradition until then was to have programs communicate with each other via highly structured binary files. They spoke to people by producing paginated reports with embedded carriage controls.
Those of us who wrote character manipulation programs before C wrote mostly in assembly language. A few of us more daring souls used FORTRAN as well. That took dedication, however. FORTRAN had few facilities, and fewer standards, for trafficking in characters.
So the early toolsmiths writing in C under UNIX began developing idioms at a rapid rate. We often found ourselves sorting characters into different classes. To identify a letter, we wrote

if ('A' <= c && c <= 'Z' || 'a' <= c && c <= 'z') .....
To identify a digit, we wrote

if ('0' <= c && c <= '9') .....
And to identify white space, we wrote

if (c == ' ' || c == '\t' || c == '\n') .....
Pretty soon, our programs became thick with tests like this. Worse, some became thick with tests almost like this. Opinions differed on the best way to write a range test. Only a few diehards avoided the operators > and >= as religiously as I still do. You can contrive to write the same idiom a number of different ways. That slows comprehension and increases the chance for errors.
Opinions also differed on the makeup of certain character classes. White space has always suffered notorious variability. Should you lump vertical tabs in with horizontal tabs and spaces? If you include new lines (which are actually ASCII line feeds), should you also include carriage returns (which UNIX reserves for writing overstruck lines)? The easier it is to get tools to work together, the more you want them to agree on conventions.
The natural response was to introduce functions in place of these tests. That made them at once more readable and more uniform. The idioms for letter, digit and white space became

if (isalpha(c)) .....
and

if (isdigit(c)) .....
and

if (isspace(c)) .....
It wasn't long before a dozen-odd functions like these came into being. They soon found their way into the growing library of C support functions. More and more programs began to use them instead of reinventing their own idioms. The character classification functions were so useful, they seemed almost too good to be true.
They were. A typical text processing program might average three calls on these functions for every character from the input stream. The overhead of calling so many functions often dominated the execution time of the programs. That led some programmers to back off from using the standard functions that had evolved. It led others to develop a set of macros to take their place.
C programmers tend to like macros. They let you write code that is as readable as calling functions but is much more efficient. You just have to be ready for a few surprises:

The macro may expand into much more code than a function call, even if it executes faster than the function call. If your program expands the macro in many places, it can grow surprisingly larger.

The macro may expand to a subexpression that doesn't bind as tight as a function call. This is an unacceptable surprise and always has been. A liberal use of parentheses in the macro definition can eliminate such nonsense.

The macro may expand one of its arguments to code that is executed more than once or not at all. A macro argument with side effects will cause surprises. While some C programmers consider such surprises acceptable, modern practice avoids them. Only two Standard C library functions, getc and putc, permit such unsafe behavior.
So the challenge in those early days was to produce a set of macros to replace the character classification functions. Because they were used a lot, they had to expand to compact code. They also had to be reasonably safe to use. What evolved was a set of macros that used one or more translation tables. Each macro took the form:

#define isxxx(c) (ctyptab[c] & XXXMASK)
The character c indexes into the translation table ctyptab. Different bits in each table entry characterize the index character. If any of the bits corresponding to the mask XXXMASK are set, the character is in the tested class. The macro expands to a compact expression that is nonzero for all the right arguments.
One drawback to this approach is that the macro generates bad code for all the wrong arguments. Expand it with an argument not in the expected range and it accesses storage outside the translation table. Depending on the implementation, the error can go undetected or it can terminate execution with a cryptic message.
On a machine that represents type char the same as signed char, this is a common error. The function call isprint(c) looks safe enough. But say c has type char and holds a value with the sign bit set. The argument will be a negative value almost cetainly out of range for the function. Few programmers know to write the safer from isspace((unsigned char)c).
Nevertheless, translation tables remain the basis for many modern implementations of the character classification functions. They help the implementor provide efficient macros, even in the presence of multiple locales. And these functions remain important to the modern C programmer. You should use them wherever possible to sort characters into classes. They greatly increase your chances of having code that is both efficient and correct across varied character sets.

Character Classifications
Classifying characters is not as easy as it appears. First you have to understand the classes. Then you have to understand where all the common characters live within the class system. Then you have to decide where to tuck the less than common characters. Then you need some understanding of how everything changes when you move to an implementation with a different character set. Finally, you need to be aware of how the classes can change when the program switches out of the "C" locale.
Let's start at the beginning. The classes defined by the character classification functions are:
digit — one of the ten decimal digits '0' through '9'
hexadecimal digit — a digit or one of the first six letters of the alphabet in either case, 'a' through 'f' and 'A' through 'F'
lower case letter — one of the letters 'a' through 'z', plus possibly others outside the "C" locale
upper case letter — one of the letters 'A' through 'Z', plus possibly others outside the "C" locale
letter — one of the lower case or upper case letters, plus possibly others outside the "C" locale
alphanumeric — one of the letters or digits
graphic — a character that occupies one print position and is visible when written to a display device
punctuation — a graphic character that is not an alphanumeric, including at least the 29 such characters used to represent C source text
printable — a graphic character or the space character ' '
space — the space character ' ', one of the five standard motion control characters (form feed, newline, carriage return, horizontal tab, or vertical tab), plus possibly others outside the "C" locale
control — one of the five standard motion control characters, backspace, alert (or bell), plus possibly others.
Note that two of these classes are open-ended even in the "C" locale. An implementation can define any number of additional punctuation or control characters. In ASCII, for example, punctuation also includes characters such as @ and $. Control characters include all the codes between decimal 1 and 31, plus the delete character, whose code is 127.
If you find all these classes confusing, take heart. So do I. I need a diagram to sort them all out. Figure 1 (taken from P.J. Plauger and Jim Brodie, Standard C, Microsoft Press, 1989) shows how the character classification functions relate to each other.
The characters in the rounded rectangles are all the members of the basic C character set. These are the characters you use to represent an arbitrary C source file. The C standard requires that every target character set contain all of these characters. Every target character set must also contain the null character, whose code is zero.
I have added single and double plus signs under some of the function names. A single plus sign indicates that the function can represent additional characters outside the "C" locale. A double plus sign indicates that the function can represent additional characters even in the "C" locale.
A target character set can contain members that fall in none of these classes. The null character is best left out of all classes, for example. The same character must not, however, be added at more than one place in the diagram. If it is a lower case letter, it is of course also in several other classes by inheritance. But a character must not be considered both punctuation and control, for example.
As you can see from the diagram, nearly all the functions can change behavior in a program that alters its locale. Only isdigit and isxdigit remain unchanged. If your code intends to process the local language, this is good news. The locale will alter islower, for example, to detect any additional lower case letter.
If your code endeavors to be locale independent, however, you must program more carefully. Supplement any tests you make with the character classification functions to weed out any extra characters that sneak in. Or get all your locale independent testing out of the way before your program changes out of the "C" locale.
If neither of these options is viable, you may have to revert part or all of the locale for a region of code. Begin the region with

#include <locale.h> #include <stdlib.h> #include <string.h> ..... char *ls = setlocale(LC_TYPE, "C"); if (ls) { char *ss = malloc(strlen(ls) + 1); ls = strcpy(ss, ls); }
And end the region with

if (ls) { setlocale(LC_CTYPE, ls); free(ls); }
If the region contains no calls to setlocale, you can eliminate the code that allocates, copies and frees the locale string. If the region is large, however, or if the code will be maintained by others in the future, play it safe. You are better off making the code robust than saving a few microseconds.
The important message is that Standard C introduces a new era. You can now write code more easily for cultures around the world, which is good. But you must now write code with more forethought. If it can end up in an international application, it may someday process characters undreamed of by early C programmers. Trust the character classification functions to contain the problem, to help you with it, and to delineate what can change.

Summary
I've reviewed the evolution of the character classification functions in the Standard C library. I've shown you how they relate to each other. And I've indicated how the functions can change between implementations and between locales.
Next month, I will discuss implementation issues for the functions and macros defined in <ctype.h>. I will also present code for the header and the functions. None of it is complex, but keeping it portable and adaptable to changing locales is a delicate matter. Stay tuned.