June 1993/Standard C

Columns

Standard C

Large Character Set Functions

P.J. Plauger

P.J. Plauger is senior editor of The C Users Journal. He is convenor of the ISO C standards committee, WG14, and active on the C + + committee, WG21. His latest books are The Standard C Library, published by Prentice-Hall, and ANSI and ISO Standard C (with Jim Brodie), published by Microsoft Press. You can reach him at pjp@plauger.com.

Introduction
This is the second in a series of columns on the facilities being added to Standard C. (See "Standard C: Large Character Set Support," C Users Journal, May 1993.) I spent the last column describing the motivation behind adding support for manipulating large character sets in C. I described the changes to existing functions in the Standard C library. I also introduced the new header <wchar.h> that declares all sorts of additional functions.
My goal this month is to show you many of those new functions. I do so in the context of how you might actually use them, should the need arise. One day, your enterprise might feel moved to sell code into Japan, China, or the Arab world. You might find an advantage in writing code that is truly international in its handling of text. Or you might want to manipulate text in numerous fonts and colors. If so, you'll find no more sophisticated machinery than what's already in Standard C, at least not in any other standardized programming language. The additions described here make for a particularly rich environment for manipulating large character sets.
I remind you that what I'm describing here is part of a "normative addendum" to the ISO C Standard. It is still subject to balloting within SC22, the parent committee of the C standards committee WG14. It may change in response to comments or criticism. But we've been reviewing it within WG14 for about three years now. I think it's mostly stable.

Traditional Character Classification
The existing header <ctype.h> declares a number of functions for classifying (one-byte) characters. They have proved their worth in many a C program over the past two decades. An important part of processing text involves classifying characters in various ways. Some of those classifications occur so often that they're worth capturing in standard library functions. Thus, the functions declared in <ctype.h> help you test quickly for characters in a number of common classes. These classes include lower-case letters, upper-case letters, digits, punctuation, and so forth.
Not all the classes are exactly the same across implementations. You can be sure that isdigit returns true (nonzero) only for the ten decimal digits. You can be sure that ispunct returns true for all the punctuation required to write a C program. ispunct also often returns true for such characters as the at sign @, but it doesn't have to. It might return true for additional characters that you don't expect. In short, Standard C says that the punctuation class is to some extent implementation defined.
That's not the end of it. When we put locales in Standard C, we allowed for additional changes to the character classification functions. A program starts out in the "C" locale, where the behavior is pretty traditional. You can call the function setlocale, however, to change to a new locale. If that locale specifies a new category LC_CTYPE, several of the functions declared in <ctype.h> can begin behaving differently. They can recognize additional punctuation, or lower-case letters, or a few other critters.
The proper style these days for classifying characters is thus to use the <ctype.h> functions religiously. They are more likely to get the right answer regardless of execution character set and regardless of current locale.

Wide Character Classification
Now imagine the additional possibilities that come with a really large character set. You still want the old classifications. ASCII is a common subset of the various Kanji codes, for instance. And you still have occasion to parse numbers and names based on English-language rules. But you now face two additional considerations:

The existing character classes may sensibly include characters not representable as single-byte characters.

Some implementations may see fit to define whole new classes of characters.

To address the first consideration, the new header <wchar.h> provides parallels for all the functions delcared in <ctype.h>:

int iswalnum(wint_t wc); int iswalpha(wint_t wc); int iswcntrl(wint_t wc); int iswdigit(wint_t wc); int iswgraph(wint_t wc); int iswlower(wint_t wc); int iswprint(wint_t wc); int iswpunct(wint_t wc); int iswspace(wint_t wc); int iswupper(wint_t wc); int iswxdigit(wint_t wc); wint_t towlower(wint_t wc); wint_t towupper(wint_t wc);
I described the new type wint_t last month. It represents all valid values of the wide-character type wchar_t, plus the value WEOF for wide-character end-of-file.
These functions essentially behave the same as their older cousins for characters that are also representable as single-byte characters. (The language in the normative addendum is twisty, but that's what it amounts to.) Some can also return true for additional wide characters, by an extension of the same latitude granted their older cousins.
The parallels break down in two cases, both involving the exclusion of whitespace. The older functions isgraph and ispunct are defined in terms of what other functions accept, minus the space character ' '. The newer functions iswgraph and iswpunct are defined in terms of what the analogous functions accept, minus the characters accepted by iswspace. That certainly includes the space character, but it might also include others, even single-byte white-space characters. We on WG14 couldn't resist the temptation to properly generalize these functions, even at the cost of some backward compatibility.
What this means in practice is really rather simple, despite my nit-picking descriptions. You can change your programs to handle large character sets mostly by:

replacing char declarations for text objects with wchar_t declarations

replacing calls to the functions declared in <ctype.h> with their analogs declared in <wchar.h>
If you run into problems, I've indicated a few subtle differences that can bite. Chances are, however, that you will have far more trouble getting the obvious code changes right than finding subtle changes in the definition of white space.

New Classifications
Look at all the ways we've found it convenient to classify one-byte characters over the years. Now imagine all the possibilities when you have a character set with thousands of elements. Actually, not even the Japanese could imagine the new classificationss they might want. The one thing they were sure of is that various people would want new ones.
So what they proposed was an open-ended set of new classifications. An implementation publishes a list of property names. These are just null-terminated strings like so many other names in C. (They could be multibyte strings, but who knows or cares? All you have to do is read them in printed form and be able to reproduce them as string literals in a program you write.)
To make use of one of these new classifications, you obtain its handle by calling the function wctype, declared in <wchar.h>, as in:

static wctype_t hirigana = wctype("hirigana");
I described the type wctype_t briefly last month. All you need to know is that it is declared in <wchar.h> as some scalar type that you can use as an argument to iswctype (also declared in that header). Listing 1, for example, shows a possible function that tests whether an entire wide-character string is hirigana. It assumes, of course, that "hirigana" is a legitimate classification name, as in the example above.
How many such classifications are there? That depends on the implementation. Every implementation must accept 11 names, corresponding to the 11 standard classification functions — "alpha", "alnum", "cntrl", etc. The promise is that an expression such as iswctype(wc, wctype("upper")) is entirely equivalent to iswupper(wc). An implementation may or may not accept additional names.
The set of accepted classification names can even change with locale. A call to setlocale that changes the category LC_CTYPE can switch to a new set of character classifications. (It can even switch to a new encoding for wide characters, within limits, but that latitude is probably more theoretical than practical.)
As some of you may know, I have an implementation of the Standard C library that includes support for extensible locales. (See my book, The Standard C Library, Prentice Hall, 1992.) I am currently adding the facilities in <wchar.h> to that implementation. The new code supports added notation for locale files, so you can add an open-ended set of wide-character classifications to any locale.
My suspicion is that it will take several years for people to agree on the commonest classifications for each large character set. In the meantime, there's little use to be made of this added capability. Remember, it's not portable, and it's not likely to correspond to anything in existing code. Just bear it in mind for future use.

<string.h> Revisited
If you like manipulating null-terminated strings of char in C, you'll probably also like doing the same sorts of things with large character sets. So the header <wchar.h> declares analogs for all our old friends from the standard header <string.h> (Listing 2) . Note the general substitution of wcs (Wide-Character String) for str (String), except that strstr became the more sensible wcsstr instead of the pedantic wcswcs. You might also note that the lack of any mem analogs in wide-character land. So far, the committee has heard no strong plea for them. You can mostly use the older functions even with the larger character types.
A quick glance reveals that the new functions differ from the older ones in a simple way. Where the functions declared in <string.h> have an argument or return value that involves type char, the newer functions instead use type wchar_t. Thus, you can convert an existing program that manipulates strings mostly by

replacing pointer to char and array of char declarations for string objects with analogous wchar_t declarations

replacing calls to the str functions declared in <string.h> with their analogs declared in <wchar.h>

fixing calls to the mem functions, where necessary, by multiplying the size arguments by sizeof (wchar_t)
WG14 mostly resisted the urge to "fix" the string functions, to ease this sort of migration.
A closer glance, however, reveals that one of the functions has indeed been fixed. The function strtok has always been notorious in the C library. It is the only string function that requires the use of static memory, to retain state information between calls. strtok certainly has plenty of company throughout the rest of the Standard C library. Still, it's a pity that this one function could have been kept pure with just a bit more work, yet traditionally was not.
So WG14 bowed to temptation in this area. The added argument wchar_t **ptr is the address of a pointer object that the caller must provide to the function wcstok. This pointer takes the place of the static memory used in the older strtok. By providing pointers to different objects, you can keep track of where you are in different wide-character strings. Thus, you can now parse multiple (wide character) strings at the same time. The cost is that you have more work to do in converting existing code.

<time.h> Revisited
The header <time.h> has three functions that produce null-terminated character strings. Two are the traditional functions asctime and ctime. The third is the function invented by X3J11, strftime. You can do essentially everything with the new function, and then some, that you could do with the older ones. Hence, WG14 decided to provide a wide-character version of only the new one:

size_t wcsftime(wchar_t *s, size_t maxsize, const wchar_t *format, const struct tm *timeptr);
wcsftime differs from strftime in two ways:

Its format argument, which controls the formatting of the generated string, is a wide-character string.

The generated string is also a wide-character string.
Note that strftime treats its format argument as a multibyte string. Assuming the implementation is tidy enough, it can thus also generate well-formed multibyte strings. Hence, the existing C Standard already provides all the machinery needed to format times using large character sets.
WG14 nevertheless elected to add wcsftime. The idea is to eliminate wherever possible the need to represent any multibyte strings within a program. That in turn eliminates the need to convert back and forth between multibyte and wide character forms. And that, in the short run, makes it easier to convert programs to manipulating text represented with a large character set instead of the current single-byte sets. In the long run, that will also keep programs cleaner.
So to convert time strings to wide-character form:

First convert any calls to asctime or ctime to their equivalents in terms of strftime.

Replace pointer to char and array of char declarations for string objects with analogous wchar_t declarations.

Replace calls to strftime with calls to wcsftime, declared in <wchar.h>.

<stdlib.h> Revisited
A handful of the functions declared in <stdlib. h> convert arithmetic representations to text strings — strtod, strtol, and strtoul. These three functions now have wide-character analogs:

double wcstod(const wchar_t *nptr, wchar_t **endptr); long int wcstol(const wchar_t *nptr, wchar_t **endptr, int base); unsigned long int wcstoul //(const wchar_t *nptr, wchar_t **endptr, int base);
Again, the new versions don't add much new functionality. The characters they generate are all representable in single-byte form. (A possible exception is how an implementation chooses to print an overflow or other forms such as NaNs, but that is not likely to involve much Arabic or Kanji.) What you win, once again, is the ability to work purely with wide-character strings.
On the other hand, several other functions declared in <stdlib.h> have been supplemented considerably. These are the ones that help you walk along multibyte strings, or that convert between multibyte and wide-character forms. We added them because experience to date says we should. Each fills some gap we found in the minimal set of conversion functions in the current C Standard.
The first addition looks simple enough:

int wctob(wint_t wc);
This function takes a wide-character argument (or WEOF, as I discussed last month). It determines whether that wide character can be represented as a single-byte multibyte character in the initial shift state. If that is possible, the function returns the value of the single-byte representation. Otherwise, it returns EOF. (Note that the int return is really the "metacharacter" type used by fgetc and other functions, as I also discussed last month.)
What does wctob buy you? It is certainly convenient, for one thing. It's a nuisance to do the equivalent using existing functions:

Set up a wide-character string consisting of the argument wc followed by a null wide character.

Convert it to a multibyte string in yet another buffer, by calling wcstombs.

Check that the converted string consists of a single (non-null) character, followed by the null terminator.

Pick up that character as the value of interest, then clean up the mess.
But nuisancy as this sequence is, it is still not guaranteed to give you what you want. The C Standard does not require that wcstombs generate the most economical string (although it probably will). It might have redundant shift codes at the start and/or end of the converted string. Thus, wctob is both essential and convenient. It even turns out to simplify the description of other functions.

Remembering Shift States
I noted earlier that strtok had the bad grace to retain static memory. Well, X3J11 had the bad grace to indulge that same unfortunate practice in some of the functions we added. Among these were the functions that convert back and forth between wide-character and multibyte encodings. We did so to keep the external interfaces simpler, figuring that the functions wouldn't be used that often. We were wrong.
So part of the proposed addition is a set of functions that do the same as existing conversion functions, but ask the caller to provide a pointer to the state memory. (Note the similarity to wcstok, described above.) The functions use the state memory to keep track of the current shift state, for multibyte encodings that are state dependent, that is. You provide a pointer to an object of (nonarray) type mbstate_t, declared as usual in <wchar. h>.
The functions decide how best to use this storage on your behalf. You can initialize the mbstate_t object to zero and be sure that it represents the initial shift state. Or you can call any of several functions (described below) in such a way that they enter the initial shift state. You can also test such an object to see whether it represents the initial shift state, by calling the function:

int sisinit(const mbstate_t *ps);
The functions that use this new state memory are analogous to existing functions, with an r (for Restartable) added in the middle of the name:

int mbrlen(const char *s, size_t n, mbstate_t *ps); int mbrtowc(wchar_t *pwc, const char *s, size_t n, mbstate_t *ps); int wcrtomb(char *s, wchar_t wc, mbstate_t *ps}; size_t mbsrtowcs(wchar_t *dst, const char **src, size_t len, mbstate_t *ps); size_t wcsrtombs(char *dst, const wchar_t **src, size_t len, mbstate_t *ps);
You can convert from using the existing functions mostly by providing your own state objects, but even that is optional. All these functions supply their own internal state, just like the bad old days, if you ask them. The harder part of the conversion involves a number of small changes in behavior, which I won't bother to describe here. I suspect very few readers of this column today have invested much in code that fiddles with large character sets already.

Coming Soon
There is still one batch of added functions that I have yet to describe. They let you read and write new creatures called wide-character streams. That is the topic of next month's column.