Columns


Standard C

Large Character Set Support

P.J. Plauger


P.J. Plauger is senior editor of The C Users Journal. He is convenor of the ISO C standards committee, WG14, and active on the C++ committee, WG21. His latest books are The Standard C Library, published by Prentice-Hall, and ANSI and ISO Standard C(with Jim Brodie), published by Microsoft Press. You can reach him at pjp@plauger.com.

Introduction

Last month, I described the new machinery for processing Defect Reports — to interpret or correct the C Standard. (See "State of the Art: Formal Changes to C," C Users Journal, April 1993.) I also outlined the "normative addendum" that will add a number of features to Standard C. Those changes are now being balloted for approval. Some are designed to help programmers write C source code that is more readable. I described them last month. But most of the changes are designed to ease writing programs that manipulate large character sets, a topic of growing interest.

My goal this month is to start showing you the other, more extensive group of changes in the works for Standard C. Understand, they are still subject to modification. Member nations of ISO, and other interested parties, get several rounds of comments before we in WG14 freeze the normative addendum. I like to think, however, that this particular material is reasonably stable. It should give you a portent of how Standard C will look in another year or so.

Equally important, my goal is to suggest some ways you might actually use these new capabilities. To some extent, the conversion to using a large character set is simple. Just look for all the places where you manipulate text as type char. Replace them with objects and values of type wchar_t. Whatever the implementation chooses for a large character set, its codes must be representable as that type. But all sorts of subtle changes must occur as well. I'll endeavor to show you some of them.

The Lightweight Approach

I begin by emphasizing that many programs can already speak Japanese, as it were. Programs often just read and write text without interpretation. You can feed them strings containing single-character and multibyte sequences intermixed, and they don't mess them up. (If they do, you can often fix the programs with little effort. Typically, you just stop presuming that you can break a string of text arbitrarily between any two characters.)

A display or printer that does the right thing with these multibyte sequences may well show Kanji characters, just the way the user desires. Your program neither knows nor cares what happens to them outside its boundaries. Multibyte is usually the representation of choice outside programs. If you plan to do little or no manipulation of such text, it can also be the representation of choice inside a program as well.

But lets say you have occasion to manipulate small amounts of text from a large character set within your program. You need to distinguish character boundaries so you can rearrange text or insert other characters. In that case, wide characters are the representation of choice inside the program. You want to deal, at least sometimes, in elements and arrays of type wchar_t.

Standard C already provides a minimal set of functions for converting back and forth between multibyte and wide character. You can also write wide-character constants, as L'x', and wide-character strings, as L"abc". Several people have demonstrated that you can do quite a bit of work with just what's already standardized. (See, for example, my book, The Standard C Library, Prentice Hall, 1992.)

One thing we chose not to add to the original C Standard, however, was the ability to read and write wide characters directly. A format string can contain multibyte characters as part of its literal text, but you cannot easily read and write data of type wchar_t. You have to convert wide characters to a multibyte string in a buffer before you print it out. Or you have to read a multibyte string into a buffer before you convert it to wide characters. That can be a nuisance.

Print and Scan Additions

So one of the changes in the normative addendum is to add capabilities to the existing print and scan functions. These are the functions declared in <stdio.h> with print or scan as part of their names. Each family of functions now recognizes two new conversion specifiers, %C and %S. For the print functions:

%C writes to the output the multibyte sequence corresponding to the wchar_t argument. A shift-dependent encoding begins and ends in the initial shift sequence. That can result in redundant shift codes in the output, but the print functions make no attempt to remove them.

%S writes to the output the multibyte sequence corresponding to the null-terminated wide-character string pointed to by the pointer to wchar_t argument. The same rules apply for shift codes as for %C. You can use a precision, as in %.5S, to limit the number of array elements converted.

In no event will either of these conversion specifiers produce a partial multibyte sequence. They may, however, report an error. A wide-character encoding may consider certain values of type wchar_t invalid. Try to convert one and the print functions will store the value of the macro EILSEQ in errno. That macro is now added to the header <errno.h>.

As usual, the behavior of the scan functions for the same conversion specifiers is similar, but different from that for the print functions:

%C reads from the input a multibyte sequence and converts it to a sequence of wide characters. It stores these characters in the array designated by the pointer to wchar_t argument. You can use the field width to specify a character count other than one, as in %5C. The input sequence must begin and end in the initial shift state (if states matter).

%S does the same job, but with two critical differences. The multibyte sequence to be converted is determined by first skipping leading whitespace, then consuming all input up to but not including the next whitespace. Here, whitespace has its old definition of being any single character for which isspace returns true in the current locale. (Yes, that can cause trouble with some multibyte encodings.) The resultant sequence must begin and end in the initial shift state. The other difference is that the scan functions store a terminating null after the sequence of converted wide characters.

A multibyte encoding may consider certain character sequences invalid. Try to convert one and the scan functions will store the value of the macro EILSEQ in errno. The print functions may generate no incomplete multibyte sequences, but the scan functions aren't nearly as tidy. They can leave the input partway through a multibyte character, or in an uncertain shift state, for all sorts of reasons. Remember, both families of functions are still essentially byte-at-a-time processors.

Staying Light

These additions to the print and scan functions were discussed before the C Standard was frozen, as I indicated earlier. We rejected them at the time because they offered only partial solutions to several problems:

The new "wide-character streams" solve such problems, but at a higher cost in machinery. I'll discuss them in a later installment.

So why did WG14 agree to add machinery which is known to be inadequate? Because, for many applications, it is good enough, thank you. People writing "internationalized" applications have discovered a whole spectrum of needs. Earlier, I described a class of programs that need essensially no special support for large character sets. Others can use what's already in the C Standard. The biggest nuisance that many practicing programmers keep reporting is this omitted ability to read and write wide characters directly. For a small cost in code added to the print and scan functions, you get enough code to help many programmers.

The Header <wchar.h>

But let's say you need to get more serious about large character sets. In that case, you need more machinery. The normative addendum provides lots more machinery when you include the new header <wchar.h>. Listing 1 shows a representative version of this header. The actual types used in the type definitions may vary among implementations, as can the value of the macro WEOF. I have simply made common choices.

As you can see, this header declares quite a few new functions. Most of the new names are weird, but they can conceivably clash with those in existing C programs. That bodes ill for backward compatibility, a prime virtue in any update to a programming language standard.

WG14 ameliorated the problem by deviating slightly from past practice. The name of every current library function is reserved in the external name space whether or not you include the header that declares it. That lets an optimizing compiler look for function names such as sqrt with fewer fears that it's guessing wrong in treating it special. But the names of functions declared in <wchar.h> are not similarly reserved. If you include the new header in any translation unit of a program, you have to look for conflicts. Otherwise, you can ignore the new stuff without fear.

You'll notice a handful of type definitions in the new header. The first three are old friends. size_t and wchar_t are, in fact, already declared in multiple headers. struct tm, up to now, has been declared only in the header <time.h>. It is needed here to declare the function wcsftime. Note, however, that in <wchar.h> it is declared as an incomplete structure type. You still have to include <time.h> if you want to complete the type declaration, so you can poke at its innards.

Wide Meta-Characters

Books on C seldom emphasize the point, but C has long supported a "meta-character" data type. It is implemented as type int with additional semantic constraints. Look, for example, at what the function fgetc returns. We're promised that the int return value is one of two things:

Further, we are assured that EOF is distinguishable from all valid character codes. The typical implementation, in fact, has eight-bit characters and sets EOF to —1. So valid values for a meta-character are in the closed interval [—1, 255]. We use meta-characters all the time. We read them with fgetc or one of its sisters, getc or getchar. We test them with the functions declared in <ctype.h>, such as isdigit or isalnum. And, provided we filter out stray EOFs, we can even write them with fputc or one of its sisters, putc or putchar. Pretty handy.

It is only natural that the world of wide characters should demand analogous machinery. Thus <wchar.h> defines a macro and a type that support programming with wide meta-characters:

Now you should be able to understand why so many of the functions are declared the way they are in <wchar.h>. They simply carry on the old, and demonstrably useful, C tradition of trafficking in meta-characters. Only now they're wide meta-characters.

There's one small subtlety I hate to let slide by without notice. The type wint_t is often represented with more bits than the type wchar_t. That gives the implementor lots of choices for the value of WEOF. But that need not be the case. It's perfectly valid for wint_t to be the same size as wchar_t. Then, the value of WEOF must be chosen from among the invalid wide-character codes. (And there better be at least one to choose from.)

So don't fall into the habit of thinking of wchar_t as some unsigned integer type. And don't assume that WEOF is —1, or any other negative value. You may one day be surprised. (If you want to write highly portable code, in fact, prepare for the possibility that types char and int are the same size. But that's a whole 'nother sermon.)

State Memories

One of the notorious shortcomings of the Standard C library is its dependence on private memory. A number of functions maintain static storage to remember various values between calls. Some storage is even shared among different functions. The drawbacks of private memory in library functions are well known:

We knew these shortcomings when we developed the C Standard. Much as we wanted to fix many of the offending functions, we decided it was too troublesome to change them. Too much existing code would have to be rewritten.

Worse, we perpetuated this practice when we added several functions that manipulate large character sets. As they parse or generate multibyte strings, they may have to keep track of the current shift state. So the functions mblen, mbtowc, and wctomb have private memory. You initialize the memory to the initial shift state with one kind of call. You then progress through a multibyte string with other kinds of calls, and the memory tracks the shift state.

Our major reason for using private memory was to avoid adding yet another argument to each of these functions. That argument also needs a special type definition as well. We didn't want all that semantic detail if people weren't going to parse multibyte strings all that much. After all, we've gotten by with strtok all these years, haven't we?

Well, it turns out we were wrong. Seems lots of people are manipulating multibyte strings these days. And they chafe at the shortcomings I outlined above. Thus the new type defintion mbstate_t, and the new functions that use it. I'll describe them all next month.

Conclusion

You'll find one more new type definition in <wchar.h>. The type wctype_t serves a limited role. It describes the "handle" returned from the function wctype. You use the handle only in the wide-character testing function iswctype. The handle corresponds to one of an open-ended set of character classifications. What are the classifications and how can you make new ones? The answer is sufficiently complex that I must also defer it to next month.

It has taken me a whole installment just to lay the groundwork for understanding the large character set support being added to Standard C. A glance at Listing 1 shows that <wchar.h> declares a mess of functions. By now, you can probably guess what many of them do. Your guesses are probably even right in most cases. My next task is to reinforce your prejudices where you're right and show you the surprises elsewhere. Tune in next month.