LARGE CHARACTER SETS FOR C

The internationalization of C is underway

Figure 1: Model 37 code for printing Greek.

  TEXT:    A [SO]b[SI]-ray is an electron.

  DISPLAY: A beta-ray is an electron.

You get more mileage out of each character code this way, but at a price. How you interpret each code depends on what has gone before. You might assume, for example, that each sequence of characters begins in an "initial shift state." For our Model 37, that would be printing English characters. Most characters that follow are interpreted in this context to determine the "metacharacter" you really mean to designate. Some characters simply alter the current shift state. They specify no metacharacter at all, at least not by themselves. The Model 37 code may have (almost) doubled the number of characters you can represent, but it must maintain one bit of state information to determine each metacharacter.

The Japanese JIS code takes this approach a step farther. In the initial shift state, each character defines a single metacharacter. ASCII is ASCII. You shift out to Kanji with the three-character sequence \33$B (ESC, dollar sign, capital B). In this state, each subsequent pair of characters determines a single metacharacter. Both the first and second characters of a Kanji pair must be in the range [0x21, 0x7e]. You shift in to ASCII with the three-character sequence \33(B; see Figure 2.

Some simple arithmetic tells you that you can specify nearly 10,000 distinct metacharacters with JIS. That's nowhere near all the Kanji characters--only the more popular ones are included. Still, it's worlds better than the mere 256 codes supported by a single 8-bit character. The price once more is added complexity. Parsing a JIS string takes work. It requires state memory just like the Model 37 code. And opportunities abound for making malformed strings.

It is possible to eliminate the need for state memory. The Japanese Shift JIS code sets aside certain character codes to signal the start of a two-character sequence. A character in the range [0x81, 0x9f] or [0xe0, 0xfc] must be followed by a character in the range [0x40, 0xfc]. Together, these define a single Kanji metacharacter. Any other first character defines the metacharacter all by itself. (Again, ASCII is ASCII). See Figure 3.

Extended UNIX Code is a variation on the same thing. It was contrived to simplify the conversion of many UNIX utilities to processing Kanji text. Essentially, any character with its sign bit set (in the range [0x80, 0xff]) is part of a two-character sequence. No shift state need be retained. But you still need to keep track of where you are within a multiple-character sequence.

(I have studiously avoided using the obvious terms "byte" and "multibyte" in this description. I have been equally careful to distinguish between "characters" and "metacharacters." C has long had the rule that a character occupies a single byte. C often lives on machines where a byte consists of 8 bits. That has led to endemic confusion between the notions of character, byte, and octet of bits, and the confusion will not soon disappear.)

The reasons for using multibyte sequences should be obvious. We live in a world of character streams. Disks, diskettes, parallel ports, and serial ports all traffic in sequences of 8-bit bytes. To ignore this world would be foolish. A large character set must be representable as sequences of bytes.

Yet there are equally obvious drawbacks to multibyte sequences. You can't manipulate individual characters without a lot of parsing. You can't paste strings together without careful thought about shift states (in the general case). At the very least, you may have to introduce many redundant shift sequences to be on the safe side.

Wide Characters

If you want to manipulate characters inside a program, it's easiest if they're all the same size. An alternate representation for large character sets has just this property. A wide character is an integer large enough to represent distinct codes for all the characters in the set. It can be type char, short, int, or long. Or it can be one of the unsigned versions of these types. Standard C provides the type definition wchar_t for the wide-character type. Include either of the headers <stddef.h> or <stdlib.h> to define this type.

Just as there are several multibyte encodings for Kanji, there are also several wide-character encodings. The more popular ones are easily derived from one of the multibyte encodings. Essentially, you cut and paste bits from the two characters in the multibyte representation to make the wide-character code; see Figure 4.

More Details.

Figure 4: Converting Shift JIS to wide character.

         MULTIBYTE: is 1[0x8C][0x8E].

         WIDE CHAR: ['i']['s']['']['1'][0x8C8E]['.']

Wide-character encodings tend to be a private matter for each implementation. Imagine trying to exchange data between two different systems by shipping wide characters. First you must make sure that both implementations of C use the same number of bits to represent wchar_t. Then you have to worry about whether the byte orders are the same. In the general case, you have to transform the code values in some way. It's much easier simply to read and write a common multibyte code. Then you don't much care about the various internal forms for wide characters.

C programmers do care somewhat about wide-character codes. Code value 0, for example, must be reserved for the null wide character. Otherwise, wide-character strings are a nuisance to manipulate. And you want 'a' to have the same numeric value when converted to a wide character. In fact, any value you can store in an unsigned character should have the same numeric value when converted to a wide character. Otherwise, all sorts of subtle but nuisancy problems arise. The C Standard endorses no particular wide-character encoding, but it does impose a few restrictions on acceptable code sets.

The C Standard imposes similar constraints on multibyte encodings, by the way. Code value 0 always stands for the null character. It can never appear as part of a longer character sequence representing some metacharacter. If the encoding has shift states, then the initial shift state is somewhat constrained. All the basic C characters (the ones you need to express a C source file) stand for themselves. Put another way, 'a' stands for lowercase "a" in the initial shift state. It is never the first character of a longer character sequence. Again, these few constraints let the C programmer use proven techniques to manipulate even multibyte strings. (For a discussion of the implications of wide-character on C++, see the accompanying text box, "So What About C++?".)

Extensions to C

We added as little as possible to C to support multibyte and wide characters. In a C source file you can write a multibyte sequence in one of the following ways:

As part of a comment.

Within a "wide-character constant" such as L'x'.

Within a "wide-character string literal" such as L"kon ban wa".

In the last two cases, you specify one or more wide characters in the executable code by writing multibyte sequences in the C source. In all cases, the multibyte sequence must begin and end in the initial shift state (if shift states matter). It is up to each implementation to choose multibyte and wide-character encodings. Note that an identifier cannot include a multibyte sequence--you're still confined to the English alphabet for contriving names. (Several proposals are kicking around ISO to generalize the rules for writing identifiers in all programming languages, however.)

You can also write multibyte sequences in all the formats used by the library print and scan functions. That lets you intermix multibyte literal text with converted values on output. It also lets you match such text on formatted input to a limited degree. The problem with input comes, as usual, with shift sequences. They let you specify the same sequence of metacharacters many different ways. But the scan functions still match literal text character by character. That can lead to all sorts of unpleasant surprises for innocent users.

The only other addition to the C Standard is a handful of library functions. The header <stdlib.h> now declares the following functions:

mblen, for determining how many characters in a multibyte sequence constitute the next metacharacter.
mbtowc, for converting a single metacharacter from multibyte to wide character.
wctomb, for converting a single wide character to a multibyte sequence.
mbstowcs, for converting a null-terminated, multibyte string to a null-terminated, wide-character string.
wcstombs, for converting a null-terminated, wide-character string to a null-terminated, multibyte string.

Besides the type wchar_t mentioned earlier, the library also defines two macros. These help you allocate work buffers for code that converts between multibyte and wide-character encodings:

MB_CUR_MAX, defined in <stdlib.h>, is the length of the longest permissible multibyte sequence for a single metacharacter in the current locale.
MB_LEN_MAX, defined in <stddef.h>, is the same length across all locales.

Yes, an implementation can change its multibyte and wide-character encoding when it changes locales, at least in principle. Such antics are fraught with peril, however. I suspect that only the more ambitious implementations will permit such games.

Future Additions

All sorts of additional functions would be useful for manipulating large character sets:

Wide-character analogs of the <ctype.h> and <string.h> functions.

Wide-character analogs of the conversion functions in <stdlib.h>, such as strtod and strtol.

Wide-character analogs of the string I/O functions sprintf, vsprintf, and sscanf.

I/O functions that convert automatically between multibyte files in the outside world and wide characters inside the program.

We did think about these issues when we drafted the C Standard. But remember we already felt that we were running late. So we chose to include only the bare minimum of functionality. We figured that more extensive library support would emerge as people understood better how to manipulate large character sets in C.

We figured right. The Japanese have proposed an extensive addition to the Standard C library. It includes all the functions outlined above. It also describes some of the subtler semantic issues in greater detail. I've glossed over many such issues here because of space limitations.

The ANSI C Standard was approved in 1989. ISO C followed in 1990. Normally, a language standard remains stable for at least five years before it gets revisited. You'd think the Japanese proposal had missed the boat, but thanks to an accident of ISO politics, that's not the case. For a variety of reasons, the ISO C committee has the charter to produce a "normative addendum" to the C Standard. It wasn't hard to convince the committee to include the Japanese proposal as part of that addendum.

The net result is that the C Standard will likely be changed within the next year or so. Essentially, that change will incorporate the Japanese extensions to large-character support. The extensions are confined to the library, and they are fairly pure. That means that existing C programs should not change meaning when these new functions are added. Your biggest worry will be whether any existing external names collide with the names of added functions. And that, as we all know, is a perennial problem with progress.

Living with Large Character Sets

Now you know the basics of large-character set support in Standard C. What should you do about it? As I mentioned at the outset, you probably don't have to do much of anything right now. What you do in the near future depends on your expectations for the code you write.

If you believe your code will never care about large character sets, you can generally ignore them. We tried to contrive the C Standard so the cost is low for those who don't use large character sets. Even implementors can get off cheap. A C compiler for a small microprocessor can, for example, define wchar_t as type char. The five conversion functions then become trivial. The print and scan functions don't have to change. Your code can stay lean and mean.

For many applications, a wiser approach is to make it multibyte tolerant. Remember that a multibyte string often looks like any other null-terminated string. You wouldn't second guess the structure of a filename in a portable program, would you? Then learn to be just as tolerant of text strings you read and write. They might one day be multibyte strings. If you don't try to chop them up or paste other characters in the middle, they will probably survive passage through your code. Who knows, your application may one day start speaking Japanese or Arabic.

Some applications must learn to be multibyte aware. You use the multibyte parsing functions religiously when manipulating strings. You probably want to adapt to the locale preferred by each user. (My book The Standard C Library contains complete code for manipulating locales and large character sets with varied encodings.) You may even want to use arrays of wide characters for manipulating some text.

A few applications will have to be wide-character oriented. These work exclusively with wide characters instead of conventional characters. They convert to and from multibyte characters only when communicating with the outside world. Such applications really benefit from the additions to Standard C proposed by the Japanese. (I understand that Windows NT fits this description.)

My personal belief is that conventional character strings will not soon go away. They meet most of our needs, even when dealing with large character sets. But I also see a growing use of wide characters in the years to come. Internationalization is a major driving force, but it is not the only one. Remember that large character sets have uses well beyond Japanese word processors. They can also be handy for representing characters of different point sizes or colors in a typesetting package. Or they can represent musical notes of different pitches and durations. I leave other uses to your imagination.

So What About C++?

C++ is still being standardized jointly by ISO WG21 and ANSI X3J16. Upward compatibility with Standard C is a clearly stated goal. Thus, all the current support for large character sets has already been adopted as part of C++.

What to do with the proposed Japanese extensions is another matter. These include literally dozens of new functions to manipulate wide-character strings. All are direct analogs to the old C standbys for manipulating conventional character strings. To name just two examples, strlen begets wcslen and sprintf begets wcsprintf.

C++ provides function-name overloading. It is considered much better to overload one name than to introduce a trivial variant of that name. Thus, C++ may very well overload strlen for both character and wide-character arguments. Do the same for all those dozens of functions and you can see a real improvement.

At least one technical problem remains to be solved for this approach. In C, the wide-character type wchar_t is simply a synonym for some existing integer type. That might very well be char or int. So the two declarations

  size_t strlen(const char *);
  size_t strlen(const wchar_t *);

may be indistinguishable on some implementations. This does not make for portable code.

C++ must find some way to distinguish wchar_t from other integer types with the same representation. It must do so without severely compromising upward migration of C code. Several approaches can work, but the C++ standards committee has yet to choose one.

It is an open issue whether C++ includes the Japanese proposal as is. Even if it does, however, function overloading will almost certainly be provided as well.

--P.J.P.