P.J. Plauger, whose most recent book is The Standard C Library (Prentice Hall, 1992), is a member of the ISO JTC1/SC22/WG14. He can be contacted at uunet!plauger!pjp.
C is the only standardized programming language that supports large character sets. That will not always be true. The Japanese have made their position clear to ISO, the international organization that standardizes programming languages. Several years ago, they announced their intention to veto any future language standards that do not contain similar support. Wisely, ISO passed a resolution endorsing the Japanese position.
C could have become the last standardized programming language that did not support large character sets. The Japanese were willing to exempt the C Standard because, at the time, it was very near completion. Many of us had already put five or more years into standardizing C. We were tired and ready to quit. But we didn't want to be the last of an old breed, not after all that work. Rather, we chose to do the extra work that made us the first of the new breed.
To do so, we had to bend our self-imposed rules a bit. Standard C is highly compatible with the C of Kernighan & Ritchie. We resisted rather well the numerous temptations to "fix" the language--particularly where such fixes would require existing C code to change. We did add a number of features. Some of the additions are changes to the language proper. Most are pure add-ons, such as new library functions. All of the additions, however new they may appear to many, were based on some form of prior art. Even function prototypes and type qualifiers (such as const) were derived from C++ and other dialects of C.
It was harder to find precedents for manipulating large character sets, at least the way we chose to do so. True, several companies have provided Kanji support libraries for a number of years. A few have permitted limited inclusion of Kanji characters within C source code itself. Nobody had chosen to be as ambitious as we felt we had to be. Like it or not, we had to be inventive.
We were equally inventive in adding "locales." That's the machinery we added to make the Europeans happy. They were just ahead of the Japanese in requesting that C be made more international. A locale summarizes many of the conventions of a given culture. Francophones want their dates spelled out in French. Accountants want negative numbers to print with a trailing "DB" instead of a leading minus. Dictionary writers want words to sort in a funny way. The locale machinery added to C is intended to support an open-ended set of such cultural conventions by defining and mixing locales.
If you are not conversant with locales and large-character support in C, don't fret. The material is so new that many experienced C programmers barely understand the basic concepts. More important, most C programmers don't need to care. At least not yet. The push for internationalization has just begun, and it may be years before your corner of the world will feel the impact.
But don't feel you can ignore this topic indefinitely. The international marketplace for software is growing fast. Whoever pays your salary will soon care very much about meeting that growing demand with an economy of effort. Standard C offers that economy better than any other programming language in use today. It behooves all professional programmers to understand the issues involved in this new field of "internationalization."
My focus in this article is primarily on support for large character sets. If you want to learn more about locales as well, see my book, The Standard C Library (Prentice Hall, 1992). It discusses the entire C library, but pays particular attention to features added for internationalization.
When Europeans talk about large character sets, they usually mean sets with extra language-specific characters. The 95 graphics defined in the U.S. form of ISO 646 (also known as ASCII) are not enough. Practically every European alphabet defines additional characters or accented versions of the English characters. In fact, I am told that only three languages in the world can get by with just the 26 letters of the common subset of ISO 646--English, Hawaiian, and Swahili.
Still, these extra characters number only in the dozens. You can throw in every known accented character in European alphabets, the funny extra characters, Cyrillic, and Greek--and still fit all the graphics comfortably in a 256-character set. So the European notion of a large character set is one that uses all 8 bits of a byte. Forget any comfortable C-ish no ions that all printable characters have positive values.
The Japanese face an entirely different problem. They inherited tens of thousands of Kanji characters from the Chinese. They also use several phonetic alphabets--Hirigana, Katakana, and Romanji (the Western alphabet). But they refuse to give up the compactness and delightful ambiguity of Kanji. They are not alone. The Chinese, Koreans, and Arabs likewise have huge alphabets that form an important part of their respective cultures. Nobody wants to stop using something that works well just because it's inconvenient for American software to process.
(In fairness to the Japanese, I should make an important observation here. They do not insist that new programming language standards support Kanji. They want them to support all large character sets, from all cultures around the world. That also helps with an internal political/technical problem in Japan. Several coding schemes are in common use for Kanji, just as both ASCII and EBCDIC are used in the U.S. The C Standard has always been general enough to accommodate both of the latter. It now also accommodates all the known ways to encode Kanji. And it allows for a variety of ways to encode the other large character sets of the world.)
Over the years, Japanese programmers have developed two distinct ways to augment text-processing software for large character sets. In the language of the C Standard, these are called "multibyte characters" and "wide characters." We included both in C because each has its uses. Naturally, that means we also had to include ways to convert between conventional, multibyte, and wide characters.
An old trick for expanding a character set is to give each code multiple meanings. The old Teletype Model 37, for example, could print both English and Greek characters. Send the terminal a "shift out" code (SO) and it began speaking Greek. A "b" printed as a beta, as I recall. Subsequent characters also printed funny until you sent a "shift in" code (SI). The terminal then reverted to more customary behavior; see Figure 1.
TEXT: A [SO]b[SI]-ray is an electron. DISPLAY: A beta-ray is an electron.
You get more mileage out of each character code this way, but at a price. How you interpret each code depends on what has gone before. You might assume, for example, that each sequence of characters begins in an "initial shift state." For our Model 37, that would be printing English characters. Most characters that follow are interpreted in this context to determine the "metacharacter" you really mean to designate. Some characters simply alter the current shift state. They specify no metacharacter at all, at least not by themselves. The Model 37 code may have (almost) doubled the number of characters you can represent, but it must maintain one bit of state information to determine each metacharacter.
The Japanese JIS code takes this approach a step farther. In the initial shift state, each character defines a single metacharacter. ASCII is ASCII. You shift out to Kanji with the three-character sequence \33$B (ESC, dollar sign, capital B). In this state, each subsequent pair of characters determines a single metacharacter. Both the first and second characters of a Kanji pair must be in the range [0x21, 0x7e]. You shift in to ASCII with the three-character sequence \33(B; see Figure 2.
Some simple arithmetic tells you that you can specify nearly 10,000 distinct metacharacters with JIS. That's nowhere near all the Kanji characters--only the more popular ones are included. Still, it's worlds better than the mere 256 codes supported by a single 8-bit character. The price once more is added complexity. Parsing a JIS string takes work. It requires state memory just like the Model 37 code. And opportunities abound for making malformed strings.
It is possible to eliminate the need for state memory. The Japanese Shift JIS code sets aside certain character codes to signal the start of a two-character sequence. A character in the range [0x81, 0x9f] or [0xe0, 0xfc] must be followed by a character in the range [0x40, 0xfc]. Together, these define a single Kanji metacharacter. Any other first character defines the metacharacter all by itself. (Again, ASCII is ASCII). See Figure 3.
Extended UNIX Code is a variation on the same thing. It was contrived to simplify the conversion of many UNIX utilities to processing Kanji text. Essentially, any character with its sign bit set (in the range [0x80, 0xff]) is part of a two-character sequence. No shift state need be retained. But you still need to keep track of where you are within a multiple-character sequence.
(I have studiously avoided using the obvious terms "byte" and "multibyte" in this description. I have been equally careful to distinguish between "characters" and "metacharacters." C has long had the rule that a character occupies a single byte. C often lives on machines where a byte consists of 8 bits. That has led to endemic confusion between the notions of character, byte, and octet of bits, and the confusion will not soon disappear.)
The reasons for using multibyte sequences should be obvious. We live in a world of character streams. Disks, diskettes, parallel ports, and serial ports all traffic in sequences of 8-bit bytes. To ignore this world would be foolish. A large character set must be representable as sequences of bytes.
Yet there are equally obvious drawbacks to multibyte sequences. You can't manipulate individual characters without a lot of parsing. You can't paste strings together without careful thought about shift states (in the general case). At the very least, you may have to introduce many redundant shift sequences to be on the safe side.
If you want to manipulate characters inside a program, it's easiest if they're all the same size. An alternate representation for large character sets has just this property. A wide character is an integer large enough to represent distinct codes for all the characters in the set. It can be type char, short, int, or long. Or it can be one of the unsigned versions of these types. Standard C provides the type definition wchar_t for the wide-character type. Include either of the headers <stddef.h> or <stdlib.h> to define this type.
Just as there are several multibyte encodings for Kanji, there are also several wide-character encodings. The more popular ones are easily derived from one of the multibyte encodings. Essentially, you cut and paste bits from the two characters in the multibyte representation to make the wide-character code; see Figure 4.
MULTIBYTE: is 1[0x8C][0x8E].
WIDE CHAR: ['i']['s']['']['1'][0x8C8E]['.']
Wide-character encodings tend to be a private matter for each implementation. Imagine trying to exchange data between two different systems by shipping wide characters. First you must make sure that both implementations of C use the same number of bits to represent wchar_t. Then you have to worry about whether the byte orders are the same. In the general case, you have to transform the code values in some way. It's much easier simply to read and write a common multibyte code. Then you don't much care about the various internal forms for wide characters.
C programmers do care somewhat about wide-character codes. Code value 0, for example, must be reserved for the null wide character. Otherwise, wide-character strings are a nuisance to manipulate. And you want 'a' to have the same numeric value when converted to a wide character. In fact, any value you can store in an unsigned character should have the same numeric value when converted to a wide character. Otherwise, all sorts of subtle but nuisancy problems arise. The C Standard endorses no particular wide-character encoding, but it does impose a few restrictions on acceptable code sets.
The C Standard imposes similar constraints on multibyte encodings, by the way. Code value 0 always stands for the null character. It can never appear as part of a longer character sequence representing some metacharacter. If the encoding has shift states, then the initial shift state is somewhat constrained. All the basic C characters (the ones you need to express a C source file) stand for themselves. Put another way, 'a' stands for lowercase "a" in the initial shift state. It is never the first character of a longer character sequence. Again, these few constraints let the C programmer use proven techniques to manipulate even multibyte strings. (For a discussion of the implications of wide-character on C++, see the accompanying text box, "So What About C++?".)
We added as little as possible to C to support multibyte and wide characters. In a C source file you can write a multibyte sequence in one of the following ways:
You can also write multibyte sequences in all the formats used by the library print and scan functions. That lets you intermix multibyte literal text with converted values on output. It also lets you match such text on formatted input to a limited degree. The problem with input comes, as usual, with shift sequences. They let you specify the same sequence of metacharacters many different ways. But the scan functions still match literal text character by character. That can lead to all sorts of unpleasant surprises for innocent users.
The only other addition to the C Standard is a handful of library functions. The header <stdlib.h> now declares the following functions:
All sorts of additional functions would be useful for manipulating large character sets:
We figured right. The Japanese have proposed an extensive addition to the Standard C library. It includes all the functions outlined above. It also describes some of the subtler semantic issues in greater detail. I've glossed over many such issues here because of space limitations.
The ANSI C Standard was approved in 1989. ISO C followed in 1990. Normally, a language standard remains stable for at least five years before it gets revisited. You'd think the Japanese proposal had missed the boat, but thanks to an accident of ISO politics, that's not the case. For a variety of reasons, the ISO C committee has the charter to produce a "normative addendum" to the C Standard. It wasn't hard to convince the committee to include the Japanese proposal as part of that addendum.
The net result is that the C Standard will likely be changed within the next year or so. Essentially, that change will incorporate the Japanese extensions to large-character support. The extensions are confined to the library, and they are fairly pure. That means that existing C programs should not change meaning when these new functions are added. Your biggest worry will be whether any existing external names collide with the names of added functions. And that, as we all know, is a perennial problem with progress.
Now you know the basics of large-character set support in Standard C. What should you do about it? As I mentioned at the outset, you probably don't have to do much of anything right now. What you do in the near future depends on your expectations for the code you write.
If you believe your code will never care about large character sets, you can generally ignore them. We tried to contrive the C Standard so the cost is low for those who don't use large character sets. Even implementors can get off cheap. A C compiler for a small microprocessor can, for example, define wchar_t as type char. The five conversion functions then become trivial. The print and scan functions don't have to change. Your code can stay lean and mean.
For many applications, a wiser approach is to make it multibyte tolerant. Remember that a multibyte string often looks like any other null-terminated string. You wouldn't second guess the structure of a filename in a portable program, would you? Then learn to be just as tolerant of text strings you read and write. They might one day be multibyte strings. If you don't try to chop them up or paste other characters in the middle, they will probably survive passage through your code. Who knows, your application may one day start speaking Japanese or Arabic.
Some applications must learn to be multibyte aware. You use the multibyte parsing functions religiously when manipulating strings. You probably want to adapt to the locale preferred by each user. (My book The Standard C Library contains complete code for manipulating locales and large character sets with varied encodings.) You may even want to use arrays of wide characters for manipulating some text.
A few applications will have to be wide-character oriented. These work exclusively with wide characters instead of conventional characters. They convert to and from multibyte characters only when communicating with the outside world. Such applications really benefit from the additions to Standard C proposed by the Japanese. (I understand that Windows NT fits this description.)
My personal belief is that conventional character strings will not soon go away. They meet most of our needs, even when dealing with large character sets. But I also see a growing use of wide characters in the years to come. Internationalization is a major driving force, but it is not the only one. Remember that large character sets have uses well beyond Japanese word processors. They can also be handy for representing characters of different point sizes or colors in a typesetting package. Or they can represent musical notes of different pitches and durations. I leave other uses to your imagination.
C++ is still being standardized jointly by ISO WG21 and ANSI X3J16. Upward compatibility with Standard C is a clearly stated goal. Thus, all the current support for large character sets has already been adopted as part of C++.
What to do with the proposed Japanese extensions is another matter. These include literally dozens of new functions to manipulate wide-character strings. All are direct analogs to the old C standbys for manipulating conventional character strings. To name just two examples, strlen begets wcslen and sprintf begets wcsprintf.
C++ provides function-name overloading. It is considered much better to overload one name than to introduce a trivial variant of that name. Thus, C++ may very well overload strlen for both character and wide-character arguments. Do the same for all those dozens of functions and you can see a real improvement.
At least one technical problem remains to be solved for this approach. In C, the wide-character type wchar_t is simply a synonym for some existing integer type. That might very well be char or int. So the two declarations
size_t strlen(const char *); size_t strlen(const wchar_t *);
may be indistinguishable on some implementations. This does not make for portable code.
C++ must find some way to distinguish wchar_t from other integer types with the same representation. It must do so without severely compromising upward migration of C code. Several approaches can work, but the C++ standards committee has yet to choose one.
It is an open issue whether C++ includes the Japanese proposal as is. Even if it does, however, function overloading will almost certainly be provided as well.
--P.J.P.
Copyright © 1992, Dr. Dobb's Journal