October 1995/Internationalization: A Primer, Part 2

Features

Internationalization: A Primer, Part 2

Rex Jaeschke

Rex Jaeschke is an independent computer consultant, author, and seminar leader. He is chair of the ANSI C committee X3J11. Readers may contact Rex at 2051 Swans Neck Way, Reston, VA 22091 or via UUCP at rex@aussie.com.
This is the second in a two-part series covering software internationalization support available in Standard C.

Multibyte Characters
One of the earliest, and probably the biggest, markets for multibyte character support is in Japan. Therefore, examples in this section will be based on encoding schemes for Japanese text processing. An understanding of these schemes makes it easier to understand other schemes.
In Japan, a single text message can be composed of characters from four different writing systems: Kanji, hiragana, katakana, and Roman. Kanji number in the tens of thousands and are represented by pictures. Hiragana and katakana (collectively known as kana) are syllabaries each containing about 80 sounds which are also represented by pictures. The Roman characters include some 96 letters, digits, and punctuation marks. One-byte control characters may also be present. I/O of non-Roman characters is a complicated issue which is outside the scope of this discussion; we're only interested in internal representation.
The number of bytes required to represent different members of the same execution character set can vary. For example, some members might require one byte, some two, and others three or more. The representation of a multibyte character is determined by its encoding scheme, and more than one encoding scheme may be available for a given multibyte character set. The maximum number of bytes needed to represent a multibyte character, including shift state information, in any supported locale is available via the macro MB_LEN_MAX, defined in <limits.h>. The maximum number of bytes needed to represent a multibyte character, including shift state information, in the current locale is available via the macro MB_CUR_MAX, defined in <stdlib.h>.
Why have different widths for characters in the same character set? Why not simply make all characters the same width? These are important questions. A simple answer is, "To save space." If a text file contains many characters that can be represented each in a single byte, that file will be smaller than if all characters were stored as two- or four-byte characters. A common example of this would be program source files which, for the most part, contain Roman characters. Once in memory, however, it is easier to manipulate characters if they all have the same width. Such characters are know as wide characters and will be discussed in the next section.
There is no universally recognized multibyte encoding scheme for Japanese. However, three schemes are prominent — JIS, Shift-JIS, and EUC. (JIS is an abbreviation for "Japanese Industrial Standard" while EUC stands for "Extended Unix Code.") JIS is the simplest scheme. It uses seven bits of each byte and requires escape sequences to shift between one-and two-byte modes. Shift-JIS and EUC use eight bits of each byte, and the value of the first byte is used to determine whether a second byte follows.

JIS Encoding
JIS supports a number of standard Japanese character sets, some requiring one byte, others two. Table 1 shows some of these characters sets and the shift sequences needed to establish them within a JIS-encoded file or string. The first four character sets are two-byte while the last two sets are one-byte. <ESC> represents the 'escape' character, having value 0x1B.
Consider the following character sequence:
The Kanji <ESC>$B...<ESC>(B spell 'Tokyo.'
The start of a sequence is in some initial shift state. Let's assume it is ASCII. Therefore, characters are assumed to be one-byte ASCII codes until the shift sequence <ESC>$B is seen. This switches us to two-byte mode as defined by JIS X 0208-1983. The shift sequence <ESC>(B switches us to ASCII mode which also happens to be the initial shift state. The sequence used in this example might well be a string literal in a source file, used in a call to printf or strcpy, for example.
Encoding schemes that use shift sequences are not very efficient for internal storage or processing. For example, the shift sequence for JIS X 0208-1990 requires six bytes. Frequent switching between character sets in a file or string could result in the number of bytes used in shift sequences exceeding that used to represent actual data!
Consider the following character sequences:
Hello <ESC>$B<ESC>(Bthere.
Hello <ESC>(Bthere.
In the first case, we see one shift sequence immediately followed by another, making the first shift sequence is redundant. In the second case, we see a shift to the state that is already current, again making the shift sequence redundant. Of course, there can be an arbitrary number of redundant shift sequences in a file or string, all of which take up space but contribute nothing useful.
Encodings containing shift sequences are used primarily as an external code, one that allows information interchange between a program and the outside world. For example, many communications paths are limited to seven bits. Therefore, sending electronic mail containing multibyte Japanese characters over such paths requires an encoding scheme that uses only seven bits of each byte.

Shift-JIS Encoding
Despite its name, Shift-JIS has nothing to do with shift sequences and states. Instead, each byte is inspected to see if it is a one-byte character or the first byte of a two-byte character. This is determined by reserving a set of byte values for certain purposes. For example:

Any byte having a value in the range 0x21-7E is assumed to be a one-byte ASCII/JIS-Roman character.

Any byte having a value in the range 0xA1-DF is assumed to be a one-byte half-width katakana character. (This character set will not be discussed further.)

Any byte having a value in the range 0x81-9F or 0xE0-EF is assumed to be the first byte of a two-byte character from the set JIS X 0208-1990. The second byte must have a value in the range 0x40-7E or 0x80-FC.
While this encoding is more compact that JIS, it cannot represent as many characters. In fact, Shift-JIS cannot represent any characters in the supplemental character set JIS X 0212-1990, which contains more than 6,000 characters. Shift-JIS was invented by Microsoft. And since Microsoft has been a significant player in the field of internationalization, this encoding scheme has become very popular.

EUC Encoding
This encoding is much more extensible than Shift-JIS, since it allows for characters containing more than two bytes. The encoding scheme used for Japanese characters is as follows:

Any byte having a value in the range 0x21-7E is assumed to be a one-byte ASCII/JIS-Roman character.

Any byte having a value in the range 0xA1-FE is assumed to be the first byte of a two-byte character from the set JIS X 0208-1990. The second byte must also have a value in that range.

Any byte having the value 0x8E is assumed to be followed by a second byte with a value in the range 0xA1-DF, which represents a half-width katakana character.

Any byte having the value 0x8F is assumed to be followed by a two more bytes with values in the range 0xA1FE, which together represent a character from the set JIS X 0212-1990.
The last two cases involve a prefix byte with value 0x8E and 0x8F, respectively. These bytes are somewhat like shift sequences in that they introduce a change in subsequent byte interpretation. However, unlike the shift sequences in JIS which introduce a sequence, these prefix bytes must precede every multibyte character, not just the first in a sequence. (They do not "stick.") As such, each multibyte character encoded in this manner stands alone, and EUC is not considered to involve shift states.

Shift States
For encodings containing shift sequences, the functions mblen, mbtowc, and wctomb are required to maintain information about the current shift state. However, if the LC_CTYPE category is changed for a given locale, the current shift state can become indeterminate. The multibyte functions can be put into their initial shift state by a call with a null char * argument. For example:

#include <stdlib.h> #include <stdio.h> main() { int i; /*1*/ i = mblen(NULL, 0); /*2*/ printf("There %s a state-dependent encoding\n", (i != 0) ? "is" : "is not"); return 0; }
In case 1, we set mblen to its initial shift state by passing it a null pointer. The value returned is nonzero if the encoding scheme currently in use uses shift sequences, zero otherwise. The four other multibyte functions behave likewise when given a null pointer argument. mbstowcs assumes that the multibyte string it is to convert starts in the initial shift state. When wcstombs creates a multibyte string it first sets to the initial shift state.

Wide Characters
As stated earlier, multibyte encoding provides an efficient way for moving characters around outside programs and between programs and the outside world. Once inside a program, it is easier and more efficient to deal with characters that have the same size and format. Wide characters serve this purpose.
Consider a filename string containing one or more directory paths, each separated by a slash. To find the actual filename, in a single-byte character string you can start at the back of the string. When you find the first separator you know where the filename proper starts. However, if the string contains multibyte characters, you must scan from the front so you don't inspect bytes out of context. On the other hand, if the string contains wide characters, you can still scan it backwards.
A number of wide character standards exist or are emerging. The most well-known ones that support characters larger than eight bits are ISO 10646.UCS-2, ISO 10646.UCS-4, and Unicode. The UCS-2 encoding is based on 16-bit characters and exactly matches Unicode. The UCS-4 encoding uses 32-bit characters.
Conceptually, we can think of wide character sets as being extended ASCII or EBCDIC; each unique character is assigned a distinct value. Since the size of a wide character is not universally fixed, the type synonym wchar_t was invented as the wide character type. This synonym is defined in <stddef.h> and <stdlib.h>. (In C++, wchart_t is a keyword.)
Standard C allows wide characters in character constants and strings literals provided new notation is used. For example:

L'a' is a wide character constant having type wchar_t.

L"abc" is a wide character string having type wchar_t [4].
Each member of the basic execution and source character sets has a code value equal to its value when used as the lone character in an integer character constant. This requires that 'a' == L 'a', for example, be true.

Multibyte Functions
Standard C includes five new functions to deal with interpreting and/or converting multibyte and wide characters. These "multibyte functions" are mblen, mbstowcs, mbtowc, wcstombs, and wctomb, all declared in <stdlib.h>.

The mblen Function
mblen reports the number of bytes in a given multibyte character. For example:

#include <stdlib.h> #include <stdio.h> main() { /*1*/ printf("MB_CUR_MAX = %lu\n", (unsigned long)MB_CUR_MAX); /*2*/ printf("i = %d\n", mblen("", MB_CUR_MAX)); /*3*/ printf("i = %d\n", mblen("ABCDEFGHIJ", MB_CUR_MAX)); return 0; }
The output produced on one system is:

MB_CUR_MAX = 1 i = 0 i = 1
The output from case 1 indicates that the program executed in single-byte mode. In case 2, the multibyte string is empty, so mblen returns zero. In case 3, the multibyte string contains a number of multibyte characters; however, only the first is processed, resulting in mblen returning one. If the first MB_CUR_MAX bytes do not form a valid multibyte character, mblen returns -1.
Since mblen might be called repeatedly on successive multibyte characters in a string, it must remember the current shift state. Therefore, this function is not reentrant.

The mbtowc Function
mbtowc converts a given multibyte character to its corresponding wide character. For example:

#include <stdlib.h> #include <stdio.h> main() { int i; wchar_t wc; printf("MB_CUR_MXX = %lu\n", (unsigned long)MB_CUR_MAX); /*1*/ i = mbtowc(&wc, "", MB_CUR_MAX); printf("i = %d, wc = %#4lX\n", i, (unsigned long)wc); /*2*/ i = mbtowc(&wc, "ABCD", MB_CUR_MAX); printf("i = %d, wc = %#4lX\n", i, (unsigned long)wc); return 0; }
The output produced on one ASCII-based system is:

MB_CUR_MAX = 1 i = 0, wc = 0 i = 1, wc = 0X41
The value returned by mbtowc has the same meaning as for mblen; if the multibyte string is empty, zero is returned and the null character is converted to a wide null character which also has value zero. If the multibyte string contains a number of multibyte characters, only the first is processed, in this case, the letter A (ASCII value 41 hex). If the first MB_CUR_MAX bytes do not form a valid multibyte character, -1 is returned.
Since mbtowc might be called repeatedly on successive multibyte characters in a string, it must remember the current shift state. Therefore, this function is not reentrant.

The wctomb Function
wctomb converts a given wide character to its corresponding multibyte character. For example:

#include <stdlib.h> #include <stdio.h> main() { int i, j; char string[10]; printf("MB_CUR_MAX = %lu\n", (unsigned long)MB_CUR_MAX); /*1*/ i = wctomb(string, L'A'); printf("i = %d, string =", i); for (j = 0; j < i; ++j) { printf(" %#4X", string[j]); } putchar('\n'); return 0; }
The output produced on one ASCII-based system was:

MB_CUR_MAX = 1 i = 1, string = 0X41
If the wide character value has no corresponding multibyte character, -1 is returned. Otherwise, the return value is the number of bytes in the resulting multibyte character.
Since wctomb might be called repeatedly to produce successive multibyte characters in a string, it must remember the current shift state. Therefore, this function is not reentrant.

The mbstowcs Function
mbtowcs converts a given multibyte character string to an array of corresponding wide characters. For example:

#include <stdlib.h> #include <stdio.h> main() { size_t i, j; wchar_t wc[5] = {'?', '?', '?', '?', '?'}; printf("MB_CUR_MAX = %lu\n", (unsigned long)MB_CUR_MAX); /*1*/ i = mbstowcs(wc, "AB", 5); printf("i = %lu, wc = ", (unsigned long)i); for (j = 0; j < 5; ++j) { printf(" %#4lX", (unsigned long)wc[j]); } putchar('\n'); /*2*/ i = mbstowcs(wc, "ABCDE", 5); printf("i = %lu, wc = ", (unsigned long)i); for (j = 0; j < 5; ++j) { printf(" %#4lX", (unsigned long)wc[j]); } putchar('\n'); return 0; }
The output produced on one ASCII-based system is:

MB_CUR_MAX = 1 i = 2, wc = 0X41 0X42 0 0X3F 0X3F i = 5, wc = 0X41 0X42 0X43 0X44 0X45
In case 1, the string contains only two multibyte characters yet five were promised. As a result, these two are converted to wide characters and the trailing null character is also converted. In case 2, the string contains five multibyte characters, all of which are converted. However, the trailing null is not processed.
The multibyte string is assumed to begin in its initial shift state. If an invalid multibyte character is encountered, the value (size_t)-1 is returned, otherwise the number of multibyte characters converted (not including any trailing null) is returned.

The wcstombs Function
wcstombs converts a given array of wide characters to a corresponding multibyte character string. For example:

#include <stdlib.h> #include <stdio.h> main() { size_t i, j; wchar_t wcs1[5] = L"AB"; wchar_t wcs2[] = L"ABCDEF"; char string[] = {'?','?','?','?','?'}; printf("MB_CUR_MAX = %lu\n", (unsigned long)MB_CUR_MAX); /*1*/ i = wcstombs(string, wcs1, 5); printf("i = %lu, string = ", (unsigned long)i); for (j = 0; j < 5; ++j) { printf(" %#4lX", (unsigned long)string[j]); } putchar('\n'); /*2*/ i = wcstombs(string, wcs2, 5); printf("i = %lu, string = ", (unsigned long)i); for (j = 0; j < 5; ++j) { printf(" %#4lX", (unsigned long)string[j]); } putchar('\n'); return 0; }
The output produced on one ASCII-based system is:

MB_CUR_MAX = 1 i = 2, string = 0X41 0X42 0 0X3F 0X3F i = 5, string = 0X41 0X42 0X43 0X44 0X45
In case 1, the array has only two wide characters, which is less than expected. As a result, these two are converted to multibyte characters and the trailing null character is also converted. In case 2, the wide string contains more characters than expected, so only five bytes worth of multibyte characters string is produced and no trailing null is appended.
The multibyte string produced begins in its initial shift state. If a wide character is encountered and it has no corresponding multibyte character, the value (size_t)-1 is returned. Otherwise the number of bytes written to the multibyte string (not including any trailing null) is returned.

ISO C Amendment 1
This amendment resulted in the creation of several new headers and some additions to existing ones, as follows:

<errno.h> has the macro EILSEQ added to it.

<iso646.h> is created as a repository for a family of macros that expand to alternate spellings for source characters necessary for C programming, yet missing from the ISO 646 character set.

<wchar.h> contains some type synonyms, a tag, some macros, and many functions, which provide for extended multibyte and wide-character processing.

<wctype.h> contains some type synonyms, a macro, and many functions, which provide for wide-character classification and mapping.
Many of the new functions have names with spellings that parallel their single-byte counterparts. The complete list of new names and their parent headers is shown in Table 2.
Many of the functions defined in Amendment 1 are discussed in Madell, Clark, and Abegg listed in the attached reading list.

Recommended Reading
The following books and publications will likely be of use to C programmers interested in further information on internationalization:
ANSI/ISO 9899:1990 Programming Language C. American National Standards Institute, New York, NY. 1990, 219 pp.
ANSI/ISO 9899:1990 Programming Language C: Amendment 1. American National Standards Institute, New York, NY. 1995, 52 pp.
Digital Guide to Developing International Software. Digital Press, Bedford, MA. 1991, 381 pp., ISBN 55558-063-7.
Lunde, Ken. Understanding Japanese Information Processing. O'Reilly & Associates, Sebastopol, CA. 1993, 435 pp., ISBN 1-56592-043-0.
Madell, Tom, Clark Parsons, and John Abegg. Developing and Localizing International Software. Hewlett-Packard Professional Books/Prentice Hall, Englewood Cliffs, NJ. 1994, 150 pp., ISBN 0-13-300674-3.
Plauger, P.J., regular column in The C/C++ Users Journal. Miller Freeman, Inc., Lawrence, KS. 1990-1994.
Plauger, P.J., regular column in The Journal of C Language Translation. IECC, Cambridge, MA. 1990-1994, ISSN 1042-5721.
Gallmeister, Bill O., POSIX. 4: Programming for the Real World. O'Reilly & Associates, Inc., Sebastopol, CA. 1995, ISBN-1-56592-074-0.
The Unicode Consortium. The Unicode Standard: Worldwide Character Encoding, Version 1.0, Volume 1. Addison-Wesley, Reading, MA. 1990, 682 pp., ISBN 0-201-56788-1.
The Unicode Consortium. The Unicode Standard: Worldwide Character Encoding, Version 1.0, Volume 2. Addison-Wesley, Reading, MA. 1992, 439 pp., ISBN 0-201-60845-6.
The Unicode Consortium. The Unicode Standard, Version 1.1. Changes from Version 1 in draft form from The Unicode Consortium. 1994.