C/C++ Contributing Editors


Standard C/C++: Multibyte Files

P. J. Plauger

The proper way to represent wide characters in a file is as a multibyte stream. But you sometimes need help to support the encoding you really want to use.


For the past couple of months, I've been describing some of the more advanced uses of the iostreams classes, as they are now specified in the C++ Standard. Last month, in particular, I discussed the business of reading and writing Unicode files. (See "Standard C/C++: Unicode Files," CUJ, April 1999.) I presented code for a couple of code-conversion locale facets that augment the Standard C++ library classes that read and write streams of wide characters. These facets let you read and write Unicode in a fairly popular format that dedicates two bytes to each character stored in the file. One of them even includes support for a clever trick that gets byte order right across architectures with different "byte sex," or "endian-ness."

But there's far more than one way to skin this particular cat. Some of those ways go back a couple of decades. They have properties that are more desirable than the relatively simple approach I discussed last month. And they are interesting in their own right. On the other hand, they're a tad trickier to get right in working code. That's why I presented the simpler approach first.

I'm talking about multibyte encodings. They were originally developed as a way to represent text using a large character set as a stream of more conventional eight-bit bytes. Disk storage, UARTs, and practically all other modern hardware traffic in eight-bit bytes these days. However much you may believe in large character sets, it's rather necessary to ship them about as byte sequences from time to time.

What I presented last month was indeed one form of multibyte encoding. It assumes that wide-character codes are manipulated inside the program as a sequence of values of type wchar_t. The actual encoding of each character is further assumed to follow the Unicode Standard, which uses 16 bits to encode each character. (For a more precise definition of Unicode, see the file ftp://ftp.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt available at the Unicode web site.) Out in the world, the 16-bit code becomes a sequence of two eight-bit bytes.

What is the order of these two bytes? That often depends on the machine that writes the stream. A "little-endian" machine finds it easier to write the less-significant byte first. A "big-endian" machine favors writing the bytes the other way around. So a reader has to know how such a multibyte stream was written to get it right.

One of the variants I described last month includes a way of making a file describe its implied byte order. The codes 0xfffe and 0xffef are reserved as non-characters for this purpose. If a reader — reading in whatever byte order it chooses — sees the code 0xfffe, it knows to discard this code and keep doing what it's been doing. On the other hand, if the reader sees the code 0xfeff, it knows to discard this code and henceforth combine bytes in the opposite order from whatever it has been doing. Since 0xfeff is just 0xfffe byte swapped, the stream identifies itself to all future readers savvy enough to look for these markers.

Desirable Properties

This form of multibyte encoding is growing in popularity, thanks in large measure to Java and Windows/CE. Whatever it may lack in other desirable properties, it more than makes up for in ubiquity. That's why I presented last month a couple of versions of template facet codecvt<wchar_t, char, mbstate_t> that work with Visual C++ to support such wide-character input and output.

But a really good multibyte encoding should aspire to quite a number of desirable properties. Here are a few that I can think of, just off the top of my head.

1. For ease of interchange across computer architectures, byte order should be precisely defined, regardless of the native byte order of the reading or writing machine.

2. For space economy, the commonest character codes should have the shortest multibyte encodings.

3. For ease of interconversion, the relationship between a wide-character code and its corresponding multibyte sequence should be a simple function, not a table lookup.

4. To permit a typical ASCII text sequence to pass unmodified as a multibyte sequence, each of the common ASCII characters should stand for itself.

5. To use a multibyte sequence as part of a null-terminated byte string (C-style string), no multibyte codes should include a null byte. The null character should be representable as a single null byte.

6. To permit reading and writing as part of a C-style text stream, none of the bytes in the set "\0\n\r\t\f\v" should appear as part of a larger multibyte sequence. Each such code should be a single-byte sequence with its usual ASCII meaning.

7. To permit multibyte sequences as file names, none of the bytes in the set "\0./\\" should appear as part of a larger multibyte sequence. Each such code should be a single-byte sequence with its usual ASCII meaning.

The simple two-byte encoding of Unicode characters satisfies only items (1) and (3). It loses on all the others.

But let's not be too quick to condemn this simple scheme. Other methods have fared only slightly better over the years. Here, for example, are the first few encodings I encountered.

Shift JIS

A byte code in the interval [0x81, 0x9f] or [0xe0, 0xfc] signals the first of a two-byte sequence. Any other code is a single character. The second byte code must be in the interval [0x40, 0xfc].

This is an early encoding used by the Japanese to represent a mixture of ASCII and a subset of Kanji, the very large set of Japanese characters derived from the equally large Chinese set (which the Japanese call Hanji). I forget the details after all these years, but I believe this subset also includes two phonetic alphabets the Japanese also use, called Hirigana and Katakana.

In any event, Shift JIS has been widely used for decades as a way to transmit and store Japanese text. It was important enough that the C compiler I wrote back in the late 1970s was doctored up to accept such codes in comments and string literals. (My agents and good friends at Advanced Data Controls Corporation, or ADaC, did the work for us.) Japanese programmers like comments and error messages they can understand, just like the rest of us.

This encoding fails only item (7) outright. Some codes contain bytes that look like dots and slashes, which makes them unusable in filenames. Whether it satisfies (2) depends on the actual mix of ASCII and Kanji in a typical sequence. And whether it satisfied (3) depends on the internal codes used for wide characters. Of course, you can always jigger the internal codes to ease translation to and from shift JIS, assuming those codes are private to your program. (For more on this issue, see Exercises 13.1 through 13.3, pages 384-385 of P.J. Plauger, The Standard C Library, Prentice-Hall, 1992.) But if they're not private, you can have a real translation problem on your hands. I could never get a clear answer from my Japanese friends about the standard code sets, if any exist, for wide-character codes, and how they might relate to shift JIS.

JIS

The three-byte sequence "\33(B" shifts to two-byte mode. The three-byte sequence "\33$B" shifts back to one-byte mode. In two-byte mode, both byte codes must be in the interval [0x21, 0x7e].

This is another early encoding used by the Japanese to represent a mixture of ASCII and a subset of Kanji. Rules exist for converting between Shift JIS and JIS, so I think the codes are highly interrelated, but once again my memory fails me about the details.

The advantage of JIS is that you have a much larger set of both single-byte and two-byte codes that you can intermix. A distinct shift sequence tells you when to change the encoding rules. The price you pay is that encodings are now state dependent. To read an arbitrary sequence, you must know what state you're currently in. Lose that state information and you get nonsense. Functions that encode and decode state-dependent sequences must have some way of storing state information between calls, so they are more complicated to write.

But the bottom line, for the purposes of this discussion, is that JIS has about the same set of virtues and vices as shift JIS.

EUC

A byte code in the interval [0xa1, 0xfe] is the first of a two-byte sequence. The second byte code must be in the interval [0x80, 0xff].

I believe this encoding was introduced when Unix met the mysterious Orient. Note the more restricted range for second-byte codes. They carefully avoid slashes — not to mention backslashes, dots, and all other ASCII printable and control characters. Hence, you can use EUC to specify filenames. Since Unix systems traffic heavily in ASCII text, even in foreign lands, and since the Unix folks felt they could define whatever internal code set they chose for wide characters, EUC is an across-the-board winner.

Its only drawback is that Unicode seems to have won the popularity contest.

Enter FSS-UTF

A byte code in the interval [0x80, Oxbf] is never a leading byte. It supplies six bits to a multibyte encoding. A byte code in the interval [0x60, 0xdf] is the first of a two-byte sequence. It supplies the most significant six bits of a twelve-bit code. A byte code in the interval [0xeo, 0xef] is the first of a three-byte sequence. It supplies four of 16 bits.

Even Plan 9, the internal successor to Unix among the original Unix developers at Bell Laboratories, adopted Unicode as its standard character set. Fortunately for the rest of us, the same smart people who brought us Unix, C, and a host of other wonderful ideas also addressed the issue of representing Unicode sequences as multibyte sequences. I believe it was Dave Prosser and Ken Thompson who did the original clever design, but that's just hearsay. In any event, one of the offshoots of the later Unicode work was an encoding called FSS-UTF, which stands for "FileSystem Safe Universal Transfer Format," or some set of words close to those.

I described this encoding in detail several years ago in a sister magazine to CUJ. (See "State of the Art: Encoding Text," Embedded Systems Programming, May 1994.) It has all of the virtues of items (1) through (7) above. It can also be used to encode up to 32-bit values both efficiently and transparently. Even nicer, the codes used to encode up to 16-bit values form a tidy, self-consistent subset.

But the greatest virtue of all, once again, is popularity. Seems the Java folks settled on FSS-UTF as the preferred way to encode Unicode sequences as byte sequences. They call it UTF8, for reasons I won't go into here, but it's essentially the same thing. Take a close look at the standard library class java.lang.String. The constructor String(bytes[]) and the method getBytes (and their variants) all interconvert between Unicode and UTF8.

So this month, I extend my work of last month to include UTF8. Listing 1 shows yet another class derived from the facet codecvt<wchar_t, char, mbstate_t>. Such a facet is used by the iostreams class basic_filebuf<wchar_t, char_traits<wchar_t> > to convert between a stream of wchar_t values inside the program and a corresponding multibyte stream stored in a named file. Class Mycodecvt, shown here, converts between Unicode and UTF8.

Listing 2 shows the test code I used last month to exercise code converters. It has a little excess baggage for our purposes this time around, but it serves the same basic purpose. First, it shows how you connect up a facet to a wide stream buffer. Then it shows how to write a file that contains essentially all the different Unicode character codes. Then it shows how to read the same file back in and verify that all the codes make the round trip properly.

I end with my usual disclaimer. I threw this code together in a hurry. It is far from a professional product. Use it, by all means, as a proof of concept and a starting point for code you write yourself. But if it doesn't do the whole job properly, please don't be too surprised. o

P.J. Plauger is Senior Editor of C/C++ Users Journal and President of Dinkumware, Ltd. He is the author of the Standard C++ Library shipped with Microsoft's Visual C++, v5.0. For eight years, he served as convener of the ISO C standards committee, WG14. He remains active on the C++ committee, J16. His latest books are The Draft Standard C++ Library, Programming on Purpose (three volumes), and Standard C (with Jim Brodie), all published by Prentice-Hall. You can reach him at pjp@plauger.com.