Columns


Standard C

Wide Character Streams

P.J. Plauger


P.J. Plauger is senior editor of The C Users Journal. He is convenor of the ISO C standards committee, WG14, and active on the C++ committee, WG21. His latest books are The Standard C Library, published by Prentice-Hall, and ANSI and ISO Standard C (with Jim Brodie), published by Microsoft Press. You can reach him at pjp@plauger.com.

Introduction

This is the third (and last) in a series of columns on the facilities being added to Standard C. (See "State of the Art: Large Character Set Support," C Users Journal, May 1993 and "State of the Art: Large Character Set Functions," C Users Journal, June 1993.) By the time you read this, the first round of balloting should be complete on the "normative addendum" that adds these facilities. At least two more rounds must occur before they become a part of the C Standard.

I have so far described:

I conclude this month by describing the largest single group of those functions, the ones that support wide-character input and output.

I remind you again that what I'm describing here can still change in some details, as a result of comments during balloting. I expect, however, that those changes should be small.

Problems with Large Character Sets

Remember that the overall goal of these additions is to ease the handling of text expressed with a very large character set. Japanese, Koreans, Chinese, Arabs, and typesetters all face a similar problem. They regularly work with "alphabets" containing thousands, or even tens of thousands, of characters. And that presents a host of problems:

We learned a long time ago to divorce the first two issues from programs and programming languages. Operating systems change keystrokes into a stream of bytes for us. Similarly, they convert an output stream into suitable dots on a screen or printer. Yes, the separation breaks down from time to time, but it is an important ideal to keep striving toward. Put another way, MS-DOS and UNIX may have to worry about how to display Kanji or tenpoint Times Roman — C and C++ should not.

We have learned more recently to represent characters inside a program as fixed-size integers. A 16-bit integer can support distinct codes for up to 65,535 characters (setting zero aside as a null terminator). A 32-bit integer can support billions of distinct codes. Such characters in C are called wide characters, as opposed to the traditional one-byte versions of type char (and its signed and unsigned varieties).

That's why the C Standard introduced the defined type wchar_t as a synonym for some integer type. Wide-character constants such as L'' have this type, and wide-character strings are arrays of this type. C++ has gone further, making wchar_t a distinct type, so you can overload functions on one-byte and wide-character arguments.

That's also why the normative addendum adds so many new functions to manipulate wide characters. As I discussed last month, it is relatively easy to convert programs that manipulate char data to deal with wchar_t data instead.

Having a fairly complete set of analogous functions can only help.

A number of code sets exist for wide characters, by the way. Even Kanji has more than one in popular use within Japan. True, the new ISO 10646 standard promises to unite all the character sets of the world. But the jury is still out on just how that "universal" code set will best be used as a wide-character representation inside computer programs.

Multibyte Encoding

Wide characters are not nearly as much at home in external data streams. It's an eight-bit-byte world out there. Chop a wide character into a sequence of bytes and you immediately face problems:

For all these reasons, we have learned over the years to encode large character sets differently for data streams.

This alternative form is called a multibyte encoding. No, there is no one universal multibyte code (just as there is still no universal wide-character code). But the rules for making multibyte codes follow a few basic patterns.

In all cases, the number of bytes used to represent a character varies for different members of the set. Codes based on ASCII, for example, represent the most used ASCII characters unchanged, as one-byte sequences. (Files containing mostly ASCII code are hence economical of disk space.) Longer codes are signaled one of two ways, either by prefix codes or by shift encoding.

With a prefix code, you can tell just by looking at a byte whether it stands alone as a single-byte code or whether it is the first byte of a longer sequence. You may also have to look at bytes that follow to decide when the sequence stops. But the major virtue is that each code sequence defines itself in all contexts. At worst, you may have to know that a given byte marks the start of a new code sequence.

A popular Kanji encoding is Shift JIS. Any code in the interval [0x81, 0x9F] or [0xE0, 0xFC] is the first byte of a two-byte code. Any other first byte is a single-byte code. The second byte of a two-byte code must be in the interval [0x40, 0xFC].

With shift encoding, you have more flexibility. A shift code (sometimes called an escape sequence) signals a change of rules. Once the byte sequence enters an alternate shift sequence, it might, for example, group two bytes at a time to define each subsequent character. The first byte might look like an ASCII A, but in this context it is not treated as such. Only another shift code behaves differently. It can put the byte stream back in the initial shift state, where A is A once again.

Another popular Kanji encoding is JIS. The three-byte shift code "\33$B" shifts to two-byte mode. In that mode, both first and second bytes must be in the interval [0x21, 0x7E]. The three-byte shift code \33(B" returns to the initial shift state. Still a better example is a typical word-processor file format. One code shifts text to italic (or bold), another shifts it back to normal.

Shift encoding exacts a price for its flexibility. To parse a multibyte string, you need even more context than for prefix codes. You have to know where you stand within a given character, as before. You also have to know the current shift state. The parsing job is that much harder. The prospects for getting out of sync are that much greater.

Multibyte Streams

So here is the quandary. You want to work with wide characters inside a program. You need to convert them to and from multibyte characters as you read and write I/O streams. Yet opportunities abound for mucking up the parse as you read streams. And opportunies abound for generating flawed sequences, or redundant shift codes at least, as you write streams.

Now consider the way a typical program reads and writes data. Occasionally, all reads occur in one place and/or all writes in another. More likely, reads and writes are sprinkled throughout the code. That makes it hard for a program to coordinate parsing a stream as multibyte characters, or generating characters in the proper shift state. The only thing in common to all reads or writes is the object of type FILE that controls the stream.

But operations on C streams are defined in terms of byte-at-a-time input or output. (The defining primitives are fgetc and fputc.) Streams know nothing about possible shift codes or any other multibyte structure. It is altogether too easy to create mayhem using the conventional C facilities to read and write multibyte streams.

The Japanese tried several ways to "fix" the existing I/O functions. They failed repeatedly. scanf, for example, is troublesome enough reading streams a byte at a time. Any attempt to make the function aware of multibyte syntax leads to parsing rules that are impossibly complex. printf is less troublesome, but still a nuisance. It has no obligation to produce shift codes and multibyte codes that are optimal, or even correct.

One approach (as always) is to add another layer of code. Define a "wide input stream" that reads bytes from another stream and assembles them into a stream of wide characters. Also define a "wide output stream" that accepts wide characters and turns them into multibyte sequences that it writes to another stream. Using them gives you more of the pure wide-character environment we now know is desirable within a program.

Sequential I/O works fine this way, but random access is rather more of a problem. You think you're positioning a stream at a given wide character. The actual underlying stream has to be positioned at the start of a given multibyte character. Not only that, you may have to restore the remembered shift state at the start of that multibyte character. The job is not impossible, but it is a nuisance.

One way to make the job of file positioning easier is to constrain it somewhat. The C Standard already lets an implementation do this with text files. You can find out where you currently are in a text file (ftell or fgetpos). If that query succeeds, you can later return the stream to that position (fseek or fsetpos). But you can't just position the file to read byte 17,239 next, as you can with a binary file.

So you impose a similar constraint on positioning within a wide-character stream. You can find out where you currently are in the wide-character stream, memorizing the position as for a text file. The only additional problem is that you may need additional bits to memorize a shift state. That's a real nuisance for ftell, which returns a long. Take one bit for the shift state and you can only memorize positions half as far into files. Need more bits for the shift state? The capability of ftell is more drastically reduced. You have no additional problems with fgetpos, however. It returns a value of type fpos_t, which an implementation can define as a structure containing all the fields, of whatever size, it might need.

Given a file position, you can later find your way back to it by calling fseek or fsetpos, as appropriate. You can also position a file at its start with equal ease. (The implementation may have to decree that all streams start in the initial shift state, a not unreasonable constraint.) To position a file at its end can be tougher, if it uses a state-dependent encoding. The only sure way to get the shift state right is to read the file from a known position to the end. (The implementation can always end files in a known shift state, but that makes it hard to avoid writing files containing redundant shift codes, a desirable if not mandatory property.)

Of course, you have to impose similar constraints on rewritting a file. You can't just drop a few wide characters in the middle of nowhere. It's too hard for an implementation to splice in an arbitrary multibyte sequence. But the same problem already exists for text files on many operating systems. They freeze record boundaries when you first write them. Rewriting the file cannot alter those boundaries. So as a general rule, once you rewrite part of a file, its contents from the end of what you just wrote to the old end of file are now in an undefined state.

Even with all these constraints, however, you can get a lot of work done. People have been writing C code for years that runs with highly structured text files. Imposing multibyte syntax is just more of the same, not an entirely new problem.

Wide-Character Streams

The proposed addendum to the C Standard introduces the concept of wide-character streams. These behave much like the extra layer of code I hypothesized above, with one important difference. The new functionality is essentially stuffed into existing FILE objects. You don't interpose a second stream to convert to and from streams of wide characters. Instead, you convince a conventional FILE object that it is mediating a wide-character stream instead of a conventional unstructured byte stream.

How do you convince a FILE that it is mediating a wide-character stream? By the operations you perform on it. The header <wchar.h> declares analogs for many of the functions declared in <stdio.h>. (See Listing 1. ) Like their older cousins, each of these new functions operates on a stream, either implied or explicitly named as an argument. But the new functions read and write wide characters instead of one-byte characters.

Note that fopen does not change. The difference is that now a stream does not know, when first opened, what its orientation will be, whether it will become wide-oriented or byte-oriented. As soon as you call any of the functions in Listing 1 for that stream, it becomes wide-oriented. All subsequent reads, writes, and positionings will be in terms of wide characters. If you instead call any of the older analogs to these functions, the stream becomes byte oriented. All subsequent operations will be in terms of single bytes, just like in the C library you know and love.

A few functions, are ecumenical. You can, for example, call setvbuf or fgetpos for a stream and still make no commitment. You can also call these (and their related) functions for a stream of either orientation, once it is established. The one thing you cannot do is try to have it both ways. Mix calls of different orientations for the same open stream and you get undefined behavior. (That means that an implementation can do something sensible if the mood strikes it, but it doesn't have to.) The only way to alter the orientation of a stream, in fact, is to call freopen on the FILE object that mediates it.

WG14 chose this approach over adding a mode qualifier to fopen (and freopen, of course). One reason for doing so was simplicity. The other was to simplify the use of stdin and the other standard streams with wide orientation. If the first operation determines orientation, a standard stream can go either way with no additional considerations.

Whither C++

An important open issue is how best to include this new capability into C++. X3J16/WG21, the C++ standards committee, has already committed to tracking the C Standard as it evolves. You can be sure that all these new functions will become a part of the C++ library, just as the entire Standard C library already is.

But is that the best way to manipulate large character sets in C++? Probably not. Serious C++ programmers already disdain the use of the C I/O machinery. They favor the classes declared in <iostreams.h>. (The continued presence of a trailing .h is currently being debated.) They like to use operator >> and operator << to do the actual input and output. Every new class they write also overloads these operators, if it makes sense at all to read or write objects of that class.

So can we expect iostreams to be further overloaded on wide characters? The decisions have not yet been made, but I'd be astonished if that didn't happen. I just worry a bit about complexity overload. C streams were already pretty complex. C++ iostreams add considerable complexity atop C streams. Wide streams add even more complexity within C streams. How well it all hangs together will be interesting to observe.

As always, I have to warn that all this stuff is new. We have very little experience with wide character manipulation in general. We have even less with the newly proposed wide-character functions I've described in the previous two columns. And we have next to no experience with the wide-character streams described here. We can only hope that it all will indeed simplify our lives as programmers in the years to come.