December 1997/Standard C/C++

Columns

Standard C/C++

P. J. Plauger

The Facet codecvt

Inside a Standard C++ program, you can now define streams of wchar_t, or even more exotic, elements. Something has to translate between those streams and what's really out there in files.

Introduction

When C was first born under Unix, reading and writing files was conceptually simple. And it still is — under Unix. The call write(fd, buf, n) writes n bytes from the char array buf to the output stream controlled by fd. If that stream is associated with a disk file, then those n bytes end up in the file, unchanged. Similarly, the call read(fd, buf, n) reads up to n bytes from the input stream controlled by fd. If that stream is associated with a disk file, then all the bytes read come from the file, unchanged.

A stream connected to a terminal is subject to a bit more interpretation. Write a newline (line feed) character '\n' and Unix may well send a carriage return '\r' and line feed to the terminal instead. Read the keyboard and Unix will typically turn a carriage return into a newline. Similar mappings may occur with other "character special" devices. But plain files and pipelines represent data just the same as it appears inside a C program.

As C migrated to other operating systems, however, life got a little messier. Few systems share the Unix convention for representing the end of a text line in a file. Many store a carriage return and line feed instead of just a line feed. Some store a carriage return. Some delimit lines with meta information separate from the sequence of bytes stored.

We pioneers soon settled on a simple but necessary convention. Open a file named X with the call fopen("X", "wb") and all writes to the file are "binary," or uninterpreted transparent transmissions. Omit the b in the mode string and writes are assumed to be to a text file. A newline character gets changed into whatever the underlying operating system considers appropriate to signal the end of a text line. Various other mappings may also occur on the way out. Similarly, the call fopen("X", "rb") reads a binary file, transmitting the bytes unchanged. Omit the b and reads are assumed to be from a text file.

As much as possible, the mappings that occur during reads are the inverse of those that occur on writes. But the mapping is seldom perfectly symmetric. Some patterns of data simply can't be stored transparently in an arbitrary text file under an arbitrary operating system. The C Standard spells out the constraints pretty precisely. Those of us who endeavor to write highly portable code have learned to live within them.

Enter wchar_t

Then along came large character sets. As first adopted in 1989, the C Standard said as little as possible about them. But it did introduce a few basic concepts. It defines wchar_t as an integer type suitable for representing all the codes of a large character set as "wide characters." It describes "multibyte sequences" as alternative encodings for elements of a large character set, using one or more bytes as needed for each code. And it adds a handful of functions to the traditional C library to convert between the two representations.

What that original C Standard did not do was supply any easy way to read and write text represented in a large character set. It was left up to you, the programmer, to decide how to store such text in a file, how to represent it inside the program, and how to convert as needed in between. You have that handful of conversion functions to help hide from the actual details of a particular encoding, but that's about all.

Amendment 1 to the C Standard addressed this deficiency. Adopted by ISO in 1995, it fleshes out the character-manipulation functions in the Standard C library considerably. Put simply, anything you can do with objects and sequences of type char, you can now also do with objects and sequences of type wchar_t. That parallelism even extended to the functions that read and write text, formatted or otherwise.

The call fputwc(wc, os), for example, nominally writes the wide character wc to the output stream os. But what it actually often does is convert wc to an equivalent multibyte encoding and write that sequence instead. The situation is very much like writing to a text file on a non-Unix system. The program pretends it is generating a sequence of wide characters, but what actually gets written may well be some different sequence of bytes more suitable for the needs of a given operating system.

Amendment 1 includes analogs to all the old standbys from the header <stdio.h>. The newer header <wchar.h> declares fwprintf and all its variants as well as fputwc. There you will also find fgetwc and friends, which read multibyte sequences as needed and compose them into wide characters for internal consumption. The upshot is that you can program as if the world consists only of wide characters, inside a program, independent of what goes on out in the files themselves. However ambitious, this is just an analytic continuation of the early Unix decision to make all line terminators look like newlines inside a C program.

And however ambitious, the facilities added with Amendment 1 are still minimalist in some important ways. Nothing is said about how to specify a given multibyte encoding when reading or writing a given file. If an implementation wants to provide a choice, it has to be inventive about how to specify the choice. Given the multiplicity of multibyte encodings in widespread use, this impediment to portability can be significant.

The saving grace here, as I have often pointed out, is that most programmers simply don't care. The vast majority get along fine without large character sets. Many implementations have only vestigial support for the Amendment 1 additions, if any at all. Those that do implement Amendment 1 usually hard-wire the library for a single encoding. In fact, the only general solution I know of is the one in the Dinkum C Library, which I wrote years ago and my company now licenses. I can't say that customers have expressed any great interest in this particular feature to date.

Fun With Facets

So now we come to the draft Standard C++ library. As we have seen in recent months, the committees standardizing C++ have not been at all shy about providing support for internationalization. (See "Standard C/C++: Introduction to Locales," CUJ, October 1997, and "Standard C/C++: The Facet ctype," CUJ, November 1997.) Each iostreams object is "imbued" with its own private locale object, so that a C++ program can work with numerous locales at the same time. Each locale consists of dozens of facets, each controlling a different aspect of a locale.

Last month, I described the facet ctype, which supplies locale-dependent member functions closely analogous to our old friends from the C header <ctype.h>. The topic for this month is the facet codecvt, also controlled by the locale category LC_CTYPE. It supplies member functions that convert between sequences of some element type E, in an internal stream, and some multibyte encoding in an external file or data stream.

More specifically, the facet is a template class:
template<class E, class To, class State>
    class codecvt;
Its member function out converts a sequence of E objects to a sequence of To objects. Its member function in converts the other way. These code-conversion functions typically operate as finite-state machines, maintaining their internal state between calls in an object of type State. That's codecvt in a nutshell. The rest is window dressing and optimizations.

The only customer for codecvt facets within the draft Standard C++ library proper is template class basic_filebuf, the templatized version of that old iostreams standby filebuf. An object of class basic_filebuf<E, Tr> manages one or two sequences of elements of type E, an input stream and an output stream. Class Tr specifies the "traits," or properties, associated with such a stream, such as how to designate end-of-file, or how to copy such a sequence.

You may recall that basic_filebuf<E, Tr> is derived publicly from basic_streambuf<E, Tr>, the generalized stream-buffer controller for the iostreams classes. A basic_filebuf<E, Tr> object lets you open a file by name, then control data streams to and from that file. It does so by overriding several protected virtual member functions inherited from the base class. The two of interest here are:

uflow, which obtains another input element of type E

overflow, which disposes of an output element of type E

The basic_filebuf<E, Tr> object uses the facet codecvt<E, char, Tr::state_type> to convert between E elements internally and sequences of char objects externally.

Just to give you a flavor for how code conversion occurs in real life, here is the essential code from uflow:
for (_State0 = _State,
     _Str->erase(); ; )
    {_E _X, *_D;
    const char *_S;
    int _C = fgetc(_File);
    if (_C == EOF)
        return (_Tr::eof());
    _Str->append(1, (char)_C);
    _State = _State0;
    switch (_Pcvt->in(_State,
        _Str->begin(), _Str->end(), _S,
        &_X, &_X + 1, _D))
    {case codecvt_base::partial:
        break;
    case codecvt_base::noconv:
        if (_Str->size() < sizeof (_E))
            break;
        memcpy(&_X, _Str->begin(), sizeof (_E));
    case codecvt_base::ok:    // can fall through
        return (_Tr::to_int_type(_X));
    default:
        return (_Tr::eof()); }}
I won't describe it in detail. All I'll say is that the loop reads more and more bytes until it has enough to determine an E element, or until it encounters an error.

Similarly, here is the essential code from overflow:
{const int _NC = 8;
const _E _X = _Tr::to_char_type(_C);
const _E *_S;
char *_D;
_Str->assign(_NC, '\0');
for (; ; )
    switch (_Pcvt->out(_State,
        &_X, &_X + 1, _S,
        _Str->begin(), _Str->end(), _D))
    {case codecvt_base::partial:
        _Str->append(_NC, '\0');
    case codecvt_base::ok:    // fall through
        {size_t _N = _D - _Str->begin();
        if (0 < _N && _N !=
            fwrite(_Str->begin(), _N, 1, _File))
            return (_Tr::eof());
        _Writef = true;
        if (_S != &_X)
            return (_C);
        break; }
    case codecvt_base::noconv:
        return (_Fputc(_X, _File) ? _C : _Tr::eof());
    default:
        return (_Tr::eof()); }}
In this case, the loop grows a buffer until it is big enough to hold all the bytes in the conversion for an E element, or until it encounters an error. It then writes out the bytes from a successful conversion.

It is an interesting exercise, once you read the more detailed descriptions below, to work through this code and see how it works. You can also see (an earlier version of) the code in situ in the Microsoft VC++ library.

Having given this very general description, I feel the need to make two important points. First, a typical cin or cout object does not use a basic_filebuf object to mediate input or output. None of this machinery comes into play.

Second, the overwhelming majority of basic_filebuf objects are specialized for elements of type char. In principle, such objects convert each element under the auspices of an object of type codecvt<char, char, mbstate_t>. In practice, however, this particular code conversion is always one-to-one, in both directions. A sensible implementation can and will check once, at construction time, whether such code conversion can be bypassed. It will subsequently do so, yielding a significant performance improvement.

Implementing codecvt

Listing 1 shows one way to write the base class codecvt_base, which serves as the base of all codecvt facets. All the draft C++ Standard requires is that this class define the enumeration result. But I have found it useful to move up a few of the member functions from template class codecvt.

I also show here all the details of implementing the bitmask type result. Since enumerations are now distinct types in C++, an implementation must supply overloads for the various logical operations.

Listing 2 shows the template class codecvt. It performs only trivial conversions as it stands. You have to override its virtual functions to do something interesting.

Listing 3 shows the explicit specialization codecvt<wchar_t, char, mbstate_t>. It is the one nontrivial code converter used within the draft Standard C++ library, by the "wide" iostreams objects.

Finally, Listing 4 shows the template class codecvt_byname. It lets you create a codecvt facet with properties consistent with some named locale "X". As I described briefly last month, in connection with template class ctype_byname, the resulting facet should behave the same as for a locale you establish by calling the Standard C library function setlocale(LC_CTYPE, "X").

Using codecvt

I conclude with the briefest of guides to using a codecvt facet. Most of the member types and functions are fairly obvious. All the action occurs in the protected virtual member functions. Here is a summary of those functions:
virtual bool do_always_noconv() const throw();
The protected virtual member function returns true only if every call to do_in or do_out returns noconv. The template version always returns true.
virtual int do_encoding() const throw();
The protected virtual member function returns:

-1, if the encoding of sequences of type to_type is state dependent

0, if the encoding involves sequences of varying lengths

n, if the encoding involves only sequences of length n
virtual result do_in(State state&,
    const To *first1, const To *last1, const To *next1,
    From *first2, From *last2, From *next2);
The protected virtual member function endeavors to convert the source sequence at [first1, last1) to a destination sequence that it stores within [first2, last2). It always stores in next1 a pointer to the first unconverted element in the source sequence, and it always stores in next2 a pointer to the first unaltered element in the destination sequence.

state must represent the initial conversion state at the beginning of a new source sequence. The function alters its stored value, as needed, to reflect the current state of a successful conversion. Its stored value is otherwise unspecified.

The function returns:

codecvt_base::error if the source sequence is ill formed

codecvt_base::noconv if the function performs no conversion

codecvt_base::ok if the conversion succeeds

codecvt_base::partial if the source is insufficient, or if the destination is not large enough, for the conversion to succeed

The template version always returns noconv.
virtual int do_length(State state&,
    From *first1, const From *last1,
        size_t len2) const throw();
The protected virtual member function effectively calls do_out(state, first1, last1, next1, buf, buf + len2, next2) for some buffer buf and pointer next2, then returns next2 - buf. (Thus, it is roughly analogous to the function mbrlen, declared in <cwchar>, at least when From is type char.)

The template version always returns the lesser of last1 - first1 and len2.
virtual int do_max_length() const throw();
The protected virtual member function returns the largest permissible value that can be returned by do_length(first1, last1, 1), for arbitrary valid values of first1 and last1. (Thus, it is roughly analogous to the macro MB_CUR_MAX, defined in cstdlib>, at least when From is type char.)

The template version always returns 1.
virtual result do_out(State state&,
    const From *first1, const From *last1,
    const From *next1,To *first2,
    To *last2, To *next2);
The protected virtual member function endeavors to convert the source sequence at [first1, last1) to a destination sequence that it stores within [first2, last2). It always stores in next1 a pointer to the first unconverted element in the source sequence, and it always stores in next2 a pointer to the first unaltered element in the destination sequence.

state must represent the initial conversion state at the beginning of a new source sequence. The function alters its stored value, as needed, to reflect the current state of a successful conversion. Its stored value is otherwise unspecified.

The function returns:

codecvt_base::error if the source sequence is ill formed

codecvt_base::noconv if the function performs no conversion

codecvt_base::ok if the conversion succeeds

codecvt_base::partial if the source is insufficient, or if the destination is not large enough, for the conversion to succeed

The template version always returns noconv.
virtual result do_unshift( State state&, To *first2, To *last2, To *next2);
The protected virtual member function endeavors to convert the source element From(0) to a destination sequence that it stores within [first2, last2), except for the terminating element To(0). It always stores in next2 a pointer to the first unaltered element in the destination sequence.

state must represent the initial conversion state at the beginning of a new source sequence. The function alters its stored value, as needed, to reflect the current state of a successful conversion. Typically, converting the source element From(0) leaves the current state in the initial conversion state.

The function returns:

codecvt_base::error if state represents an invalid state

codecvt_base::noconv if the function performs no conversion

codecvt_base::ok if the conversion succeeds

codecvt_base::partial if the destination is not large enough for the conversion to succeed

The template version always returns noconv.

Conclusion

I end with a word of caution. Chances are, you'll never have occasion to construct a codecvt object in anger. If you do, your best bet is to model closely some code that already works, such as the snippets from basic_filebuf shown above, on the same system. Template class codecvt is a pure invention, with no prior art. It was standardized before it was implemented, and rather sketchily at that. Interpretations doubtless vary. Don't be surprised if you find variations among implementations. o

P.J. Plauger is Senior Editor of C/C++ Users Journal and President of Dinkumware, Ltd. He is the author of the Standard C++ Library shipped with Microsoft's Visual C++, v5.0. For eight years, he served as convener of the ISO C standards committee, WG14. He remains active on the C++ committee, J16. His latest books are The Draft Standard C++ Library, Programming on Purpose (three volumes), and Standard C (with Jim Brodie), all published by Prentice-Hall. You can reach him at pjp@plauger.com.