C/C++ Contributing Editors


Standard C/C++: Last of the Facets

P. J. Plauger

Programming language support for internationalization has its limits, no matter how ambitious the library that comes with the compiler.


Introduction

We're coming down to the wire. For the past year, I've been describing the extensive facilities in the Standard C++ library for adapting code to different cultures. (See "Standard C/C++: Introduction to Locales," CUJ, October 1997, and every monthly installment since.) These facilities are basically an elaboration of the locale machinery built into Standard C for the past decade or so. The goal, in both C and C++ libraries, is to assist programmers in writing programs for an international marketplace. The number of programmers in search of such an assist is still quite small, but appears to be growing rapidly. If you are not (yet) among that number, you will probably be pleased to learn that this is the last installment in my year-long saga.

Bear with me while I review the journey. It began back in the mid 1980s, about the time that the ANSI C standards committee, then called X3J11, was wrapping up the specifications for Standard C. Or so we thought. A contingent of Europeans mentioned at one meeting, almost in passing, that they planned to press for changes in the ANSI C Standard before it was adopted by ISO as an international standard. Seems the Europeans felt that the existing draft was way too insular. It was full of presumptions that are peculiar to the American culture, with no easy way to work around them.

As an extreme example, the primary function for printing out a human-readable date is called asctime, declared in the header <time.h>. The asc prefix comes from ASCII, which is short for American Standard Code for Information Interchange. So even the name subtly begins with an American presumption. But the name is a minor matter compared to the behavior of the function. It invariably prints a date that looks like:

Fri Jul  4 1986 10:30:00

The weekday and month names are in (abbreviated) English, and the order of the fields is peculiar to American custom. Not even the Canadians, our closest cultural allies on the planet, appreciate this form in all contexts.

A C programmer circa 1986 had little help in printing a date in, say, French. You would have to start with the "broken down" time in a struct tm object and piece together the text form of the date yourself. No big deal. You need an array of weekday names, an array of month names, and a bit of facility in using sprintf. But it's a tedious job that you might have to do repeatedly, particularly if you program for a bank in Montreal. It's even more tedious if you have to deal in multiple languages, switching tables of names on the fly.

Several good reasons exist for adding a function to a standard library. The function may be hard to get right, or so simple it's easy to get wrong, or so heavily used it's important to optimize to a fare-thee-well. Or it can be a function that is likely to be of widespread utility. That was the case made for adding an alternative to asctime to the Standard C library — a function that helped you compose dates better tailored to different cultures around the world. The added function is called strftime, by the way. It is also declared in <time.h> along with "classic" asctime.

Adding Locales

But the detailed behavior of this new function is not germane to this essay. (If your stack of old magazines goes way back, see my column "Standard C" in the January through March 1993 issues of CUJ for more on this topic.) The point I want to make here is that X3J11 chose to respond to the European complaints. None of us liked the prospect of having ISO immediately rework the C Standard we'd already invested years in refining. If the addition of just a few more functions to the Standard C library would eliminate the perceived need for a different international standard, we were all for it.

X3J11 added a bit more than the function strftime, as part of this effort. The biggest single chunk was the header <locale.h>. It introduces the concept of a "locale," or collection of culture-specific information. By altering the current locale, you can alter the behavior of several of the functions in the Standard C library. An obvious example is strftime, which you might want to persuade to generate dates in French. But you might also want to convince strtod to accept a comma instead of a period as the decimal point, since many European nations favor that notation.

Support for locales is pretty sparing in Standard C. To their credit, the Europeans who lobbied for this addition were sensitive to the needs of others. They didn't want to require that a compiler for an eight-bit microcontroller have extensive support for locale switching. Equally, they wanted to assure programmers who had no interest in locales that they would incur no overheads in a program that doesn't use them.

Thus, locales are rather like named files in C — rules exist for naming, reading, and writing files, but an implementation doesn't have to succeed if a program tries to do anything with a file of some specific name. Similarly, the C Standard imposes very few rules on how to name locales, and says nothing about what locales have to be present — other than the default "C" locale in effect at program startup.

Moreover, the Standard C library supports only one locale at a time, which globally affects all locale-sensitive functions. For a sophisticated application that wants to switch frequently among multiple locales, this is an annoying oversimplification. It isn't even good design style, violating fundamental principles of encapsulation as it does. But it was enough to keep many people happy in the late 1980s. And to this day, most C (and C++) programmers I deal with are quite content to ignore even what's there in the way of locale machinery.

Locales at their best have only a modest impact on the Standard C library. A locale can be split into a handful of "categories." The required categories, as identified by their macro names in <locale.h>, are:

These are handy services, to be sure, but they hardly address all the needs of a programmer bent on writing code that adapts to all the cultures of the world. Study any nontrivial program you use on a regular basis. The good ones avoid most problems of internationalization by avoiding them. You should never display words if a widely accepted symbol will do the job. Think of all those road signs in Europe. Even we Americans know that a red octagon means "stop" and an inverted yellow triangle means "yield," with no need for supporting text.

The locale support provided by Standard C comes into its own in a handful of places. A checkbook manager should not insist on displaying a dollar sign and two places after the decimal point, not if it wants to sell in Italy or Japan. Dates and monetary amounts are both widely used in displays. They are hard to avoid and very culture dependent. Here's where determined use of locale support, and a rich set of locales to choose among, can really help a software package.

Adding Messages

But problems remain. What if you really need to display some nontrivial message, something too sophisticated to approximate with those funny little international symbols? Look at those programs you use a lot once more and you'll see what I mean. Even DOS displays an occasional "Abort, Retry, Fail." Beyond a certain point, you have to deal with language dependencies.

The Posix people picked up where the C Standard left off. They added yet another locale category, called LC_MESSAGE. It controls the behavior of a handful of functions that let you read messages from a "message catalog." The idea is that you, the programmer, reduce the number of text messages in your program to an irreducible minimum. You place all these messages in a named catalog, which is available to the program when it runs.

The running program looks up each message by its index number, rather than using a message wired into the program. If you want your program to speak French one fine day, just translate the catalog, message by message, into French and run with that one instead. You can even use the setlocale machinery of Standard C to choose on the fly one of several different versions of a message catalog. You have a multilingual program.

I've made the process sound much easier than it usually is in practice. Somebody has to translate every catalog of messages into every language you want to support. Each translation involves special skills that you have to go out and pay for. (How's your Estonian?) And translating brief messages is a peculiar skill that not all translators have mastered. You have to keep the lengths of all versions within certain bounds, or you risk messing up screen and report formats. Variable information appears at different places, probably in different orders, thanks to the rich variety of grammars that human languages exhibit.

And you can still fail to bridge a cultural gap, having done all of the above. Europeans are often put off by American jocularity and looseness. A dancing paper clip may be accepted as cute and reassuring by (some) Americans, but be taken as sophomoric in cultures that are more staid. If you've ever tried to get by in a foreign land with just a phrase book and determination, you know what I mean. There's more to communication than just the words.

I've pointed out all these obstacles primarily to keep matters in perspective. The C Standard added locale support for good and proper reasons. It added no more than it did also for good and proper reasons. Other organizations, such as Posix, have added to this machinery, also for good and proper reasons. Taken all together, the various components of locale support still solve only some of the many problems a programmer faces when preparing code for the international marketplace. Nevertheless, it is still a useful aid in an important and growing corner of the programming profession.

Locales In C++

The C++ Standard endeavors to do its share in supporting international programming. As I have described at length over the past year, it adds to the Standard C++ library some very sophisticated machinery for managing locales. For all its complexity, however, that machinery sits atop and depends on the locales of Standard C and, to a lesser extent, Posix. If you can't do it in C, you can't do it in C++. On the other hand, if you can do it in C, you can do it with much finer control in C++.

To summarize one last time: the Standard C++ library introduces the concept of a "locale object," which is an object of class locale, declared in the header <locale> (not to be confused with the Standard C header <locale.h>). Each locale object encapsulates a collection of culture-specific information just like a C locale. But you can have many locale objects alive at the same time in a C++ program, all usable independent of the single, global C locale that influences the behavior of the Standard C library.

A locale object is essentially an immutable set of pointers to locale "facets." A facet is an object based on class locale::facet, with one or two additional behavioral constraints that I won't repeat here. The Standard C++ library itself defines over two dozen different facets, each of which supplies member functions that perform some set of locale-specific operations for you when you call it. For example, the facet ctype<char> is a specialization of template class ctype. (It is, in fact, an explicit specialization, but that's by the by.) It supplies a member function that tests whether a char object is, say, an upper-case letter. It also supplies a member function that converts an upper-case letter to its lower-case equivalent.

The default locale contains pointers to the 28 facets defined by the Standard C++ library. These facets are organized into categories just as in C. Specifically:

Category collate includes the facets:

collate<char>
collate<wchar_t>

Category ctype includes the facets:

ctype<char>
ctype<wchar_t>
codecvt<char, char, mbstate_t>
codecvt<wchar_t, char, mbstate_t>

Category monetary includes the facets:

moneypunct<char, false>
moneypunct<wchar_t, false>
moneypunct<char, true>
moneypunct<wchar_t, true>
money_get<char,
    istreambuf_iterator<char> >
money_get<wchar_t,
    istreambuf_iterator<wchar_t> >
money_put<char,
    ostreambuf_iterator<char> >
money_put<wchar_t,
    ostreambuf_iterator<wchar_t> >

Category numeric includes the facets:

num_get<char,
    istreambuf_iterator<char> >
num_get<wchar_t,
    istreambuf_iterator<wchar_t> >
num_put<char,
    ostreambuf_iterator<char> >
num_put<wchar_t,
    ostreambuf_iterator<wchar_t> >
numpunct<char>
numpunct<wchar_t>

Category time includes the facets:

time_get<char,
    istreambuf_iterator<char> >
time_get<wchar_t,
    istreambuf_iterator<wchar_t> >
time_put<char,
    ostreambuf_iterator<char> >
time_put<wchar_t,
    ostreambuf_iterator<wchar_t> >

And finally, category messages includes the facets:

messages<char>
messages<wchar_t>

Over the past year, I've described all but this last set of facets. Here, now, is the last of the story.

Template Class messages

Template class messages (yes, it should be message to match the naming rules of C) describes a family of facets that maintain message catalogs like those specified in the Posix Standard. The template parameter type specifies the representation of elements of the message. Thus messages<char> traffics in traditional single-byte text, while messages<wchar_t> traffics in wide characters.

Listing 1 shows the bare minimum implementation of this template class, along with the usual suspects that accompany most facets in the Standard C++ library. This particular version fails all the time, which is acceptable behavior on an implementation that doesn't support Posix message catalogs. On a richer implementation, this code calls on the Posix machinery to do all the hard work.

As usual, template class messages has a public interface that does nothing but call on protected virtuals to do the actual work. That way, you can derive your own class from a specialization of this template, then change just the virtual member functions you want to tinker with. Everything else, including the public interface, remains unchanged.

Here is a brief description of the protected virtuals:

messages::do_open

virtual catalog
do_open(const string& name,
    const locale& loc) const;

The protected member function endeavors to open a message catalog whose name is name. It may make use of the locale loc in doing so. It returns a value that compares less than zero on failure. Otherwise, the returned value can be used as the first argument on a later call to get. It should in any case be used as the argument on a later call to close.

messages::do_get

virtual string_type
do_get(catalog cat, int set, int msg,
    const string_type& dflt) const;

The protected member function endeavors to obtain a message sequence from the message catalog cat. It may make use of set, msg, and dflt in doing so. It returns a copy of dflt on failure. Otherwise, it returns a copy of the specified message sequence.

messages::do_close

virtual void do_close(catalog cat) const;

The protected member function closes the message catalog cat, which must have been opened by an earlier call to do_open.

This specification leaves a lot to the imagination. I believe that was more or less intentional. I do know from experience that it is sufficient to tie into a working implementation of Posix message catalogs, so the spec is probably sufficient.

Conclusion

My comments about locale support in both C and C++, border on the cynical at times. To some extent, that reflects a degree of disillusionment on my part. I championed this stuff for years, in the interest of amity among nations. I've implemented it, in excruciating detail, in both my Standard C and Standard C++ libraries. (See my book, The Standard C Library, Prentice-Hall, 1992. Also study the headers shipped with Microsoft Visual C++ V4.2 and V5.0, particularly the ones with "loc" somewhere in their names.) Until very recently, hardly anyone seemed to care.

More recently, however, I am seeing ever growing interest in this code. Others are writing articles on locale objects. Intelligent questions are popping up on on the newsgroups, such as comp.lang.c++ and comp.std.c++. I get a steady stream of email, also asking ever more intelligent questions. I'd rather it were easier to avoid the overheads imposed on all C++ programs by the presence of locale objects. But I'm happy to have written the code. And I'm mostly glad the facilities are now part of Standard C++.

P.J. Plauger is Senior Editor of C/C++ Users Journal and President of Dinkumware, Ltd. He is the author of the Standard C++ Library shipped with Microsoft's Visual C++, v5.0. For eight years, he served as convener of the ISO C standards committee, WG14. He remains active on the C++ committee, J16. His latest books are The Draft Standard C++ Library, Programming on Purpose (three volumes), and Standard C (with Jim Brodie), all published by Prentice-Hall. You can reach him at pjp@plauger.com.