October 1997/Standard C/C++

Columns

Standard C/C++

P. J. Plauger

Introduction to Locales

If you thought locales were an esoteric aspect of the Standard C library, wait till you see what the draft C++ Standard has done with locale objects.

Introduction

The concept of a "locale" was introduced into C just over a decade ago, as part of the standardization process. Put simply, a locale is a collection of information about how to tailor certain common operations to meet the needs of a given culture. How to write dates and currency amounts are two of the more obvious components of a locale. The culture can be keyed to a nationality, such as Swiss, to a language, such as French, or to a profession, such as accounting. It can even be some intersection of all of the above, such as a locale tailored for Francophone accountants in Switzerland.

The idea is to make programs more widely useful by first internationalizing them. Eliminate culture-specific messages and conventions where possible, then make the code adaptable to different cultures where necessary. You can then localize an internationalized program by supplying a specific locale, for use by the adaptive code. The binding between locale and code might occur when the software goes into the box, ready for dealers' shelves in Geneva. Or it might be localized at run time when it is accessed on a computer running in Zurich from a client's site in Grenoble.

As you might guess from this extended example, the European members of ISO were the keenest advocates of locales. They live with mixed languages much more intimately than most Americans. ANSI X3J11, the committee that developed the C Standard, consisted predominately of Americans. Nevertheless, we were all keen to make the Europeans happy. The not-so-subtle threat was that the ISO C Standard would otherwise differ from the ANSI C Standard. After all the work we had put in, none of us wanted that.

Locales in C

The Europeans were gracious in their victory. They got together with a number of their American counterparts, at a meeting hosted by AT&T Bell Labs, and hammered out a minimalist specification for the machinery to be added. No changes were required to the C language proper. The Standard C library got a new header, called <locale.h>, which declares the functions localeconv and setlocale. About the only effect on the remainder of the library was the character to use for a decimal point, and whether the space character has any esoteric companions. Old standbys such as atof and printf care about such matters. So uniform was the support for locales within X3J11 that the addition was voted into the draft C Standard with a minimum of debate.

What made the selling job even easier was a simple promise. A program that cares not a whit about internationalization or localization can ignore locales completely. The committee defined a "C" locale, with all the library properties that C programmers had come to expect over the years. Every program begins executing in the "C" locale. Absent an explicit call to setlocale, nothing changes. Programmers can pretend locales don't exist, if they choose. And to date, the vast majority of programmers still do, as far as I can see.

Compiler vendors have a similar freedom. The minimum support for locales required by the C Standard is pretty minimum. The only other required locale besides the one called "C" is the native locale, called "". It can be the same as the "C" locale. A call to setlocale using any other name is entitled to fail. To this day, a typical C compiler targeted for programmers of embedded systems has just this minimum support for locales and nothing more.

It's easy to vote to add something to a standard if you're allowed to ignore its presence afterward.

I should emphasize that real support for locales does indeed exist in the C marketplace. Companies like Microsoft, IBM, Hewlett-Packard — companies that sell compilers used internationally to develop hosted applications — typically provide locales tailored to their major marketplaces. Posix and X/Open standards flesh out more details, and extend the information captured in a locale. The people who care about internationalizing code really care, and the machinery added to Standard C is genuinely useful.

In fact, a small but vocal minority feels that the C Standard doesn't go nearly far enough. A locale in C is a global entity — one locale applies to the entire library. That's fine for a program that adapts once to a given locale and stays that way. It's not so fine for a server that may have to support a hundred distinct locales for as many connected terminals. A program can do a lot of global locale switching in such a situation.

Some of the supplemental standards I mentioned above provide for multiple locales in a single program. Each call to a locale-sensitive function has an added handle argument to specify which locale to use for that call. Without being too hifalutin about it, you can say that these standards have introduced simple locale objects into C programming.

Locales in C++

C++ is, of course, much more aggressively object oriented than C. The folks standardizing C++ are determined to promote principles of object-oriented design, even as they (generally) accept the need to retain backward compatibility with Standard C. You can imagine how reluctant these people were to base locale support in Standard C++ on a mechanism so primitive as a global locale in C.

Fool that I am, I nevertheless tried to convince ISO WG21 and ANSI J16 to keep things simple. Locales are little used, said I. Why add still more complex machinery when the existing stuff is mostly ignored? My particular concern was unavoidable library overheads. It's altogether too easy to mandate untried designs that pull in lots of code because it might be used, and locales are intimately intertwined with even the simplest forms of input and output. Let's play it safe, said I.

As you might guess, my viewpoint did not prevail. Instead, the committee went on to adopt an extremely ambitious, and inventive, mechanism for manipulating locales in the Standard C++ library. It provides for multiple locale objects in a single program, naturally enough. It also lets you extend locales in a remarkably open-ended way. And it stretches the C++ language to the extreme, using features voted into the draft that are only recently being implemented in commercial compilers. Indeed, I can't imagine a program so sophisticated in its use of locales that it would find the adopted machinery wanting in any way.

On the other hand, my worries about overheads seem to be well founded. Consider the classic minimalist program that simply prints "Hello, world," then exits. In the early days of Unix, you could write this program in a few lines of assembly code and get a program of a few dozen bytes for your efforts. Moving to C, and using a simple system-interface library function such as write, the program grows to hundreds of bytes. Do the same job by calling printf and you're in the low kilobytes.

The canonical way to print text in a C++ program is to insert a text string into the output stream cout. Early C++ libraries that I've used generate executables for "Hello, world" in the range of 10,000 to 20,000 bytes. My first cut at writing the full draft Standard C++ library, locale objects and all, puffed the same program up to 250,000 bytes. I felt pretty bad about that until I learned that my principal competitor was well behind me in writing the locale-handling code. When they finally got something working, "Hello, world" came out at 1.5 megabytes!

By that time, I had managed to whittle down my library to a point where "Hello, world" took a mere 50,000 bytes. (See my column, "State of the Art: Too Much of a Good Thing," Embedded Systems Programming, July 1996.) I still think this is horrendously large, even though my customers are now happy enough. But I can't imagine where to trim anything else, other than by switching to Embedded C++, given the current requirements of the draft C++ Standard. That's what happens when you standardize first and get field experience afterward.

Perhaps you can understand now why I have a love/hate relationship with the locale machinery in the draft Standard C++ library. On the one hand, it has presented a real technical challenge. I had to write quite a lot of code, debugging and fleshing out the sketchiest of specifications in the process. I got it working, validated, and tuned up pretty nicely. So I'm proud of both the process and the result. On the other hand, the mandated machinery clashes with my design aesthetics in a number of ways. I particularly dislike inflicting space or time overheads on programmers who never intend to use but a fraction of the power provided. It is certainly not what we used to call "the spirit of C."

The header <locale> nominally declares all the facilities provided for manipulating locale objects in the draft Standard C++ library. It is the last of the headers I characterize as "miscellaneous." These in turn are the last of the headers I describe as part of my long-running series of that library. But even as the last of the last, <locale> is far from the least in complexity. It will take some explaining to cover its extensive features. Fortunately, CUJ recently ran an overview of locales in C++. (See Angelika Langer and Klaus Kreft, "Internationalization Using Standard C++," CUJ, September 1997.) Reviewing that article will help you put some of this stuff in perspective.

Class locale

A locale object is, as you might expect, an object of class locale. Listing 1 shows one way to declare this class in the header <locale>. If you don't see much resembling cultural conventions in this code, don't worry. A locale object is really just a package for conveying pointers to such information. It doesn't do any of the work itself.

Locales have many aspects. My goal this month is just to describe the basic machinery of locale objects. I don't intend to explain everything in Listing 1 in this installment. I nevertheless supply the full class declaration, as usual, for completeness.

An object that conveys actual culture-specific information is called a facet. The nested class locale::facet is the base class for all facets. The draft Standard C++ library defines about two dozen facets. They carve up the locale categories of the Standard C library — such as LC_CTYPE and LC_NUMERIC — into finer slices. For example, the category LC_CTYPE is represented by four facets:
ctype<char>    ctype<wchar_t>
codecvt<wchar_t, char, mbstate_t>
codecvt<char, char, mbstate_t>
and the category LC_NUMERIC is represented by six facets:
numpunct<char>   numpunct<wchar_t>
num_get<char>    num_get<wchar_t>
num_put<char>    num_put<wchar_t>
You can add as many more types of facets as your heart desires, provided you follow a few simple coding rules, described below. The facets you define are not associated with any locale category, unless they are based on facets defined in the Standard C++ library. The implications of this restriction are minor. I'll explore them more in future installments, along with the behavior of the facets mentioned above.

You cannot declare an object of class locale::facet — it has only a protected constructor. You can declare an object of a class based on locale::facet, but you should create such an object only with operator new — locale objects maintain a reference count for each facet and feel obliged to delete a facet when its reference count decrements to zero. Class locale::facet supplies no copy constructor or assignment operator, to discourage any other uses of facets.

Effectively, a locale object is a list of pointers to facets, but the programmer can't see much of that representation. All the facets associated with a locale object are determined when that object is constructed. Locale objects are thereafter immutable until destroyed. In between, you can obtain only const references to the facets inside a locale object. Two template functions let you peer inside a locale object:
template<class Facet>
    bool has_facet(const locale& loc) throw();
lets you write an expression like:
has_facet< ctype<char> >(loc)
to determine whether the facet ctype<char> is represented inside the locale loc. And:
template<class Facet>
    const Facet& use_facet(const locale& loc)
lets you write a declaration like:
const ctype<char>& fac =
    use_facet< ctype<char> >(loc);
If the locale loc does not store a pointer to a facet of type ctype<char> the function throws an object of class bad_cast.

By the way, this notation for calling functions is fairly new. You need it when calling a template function whose arguments don't supply the type information needed to determine the template parameters. Only a few compilers currently support this notation. Existing implementations of locale objects need some sort of workaround, as I describe in future installments.

Locale objects are also intended to be lightweight. A number of functions in the iostreams classes return locale objects by value. These functions are called often during formatted input and output. For example, the extractors in template class basic_istream, the modern descendant of the old iostreams standby istream are obliged to skip white space according to locale-dependent rules. These rules are encapsulated in a facet of type ctype<char>, which is obtained from the locale object "imbued" in the basic_istream object. So practically every extraction involves a call to ios_base::getloc() to obtain a copy of the imbued locale object, as in:
const ctype<char>& fac =
    use_facet< ctype<char> >(ios_base::getloc());
Copying an array containing dozens of pointers is generally considered to be an unacceptable overhead, at least for an operation that occurs frequently. Thus, a locale object is best implemented as a handle — a pointer to a separate "implementation" object that stores the actual array of pointers. In the code in Listing 1, the nested class locale::_Locimp describes the implementation object. The sole member object in a locale object is a pointer _Ptr to an implementation object.

An object of class locale::_Locimp is also reference counted. (The base class locale::facet is used here solely for its ability to maintain a reference count.) Constructing a copy of a locale object merely involves copying the handle pointer and incrementing the reference count of the locale::_Locimp object that it designates. Assigning a new value involves decrementing the reference count for the existing implementation object and incrementing the reference count for the new implementation object. Destroying a locale object decrements the reference count. If, as a result of any decrement, the count goes to zero, the implementation object is deleted. For example, the destructor ~locale contains the interesting code:
~locale() throw()
    {if (_Ptr != 0)
        delete _Ptr->_Decref(); }
The member function locale::facet::_Decref returns the this pointer for its locale::facet object only when the reference count decrements to zero. Otherwise it returns a null pointer. I prefer this approach to having _Decref invoke the rather horrid expression delete this, an idiom of dubious safety.

Indexing Facets

There's still a missing piece of the puzzle. So far, I have talked glibly about looking up facet pointers of a given type within (the implementation of) a locale object. The template functions has_facet and use_facet must each somehow map its template type parameter Facet into an index into _Fv, the array of pointers to facets stored in the implementation object. Now, C++ is a flexible language, but it doesn't let you subscript an array with a type, as in:
return _Ptr->_Fv[Facet];    // NOT REALLY
Draft Standard C++ does include facilities for run-time type identification (RTTI). You could probably generate a hash key from the text string returned by:
(typeid Facet).name()
The value returned by this function call is a pointer to a const null-terminated string that is unique for each distinct type in a program. But discriminating facets this way prevents you from deriving your own flavor of ctype<char>, as in:
class Myctype :
    public ctype<char> {.....};
The derived type would have a different name from its base type.

Instead, a locale facet has an added constraint. Besides being derived from class locale::facet, a facet fac must also define a publicly accessible static object whose name is id and whose type is locale::id. Moreover, this static object must be statically initialized to zero. You obtain a unique nonzero stored integer value from fac::id by casting it to type size_t. The first time you do so, operator size_t() obtains a unique index and stores it in the locale::id object. Subsequent casts return the same value.

If your goal is to make a new version of an existing facet type, as with Myctype above, you simply inherit the existing definition of the static locale::id object. But if you want to make a brand new kind of facet, you must begin with the skeleton:
class Myfacet :
    public locale::facet {
public:
    static locale::id id;
    ..... };
    // define static id
locale::id Myfacet::id(0);
The net result is that each distinct facet has an associated distinct value of the static member object id. Each implementation object stores an array of pointers to facets indexed by the values stored in the various id objects. The size of each array of pointers is determined when the implementation object is constructed. You will find similarly elastic arrays associated with the ios member functions iword and pword, dating back to the earliest versions of iostreams.

Building Locale Objects

Class locale defines the static member function locale::classic(), which returns a const reference to a locale object that corresponds to the "classic" "C" ocale. This locale object is populated with the two dozen or so facets called for by the draft C++ Standard. All other locale objects in a program derive from this initial locale object either by the replacement of facets or the addition of new ones. That's what all those different locale constructors are for. So you can expect that the first couple dozen facet indices are taken up by library facets. And all locale objects have at least the standard assortment of facets.

Well, this is true in principle. The code I present here has a couple of interesting additions, however. The static member function locale::empty shown in Listing 1 is a nonstandard, but conforming, extension that creates interesting additional possibilities for fragmentary locales. And the version of use_facet in this implementation permits a kind of lazy evaluation of locale facets — the secret behind reducing 250,000 bytes of overhead to 50,000. I discuss both of these delicate topics later.

Sorry about all the teasers in this presentation, but locales in the draft Standard C++ library are a very large topic. It will take a while to walk through all those facets, describing how the library uses them and how they can be used or augmented. We have to begin somewhere. A good place to begin is to understand the basics of locale objects, facets, and how they interact in the most general terms. o

P.J. Plauger is Senior Editor of C/C++ Users Journal and President of Dinkumware, Ltd. He is the author of the Standard C++ Library shipped with Microsoft's Visual C++, v5.0. For eight years, he served as convener of the ISO C standards committee, WG14. He remains active on the C++ committee, J16. His latest books are The Draft Standard C++ Library, Programming on Purpose (three volumes), and Standard C (with Jim Brodie), all published by Prentice-Hall. You can reach him at pjp@plauger.com.