Features


Internationalization Using Standard C++

Angelika Langer and Klaus Kreft

The draft Standard C++ Library provides powerful support for internationalizing code. You just have to learn how to use all that power.


Computer users all over the world prefer to interact with their systems using their own language and cultural conventions. Cultural differences affect, among other things, the display of monetary values, dates, and time. Just think of the way numeric values are formatted in different cultures: 1,000,000.00 in the US is 1.000.000,00 in Germany and 10,00,000.00 in Nepal. If you aim for high international acceptance of your products, you must build into your software the flexibility to adapt to varying requirements that stem from cultural differences.

Building into software the potential for worldwide use is called "internationalization," or I18N for short [1] . However it is spelled, internationalization is one of the major challenges of software development these days.

Traditionally, internationalization has been achieved with C libraries. The C Standard, as well as standards like Posix and X/Open, define locales and wide-character input and output for Standard C. Windows 95 and Windows NT have a C interface, too, the Win32 NLSAPI [2] . None of the Win32 NLSAPI interfaces matches any of the Standard C interfaces though, and locales are thread specific in Windows whereas they are provided per process in Unix. These are important differences. The concept and level of support, however, is equivalent. There is a common notion of locales, and the services provided cover a similar range of I18N problems.

The ISO/ANSI draft C++ Standard defines an extensible framework that facilitates internationalization of C++ programs in a portable manner. Its main elements are locales and facets. This article gives an overview of the locale framework and the standard facets defined by the draft C++ Standard.

Locales in C

As a reader of CUJ, you may already have some background in the internationalization services provided by the Standard C library. Let's start with a short recap of those services, and then build on existing knowledge to describe the C++ locales in terms of the C locale.

Internationalization requires that developers consciously design and implement their software, and avoid hard-coding information or rules that can be localized. For example, careful developers never assume specific conventions for formatting numeric or monetary values, or for displaying date and time, not even for comparing or sorting strings. For internationalization, all cultural and language dependencies need to be represented outside of source code in a kind of language table. Such a table is called a locale.

A locale in the Standard C library contains support for several problem domains. The information in a Standard C locale is composed of categories. Each of the categories represents a set of related information, as shown in Table 1 [3] .

Inside a program, the C locale is represented by one or more global data structures. The Standard C library provides functions that use information from those global data structures to adapt their behavior to local conventions. Examples of these functions and the information they cover are:

Locales in C++

In the draft Standard C++ library, locale categories are compartmentalized, and further divided, into a number of separate classes called facets. Each facet offers a set of internationalization services. For instance, the formatting of monetary values is encapsulated in the money_put<> facet. (Note the brackets — all facets are template classes.) A facet may also represent a collection of information about certain cultural and language dependencies. The rules and symbols for monetary information, for example, are contained in a facet called moneypunct<>.

The draft Standard C++ library also defines a class called locale. Unlike a C locale, which is a global data structure representing various culture and language dependencies, an object of class locale is an abstraction that manages facets. Basically, you can think of a C++ locale object as a container of facets, as illustrated in Figure 1.

The draft C++ Standard defines a number of standard facets. They provide services and information similar to those contained in the Standard C library. As we have seen, the C locale is composed of five or six categories of information. There are a comparable number of groups of standard facets:

As you might have noticed, the names of the standard facets obey certain naming rules. The "get" facets, like num_get and time_get, offer services for parsing. The "put" facets provide formatting services. The "punct" facets, like numpunct and moneypunct, represent rules and symbols.

C++ Locale Objects

As you can see, the C++ locale class, along with the standard facets, offers services similar to locales in C. However, the semantics of the C++ locale are different from the semantics of the C locale. The Standard C locale is a global resource — there is only one locale for the entire application. This makes it hard to build an application that has to handle several locales at a time. The Standard C++ locale is a class. You can create multiple instances of class locale at will, so you can have as many locale objects as you need.

To explore this difference in further detail, let us see how locales can be used. It may well happen that you have to work with multiple locales. For example, if you have to implement an application for Switzerland, you might want to output messages in Italian, French, and German. As the C locale is a global data structure, you will have to switch locales several times.

Let's discuss an application that works with multiple locales. Suppose an application runs at a US company that ships products worldwide. It needs to print invoices to be sent to customers all over the world. Of course, the invoices need to be printed in the customer's native language. Suppose further that the application reads input (the product price list) in US English, and writes output (the invoice) in the customer's native language, say German. Since there is only one global locale in C that affects both input and output, the global locale must change between input and output operations. See Figure 2.

Here is the C code that corresponds to the previous example [4] :

float price;
while ( 1 )  // processing the German invoice
{  setlocale(LC_ALL, "En_US");
  
fscanf(priceFile,"%f",&price);
   // convert $ to DM
   setlocale(LC_ALL,"De_DE");
  
fprintf(invoiceFile,"%f",price);
}

Multiple Locales

Using C++ locale objects dramatically simplifies the task of using multiple locales. Iostreams in the Standard C++ Library are internationalized, so each stream can be imbued with its own locale object. For example, the input stream can be imbued with an English locale object, and the output stream can be imbued with a German locale object. In this way, switching locales becomes unnecessary. See Figure 3.

Here is the C++ code corresponding to the previous example:

priceFile.imbue(locale("En_US"));
invoiceFile.imbue(locale("De_DE"));
float price;
while ( 1 )  // processing the German invoice
{  priceFile >> price;
   // convert $ to DM
   invoiceFile << price;
}

With this toy example, switching locales might look like only a minor inconvenience. However, consider the need for multiple locales in an application with multiple threads of execution. Because all threads share one global locale in C, access to the global locale must be serialized by means of mutual exclusion. A lot of locking would occur and mostly slow down the program.

Ideally, you want to keep locales completely independent of each other. Each component should have a locale of its own, which is unrelated to other locales in your program. This is what you have in C++. You can create an arbitrary number of independent, light-weight locale objects that you can attach to streams, and exchange between components, or pass around as function arguments.

The C locale and C++ locale objects are mostly unrelated. There is only one occasion when they affect each other, when you alter the global C++ locale object. You obtain a copy of this global locale object by calling locale::global().

The notion of a global C++ locale was added for all those users who do not want to bother with detailed control of internationalization. Rather, they can leave it to the various internationalized components to pick a sensible default locale. The global C++ locale is often used as this default locale. Iostreams objects, for example, take a snapshot of the global locale object, if you do not explicitly imbue a stream with a given locale object.

Altering the global locale object by calling locale::global(newloc) affects the global C locale as well. It results in a call to setlocale. When this happens, locale-sensitive C functions called from within a C++ program will see the same locale as that specified by the global C++ locale. There is no way to affect the global C++ locale, or any other locale object, from within a C program, however.

Using Locales and Facets

Let us now explore how C++ locale objects and facets are used. Remember that a locale in C++ is a container of facets, and a facet is a set of internationalization services and information. The general pattern of usage is:

This sounds more complicated than it actually is, as you'll soon see. It points out, however, that the locale does not know anything about the facets' capabilities. The locale only maintains the facets. It registers them and makes them accessible on demand.

The locale itself, therefore, does not provide you with internationalization services. It only gives you access to the facets that provide the services. It is your task to understand which facets you need for which particular services. The advantage of separating maintenance from functionality is that a locale can maintain any kind of facet, not only the predefined standard facets from the C++ library, but also new facets that are added to the library for special purposes.

Class locale has numerous constructors. See Listing 1 for a comprehensive list. Basically they fall into three categories:

Here are a couple of constructors of class locale that allow creation of locales by composition:

class locale {
public:
  locale(const locale& other,
    const char* std_name, category);
  template <class Facet>
    locale(const locale& other, Facet* f);
  template <class Facet>
    locale(const locale& other, const locale& one);
  locale(const locale& other,
      const locale& one, category);
};

The following example uses the first constructor to show how you can construct a locale object as a copy of the classic locale object with the classic numeric facets replaced by the numeric facet objects taken from a German locale object.

locale loc(locale::classic(),
    locale("De_DE"), LC_NUMERIC);

As mentioned earlier, the facets fall into categories. LC_NUMERIC is the category that designates all numeric facets in a locale.

Note that some of the constructors are member templates, which is a language feature that is relatively new to the language and not supported by all compilers. (See the sidebar, "Locales in Practice. ")

It's important to understand that locales are immutable objects. Once a locale object is created, it cannot be modified — no facets can be replaced after construction. This makes locales reliable and easy to use. You can safely pass them around between components.

Copying a locale object is a cheap operation. You should have no hesitation about passing locale objects around by value. You may copy locale objects for composing new locale objects, as arguments to functions, etc.

Locales are implemented using reference counting and the handle-body idiom [5] . When a locale object is copied, only its handle is duplicated (a fast and inexpensive action). Figure 4 shows an overview of the locale architecture. A locale is a handle to a body that maintains a sequence of pointers to facets. The facets are reference-counted, too.

Accessing Facets

You can access the facet objects of a locale via two template functions, use_facet and has_facet:

template <class Facet>
    const Facet& use_facet(const locale&);
template <class Facet>
    bool has_facet(const locale&);

The function use_facet gives access to a facet by providing a constant reference to it. The function has_facet checks whether a certain facet is present in a given locale. The requested facet is specified via its type.

Note, that both functions are template functions. The template parameter is the type of the facet to access in a locale. In other words, these function are capable of deciding which facet object is meant from just the information about the facet's type. It works because a locale contains at most one instance of a certain facet type. This kind of compile-time dispatch is a novel technique in C++. A discussion of this technique, and the design of the locale architecture, is beyond the scope of this article.

The code below demonstrates how these functions are used to get access to a facet and invoke an internationalization service. It illustrates the conversion service tolower from the ctype facet. All upper-case letters of a string read from the standard input stream are converted to lower-case letters and are written to the standard output stream:

string in;
cin >> in;
use_facet< ctype<char> >(locale::locale()).
tolower(in.c_str(), in.c_str() + in.length());
cout << in;

The function template use_facet< ctype<char> > returns a constant reference to the locale's facet object. Then the facet object's member function tolower is called. It behaves much like the C function tolower for each element in the sequence. It converts all upper-case letters into lower-case letters [6] .

The syntax of the call:

use_facet < ctype<char> >(locale::locale())

might look surprising to you. It is an example of explicit template argument specification, a language feature that is relatively new to C++. Template arguments of a function instantiated from a function template can either be explicitly specified in a call or can be deduced from the function arguments. The explicit template argument specification is needed in the call to use_facet above, because the compiler can only deduce a template argument if it is the type of one of the function arguments. (See the sidebar. )

Note, that we do not store the reference to the facet, but just use the temporary reference returned by use_facet to call the desired member function of that facet. This is a safe way of using facets retrieved from a locale. If you keep the reference, you'll need to keep track of the object's lifetime and validity. The facet reference does indeed stay valid throughout the lifetime of its locale object, but when the locale goes out of scope, such references might become invalid. For this reason it is advisable to combine retrieval and invocation as shown in the example above, unless you have a real need to do things differently.

Note also, that we did not first call has_facet< ctype<char> > to check whether the locale has a ctype facet. In most situations, you do not have to check for the presence of a standard facet object like ctype<char>, because locale objects are created by composition. You start with the classic locale or a locale object constructed "by name" from a C locale's external representation. Because you can only add or replace facet objects in a locale object, you cannot compose a locale that misses one of the standard facets. A call to has_facet is useful, however, when you expect that a certain non-standard facet object should be present in a locale object. The function use_facet will throw an exception if the facet is not present.

Locales and Iostreams

The standard iostreams are the prime example of an internationalized component that uses locales and facets. This enhancement of iostreams enables you to implement locale-sensitive standard I/O operations for your user-defined types. Each stream has a locale object attached. Attaching a locale to a stream is done at construction, or via the stream's imbue member function. If you do not explicitly imbue a locale, the stream uses a snapshot of the current global locale as a default, as we mentioned earlier.

Here is an example that demonstrates how one can use a stream's locale for printing a date. Let us assume we have a date object of type tm, which is the time structure defined in the Standard C library, and we want to print it. Let's also assume our program is supposed to run in a German-speaking canton of Switzerland. Hence, we attach a Swiss locale to the standard output stream. When we print the date we expect an output like "1. September 1989" or "01.09.89"

struct tm aDate;
aDate.tm_year = 1989;
aDate.tm_mon = 9;
aDate.tm_mday = 1;

cout.imbue(locale::locale("De_CH"));
cout << aDate;

As there is no operator<< defined in the Standard C++ library for tm from the C library, we have to provide this inserter ourselves. The following code suggests a way this can be done. To keep it simple, the handling of exceptions thrown during the formatting is omitted:

template<class Ostream>
Ostream& operator<<ostream& os,
    const struct tm& date)
{
  typedef typename Ostream::char_type char_t;
  typedef typename Ostream::traits_type traits_t;
  typedef ostreambuf_iterator<char_t, traits_t> outIter_t;

  locale loc = os.getloc();
 
  const time_put<char_t,outIter_t>& fac =
     use_facet < time_put<char_t, outIter_t > > (loc);
   
  outIter_t nextpos = fac.put(os, os,
     os.fill(), &date, 'x');

  if (nextpos.failed())
  os.setstate(ios_base::badbit);

  return os;
}

There's a lot going on here. Let's discuss the interface of the shift operator first.

The code above shows a typical stream inserter. It takes a reference to an output stream and a constant reference to the object to be printed and returns a reference to the same stream. The inserter is a template function because the standard iostreams are templates; they take a character type and an associated traits type describing the character type as template arguments [7] . Naturally, we have the same template parameters for our date inserter.

We need to get hold of the stream's locale object, because we want to use its time formatting facet for output of our date object. As you can see in the code above, the stream's locale object is obtained via the stream's member function getloc. We retrieve the time formatting facet from the locale via use_facet as in the earlier example. We then call the facet's member function put.

The put function does all the magic. It produces a character sequence that represents the equivalent of the date object, formatted according to culture-dependent rules and information. It then inserts the formatted output into the stream via an output iterator. Before we delve into the details of the put function, let us take a look at its return value.

The put function returns an output iterator that points to the position immediately after the last inserted character. The output iterator used here is an output stream buffer iterator. These are special-purpose iterators defined by the Standard C++ library that bypass the stream's formatting layer and write directly to the output stream's underlying stream buffer. Output stream-buffer iterators have a member function failed for error indication. So we can check for errors happening during the time formatting. If there was an error, we set the stream's state accordingly, by calling the stream's setstate function.

Let's return to the facet's formatting function put and see what arguments it takes. Here is the function's interface:

iter_type put(iter_type(a), ios_base&(b),
    char_type(c),const tm*(d), char)(e)

iter_type and char_type stand for the types that were provided as template arguments when the facet class was instantiated. In this case, they are ostreambuf_iterator<charT, traits> and charT, where charT and traits are, in turn, the respective streams template arguments.

Here is the actual call again:

nextpos = fac.put(os, os,
os.fill(), &date, 'x');

Let's see what the arguments mean.

The first parameter is supposed to be an output iterator. We provide an iterator to the stream's underlying stream buffer. The reference os to the output stream is converted to an output iterator, because output stream buffer iterators have a constructor taking an output stream, of type basic_ostream<charT,traits>&.

The second parameter is of type ios_base&, which is one of the stream base classes. Class ios_base contains data for format control, which the facet object uses. We provide the output stream's ios_base subobject here, using the automatic cast from a reference to an output stream to a reference to its base class.

The third parameter is the fill character. It is used when the output has to be adjusted and extra characters have to be filled in. We pass on the stream's fill character, obtained by calling the stream's fill function.

The fourth parameter is a pointer to a time structure tm from the C library.

The fifth parameter is a format character as in the C function strftime. The x calls for the locale's appropriate date representation.

As you can see from the example of a date inserter function, it is relatively easy to implement powerful, locale-sensitive I/O operations using standard iostreams and locale. It takes just a couple of lines of C++ code, once you understand the underlying machinery.

Summary

This article gives a brief overview of locales and facets, the components in the draft Standard C++ library for supporting the internationalization of C++ programs. The functionality of the standard facets contained in the Standard C++ library covers traditional C functionality. However, C++ allows multiple locales and overcomes the limitation of the single global locale that was imposed by C.

Naturally, this brief introduction to internationalization support in Standard C++ is far from being comprehensive. For instance, we concealed that locales and facets are designed as an open and extensible framework. A description of the framework's architecture and of techniques for extending the framework would fill another article.

Acknowledgements

This article is based on material we put together for a book on Standard C++ Iostreams and Locales to be published by Addison-Wesley-Longman in 1998. Part of the article was inspired by work Angelika Langer did for Rogue Wave Software, Inc. in 1996. We also want to thank Nathan Myers, who initially proposed locales and facets to the C++ standards committee. He patiently answered countless questions during the past months.

Notes

[1] Internationalization is such an ugly and long word that it is often abbreviated as "I18N," where "18" stands for the 18 characters between the first and last character of the word i(nternationalizatio)n.

[2] An excellent book in the Microsoft Programming Series is Developing International Software for Windows 95 and Windows NT, by Nadine Kand.

[3] The description is based on XPG4, which is the Native Language Support (NLS) defined by X/Open for the programming language C. ISO C also defines internationalization services to be contained in the C library. The respective ISO standard is ISO/IEC 9899 and its Amendment 1. The ISO C standard is identical to the POSIX standard for the programming language C. The internationalization services defined by ISO C are part of XPG4. However, XPG4 offers more services than ISO C.

[4] The example is oversimplified. One would certainly use the strfmon function for formatting monetary values like prices. Note also, that the locale names, such as "En_US" and "De_DE" are not standardized. Each platform and operating system may have different naming patterns. The names used in the example use the X/Open notation. For instance, the equivalent to "De_DE" on a Microsoft platform would be "German_Germany."

[5] A good reference for an explanation of the handle-body idiom is Advanced C++ Programming Styles and Idioms, by James O. Coplien (Addison-Wesley, 1992), ISBN 0-201-54855-0.

[6] The function ctype<>::tolower is similar to the C function tolower, except that it takes two character pointers as arguments. You might have expected a call like tolower(in.begin(), in.end()) — using the begin and end iterators for the string. But that would make the code non-portable, because string::iterator is an implementation-defined type. It is not necessarily a pointer to the string's character type.

[7] The typical character-type arguments are the builtin types char and wchar_t. The typical traits type is the instantiation of the standard character-traits template char_traits<charT> provided by the draft Standard C++ library.

Angelika Langer works as an independent freelance trainer/consultant. Her main area of interest and expertise is object-oriented software development in C++ and Java. She can be reached at langer@camelot.de.
Klaus Kreft is a Senior Consultant at Siemens Nixdorf in Germany. He can be reached at klaus.kreft@mch.sni.de.