Features


Internationalization: A Primer

Rex Jaeschke


Rex Jaeschke is an independent computer consultant, author, and seminar leader. He is chair of the ANSI C committee X3J11. Readers may contact Rex at 2051 Swans Neck Way, Reston, VA 22091 or via UUCP at rex@aussie.com.

This is the first in a two-part series covering software internationalization support available in Standard C.

Introduction

Prior to the standardization of C, C programs executed in what amounted to as a "USA-English" environment. For example, isupper returned true only for the Roman letters A-Z, the decimal-point character printed by printf was the period, and dates were formatted as mm/dd/yy.

As a result of standardization, the behavior of various standard library functions is now described in terms of the execution environment, which can be changed at run time. For example, the is* and to* functions in ctype.h can properly handle local characters, the printf and scanf families can deal with decimal points other than a period, and local date and time formats are permitted.

A major addition to C was the ability to allow characters in other alphabets to be used inside character constants, comments, header names, and string literals. (However, letters in identifiers are still restricted to the Roman alphabet, and arithmetic constants must use digits from the Arabic number system.) A handful of new functions were defined to deal with interpreting and/or converting multibyte and wide characters.

Soon after the initial ISO C Standard was published, work began on an addendum to that standard. The primary purpose of this addendum was to provide further support for multibyte and wide characters. This support was achieved via new headers, which contain macros, type definitions, and functions. The addendum is known officially as ISO C Amendment 1.

This article describes only the support Standard C provides for achieving internationalization. It does not attempt to describe a strategy for achieving internationalization.

Definition of Terms

Before we can discuss internationalization in detail or even look at programming examples, it is necessary to define a number of terms, many of which are interrelated.

Locales

Two locales are defined by Standard C: "C" and the implementation-defined native locale "". (An implementation that provides the bare minimum locale support will make the native locale "" the same as locale "C".) Any number of other locales are permitted, with arbitrary names.

At program startup, the default locale is "C". Changing one or more categories of a locale to something other than "C" requires a call to the function setlocale, declared in locale.h.

A Simple Example

Listing 1 shows a program that performs a number of locale-specific operations based on a user-specified locale.

Listing 2 shows several sets of input and their (possible) corresponding output. How they actually appear depends on the implementation.

In case 1, the locale name is entered and the terminating new-line discarded. In the implementation used to build the examples in this article (Microsoft's Windows NT), 'american', 'french', 'german', 'spanish', and 'swedish' are valid locale names.

Since the category macro LC_ALL is used, the call to setlocale in case 2 attempts to establish the user-specified locale as the current default for all locale categories. If this is not possible, the function returns a null pointer.

In case 3, the program calls getchar to read a single character from the input. Since isalpha is locale-specific, letters other than Roman A-Z and a-z can be accepted, as shown by the output from case 4.

The output from case 5 shows that printf is also locale-specific. In some countries, the comma is used as the decimal-point character.

The function strftime provides date and time information formatted in a locale-specific manner. As shown, the day, month, and year order and separators can vary from one country to another. Also, users can get day and month names in their native language.

Approaches to Using Locales

You can use locales three different ways:

Locale Names

The spelling of a locale name is unspecified by Standard C; a locale name is simply an implementation-defined string. As such, it may contain multibyte characters. Consider the case of a locale designed to support data processing in Germany. A few possible spellings for that locale's name are "german", "deutsch", "deitser", "alemn", and "allemand", along with their uppercase equivalents or with leading capital letters. It depends on the programmers' cultural background and native language as to which form they would prefer or expect to see or use.

Hard-coding locale names is probably a bad idea, particularly if portability is an issue. If not, you might question this advice. However, consider the case where your implementation does not support a locale that you need. If that implementation provides a way to add locales, the locale may come from another vendor or user. And its name may be out of your control. Of course, if that implementation does not provide a way to integrate third-party locales, you may have to move to one that does, and that is a port!

In any event, for the purposes of this discussion we will use a header called "locnames.h" which contains macros whose names are of the form LOC_*. For example:

#define LOC_Ameri can "american"
#define LOC_Arabic    "arabic"
...
#define LOC_French    "french"
#define LOC_German    "german"
...
#define LOC_Spanish   "spanish"
#define LOC_Swiss     "swiss"
This user-defined and maintained header can contain conditional-compilation directives as necessary, to accommodate different locale name spellings.

Clearly, it is useful to hide the real locale name spelling from the programmer. However, when the value of a locale name macro is displayed directly, the user will see the underlying spelling. If that is undesirable, the name of the macro can be displayed instead, as shown in Listing 3.

The macro STR is defined in "locnames.h" using:

#define STR(x) #x
The possible outputs now become:

Established locale LOC_German
or

Can't establish locale LOC_German

Locale Categories

A locale can be established for all locale-specific operations or for just a subset, depending on the category specified in a call to setlocale. Standard C defines six category macros in locale.h:

These macros expand to integral constant expressions having distinct values. An implementation may define additional category macros provided their names begin with the characters LC_ followed by an uppercase letter. (The POSIX standard defines a category called LC_MESSAGES which allows the user to find out the spelling of the affirmative and negative response to a yes/no question.)

A mixed locale can be established by calling setlocale a number of times, once per category. For example:

setlocale(LC_COLLATE, LOC_German);
setlocale(LC_CTYPE, LOC_French);
setlocale(LC_NUMERIC, LOC_Italian);
[More recent standards based on C locales let you talk about groups of categories by ORing these constants, as in LC_COLLATE/LC_CTYPE. But the C Standard does not make clear that this is always permissible. — pjp]

The setlocale Function

When a call to setlocale succeeds, it returns a pointer to the string associated with the specified locale name and category combination. Let us call such a string a locale string. The format of a locale string is unspecified by Standard C. Listing 4 shows a program that displays several such locale strings.

The output produced on one implementation that supported a French locale was:

|C|
|French_France.850|
|C|
|LC_COLLATE=C;LC_CTYPE=French_France.850; LC_MONETARY=C;
   LC_NUMERIC=C;LC_TIME=C|
(The last printf output is broken after semicolons to fit column width.)

When setlocale is called with a locale name of NULL, the locale string for the specified category in the current locale is returned. The category is not changed. Since the default locale at program startup is "C", the output produced in cases 1 and 3 is not surprising. (It is also not guaranteed by the C Standard, however.)

The output produced in cases 2 and 4 is, of course, specific to the implementation. When case 4 is executed, a mixed locale has already been established. Therefore, the locale string returned in this case must somehow include a description of that mixed locale.

The locale string produced for a given locale name and category combination can be used as the locale name in a subsequent call to setlocale for the same category. For example, Listing 5 shows the function handle_dates, which temporarily switches to Spanish date/time processing and then restores the original mode.

Assuming setlocale and malloc don't fail, the output produced is:

Saved current LC_TIME
Established LOC_Spanish LC_IME
Doing LOC_Spanish-specific date/time processing
Restored saved LC_TIME
The memory allocated for a locale string is managed by the library. Since it may be recycled in subsequent calls to setlocale, it is the user's responsibility to make a copy for later use, as shown in cases 2 and 3.

There is no way to identify which locale is currently established. Since the format of a locale string is unspecified there is not even a guarantee that the locale strings produced by two calls to setlocale using the same arguments, compare equal.

The localeconv Function

The header locale.h contains a definition for the type struct lconv whose members provide access to information regarding the formatting of monetary and non-monetary numeric values. Standard C requires that the following members be defined:

The elements of grouping and mon_grouping are interpreted according to the following conventions:

The value of p_sign_posn and n_sign_posn is interpreted according to the following:

Any of the pointer members except decimal_point can point to an empty string, which indicates the value is not available in the current locale or is of zero length. In the "C" locale, the decimal-point character is a period and all other pointer members point to an empty string.

The members having type char contain nonnegative values. If a value of CHAR_MAX is present, it indicates that the real value is not available in the current locale. In the "C" locale, these members all have value CHAR_MAX.

The structure may contain other members. The ordering of members is unspecified.

A call to localeconv returns a pointer to a structure of type struct Iconv, which contains values corresponding to the current locale. This structure may be overwritten by future calls to localeconv or by calls to setlocale that refer to the categories LC_ALL, LC_MONETARY, or LC_NUMERIC. (Note, however, that saving a copy of this structure is insufficient to save the complete numeric formatting description, since the strings pointed to by the char * members might also be overwritten by subsequent calls to those functions.)

Listing 6 shows a call to localecony and the subsequent display of two of the structure members' contents.

The output produced is:

Locale: LOC_French
decimal_point:   ,
int_curr_symbol: FRF

The strcoll Function

This function works just like strcmp except that strcoll uses a collating sequence established via the LC_COLLATE category. Listing 7 shows an example program.

Two inputs and their (possible) corresponding outputs are:

Enter two strings: e ê
e < ê

Enter two strings: ê f 
ê < f
As shown by the outputs, sorts between e and f, just as the French would want.

More to Come

In the next installment I will deal with more internationalization issues, such as multibyte characters and ISO C Amendment 1.

Recommended Reading

The following books and publications will likely be of use to C programmers interested in further information on internationalization:

ANSI/ISO 9899:1990 Programming Language C. American National Standards Institute, New York, NY. 1990, 219 pp.

ANSI/ISO 9899:1990 Programming Language C: Amendment 1. American National Standards Institute, New York, NY. 1995, 52 pp.

Digital Guide to Developing International Software. Digital Press, Bedford, MA. 1991, 381 pp., ISBN 55558-063-7.

Lunde, Ken. Understanding Japanese Information Processing. O'Reilly & Associates, Sebastopol, CA. 1993, 435 pp., ISBN 1-56592-043-0.

Madell, Tom, Clark Parsons, and John Abegg. Developing and Localizing International Software. Hewlett-Packard Professional Books/Prentice Hall, Englewood Cliffs, NJ. 1994, 150 pp., ISBN 0-13-300674-3.

Plauger, P.J., regular column in The C/C++ Users Journal. R&D Publications, Lawrence, KS. 1990-1994.

Plauger, P.J., regular column in The Journal of C Language Translation. IECC, Cambridge, MA. 1990-1994, ISSN 1042-5721.

Gallmeister, Bill O., POSIX.4: Programming for the Real World. O'Reilly & Associates, Inc., Sebastopol, CA. 1995, ISBN 1-56592-074-0.

The Unicode Consortium. The Unicode Standard: Worldwide Character Encoding, Version 1.0, Volume 1. Addison-Wesley, Reading, MA. 1990, 682 pp., ISBN 0-201-56788-1.

The Unicode Consortium. The Unicode Standard: Worldwide Character Encoding, Version 1.0, Volume 2. Addison-Wesley, Reading, MA. 1992, 439 pp., ISBN 0-201-60845-6.

The Unicode Consortium. The Unicode Standard, Version 1.1. Changes from Version 1 in draft form from The Unicode Consortium. 1994.