September 1995/Internationalization: A Primer

Features

Internationalization: A Primer

Rex Jaeschke

Rex Jaeschke is an independent computer consultant, author, and seminar leader. He is chair of the ANSI C committee X3J11. Readers may contact Rex at 2051 Swans Neck Way, Reston, VA 22091 or via UUCP at rex@aussie.com.
This is the first in a two-part series covering software internationalization support available in Standard C.

Introduction
Prior to the standardization of C, C programs executed in what amounted to as a "USA-English" environment. For example, isupper returned true only for the Roman letters A-Z, the decimal-point character printed by printf was the period, and dates were formatted as mm/dd/yy.
As a result of standardization, the behavior of various standard library functions is now described in terms of the execution environment, which can be changed at run time. For example, the is* and to* functions in ctype.h can properly handle local characters, the printf and scanf families can deal with decimal points other than a period, and local date and time formats are permitted.
A major addition to C was the ability to allow characters in other alphabets to be used inside character constants, comments, header names, and string literals. (However, letters in identifiers are still restricted to the Roman alphabet, and arithmetic constants must use digits from the Arabic number system.) A handful of new functions were defined to deal with interpreting and/or converting multibyte and wide characters.
Soon after the initial ISO C Standard was published, work began on an addendum to that standard. The primary purpose of this addendum was to provide further support for multibyte and wide characters. This support was achieved via new headers, which contain macros, type definitions, and functions. The addendum is known officially as ISO C Amendment 1.
This article describes only the support Standard C provides for achieving internationalization. It does not attempt to describe a strategy for achieving internationalization.

Definition of Terms
Before we can discuss internationalization in detail or even look at programming examples, it is necessary to define a number of terms, many of which are interrelated.

Byte — The unit of data storage large enough to hold any member of the basic execution character set.

Category — One of a number of components that, when combined, describe a locale.

char — An integral type whose range of values can represent distinct codes for all members of the basic execution character set. The sizeof operator produces the size of its operand, measured in bytes. Since sizeof(char) is one by definition, an object of type char occupies exactly one byte.

Character — A bit representation that fits in a byte. Each member of the basic source and basic execution character sets must fit into a byte.

Character constant — A sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'.

Comment — A sequence of zero or more multibyte characters enclosed between/* and */.

Encoding scheme — A set of rules for parsing a stream of bytes into a group of coded characters. The meaning of a multibyte character can vary if the encoding scheme provides state-dependent encoding via shift sequences. Standard C requires that no encoding scheme have a byte with value zero as the second or subsequent byte of a multibyte character. This restriction allows many of the traditional string manipulation functions to be used transparently with strings containing multibyte characters. Encoding schemes for wide characters are quite simple — all characters have the same internal width and each character has a unique value. Examples of encoding schemes are: ASCII, EBCDIC, EUC, JIS, Shif-JIS, Unicode, 10646.UCS-2, and 10646.UCS-4.

Execution character set — The set of characters available for use during execution of a program. The basic execution character set must include all of the characters in the basic source character set, the null character, and control characters representing alert, backspace, carriage return, form feed, and new-line. The members of the extended execution character set are locale-specific and may be single-byte or multibyte characters.

Header name — A sequence of one or more multibyte characters that identifies the name of a header used as the subject of an #include preprocessor directive.

Locale — A set of conventions based on some nationality, culture, or language. A locale is made up of a set of categories.

Locale-specific behavior — Behavior pertaining to a particular locale. Examples include the decimal-point character and date and time formats.

Mixed locale — A composite locale comprised of a set of conventions representing two or more distinct nationalities, cultures, or languages. For example, a Swiss locale might include some German aspects as well as some French and/or Italian aspects.

Multibyte character — A character from the execution character set that may require more than a single byte for its representation. Examples include character sets that accommodate Japanese, Chinese, Korean, and Arabic alphabets. Multibyte characters can appear in character constants, comments, header names, and strings literals.

Multibyte character string — A sequence of zero or more multibyte characters terminated by, and including, a null character.

Null character — A byte with all bits set to zero.

Null wide character — A wide character with all bits set to zero.

Shift sequence — A sequence of one or more single-byte characters that indicate a change in encoding. When a sequence of multibyte characters is being scanned, the detection of a shift sequence causes the characters following to be interpreted differently until the shift state is changed (possibly by restoring to the original state) or the end of the character sequence is reached. All comments, string literals, character constants, and header names are required to begin and end in their initial shift state. In the initial shift state, the single-byte characters you use to write a C program (from the basic source character set, described below) have their usual meaning — they do not alter the shift state. A redundant shift sequence is one that is followed immediately by another shift sequence or is one that switches to the (already) current mode.

Single-byte character — A character from the execution character set that can be represented in a single byte. Examples of character sets made up entirely of single-byte characters are ASCII and EBCDIC.

Source character set — The set of characters available for use in writing source code. The basic source character set must include the 52 uppercase and lowercase letters from the English alphabet, the digits 0-9, the graphic characters ! "# % & ' ( ) * + , - . / : ; < = > ? [ \ ] ^_ { | } ~, the space character, and control characters representing horizontal tab, vertical tab, and form feed. If any other characters are seen except inside character constants, comments, header names, and string literals, the behavior is undefined. The extended source character set may include other single-byte and/or multibyte characters.

String — A sequence of zero or more multibyte characters terminated by, and including, a null character.

String literal — A sequence of zero or more multibyte characters enclosed in double-quotes, as in "xyz".

wchar_t — An integral type whose range of values can represent distinct codes for all members of the largest extended execution character set specified among the supported locales. This is the type of a wide character. If all character sets used are single-byte, an implementation may define wchar_t as a synonym for char.

Wide character — An object capable of representing distinct codes for all members of the largest extended execution character set specified among the supported locales. A wide character has type wchar_t. Unlike a multibyte character, each wide character takes up the same amount of storage. One character set based on wide characters is Unicode, a subset of the code defined by ISO 10646. When represented as a wide character, the null character shall have the code value zero.

Wide character constant — A sequence of one or more multibyte characters enclosed in single-quotes and prefixed with L, as in L 'x'.

Wide character string — A sequence of zero or more wide characters terminated by, and including, a null wide character.

Wide character string literal — A sequence of zero or more multibyte characters enclosed in double-quotes and prefixed with L, as in L"xyz".

Locales
Two locales are defined by Standard C: "C" and the implementation-defined native locale "". (An implementation that provides the bare minimum locale support will make the native locale "" the same as locale "C".) Any number of other locales are permitted, with arbitrary names.
At program startup, the default locale is "C". Changing one or more categories of a locale to something other than "C" requires a call to the function setlocale, declared in locale.h.

A Simple Example
Listing 1 shows a program that performs a number of locale-specific operations based on a user-specified locale.
Listing 2 shows several sets of input and their (possible) corresponding output. How they actually appear depends on the implementation.
In case 1, the locale name is entered and the terminating new-line discarded. In the implementation used to build the examples in this article (Microsoft's Windows NT), 'american', 'french', 'german', 'spanish', and 'swedish' are valid locale names.
Since the category macro LC_ALL is used, the call to setlocale in case 2 attempts to establish the user-specified locale as the current default for all locale categories. If this is not possible, the function returns a null pointer.
In case 3, the program calls getchar to read a single character from the input. Since isalpha is locale-specific, letters other than Roman A-Z and a-z can be accepted, as shown by the output from case 4.
The output from case 5 shows that printf is also locale-specific. In some countries, the comma is used as the decimal-point character.
The function strftime provides date and time information formatted in a locale-specific manner. As shown, the day, month, and year order and separators can vary from one country to another. Also, users can get day and month names in their native language.

Approaches to Using Locales
You can use locales three different ways:

Never call setlocale at all. By default, all processing will be done using locale "C", giving a "USA-English" mode of operation. This approach is simple; it's what we've had these past 20+ years.

At the start of main, call setlocale(LC_ALL, "") once to establish all categories to the native locale, whatever that may be. This approach is fairly straightforward; all locale-specific functions do their thing and you work in (presumably) your native environment. However, you should consider the case where a program built using this approach is run in an environment where the native locale is unknown. Will the program behave in a reasonable manner or did you hardcode something that really is locale-specific?

Use multiple locales by switching some or all categories back and forth from one locale to another. This approach requires more care since you must keep careful track of the current mode of each locale category as you go, making sure to not make unreasonable assumptions along the way. Another concern is whether all locales are known in advance to the programmer or whether one or more are supplied at run time.

Locale Names
The spelling of a locale name is unspecified by Standard C; a locale name is simply an implementation-defined string. As such, it may contain multibyte characters. Consider the case of a locale designed to support data processing in Germany. A few possible spellings for that locale's name are "german", "deutsch", "deitser", "alemn", and "allemand", along with their uppercase equivalents or with leading capital letters. It depends on the programmers' cultural background and native language as to which form they would prefer or expect to see or use.
Hard-coding locale names is probably a bad idea, particularly if portability is an issue. If not, you might question this advice. However, consider the case where your implementation does not support a locale that you need. If that implementation provides a way to add locales, the locale may come from another vendor or user. And its name may be out of your control. Of course, if that implementation does not provide a way to integrate third-party locales, you may have to move to one that does, and that is a port!
In any event, for the purposes of this discussion we will use a header called "locnames.h" which contains macros whose names are of the form LOC_*. For example:

#define LOC_Ameri can "american" #define LOC_Arabic "arabic" ... #define LOC_French "french" #define LOC_German "german" ... #define LOC_Spanish "spanish" #define LOC_Swiss "swiss"
This user-defined and maintained header can contain conditional-compilation directives as necessary, to accommodate different locale name spellings.
Clearly, it is useful to hide the real locale name spelling from the programmer. However, when the value of a locale name macro is displayed directly, the user will see the underlying spelling. If that is undesirable, the name of the macro can be displayed instead, as shown in Listing 3.
The macro STR is defined in "locnames.h" using:
#define STR(x) #x
The possible outputs now become:
Established locale LOC_German
or
Can't establish locale LOC_German
Locale Categories
A locale can be established for all locale-specific operations or for just a subset, depending on the category specified in a call to setlocale. Standard C defines six category macros in locale.h:

LC_ALL — This category includes all other categories.

LC_COLLATE — This category affects the behavior of the strcoll and strxfrm functions.

LC_CTYPE — This category affects the behavior of the character handling functions with the exception of isdigit and isxdigit. It also affects the behavior of the multibyte functions.

LC_MONETARY — This category affects the monetary formatting information returned by the function localeconv.

LC_NUMERIC — This category affects the decimal-point character for the formatted I/O functions (such as printf and scanf) and the string conversion functions (such as atof and strtod), as well as the non-monetary formatting information returned by the function localeconv.

LC_TIME — This category affects the behavior of the strftime function.
These macros expand to integral constant expressions having distinct values. An implementation may define additional category macros provided their names begin with the characters LC_ followed by an uppercase letter. (The POSIX standard defines a category called LC_MESSAGES which allows the user to find out the spelling of the affirmative and negative response to a yes/no question.)
A mixed locale can be established by calling setlocale a number of times, once per category. For example:

setlocale(LC_COLLATE, LOC_German); setlocale(LC_CTYPE, LOC_French); setlocale(LC_NUMERIC, LOC_Italian);
[More recent standards based on C locales let you talk about groups of categories by ORing these constants, as in LC_COLLATE/LC_CTYPE. But the C Standard does not make clear that this is always permissible. — pjp]

The setlocale Function
When a call to setlocale succeeds, it returns a pointer to the string associated with the specified locale name and category combination. Let us call such a string a locale string. The format of a locale string is unspecified by Standard C. Listing 4 shows a program that displays several such locale strings.
The output produced on one implementation that supported a French locale was:
|C|
|French_France.850|
|C|
|LC_COLLATE=C;LC_CTYPE=French_France.850; LC_MONETARY=C;
   LC_NUMERIC=C;LC_TIME=C|
(The last printf output is broken after semicolons to fit column width.)
When setlocale is called with a locale name of NULL, the locale string for the specified category in the current locale is returned. The category is not changed. Since the default locale at program startup is "C", the output produced in cases 1 and 3 is not surprising. (It is also not guaranteed by the C Standard, however.)
The output produced in cases 2 and 4 is, of course, specific to the implementation. When case 4 is executed, a mixed locale has already been established. Therefore, the locale string returned in this case must somehow include a description of that mixed locale.
The locale string produced for a given locale name and category combination can be used as the locale name in a subsequent call to setlocale for the same category. For example, Listing 5 shows the function handle_dates, which temporarily switches to Spanish date/time processing and then restores the original mode.
Assuming setlocale and malloc don't fail, the output produced is:
Saved current LC_TIME
Established LOC_Spanish LC_IME
Doing LOC_Spanish-specific date/time processing
Restored saved LC_TIME
The memory allocated for a locale string is managed by the library. Since it may be recycled in subsequent calls to setlocale, it is the user's responsibility to make a copy for later use, as shown in cases 2 and 3.
There is no way to identify which locale is currently established. Since the format of a locale string is unspecified there is not even a guarantee that the locale strings produced by two calls to setlocale using the same arguments, compare equal.

The localeconv Function
The header locale.h contains a definition for the type struct lconv whose members provide access to information regarding the formatting of monetary and non-monetary numeric values. Standard C requires that the following members be defined:

char *decimal_point — The decimal-point character used to format non-monetary quantities.

char *thousands_sep — The character used to separate groups of digits before the decimal-point character in formatted non-monetary quantities.

char *grouping — A string whose elements indicate the size of each group of digits in formatted non-monetary quantities.

char *int_curr_symbol — The international currency symbol applicable to the current locale. The first three characters contain the alphabetic international currency symbol. The fourth character is the character used to separate the international currency symbol from the monetary quantity. The fifth character is a null character.

char *currency_symbol — The local currency symbol applicable to the current locale.

char *mon_decimal point — The decimal-point used to format monetary quantities.

char *mon_thousands_sep — The separator for groups of digits before the decimal-point character in formatted monetary quantities.

char *mon_grouping — A string whose elements indicate the size of each group of digits in formatted monetary quantities.

char *positive_sign — The string used to indicate a nonnegative-valued formatted monetary quantity.

char *negative_sign — The string used to indicate a negative-valued formatted monetary quantity.

char int_frac_digits — The number of fractional digits (those after the decimal-point) to be displayed in an internationally formatted monetary quantity.

char frac_digits — The number of fractional digits (those after the decimal-point) to be displayed in a locally formatted monetary quantity.

char p_cs_precedes — Set to 1 or 0 if the currency_symbol respectively precedes or succeeds the value for a nonnegative formatted monetary quantity.

char p_sep_by_space — Set to 1 or 0 if the currency_symbol respectively is or is not separated by a space from the value for a nonnegative formatted monetary quantity.

char n_cs_precedes — Set to 1 or 0 if the currency_symbol respectively precedes or succeeds the value for a negative formatted monetary quantity.

char n_sep_by_space — Set to 1 or 0 if the currency_symbol respectively is or is not separated by a space from the value for a negative formatted monetary quantity.

char p_sign_posn — Set to a value indicating the positioning of the positive_sign for a nonnegative formatted monetary quantity.

char n_sign_posn — Set to a value indicating the positioning of the negative_sign for a negative formatted monetary quantity.
The elements of grouping and mon_grouping are interpreted according to the following conventions:

CHAR_MAX — No further grouping is to be performed.

0 — The previous element is to be repeatedly used for the remainder of the digits.

other — The integer value is the number of digits that comprise the current group. The next element is examined to determine the size of the next group of digits before the current group.
The value of p_sign_posn and n_sign_posn is interpreted according to the following:

0 — Parentheses surround the quantity and currency_symbol.

1 — The sign string precedes the quantity and currency_symbol.

2 — The sign string succeeds the quantity and currency_symbol.

3 — The sign string immediately precedes the currency_symbol.

4 — The sign string immediately succeeds the currency_symbol.
Any of the pointer members except decimal_point can point to an empty string, which indicates the value is not available in the current locale or is of zero length. In the "C" locale, the decimal-point character is a period and all other pointer members point to an empty string.
The members having type char contain nonnegative values. If a value of CHAR_MAX is present, it indicates that the real value is not available in the current locale. In the "C" locale, these members all have value CHAR_MAX.
The structure may contain other members. The ordering of members is unspecified.
A call to localeconv returns a pointer to a structure of type struct Iconv, which contains values corresponding to the current locale. This structure may be overwritten by future calls to localeconv or by calls to setlocale that refer to the categories LC_ALL, LC_MONETARY, or LC_NUMERIC. (Note, however, that saving a copy of this structure is insufficient to save the complete numeric formatting description, since the strings pointed to by the char * members might also be overwritten by subsequent calls to those functions.)
Listing 6 shows a call to localecony and the subsequent display of two of the structure members' contents.
The output produced is:
Locale: LOC_French
decimal_point:   ,
int_curr_symbol: FRF
The strcoll Function
This function works just like strcmp except that strcoll uses a collating sequence established via the LC_COLLATE category. Listing 7 shows an example program.
Two inputs and their (possible) corresponding outputs are:
Enter two strings: e ê
e < ê

Enter two strings: ê f 
ê < f
As shown by the outputs, sorts between e and f, just as the French would want.

More to Come
In the next installment I will deal with more internationalization issues, such as multibyte characters and ISO C Amendment 1.

Recommended Reading
The following books and publications will likely be of use to C programmers interested in further information on internationalization:
ANSI/ISO 9899:1990 Programming Language C. American National Standards Institute, New York, NY. 1990, 219 pp.
ANSI/ISO 9899:1990 Programming Language C: Amendment 1. American National Standards Institute, New York, NY. 1995, 52 pp.
Digital Guide to Developing International Software. Digital Press, Bedford, MA. 1991, 381 pp., ISBN 55558-063-7.
Lunde, Ken. Understanding Japanese Information Processing. O'Reilly & Associates, Sebastopol, CA. 1993, 435 pp., ISBN 1-56592-043-0.
Madell, Tom, Clark Parsons, and John Abegg. Developing and Localizing International Software. Hewlett-Packard Professional Books/Prentice Hall, Englewood Cliffs, NJ. 1994, 150 pp., ISBN 0-13-300674-3.
Plauger, P.J., regular column in The C/C++ Users Journal. R&D Publications, Lawrence, KS. 1990-1994.
Plauger, P.J., regular column in The Journal of C Language Translation. IECC, Cambridge, MA. 1990-1994, ISSN 1042-5721.
Gallmeister, Bill O., POSIX.4: Programming for the Real World. O'Reilly & Associates, Inc., Sebastopol, CA. 1995, ISBN 1-56592-074-0.
The Unicode Consortium. The Unicode Standard: Worldwide Character Encoding, Version 1.0, Volume 1. Addison-Wesley, Reading, MA. 1990, 682 pp., ISBN 0-201-56788-1.
The Unicode Consortium. The Unicode Standard: Worldwide Character Encoding, Version 1.0, Volume 2. Addison-Wesley, Reading, MA. 1992, 439 pp., ISBN 0-201-60845-6.
The Unicode Consortium. The Unicode Standard, Version 1.1. Changes from Version 1 in draft form from The Unicode Consortium. 1994.