Columns


Standard C

The Header <locale.h>

P.J. Plauger


P.J. Plauger is senior editor of The C Users Journal. He is secretary of the ANSI C standards committee, X3J11, and convenor of the ISO C standards committee, WG14. His latest book is Standard C, which he co-authored with Jim Brodie. You can reach him at pjp@plauger. uunet.

History

The header <locale.h> is an invention of X3J11, the committee that developed the C standard. You will find little that resembles locales in earlier implementations of C. That stands at odds with the committee's stated purpose, "to codify existing practice." Nevertheless, those of us active within X3Jll at that time felt we were acting out of the best of motives — self defense.

This particular header popped up about five years after work began on the standard. At that time, many of us felt that the standard was essentially complete. We were simply putting a few finishing touches on a product in which we had invested five years of our lives. Resistance was mounting to change of any sort.

I recall mentioning a change that I would have liked. (I forget now just what the change was.) In the interest of speeding closure, however, I suggested that the committee not make the change. An attendee from the UK, Keith Winter, also expressed support for the change. But he, too, was willing to let it slide. After all, he said, it was just one of many changes that would have to be made to the ISO standard for C.

Silence.

After we collected our collective wits, several of us simultaneously uttered the moral equivalent of, "Say what?" Winter went on to explain calmly that a number of Europeans were unhappy with certain parts of the C standard being developed by X3J11. It was simply too American in several critical ways. They despaired of trying to educate us insular Yanks about the needs of the world marketplace. Rather, they were content to wait and fight their battles on a more congenial field. The Europeans took it for granted that an ISO standard for C must differ from the ANSI standard.

Many of us disagreed with that position. We felt it imperative that whatever standard ANSI developed had to be acceptable to the international community. We had seen the effects in the past of computer language standards that differed around the world. Our five years of effort would be in vain, we felt, if the final word on C came from a separate committee second-guessing all our decisions.

So we sighed a deep sigh and asked the Europeans to show us their shopping list of changes. Most of the items on the list dealt with ways to adapt C programs to different cultures. That is a much more obvious problem in a land of many languages and nations such as Europe. Americans enjoy the luxury of a single (official) language and a fairly simple alphabet.

AT&T Bell Labs went so far as to host a special meeting to deal with various issues of internationalization. (This is a big word that people are uttering more and more often. It seems to have no acceptable synonym that is any shorter. The techie solution is to introduce the barbarism I18N, pronounced EYE eighteen EN. The 18 stands for the number of letters omitted.) Out of that meeting came the proposal for adding locale support to Standard C. The machinery eventually adopted is remarkably close to the original proposal.

Adding locales to C had the desired effect. Many of the objections to ANSI C as an international standard were derailed. It cost X3Jll an extra year, by my estimation, to hammer out locales. And we probably spent yet another year dealing with residual gripes from the international community. (And WG14, the ISO C standard committee, is still working on additions to the existing Standard.) Nevertheless, we succeeded in producing a standard for C that is currently identical at both ANSI and ISO levels.

Before Locales

Writing adaptive code is not entirely new. An early form sprung up about fifteen years ago in the UNIX operating system. Folks got the idea of adding environment variables to the system call that launches new processes. (That service is called exec, or some variant thereof, in UNIX land.) Environment variables are an open-ended set of names, each of which identifies a null-terminated string that represents its value. You can add, alter, or delete environment variables in a process. Should that process launch another process, the environment variables are automatically copied into the image of the new guy.

The new process can simply ignore environment variables. It loses a few dozen, or a few hundred, bytes of storage that it might otherwise enjoy. Or it can look for certain environment variables and study their current values. A common variable is TZ, which provides information to the library date functions about the current time zone. If the value of TZ is, say, EST05E0T, the time functions know to label local standard time as EST and local daylight savings time as EDT. The local (standard) time zone is five hours later than UTC, known in the past as Greenwich Mean Time.

Environment variables have many uses. They are a great way to smuggle file names into an application program. It is almost always a bad idea to wire file names directly into a program. Prompting the user for file names is mostly a good idea, except for secret files about which the user should not have to be informed. Asking for such a file name on the command line that starts the program is somewhat better, but it can be a nuisance. It is a particular nuisance is several programs in a suite need access to the same file name. That's why it is often much nicer to set an environment variable to the file name once and for all in a script that starts a session. The file name is captured in one place, but is made available to a whole hierarchy of programs.

If you are at all literate about MS-DOS, you probably know that that system supports environment variables too. They are just one of many good ideas borrowed from past experience with UNIX. I have purchased several bits of commercial software that use environment variables to advantage. A common use is to locate special directories that contain support files or that are well-suited for hosting temporary files. But they have many other uses as well.

The Standard C library includes the function getenv. You will find it declared in the standard header <stdlib.h>. Call getenv with the name of an environment variable and it will return a pointer to its value string, if there is one. It is not considered an error to reference a variable that is not defined.

Note, however, that the Standard does not include setenv, the usual companion to getenv. That is the common name for the function that lets you alter the values associated with environment variables. Simply put, committee X3Jll couldn't decide how to describe the semantics of setenv. They differ too much among various single-user and multiprocessing systems. So you can write portable code that reads environment variables, but you can't alter them in a standard way.

Why Locales?

What do locales provide that environment variables do not? In a word, structure. This is the era of object-oriented hoopla. So you can look on locales, if you wish, as object-oriented environment variables. A single locale provides information on many related parameters. The values are consistent for a given culture. You would have to pump dozens of reserved names into the name space for environment variables to transmit the same amount of information. And you run a greater risk that subsets of the information get altered inconsistently.

When I talk about a culture, by the way, I don't mean just a group that speaks a common language. People in the USA write dates as 7/4/1776 (Independence Day). The same day in the UK is written as 4/7/1776 (Thanksgiving Day). Even within the USA, practices can vary. Where we civilians might write a debit as $-123.45, an accountant may well prefer (123.45).

For this reason, and others, locales have substructure. You can set an entire locale, or you can alter one or more categories. Separate categories exist for controlling collation sequences, character classification, monetary formatting, other numeric formatting, and times. The header <locale.h> defines several macros with names such as LC_COLLATE and LC_TIME. Each expands to an integer value that you can use as the category argument to setlocale, the function that alters locales. An implementation can choose to provide additional categories as well. A program that uses such added categories will, of course, be less portable than one that does not.

The idea behind categories is that an application may wish to tailor its locale. It may want to print dates in the local language and by the formatting rules of that language. But it may still opt to use the dot for a decimal point even though speakers of that language customarily write a comma. Or the application may adapt completely to a given locale, then change the monetary category to match a worldwide corporate standard for expressing accounting information.

Much of the information provided in a locale is purely informative. C has never treated currency amounts as a special data type. It is no surprise, therefore, that the Standard C library is unaffected by a change in the monetary category. On the other hand, some changes in locale very definitely affect how certain library functions behave. If a culture uses a comma for a decimal point, then the scanf family should accept commas and the printf family (and strtod) should produce commas in the proper places. That is indeed what happens.

Here are all the places where library behavior changes with locale:

Using Locales

If you are half as nervous as I am, this litany of changes should scare you. How do you write portable code if large chunks of the Standard C library can change behavior underfoot? Can you ship code to Germany and know what isalpha will do when it runs there? If you mix your code with functions from another source, how much trouble can they cause? Each time your functions get control, you may be running in a different locale. How do you code under those conditions?

X3Jll anguished about such issues when we spelled out the behavior of locales. We recognized that many people don't want to be bothered with this machinery at all. Those folks should suffer little from the addition of locales. Still others have only modest goals. They want to trade in the Americanisms wired into older C for conventions more in tune with their culture. Still others are ambitious. They want to write code that can be sold unchanged, in binary form, in numerous markets. That code must be very sophisticated about changing locales.

The simplest way to use locales is to ignore them. Every Standard C program starts up in the "C" locale. In this locale, the traditional library functions behave pretty much as they always have. islower returns a nonzero value only for the 26 letters of the English alphabet, for example. The decimal point is a dot. If your program never calls setlocale, none of this behavior can change.

The next simplest way to use locales is to change once, just after program startup, and leave it at that. The C Standard requires no other locale names besides "C". But it does define a defalut locale designated by the empty string "". If your program executes

setlocale (LC_ALL, "")
it should shift to this default locale. Presumably, each implementation will devise a way to determine a default locale that pleases the locals. (An implementation that doesn't care a hoot about locales can make the default locale the same as the "C" locale, of course.)

You must be more careful in using the library, once the locale can change on you. Some things get easier, such as displaying pretty dates or skipping the appropriate characters for white space. Other things get chancier, such as parsing strings with the functions in <ctype.h>. In a pinch, you can always revert part or all of the locale to the "C" locale, as in:

char *s1 = setlocale(LC_CTYPE, "C");
char *s2 = malloc(strlen(s1) + 1);
if (s2 == NULL)
    <despair>
strcpy(s2, s1);
<use ctype functions safely>
<setlocale(LC_ALL, s2);
You can omit the business about copying the locale string returned by setlocale only if you are sure that no other calls to that function can intervene between the two shown above.

I won't go into more sophisticated manipulation of locales at this point. That must wait until we have covered some of the implementation issues raised so far. You will find that they are dizzying enough in their own right.

What the Standard Says

If your primary goal is to query the current locale, you need to read at least two chunks of the Standard. The first provides an introduction to the standard header <locale.h>:

4.4 Localization <locale.h>

The header <locale.h> declares two functions, one type, and defines several macros.

The type is

struct lconv
which contains members related to the formatting of numeric values. The structure shall contain at least the following members, in any order. The semantics of the members and their normal ranges is explained in 4.4.2.1. In the "C" locale, the members shall have the values specified in the comments.

char *decimal_point;      /*"." */
char *thousands_sep;      /* "" */
char *grouping;           /* "" */
char *int_curr_symbol;    /* "" */
char *currency_symbol;    /* "" */
char *mon_decimal_point;  /* "" */
char *mon_thousands_sep;  /* "" */
char *mon_grouping;       /* "" */
char *positive_sign;      /* "" */
char *negative_sign;      /* "" */
char int_frac_digits;     /* CHAR_MAX */
char frac_digits;         /* CHAR_MAX */
char p_cs_precedes;       /* CHAR_MAX */
char p_sep_by_space;      /* CHAR_MAX */
char n_cs_precedes;       /* CHAR_MAX */
char n_sep by_space;      /* CHAR_MAX */
char p_sign_posn;         /* CHAR_MAX */
char n_sign_posn;         /* CHAR_MAX */
The macros defined are NULL (described in 4.1.5); and

LC_ALL
LC_COLLATE
LC_CTYPE
LC_MONETARY
LC_NUMERIC
LC_TIME
which expand to integral constant expressions with distinct values, suitable for use as the first argument to the setlocale function. Additional macro definitions, beginning with the characters LC_ and an upper-case letter,100 may also be specified by the implementation. [end of excerpt]

The second chunk you must read is the description of localeconv, the function that lets you query the current locale:

4.4.2 Numeric Formatting Convention Inquiry

4.4.2.1 The localeconv Function

Synopsis

#include <locale.h>
struct lconv *localeconv(void);

Description

The localeconv function sets the components of an object with type struct lconv with values appropriate for the formatting of numeric quantities (monetary and otherwise) according to the rules of the current locale.

The members of the structure with type char * are pointers to strings, any of which (except decimal_point) can point to "", to indicate that the value is not available in the current locale or is of zero length. The members with type char are non-negative numbers, any of which can be CHAR_MAX to indicate that the value is not available in the current locale. The members include the following:

char *decimal_point
The decimal-point character used to format non-monetary quantities.

char *thousands_sep
The character used to separate groups of digits before the decimal-point character in formatted non-monetary quantities.

char *grouping
A string whose elements indicate the size of each group of digits in formatted non-monetary quantities.

char *int_curr_symbol
The international currency symbol applicable to the current locale. The first three characters contain the alphabetic international currency symbol in accordance with those specified in ISO 4217 Codes for the Representation of Currency and Funds. The fourth character (immediately preceding the null character) is the character used to separate the international currency symbol from the monetary quantity.

char *currency_symbol
The local currency symbol applicable to the current locale.

char *mon_decimal_point
The decimal-point used to format monetary quantities.

char *mon_thousands_sep
The separator for groups of digits before the decimal-point in formatted monetary quantities.

char *man_grouping
A string whose elements indicate the size of each group of digits in formatted monetary quantities.

char *positive_sign
The string used to indicate a non-negative-valued formatted monetary quantity.

char *negative_sign
The string used to indicate a negative-valued formatted monetary quantity.

char int_frac_digits
The number of fractional digits (those after the decimal-point) to be displayed in a internationally formatted monetary quantity.

char frac_digits
The number of fractional digits (those after the decimal-point) to be displayed in a formatted monetary quantity.

char p_cs_precedes
Set to 1 or 0 if the currency_symbol respectively precedes or succeeds the value for a non-negative formatted monetary quantity.

char p_sep_by_space
Set to 1 or 0 if the currency_symbol respectively is or is not separated by a space from the value for a non-negative formatted monetary quantity.

char n_cs_precedes
Set to 1 or 0 if the currency_symbol respectively precedes or succeeds the value for a negative formatted monetary quantity.

char n_sep_by_space
Set to 1 or 0 if the currency_symbol respectively is or is not separated by a space from the value for a negative formatted monetary quantity.

char p_sign_posn
Set to a value indicating the positioning of the positive_sign for a non-negative formatted monetary quantity.

char n_sign_posn
Set to a value indicating the positioning of the negative_sign for a negative formatted monetary quantity.

The elements of grouping and mon_grouping are interpreted according to the following:

CHAR_MAX No further grouping is to be performed.

0 The previous element is to be repeatedly used for the remainder of the digits.

other The integer value is the number of digits that comprise the current group. The next element is examined to determine the size of the next group of digits before the current group.

The value of p_sign_posn and n_sign_posn is interpreted according to the following:

0 Parentheses surround the quantity and currency_symbol.

1 The sign string precedes the quantity and currency_symbol.

2 The sign string succeeds the quantity and currency_symbol.

3 The sign string immediately precedes the currency_symbol.

4 The sign string immediately succeeds the currency_symbol.

The implementation shall behave as if no library function calls the localeconv function.

Returns

The localeconv function returns a pointer to the filled-in object. The structure pointed to by the return value shall not be modified by the program, but may be overwritten by a subsequent call to the localeconv function. In addition, calls to the setlocale function with categories LC_ALL, LC_MONETARY, or LC_NUMERIC may overwrite the contents of the structure.

Footnote:

100. See future library directions (4.13.3). [end of excerpt]

Future Attractions

Next month, I will discuss ways to implement locales. Since there is little or no history in this area, I can be particularly inventive. That makes locales particularly interesting to tinkerers like me. Internationalization is becoming more important. That should make the topic of interest to many of you. As for the rest of you, at least you can see how much trouble your code will encounter when everyone starts altering locales under foot.