P.J. Plauger is senior editor of The C Users Journal. He is secretary of the ANSI C standards committee, X3J11, and convenor of the ISO C standards committee, WG14. His latest book is Standard C, which he co-authored with Jim Brodie. You can reach him at pjp@plauger.uunet.
Introduction
Last month, I introduced the header <locale.h> and described its brief history. I showed how to adapt a program to the default locale, and how to partially revert behavior to the "C" locale when necessary. Now it's time to look at some implementation details.The easiest part is the function localeconv. All it must do is return a pointer to a structure describing (parts of) the current locale. That structure has type struct lconv, which is defined in <locale.h>. Here are the easy parts of the implementation. Listing 1 shows the file locale.h. Listing 2 shows localeco.c. (The name is chopped to eight letters because of file naming restrictions on MS-DOS and other systems.) Packed in with localeconv are the structures holding the current and "C" locales.
I have defined additional fields in struct lconv, over and above those specified by the C Standard. One field points at the name of the locale. Another links these structures together into a list. (Initially, the "C" locale is the only entry on the list.) The remaining fields contain information that changes with a change in locale. The Standard C library I have written defines several such fields. Here, I show only the ones that I've discussed in earlier columns. You can see what is involved in controlling the tables used by the functions in <ctype. h>.
What the Standard Says
The setlocale function introduces many more implementation isses than localeconv. I showed what the Standard says about localeconv last month. Here is what it has to say about setlocale:
4.4.1 Locale Control
4.4.1.1 The setlocale Function
Synopsis
#include <locale.h> char *setlocale(int category, const char *locale);Description
The setlocale function selects the appropriate portion of the program's locale as specified by the category and locale arguments. setlocale can change or query the program's entire current locale or portions thereof. The value LC_ALL for category names the program's entire locale; the other values for category name only a portion of the program's locale. LC_COLLATE affects the behavior of the strcoll and strxfrm functions. LC_CTYPE affects the behavior of the character handling functions101 and the multibyte functions. LC_MONETARY affects the monetary formatting information returned by localeconv. LC_NUMERIC affects the decimal-point character for the formatted input/output functions and the string conversion functions, as well as the non-monetary formatting information returned by the localeconv function. LC_TIME affects the behavior of the strftime function.A value of "C"for locale specifies the minimal environment for C translation; a value of " " for locale specifies the implementation-defined native environment. Other implementation-defined strings may be passed as the second argument to setlocale.
At program startup, the equivalent of
setlocale(LC_ALL, "C");is executed.The implementation shall behave as if no library function calls setlocale.
Returns
If a pointer to a string is given for locale and the selection can be honored, the setlocale function returns a pointer to the string associated with the specified category for the new locale. If the selection cannot be honored, setlocale returns a null pointer and the program's locale is not changed.A null pointer for locale causes setlocale to return a pointer to the string associated with the category for the program's current locale; the program's locale is not changed.102
The pointer to string returned by setlocale is such that a subsequent call with that string value and its associated category will restore that part of the program's locale. The string pointed to shall not be modified by the program, but it may be overwritten by a subsequent call to setlocale.
Forward references: formatted input/output functions (4.9.6), the multibyte character functions (4.10.7), the multibyte string functions (4.10.8), string conversion functions (4.10.1), the strcoll function (4.11.4.3), the strftime function (4.12.3.5), the strxfrm function (4.11.4.5).
Footnotes:
101. The only functions in 4.3 whose behavior is not affected by the current locale are isdigit and isxdigit.
102. The implementation must arrange to encode in a string the various categories due to a heterogeneous locale when category has the value LC_ALL. [end of excerpt]
Implementing setlocale
setlocale clearly has a number of tasks to perform. It must determine what locales to switch to, based on the category and name you specify when you call the function. It must find locales already in memory, or read in newly specified locales from a file. (I describe the general case, of course. A minimal implementation can recognize only the "C" and " " locales, which can be the same.) And it must return a name that it can later use to restore the current locale.The last task is one of the hardest because you can construct a mixed locale, containing categories from various locales. For example, you can write:
#include <locale.h> ..... char *s1, s2; setlocale(LC_ALL, ""); s1 = setlocale(LC_CTYPE, "C"); if ((s2 = malloc(strlen(s1) + 1))) strcpy(s2, s1);The first call switches to the native locale, which is some locale preferred by the local operating environment. The second call reverts one category to the "C" locale. You must make a copy of the string pointed to by s1 because intervening calls to setlocale might alter it. If you later make the call
setlocale(LC_ALL, s2);the locale reverts to its earlier mixed state.setlocale must contrive a name that it can later use to reconstruct an arbitrary mix of categories. The C Standard doesn't say how to do this, or what the name looks like. It only says that an implementation must do it.
The scheme I settled on was to paste qualifiers on a locale name if it contains mixed categories. Say, for example, that the base locale is "USA". That gives you American date formats, the English alphabet, and so on. But an application adapts the monetary category to the special conventions of accounting a locale exists called "acct". The name that characterizes this mixed locale is "USA;monetary:acct".
I use semicolons to separate components of the mixed locale name. Within a component, a colon separates a category name from its locale name. The base locale has no category name qualifier. When setlocale constructs a name, it adds components only for categories that differ from the base locale.
Perhaps now you can understand some of the complexity of setlocale. Listing 3 shows the source file setlocal.c. Much of its logic is concerned with parsing a name to determine which locale to use for each category. Another big chunk of logic builds a name that setlocale can later digest. Everything else is small potatoes by comparison.
To determine the native locale, I inspect the environment variable LOCALE. That strikes me as a reasonable channel for determining what locale to favor. It's akin to using the environment variable TZ to determine what time zone you're in. The environment variable is inspected at most once during program execution.
You will also see code that copies information into the "C" locale on the first call to the function. I adopted that ruse to avoid a nasty snowball effect. It's easy enough to pile all the various locale-dependent tables into one structure. Do so, however, and you get the whole snowball regardless of how little of it you use. I felt it was better to have setlocale do a bit more work to avoid this problem. You don't want to drag in 10Kb of code when you use only isspace from the library.
I offloaded some of the work to an internal function called _Getloc. Listing 4 shows the file xgetloc.c, or at least most of it. This function determines whether a locale exists in memory. If a locale does not exist in memory, _Getloc should go looking for it. I have stubbed that code out for this presentation, because it takes a whole column just to describe how you can make your own locale files and read them in at runtime.
Listing 5 shows the file xsetloc.c. It contains the function _Setloc, which actually copies new information into the current locale. It also copies information out to the various bits of static data affected by changes in the locale. A call to setlocale drags in all this stuff. I don't know how to avoid this particular snowball. At least you can avoid it if you leave locales alone.
I have tested this code at least superficially. It should not contain major errors. Be wary of small gaffes, however.
Conclusion
What I have presented here is just the basic machinery you need to support locales. It is enough to let you build additional locales directly into the library. Just add static declarations of type struct lconv and initialize them as you see fit. Be sure to change_Clocale._Next to point at the list you add.The real fun of locales is defining an open-ended set. To do that, you must be able to specify a locale without altering C code. I have developed considerable additional machinery that lets you do so. Next month, I will show you the code that reads locale files.