May 1991/Standard C

Columns

Standard C

Build Your Own Locales

P.J. Plauger

P.J. Plauger is senior editor of The C Users Journal. He is secretary of the ANSI C standards committee, X3J11, and convenor of the ISO C standards committee, WG14. His latest book is Standard C, which he co-authored with Jim Brodie. You can reach him at pjp@plauger.uunet.
For the past two months, I have been discussing the header <locale.h>. Last month, I discussed many of the details of implementing the two functions declared in this header — localeconv and setlocale. I showed the code needed to set up the "C" locale and save it when the locale changes. I also showed how setlocale picks and chooses categories from among the locales in memory.
None of that code lets you switch to new locales not already in memory. That takes considerably more effort. Few implementations of Standard C have so far gone to that effort. Most implementations simply let you switch between the "C" locale and the default locale " ". (The default locale is usually the same as the "C" locale.)
I have written the code needed to support an open-ended set of locales. I simply excised any hooks into that code last month to keep the column to a manageable length. My goal this month is to describe how this additional code lets you define your own locales. I won't go into details about the code here — there's too much to present in this column. Instead, I focus on its external appearance and the services it provides.

The Approach
A locale should be easy to define. All sorts of people might have occasion to define part or all of a locale. Different groups may want to:

print dates and times in the local language, using the local conventions

change the decimal point character used for reading, converting, and

specify the local currency format and symbols

specify peculiar collating sequences

add letters, punctuation, or control characters to the character classes defined by the functions in
<ctype. h>

alter the encodings of multibyte characters and wide characters
I list these changes roughly in order of increasing sophistication. Almost anybody might want to change month and weekday names to a different language. A few might undertake to define a special collating sequence. Only the bravest would consider changing to a new multibyte-character encoding. (It might not agree with the string and character literals produced by the translator, for one thing.) Nevertheless, none of these operations should require a change in the Standard C library to pull off.
My goal, therefore, was to contrive a way that ordinary citizens can define a new locale and introduce it to a C program at runtime. The program must, of course, be one that calls setlocale under some circumstances. And the program must make use of the information altered by such a call. Given those obvious prerequisites, the Standard C library should assist program and user in agreeing on locale specifications.
My approach was to introduce two environment variables and a file format. The environment variables are:

LOCALE, which specifies the name of the default locale that is selected on a call such as setlocale(LC_ALL," ")

LOCFILE, which specifies the name of the text file to read if setlocale encounters a locale name not already represented in memory
The file format specifies how you prepare the text file so that it defines all of the additional locales you want to add.
A program called xxx might, for example, begin by executing the call setlocale(LC_ALL, "") as just shown. Under MS-DOS, you can invoke it from a batch file that looks like:

set LOCFILE c:\locales\mylocs.loc set LOCALE USA xxx
That causes xxx to read the file c:\locales\mylocs.loc in search of a locale named USA. Assuming the program can find that locale and successfully read in its specification, the program xxx then executes with its behavior adapted to the USA locale.
Change USA to France in the batch script and the program searches out a different locale in the same file. Or you can change the file name specified by LOCFILE and always ask for the generic NATIVE locale. Both are sensible ways to tailor the default locale.
A more sophisticated program might use more than just the default locale. It could determine categories and the names of locales in various ways, then oblige setlocale to chase them down in the locale file. Conceivably, it could even rewrite the contents of the locale file while it is running, to build new locales on the fly. In any of these cases, you certainly want to defer binding locales to programs as late as possible.

Specifying A Locale
So what can you specify as part of a locale? The C Standard spells out the contents of the monetary and numeric categories in considerable detail. It suggests the information required to describe several other categories. And it permits an open-ended set of additional categories that do not affect the behavior of the Standard C library. I ignore the last category for the moment. Let's look at what is required to satisfy the C Standard.
A locale consists of an assortment of data types. Some are numeric values, some are strings, and some are tables of varying formats. We need to give each entity in a locale a distinct name. You use these names when you write the locale file to specify which entities you wish to redefine. For the members of struct lconv, I use the member name as the entity name within the locale file. In other cases, I had to invent entity names.
A locale file is organized into a sequence of text lines. You begin the definition of the USA locale, for example, with the line:

LOCALE USA
Each line that follows begins with a keyword from a predefined list. Use NOTE to begin a comment and SET to assign a value to an uppercase letter, as in:

NOTE The following sets D(elta) to 'a'-'A" SET D 'a' - 'A'
You can then use D as a term in an expression.
If the keyword is an entity name, you specify its value on the remainder of the line. Some examples are:

currency_symbol $ int_curr_symbol "USD " frac_digits 2
The quotes around a string value are optional. You need them only if you want to include a space as part of the string. You can write a fairly ornate expression wherever a numeric value is required. I will describe expressions in detail later.
The initial values in each new locale match those in the "C" locale. That typically saves a lot of typing. All you really have to specify is what you want changed from the "C" locale. Write more only if you want more thorough documentation of a locale.

Numeric Values And Strings
You need to specify numeric values for some members of struct lconv. These include the LC_MONETARY information:

frac_digits int_frac_digits n_cs_precedes n_sep_by_spaces n_sign_posn p_cs_precedes p_sep_by_spaces p_sign_posn
Each of these occupies a char field. A value of CHAR_MAX (defined in <limits.h>) indicates that no meaningful value is provided.
The value of the macro MB_CUR_MAX can change with the LC_CTYPE category. I adopted the entity name:

mb_cur_max
for the char data object that holds the value of this macro.
You need to specify strings for some members of struct lconv. These include the LC_MONETARY information:

currency_symbol int_curr_symbol mon_decimal_point mon_thousands_sep negative_sign positive_sign
and the LC_NUMERIC information:

decimal_point thousands_sep
You need to specify numeric strings for some members of struct lconv. These include:

grouping (LC_NUMERIC) mon_grouping (LC_MONETARY)
The value of each character specifies how many characters to group as you move to the left away from the decimal point. A value of zero terminates the string and causes the last grouping value to be repeated indefinitely. A value of CHAR_MAX terminates the string and specifies no additional grouping. To group digits by twos, by fives, and then by threes, for example, you want to create the string "\2\5\3". In the locale text file, however, you write:

mon_grouping 253
Each digit is replaced by its numeric value.

Time Information
I introduced a handful of additional strings to specify information for the LC_TIME category. Each of these is divided into fields. I couldn't imagine any character that would serve universally as a field delimiter. So I adopted the convention that the first character of the string delimits the start of the first field. The start of each subsequent field is delimited with that character. That way, you can choose a character that doesn't collide with any characters in the fields.
As an example, the am_pm entity specifies what the function strftime in <time.h> prints for the AM/PM indicator. A common definition for this string is :AM:PM. A colon delimits the start of each field.
Here are the LC_TIME entity names with some possible string values:

am_pm :AM:PM days :Sunday:Monday:Tuesday\ :Wednesday:Thursday:Friday\ :Saturday dst_rules :032402:102702 months:Jan:January:Feb:February\ :Mar:March:Apr:April:May:May\ :Jun:June:Jul:July:Aug:August\ :Sep:September:Oct:October\ :Nov:November:Dec:December time_zone :EST:EDT:+0300
Note the use of the backslash to continue lines, just as in C source code.
The third field of time_zone counts minutes from UTC (Greenwich Meridian Time), not hours. That allows for the various time zones around the world that are not an integral number of hours away from UTC. If this string is empty, the time functions look for its contents in the environment variable TIMEZONE. If that variable is also absent, the functions then look for the widely-used environment variable TZ. That string takes the form ESTO5EDT, where the number in the middle counts hours West of UTC.
The string dst_rules is even more ornate. It takes one of two general forms:

(YY)MMDD+WHH (YY)MMDD-WHH
Here, YY in parentheses is the number of years since 1900, MM is the month number, DD is the day of the month, W is the number of weekdays past Sunday, and HH is the hour number in a 24-hour day. +W advances to the next such day of the week on or after the date MMDD in the year in question. -W backs up to the next previous such day of the week before the specified date. You can omit the fields that specify year, day of the week, and hour.
The fairly simple example above calls for Daylight Savings Time to begin on 24 March (MMDD = 0324) at 02:00 (HH = 02) and to end on 27 October at the same time. To switch on the last Sundays in March and October each year since 1990, write :(90)0401-002:1001-002. (Years before 1990 don't correct for Daylight Savings Time, by this set of rules.)
If you live below the Equator, the year begins in Daylight Savings Time. You can capture that nicety by adding a third reversal field, as in :0101:030202:100202. You can, in fact, write an arbitrary number of reversal dates throughout the year, each qualified by a starting year (HH) for the rule to take effect. You could thus capture the entire history of law governing Daylight Savings Time in a given state or country, if you choose.

Tables
The functions declared in <ctype.h> are all organized around translation tables. (See Standard C, CUJ, October and November 1990.) Each is an array of 257 shorts that accepts subscripts in the interval [--1, 255]. In the locale file, you cannot alter the contents of element --1, which translates the value EOF (defined in <stdio. h>).
The entity names for these tables are:

ctype tolower toupper
You initialize these tables an element at a time or a subrange at a time. Here, for example, is a complete specification for the tolower table, using ASCII characters plus the Swedish 'A':

tolower[0 : 255] $@ tolower['A' : 'Z'] $$ + 'a' - 'A' tolower['' ] ''
The special term $@ is the value of the index for each element in the subrange. (Read the term as "where it's at.") The special term $$ is the value of the previous contents of the table element. (Read the term as "what its value is.") Note that you can write a simple (single-character) character literal to specify its code value, and that you can add and subtract a sequence of terms. The first two lines are, of course, optional. You inherit them from the "C" locale.

State Machines
Several functions in this implementation of the Standard C library use state tables to define their behavior. That provides the maximum in flexibility with moderate performance. It also lets you specify the behavior of these functions in a locale using notation very similar to that for the <ctype.h> tables above. Here are the affected functions:

strcoll and strxfrm map a character string to another character string, to define a collating sequence.

mbtowc and mbstowcs map a multibyte string to a wide-character string.

wctomb and wcstombs map a wide-character string to a multibyte string.
You can specify up to 16 state tables for each of three entity names:

collate mbtowc wctomb
I describe these tables in greater detail in conjunction with the functions that use them (in future columns). For now, I show only a simple example. Here is how you can write the specification for one of the simple state tables in the "C" locale. It makes any of the above functions perform a one-to-one mapping:

mbtowc[0, 0] $0 mbtowc[0, 1:255] $@ $F $I $0 $0
The first line defines element zero of state table zero for mbtowc. It tells the function to consume a null element when it sees one, ending the translation. That automatically causes a null output element to be generated. The second line defines the remaining elements of state table zero. It tells the function that each of the character codes 1 to 255 maps to itself ($@). The function should Fold this mapped value into the accumulated value ($F), consume the Input ($I), and write the accumulated value as the Output ($0). The successor state is state zero ($0).
You can, of course, perform much more ambitious translations than this one.

Expressions
That's the list of entities you can specify in a locale. Now you can understand why certain funny terms can appear in expressions. An expression itself is simply a sequence of terms that get added together. The last example above shows that you can add terms simply by writing them one after the other. The plus signs are accepted in front of terms purely as a courtesy so that expressions read better.
You can write lots of different terms:

Decimal, octal, and hexadecimal numbers follow the usual rules of C literals. The sequences 10, 012, and 0xA all represent the decimal value ten.

A plus sign before a term is ignored. A minus sign negates the term that immediately follows.

Single quotes around a character yield the value of the character, just as for a character literal in C source code.

An uppercase letter has the value last set by a SET command. All such variables are set to zero at program startup.
In addition to these terms, a dollar sign is the first character of a two-character name that has a special meaning, as outlined below. Here are the special terms signaled by a leading dollar sign:

$$ — the current value stored in a table element.

$@ — the index of a table element. $$ and $@, if present, must precede any other terms in an expression.

$^ — the value of the macro CHAR_MAX.

[$a $b $f $n $r $t $v] — the values of the character escape sequences, in order, ['\a' '\b' '\f' '\n' '\r' '\t' '\v'].

[$A $C $D $H $L $M $P $S $U $W] — the character-classification bits used in the table ctype. These specify, in order: extra alphabetics, extra control characters, digits, hexadecimal digits, lowercase letters, motion-control characters, punctuation, space characters, uppercase letters, and extra white-space characters. (See the description of <ctype.h> in Standard C, CUJ, July 1990.)

[$0 $1 $2 $3 $4 $5 $6 $7] — the successor states 0 through 7 in a state-table element. (No symbols are provided for successor states 8 through 15.)

[$F $I $0 $R] — the command bits used in a state-table element. These specify, in order: Fold translated value into the accumulated value, consume Input, produce Output, and Reverse bytes in the accumulated value.

Conclusion
I conclude with an example of a complete locale. Figure 1 shows the USA locale with sensible values for all the fields in struct lconv. It makes no changes to the collating sequence or multibyte encoding specified in the "C" locale.
I can't say for certain that this scheme for specifying locale files is adequate. It seems to work well for a number of examples that I have contrived, but I have so far received little feedback from others. All I can say for now is that it's a start.