July 1993/Questions & Answers

Columns

Questions & Answers

Internationalization and Localization

Kenneth Pugh

Kenneth Pugh, a principal in Pugh-Killeen Associates, teaches C and C++ language courses for corporations. He is the author of C Language for Programmers and All on C, and was member of the ANSI C committee. He also does custom C programming for communications, graphics, image databases, and hypertext. His address is 4201 University Dr., Suite 102, Durhan, NC 27707. You may fax questions for Ken to (919) 489-5239. Ken also email at Kpugh@dukemvs.ac.duke.edu (Internet) and on Compuserve 70125,1142.
The questions in the past column regarding internationalization of programs have spawned a number of replies. These replies reveal that some software companies receive over half their revenue from overseas. You can serve this important market, even if you are not concerned with internationalization, by structuring your programs to separate strings from the code, thus making the program more maintainable.
I had a phone conversation with Mr. Shae Murphy of Network Dynamics in Williamsburg, VA. They produce an internationalization toolkit for C and C++. He sent me a white paper on internationalization, from which some of these remarks are extracted. I'll briefly describe how their toolkit works, as a basis for comparison with some other approaches.
The toolkit extracts strings from your text files and stores them in an external file. It replaces each string with a macro call that is translated to a pointer to a char. You make a call in your program to a function that initializes the pointers. The function loads the external file into dynamic memory. The external files can be appended to your executable program, so that you don't have lots of files to worry about. The strings can also be processed into a file to be included in a Microsoft Windows program's resource file.
Unlike the OSF internationalization files, the external file does not contain any string identifiers in it. The line number of each string in the file implicitly identifies the string. If you edit the file, you need to be sure to keep the lines in the same sequence.
The toolkit has a few limitations. It cannot handle implicitly concatenated strings, as each part is turned into a character pointer. This was a rather large problem for the code I ran it on. Ever since ANSI C developed the feature, I use it a fair bit to keep my code indented nicely. The total size of the extracted strings cannot exceed 64KB. However, multiple external files can be used to increase this number.
If you wanted to, you could simply use the string parsing program to replace strings in your source files with a call to your own function. Your function could perform the accessing of the strings in any manner you choose.
There are a couple of other methods for creating internationalization. Each has its tradeoffs. Microsoft Windows has Language Pre-Processor. It scans the source code and places the strings into a database. A human translator changes the strings to the new language. Then the FLPP replaces the strings in the source file. There is no string processing overhead at runtime, but this approach requires multiple copies of the executable. Translation time is cut down by two factors. Duplicate uses of the same string map to the same entry in the database and the same string database can be used for multiple applications.
Another approach is that of OSF, which stores the strings in an external file, compiled into a compact form. OSF provides functions to access the strings. You have to write your program specifically to use these functions, although the string parsing program could help in converting a currently written program. There can be a bit of overhead in these function calls.
Whichever of these approaches you use does not automatically guarantee that you will have an internationalized program. The length of the translated strings may vary dramatically. The English phrase not found translates into French as fichier non trouve. A simple message popup turns into Nachrichtenuberlagerrungsfenster in German. In general, short strings (ten characters or less) may double in size, and longer strings (over 70 characters) may increase by 30 percent.
You need to plan for string sizes. If you are using a window system, your windows should be self-sizing. If you are simply using printf, then your sizing requirements may be non-existent. Menu and button string labels may need to be shortened.
During the translations, you need to watch for sentence order. Other languages place subjects, predicates, and modifiers in sundry orders. Format statements may need to be altered to place the values in the proper location for a particular sentence. If the sequence of the format specifiers changes, you will need to alter the calling sequence in your source code.
I haven't discussed translations to languages which are not supported by eight-bit character sets. Conversion to these languages implies using a graphical interface to represent the characters. A program which uses printf for output will probably need a good deal of work for it to handle 16-bit wide characters. printf does not have a format specifier for wide characters.
Making your program international involves not only string translation, but also localization. Localization means conforming to the local culture. This includes using date and monetary formats specific to the particular market. The ANSI C library provides some features that aid in writing a localizable program. I'll describe a few of them here.
The setlocale function sets up a program to operate within a specific environment. The prototype is:
 char *setlocale(int category, const char *locale);
The category can be LC_ALL to change all features in the locale. The other options will be discussed shortly. The locale affects the way some other ANSI functions operate and some of the information that they return. The locale variable is a string which identifies a particular locale. The only two strings that are universally supported are "C" and "", which represent a locale for translation of C programs and an implementation-defined native locale.
This function provides the framework for altering the locale, without defining the specific values. The locale variable could be something like "English.USA" or "English.England". Compiler manufacturers may or may not provide other locales. If they do not, it is difficult to make use of this function, as different locales alter the operation of other ANSI functions, for which you do not have source.
The localeconv function returns a pointer to a structure of type lconv. This structure contains various values that are specific to the current locale. They include the decimal-point character, currency symbols (local and international), and information on how the currency symbol and positive and negative values are displayed. For example, in Norway, a positive currency is shown as "kr1.234,56" and a negative currency as "kr1.234,56-". Italian lira amounts are shown as "L1.234" and "-L.1.234". The currency amounts are not automatically formatted for you. You will need to write or obtain a function that can use the information returned by localeconv to produce the correct output.
You can modify particular parts of the locale by using different options for the first parameter. The LC_NUMERIC option affects the decimal point character in other functions, such as printf and scanf and the information returned by localeconv. LC_MONETARY alters the information returned by localeconv. LC_TIME affects the output of strftime. LC_COLLATE alters the operation of strcoll and strxfrm. LC_TYPE changes the operation of the character functions.

Character Functions
The strftime function converts the values in a time structure (of type tm) to a string. Its prototype looks like:

size_t strftime ( char *output_string, size_t output_string_size, const char *format, const struct tm *p_time);
The format parameter may contain specifiers that are replaced in output_string with characters corresponding to the time. For example "%a" represents the abbreviated weekday name in the locale. "%X" is the locale's time representation.
strcoll and strxfrm are used to compare and translate strings that contain characters other than the standard seven-bit ASCII set. They may also be used to sequence ASCII characters in non-ASCII order. strcoll works like strcmp, but it uses character comparisons that are based on the locale. For example, characters with accent marks, e.g. , may be interpreted as greater than the non-accented character e and less than the character f. This comparison involves some sort of table lookup or other algorithmic method, rather than a simple subtraction of two chars. Using this in place of strcmp for qsort may increase the sort time substantially. Its prototype looks exactly like strcmp:

int strcoll(const char *string_1, const char *string_2);
To avoid the overhead of doing a translation every time you want to compare two strings, you can use strxfrm. This function transforms a string into a strcmp comparable string, based on the locale. The prototype is:

size_t strxfrm (char *transformed_string, const char *input_string, size_t input_string_size);
This is a one-way process and the transformed_string is good only for comparing with another transformed_string. Since you would normally perform the transformation only once for each string, this function can make a sort faster at the expense of some memory and a little setup time.
The character functions, such as isalpha and isupper, return true or false values, depending on the value of the character passed to them. Regardless of whether you have ever considered going international, you should be using these functions. You may have coded if (c >= 'A' && c <= 'Z') to test for upper case characters. That is essentially non-portable to non-ASCII computers. Although only an estimated .1 percent of computers use a code other than ASCII, for completeness, consider the possibility. Additionally, using isalpha and other functions makes your code more readable and possibly faster. These functions use the locale information to determine the type of characters. So isalpha would return true for .

Function Pointers
I've had a number of questions in my classes regarding the use of function pointers. A function pointer is a variable whose contents is the address of a function. The syntax for declaring a function pointer looks a bit odd the first time you see it:

int (*pointer_to_function)();
The parentheses surround the name of the variable, along with the asterisk. This declaration can be read from inside out. This statement declares pointer_to_function to be a variable. This variable is a pointer (because of the *). The pointer variable will be used to point to functions (because of the ()). The function that it points to returns an int. For example, suppose you had functions whose prototypes were:

int a_function(int one_parameter); double another_function (int first_parameter, int second_parameter); int still_another_function (int first_parameter, int second_parameter);
Remember that the name of a function all by itself represents the address of the function. You can code:

pointer_to_function = a_function;
or

pointer_to_function = still_another_function;
but you cannot code:

pointer_to_function = another_function;
since another_function returns a double. The ANSI standard suggests (and C++ requires), that when you declare a pointer to a function, you must also specify the types of the parameters. For example:

int (*pointer_to_function_with_one_int)(int);
Now you can code:

pointer_to_function_with_one_int = a_function;
but not:

pointer_to_function_with_one_int = still_another_function;
since still_another_function has two int parameters. The types of the parameters of the function being pointed to must match exactly. If you had a function:

a_fourth_function (double one_parameter);
you could not set:

pointer_to_function_with_one_int = a_fourth_function;
even though doubles are automatically converted to ints if the function is called directly. You would have to have a variable as:

int (*pointer_to_function_with_double)(double);
and set

pointer_to_function_with_double = a_fourth_function;
You could use a cast to permit the previous assignment, as:

pointer_to_function_with_int = (int (*)(int)) a_fourth_function;
This cast changes the type of the value of a_fourth_function, which is a "pointer to function which expects a double and returns an int" to a "pointer to a function which expects one int and returns an int". You should not call a function using this pointer as it will be passed an int, but the actual function will expect a double. The cast of the function pointer does not alter the parameter types.
Given that background in pointers to functions, there are at least three common places where they are used in code. The first is a parameter to qsort. The qsort function sorts an array of values based upon a comparison function. Its prototype is:

void qsort(void *array, size_t number_elements, size_t size_of_element, int (*compare_function)( const void *, const void *));
It assumes that array points to the beginning of an array that is number_elements long, each element of which is size_of_element bytes. It sorts the array according to its internal algorithm, which is documented in a number of data structure books and in Knuth's books. When it needs to compare two elements to determine whether to swap them or not, it calls the function whose address you have passed it. This function is passed two addresses which point to the two elements which it is currently comparing. Your function needs to return a value less than zero, equal to zero, or greater than zero, depending on how the elements compare. For example:

your_compare_function(int * first, int * second) /* Quick compare - may not work for all int values */ { return *first - *second; } #define NUMBER_ELEMENTS 10 int int_array[NUMBER_ELEMENTS] = {5,3,4,6,7,1,2,3,4,7}; qsort (int_array, NUMBER_ELEMENTS, sizeof(int), (int (*)(const void *, const void *)) your_compare_function);
The cast is necessary to match the parameter types. Note that qsort is a generic sort. It cannot pretend to be able to sort any kind of array. You supply the comparison function to make it specific to your particular array. The same comparison function can be used with the bsearch function to search a sorted array for a particular value of the same type. The cast of the function pointer does not change the parameter types. In this case, the parameters are pointers to data. It is possible that on some implementations that the representation of a void * is different than an int*. To be strictly ANSI-compliant and absolutely portable, you have to declare your comparison function to take two void * and then typecast them to ints*. The comparison function should look like

your_comparison_function(void *first, void *second) { return *(int*)first - *(int*)second }
Note that regardless of what you may be comparing, your function header will look exactly the same. So only use this function to pass to qsort and bsearch and not as a general comparison function.
The next area in which function pointers have been used extensively is with device drivers. I'll cover this in general, since they vary from operating system to operating system. A device driver responds to requests from the operating system for opens, reads, and writes. To install a device driver, you usually set up a system defined structure or array that looks like:

struct s_device_driver_interface { int (*open)(char *filename, char *mode); int (*read)(char *buffer, int count); int (*write)(char *buffer, int count); .... int (*close)(void); };
You would code your functions that performed each of the operations, and then pass those values to the operating system, along with some identification of the device. For example,

my_device_open(char *filename, char *mode); my_device_read(char *buffer, int count); struct s_device_driver_interface my_device = {my_device_open, my_device_read, ...); install_device_driver("MY_DEVICE", &my_device);
When the user opens a device, the operating system determines which device it is from the name. If it is "MY_DEVICE", then it would call my_device_open.
The third major place function pointers are used are in window programming. Both Microsoft Windows and X-Window have callback functions, though their granularity differs. When an event, such as a mouse-click, occurs when the mouse cursor is in your window, the window system calls the function whose address you have passed. This is termed a callback function since the system will literally call back to your function. The system passes along some information as to what event occurred and additional details on some of the events. In a sense, your program is made of functions which respond to the user input, rather than the more traditional procedural approach.
One major difference between the systems is how many callback functions are used. Microsoft Windows has a single callback per window (parent, child, or dialog box). Under X-Window, you can specify a callback for every object on the screen (pushbutton, radio button, etc.). The latter may be easier to comprehend, as it eliminates a large number of events being passed to a single callback. For those interested in the full treatment, you can read the thousands of manual pages on both Microsoft Windows and X-Window.

Naming Conventions
In previous columns, I have discussed using actual words for names of variables and functions. It appears that widespread use of abbreviations such as hwnd may affect the spellchecking ability for normal text. For example, a major manufacturer of a window software system has a CD-ROM describing many of its features. One of the articles contains the sentence, "It also means that using the same command that invoked it again, you mght [sic] become reentrant and can really mes [sic] up your application."