March 1999/Portable Run-Time Multilanguage Support with Lingua

Features

Portable Run-Time Multilanguage Support with Lingua

Anneke Sicherer-Roetman

Message text is the bane of any multilanguage program, but you can get a lot of help with them from this remarkably simple package.

Introduction

Lingua is a small package that I wrote a few years ago and still use on a regular basis. I designed the Lingua package to help C programmers develop applications that must be released in multiple languages. With Lingua, the end-user gets an executable, and one datafile for every supported language. The datafile contains all user interface text in the application.

Since a language can be chosen at run time, you needn't use different executables for different languages. You need only compile one executable and make a text data file for every supported language. It's easy to create support later for yet another language, then send only the new data file to interested customers. Lingua was written in pure ANSI C and is thus completely portable.

Lingua Package Overview

The package consists of a utility, Lingua, and a module, ui_text.c. You must compile ui_text.c and link it with your application program. This module is the same for any language and is very small (about 4 K when compiled with Borland C++).

You need to prepare a text file for every supported language, according to a specified format. Lingua then encrypts the text file so users cannot alter or corrupt the file. The encrypted text files are shipped with the application. One of them is loaded by the module ui_text.c at run time so that the application's user interface texts are all in the desired language. It is also possible to switch languages at run time.

How to Use Lingua

Do not use any literal strings in your source code; instead, use symbolic identifiers, or mnemonics. For example, instead of using "Press any key", use ANYKEY. You can write your application however you like, so long as you observe this main rule and put an #include "ui_text.h" statement at the start of every source file in which these mnemonics appear.

Create and maintain a text source file with all literal text for the program written in your first language of choice. This file can have any name, but Lingua can use only files with the .txt extension. The lines in this text file should all begin (at the first position!) with the mnemonic that will be used in the application. Following the mnemonic is the literal string that should be substituted at run time. Listing 1 shows an example of a Lingua text source file.

Comment lines must begin with #. In literal strings, the underscore character (_) signifies a leading or trailing space, and a single dash character (-) denotes an empty string. If you want a character other than _ to signify a leading or trailing space, you can change it using the #SPACE directive.

String arrays, in which not every item has its own mnemonic, are represented by a [ directly after the mnemonic, followed by one or more lines starting with only a [. Multi-line strings are represented by a / directly after the mnemonic, followed by one or more lines starting with only a /.

By default, Lingua loads all the text at once into memory when you call the function ui_loadtext. This is typically done at the start of an application. You delete the text at the end of the application by calling ui_unloadtext. If desired, you can specify that part or all of the text be read from the file only when necessary. To specify this behavior, use the directive #FILE in the text file (see Listing 1). All text after this directive will be loaded only when necessary. All text before #FILE will be loaded as soon as ui_loadtext is called and will remain resident in memory. You could place all frequently used strings before #FILE and seldom-used or very long strings after #FILE. Note that if #FILE is used, the text datafile will remain open between ui_loadtext and ui_unloadtext, which is otherwise not the case. Also note that #FILE may occur only once in the text file.

Beware of one snag when using the #FILE directive: every mnemonic defined after #FILE results in a pointer to the same file buffer. When using two consecutive strings, the latter will overwrite the former. This leads to problems with statements like:
printf("%s%s\n",TENTH,ELEVEN);
The above will result in printing "eleven" twice. Instead, use:
printf("%s",TENTH);
printf("%s",ELEVEN);
Apart from this snag, the use of Lingua is totally transparent so long as, at run time, you do not reference any mnemonic before ui_loadtext is called.

Updating the Data File

Whenever any mnemonic in the text file has been added, edited, or deleted, run the utility Lingua by issuing a command such as:
lingua francais
where francais is the name of the text file. Then recompile your program's source code (just as you would had you edited literal strings in the source code). If you change only the literal text in the text file, you need only run Lingua; you needn't recompile your program's source code in this case.

You can give an optional version number or other string as a second argument for the Lingua utility. Use this same version number as the second parameter in the call to ui_loadtext. This can be a useful feature because it enables ui_loadtext to reject a file that was built for a previous version of your application program.

Lingua creates two files: ui_text.h and an .etf (Encrypted Text File). The .etf file must be shipped with the application. The header file ui_text.h must be included in all source files that contain calls to ui_loadtext and ui_unloadtext, or that contain mnemonics. Since this header is not language dependent, no recompilation for another language is necessary. It contains only macros for the mnemonics, which are language independent. Listing 2 shows a sample header.

When your application is finished (when is an application ever finished?), copy the original .txt file and change the literal strings in the copy to another language. Warning: do not change the mnemonics that begin each line! When you run Lingua on this text file, a new .etf file is created (and also ui_text.h, but this one is identical to the one from the original language).

Inside the Application Program

In the application program, start with a call to ui_loadtext (in the module ui_text.c). End the application program with a call to ui_unloadtext. It is also possible to load another language into the program after completing a call to ui_unloadtext. Simply call ui_loadtext again with a new language file argument.

Lingua also writes a checksum at the start of the .etf file. ui_loadtext returns 1 if the file was read and decoded successfully and if the computed checksum is equal to the one in the file; if not, it returns zero. You can then handle an error condition in your application (but in what language?).

Listing 3 shows an example application that uses the Listing 1's text.

The module ui_text.c uses the fopen family of file functions. For programmers who do not like to use these, all file handling in ui_text.c is done via macros defined in an include file, files.inc, that can be changed to contain other file functions.

The Lingua package source code is available in electronic form and also includes makefiles for Unix and DOS/Windows. This source archive also contains a shareware Borland C++ Builder component that can be dropped on the main form of an application to make a BCB application multilingual.

How Lingua Works

Listing 4 shows the source code of the module ui_text.c. The include files lingua.h and files.inc can be seen in Listing 5 and Listing 6 respectively. (The source code for the Lingua utility is not shown here, but is available on the CUJ ftp site. See p. 3 for downloading instructions.)

For simplicity, this discussion begins with only the handling of the strings that are loaded permanently into memory.

The Lingua utility starts by writing some declarations in the header file ui_text.h. It then reads the input file twice. During the first pass, Lingua writes the mnemonics and their sequential numbers in the text source file to ui_text.h in the form of #define statements that connect each mnemonic with an element of a global array ui_text (defined in ui_text.c):
#define SECOND  ui_text[2]
#define THIRD   (ui_text+3)
The first line denotes a single string, the second line a string array.

Lingua also determines the corresponding strings' lengths and computes an offset for each string. These offsets are written to the .etf file. During the second pass, the strings themselves are encrypted and written to the .etf file. Finally, Lingua computes a checksum and writes it to the .etf file together with a total count of strings and characters.

Lingua thus generates an encrypted text file and a header file ui_text.h. By including ui_text.h in your source files, the preprocessor replaces the mnemonics in the source code by the corresponding ui_text[n] identifiers. ui_text[n] is a pointer that points to the n-th element of a pointer table. The element in the pointer table points into a character buffer, which I call the string area.

The ui_loadtext function in the ui_text.c module reads the counts and allocates memory for a string area (char *ui_textbuffer) and a table of pointers into this area (char **ui_text, declared globally). It then reads the offsets, calculates the pointers, puts these in the table, and then reads the strings and puts these in the string area. Now all ui_text pointers point to the corresponding literal strings. A checksum is also computed and compared with the one stored in the .etf file. The ui_unloadtext function releases all used memory.

For the strings that appear after the #FILE directive, it's a bit more complicated. Single strings and string arrays are handled differently here.

The generated #define statements in ui_text.h look like this:
#define TWELVE   ui_filetext(3)
#define FOURTEEN (ui_filearray(6))
The first line denotes a single string, the second line a string array.

Here too, the offsets are computed from the string lengths and written to the .etf file. Also during the second pass, the strings are encrypted and written to the .etf file. For arrays, the size of the array is also written to the .etf file prior to writing the strings of the array.

ui_loadtext reads the counts and allocates memory for a table of offsets into the .etf file (unsigned long *ui_file). It then reads the offsets and puts these in the table. It leaves the .etf file open.

The parameter n of the function ui_filetext(n) gives the index into the offset array ui_file. The string at that offset is then read from the file into a dynamically allocated character buffer char *ui_filebuffer. If the previously requested string was at the same file offset as the newly requested one, no re-reading is done. The address of ui_filebuffer is passed to the program as the return value of ui_filetext.

For arrays in the .etf file, the function ui_filearray(n) is used instead. First, the size of the array is read from the file at the offset found at ui_file[n]. Then, the strings at this offset are read from the file into the dynamically allocated character buffer ui_filebuffer. At the same time, a pointer array char **ui_fileptr is allocated and filled with pointers into ui_filebuffer. The return value of ui_filearray is the address of ui_fileptr. The ui_unloadtext function releases all used memory and closes the .etf file.

From this discussion, the reason for the above-mentioned snag in using the #FILE directive should be clear. There is only one string or string array from the file present in memory at one time. So, never use two different strings from a file in the same expression; put them in temporary variables first. Apart from this, the usage of Lingua is completely transparent.

Conclusion

I have written many applications with Lingua. I prefer its simplicity, efficiency, and complete portability over other more elaborate schemes, and certainly over methods that force you to distribute different executables for different languages. Maintenance is easy and lends itself to code management systems and project makefiles because the source file is a normal text file and Lingua is a command-line utility.

I'm interested in hearing about how you've used Lingua in your applications, and welcome any improvements or suggestions you may have. The next obvious step is to rewrite Lingua to Unicode instead of single-byte characters, which should not be too difficult. If you want to do it, please let me know about it also.

Anneke Sicherer-Roetman started her career as a research chemist and holds a doctorate in chemisty. She then became a Unix systems manager and a programmer, and practiced both occupations for more than ten years. Recently she became a software engineer in a company that produces software and instrumentation for behavioral research (Windows 95/NT platform). She has also run a small educational software company together with her husband for 12 years. She has been coding in C for 12 years and in C++ for 6 years. She can be reached at sicherer@sichemsoft.nl or a.sicherer@noldus.nl.