Web Localization & Perl

The Perl Journal May 2003

By Autrijus Tang

Autrijus is a software developer who specializes in localization and internationalization. He can be contacted at autrijus@autrijus.org.

Designing software for a worldwide audience involves two processes—internationalization (often abbreviated "i18n," because of the 18 letters between "i" and "n") and localization (abbreviated "l10n" for the same reason). Internationalization is an engineering process: It means building your application so that it can support multiple languages, date/currency formats, and local customs without deep structural changes in the code. Localization is the process of implementing your internationalized application for various locales. It is during the localization process, for example, that text translation takes place. Today, web-application developers have perhaps the greatest need to localize their software. Often, web-application user interfaces are text-based, increasing the need for translation and other localization efforts.

For proprietary applications, localization typically has been done as a prerequisite for competing in a foreign market. That implies that if the localization cost exceeded estimated profit in a given locale, the company would not localize its application at all; and it would be difficult (and maybe illegal) for users to do it themselves without the source code. If a vendor did not design its software with a good i18n framework in mind, well, international users were just out of luck.

Fortunately, the case is much simpler and more rewarding with open-source applications. As with proprietary applications, the first few versions are often designed with only one locale in mind, but open-source apps can be internationalized at any time by anyone.

In this article, I'll describe techniques to make l10n straightforward. While I focus on web-based applications written in Perl, the principle should also apply to other languages and application types.

Localizing Static Web Sites

Web pages come in two different flavors: static pages that don't change until they are manually updated, and dynamic pages that can change for each viewer based on various factors. These two types of pages are often referred to as "web documents" and "web applications," respectively.

However, even static pages may have multiple representations—different people may prefer different languages, styles, or media (for example, an auditory representation instead of a visual one). Part of the Web's strength is its ability to let clients negotiate with the server and determine the most preferred representation.

For example, consider my hypothetical homepage http://www .autrijus.org/index.html, written in Chinese (Figure 1). Assume that one day, I decide to translate it for my English-speaking friends (Figure 2).

At this point, many web sites would decide to offer a language-selection page to let visitors pick their favorite language, as in Figure 3. For both nontechnical users and automated programs, this page is confusing, redundant, and irritating. Besides demanding an extra search-and-click for each visit, it creates a considerable amount of difficulty for web-agent programmers, as they now have to parse the page and follow the correct link, which is a highly error-prone thing to do.

MultiViews: The Easiest L10n Framework

Of course, it is better if everybody can see their preferred language automatically. Thankfully, the content negotiation feature in HTTP/1.1 addresses this problem quite neatly. Content negotiation is defined as "the process of selecting the best representation for a given response when there are multiple representations available" (http://www.w3.org/Protocols/rfc2616/rfc2616-sec12.html).

Under this scheme, browsers always send an Accept-Language request-header field, which specifies one or more preferred language codes. For example, "zh-tw, en-us, en" would mean "Traditional Chinese, American English, or English, in this order."

Upon receiving this information, the web server is responsible for presenting the request content in the most preferred language. Different web servers may implement this process differently; under Apache (the most popular web server), a technique called "MultiViews" is widely used.

Using MultiViews, I save the English version as index.html.en (note the extra file extension), then put the following line into Apache's configuration file (httpd.conf or .htaccess):

Options +MultiViews

After that, Apache examines all requests to http://www .autrijus.org/index.html, to see if the client prefers "en" in its Accept-Language request-header field. People who prefer English see the English page; others see the original index.html page.

This technique allows gradual introduction of new localized versions of the same documents, so my international friends can contribute more languages over time—index.html.fr for French, index.html.he for Hebrew, and so on.

Since much of the international online population speaks only English and one other (native) language, most of the contributed versions would be translated from English, not Chinese. But because both versions represent the same content, that is not a problem.

Or is it? What if I go back to update the original, Chinese page?

The Difficulty of Maintaining Translations

It's impossible to get my French and Hebrew friends to translate from Chinese. Clearly, I must use English as the base version. The same reasoning also applies to most free software projects, even if the principal developers do not speak English natively.

Moreover, even if it is merely a change to the background color (for example, <body bgcolor=gold>), I still need to modify all translated pages, to keep the layout consistent.

Now, if both the layout and contents are changed, things quickly become very complicated. Since the old HTML tags are gone, my translator friends must work from scratch every time. Unless all of them are HTML wizards, errors and conflicts will surely arise. If there are 20 regularly updated pages in my personal site, then pretty soon, I will run out of translators—or even out of friends. Clearly, what's needed is a way to automate the process of generating localized pages.

Separate Data and Code with CGI.pm

To prepare web applications for localization, you must find a way to separate data from code as much possible.

As the long-established web-development language of choice, Perl offers a wide variety of modules and toolkits for web site construction. The most popular one is probably CGI.pm, which has been merged into the core Perl release since 1997. Example 1 is a code snippet that uses it to automatically generate translated pages.

Unlike the HTML pages, this program enforces data/code separation via CGI.pm's HTML-related routines. Tags (<html>, for instance) now become functions calls (start_html()), and text is turned into Perl strings. Therefore, when the localized version is written out to the corresponding static page (index.html.zh_tw, index.html.en, and so on), the HTML layout is always identical for each of the four languages listed.

The sub _ function is responsible for localizing any text into the current $language, by passing the language and text strings to a hypothetical some_function(). some_function() is known as the "localization framework."

After writing the code in Example 1, it is a simple matter to grep for all strings inside _(...) within the code, extract them into a lexicon, and ask translators to complete this lexicon in other languages. Here, lexicon means a set of things that you know how to say in another language—sometimes single words like "Cancel," but usually whole phrases such as "Do you want to overwrite?" or "5 files found." Strings in a lexicon are like entries in a traveler's pocket phrasebook, sometimes with blanks to fill in, as in Figure 4.

Ideally, the translator should focus solely on this lexicon, instead of peeking at HTML files or the source code. But here's the rub: Different localization frameworks use different lexicon formats, so you have to choose the framework that best suits the project.

Localization Frameworks

To implement the some_function() in Example 1, you need a library to manipulate lexicon files, look up the corresponding strings in it, and maybe incrementally extract new strings for insertion into the lexicon. These abilities are collectively provided by a localization framework.

From my observation, frameworks mostly differ in their idea about how lexicons should be structured. Here, I discuss the Perl interfaces for three such frameworks, starting with Msgcat.

Msgcat: Lexicons are Arrays

As one of the earliest l10n frameworks and as part of XPG3/XPG4 standards, Msgcat enjoys ubiquity on all UNIX platforms. It represents the first-generation paradigm of lexicons: Treat entries as numbered strings in an array (aka., a message catalog). This approach is straightforward to implement, needs little memory, and is fast to look up. The resource files used in Windows programming and other platforms use basically the same idea.

For each page or source file, Msgcat requires us to make a lexicon file for each language, as in Example 2.

Example 2 contains the German translation for the text strings within index.html, which is represented by a unique "set number" of 7. Once you finish building the lexicons for all pages, the gencat utility is used to generate the binary lexicon:

% gencat nls/de.cat nls/de/*.m

It is best to imagine the internals of the binary lexicon as a two-dimensional array, as in Figure 5.

To read from the lexicon file, you use the Perl module Locale::Msgcat (available from CPAN) and implement the sub _() function (Example 3). Only the msg_id matters here; the string "Autrijus.House" is only used as an optional fallback when the lookup fails, as well as to improve the program's readability.

Because set_id and msg_id must both be unique and immutable, future revisions may only delete entries, and never reassign the number to represent other strings. This characteristic makes revisions very costly. For this reason, one should consider using Msgcat only if the lexicon is very stable.

Another shortcoming of Msgcat is the plurality problem. Consider the code snippet printf(_(8, "%d files were deleted."), $files);. This is obviously incorrect when $files == 1, and "%d file(s) were deleted" is grammatically invalid as well. Hence, you are often forced to use two entries:

printf(($files == 1) ? _(8, "%d file was deleted.")
                    : _(9, "%d files were deleted."), $files);

This is still not satisfactory, however, because it is English specific. French, for example, uses singular with $files == 0, and Slavic languages have three or four plural forms. Trying to retrofit those languages to the Msgcat infrastructure is often a futile exercise.

Gettext: Lexicons are Hashes

Due to the various problems of Msgcat, the GNU Project developed its own implementation of the UniForum Gettext interface in 1995. This implementation, written by Ulrich Drepper, has since become the de facto l10n framework for C-based free software projects, and has been widely adopted by C++, Tcl, and Python programmers.

Instead of requiring one lexicon for each source file, Gettext maintains a single lexicon (called a PO file) for each language of the entire project. For example, the German lexicon de.po for the homepage would look like Example 4.

The #: lines are automatically generated from the source file by the program xgettext, which can extract strings inside invocations of gettext(), and sort them out into a lexicon. Now, we may run msgfmt to compile the binary lexicon locale/de/LC_MESSAGES/web.mo from po/de.po:

% msgfmt locale/de/LC_MESSAGES/web.mo po/de.po

You can then access the binary lexicon using Locale::gettext from CPAN, as in Example 5. Recent versions (glibc 2.2+) of Gettext also introduced the ngettext ("%d file", "%d files", $files) syntax. Unfortunately, Locale::gettext does not support that interface yet.

Also, Gettext lexicons support multiline strings, as well as reordering via printf and sprintf (Figure 6). Finally, GNU Gettext comes with a very complete tool chain (msgattrib, msgcmp, msgconv, msgexec, msgfmt, msgcat, msgcomm...), which greatly simplifies the process of merging, updating, and managing lexicon files.

Maketext: Lexicons are Dispatch Tables

First written in 1998 by Sean Burke, the Locale::Maketext module was revamped in May 2001 and is included in the Perl 5.8 core. Unlike the function-based interface of Msgcat and Gettext, its basic design is object oriented, with Locale::Maketext as an abstract base class from which a project class is derived. The project class (with a name like MyApp::L10N) is, in turn, the base class for all the language classes in the project (which may have names like MyApp::L10N::it, MyApp::L10N::fr, and the like).

A language class is really a Perl module containing a %Lexicon hash as class data, which contains strings in the native language (usually English) as keys, and localized strings as values. The language class may also contain some methods that are useful for interpreting phrases in the lexicon, or otherwise dealing with text in that language. Example 6 illustrates Locale::Maketext's use.

Under its square bracket notation, translators can make use of various language-specific functions inside their translated strings. Example 6 demonstrates the built-in plural and quantifier support—for languages with other kinds of plural-form characteristics, it is a simple matter of implementing a corresponding quant() function. Ordinates and time formats are easy to add, too.

Each language class may also implement an ->encoding() method to describe the encoding of its lexicons, which may be linked with Encode for transcoding purposes. Language families are also inheritable and subclassable: missing entries in fr_ca.pm (Canadian French) would fall back to fr.pm (Generic French).

The handy built-in method ->get_handle(), used with no arguments, magically detects HTTP, POSIX, and Win32 locale settings in CGI, mod_perl, or from the command line; it spares you from parsing those settings manually.

However, Locale::Maketext is not without problems. The most serious issue is its lack of a toolchain such as GNU Gettext's. Locale::Maketext classes are full-fledged Perl modules and, as such, can have arbitrarily obscure syntactic structure. This makes writing a toolchain targeting Locale::Maketext classes all but impossible. For the same reason, there are also few text editors that can support it as well as Emacs PO Mode for Gettext.

Finally, since different projects may use different styles to write the language class, the translator must know some basic Perl syntax.

Locale::Maketext::Lexicon: The Best of Both Worlds

Irritated by the irregularity of Locale::Maketext lexicons, I implemented my own lexicon format for my company's internal use in May 2002, and asked the perl-i18n mailing list for ideas and feedback. Jesse Vincent suggested: "Why not simply standardize on Gettext's PO File format?" So I implemented it to accept lexicons in various formats, handled by different lexicon back-end modules. Thus, Locale::Maketext::Lexicon was born.

The design goal was to combine the flexibility of Locale::Maketext's lexicon expression with standard formats supported by utilities designed for Gettext or Msgcat. It also supports the Tie interface, which comes in handy for accessing lexicons stored in relational databases or DBM files.

Figure 7 demonstrates a typical application using Locale::Maketext::Lexicon and the extended PO File syntax supported by the Gettext back end. Line 2 tells the current package main to inherit from Locale::Maketext, so it can acquire the get_handle method. Lines 5-8 build four language classes using a variety of lexicon formats and sources:

The Auto back end tells Locale::Maketext that no localizing is needed for the English language—just use the lookup key as the returned string. It is especially useful if you are just starting to prototype a program and do not want deal with the localization files yet.
The Tie back end links the French %Lexicon hash to a Berkeley DB file; entries will then be fetched whenever it is used, so it will not waste any memory on unused lexicon entries.
The Gettext back end reads a compiled MO file from disk for Chinese, and reads the German lexicon from the DATA filehandle in PO file format.

Lines 11-13 implement the ord method for each language subclass of the package main, which converts its argument to ordinate numbers (1st, 2nd, 3rd...) in that language. Two CPAN modules are used to handle English and French, while German and Chinese need only straightforward string interpolation.

Line 15 gets a language handle object for the current package. Because it did not specify the language argument, it automatically guesses the current locale by probing the HTTP_ACCEPT_LANGUAGE environment variable, POSIX setlocale() settings, or Win32::Locale on Windows. Line 16 sets up a simple wrapper function that passes all arguments to the handle's maketext method.

Finally, lines 18-19 print a message containing one string to be localized. The first argument, $hits, is passed to the ord method, and the second argument, $days, calls the built-in quant method—the [*...] notation is shorthand for the previously discussed [quant,...].

Lines 22-24 are a sample lexicon, in extended PO file format. In addition to ordered arguments via %1 and %2, it also supports %function(args...) in entries, which will be transformed to [function,args...]. Any %1, %2... sequences inside the args has their percent signs (%) replaced by underscores (_).

Summary

The localization process consists of these steps:

1. Assess the web site's templating system.

2. Choose a localization framework and hook it up.

3. Write a program to locate text strings in templates, and put filters around them.

4. Extract a test lexicon; fix obvious problems manually.

5. Locate text strings in the source code by hand; replace them with _(...) calls.

6. Extract another test lexicon and machine-translate it.

7. Try the localized version out; fix any remaining problems.

8. Extract the beta lexicon; mail it to your translator teams for review. Fix problems reported by translators; extract the official lexicon and mail it out.

9. Periodically notify translators of new lexicon entries before each release.

Following these steps, you could manage a l10n project fairly easily, keep the translations up-to-date, and minimize errors.

Localization Tips

Here are some tips for localizing web applications, and other software in general:

Separate data and code, both in design and in practice.
Don't work on i18n/l10n before the web site or application takes shape.
Avoid graphic files with text in them.
Leave enough spaces around labels and buttons—do not overcrowd the UI.
Use complete sentences, instead of concatenated fragments:

_("Found ") . $files . _(" file(s).");   # Fragmented sentence - wrong!
sprintf(_("Found %s file(s)."), $files); # Complete (with sprintf)
_("Found [*,_1,file].", $files);         # Complete (Locale::Maketext)

Distinguish the same string in different contexts (for example, "Home" in the context of "Homepage" and "Home Phone Number").
Work with your translators as equals; do not apply lexicon patches by yourself without their consent.
One person doing draft translations works best.
In lexicons, provide as many comments and as much metadata as possible:

#: lib/RT/Transaction_Overlay.pm:579
#. ($field, $self->OldValue, $self->NewValue)
# Note that 'changed to' here means 'has been modified to...'.
msgid "%1 %2 changed to %3"
msgstr "%1 %2 cambiado a %3"

Using the xgettext.pl utility provided in the Locale::Maketext::Lexicon package, the source file, line number (marked by #:), and variables (marked by #.) can be deduced automatically and incrementally. It is also helpful to clarify the meaning of short or ambiguous phrases with normal comments (marked by #).

Conclusion

In nonEnglish speaking countries, localization efforts are often a prerequisite for participating in free software projects. These localization projects are principal places for community contributions, but such efforts are also historically time consuming and error prone, partly because of English-specific frameworks and rigid coding practices used by existing applications. The entry barrier for translators has been unnecessarily high.

On the other hand, the increasing internationalization of the Web makes it increasingly likely that the interfaces to web-based dynamic content services will be localized to two or more languages. For example, Sean Burke led enthusiastic users to localize the popular Apache::MP3 module, which powers homegrown Internet jukeboxes everywhere, to dozens of languages in 2002. Lincoln Stein, the module's author, was not involved with the project at all—all he needed to do was integrate the i18n patches and lexicons into the next release.

Free software projects are not abstractions filled with code, but rather depend on people caring enough to share code and give useful feedback to improve each other's code. Hence, it is my hope that techniques presented in this article will encourage programmers and eager users to actively internationalize existing applications, instead of passively translating for the relatively few applications with established i18n frameworks.

TPJ