C/C++ Contributing Editors


Standard C/C++: The Facet collate

P. J. Plauger

Comparing two character sequences is relatively easy, once you know which one should come first.


Introduction

What does it mean to compare two strings? To a C programmer, the answer is usually simple and obvious. A "string" is a null-terminated sequence of elements stored in an array of some character type. You call strcmp(s1, s2) (declared in string.h) to compare the string beginning at s1 to the string beginning at s2. The function compares successive elements until it finds a mismatch or until the corresponding elements are both null. The result of the comparison is just the signed difference between these elements, taken as unsigned values, as in (unsigned char)*s1 - (unsigned char)*s2.

The effect of this definition generally satisfies our intuitive notion of string ordering. Strings are equal only if they are of the same length and match element by element. If one string is a prefix of the other, it compares less. The principal creativity comes in determining which of two unequal strings comes earlier in an ordered sequence. The answer obviously depends on the choice of codes stored in the individual elements. It is well known that Standard C says nothing about the encoding of its source or execution character set. A change of the latter can lead to different results, for the same program, on different machines. Nevertheless, many of us are happy much of the time with this simple, and fairly fast, method of comparing two strings.

You can get a little more sophisticated, as a C programmer, and forego those terminating nulls. Then you might call memcmp(s1, s2, n) (also declared in string.h) to compare two sequences both of length n. There are no issues here about strings of different length, so two strings are equal only if all corresponding elements match. Otherwise, the ordering rule is as before, also depending on the encoding used for elements.

Standard C introduces the concept of locales, and with it the notion that some library behavior should change between locales to reflect differences among cultures. Remember that a "culture" is not necessarily a nationality. Librarians have their own culture, at least when it comes to ordering the entries in a card catalog. That culture differs in small ways from the way the Yellow Pages tribe orders names in the Manhattan telephone directory. And that tribe has slightly different ordering rules from Yellow Pages du Montreal (a name I just made up, I hope).

One of the goals of locales is to encapsulate ordering rules for character strings. So Standard C lets you call strxfrm(s1, x1, n) (also declared in string.h) to perform a culture-specific transformation of the string at s1 into the n-element array beginning at x1. If you similarly transform s2 to x2, then the call strcmp(x1, x2) tells you the ordering between s1 and s2 peculiar to that culture (locale). A simple transformation might be to force all letters to lower case. A more ambitious approach might be to omit all punctuation. The possibilities are open ended, and for a good reason. Since people first began scratching glyphs on clay tablets, they have invented more ways to order sequences of glyphs than anyone can count.

I should point out in passing that Standard C also lets you call strcoll(s1, s2) (also declared in string.h) to do all these operations at once. You use strxfrm to transform a string you may have to compare quite often — a key in a hash table, for example. You use strcoll as a more compact, but less efficient, alternative.

Standard C also added at least minimal support for large character sets. By "large" I mean sets with way more than the 256 distinct codes you can represent in a single byte. The typedef wchar_t (defined in several headers) can represent all values of the largest character set supported by an implementation. Typically, it is a synonym for a 16-bit integer type, such as short or unsigned short. Objects of this type are called "wide characters." Corresponding to each wide-character encoding is a way to write a sequence of wide characters as a sequence of bytes. Each wide-character code is represented as one or more bytes — thus the term "multibyte encoding" for this representation.

No provision was made in the original C Standard (ANSI 1989, ISO 1990) for culture-specific comparison of wide-character sequences. Remember, I said support was minimal. Presumably, an implementation could have strxfrm do sensible things with multibyte sequences, but the C Standard was profoundly silent on such matters. In any event, it was an unappealing prospect to first convert wide-character sequences to multibyte, then transform them further, just to apply a culture-specific ordering rule.

Amendment 1 to the C Standard greatly expanded support for large character sets. One of the additions was the pair of functions wcsxfrm and wcscoll (both declared in wchar.h), where the wcs stands for "wide-character string." Their behavior is entirely analogous to the older pair that operates on "narrow" (or single-byte) characters.

Enter Standard C++

Now we come to the C++ Standard. It incorporates the entire Standard C library, including Amendment 1 additions and then some. So you can certainly do in C++ all the string comparisons now possible in Standard C. What else could you possibly need?

Well, the C++ standards committee was not content to rely on the locale machinery built into the Standard C library. I tried to explain why nearly two years ago. (See "Standard C/C++: Introduction to Locales," CUJ, October 1997.) And ever since then, I've been describing the locale machinery invented for Standard C++. Put simply, a locale in C is a global entity that modifies the behavior of certain library functions. In C++, a locale is encapsulated inside an object of type locale (oddly enough). Its behavior is expressed through member functions of the dozens of "locale facets" designated by the locale object.

You can thus have multiple locales in a C++ program, with no need to switch the global locale among them. Each input or output stream, in fact, is "imbued" with its own locale object. Your program can read by French rules and write by German rules, if you choose. You can tailor your own locale objects with facets of all nations. You can even define brand new facets, and create locale objects that designate these new creatures. I've spent most of the past two years in these pages describing the facets supplied with the Standard C++ library. Every locale object designates at least these facets, which number in the dozens. I'm almost to the end of the list, but I still have one or two left to describe.

The facet of the month is template class collate. As you've no doubt guessed by now, it encapsulates locale-specific rules for comparing two character sequences. The template class has a single class parameter, which I usually write _E, for element type. Any locale object you create has the two facets collate<char>, for comparing single-byte characters, and collate<wchar_t>, for comparing wide characters.

Template class collate defines three public member functions. The first is an obvious analog to strxfrm:

string_type transform(const _E *_F,
    const _E *_L) const
    {return (do_transform(_F, _L)); }

Here, string_type is the appropriate flavor of template class basic_string (either string or wstring, for the standard facets) to store sequences of the element type _E. Rather than rely on a null terminator, this function instead takes a range of elements delimited by pointers, much like the iterators used throughout STL.

Note the usual practice of foisting the real work off on a protected "do" function. That lets you derive a class from collate<T> and override just the virtual functions you choose, without monkeying with the public interface to the class.

The second public member function is an obvious analog to strcoll:

int compare(const _E *_F1,
    const _E *_L1, const _E *_F2,
    const _E *_L2) const
    {return (do_compare(_F1, _L1,
        _F2, _L2)); }

It defines its operands with two pairs of pointers, (_F1, _L1) and (_F2, _L2).

Finally, template class collate defines the public member function:

long hash(const _E *_F,
    const _E *_L) const
    {return (do_hash(_F, _L)); }

The idea is that if you're going to be comparing strings by some special rule, you may well want to construct a hash table as well. Equality doesn't necessarily mean identity — not after an arbitrary transformation — so it's highly desirable that two "equal" strings hash to the same location in the table.

For completeness, here is a brief description of each of the virtual functions in template class collate. As in the other facets, they do all the interesting work:

collate::do_compare

virtual int
do_compare(const E *first1,
    const E *last1, const E *first2,
    const E *last2) const;

The protected virtual member function compares the sequence at [first1, last1) with the sequence at [first2, last2). It compares values by applying operator< between pairs of corresponding elements of type E. The first sequence compares less if it has the smaller element in the earliest unequal pair in the sequences, or if no unequal pairs exist but the first sequence is shorter.

If the first sequence compares less than the second sequence, the function returns -1. If the second sequence compares less, the function returns +1. Otherwise, the function returns zero.

collate::do_hash

virtual long do_hash(const E *first,
    const E *last) const;

The protected virtual member function returns an integer derived from the values of the elements in the sequence [first, last). Such a "hash" value can be useful, for example, in distributing sequences pseudo randomly across an array of lists.

collate::do_transform

virtual string_type
do_transform(const E *first,
    const E *last) const;

The protected virtual member function returns an object of class string_type whose controlled sequence is a copy of the sequence [first, last). If a class derived from collate<E> overrides do_compare, it should also override do_transform to match. Put simply, two transformed strings should yield the same result, when passed to collate::compare, that you would get from passing the untransformed strings to compare in the derived class.

Using collate

So how do you do use these facets? Well, first you have to get a handle on one. I've shown the idiom many times of late, but it doesn't hurt to show it yet again, because it's tricky. Let's say you want to compare two strings using the locale imbued in the standard input stream cin. You obtain a reference to the facet by writing:

const collate<char>& fac =
    use_facet< collate<char> >
        (cin.getloc());

This expression makes use of a fairly new aspect of the language. Template function use_facet is declared as:

template<class Facet>
    Facet& use_facet(locale);

Note that the template function has no parameters that help the translator determine the type of the parameter Facet. This was once invalid C++, but now the language has an escape clause. You can call use_facet<T>(arg) to tell the translator that the type of the parameter is T.

That's all well and good for up-to-date compilers, and there are a few that can now handle this notation. But to meet the needs of Visual C++ and other older (but still widely used) compilers, I had to supply an alternate declaration for the template function:

template<class Facet> Facet&
    use_facet(locale, Facet *);

With this declaration, the call above is rewritten as:

const collate<char>& fac =
    use_facet(cin.getloc(),
        (collate<char> *)0);

To keep the code as portable as possible, I introduce a macro. So what you really should write, during this period of transition, is:

const collate<char>& fac =
    _USE(cin.getloc(),
        collate<char> *);

and let the library supply the appropriate definition of the macro _USE for a given compiler.

So much for the hard part. Let us now compare two objects of class string called s1 and s2. To test for equality, you apparently can write:

if (fac.compare(s1.begin(), s1.end(),
    s2.begin(), s2.end()) == 0)
    <WRONG>

Sadly, this is not a safe thing to do. Between the time that collate was specified and when the draft C++ Standard was frozen, the specification changed for template class basic_string. It is no longer promised that a call such as s1.begin() returns a pointer to char. All containers in the library (and basic_string is more or less a container now, in the STL sense) have iterators that are implementation defined. There is no promised implicit conversion from such an iterator to a pointer. So the safer thing to write for the comparison above is:

const char *q1 = s1.c_str();
const char *q2 = s2.c_str();
if (fac.compare(q1, q1 + s1.length(),
    q2, q2 + s2.length()) == 0)
    <strings are equivalent>

Naturally, the same caveats apply to the other two member functions, transform and hash.

Things are not quite as bad as they may seem, however. The Standard C++ library provides yet another way to make use of facet collate. It seems that class locale defines an interesting member template version of operator(). Cognoscenti will recognize this operator as the source of all "function objects" (or "functoids" if you really like Newspeak). An object of a class that defines this member operator can be used wherever a function name or pointer can normally appear, followed by an appropriate argument list.

Listing 1 shows the definition of this template member function in all its unreadable glory. (All those underscores don't help, of course, but they're required to avoid user-defined macro names.) The upshot of this machinery is that you can write:

locale loc = cin.getloc();
if (loc(s1, s2) == 0)
    <strings are equivalent>

and compare your two strings as if the locale object loc were magically doing all the work for you. And in a real sense it is, by passing the request onto one of its facets.

Listing 2 shows one way to implement template class collate. As is my custom, I omit various magic bits of initialization code. It has to be tailored in nonstandard ways to interface to the underlying C locale machinery on each implementation. All you see of the interface here is the calls to two magic functions, _LStrxfrm and _Lstrcoll. They take the equally magic cookie _Coll, pumped full of locale-specific ordering information at construction time, and perform functions analogous to their similarly named cousins in the Standard C library.

And that's the thoroughly modern way to compare two strings in Standard C++.

P.J. Plauger is Senior Editor of C/C++ Users Journal and President of Dinkumware, Ltd. He is the author of the Standard C++ Library shipped with Microsoft's Visual C++, v5.0. For eight years, he served as convener of the ISO C standards committee, WG14. He remains active on the C++ committee, J16. His latest books are The Draft Standard C++ Library, Programming on Purpose (three volumes), and Standard C (with Jim Brodie), all published by Prentice-Hall. You can reach him at pjp@plauger.com.