September 2001/The Standard Librarian

C/C++ Contributing Editors

The Standard Librarian: I/O and Function Objects

Matt Austern

The Standard C++ library is crafted for extensibility, but doing it right can stymie the best of us. Matt combines function objects with the right "traits" to intelligently extend IOStreams.

The Standard C++ I/O library is extremely general, and it’s also user-extensible. Sometimes there’s a tension between those two aspects: when you write a new I/O component, it isn’t always easy to make sure that it’s able to accommodate all of the permitted variability. Getting used to a few simple tools can help.

After discussing alternatives, I'll show a simple and efficient method for finding characters in a stream, a method that works for arbitrary locales and even for arbitrary character types.

Character Classification

Suppose that you’re reading characters from an input stream buffer and that you’re looking for the start of the word: the first alphabetic character or letter.

In older programs, you might sometimes see a test that looks something like this:
if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))
  <do something>
This sort of test is a terrible idea. First, it isn’t portable. It makes unjustified assumptions about the character set: it only works if the upper- and lower-case letters are contiguous blocks. Some character sets (like ASCII) work that way, but others don’t. Second, this test is incompatible with internationalization. It assumes that there are only 26 letters, a through z. That’s true in English, but it’s a parochial assumption; other languages, even languages that use the Roman alphabet, have a different set of letters. In general, you can’t tell whether something is a letter unless you know what language you’re working in. (Is l a letter, or just a mathematical symbol? How about Å?)

In C, you’d be advised to write the test differently:
if (isalpha(c))
  <do something>
That’s better. The isalpha function is part of the Standard C library; it takes care of all of the details of the character set. It can even accommodate a form of internationalization. The Standard C library has a notion of a “current locale.” The behavior of isalpha, and various other C library functions, depends on which locale is currently active; you can switch to a new locale by calling the setlocale function. That’s still not very satisfactory, though, because the locale is a global variable. The C library gives you no good way to use two different locales at the same time, perhaps in different parts of your program; the two locales would interfere with each other.

The C++ library has a different mechanism for internationalization: a locale is an object, and each stream and each stream buffer has its own locale associated with it. A locale object is composed of facet objects [1]. The C++ equivalent of the C library’s isalpha is a member function of the ctype facet.

If you’re writing a new I/O component, you should never make assumptions about which locale is in use. You should not use C library functions like isalpha and toupper, because they might refer to the wrong locale. Instead, for full generality, you should fetch and use the locale that’s associated with the stream you’re using. Here is a correct way to find an alphabetic character when you’re reading from a streambuf:
std::basic_streambuf<char>* buf;
...
std::locale L(buf.getloc());
std::ctype<char>& ct =
  std::use_facet<std::ctype<char> >(L);
std::istreambuf_iterator<char> f(buf);
std::istreambuf_iterator<char> l;

while (f != l) {
  if (ct.is(std::ctype_base::alpha, *f))
    <do something>
  ++f;
}
There are other ways that this loop could be written, which is why I said “a correct way” instead of “the correct way.” You could, for example, dispense with the explicit call to use_facet and write:
while (f != l) {
  if (std::isalpha(c, L))
    <do something>
  ++f;
}
This looks a bit prettier, but actually it’s inferior to the first version. Writing std::isalpha(c, L) is exactly the same as writing:
std::use_facet<std::ctype<char> >(L).is(
  std::ctype_base::alpha, c)
It’s better to move use_facet outside the loop, instead of calling it on every iteration.

It’s a bit frustrating, though, that there has to be an explicit loop at all. Doesn’t the standard library give us anything we can reuse? Almost, but not quite. The ctype facet has a promising looking member function, scan_is, which finds the first character in a range with a specified character classification. Unfortunately, scan_is is completely useless for us. It’s not a template: it only works on character pointers, and we’re dealing with a range of iterators.

A more promising approach is the find_if algorithm. We should be able to find the first alphabetic character by writing std::find_if(f, l, <something>); the only question is what predicate to supply. One might try some combination of bind1st/bind2nd and mem_fun, but it doesn’t take much thought to realize that that can’t work: std::ctype<>::is has two arguments, and the mem_fun adaptors only go up to member functions of one argument. We need something else.

Characters and Character Traits

There’s another way in which the code sample I’ve shown isn’t as general as it might be. I hinted at it, by writing basic_streambuf<char> instead of streambuf. All of the standard streams and stream buffers are templates: they’re parameterized by character type. The standard library instantiates those templates for two predefined character types, char and wchar_t, and it defines typedefs for those instantiations: streambuf is the same as basic_streambuf<char>, wstring is the same as basic_string<wchar_t>, and so on. (The only important I/O class that’s not a template is std::locale. All of the predefined locale facets, however, are templates.)

If you’re writing an I/O component that you expect to be reused, you should work with basic_istream<charT>, not with istream; you should be prepared for charT to be char, or wchar_t, or some user-defined type you’ve never heard of. This is mostly a straightforward matter of writing your component as a template, but there is one complication.

Suppose that, instead of looking for a general class of characters, you’re looking for one specific character (a newline, for example). It would seem that this time you really could just reuse a standard algorithm:
std::basic_streambuf<charT>* buf;
charT c;
...
std::istreambuf_iterator<charT> f(buf);
std::istreambuf_iterator<charT> l;
f = std::find(f, l, c);
Unfortunately, this code is wrong. It will probably pass all of your tests, but it will fail when someone tries to do something unusual. It doesn’t take into account one last way in which the I/O library is parameterized: traits.

Standard I/O classes like basic_streambuf have two template parameters, not one. The second parameter has a default, which is why it was correct for me to omit it in the last code snippet, but users are permitted to supply it explicitly. The second parameter determines character traits, and it defaults to std::char_traits<charT>.

Traits in general are a basic technique of template-based programming, and character traits are an application of that technique to I/O. A character traits class parameterizes the fundamental properties of a specific character type. Every character traits class must satisfy a set of requirements given in table 37 of the C++ Standard. Such a class contains a few nested typedefs, including char_type, the character type under discussion, and int_type, an integer type that’s associated with char_type. A traits class also includes a few static member functions (typically inline) that perform basic character operations: Traits::eq(x, y) compares x and y for equality; Traits::assign(x, y) assigns y to x; Traits::eof() returns an end-of-file character; Traits::to_int_type(c) and Traits::to_char_type(n) convert between char_type and int_type.

(Why do we have both char_type and int_type? For the same reason as in C! Some operations, such as basic_streambuf’s sbumpc and snextc member functions, have to be able to return every possible valid value that a Traits::char_type can hold and also have to be able to return a distinct end-of-file indicator. The return type of such functions is Traits::int_type. Once you’ve checked such a return value and verified that it’s not Traits::eof(), you can convert it to char_type.)

This, finally, is why:
f = std::find(f, l, c);
is subtly wrong, or at least insufficiently general. A character traits class tells you how to compare characters for equality, and anyone who goes to the trouble of using a non-default character traits class probably has a reason for it. The std::find algorithm knows nothing about traits; it just uses operator==. If you use std::find, you’re throwing away the information that the traits class supplies.

Whenever you’re working with a class that’s parameterized by character traits, you should use the supplied traits class for all character comparisons and all character assignments. The predefined I/O components follow this rule, and your I/O components should do the same if you intend them for reuse.

We can’t use std::find, but we can use std::find_if — again, if we have the appropriate predicate.

Function Objects

In both cases, the main problem is one of packaging: the Standard C++ library provides the tests that we need; it just doesn’t provide them in the form that std::find_if needs. The solution is simple: write some simple function object adaptors to transform one interface into another.

Listing 1 contains a predicate that uses std::ctype<> for character classification, and Listing 2 contains a collection of predicates based on character traits. Using these function objects, we can find an alphabetic character by writing:
f = std::find_if(f, l,
        is_char_class<charT>(L,
            std::ctype_base::alpha));
We can find a specific character c by writing:
f = std::find_if(f, l,
          std::bind2nd(

traits_eq<std::char_traits<char> >(),
              'c'));
In addition to traits_eq<>, Listing 2 also contains two function objects that work with characters in their int_type form. This is a less common need, first, because you only need to worry about int_type if you’re using a basic_streambuf directly (streambuf iterators return char_type values, not int_type values), and second, because the only real purpose of int_type is to return a value that might be end-of-file. I’ve provided traits_int_eq<>, which compares two int_type values, mostly for completeness. I use it less often than I use is_eof<>, a predicate that takes a single int_type value and returns true if that value is the end-of-file indicator.

Conclusion

The predicates in Listings 1 and 2 are trivial, but they’re useful nonetheless. Without them, or something like them, it’s a nuisance to combine locales and character traits with generic algorithms. With them you can easily write code that’s correct even in the presence of unusual character types and locales, instead of code that’s only almost correct. You have to observe two simple rules:

Always use the ctype facet for character classification. Don’t assume that characters work anything like they do in ASCII.

Always use the traits class’s member functions for character comparison, assignment, and conversion between char_type and int_type. Don’t assume that you can use operator==, operator<, or operator=. These rules may not seem important today, but, as internationalization and alternate character types become increasingly important, they’re likely to be important in the future.

Implementers of the standard library often use function objects similar to the ones in Listings 1 and 2 when implementing such library components as basic_istream’s member functions. If you want to write anything like those member functions, you’ll need them too.

Note

[1] Matt Austern. “The Standard Librarian: Defining a Facet,” C/C++ Users Journal C++ Experts Forum, June 2001, <www.cuj.com/experts/1906/austern.htm>.

Matt Austern is the author of Generic Programming and the STL and the chair of the C++ standardization committee’s library working group. He works at AT&T Labs — Research and can be contacted at austern@research.att.com.