The Standard C++ library is crafted for extensibility, but doing it right can stymie the best of us. Matt combines function objects with the right "traits" to intelligently extend IOStreams.
The Standard C++ I/O library is extremely general, and its also user-extensible. Sometimes theres a tension between those two aspects: when you write a new I/O component, it isnt always easy to make sure that its able to accommodate all of the permitted variability. Getting used to a few simple tools can help.
After discussing alternatives, I'll show a simple and efficient method for finding characters in a stream, a method that works for arbitrary locales and even for arbitrary character types.
Character Classification
Suppose that youre reading characters from an input stream buffer and that youre looking for the start of the word: the first alphabetic character or letter.
In older programs, you might sometimes see a test that looks something like this:
if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')) <do something>This sort of test is a terrible idea. First, it isnt portable. It makes unjustified assumptions about the character set: it only works if the upper- and lower-case letters are contiguous blocks. Some character sets (like ASCII) work that way, but others dont. Second, this test is incompatible with internationalization. It assumes that there are only 26 letters, a through z. Thats true in English, but its a parochial assumption; other languages, even languages that use the Roman alphabet, have a different set of letters. In general, you cant tell whether something is a letter unless you know what language youre working in. (Is l a letter, or just a mathematical symbol? How about Å?)
In C, youd be advised to write the test differently:
if (isalpha(c)) <do something>Thats better. The isalpha function is part of the Standard C library; it takes care of all of the details of the character set. It can even accommodate a form of internationalization. The Standard C library has a notion of a current locale. The behavior of isalpha, and various other C library functions, depends on which locale is currently active; you can switch to a new locale by calling the setlocale function. Thats still not very satisfactory, though, because the locale is a global variable. The C library gives you no good way to use two different locales at the same time, perhaps in different parts of your program; the two locales would interfere with each other.
The C++ library has a different mechanism for internationalization: a locale is an object, and each stream and each stream buffer has its own locale associated with it. A locale object is composed of facet objects [1]. The C++ equivalent of the C librarys isalpha is a member function of the ctype facet.
If youre writing a new I/O component, you should never make assumptions about which locale is in use. You should not use C library functions like isalpha and toupper, because they might refer to the wrong locale. Instead, for full generality, you should fetch and use the locale thats associated with the stream youre using. Here is a correct way to find an alphabetic character when youre reading from a streambuf:
std::basic_streambuf<char>* buf; ... std::locale L(buf.getloc()); std::ctype<char>& ct = std::use_facet<std::ctype<char> >(L); std::istreambuf_iterator<char> f(buf); std::istreambuf_iterator<char> l; while (f != l) { if (ct.is(std::ctype_base::alpha, *f)) <do something> ++f; }There are other ways that this loop could be written, which is why I said a correct way instead of the correct way. You could, for example, dispense with the explicit call to use_facet and write:
while (f != l) { if (std::isalpha(c, L)) <do something> ++f; }This looks a bit prettier, but actually its inferior to the first version. Writing std::isalpha(c, L) is exactly the same as writing:
std::use_facet<std::ctype<char> >(L).is( std::ctype_base::alpha, c)Its better to move use_facet outside the loop, instead of calling it on every iteration.
Its a bit frustrating, though, that there has to be an explicit loop at all. Doesnt the standard library give us anything we can reuse? Almost, but not quite. The ctype facet has a promising looking member function, scan_is, which finds the first character in a range with a specified character classification. Unfortunately, scan_is is completely useless for us. Its not a template: it only works on character pointers, and were dealing with a range of iterators.
A more promising approach is the find_if algorithm. We should be able to find the first alphabetic character by writing std::find_if(f, l, <something>); the only question is what predicate to supply. One might try some combination of bind1st/bind2nd and mem_fun, but it doesnt take much thought to realize that that cant work: std::ctype<>::is has two arguments, and the mem_fun adaptors only go up to member functions of one argument. We need something else.
Characters and Character Traits
Theres another way in which the code sample Ive shown isnt as general as it might be. I hinted at it, by writing basic_streambuf<char> instead of streambuf. All of the standard streams and stream buffers are templates: theyre parameterized by character type. The standard library instantiates those templates for two predefined character types, char and wchar_t, and it defines typedefs for those instantiations: streambuf is the same as basic_streambuf<char>, wstring is the same as basic_string<wchar_t>, and so on. (The only important I/O class thats not a template is std::locale. All of the predefined locale facets, however, are templates.)
If youre writing an I/O component that you expect to be reused, you should work with basic_istream<charT>, not with istream; you should be prepared for charT to be char, or wchar_t, or some user-defined type youve never heard of. This is mostly a straightforward matter of writing your component as a template, but there is one complication.
Suppose that, instead of looking for a general class of characters, youre looking for one specific character (a newline, for example). It would seem that this time you really could just reuse a standard algorithm:
std::basic_streambuf<charT>* buf; charT c; ... std::istreambuf_iterator<charT> f(buf); std::istreambuf_iterator<charT> l; f = std::find(f, l, c);Unfortunately, this code is wrong. It will probably pass all of your tests, but it will fail when someone tries to do something unusual. It doesnt take into account one last way in which the I/O library is parameterized: traits.
Standard I/O classes like basic_streambuf have two template parameters, not one. The second parameter has a default, which is why it was correct for me to omit it in the last code snippet, but users are permitted to supply it explicitly. The second parameter determines character traits, and it defaults to std::char_traits<charT>.
Traits in general are a basic technique of template-based programming, and character traits are an application of that technique to I/O. A character traits class parameterizes the fundamental properties of a specific character type. Every character traits class must satisfy a set of requirements given in table 37 of the C++ Standard. Such a class contains a few nested typedefs, including char_type, the character type under discussion, and int_type, an integer type thats associated with char_type. A traits class also includes a few static member functions (typically inline) that perform basic character operations: Traits::eq(x, y) compares x and y for equality; Traits::assign(x, y) assigns y to x; Traits::eof() returns an end-of-file character; Traits::to_int_type(c) and Traits::to_char_type(n) convert between char_type and int_type.
(Why do we have both char_type and int_type? For the same reason as in C! Some operations, such as basic_streambufs sbumpc and snextc member functions, have to be able to return every possible valid value that a Traits::char_type can hold and also have to be able to return a distinct end-of-file indicator. The return type of such functions is Traits::int_type. Once youve checked such a return value and verified that its not Traits::eof(), you can convert it to char_type.)
This, finally, is why:
f = std::find(f, l, c);is subtly wrong, or at least insufficiently general. A character traits class tells you how to compare characters for equality, and anyone who goes to the trouble of using a non-default character traits class probably has a reason for it. The std::find algorithm knows nothing about traits; it just uses operator==. If you use std::find, youre throwing away the information that the traits class supplies.
Whenever youre working with a class thats parameterized by character traits, you should use the supplied traits class for all character comparisons and all character assignments. The predefined I/O components follow this rule, and your I/O components should do the same if you intend them for reuse.
We cant use std::find, but we can use std::find_if again, if we have the appropriate predicate.
Function Objects
In both cases, the main problem is one of packaging: the Standard C++ library provides the tests that we need; it just doesnt provide them in the form that std::find_if needs. The solution is simple: write some simple function object adaptors to transform one interface into another.
Listing 1 contains a predicate that uses std::ctype<> for character classification, and Listing 2 contains a collection of predicates based on character traits. Using these function objects, we can find an alphabetic character by writing:
f = std::find_if(f, l, is_char_class<charT>(L, std::ctype_base::alpha));We can find a specific character c by writing:
f = std::find_if(f, l, std::bind2nd( traits_eq<std::char_traits<char> >(), 'c'));In addition to traits_eq<>, Listing 2 also contains two function objects that work with characters in their int_type form. This is a less common need, first, because you only need to worry about int_type if youre using a basic_streambuf directly (streambuf iterators return char_type values, not int_type values), and second, because the only real purpose of int_type is to return a value that might be end-of-file. Ive provided traits_int_eq<>, which compares two int_type values, mostly for completeness. I use it less often than I use is_eof<>, a predicate that takes a single int_type value and returns true if that value is the end-of-file indicator.
Conclusion
The predicates in Listings 1 and 2 are trivial, but theyre useful nonetheless. Without them, or something like them, its a nuisance to combine locales and character traits with generic algorithms. With them you can easily write code thats correct even in the presence of unusual character types and locales, instead of code thats only almost correct. You have to observe two simple rules:
- Always use the ctype facet for character classification. Dont assume that characters work anything like they do in ASCII.
- Always use the traits classs member functions for character comparison, assignment, and conversion between char_type and int_type. Dont assume that you can use operator==, operator<, or operator=. These rules may not seem important today, but, as internationalization and alternate character types become increasingly important, theyre likely to be important in the future.
Implementers of the standard library often use function objects similar to the ones in Listings 1 and 2 when implementing such library components as basic_istreams member functions. If you want to write anything like those member functions, youll need them too.
Note
[1] Matt Austern. The Standard Librarian: Defining a Facet, C/C++ Users Journal C++ Experts Forum, June 2001, <www.cuj.com/experts/1906/austern.htm>.
Matt Austern is the author of Generic Programming and the STL and the chair of the C++ standardization committees library working group. He works at AT&T Labs Research and can be contacted at austern@research.att.com.