Generic Programming and the C++ STL

Focusing on data representation and algorithms

Dan Zigmond

Dan, a principal of Avatar Software Inc., can be contacted at djz@avasoft.com.

The C++ object-oriented programming model focuses so heavily on specifying interfaces for encapsulation that programmers sometimes gloss over the meat of the software-development process--data representation and algorithms. Once a class interface has been specified, the details of its implementation are often said to be "irrelevant" to the outside world. There is some truth to this, but only in a limited sense. Anyone who actually uses the class cares whether or not it has been implemented correctly and efficiently, and whether or not the internal algorithms and data representations chosen by the class designer were appropriate.

Generic programming provides an alternative to this black-box approach. Where object-oriented programming attempts to take abstract "objects" and give them real-world representations, generic programming tries to do the same with algorithms. It does this by focusing on two questions:

How can you represent efficient algorithms independently of any particular data-representation scheme?

How can you provide an interface to a diverse set of data-representation strategies that gives us the flexibility to choose appropriate representations, but also facilitates the development of more-abstract algorithms?

Until recently, C++ didn't directly address either problem. The assumption was that programming techniques that worked in C would also work in C++. Once the class hierarchy was laid out, C++ simply became a "better C" in terms of the tools it provided for implementing the class.

With the C++-standard committee's adoption of Alexander Stepanov and Meng Lee's Standard Template Library (STL), however, everything has changed. (For background information on STL, see Al Stevens' "C Programming" column, DDJ, March and April 1995.) STL addresses both questions by introducing the "iterator"--a generalization of a standard C pointer that represents a position in an abstract data structure in the same way that a pointer represents a position in memory. There are several types of iterators in STL, each of which defines a distinct subset of the operations we normally associate with pointers. At the very least, all iterators support some sort of incrementing and dereferencing operations.

STL uses iterators to support generic programming in the following way: First, all generic algorithms use iterators to access data structures, rather than interacting with data structures directly. Secondly, all generic data structures provide the most advanced type of iterator they can implement efficiently. Thus, iterators serve as the interface to both generic algorithms and generic data structures, allowing you to mix and match the two at will. In Example 1(a), for instance, the STL find algorithm will return a pointer to the first element of numbers containing the number 37. This same algorithm will also work if you switch from a C array to an STL list, as in Example 1(b).

The Lexicon Example

To illustrate how the STL works, I'll present a filter program called "Lexicon" that takes ASCII text and outputs an alphabetized list of all the unique words in that text, ignoring case and punctuation. The program starts by reading data and storing it in a structure you can work with, and ends by printing the results. In the middle, it must convert each word to a standard form in which both case and punctuation are ignored, remove duplicate words from the data structure, and sort the words alphabetically. One of the interesting things about this problem is that, at first glance, even the ordering of these middle steps is unclear.

Reading the words into a data structure is easy. You start by using an STL vector to store your data; see Example 2. The first line creates a vector of strings named "words." The copy function (like find) is among the most basic algorithms of the STL. It takes three iterators: a start, end, and destination. If the start and end are the same, the algorithm does nothing. If not, it begins by copying the data stored at the start iterator to the memory location indicated by the destination iterator. Then both start and destination are incremented by one. If the start iterator still does not equal the end iterator, the process repeats.

This seems simple enough, but the actual code in Example 2 looks somewhat more complicated because you're copying from a C++ istream, not from another vector or sequence. The istream_iterator class allows you to treat an istream as a sequence by providing an iterator into it. The constructor istream_iterator< string, ptrdiff_t >(cin) builds an iterator pointing to the next string to be read from cin. The default constructor istream_iterator< string, ptrdiff_t >() builds an iterator effectively pointing to EOF. STL guarantees that the first iterator will equal the second once you've read everything there is to read from cin, so this is the standard idiom for reading everything from a string into an STL container.

The third iterator is also a bit complicated. Normally, iterators operate in overwrite mode. When you iterate through a sequence like a vector with a standard iterator, you pass over its existing elements rather than creating new elements, just as you'd expect. If you replaced this third iterator with something more intuitive, say, words.begin(), the copy algorithm would try to copy all the strings from cin on top of the existing elements of vector words. But because words is empty, you would quickly run out of space.

Note that inserter (Example 2) is a simple way of creating a new type of iterator, called an "insertion iterator." The first argument specifies the container you're using, and the second element describes where in the container you want to start inserting. Unlike standard iterators, the insertion iterator that is returned will insert new elements when incremented, rather than passing over existing elements. This is what you want when creating a new sequence based on an existing sequence of unknown size.

At this point, you need to decide what to do next. For starters, you can't remove the duplicates before putting all the words in a standard form. Furthermore, if you sort the words as they are, you'll just have to sort them again later. Consequently, standardizing seems like the most logical next step. Luckily, transforming all the elements of your vector into a standardized form is straightforward. A simple standard form for your words would be all lowercase with no punctuation. Suppose you have a function standardize() that does just that; it takes a string argument and returns that same string in all lowercase letters and with punctuation stripped out. All of your strings can then be standardized with a single call to the transform algorithm; see Example 3(a). This example traverses a sequence from words.begin() to words.end(), calling your standardize function on each element and placing the result of each call into the sequence starting at words.begin(). Note that you are not limited to transforming a sequence in place; you can use any forward iterator as the third argument. This usage is common for in-place transformations, but you can use the full generality of transform to combine your first two steps into one; see Example 3(b).

To wrap up the Lexicon filter example, you could remove duplicates first, following the reasoning that it's better to sort a shorter sequence than a longer one. However, Example 3(c) presents an alternative. The first line simply sorts the vector alphabetically from start to finish, using the STL's implementation of quicksort. The second line calls unique, an algorithm that compacts a sorted sequence by removing adjacent duplicates, then returns an iterator pointing just past the last element in the compacted sequence. The order of these two operations is crucial: Because removing duplicates in an unsorted sequence is an extremely expensive operation, unique only works if the sequence has been sorted so that all duplicates are adjacent. Although sorting a shorter sequence is faster than sorting a longer one, removing duplicates from an unsorted sequence is so slow that it's impractical. Finally, the third statement copies all the relevant elements to an output iterator constructed based on cout. The second argument to this constructor means that a newline will be added after each element written to this iterator.

The only thing left is to define standardize. The exact implementation of such a function depends on the string class you use. The STL-based class I use (from Modena Software's STL++ package) lends itself to a fairly simple definition. In Example 4, the first two lines remove all punctuation. Specifically, remove_if will compact the string by removing all characters for which the standard-library function ispunct returns True; the erase member function actually trims the trailing characters from the string. The third line of the function transforms the string in place by calling the standard C-library function tolower on each character.

Another Approach

Listing One is the code for the Lexicon example. The program's run-time performance is fairly good. Reading the data from standard input, transforming it, and storing it in our vector is an O(n) operation. Quicksort completes in O(nlogn) time on average. The unique algorithm and the final copy to cout are both O(n) operations. In short, the algorithm makes three complete passes through the data and does one efficient sort.

Still, it seems there may be a better way. If you could somehow maintain the internal data structure in alphabetical order throughout, you could simply discard duplicates as they are read rather than storing, sorting, and using unique to weed them out later. One way to do this is to use the STL set class instead of a vector. Example 5(a) shows the set-required changes to the first two lines of Example 2.

What difference does this small change make? The first line creates a set of strings that is sorted with respect to the functor less< string >, which means "sorted alphabetically." Sets are used in the STL primarily to facilitate the fast lookup of their keys; for that reason, they are always stored in sorted order. This also makes it efficient to check for duplicates when a new element is inserted, which is important because STL sets are defined to have no duplicate elements. Using sets instead of vectors, the transform algorithm can be used exactly as before, but with profoundly different results--duplicates are automatically discarded and the entire data structure is maintained in sorted order. The program is therefore considerably simplified, since all that's left to do is output the result; see Example 5(b). Likewise, the other steps (sort and unique) have become redundant. (Listing Two is the complete code for this version.)

The difference in speed between the approach illustrated in Example 2 and that of Example 5 is modest. On my machine, the set implementation is consistently about 13 percent faster than the original on large pieces of text. The set-based algorithm is much simpler--it has only one O(n) output pass, plus the original insertions--but inserting into a red-black tree is a much more expensive operation than appending on the end of a vector. Each lookup into a red-black tree requires O(logn) steps. If the new word is a duplicate, the "insertion" stops there; if not, a new node must be created and the tree adjusted so as to remain balanced within the strict rules of red-black trees. The time required to construct the set of words is therefore difficult to calculate because it depends on the degree of duplication with the word list, and each insertion becomes progressively more costly as the size of the set increases. (For more information on red-black trees, see "Red-Black Trees," by Bruce Schneier, DDJ, April 1992.)

Why Use the STL?

What has the STL bought you in these examples? The easiest way to answer this question is to look at all the things you didn't have to do:

Write your own dynamic data structures. Although an unbounded vector class is not very difficult to write, it is nontrivial, and having to write one from scratch would certainly have more than doubled the development time. A red-black tree is much harder; I've seen programmers wrestle with the balancing rules for a week and still not be 100 percent sure they are right.

Write your own sorting and searching routines. Many of us have written quicksorts and tree lookups before. This time, someone's done that for us.

Pay much extra for changing data structures and algorithms midstream.

The first point could be made about almost any class library--most already provide data structures such as vector and set that attend to memory management. Likewise, the second point could also be made about most libraries.

It's the third point, however, that I want to emphasize. My understanding of the problem evolved as I worked out the two solutions presented here. Although the algorithms are functionally equivalent and look similar at the highest level, the actual steps taken underneath are quite different. One version takes several passes along a simple, linear structure; the other painstakingly builds a complicated structure that, when fully constructed, essentially solves the problem for you. Yet when switching from one to the other, you only had to change one, and delete two, lines of code.

When I implemented the set version of the algorithm, I intuitively expected it to be faster than the vector implementation. It wasn't as much of an improvement as I thought, and the memory-usage calculations were discouraging because of all the extra pointers the red-black tree needs to maintain. Yet the price I paid for a little experimentation was small enough that it encouraged me to look for a better way. In doing so, I learned something about red-black trees and the diminishing returns of clever optimization when you start with solid, efficient data structures and algorithms in the first place. I also developed two programs with very different run-time characteristics, each of which might be a good solution under certain circumstances. Without the STL, the experience would have been different. For instance, once I had written a dynamic vector class and quicksort algorithm from scratch, it's unlikely I would have been willing to take the time to write a completely different program based on balanced trees. In the end, my program would likely have represented my first guess as to the best solution, because the cost of change would have been too great to warrant any experimentation.

Generic programming tools such as STL let us take a more evolutionary and experimental approach to programming. In this way, they can be a healthy complement (or even an antidote) to standard object-oriented design and programming techniques. Where classes encourage us to define interfaces in advance, STL allows us to experiment with the implementation until we get it right, gradually refining our understanding of problems and the trade-offs inherent in their solutions.

For More Information

The original reference implementation of the STL developed by Stepanov and Lee is available free-of-charge from HP Labs. You can download it from the directory /stl of butler.hpl.hp.com. Unfortunately, most compilers can't yet handle STL's sophisticated template operations (Borland C++ 4.5 is an exception). Two commercial STL implementations are available that work on a wider variety of platforms: STL++ from Modena Software (Los Gatos, CA) and STL<ToolKit> from ObjectSpace (Dallas, TX). The latter is unique in that it supports virtually all major UNIX C++ compilers, including those based on cfront. Compiler vendors will probably start bundling the STL with their products soon; it is already bundled with Metrowerks' CodeWarrior 6 for the Macintosh. Likewise, Symantec has announced upcoming STL support for its Macintosh C++ environments.

Information about STL is also available on the World Wide Web (http://www.cs.rpi.edu/~musser/stl.html) from David Musser, one of the original generic-programming researchers. The Modena string class used in this article is also available at this site, along with code samples and documentation.

Example 1: (a) STL find algorithm; (b) switching from a C array to an STL list.

(a) int numbers[ 100 ];
     ....
    int* i = find( numbers, numbers + 100, 37 );
(b) list< int > numbers;
     ....
    list< int >::iterator i = 
     find( numbers.begin(), numbers.end(), 37 );

Example 2: Using an STL vector to store data.

vector< string > words;
copy( istream_iterator< string, ptrdiff_t >( cin ),
      istream_iterator< string, ptrdiff_t >(),
      inserter( words, words.end() ) );

Example 3: (a) Standardizing all of the strings with a single call to the transform algorithm; (b) using the full generality of transform to combine the two steps into one; (c) sorting and removing duplicates.

(a) transform( words.begin(), words.end(), words.begin(),
                                           standardize );
(b) transform( istream_iterator< string, ptrdiff_t >( cin ), 
           istream_iterator< string, ptrdiff_t >(),
           inserter( words, words.end() ),
          standardize )
(c) sort( words.begin(), words.end() );
    vector< string >::iterator i = 
    unique( words.begin(), words.end() );  
    copy( words.begin(), i, 
        ostream_iterator< string >( cout, "\n" ) );

Example 4: Using a string class to define standardize.

string standardize( string s ) {
  string:iterator i = remove_if( s.begin(), s.end(), ispunct );
  s.erase( i, s.end() );
  transform( s.begin(), s.end(), s.begin(), tolower );
  return s;
}

Example 5: (a) Required changes to Example 2 for discarding duplicates; (b) outputting the result.

(a) set< string, less< string > > words;
    transform( istream_iterator< string, ptrdiff_t >( cin ), 
           istream_iterator< string, ptrdiff_t >(),
           inserter( words, words.end() ),
          standardize );
(b) copy( words.begin(), words.end(), 
        ostream_iterator< string >( cout, "\n" ) );

Listing One

#include <set.h>
#include <mstring.h>
#include <algo.h>
#include <ctype.h>
// Return a copy of the string in "standard" form (lowercase, no punctuation)
string standardize( string s ) {
  string::iterator i = remove_if( s.begin(), s.end(), ispunct );
  s.erase( i, s.end() );
  transform( s.begin(), s.end(), s.begin(), tolower );
  return s;
}
// Filter a text file into an alphabetzied list of unique words contained 
// in that file, ignoring case and punctuation.
int main( int argc, char** ) {
  if ( argc != 1 ) throw("usage: lexicon\n");
  
  set< string, less< string > > words;
  transform( istream_iterator< string, ptrdiff_t >( cin ), 
         istream_iterator< string, ptrdiff_t >(),
         inserter( words, words.end() ),
         standardize );
  copy( words.begin(), words.end(), ostream_iterator< string >( cout, "\n" ) );
  return( 0 );
}

Listing Two

#include <vector.h>
#include <mstring.h>
#include <algo.h>
#include <ctype.h>
// Return a copy of the string in "standard" form (lowercase, no punctuation)
string standardize( string s ) {
  string::iterator i = remove_if( s.begin(), s.end(), ispunct );
  s.erase( i, s.end() );
  transform( s.begin(), s.end(), s.begin(), tolower );
  return s;
}
// Filter a text file into an alphabetzied list of unique words contained 
// in that file, ignoring case and punctuation.
int main( int argc, char** ) {
  if ( argc != 1 ) throw("usage: lexicon2\n");
  vector< string > words;
  transform( istream_iterator< string, ptrdiff_t >( cin ), 
         istream_iterator< string, ptrdiff_t >(),
         inserter( words, words.end() ),
         standardize );
  sort( words.begin(), words.end() );
  vector< string >::iterator i = unique( words.begin(), words.end() );
  copy( words.begin(), i, ostream_iterator< string >( cout, "\n" ) );
  return( 0 );
}