Software Tools


You Can Count On It

Beman Dawes


Beman Dawes has been an independent software consultant for fifteen years, specializing in libraries to support geographic applications. He is a voting member of X3J16, the ANSI C++ Standards Committee. You can reach him at (804) 787-8228 or via the Internet at beman@dawes.win.net.

Introduction

The Standard Template Library (STL), developed by Alex Stepanov and Meng Lee of Hewlett-Packard, became part of the draft C++ Standard in July, 1994, when the joint ANSI/ISO C++ Standards Committee voted to add the STL to the Standard C++ library. The STL includes generic algorithms, containers, iterators, and other goodies. It can be extended with user-defined components. Even though the STL proposal surfaced uncomfortably late in the C++ standardization process, committee members felt that the long-term benefits of a much more powerflibrary overwhelmed any short-term committee discomfort.

The Standard Template Library is hot technology. Initial skepticism on the C++ committee has turned to outright admiration. Vendors are announcing and shipping STL-related products. Well known authors are writing STL articles and books.

Meanwhile, back in the everyday world where I write software for a living, a ten-year old C command-line utility program called Count needed an update. Count simply reads a file and keeps a frequency count of each value that appears in a specified field. At end of file, the values and frequency counts are displayed on the standard output ordered by value. Two limitations in the old C code needed fixing — poor performance on sorted input and lack of frequency-ordered output.

A rewrite of Count seemed like a good project to gain practical STL programming experience. This article examines the development of a C++ version of Count using the STL.

STL Implementations

HP has made a free implementation of the STL available via FTP from butler.hpl.hp.com. The file to download is stl/stl.zip for PC users, or stl/sharfile.Z for the UNIX crowd. A PostScript document describing the STL is included.

The STL can also be obtained from Modena Software (1-800-MODENA-1) in a product called STL++, which includes a tutorial and sample programs. Modena's implementation is a repackaging of the HP distribution, together with several non-STL classes. [ObjectSpace (1-800-OBJECT-1) has also been advertising their STL <ToolKit> lately. — pjp]

A word of caution here. Some of the details of the the STL are bound to change as the draft C++ Standard matures. For example, the names of header files and member functions may change. Furthermore, the STL is a torture test for template compilation, so a certain amount of working around compiler limitations is to be expected. The code in this article compiles and runs under Borland C++ 4.5 using the HP STL distribution of February 7, 1995. Before compiling, be sure to comment out the two-argument forms of the min and max templates in algobase.h, as described in the HP readme.old file.

Count1

The final version of Count has lots of messy code to deal with command-line arguments and optional features, so I begin with a simple program, Count1, and develop the final program as a series of stepwise refinements. The Count1 program just builds a count of each string value read from the standard input, then displays the counts and strings when end of file is reached. The complete program is given in Listing 1.

The STL map class template provides an ordered associative container for unique keys, and so is just what is needed to hold the field values and counts. The field value is the association key. The standard library string class is ideal for the keys since it can handle all the details of memory management. A long holds the count associated with each key.

One of the Standard C++ library innovations introduced by the STL is the specification of time complexity for operations. Insertion, the most time-consuming operation Count1 needs to do on the map, is required to be of logarithmic (or better) complexity. The practical implications of this requirement are that Count's speed should not degrade badly if the input happens to be in some particular order, and that Count's code should be portable to other implementations of the STL without fear of poor performance.

The first interesting line of code in Count1 is a type definition:

typedef map< string, long,
           less<string> > map_t;
This defines map_t to be an STL map container with string keys and long values. Providing a typedef with a short name is a coding convention used to avoid repeating the lengthy map type declaration elsewhere in the program.

The map uses less<string>, a templatized function object class, to order the container entries. The map's less<string> template argument isn't needed in theory since the default for the argument is less<Key> where Key is the first template argument. In practice most compilers don't support default template arguments yet, so the argument is required for the time being.

Once main begins execution, the program defines ct_map as the container that will be used to hold the frequency counts, and str as the string that will be used to hold input:

map_t ct_map;
string str;
The input string is read, inserted into the ct_map container, and the count incremented in the following loop:

while ( cin >> str) {
   ct_map.insert(
      map_t::value_type(str, 0L) );
   ++ct_map[str]; // increment the count
   }
The argument for the insert function deserves a bit of explanation. STL containers like the map class provide a number of type definitions. Nesting a type definition inside a class effectively creates type members similar to the more familiar data and function members. In other words, type definitions allow the map class to export certain type names. The map class definition contains the type definition:

typedef pair<const Key, T> value_type;
where Key and T are the first and second template arguments to map, and pair is an STL template class that holds a pair of objects. Thus the insert argument:

map_t::value_type(str, 0L)
has the effect of creating a pair<string, long> with the appropriate values for insertion into the map.

Since an STL map container's insert function has no effect if the key already exists, the insert is harmless if the program has already seen the string. If the string is not already present, the count inserted will be zero, so that the count increment works correctly in either case.

Once end of file is reached, the only thing remaining is to display the results:

for ( map_t::const_iterator it = it != ct_map.end(); ++it )
    { cout << setw(7) << (*it).second
          << "-"     << (*it).first <<
          endl;
    }
Iterators are pointer-like objects used to access the contents of containers. Each STL container supplies member functions that return the iterator range for the container. begin returns an iterator that designates the first element. end returns a sentinel iterator designating the end of the range. Both const and non-const iterator type definitions are supplied by the map class.

In this case, the display for statement defines a const_iterator named it to use in stepping through the container. (This idiom for processing a range of iterators is essentially universal throughout the STL.) Dereferencing a map_t::const_iterator yields a map_t::value_type, which in turn is defined as an STL pair class template, and so allows access to the pair members first and second which contain the key and the count respectively.

While iterators are pointer-like, they may not actually be pointers. The draft C++ Standard requires that STL iterators support operator* but not necessarily operator->, so (*it).XXX, rather than it->XXX, is used to access the container contents. operator-> on iterators is not supported because it would not work for iterators to built-in types, such as int.

The explanation of iterators given here is far from complete, since this article is intended to be introductory rather than exhaustive. Luckily, iterators are easy to use even without an exhaustive explanation.

Test Results

Count1

When Count1 is given the input:

how much wood could a woodchuck chuck
if a woodchuck could chuck wood
the program displays:

   2 - a
   2 - chuck
   2 - could
   1 - how
   1 - if
   1 - much
   2 - wood
   2 - woodchuck

Count2

The second program, Count2 (see Listing 2) , refines Count1 by adding a layer of indirection in two places. Otherwise, Count1 and Count2 have the same functionality. The first indirection is to add a type called long_t which stores a long value and has a default constructor that initializes the value to zero:

struct long_t {
   long v;
   long_t() : v(0) {}
   };
This type is used instead of a long for map_t, allowing ct_map insertion to be simplified:

typedef map< string, long_t,
           less<string> > map_t;
.....
while ( cin >> str ) ++ct_map[str].v;
This works because the STL map container's operator[] is specified to insert the pair ( key, T() ) if the key is not found. The type T must have a default constructor which sets the value to zero, and that's why long_t rather than just plain long is required. The C++ standards committee has now specified that T() generate a zero value for built-in types, so long_t will no longer be needed once compilers catch up to the standard.

Count2's second refinement is to add a function to display the results:

static inline int
show_m( const map_t::value_type& x )
    { cout << setw(7) << x.second.v
          << "-"    << x.first <<
          endl;
    }
This allows the replacement of the display for loop with a simpler call to the STL for_each algorithm. STL algorithms are implemented as function templates. The for_each algorithm calls the show_m function with each item in an STL iterator range:

for_each( ct_map.begin(), ct_map.end(),
   ptr_fun( show_m ) );
As is typical of STL algorithms, the first two arguments to the for_each function supply iterators which define the range the algorithm is to be applied to.

STL algorithms that need to call a function don't directly pass the function address as an argument. Instead they pass function objects which provide the desired functionality via an operator() member function.

Passing STL function objects can be more efficient than passing a function address. The reason for this efficiency is that the function object's code gets inlined directly into the calling code. To understand this better, look at a possible implementation of the STL for_each function:

template <class InputIterator,
        class Function>
   inline void
   for_each(InputIterator first,
           InputIterator last,
           Function func)
      {
      while (first != last)
         func(*first++);
      }
The call to func is compiled as an inline call to func::operator(). This allows the template code to be completely generic, yet is just as efficient as if hand tailored for a specific data type.

It takes a bit more boilerplate to define a function object than to define a function. The STL ptr_fun function template is used in the Count2 for_each call to painlessly generate the necessary function object boilerplate.

Incidentally, STL algorithms work on iterator ranges, and STL iterator range requirements are met by pointers to regular C++ arrays as well as iterators to more exotic containers. Thus for_each and other STL algorithms work fine on C++ arrays, using regular C++ pointers as the iterators.

Count3

Count3 (Listing 3) refines Count2 by displaying the results ordered by frequency count, in addition to the display ordered by string value. The strategy is to create the map as before, and at end of file copy the map contents to an STL vector container which is easily sorted. The vector is then sorted and the results displayed.

The first addition defines a couple of types:

struct ct_t {
   string str;
   long_t count;
   };
typedef vector< ct_t > vec_t;
Like the earlier map_t type definition, vec_t improves program readability by providing a short synonym for the longer type declaration.

It might appear that it would be better to define ct_t as map_t::value_type via a type definition, but this won't work because the first member of the underlying pair<> type is const, and this prevents the vector template from instantiating.

Count3 uses two additional functions to specify descending vector sort order and to display the vector after sorting:

static inline bool order( const ct_t& x, const ct_t& y )
    { return x.count.v > y.count.v; }

static inline int show_v( const ct_t& x )
    { cout << setw(7) << x.count.v
          <<"-" <<x.str << endl; }
The code in main is unchanged from Count2 until after the display of the map. Then the vector is defined, sized, and loaded with the contents of the map container:

vec_t ct_vec;

ct_vec.reserve(ct_map.size() );

for ( map_t::iterator it = ct_map.begin(); it != ct_map.end(); ++it ) {
    ct_t ct;
    ct.str = (*it).first;
    ct.count = (*it).second;
    ct_vec.push_back( ct ); //append
}
The call to ct_vec.reserve is not strictly necessary since vectors resize automatically as needed. But reserve improves performance by eliminating unnecessary reallocations.

There are several possible ways to copy the contents of the map to the vector. STL containers like vector include constructors which automatically insert an iterator range, but this requires member templates, a feature not implemented by most current compilers. The for loop actually used in Count3 takes a few more lines of code, but is straightforward and doesn't require an advanced compiler.

Once the vector is loaded, sorting is simple using one of the STL sort algorithms:

stable_sort( ct_vec. begin(), ct_vec.end(),
   ptr_fun( order ) );
The current implementation of stable_sort has an apparent bug which causes an occasional program crash, so the actual code in Listing 3 uses the STL sort function instead.

The final Count3 refinement is to display the sorted results:

cout << "Ordered by count:" << endl;
for_each(ct_vec.begin(), ct_vec.end(),
   ptr_fun( show_v ) );
Here is what the count ordered output looks like for the test file:

Ordered by count:
   2 - a
   2 - chuck
   2 - could
   2 - wood
   2 - woodchuck
   1 - how
   1 - if
   1 - much

Count

The final Count program (Listing 4) adds a lot of code to handle command-line arguments, extract the field to be counted from input lines, and deal with errors. Yet the critical code that actually does the counting is almost unchanged from the Count3 program.

The STL provided exactly the functionality needed for Count, and does so in very few lines of code. Design, coding, and testing all proceeded smoothly. I did get sidetracked several times trying out different approaches, such as for_each functions versus for statements, but these diversions tended to be educational rather than frustrating, since each approach really did work.

Conclusion

The project to create a C++ version of Count encountered no serious difficulty in applying the STL to a real program. Although the STL can be initially intimidating because it requires learning a number of unfamiliar concepts and mastering the details of lots of library components, when I actually started to write code, I found the STL interfaces to be easy to use. I made a few coding errors, but these were all detected at compile time. The compilation error messages were easily fixed, and then each program ran correctly on the first test.

One of the really nice aspects of the STL is that its interfaces are rich enough that there are several ways to attack any given task. This richness makes the STL more comfortable to use than libraries which are so brittle that users must do things a certain way or not at all.

It is hard to fully appreciate the power and usefulness of the STL just by reading about it. So download a copy (or buy a commercial version) and try it out on your own projects. Almost all C++ programmers are going to want to learn at least the basics of this important addition to Standard C++.

References

Alexander Stepanov and Meng Lee. The Standard Template Library, Hewlett-Packard Laboratories, Palo Alto, February 7, 1995. (Included in FTP distribution.)