C/C++ Contributing Editors


The Standard Librarian: Searching in the Standard Library

Matt Austern

The genius as well as the oversights in the design of the Standard C++ library surface in a simple discussion of its linear and binary search algorithms.


If you’re storing a collection of values, probably one of the reasons is so you can look things up in it. Searching for a specific item in a collection is an important problem; whole books have been written about it. Not surprisingly, the Standard C++ library provides many different techniques for searching.

In the C++ library, the usual way of referring to a collection is through a range of iterators. The range can be written as [first, last), where *first is the first value in the range and where last points just beyond the last value. In this notation, here’s how we can think of the general problem: given a range and a criterion, find the first iterator that points to a value satisfying the criterion. Because we write ranges in an asymmetrical form, we can avoid special cases: a search that fails can return last, and a range with zero elements can be written as [i, i).

Linear Search and Its Variants

The simplest form of searching is linear search, or, as Knuth calls it, sequential search: looking at each value in turn, and checking to see if it’s the one we’re looking for. If we have N elements in the range, then in the worst case we have to make N comparisons.

The standard library provides several versions of linear search; the two most important versions are find (which takes a range and a value x and looks for a value that’s equal to x) and find_if (which takes a range and a predicate p and looks for a value that satisfies p). Other versions of linear search are find_first_of (which takes two ranges, [first1, last1) and [first2, last2) and finds the first element in [first1, last1) that’s equal to any one of the values in the second range) and adjacent_find (which takes a single range and finds the first element that’s equal to its successor).

Suppose, for example, that v is a vector of int. You can find the first zero by writing:

vector<int>::iterator i =
  find(v.begin(), v.end(), 0);

You can find the first nonzero element by writing:

vector<int>::iterator i =
  find_if(v.begin(), v.end(),
          not1(bind2nd(equal_to<int>(),
          0)));

You can look for the first small prime by writing:

A[4] = { 2, 3, 5, 7 };
vector<int>::iterator i =
  find_first_of(v.begin(), v.end(),
                A+0, A+4);

You can find the first pair of duplicates by writing:

vector<int>::iterator i =
  adjacent_find(v.begin(), v.end());

There aren’t any separate versions that search backward through a range, but that’s because you don’t need them: you can get the same effect with a simple iterator adaptor. To find the last zero in v instead of the first, for example, all you have to write is:

vector<int>::reverse_iterator i =
  find(v.rbegin(), v.rend(), 0);

Linear search is a simple algorithm, and its implementation may not seem to present any important issues. Many books (including mine) show a simple implementation of std::find:

template <class InIter, class T>
InIter find(InIter first, InIter last,
            const T& val)
{
  while (first != last && !
    (*first == val))
    ++first;
  return first;
}

This is indeed a faithful implementation of the linear search algorithm, and of the C++ Standard’s requirements; the first template parameter’s name, InIter, means that this template argument need only satisfy the very weak Input Iterator requirements [1]. It might seem so simple that you might as well just write it out in your code by hand. Still, there’s one annoying problem: this implementation isn’t as efficient as it ought to be. The loop condition is complicated, requiring two tests for every element that gets examined. Conditional branches are expensive, and loops with complicated tests don’t get optimized as well as loops with simple ones.

One answer to this problem, used in some standard library implementations [2], is to “unroll” the loop, examining four elements per iteration. This is a fairly complicated solution, because find then has to deal with the leftover elements (ranges don’t always come in multiples of four!), and also because it requires that find dispatch on iterator category — unrolling is sensible only for a range of Random Access Iterators, so it’s still necessary to use the old implementation for the general case. Still, unrolling is effective: it reduces the number of tests per element from 2 to 1.25. It’s a technique that a standard library implementer can use without changing any interfaces.

You’ll see a different answer in the usual books about algorithms. The reason we need two tests for each element is that, if we get to the end of the range without finding what we’re looking for, we have to recognize that we’ve failed. But what if we happen to know that the element we’re looking for is always there and that the search will never fail? In that case, testing for the end of the range would be redundant; there wouldn’t be any reason for the search algorithm to keep track of the end of the range in the first place. Instead of std::find, we could write the linear search algorithm as follows:

template <class InIter, class T>
InIter
unguarded_find(InIter first, 
  const T& val)
{
  while (!(*first==val))
    ++first;
}

The version of linear search in Knuth [3] is more like unguarded_find than like std::find. Note that unguarded_find is not part of the C++ Standard. It’s more dangerous than find, and less general. You can only use it when you are certain there is an element that’s equal to val — which usually means you’ve put that element there yourself, as a sentinel at the end of the range. Using a sentinel isn’t always practical. (What if you’re searching through a read-only range?) But when it is practical, unguarded_find is simpler and faster than anything in the standard library.

Binary Search

Linear search is simple, and, for short ranges, it’s the best way to search for values. As ranges get longer and longer, however, it stops being a reasonable solution. I was recently reminded of this while working with XSLT. My XSLT script included a line that looked something like this:

<x:value-of select="/defs/data[@key=$r]"/>

The XSLT engine I ran this under must have used linear search. I was searching through a list, and performing this search for every item in a list. My script was O(N2), and it took minutes to run.

If you’re looking through a fully general range, you can’t do better than linear search. You have to examine every element, because otherwise the element you’re looking for might be the one you didn’t look at. But if you require that the range be structured in some way, you can do better.

You can, for example, require that the range be sorted. If you have a sorted range, you can use an improved version of linear search (once you get to an element that’s larger than the one you’re looking for, you don’t have to walk all the way to the end to know that the search has failed), but, even better, you can use binary search. By looking at an element in the middle of the range, you can tell whether the element you’re looking for is in the first or the second half; applying this division recursively, you can find the value you’re looking for without having to look at all the rest of the elements. Linear search requires O(N) comparisons, and binary search requires only O(log N).

The standard library includes four different versions of binary search: lower_bound, upper_bound, equal_range, and binary_search. All of them have the same signature: they take a range, an element to search for, and, optionally, a comparison function. The range must be sorted according to that function; if you don’t provide it, it must be sorted according to the ordinary < operator. So, for example, if you are searching through a range of ints arranged in ascending order, you could use the default. If you are searching through a range of ints arranged in descending order, you could pass an std::greater<int> as the comparison function.

Of the four binary search functions, the least useful one is the one with the most straightforward name: binary_search. All it returns is a simple yes or no: true if the value is somewhere in the range, false if it isn’t. But that information isn’t very useful by itself; I’ve never had occasion to use binary_search. If the value you’re searching for is there, you’ll probably want to know where it is; if it isn’t, you’ll often want to know where it would have been.

There are several different questions you might want to ask about a value’s location, which is why there are several different useful versions of binary search. The distinction becomes important when you have a range with several copies of the same element. Suppose, for example, that you have an array of ints, and you use both lower_bound and upper_bound to look for a value:

int A[10] = 
  { 1, 2, 3, 5, 5, 5, 5, 7, 8, 9 };

int* first = 
  std::lower_bound(A+0, A+10, 5);
int* last  = 
  std::upper_bound(A+0, A+10, 5);

The names first and last suggest the difference: lower_bound returns the first occurrence of the value you’re searching for (in this case, &A[3]), and upper_bound returns an iterator that’s just past the last occurrence of the value you’re searching for (in this case, &A[7]). If you search for a value that isn’t present, you’ll get the position where it would have been if it had been there. Using the same A as before, we can write:

int* first = 
  std::lower_bound(A+0, A+10, 6);
int* last  = 
  std::upper_bound(A+0, A+10, 6);

Both first and last will be equal to &A[7], because that’s the only place where a 6 could be inserted without violating the ordering.

In practice, of course you would never see a call to lower_bound followed immediately by a call to upper_bound. If you need both of those pieces of information, that’s where the last binary search algorithm comes in: equal_range returns a pair, the first element of which is the value that lower_bound would return and the second element of which is upper_bound’s return value.

Up until now I’ve been slightly sloppy in my discussion: I’ve been saying that lower_bound and upper_bound find a value, without saying exactly what that means. If you write

iterator i = 
  std::lower_bound(first, last, x);

and the search succeeds, are you guaranteed that *i is equal to x? Not necessarily! Neither lower_bound nor upper_bound ever tests for equality. They use the comparison function that you provide: either operator< or some other comparison function that acts like “less than.” What it means for the search to succeed is that *i is not less than x and x is not less than *i.

This may seem like a picky distinction, but it’s not. Suppose that you have some complicated record with many fields, and you’re using one of those fields as the sort key (a person’s last name, say). Two records may have the same key, in which case neither will be less than the other, even if all the other fields are different.

Once you start thinking about records and keys, another question about binary search becomes very natural: can you use binary search functions to look up a record by key? Just to be concrete, suppose we define a struct X as:

struct X {
  int id;
  ... // other fields
};

Suppose we have a vector<X>, sorted by the elements’ ids. How can you use binary search to find an X with a specified id, such as 148?

One answer is to create a dummy X object with the desired id and use the dummy in the binary search:

X dummy;
dummy.id = 148;
vector<X>::iterator
   = lower_bound(v.begin(), v.end(),
                 dummy,
                 X_compare);

For the moment, this is the most reliable method. It’s what you should use if you’re concerned about maximum portability. On the other hand, it’s not very elegant. You have to create a full X object, even though you aren’t using anything but a single field; the other fields have to be initialized to default or arbitrary values. That initialization might be inconvenient, expensive, or sometimes even impossible.

Is it possible to pass the id to lower_bound directly, perhaps by passing lower_bound a heterogeneous comparison function that takes an X and an id? That question doesn’t have a simple answer. The C++ Standard isn’t completely clear about whether such heterogeneous comparisons are allowed; in my opinion, the most natural reading of the Standard is that they aren’t. In practice today, heterogeneous comparisons work on some implementations but not on others. On the other hand, the C++ standardization committee considers this a defect, and in the future the Standard will probably be changed to make it unambiguous that heterogeneous comparisons are allowed [4].

Conclusion

The C++ library provides other forms of searching than the ones I’ve discussed here. With both find and lower_bound, the search is restricted to a single element, but the standard library also provides search, which looks for an entire subrange. You can, for example, use search to look for a word in a string:

std::string the = "the";
std::string::iterator i
  = std::search(s.begin(), s.end(),
                the.begin(), the.end());

The return value, i, will point to the beginning of the first place that "the" appears in s — or, as usual, the return value will be s.end() if "the" doesn’t appear anywhere. There’s also a variant that searches from the end instead of the beginning:

std::find_end(s.begin(), s.end(),
              the.begin(), the.end());

returns an iterator pointing to the beginning of the last occurrence of "the", rather than the first. (If you think it’s odd that a reverse variant of search is called find_end, rather than search_end, you’re not alone.)

Searching can be encapsulated in data structures. Most obviously, the standard library’s associative containers, set, multiset, map, and multimap, were designed specifically for efficient lookup by key [5]. The library’s string class also provides a host of member functions for searching: find, rfind, find_first_of, find_last_of, find_first_not_of, and find_last_not_of. My advice is to avoid them. I find these special member functions hard to remember because they have so many forms, and because their interface differs from the rest of the library; in any case, they don’t give you any functionality that you don’t already have from find, find_if, and search.

But if you still think you see some major omissions, you’re right! I haven’t mentioned hash tables, because the standard library has no hash tables. I mentioned subrange matching with search, but surely that’s just a special case of pattern matching — and the standard library has no regular expression search or anything like it.

The C++ standardization committee has just started thinking about extensions to the standard library, and both hash tables and regular expressions are obvious candidates for a future version of the Standard. If you see something that the standard library is missing, and if you would like to submit a proposal, now is when you should start getting it ready.

Notes

[1] See Table 72 in the C++ Standard. Some of the other search algorithms, which I discuss later, rely on the stronger Forward Iterator requirements.

[2] See, for example, <www.sgi.com/tech/stl>.

[3] See “Algorithm Q,” in §6.1 of D. E. Knuth, The Art of Computer Programming, vol. 2, Sorting and Searching, Second Edition (Addison-Wesley, 1998).

[4] See <http://anubis.dkuug.dk/jtc1/sc22/wg21/docs/lwg-active.html#270>. Dave Abrahams had the insight that enabled the proposed resolution to this issue. He pointed out that it’s possible to think of binary searches not in terms of sorting and comparisons, but in terms of partitioning: we’re given a range with the property that all elements before a certain point satisfy a condition and all elements after it fail to satisfy the condition, and we’re looking for the transition point.

[5] But these containers aren’t the most efficient choice as often as one might think. See my earlier column “Why You Shouldn’t Use set — and What You Should Use Instead,” C++ Report, April 2000.

Matt Austern is the author of Generic Programming and the STL and the chair of the C++ standardization committee’s library working group. He works at AT&T Labs — Research and can be contacted at austern@research.att.com.