The genius as well as the oversights in the design of the Standard C++ library surface in a simple discussion of its linear and binary search algorithms.
If youre storing a collection of values, probably one of the reasons is so you can look things up in it. Searching for a specific item in a collection is an important problem; whole books have been written about it. Not surprisingly, the Standard C++ library provides many different techniques for searching.
In the C++ library, the usual way of referring to a collection is through a range of iterators. The range can be written as [first, last), where *first is the first value in the range and where last points just beyond the last value. In this notation, heres how we can think of the general problem: given a range and a criterion, find the first iterator that points to a value satisfying the criterion. Because we write ranges in an asymmetrical form, we can avoid special cases: a search that fails can return last, and a range with zero elements can be written as [i, i).
Linear Search and Its Variants
The simplest form of searching is linear search, or, as Knuth calls it, sequential search: looking at each value in turn, and checking to see if its the one were looking for. If we have N elements in the range, then in the worst case we have to make N comparisons.
The standard library provides several versions of linear search; the two most important versions are find (which takes a range and a value x and looks for a value thats equal to x) and find_if (which takes a range and a predicate p and looks for a value that satisfies p). Other versions of linear search are find_first_of (which takes two ranges, [first1, last1) and [first2, last2) and finds the first element in [first1, last1) thats equal to any one of the values in the second range) and adjacent_find (which takes a single range and finds the first element thats equal to its successor).
Suppose, for example, that v is a vector of int. You can find the first zero by writing:
vector<int>::iterator i = find(v.begin(), v.end(), 0);You can find the first nonzero element by writing:
vector<int>::iterator i = find_if(v.begin(), v.end(), not1(bind2nd(equal_to<int>(), 0)));You can look for the first small prime by writing:
A[4] = { 2, 3, 5, 7 }; vector<int>::iterator i = find_first_of(v.begin(), v.end(), A+0, A+4);You can find the first pair of duplicates by writing:
vector<int>::iterator i = adjacent_find(v.begin(), v.end());There arent any separate versions that search backward through a range, but thats because you dont need them: you can get the same effect with a simple iterator adaptor. To find the last zero in v instead of the first, for example, all you have to write is:
vector<int>::reverse_iterator i = find(v.rbegin(), v.rend(), 0);Linear search is a simple algorithm, and its implementation may not seem to present any important issues. Many books (including mine) show a simple implementation of std::find:
template <class InIter, class T> InIter find(InIter first, InIter last, const T& val) { while (first != last && ! (*first == val)) ++first; return first; }This is indeed a faithful implementation of the linear search algorithm, and of the C++ Standards requirements; the first template parameters name, InIter, means that this template argument need only satisfy the very weak Input Iterator requirements [1]. It might seem so simple that you might as well just write it out in your code by hand. Still, theres one annoying problem: this implementation isnt as efficient as it ought to be. The loop condition is complicated, requiring two tests for every element that gets examined. Conditional branches are expensive, and loops with complicated tests dont get optimized as well as loops with simple ones.
One answer to this problem, used in some standard library implementations [2], is to unroll the loop, examining four elements per iteration. This is a fairly complicated solution, because find then has to deal with the leftover elements (ranges dont always come in multiples of four!), and also because it requires that find dispatch on iterator category unrolling is sensible only for a range of Random Access Iterators, so its still necessary to use the old implementation for the general case. Still, unrolling is effective: it reduces the number of tests per element from 2 to 1.25. Its a technique that a standard library implementer can use without changing any interfaces.
Youll see a different answer in the usual books about algorithms. The reason we need two tests for each element is that, if we get to the end of the range without finding what were looking for, we have to recognize that weve failed. But what if we happen to know that the element were looking for is always there and that the search will never fail? In that case, testing for the end of the range would be redundant; there wouldnt be any reason for the search algorithm to keep track of the end of the range in the first place. Instead of std::find, we could write the linear search algorithm as follows:
template <class InIter, class T> InIter unguarded_find(InIter first, const T& val) { while (!(*first==val)) ++first; }The version of linear search in Knuth [3] is more like unguarded_find than like std::find. Note that unguarded_find is not part of the C++ Standard. Its more dangerous than find, and less general. You can only use it when you are certain there is an element thats equal to val which usually means youve put that element there yourself, as a sentinel at the end of the range. Using a sentinel isnt always practical. (What if youre searching through a read-only range?) But when it is practical, unguarded_find is simpler and faster than anything in the standard library.
Binary Search
Linear search is simple, and, for short ranges, its the best way to search for values. As ranges get longer and longer, however, it stops being a reasonable solution. I was recently reminded of this while working with XSLT. My XSLT script included a line that looked something like this:
<x:value-of select="/defs/data[@key=$r]"/>The XSLT engine I ran this under must have used linear search. I was searching through a list, and performing this search for every item in a list. My script was O(N2), and it took minutes to run.
If youre looking through a fully general range, you cant do better than linear search. You have to examine every element, because otherwise the element youre looking for might be the one you didnt look at. But if you require that the range be structured in some way, you can do better.
You can, for example, require that the range be sorted. If you have a sorted range, you can use an improved version of linear search (once you get to an element thats larger than the one youre looking for, you dont have to walk all the way to the end to know that the search has failed), but, even better, you can use binary search. By looking at an element in the middle of the range, you can tell whether the element youre looking for is in the first or the second half; applying this division recursively, you can find the value youre looking for without having to look at all the rest of the elements. Linear search requires O(N) comparisons, and binary search requires only O(log N).
The standard library includes four different versions of binary search: lower_bound, upper_bound, equal_range, and binary_search. All of them have the same signature: they take a range, an element to search for, and, optionally, a comparison function. The range must be sorted according to that function; if you dont provide it, it must be sorted according to the ordinary < operator. So, for example, if you are searching through a range of ints arranged in ascending order, you could use the default. If you are searching through a range of ints arranged in descending order, you could pass an std::greater<int> as the comparison function.
Of the four binary search functions, the least useful one is the one with the most straightforward name: binary_search. All it returns is a simple yes or no: true if the value is somewhere in the range, false if it isnt. But that information isnt very useful by itself; Ive never had occasion to use binary_search. If the value youre searching for is there, youll probably want to know where it is; if it isnt, youll often want to know where it would have been.
There are several different questions you might want to ask about a values location, which is why there are several different useful versions of binary search. The distinction becomes important when you have a range with several copies of the same element. Suppose, for example, that you have an array of ints, and you use both lower_bound and upper_bound to look for a value:
int A[10] = { 1, 2, 3, 5, 5, 5, 5, 7, 8, 9 }; int* first = std::lower_bound(A+0, A+10, 5); int* last = std::upper_bound(A+0, A+10, 5);The names first and last suggest the difference: lower_bound returns the first occurrence of the value youre searching for (in this case, &A[3]), and upper_bound returns an iterator thats just past the last occurrence of the value youre searching for (in this case, &A[7]). If you search for a value that isnt present, youll get the position where it would have been if it had been there. Using the same A as before, we can write:
int* first = std::lower_bound(A+0, A+10, 6); int* last = std::upper_bound(A+0, A+10, 6);Both first and last will be equal to &A[7], because thats the only place where a 6 could be inserted without violating the ordering.
In practice, of course you would never see a call to lower_bound followed immediately by a call to upper_bound. If you need both of those pieces of information, thats where the last binary search algorithm comes in: equal_range returns a pair, the first element of which is the value that lower_bound would return and the second element of which is upper_bounds return value.
Up until now Ive been slightly sloppy in my discussion: Ive been saying that lower_bound and upper_bound find a value, without saying exactly what that means. If you write
iterator i = std::lower_bound(first, last, x);and the search succeeds, are you guaranteed that *i is equal to x? Not necessarily! Neither lower_bound nor upper_bound ever tests for equality. They use the comparison function that you provide: either operator< or some other comparison function that acts like less than. What it means for the search to succeed is that *i is not less than x and x is not less than *i.
This may seem like a picky distinction, but its not. Suppose that you have some complicated record with many fields, and youre using one of those fields as the sort key (a persons last name, say). Two records may have the same key, in which case neither will be less than the other, even if all the other fields are different.
Once you start thinking about records and keys, another question about binary search becomes very natural: can you use binary search functions to look up a record by key? Just to be concrete, suppose we define a struct X as:
struct X { int id; ... // other fields };Suppose we have a vector<X>, sorted by the elements ids. How can you use binary search to find an X with a specified id, such as 148?
One answer is to create a dummy X object with the desired id and use the dummy in the binary search:
X dummy; dummy.id = 148; vector<X>::iterator = lower_bound(v.begin(), v.end(), dummy, X_compare);For the moment, this is the most reliable method. Its what you should use if youre concerned about maximum portability. On the other hand, its not very elegant. You have to create a full X object, even though you arent using anything but a single field; the other fields have to be initialized to default or arbitrary values. That initialization might be inconvenient, expensive, or sometimes even impossible.
Is it possible to pass the id to lower_bound directly, perhaps by passing lower_bound a heterogeneous comparison function that takes an X and an id? That question doesnt have a simple answer. The C++ Standard isnt completely clear about whether such heterogeneous comparisons are allowed; in my opinion, the most natural reading of the Standard is that they arent. In practice today, heterogeneous comparisons work on some implementations but not on others. On the other hand, the C++ standardization committee considers this a defect, and in the future the Standard will probably be changed to make it unambiguous that heterogeneous comparisons are allowed [4].
Conclusion
The C++ library provides other forms of searching than the ones Ive discussed here. With both find and lower_bound, the search is restricted to a single element, but the standard library also provides search, which looks for an entire subrange. You can, for example, use search to look for a word in a string:
std::string the = "the"; std::string::iterator i = std::search(s.begin(), s.end(), the.begin(), the.end());The return value, i, will point to the beginning of the first place that "the" appears in s or, as usual, the return value will be s.end() if "the" doesnt appear anywhere. Theres also a variant that searches from the end instead of the beginning:
std::find_end(s.begin(), s.end(), the.begin(), the.end());returns an iterator pointing to the beginning of the last occurrence of "the", rather than the first. (If you think its odd that a reverse variant of search is called find_end, rather than search_end, youre not alone.)
Searching can be encapsulated in data structures. Most obviously, the standard librarys associative containers, set, multiset, map, and multimap, were designed specifically for efficient lookup by key [5]. The librarys string class also provides a host of member functions for searching: find, rfind, find_first_of, find_last_of, find_first_not_of, and find_last_not_of. My advice is to avoid them. I find these special member functions hard to remember because they have so many forms, and because their interface differs from the rest of the library; in any case, they dont give you any functionality that you dont already have from find, find_if, and search.
But if you still think you see some major omissions, youre right! I havent mentioned hash tables, because the standard library has no hash tables. I mentioned subrange matching with search, but surely thats just a special case of pattern matching and the standard library has no regular expression search or anything like it.
The C++ standardization committee has just started thinking about extensions to the standard library, and both hash tables and regular expressions are obvious candidates for a future version of the Standard. If you see something that the standard library is missing, and if you would like to submit a proposal, now is when you should start getting it ready.
Notes
[1] See Table 72 in the C++ Standard. Some of the other search algorithms, which I discuss later, rely on the stronger Forward Iterator requirements.
[2] See, for example, <www.sgi.com/tech/stl>.
[3] See Algorithm Q, in §6.1 of D. E. Knuth, The Art of Computer Programming, vol. 2, Sorting and Searching, Second Edition (Addison-Wesley, 1998).
[4] See <http://anubis.dkuug.dk/jtc1/sc22/wg21/docs/lwg-active.html#270>. Dave Abrahams had the insight that enabled the proposed resolution to this issue. He pointed out that its possible to think of binary searches not in terms of sorting and comparisons, but in terms of partitioning: were given a range with the property that all elements before a certain point satisfy a condition and all elements after it fail to satisfy the condition, and were looking for the transition point.
[5] But these containers arent the most efficient choice as often as one might think. See my earlier column Why You Shouldnt Use set and What You Should Use Instead, C++ Report, April 2000.
Matt Austern is the author of Generic Programming and the STL and the chair of the C++ standardization committees library working group. He works at AT&T Labs Research and can be contacted at austern@research.att.com.