Dr. Dobb's Journal April 2000
Caches are wonderful. My trusty old desktop computer runs at 400 MHz, or a 2.5-nanosecond cycle time. It has 128 MB of RAM, with a 70-nanosecond access time that is a factor of almost 30 slower than the cycle time. Two caches bridge that enormous chasm -- 32 KB of Level 1 (L1) cache that can be accessed in a single cycle, and 512 KB of Level 2 (L2) cache accessed in a few cycles. The system therefore combines the huge size and cheap price of the RAM with the blazing speed of the L1 cache. Usually.
Caching often works well, but sometimes fails utterly. In this column, I'll examine why that happens, and what you can do about it. I'll tackle this big-systems issue by concentrating on one tiny programming problem -- binary search. When I ignored the effects of caching, a simple experiment I conducted mispredicted the run time of a binary search by a factor of four. When I viewed caching properly, I was able to speed up a related binary search by a factor of two. I'll start by reviewing binary search, and then move to a simple program for experimenting on its run time.
I'm thinking of a number between 1 and 1000; you guess it. 500? Too low. 750? Too high. And so the game goes, until you get my number in about 10 guesses.
The optimal strategy for this game uses binary search, a textbook divide-and-conquer algorithm. Start with a known range, and use each probe to divide the range in half. Stop when you find the item, or when the range becomes empty.
Exercise 1. Show that a binary search will take at most 1+log2n probes to locate a given value among n inputs.
Fancy applications of a binary search examine complex structures, pruning a fixed portion of the search space with each probe. The most common use of a binary search in programs, though, is to look up an element in a sorted array.
Exercise 2. Write code to determine whether the sorted array x[0..n-1] contains the target value t.
Solution 2 (at the end of this column) gives the simplest approach I know for binary search. It is easy to implement and relatively fast; I have used similar code in many programs. Unfortunately, if the array contains several copies of the target, the pseudocode might return any given one of them.
The rest of this column uses a binary search that always returns the first occurrence of many targets, or the place where it would be if it isn't present. The code uses the two integer indices l and u (for "lower" and "upper") to maintain the loop invariant shown in Figure 1. Each probe changes one of the two values, and the loop continues until the indices meet.
The binary search operates on objects of type DataType. Listing One defines the type as:
typedef int DataType;
The sorted array x to be searched has n elements:
DataType x[MAXN];
int n;
The binary search has this specification:
int binarysearch(DataType t, int lo, int hi)
/* pre: lo <= hi and x[lo..hi] is sorted */
/* post: x[result-1] < t and x[result] >= t */
The postcondition assumes that the array has the two fictitious elements x[-1])=-
and x[n])=
.
Exercise 3. How would you add logical terms to the postcondition to avoid using the fictitious elements?
This function implements the specification using the invariant in the previous section:
int binarysearch(DataType t, int lo, int hi)
{ int l, u, m;
l = lo - 1;
u = hi + 1;
while (l+1 != u) {
m = (l + u) / 2;
if (x[m] < t)
l = m;
else
u = m;
}
return u;
}
Exercise 4. How would you convince yourself that this delicate code is correct?
Exercise 1 observes that a binary search in an array of n elements will use about log2n steps. How will that translate into processor time?
That question was easy to answer in the old days. A timing program reads a value for n, then fills the array with n values in order:
for (i = 0; i < n; i++)
x[i] = 10*i;
The measurement code starts a clock, searches for each value in turn, and computes the run time.
start = clock();
for (i = 0; i < n; i++) {
t = x[i];
pos = binarysearch(t, 0, n-1);
assert(pos == 0 || x[pos-1] < t);
assert(pos == n || x[pos] >= t);
}
clicks = clock() - start;
nanosecs_per_search = 1e9*clicks
/ ((float) CLOCKS_PER_SEC*n);
The assert statements ensure that the function meets its specification. The real code in fact iterates the main loop a number of times to get large enough clock values when n is small.
Before microprocessors had caches, such simple code was very useful. Running the experiment on eight evenly spaced values of n led to the well-behaved graph of 8086 run times in Figure 2. The data points all lie quite close to the line at 67+31log2n microseconds. That was a useful engineering approximation: Most searches took about that much time.
This experiment measures the cost of successful searches that find the values for which they search.
Exercise 5. How would you measure the average cost of unsuccessful searches that do not find their target values?
When I ran the experiment on several modern computers, I got similar lines, but with much smaller constants. But what do those times mean on a modern machine? Are they accurate predictors of the cost of typical searches?
If searches look for each element in order, the elements used by one binary search are almost exactly the elements used by the search before it. (On average, only two elements are different.) Those elements are usually in the L1 cache, and are therefore cheap to access.
Unfortunately, most real-world searches don't take place in rigid order. To model such searches, the next experiment will search for every element in the input vector in a random order.
Exercise 6. Modify the timing scaffolding to perform the searches in a random order.
Figure 3 shows the results of the scrambled experiment on an AMD K6-2 processor. The ordered experiment gives the same straight line that was observed on a machine without caches. The scrambled times clearly show the effect of caching: They have almost the same cost as the ordered times through n=8000 (when the 32,000 bytes of 4-byte integers barely squeezes into the 32,000 L1 cache). At higher values, though, the data spills out into the L2 cache and finally into RAM, and the cost of each access rises from dozens to hundreds of nanoseconds.
Exercise 7. Download the program and measure it on your machine. Can you spot the boundaries between L1, L2, and RAM?
When the data fits in the L1 cache, the ordered and scrambled searches cost about the same. For larger instances, the ordered searches are much cheaper. Like many difficulties, this problem can also be viewed as an opportunity.
Exercise 8. When is it possible to order queries to speed up a series of searches?
I was recently working on a program in which a sorted array contained many copies of some values. The bottleneck in the program boiled down to searching for the values equal to an input, and choosing one of them at random.
My first approach to the problem started with a binary search for the first value, and then sequentially scanned through subsequent equal values to choose one at random:
first = [binary search for first instance of t] i = 0; while (x[first+i] == t) if (rand() % ++i == 0) choice = first+i;
The program took more time than I expected, and I finally found that for my input data, the sequential search took the program from O(n log n) to O(n2) operations. I therefore replaced the sequential search with a second binary search:
first = [binary search for t using < ]
last = [binary search for t using <=] - 1
choice = random integer in first..last
The first binary search uses the less-than operator (exactly as in the binarysearch function previously described) to find the first instance of t in the array. The second binary search changes that operator to less-than-or-equal to find the first value greater than t; subtracting one gives the last instance of t. This change restored the O(n log n) performance, and sped the program up by a factor of 10 on reasonable inputs.
Exercise 9. Suppose that m of the n values in the array are equal to t. The first method (binary search then sequential search) uses about m+log2n comparisons, while the second method (two binary searches) uses about 2 log2n comparisons. Can you find a method that uses about log2n+2log2m comparisons?
When I implemented my program, I stumbled across a caching problem that gave me another factor of two.
When I first wrote the pair of binary searches, I used pseudocode like this:
first = binarysearch(t, x, 0, n-1, < )
last = binarysearch(t, x, first, n-1, <=) - 1
Without thinking, I used the result of the first search as a lower bound for the second search. I'll call this approach a "warm start" to contrast it with a "cold start," which always uses zero as a lower bound for the second search (by replacing the parameter first in the second call with zero). I used a warm start out of force of habit; I've learned to avoid paying for the same ground twice. But is that the right choice for today's processors?
Exercise 10. How many comparisons will the warm start use compared to the cold start? What effect will each approach have on caching?
Solution 10 shows that, under reasonable assumptions, the warm start method should use about one or two fewer comparisons than the cold start. Figure 4 illustrates the run time of the two methods on an AMD K6-2. When the data fits entirely in the L1 cache, the warm start is indeed faster (by about the one or two predicted comparisons). When the data spills out of the cache, though, the cold start grows a little, while the warm start grows a lot. Why is the cold start more cache-friendly than the warm start?
When the cold start performs the second binary search, it always uses the same starting comparison as the first search (the middle element of the array). The next comparisons are almost always the same, and so on until the final few comparisons. The cold start uses 2log2n comparisons, but only accesses about log2n distinct elements. In the cold start, caching makes you a "buy one, get one free" offer. (As long as the second is very close to the first.)
The warm start, on the other hand, usually accesses a new element in its first comparison, and typically accesses new elements all the way down the search. So even though it uses one less comparison, it does so by accessing about 2log2n or twice as many elements. The slope of the right part of the warm start curve in Figure 4 is almost exactly twice the slope of the cold start curve. In the warm start, clever algorithm design makes the offer of, "Buy one, get two comparisons off the price of the second." Unfortunately, that negates the benefit of caching.
This column illustrates several important points about the important young field of "cache-conscious code tuning."
Measure Times on Appropriate Inputs. Some inputs display enough locality to make them particularly efficient for caching; others don't. Be sure your test inputs accurately model the inputs your code will face in production.
Look for the Knees. On the cacheless memories of my programming youth, run times were often straight lines to asymptopia, like the data in Figure 2. On machines with caches, though, run times usually have sharp bends like those in Figures 3, 4, and 5. Graph the run time of your program, and hunt for those twists. What is fast on one side of the curve may be slow on the other.
More Old Work May Be Cheaper Than Less New Work. Out of habit, I used a warm start in my original pair of binary searches. When I thought hard about caching, I tried the cold start by changing one variable in the source code, and saw a speedup of close to a factor of two.
Machine-Independent Caching. To tweak the last cycle of performance out of key inner loops, it is sometimes necessary to labor over each machine instruction. A basic awareness of caching is often sufficient to gain substantial speedups across a wide variety of machines.
I came across the problems described in this column as I was working on my book Programming Pearls, Second Edition (Addison-Wesley, 2000). The themes of this column run throughout the book; the index contains 14 entries for "binary search" and 10 entries for "cache sensitive code." The web site http://www .programmingpearls.com/ gives additional references to these topics, especially to Anthony LaMarca's important work on caches and algorithms.
2. This code searches the sorted array x[0..n-1] for the target value t:
lo = 0 hi = n-1 loop if lo > hi return "not found" mid = (lo + hi) / 2 case x[mid] < t: lo = mid + 1 x[mid] == t: return "found at " m x[mid] > t: hi = mid - 1
The invariant of the loop is that if t is anywhere in the array, then it will be in x[lo..hi].
3. The strengthened conditions are shown in the scaffolding code on page 111.
6. To scramble the searches, the permutation vector
int p[MAXN];
is initialized to index to the elements in order:
for (i = 0; i < n; i++) p[i] = i;
This code scrambles the elements to be a random permutation:
for (i = n-1; i > 0; i -- ) {
j = rand() % (i + 1);
t = p[i]; p[i] = p[j]; p[j] = t;
}
The new experiment chooses as the search value not the ith element in the array, but rather the ith scrambled element:
t = x[p[i]];
7. Figure 5 shows the run time of a binary search on scrambled input data on a Pentium II processor with 32,000 of L1 data cache and 512,000 of L2 cache. The straight lines are drawn between the endpoints in the three regions, and seem to represent the data accurately. If all the queries are known in advance, sorting them could induce locality and thereby speed up the caching. If the queries are known in advance, though, a series of binary searches is probably not the most efficient algorithm for the job. (On other problems, however, I have ordered queries to induce locality and thereby sped up programs by a factor of four.)
10. Assume that every array index is equally as the value of first. For the n/2 values less than the median, a warm start reduces the search cost by at most one comparison. For the n/4 values between the median and the 75th percentile, the cost is reduced by at most two comparisons. The next n/8 values save at most three comparisons, and so forth. This series sums to 2, so the search cost is reduced by at most that amount. A similar analysis shows that the search cost is reduced by at least one. To find the exact answer, mathematically inclined readers might try using Stirling's approximation to
log2n!=log2n+log2n-1+ log2n-2+...+log2
DDJ
/* bsearch.c -- time binary search algorithms for cache effects
From Dr. Dobb's Journal, April 2000
Input lines: algnum n numtests
Output lines: algnum n numtests scrambled clicks nanosecs_per_elem
See timedriver for algnum codes
bsearch gives scrambled tests; bsearch -o gives ordered
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define MAXN 10000000
typedef int DataType;
DataType x[MAXN];
int n;
/* Alg 1: From Programming Pearls, 2nd Ed., Column 8 */
int binarysearch1(DataType t, int lo, int hi)
/* return least u s.t. x[u] >= t */
{ int l, u, m;
l = lo - 1;
u = hi + 1;
while (l+1 != u) {
m = (l + u) / 2;
if (x[m] < t)
l = m;
else
u = m;
}
return u;
}
/* Algs 2 and 3: Return range of equal values */
typedef struct range {
int lo, hi;
} Range;
int warmstart = 0; /* 1 to change start of second binary search */
Range binarysearch2(DataType t, int lo, int hi)
/* pre: sorted(0, n-1)
post: bsl = min l s.t. x[l] >= t
bsu = max u s.t. x[u] <= t
*/
{ int l, u, m;
Range answer;
l = lo - 1;
u = hi + 1;
while (l+1 != u) {
m = (l + u) / 2;
if (x[m] < t)
l = m;
else
u = m;
}
answer.lo = u;
l = lo - 1; /* cold start */
if (warmstart)
l = answer.lo;
u = hi + 1;
while (l+1 != u) {
m = (l + u) / 2;
if (x[m] <= t)
l = m;
else
u = m;
}
answer.hi = l;
return answer;
}
/* Timing */
int p[MAXN]; /* permutation vector */
void scramble(int n)
{ int i, j;
DataType t;
for (i = n-1; i > 0; i--) { /* call 15-bit rand twice */
j = (RAND_MAX*rand() + rand()) % (i + 1);
t = p[i]; p[i] = p[j]; p[j] = t;
}
}
#define assert(v) { if ((v) == 0)
printf(" binarysearch bug i=%d n=%d\n", i, n); }
int main(int argc, char *argv[])
{ int i, algnum, numtests, test, start, clicks;
int pos, b, e, wantscramble;
Range answer;
DataType t;
wantscramble = 1;
if (argc > 1 && argv[1][0] == '-' && argv[1][1] == 'o')
/* "ordered" searches */
wantscramble = 0;
while (scanf("%d %d %d", &algnum, &n, &numtests) != EOF) {
x[0] = 0;
for (i = 1; i < n; i++)
x[i] = x[i-1] + rand() % 2;
for (i = 0; i < n; i++)
p[i] = i;
if (wantscramble)
scramble(n);
start = clock();
for (test = 0; test < numtests; test++) {
for (i = 0; i < n; i++) {
t = x[p[i]];
switch (algnum) {
case 1:
pos = binarysearch1(t, 0, n-1);
assert(pos == 0 || x[pos-1] < t);
assert(pos == n || x[pos] >= t);
break;
case 2:
case 3:
warmstart = 0;
if (algnum == 3)
warmstart = 1;
answer = binarysearch2(t, 0, n-1);
b = answer.lo;
e = answer.hi;
assert(b < 0 || x[b] >= t);
assert(b < 1 || x[b-1] < t);
assert(e > n-1 || x[e] <= t);
assert(e > n-2 || x[e+1] > t);
break;
}
}
}
clicks = clock() - start;
printf("%d\t%d\t%d\t%d\t%d\t%g\n",
algnum, n, numtests, wantscramble, clicks,
1e9*clicks/((float) CLOCKS_PER_SEC*n*numtests));
}
return 0;
}