Code Tuning in Context

By Jon Bentley

Jon is a Member of Technical Staff at Bell Labs. He can be contacted at jlb@ research.bell-labs.com.

Some people spend hours tweaking the very last bit of horsepower out of their automobile engines. Others adjust strings until their violins sing sweetly, practice golf swings, or finely polish web pages. I, too, enjoy tuning: I love to tinker with code to make it run faster.

When I started programming, code tuning was easy: You found the innermost loop, got rid of the expensive parts, and your program ran faster. Nowadays, though, I'm no longer surprised when I tune code and find that it is not substantially faster, and is sometimes slower.

What has changed? Lots of things. Computer architectures have more accelerators, and compilers perform more optimizations. This column studies code tuning in several different contexts. I'll start with a family of four related algorithms, apply four tuneups to them, and measure the results on various hardware and compilers.

Background

Last month's column presented a sequence of four algorithms to solve the Traveling Salesman Problem (TSP), and analyzed the run times of each.

Algorithm 1 was the simplest code for the job. It enumerated all n! permutations, and stored the shortest. Since it employed n distance calculations to find the cost of each tour, it used a total of n×n! distance functions.

Algorithm 2 was a minor variant of Algorithm 1. It required fixing a starting city by changing one parameter in one call from n to (n-1); this reduced the number of tours examined from n! to (n-1)!, and the total run time to n×(n-1)! or n! distance functions.

Algorithm 3 reduced the number of distance functions by keeping a partial sum of the tour distance. The total number of distance calculations was approximately (1+e)×(n-1)!, where eÅ2.71828...is the base of natural logarithms.

Algorithm 4 pruned the search by returning whenever the partial sum exceeded the length of the smallest tour yet observed. Experiments showed that it is much faster than Algorithm 3, but its run time was very dependent on the particular input data.

These algorithms are fine candidates for tuning: Each is a small loop that is likely to consume many cycles in an application. I'll tune all four in parallel for the sake of experiments; in a real application, you would tune only the fastest algorithm.

Tuneup B: Precompute Distances

The critical operation in TSP algorithms is often the distance function:

Dist geomdist(int i, int j)

{ return (Dist) (sqrt(sqr(c[i].x-c[j].x)

+ sqr(c[i].y-c[j].y)));

}

A profile showed that Algorithm 1 spent 16.6 percent of its time in geomdist, 64.6 percent in sqrt, and 5.7 percent in this sqr function:

float sqr(float x)

{ return x*x;

}

Together, these three functions account for about 87 percent of the time used by the program.

Exercise 1. How can you reduce the time the program spends in the geomdist function?

You could tune the function in several ways: You could write the sqr functions inline, eliminate common subexpressions, or use a special-purpose square root. I will, instead, leave its cost unchanged, but almost eliminate its time by precomputing all n² possible distances and storing them in an array:

for (i = 0; i < n; i++)

for (j = 0; j < n; j++)

distarr[i][j] = geomdist(i, j);

This approach trades space for time. The distarr table uses n² words of storage, but you can now compute the distance from point i to point j with an array access:

#define dist(i, j) distarr[i][j]

Exercise 2. How will this tuneup change the run times of the various algorithms?

I'll call the original code for each algorithm (which calls the function) "Tuneup A," and the modified code (which accesses the array) "Tuneup B." Thus Algorithm 1A (implemented in tsp2.c; available electronically, see "Resource Center," page 5) is the original Algorithm 1, Algorithm 1B incorporates the array, and so on for the three other algorithms. Table 1 shows before and after run times on a 200-MHz Pentium Pro. Because the later algorithms are so much faster than the earlier ones, I had to increase the input size n on the various rows to get accurate run times. In every case, though, Tuneup B represents a substantial performance increase. Furthermore, this is not the kind of speedup supported by hardware or system software.

Lesson: Tuneups beyond the reach of hardware accelerators and optimizing compilers still play a vital role in software performance.

The rightmost column in Table 1 shows the speedup ratio. Algorithm 1B is about 5.5 times faster than the original Algorithm 1A: That "dumb" algorithm spent most of its time performing the critical distance calculations, and very little time in overhead. Algorithm 2 uses the same code as Algorithm 1, with just a single parameter changed. "Smart" Algorithms 3 and 4 add overhead (for maintaining partial sums and pruning the search) to reduce distance computations, so they are less affected by the dramatic reduction in the cost of distance calculations.

Lesson: Tuneups to critical operations have the most impact on "dumb" algorithms with little overhead.

This tuneup successfully added space to reduce run time. That technique was a sure bet in the old days, but it can mean a slowdown when running out of modern caches.

Lesson: Measure the effect of tuneups to ensure they indeed reduce the run time.

Exercise 3. Write a program to determine experimentally the run times of the key operations in these algorithms, such as loops, array accesses, and arithmetic operations.

Tuneup C: Change Arithmetic

These algorithms must keep sums of potentially large numbers. I've been bitten in the past by arithmetic precision (or lack thereof), so I instinctively made the type of distances the largest possible:

typedef double Dist;

In the past, changing the double to float or long or short would have easily given a huge speedup.

Exercise 4. Estimate the result of changing to various data types on your machine.

There's a little more to the job than changing one typedef. The first task is to scale distances to use the precision available in our chosen word lengths:

return (Dist) (DistFACT * sqrt(...));

Exercise 5. What are appropriate scale values? What other changes need to be made to the program? How would you incorporate them into a program?

Algorithms 1C through 4C incorporate different kinds of arithmetic. Table 2 presents experiments on Algorithm 3C (which has an intermediate amount of overhead) with n=9 on a decade's worth of Intel processors. The first six lines used the same 16-bit executable, while the bottom line used a 32-bit executable. Old hardware was pretty fast on shorts and longs, and much slower on floats and doubles. In recent years, though, computer architects have used massive transistor counts and wide data paths to make floating-point operations run almost as fast as their integer cousins.

Can modern programmers ignore ancient hardware? Some indeed can: They know their programs will run only on particular machines. Others certainly cannot: Their programs run today on small processors (in cell phones, joysticks, or even legacy computers), and who knows where they'll be used tomorrow?

Lesson: Old hardware was relatively slow for floats and doubles. Modern (large) processors operate on such variables as quickly as on ints and shorts.

For the remainder of this column, I'll focus on modern machines, and reckon Tuneup C as a blind alley. Keep it in mind, though, the next time you work on a small processor.

Tuneups to Reduce Overhead

The innermost loop of search1 currently is:

for (i = m-1; i >= 0; i--) {

swap(i, m-1);

search1(m-1);

swap(i, m-1);

}

This consists of two calls to the swap function and a recursive call to itself; Algorithms 2 through 4 have similar inner loops.

In the old days, a sure speedup could be gained by writing the swap function inline, and removing the overhead of function calls. This process results in Tuneup D:

for (i = m-1; i >= 0; i--) {

t = p[m-1];

p[m-1] = p[i];

p[i] = t;

search1d(m-1);

t = p[i];

p[i] = p[m-1];

p[m-1] = t;

}

Tuneup E pares a few additional lines of fat from the innermost loop by moving two assignments out of the loop and removing one redundant assignment:

t = p[m-1];

for (i = m-1; i >= 0; i--) {

p[m-1] = p[i];

p[i] = t;

search1e(m-1);

p[i] = p[m-1];

}

p[m-1] = t;

The resulting program spends the lion's share of its run time in this function. Table 3 presents a profile of Algorithm 4E with n=12.

Code tuners face two nightmare profiles. When the profile is entirely flat, and each function accounts for just 1 or 2 percent of the run time, it is hard to know where to start. This profile is at the other end of the spectrum: All the run time is in one function, and I've already squeezed it all I can.

Results of the Tuneups

I now have five tuneups across four algorithms and I'm ready to examine their interactions in detail. I'll start with a simple study of compiler optimizations using versions of Algorithm 4 with n=12; see Table 4.

With the single exception of Algorithm 4A on the Pentium Pro, optimization is a win.

Lesson: Compiler-optimized code is sometimes dramatically faster than unoptimized code. Compiler optimization may, however, slow down a program.

The compiler-optimized versions of Algorithms 4D and 4E were slower than (simpler) Algorithm 4C on the MIPS processor, yet faster on the Pentium Pro.

Lesson: Code tuning may interact badly with compiler optimizations, and increase the resulting run time.

All of the other experiments in this column have been conducted with optimization enabled.

Our next experiment runs the complete set of algorithms and tuneups (with the exception of dead-end Tuneup C) on a 200-MHz Pentium Pro; see Table 5.

The result of Tuneup B was previously discussed. Tuneup D gave a speedup of 13 to 30 percent (but recall that it caused a slowdown on a MIPS R10000). Tuneup E made little difference on Algorithms 1 and 2 (where run time was still dominated by summing distances), but gave a 12 percent speedup on Algorithms 3 and 4.

Lesson: Fine-tuning inner loops can still yield 10 percent here and 30 percent there, but sometimes slows the code.

The overall speedup, from Tuneup A to Tuneup E, ranged from a factor of 6.6 to 3.6.

Our final experiment (see Table 6) runs all Tuneups on Algorithm 4, across three Intel processors.

Lesson: Tuning code on old processors was predictable and could yield speedups of orders of magnitude.

Lesson: Tuning code on modern processors is less predictable and yields smaller, but still important, speedups.

Exercise 6. Perform additional experiments changing variables such as compiler, operating system, optimization level, and so on.

Principles

Back in the old days, when code was easy to tune, supermarket shopping was pretty easy, too. I would walk through the store, putting items into my basket, keeping a running total as I went. I sometimes snapped up a few items on "two-for-one" sale. When I came to the cashier, I handed over a coupon or two, and if the total price was more than a few percent off my estimate, I would ask to see the receipt.

Modern marketing techniques have changed all that. I now go the store armed with more coupons. I can swipe my frequent-shopper card across a machine at the front of the store to get just-in-time coupons. I try to take advantage of "buy five of these, get two of those free" sales when I can. When I present my card to the checker, I get extra discounts on items already scanned, custom-printed coupons for future use, and credit towards a free turkey if I rack up enough sales before the next holiday. The process results in lower prices (I suppose), but I no longer understand my shopping trips.

What marketing techniques have done to food shopping, computer architects and compiler writers have done to code tuning. Pipelines and other instruction-level parallelism offer many instructions for the price of one (except when hazards jeopardize the deal). Multilevel caches usually yield on-chip speed at DRAM prices. Optimizing compilers manipulate your code at no additional charge. These devices automate many of the tasks that used to fall to code tuners, and they usually do a pretty fine job. But some code-tuning tasks remain, and we have to keep a much larger context in mind as we set to work.

Solutions to Selected Exercises

Exercise 3. Brian Kernighan, Chris Van Wyk, and I described "An Elementary C Cost Model" in the February 1991 issue of UNIX Review. When I ran a 50-line C program on the 200-MHz Pentium Pro, it produced this estimate of the run times for various operations:

Operation Nanosecs

Null Loop

{} 36

Int Operations

i1++ -5

i1=i3++ -1

i1-i2 + i3 3

Control Structures

if (i==5)i1++ 0

while (i<0)i1++ 3

Array Operations

p[i]=i 40

i2=p[i] 20

swap(0,i) 160

Square Root

f1=sqrt(f2) 888

The first line says that a null loop requires about 36 nanoseconds per operation. The second through fourth lines show the costs of arithmetic operations on integers.

Lesson: Adding instructions to a small inner loop can make it run faster.

The array operations are more substantial, and the square roots are relatively expensive.

Exercise 5. This code sequence assumes that precisely one of the variables SHORT, LONG, FLOAT, and DOUBLE is nonzero, and that variable signifies the type to use:

#ifdef SHORT

typedef short Dist;

#define INF 30000

#define DistFACT 100

#elif LONG

typedef long Dist;

#define INF 1000000000

#define DistFACT 100000

#elif FLOAT || DOUBLE

typedef float Dist;

#define INF ((Dist) 1e35)

#define DistFACT 1.0

#endif

Scaling and rounding may slightly change the output sequence, but the tour distance should be very close to the true distance.

DDJ