C/C++ Contributing Editors


Standard C/C++: A Better Sort

P. J. Plauger

Sorts keep getting smarter, but not always as smart as they seem.


Introduction

Working in a small systems software company is a mixed blessing. You get to write a lot of interesting software, which is good. But you usually have to write to "interesting" specifications beyond your control, such as international standards or popular operating environments, which is not always fun. You get to see your code appear in a lot of different venues, which is good. But you then have to deal with a highly varied population of users, which can be a challenge. And you usually have only a handful of competitors for any given product, which is good. But you have to quickly match them feature for feature to stay even in the software arms race, which can be tiring.

A consequence of all these factors is that every month looks different. Pete Becker bounces between building better DLLs under VC++ and tuning his Java incremental garbage collector. At the same time, I am configuring our Standard C++ library for a major new customer and tidying up documentation. Mostly we thrive on such variety. But sometimes the mix shifts too much to the mundane. For me, in particular, the selling, packaging, and customer support can eat up days or weeks at a stretch. I need to go write code, or just rewrite some to make it better, even if the project is not on the top ten list of priorities.

And that's how I ended up reworking all our sort code the past few weeks.

You can look on this installment as the fourth in a series of updates from the field. Regular readers of this column know that I spent years describing all aspects of the Standard C++ library, and the Standard C library before that. In recent months, I've revisited several of the hairier parts of the code as I've learned better ways to do what the C++ Standard says. (See "Standard C/C++: A Better Deque," CUJ, June 1999; "Standard C/C++: A Better Red-Black Tree," CUJ, July 1999; and "Standard C/C++: A Better List," CUJ, August 1999.)

You can also look on this installment as the latest evidence of my long standing preoccupation with sorting. Brian Kernighan and I presented sort algorithms in several books dating back over the past quarter century or so. Each succeeding version was mildly less naive than the one that preceded it. I have also written columns in other Miller Freeman magazines over the past dozen years or so. (See "Programming on Purpose: Order out of Chaos," Computer Language, March 1987; "Programming on Purpose: All Sorts of Sorts," Computer Language, January 1992; and "State of the Art: Out of Sorts," Embedded Systems Programming, December 1998.)

But to be completely honest, I view this presentation as a direct response, if rather belated, to a review of sort functions that appeared in these pages over four years ago. (See K.B. Williams, "Testing Sort Functions," CUJ, July 1995.) That article by Williams showed just how naive was the code for qsort in my book The Standard C Library (Prentice-Hall, 1992). He ranked it dead last among a dozen candidates, including qsort functions from several popular C compilers. I've been wanting to improve that version of qsort ever since.

I haven't let the world of sorting pass me by completely. The STL (Standard Template Library) includes some pretty sophisticated template functions for sorting, among many other things. STL became important to me when it was incorporated whole hog into the draft C++ Standard in mid 1995. I promptly got my hands on the freely distributed STL code from Hewlett-Packard (bless their hearts) and began working my way through it. I described the sorting template functions several years ago. (See "Standard C/C++: Ordering Algorithms," CUJ, October 1996.)

My first rewrite was pretty close to the approach used by Alex Stepanov and Meng Lee at H-P, naturally enough. In places, I barely knew what I was doing with all this clever new template technology. But over time, I got brave enough to recast the code more and more. I also began to add improvements in response to suggestions, criticism, and bug reports. As recently as last month, I thought it was pretty good. Until I made it better.

So my goal this month is to put the state of the art of sorting in perspective. I begin with a brief overview of the goals and problems of sorting and end with some nice new code. Along the way, I reveal a few surprises that I encountered while measuring the effects of various techniques.

Defining Order

How you sort depends strongly on what you know about what you're sorting. First and foremost, you must know what it means to put a sequence in order. You need an ordering rule that answers the musical question, "If item X precedes item Y in the sequence, are they properly ordered?" You can write this ordering rule as a predicate function pred(X, Y). But for readability, I'll use operator<(X, Y) (speaking C++) to show how the ordering rule works.

Assuming we want a sequence whose elements are in increasing order, it is certainly true that the subsequence {X Y} is properly ordered if X < Y. But that's not the whole story. What if the elements are equal? If X == Y, then Y == X had also better be true, or "order" has a pretty slippery meaning. So it shouldn't matter which way 'round X and Y appear in the sequence. It's ordered either way.

You don't have to define operator==(X, Y) to test for equality, by the way. For "order" to have a sensible meaning, we should be able to write an equivalent test using just operator<(X, Y). Specifically !(X < Y) && !(Y < X) should be true if X and Y are equal. (Even "equal" is a misleading term, since the elements don't have to be identical to compare equal. It is better to say that two elements have "equivalent ordering" for some predicate.)

But all this is a red herring. What we really want to say is that {X Y} is ordered if X <= Y. No, you don't have to define operator<=(X, Y) either. You can write the equivalent form !(Y < X). In plain English, two elements are ordered if they aren't unequivocally out of order. You can sure get a lot of mileage out of operator<(X, Y) if you work at it. And STL certainly does. Practically anywhere that an order must be defined between elements of a given type, STL lets you define just operator<(X, Y) (or some predicate that behaves much like less-than tests). It determines equivalent ordering, less/equal ordering, and anything else it needs from just this test.

As an aside, I should observe that qsort is not nearly as austere. Like its buddy bsearch, it requires that you supply a pointer to a comparison function declared something like:

typedef int _Cmpfun(const void *left, const void *right);

The void pointers are there for generality. Neither qsort nor bsearch knows what kind of elements they're working on. So they pass around object pointers that can be converted (back) to whatever type is required within an actual comparison function.

A comparison function is expected to return a value less than zero if the element at left is less than the one at right. It returns zero if the two have equivalent ordering. And it returns a value greater than zero if the element at left is greater than the one at right. We now know this is overkill, but that was less obvious when these functions were defined over a quarter century ago.

Accessibility and Complexity

Given an ordering rule and two elements stored in memory, we now know how to determine whether they're in order. The next question is how to find elements and rearrange them. What if you want to sort a million records of varying length stored on a disk file? Or worse, like in the good old days, what if the data is stored on one or more tapes? Well, there's all sorts of technology for dealing with these problems of limited accessibility. It falls under the rubric of external sorts. Generally, it involves some strategy for reading in chunks of data, ordering them by an internal sort, then merging all the pieces to the final destination. It is a big topic.

Here we will focus just on internal sorts, where all the elements are stored in a sequence that supports random-access iterators, for template class sort, or in an array, for qsort. That's hard enough. We also assume, in C++, that you can create temporary copies of elements and assign elements freely among each other. That rules out objects of type auto_ptr<T>, by the way, because ownership passes among copies in such a stylized way. An assignment in C is always a bitwise copy, so memcpy or memmove can always do the job. One of the arguments to qsort is the length of each element in bytes, for just this purpose.

So now we have a fairly precise specification. We want to sort, in place, a sequence of elements of some known length N. We can access element I of the sequence at will (in constant time, presumably). We can test two elements to see if they're in order. And we can rearrange elements by performing a series of assignments. What else is there to say?

Plenty, it turns out. There are oodles of strategies for performing internal sorts, each with a different mix of advantages and disadvantages. Some take advantage of known structure in the values being compared. Some make good use of auxiliary storage. These are seldom useful in a general-purpose sort function, for obvious reasons. But even if you dismiss all these, you still have quite a few choices. Here are three proven contenders:

An insertion sort makes up in simplicity, for small sequences, what it lacks in sophistication. Basically, you take an element at a time and compare it right to left against previously ordered elements. When you find out where it fits, you shove the larger elements one place to the right and drop it in the resultant hole. The time complexity is horrendous — it grows as the square of the number of elements to be ordered. But for a couple of dozen elements, max, it can be hard to beat.

A heap sort makes up in reliability what it lacks in adaptibility. The term "heap" here does not refer to allocatable storage but to a weakly ordered data structure whose first item is always the largest. To make a heap and sort it takes on the order of N*log(N) operations (comparisons and exchanges). That's about as good a time complexity as you can get, but the multiplier is pretty large and fixed. It does the same amount of wheel spinning no matter what data you ask it to sort.

A quick sort makes up in average performance what it lacks in reliability. It is a recursive, divide-and-conquer technique. For each subsequence, you determine a "pivot" value and partition the sequence about the pivot. In the simpler case, you place to the left all elements less than the pivot and all others to the right. Recurse on each of these subintervals until they're too small to be anything but sorted. A more ornate approach allows for a "fat pivot" — a middle region whose elements all equal the pivot value. The middle region does not have to be revisited, of course.

Time complexity for quick sort can be as bad as N2, for an unfortunate choice of pivot value or a perversely ordered initial sequence. It can be as good as N for a sequence all of whose terms are equal, at least if you use a fat pivot. For random data, however, quick sort tends to have the same time complexity as heap sort, but with a much smaller multiplier. It is thus generally the algorithm of choice for sorting large sequences, particularly if nothing is known in advance about any preexisting order.

And that's why I chose an unadorned quick sort as the mainspring of qsort when I wrote The Standard C Library. I didn't reckon on the likes of K.B. Williams, however. He contrived a set of tests that go hunting for places where quick sort might demonstrate N2 behavior. Since I had considered none of them, the version I wrote got caught out practically every time. On purely random data, my qsort was up there among the best of them, but that was only one of about a dozen tests. Add up the execution times of all the cases and the perverse ones rule. My code had a simply horrible figure of merit by this reckoning.

N2 Insurance

Quick sort behaves much like the little girl in the brief poem by Henry Wadsworth Longfellow:

There was a little girl, she had a little curl
Right in the middle of her forehead;
And when she was good, she was very, very good,
And when she was bad, she was horrid.

Since it is very, very good so much of the time, quick sort is well worth cultivating. But it is also important to remember its failure modes, and to avoid them wherever possible. Over the years, I've run across various forms of what I call "N2 insurance" — techniques that keep quick sort from degenerating to N2 space and/or time complexity for some types of initial order. Here are a few forms of such insurance.

Choosing the pivot value is pretty important, to begin with. I originally chose the simple technique of picking the last element in the subsequence. For a sequence already in sort (or in reverse sort), however, this is a horrible choice. The partitions repeatedly end up as sizes 1 and N-1. That's a guaranteed recipe for N2 behavior. Choosing either end has the same effect.

You can't work too hard at picking a pivot, or the pivot calculations will come to dominate the time complexity. A cheap thing to do, however, is to just pick the middle element as the pivot value. That avoids the common problems with preordered data, but it can still be suckered. Thus, many implementations choose the mean of the first, middle, and last elements instead. Indeed, the STL code I originally got did just that. The weakness with this approach appears if you take a sorted sequence and append a new smallest element. The median calculation repeatedly chooses the beginning of a subsequence. N2 again. The fix here is to go ahead and sort these three elements in place. The pivot value moves to the middle, and extreme values percolate out to the proper ends. Much safer.

Another important safeguard is to recurse with care. Each partitioning ends with recursive calls to sort the two subintervals. Such "tail recursion" is easily changed to a single recursive call for one interval and a loop in the same function invocation to handle the other interval. (Some versions avoid recursion altogether by using a stack, but my measurements indicate that this is actually more expensive. Modern architectures have optimized function calls considerably.) It is important to recurse on the smaller interval, and loop on the larger. Otherwise, when the algorithm approaches N2 in time complexity it also approaches N2 in the depth of function calls. Stack overflow is not far behind, for a large enough sequence. But with proper tending, the code should never recurse deeper than 20 levels to sort 1,000,000 elements.

The latest form of insurance was developed recently by Dave Musser, one of the original architects of STL. He got the bright idea of counting the number of levels the code has recursed. If the depth count gets much larger than log2(N), it's a sign that N2 behavior is brewing. In that case, it's probably safer to switch to a heap sort for the problematic subinterval. Better safe than sorry. Musser calls this introspective approach an "intro sort."

Several techniques also exist to make sorting just plain go faster. A common one is to choose different algorithms depending on the size of the (sub)sequence. Switching to an insertion sort for small intervals is a common speedup, with very measurable results. I've seen cutover points at eight and 16 elements, but surprisingly my own measurements show that 32 elements is generally better. That probably reflects the particular mix of techniques I settled on, however.

Generating a fat pivot seems to be a no-brainer, at first blush. For a sequence of all equal elements, the fat-pivot quick sort works in linear time. This case is actually rather worse than average if you don't form a fat pivot, interestingly enough. But to my surprise, I discovered that the fat-pivot approach is somewhat slower for random data. It takes maybe 20 per cent more time to sort 1,000,000 random elements. Given the big win you get for sorting data with lots of equal elements, however, I think the tradeoff favors the fat pivot.

Finally, you can just plain optimize how you move data about. One way to optimize STL, for example, is to specialize template functions such as copy and copy_backward for "plain old data structures" (or PODs, in C++ jargon). If you can do a bitwise copy, rather than a call to some assignment operator function, you can unleash all the power of good old memmove. Even without this optimization, you can speed up insertion sorts by performing rotations instead of repeated swaps or assignments. Of course, qsort already deals in block moves, but the rotate optimization seems to help it too.

Latest and Greatest

After a couple weeks of tinkering, I settled on the following scheme for state-of-the-art sorting. The basic algorithm is a quick sort with a fat pivot. It computes its pivot by sorting in place the first, middle, and last elements. It switches to insertion sorts for small intervals, and to heap sorts if it recurses too much. Insertion sorts perform rotations instead of swaps or assignments.

I did encounter one last surprise while tuning this approach. Intro sort is based on the notion that log2(N) recursive calls is a reasonable budget for quick sort. Beyond that quota, you'd better switch pronto to a heap sort. I found that a bit too conservative. The code actually switches to heap sort way too often, just sorting random data. It slows down. By increasing the quota by 50 per cent, I found that random data sorts as fast as ever, but the insurance policy is still there to kick in well before N2 behavior ever comes to dominate. And, of course, the other techniques reduce the number of occasions where that is likely to happen.

Listing 1 shows the latest and greatest version of template function sort and friends. I omit the code for make_heap and sort_heap because it is unchanged from before. I also omit the variants that take a predicate argument, to be used in place of operator<(X, Y). It is largely the same code. This version runs about twice as fast as its predecessor for random data. It is way faster for data with about any sort of initial order. And it should never approach N2 behavior, even if you can contrive a sufficiently perverse initial sequence.

Listing 2 shows the latest and greatest version of the C function qsort. It includes all the heap sort stuff as well, since you won't otherwise find such code in the Standard C library. I find it remarkably compact, given all that it does. It too is up there with the best of them when it comes to sorting random data. And it shares with template function sort much the same virtues. Unlike its distant precursor from The Standard C Library, it wasn't born yesterday.

P.J. Plauger is Senior Editor of C/C++ Users Journal and President of Dinkumware, Ltd. He is the author of the Standard C++ Library shipped with Microsoft's Visual C++, v5.0. For eight years, he served as convener of the ISO C standards committee, WG14. He remains active on the C++ committee, J16. His latest books are The Draft Standard C++ Library, Programming on Purpose (three volumes), and Standard C (with Jim Brodie), all published by Prentice-Hall. You can reach him at pjp@plauger.com.