January 1991/Some Tips For QuickSort

Features

Some Tips For QuickSort

Joe Celko

Joe Celko is a college teacher and consultant based in Los Angeles, as well as a member of the ANSI X3H2 Database Standards Committee. He wrote a biweekly column on software design in Information Systems News for four years and has taught structured methods in industry and colleges.
Mark Nelson's article on QuickSort (CUJ, August 1990)[1] examined tuning an algorithm to fit a particular need. In this article, I will supply you with more tricks to improve QuickSoft performance.
The choice of the pivot (also called the key value) is the most important factor in the performance of each partition in the sort. The statistical distribution of the data to be sorted is the most important factor in picking the pivot.
Another helpful point not mentioned in previous QuickSort articles is that the pivot value does not have to be an actual value in the array being sorted. The code changes a little bit, but the result is that everything in the left partition is less than the pivot, and everything in the right partition is greater. Now you can construct a pivot from a sample of the values in the partition without actually looking for an occurrence of it.
If you have a uniform probability distribution in the data, as you would sorting on a unique key, then using the first array element of each partition (the first naive sort given in Nelson's article) will work as well or better than any fancier method.
Using the middle position of the partition as the pivot also works well in practice since many files already have some order to them from previous sorting. The code for the middle point, pivot = (first + last)/2, will be optimized to a shift operation instead of a divide by most C compilers.
If you have a normal probability distribution (the so-called bell curve) with any skew (tilt to one side) in the data, then the "median of three" algorithm for picking a pivot will work much better. Using three values reduces comparisons by 14 percent, but more than three does not seem to give much improvement.
QuickSort is a non-stable sort. A stable sort preserves the original order in the file for records with duplicate sort keys. Stable sorts are nice in one sense because they can be applied one after the other. If I need to sort a file by counties for one report, and by counties within states for another report, and I use a stable sort, then I can run the first report followed by a stable sort on the states alone for the second report.
Unfortunately, stable sorts tend to be slower than non-stable sorts. On the bright side, you can modify the QuickSort to be a stable sort. You do this by scanning from left to right, partitioning the records into two separate files in their original order and then appending them together in order. The internal sorting phase is also done with a stable sort.
Recursion is not the enemy. William Hutchison of Exton, PA collects versions of QuickSort written in C to use for compiler and hardware testing on a standard set of data files. He reports in private correspondence that he found that recursive versions of QuickSort often ran faster than non-recursive versions of the same variation. Modern stack hardware makes an important difference.
Using a different sort for partitions smaller than some size M is a good trick, since many sorts run faster than QuickSort for small arrays. An even better way is to ignore the small subfiles during the first phase of the sort, then go back through the entire files and sort it with a procedure tuned for handling M or fewer records. You save the overhead of invoking the small partition sort over and over while the QuickSort is running. The best size for M is 9 or 10, but anything between 6 and 15 will work about as well, according to analysis [2].
Insertion sort is a good choice, but you can also use the Bose-Nelson sort. This algorithm does not sort directly, but it generates the minimum number of swap pairs for a fixed size array. A swap pair is a pair of array elements that are to be compared and then swapped if they are out of order [3].
Another trick is to see if a partition is already sorted and to then leave it alone. If the file is already in near sorted order, this can save a lot of time. You do this by adding Boolean flags to the left and right pointer control loops, which are set when an out-of-place element is scanned.
If equal keys are present, then Nelson's first program will keep moving the high and low pointers so that all the duplicates are left in their proper position in the middle. Surprisingly, analysis shows that it is better to stop when the pointers are equal to the pivot element's position.
Wainwright [2] measured the performance of several modifications of QuickSort on the same sets of test and got some interesting results. The best performance seems to be from what he called Bsort — a QuickSort with a check for sorted order in the partitions.

References
[1] Nelson, Mark, "Writing Your Own QuickSort," The C Users Journal, Aug 1990, pp 63-73.
[2] Wainwright, Roger L., "A Class Of Sorting Algorithms Based On QuickSort," Communications of the ACM, Apr 1985, Vol 28, No 4, pp 396-402.
[3] Celko, Joe, "Bose-Nelson Sort," Dr. Dobb's Journal, September 1985.
Hoare, C. A. R., "Algorithm 64: QuickSort," Communications of the ACM, April 1961, Vol 4, No 7, p 321.
Sedgewick, Robert, Implementing QuickSort Programs, Communications of the ACM, Oct 1978, Vol 21, No 10 pp 847-857.
Sedgewick, Robert, "QuickSort With Equal Keys," SIAM Journal of Computing, Jun 1977, Vol 6, No 2, pp 240-257.
vanEmdan. M.N., "Algorithm 402: Increasing The Efficiency Of QuickSort," Communications of the ACM, Nov 1970, Vol 13, No 11, pp 693-694.