May 1990/The Quick Sort For Micros And Pipeline Processors

Number Crunching

The Quick Sort For Micros And Pipeline Processors

Timothy Prince

Timothy Prince has worked 25 years in aerodynamic design and computational analysis. He first wrote software in BASIC in 1967 on a GE225 with KSR32 terminals. His first renovation of old FORTRAN came shortly thereafter, leading to more such projects, including seminars on adapting code to modern architectures. He received a B.A. in physics from Harvard and a Ph.D. in mechanical engineering from the University of Cincinnati. He can be contacted at 39 Harbor Hill, Grosse Pointe Farms, MI 48236.
A great deal has been written about the quick sort, but without much clear guidance on useful working versions. In this article, I combine some of the better features of previously published versions of the classic quick sort and look into some of the issues affecting quality of compiled code.
I started this round of re-examination of sorts with the thought that I might uncover some ideas which would essentially optimize the code for a wide variety of machines. Certain modern computers, although their hardware design appears useful for sorting as well as numerical analysis, fail to achieve sorting performance in proportion to their speed in other tasks.
You probably assume that sorting algorithms can be differentiated by how efficiently they capitalize on inherent strengths of various computers' architectures, and to some extent, this is true. Nevertheless, the basic quick sort appears to perform well in a wide variety of architectural environments. Only in the outer control structures does the programmer face decisions which affect the ability of pipelining compilers to generate efficient code.

Choice Of Method
For problems where sort time is not a dominant factor, the shell sort is often used. It represents a good compromise between speed and code size. Assuming that the compiler doesn't stumble over the slightly more complicated nature of the shell sort's inner loops, shell sorts will be faster on a certain range of problems, particularly in comparison with quick sorts that are written with old-fashioned FORTRAN DO loops.
The quick sort algorithm is basically recursive, but that doesn't necessarily make a recursive implementation more desirable or clearer. The non-recursive version requires that the internal stack space be dimensioned according to the maximum expected problem size. This will almost always result in a smaller stack than a recursive version would use.
Some authors including Plum propose a general-purpose sort function where the programmer supplies pointers to basic comparison and swapping functions. I think that setting up a sorted vector of pointers to the data, which may be of any type defined by a typedef of ARRTYPE, is a more generally useful process. Since only pointers are swapped, and the swaps are naturally mixed in with other code, calls to a separate swapping function aren't needed.
Bear in mind that it may pay to use different data types for some sorts. The key is determining whether the sorts will be done in the same order. For example, rather than sorting positive IEEE floats, try sorting on longs — assuming the byte order is the same — though the code will be less readable and portable. Similarly, rather than sort 64-bit positive integers, consider performing the sort on IEEE doubles. Remember, though, that negative numbers will probably sort in reverse order, since IEEE floating point is generally based on the one's complement representation, while int is based on the two's complement.
The quick sort is deceptively difficult to write from scratch. Perhaps the variety of published versions should be taken as a warning. The only optional code in Listing 1 is the special section that applies to segments of length 2. Sorts for these short segments can be finished off faster by direct attack than by any of the sorting methods suitable for longer segments. Because quick sorts typically generate many of these short segments, a special treatment is worthwhile.
Some quick sorts will occasionally allow the searches to extend all the way to the boundaries of the data rather than restricting them to the remaining section of the segment currently being analyzed. These algorithms check each iteration of the loop against the boundary values. Depending on the system, the extra code could take over twice as long. This probably accounts for observations that heap sorts seem to be quicker — especially for moderate amounts of data.

Recommended Method
Consider Listing 1, iqsort.c. The downward search, which occurs first, can be guaranteed to stop within one count of the known limit simply by copying the pivot element to the beginning of the segment before starting. By doing this, you don't force the processor continually to test against limits, so this will generally save time on the simpler processors.
The closest parallel for an upward search requires that a termination element be placed beyond the actual data. This will cause the search to end if there are no remaining data to be moved. You can accomplish this by copying the pointer to the pivotal element, which itself will be chosen the first time through the loop. This requires that you dimension the pointer array one larger than you would otherwise.
Another method of enhancing a quick sort is to copy the pointer vector element to a register variable so that a duplicate memory read won't be needed after the loop ends. This is an optimization that certain compilers can perform; others do not perform automatic register allocation across loop exits or a break. With today's compilers it shouldn't make much difference whether the local variables are declared register or static.
I recommend using the ? operator in the stack-pushing section because it permits pipelining compilers to generate more compact code. Non-pipelining compilers, even if they don't generate optimum code in this section, shouldn't slow the code significantly, because there are order of n times as many operations in the inner loops.
Note how I've set up the loops so that the outer loop test executes only when the stack pointer is decremented, not when the test is unnecessary. This technique also permits more effective pipelining of the code. The pipelining compilers I've examined generate excellent code. Since the inner loops are so short and simple, effective pipelining would require overlapping pointer and data fetches, without waiting for one iteration to finish before the next begins. This opens up the possibility of using uninitialized pointers — unless the pointer array is extended enough to cover the pipeline length.

Conclusion
I don't believe that sorts have been used effectively in engineering and scientific applications. Nor do I believe that compiler writers have had the incentive to consider how to generate efficient code for simple search loops. Still, sorting has always been a job which computers performed more efficiently than people, and there are a lot of applications where data won't necessarily always be analyzed in either the original input order or in order by a single parameter. Sorted indices aren't hard to build and they don't need to be slow. This should encourage you to write software that lets end-users perform better data analysis, by allowing them to quickly view data sorted in a variety of fashions.

References
Knuth, Donald. The Art of Computer Programming, Vol. 3, Addison-Wesley, 1973.
Plum, Thomas. Reliable Data Structures in C, Plum-Hall, 1985.
Sedgewick, Robert. Algorithms, Addison-Wesley, 1984. Good treatments of sorts and searches. Many of the other topics in this book have weak coverage. Easier to read than most such texts.

Listing 2