December 1990/Coding For Superscalar Architectures

Features

Coding For Superscalar Architectures

Dr. Timothy Prince

Timothy Prince has a B.A. in physics from Harvard and a Ph.D. in mechanical engineering from the University of Cincinnati. He has 25 years of experience in aerodynamic design and computation. He can be contacted at 39 Harbor Hill Dr., Grosse Pointe Farms, MI 48236.
C has become the fundamental programming language for machines of new architectures. By supporting C and porting some operating system from the UNIX family, a hardware vendor can help customers put a new architecture to use quickly enough to justify its development. Moreover, most applications far outlive the hardware on which they were based. Support in C for multiple architectures has become a necessity.
Much C code in the past has been written in a somewhat portable style, but with only a few architectures in mind. That can limit how well the code can be compiled in a way that produces good performance. The new pipelined "superscalar" computers present some opportunities for code performance improvement that have not been widely known. These new computers also increase the importance of certain techniques that have been discussed in the past but not often exploited. Most of these techniques can be used without degrading performance on other architectures, thus making a gain in performance portability.
Generation of efficient code on a variety of architectures first received attention with the advent of supercomputers that provide architectural support for operations on vectors. With improved compiler technology and increased competition among vendors, the definition of vectorizability has widened considerably. Vectorizability is one of the desirable features of portably efficient code.

Combining Loops With Breaks
Loops should be arranged to minimize conditional branching. Whenever possible, conditional blocks within a loop should end with break.

for(;;) { ... if (...) { ... break; } ... }
This kind of structure enables compilers to pipeline together the sections of code that lead to further iteration.
Combining as many operations as possible into each block of code reduces the fraction of execution time spent on looping control or conditionals. This is particularly important for typical RISC machines. They are designed to carry out several independent operations nearly concurrently. Modern compilers should not be inhibited by loops that contain conditional exits.
There are often many opportunities to combine loops. If the operations in the second loop don't depend on later operations of the first loop, consecutive loops with identical control structure can be combined. Even if consecutive loops differ by 1 in iteration count, they can still be combined. Consider the following:

for (i=0; i<n; ++i) ... for (i=1; i<n; ++i) ...
If the operations in the two loops can be intermixed, the loops can be combined as:

if (n>0) for (i:0; ; ) { ... if (++i>=n) break; ... }
If the second loop runs from 0 to n-2, the two loops can still be combined as:

if (n>0) for (i=0; ; ++i) { ... if (i>=n-1) break; ... }
Combining loops is most beneficial when it keeps the pipelines full. Also, when the same data are used in both sections, it allows a reduction in memory bus traffic by taking increased advantage of registers.

Increasing Available Parallelism
C programmers have tended to be more aware than others of techniques to reduce page faults in virtual memory systems. Two important methods are grouping related functions together in one source file and traversing large data arrays sequentially. With processor speed increasing faster than disk speed, these considerations are as important as ever. On the other hand, C programmers have not been as familiar with compilers that need to overlap operations or to perform them in a different order from the way they appear in program statements.
Most of those who have evaluated polynomials have heard of Horner's rule:

poly(x) = a0+x*(a1+x*(a2+x*(a3+....)))
which minimizes the number of multiplications required. This can pose an obstacle to the performance of pipelined machines. It prevents any attempt at "superscalar" performance, where more than one operation is performed at a time. Opportunities for "fine-grained" parallel computation are obtained by writing expressions in forms such as:

#define SQ(x) (x)*(x) poly(x) = a0+x*a1+SQ(x)*(a2+x*a3) +SQ(SQ(x))* (a4+x*a5+SQ(x)*(a6+x*a7))
Ideally, the inner multiplications involving a3 and a7 can be started, followed by the calculation of SQ(X) and the remaining inner multiplications. As each result comes out of the multiplier pipeline, it feeds into the adder pipeline. The first pair of results from addition are multiplied by SQ(x) and added to the other pair. By this time, the factor SQ(SQ(x)) has been calculated, so the final multiplication and addition can proceed in sequential fashion. With certain compilers, additional brackets are required to obtain the desired order of evaluation. These additional brackets are ignored by compilers that conform to K&R. It may be necessary to dictate the order of operations by writing, for example:

q= a4+x*a5; p= a0+x*a1; p += SQ(x)*(a2+x*a3); p += SQ(SQ(x))*(q+SQ(x)*(a6+x*a7));
This form needs at least four floating point registers for efficient evaluation on a scalar architecture. Two registers suffice for Horner's method. Still, the proposed evaluation scheme will not run out of registers on any popular architecture. For code clarity, it is useful to have a compiler that is able to determine the most effective order of evaluation without requiring the programmer to pay such attention to coding details.
Roundoff errors vary with the order of evaluation. While Horner's method is most accurate when the terms decrease in magnitude with increasing order, the suggested order has given satisfactory results with a wider variety of cases.
Compared with a Horner polynomial, this revised form of the expression has two additional multiplications. Superscalar operation calculates eight terms in less time than is typically required for four Horner terms on a machine with a three-stage pipeline. On superscalar machines, approximating math functions with polynomials of up to nine terms should be more satisfactory than using table lookup methods.
Frequently, expressions such as

max(0, min(x, 1))
are required. This expression contains two comparisons, which we would like to have pipelined. Code such as

x>0 ? min(x, 1) : 0
may be almost twice as fast. But if the context is such that other independent operations will absorb the time anyway, the first version may be preferred to conserve registers. Many other obvious cases occur where an expression can be written so as to give the compiler freedom to choose the fastest order of evaluation on the target machine.

Unrolling Loops
Numerical problems often involve many small searches. Particularly on pipelined machines, the arrays are likely not to be large enough to benefit from conversion to binary search. With a pipelined architecture, a linear search loop needs to be unrolled to gain full performance. Several current compilers do this automatically, giving the equivalent of:

for (i = -1; ; ) { if (x[++i] < goal) break; if (x[++i] < goal) break; if (x[++i] < goal) break; ... }
The unrolling required for a given fraction of peak performance is proportional to the total number of pipeline stages required for memory fetch and comparison. As typical pipelined architectures use four stages for memory access and three for floating point comparison, unrolling six times would produce 50% of peak performance. This unrolling would more than triple the speed with less than double the number of instructions in the loop body. Evidently, if unrolling is required for full performance on many problems, an unrolling compiler is needed to avoid the requirement for changes in the source code to accommodate various architectures.
Unrolling compilers form a logical bridge to the style of operation needed on vector machines. Coding styles currently in favor for vector machines are often also suitable for scalar and superscalar machines.

Recursion, True or Apparent
The greatest weakness of vector architectures is their unsuitability for loops of a recursive nature. In the most common cases, the recursion is only apparent. The loop can be altered in several ways to make it vectorizable. Apparent recursion usually poses no problem for superscalar architectures.
When the recursion is real, there may be ways to group the terms so that superscalar performance is available. For example, Clenshaw's algorithm for evaluation of a Chebyshev polynomial involves the loop:

for (dd=0, j=m; - -j>=1; ) { d=y*2*(sv = d) +c [j] -dd; dd=sv; } return y*d-dd+c[0]*.5;
Assuming that the loop must execute at least once, and removing an unnecessary subtraction from the first trip, the loop may be reorganized to permit the array elements to be fetched before they are needed:

for (dd=c[j=m-1]; ; ) { d=y*2*(sv=d) +dd; dd=c[- -j]; if (j==0) return y*d-sv+dd*.5; dd -= sv; }
A pipelining compiler should generate code that requires only the time of one multiplication and one add per loop iteration. Compilers for non-pipelined machines also are able to generate efficient code for the second form.
Recursion limits this loop to 50% greater than scalar performance. Merging such loops with other loops is a useful route to improved superscalar performance — current vector compilers should be able to split off vectorizable portions when necessary. Multiple evaluations of such an expression can be vectorized by looping across the independent cases inside the recursion loop. With unrolling, this technique also improves performance of superscalar machines.
Clenshaw's algorithm was chosen for this example because it is a straightforward illustration of the points to be made. Other more common situations, such as fitting spline curves by solution of tridiagonal systems, share many of these characteristics.
"Pump priming" involves copying enough vector elements to registers before the loop starts so as to eliminate delays in arithmetic pipelines. It is usually beneficial for pipelined machines. It does, however, introduce apparent recursion, which inhibits vectorization and reduces the feasible amount of unrolling. So pump priming is best restricted to situations where there is true recursion, or where there is no desire to maintain performance on a vector machine. Pump priming was performed by compilers for architectures developed by Floating Point Systems and Control Data Corporation. Current unrolling compilers do not consider pump priming as an alternate strategy. It is most useful when recursion or shortness of the loop prevent achieving full performance by loop unrolling.
Apparent recursion can be eliminated by creating a vector that would not otherwise be required. It can also be eliminated by putting appropriate additional elements at the beginning or end of a vector. Such elimination can allow a superscalar machine to keep its pipelines full without duplicating operations. The following simple example arises in problems involving finite differencing in cylindrical coordinates. The apparently recursive form is:

for (im1=n, i=1; i<= n; im1=i++) flux[i]=q[i]-q[im1];
The vectorized form is:

for (q[0]=q[n], i=1; i<=n; ++i) flux[i]=q[i]-q[i-1];
The C compiler needs access to the declarations of the arrays, or the code must contain special directives, to assure that the arrays do not overlap. Unlike other languages, C does not give any assurance that arrays passed into a function as arguments do not overlap.

C Branching Syntax
In C, the conditional expression:

if (a && b)
has the same meaning as:

if (a) if(b)
In many cases, the same result will be obtained from the expression:

if (a & b)
The last expression leaves no doubt that a and b can be evaluated in parallel. A similar relationship holds with the | |and | operators. When there are no side effects, such as function calls, the code with bitwise operators may be more efficient.
Since there is no penalty in generated code for writing expressions such as:

a!= 0 && b != 0
there is little excuse for the shorthand:

a && b
Replacing && with & can change the meaning of the second form but not the first. That makes the shorthand more conducive to errors.
Using the bitwise operators permits the compiler to substitute a logical operation for a branch. Many modern architectures have efficient instruction sets for evaluating relational expressions without branching. Source code that does not require additional branches can profit from such architectures.

Conclusion
Coding for greater available parallelism improves the performance of software on superscalar and vector architectures. In most cases, there is little or no penalty on any current architecture. Awareness of techniques such as unrolling, with the use of appropriate software development tools, can improve performance with minimal change to current applications. In a few cases, adopting a coding style that differs from common practice can help compilers for the new architectures perform better code generation.