Timothy Prince has a B.A. in physics from Harvard and a Ph.D. in mechanical engineering from the University of Cincinnati. He has 25 years of experience in aerodynamic design and computation. He can be contacted at 452 Palmitas St., Solana Beach, CA 92075.
Pipelined architectures including vector and superscalar obtain their speed by depending in part on working on independent calculations in pipeline fashion. Most computationally intensive applications present enough opportunities for parallel or pipeline operation to make worthwhile increases in speed. The normal use of scalar math functions, which produce a single result, poses an obstacle to superscalar performance. These functions can be organized to increase the internal opportunities for parallelism as compared with optimum scalar processor code. Performance remains far short of the potential, unless more parallelism is exploited by working on more than one math function result at a time. A further reason for obtaining multiple results is that the overhead for calling functions which take less than 10 microseconds becomes excessive.
Soon after the introduction of vector computers, vector math functions were introduced to provide vector performance in calculations involving such functions. With pipelined or superscalar processors, vector functions may be effective, but functions which calculate a small number of copies per call are more versatile. Since a typical RISC architecture employs a four-stage pipeline, functions which calculate two or four copies should be enough to maximize performance.
Vector chunk math functions may be used whether or not your compiler makes specific provision for them. I will show actual examples of the C code of such functions. The functions sin_4 (four sins), cosf_sinf_2 (two pairs of float sin and cos), powf_2 (two float pows), and tan_2 (two tans) are chosen for their proven usefulness and because they illustrate the points which I want to make.
Vector vs Superscalar Function Calls
On a vector architecture, vector math functions naturally are performed on argument vectors, and normal vector performance is not approached except on long vectors. These architectures perform well when 50 or more functions are to be calculated at a time. Suppose we wanted to integrate a function involving sin and cos by Simpson's rule, producing a loop such as
for(i=2 ; i<n ; i += 2){ yint += q[i-2]*sin(t[i-2]) +4*q[i-1]*sin(t[i-1])+q[i]*sin(t[i]); xint += q[i-2]*cos(t[i-2]) +4*q[i-1]*cos(t[i-1])+q[i]*cos(t[i]); }which involves n evaluations of sin and cos. A vector compiler would start out by setting up the six vectors made up of the three sin and cos evaluations from each copy of the loop body. Almost a third of these evaluations would be duplicates, since the vector of sin(t[i-2]) is the same as the vector of sin(t[i]) except that sin(t[0]) and sin(t[n-1]) are not repeated. Each of these vectors has length (n+1)/2, so n would have to be around 100 before good vector performance could be achieved.In order to approach the performance potential of a vector architecture, we would have to rewrite the code to store the q[]*sin() and q[]*cos() intermediate results in vectors in a preliminary loop, and then add the appropriate values in another loop. Even after 20 years of vector compiler development, this is more analysis than any compiler can do without human assistance. Most reasonable attempts to improve the performance of this loop for a scalar architecture will prevent vectorization, and changes to improve vector performance will reduce scalar or superscalar performance. Although vector compilers now deal well with sum reductions such as this loop, this is done in effect by splitting the vectors again into six to 10 shorter vectors, making a vector architecture less than fully effective for this type of application.
Unrolling compilers can eliminate most of the duplicate operations by combining the common subexpressions over several iterations of the loop. Each additional loop iteration will require an additional pair of sins and coss. Compilers have been available (e.g. Multiflow) which would detect this situation automatically and build in a call such as cosf_sinf_2(t[0],t[1]) which returns cosf and sinf of both arguments, a total of four results from one function call. Such a function is a good match to the architecture of a superscalar processor. Even if your compiler does not perform the transformation automatically, manual rewriting is not unduly burdensome and need not detract from performance on scalar processors. A change as simple as
ty = q[0]*sinf(t[0]); tx = q[0]*cosf(t[0]); for( i=2; i<n; i+=2){ temp = cosf_sinf_2(t[i-1],t[i]); yint += ty+q[i]*temp.sin2+4*q[i-1]*temp.sin1; xint += tx+q[i]*temp.cos2+4*q[i-1]*temp.cos1; tx = q[i]*temp.cos2; ty = q[i]*temp.sin2; }should produce most of the advertised performance of any non-vector machine.
Vector and Vector Chunk Function Coding Style
Multiple copy or vector chunk math functions, like vector code, need to be written without conditionals which actually cause transfer of control. In a scalar trig function, it would usually be worth while to test the argument to find out whether it needs to be translated into the primary range. In a vector chunk function, the full range reduction should be performed whether it is needed or not, so that all of the code for the function can be compiled as one basic block.Transfer of control (branching) gives the compiler a choice of undesirable consequences. Either the pipelines must be allowed to empty, reducing the performance to scalar levels until they are refilled, or trace scheduling must be used to fill the pipeline with future operations along the preferred path of execution. Each branch can cause generation of another trace, and the length of compiled code may grow exponentially with the number of branches. The speed gained by keeping the pipelines full may be canceled by increased paging.
A great deal of progress has been made in architecture and compilers in recent years, so that many simple conditional selections can be performed without a transfer of control, if this is necessary to keep a processor producing results at rated speed. This requires calculation of both alternative results followed by instructions which select the correct one. There is a good correspondence with the syntax of ?: in C, although the compiler should not be totally dependent on the programmer choosing to use ?. Existing compilers do not vectorize if..else. All of the operations in the sin, cos, tan, exp, and log functions can and should be written in vectorizable form, even when the overall scheme is to favor superscalar execution.
Vector chunk functions do not fit well with the <errno.h> system for error reporting. The best that can be done is to report that ERANGE or EDOM exceptions have been raised for one of the arguments or results. Vector functions give an even hazier indication of trouble.
Some Nuts and Bolts of Machine Dependence
In some of the examples, the sign of a float or double is tested by assuming that it is in the same position as the sign of an int which shares the same storage. This is done either because it is faster or because it reduces register thrashing on certain machines. Generally, in these functions, there is an imbalance of double over integer arithmetic, and integer operations can be treated as a free resource whenever float operations are being performed at the same time.This code will work as is on most modern architectures which use the same byte order for double and int. On VAX-compatible architectures, the sign of a double falls in place with the sign of a short at the same address, apparently as a result of the PDP-11 ancestry.
A few architectures also suffer from excess of divide and multiply operations over add and subtract, so, in the examples, addition is used instead of multiply by 2. In these examples, it occurs when there are plenty of pipeline slots open, but in other cases, one would not want to prevent an optimizing compiler from converting multiply by 2 to a ldexp operation.
As the conditionals which would be required for architectures not conforming to IEEE P754 standards would clutter up the code, I have simply put in #error preprocessor directives, which are ignored by non-ANSI compilers because they are indented.
Since a good pipelining compiler will give priority to finishing up the expressions which are placed first in the code, the later copies of the functions tend to fall behind. This may be aggravated by compilers which give priority to loading constants into registers long before they can be used. The way to compensate is to write the earlier copies so as to minimize use of registers and leave more empty pipeline slots which can be filled up by arithmetic from the later copies.
The later copies are written for more available parallelism at the expense of register usage, so that, when the calculation of the earlier results has finished, the pipelines can still be kept full nearly to the end of the function. This can lead to somewhat greater round off errors in the later copies, in the double functions. In the float functions, use of double arithmetic eliminates the effect of order of operations on accuracy.
Systems which are unable to perform simple conditional selections without branching may require sign changes to be performed by xoring the sign bit. To avoid branching, errno may be left alone or set by
errno=ERANGE&(-(relational expresssion))which sets it to zero or to ERANGE. This is contrary to the normal requirement that errno never be set to zero, but may be a satisfactory compromise.
Calculation of Coefficients
Listing 1 shows a bc program for calculation of coefficients for sin, as used in sin_4.c. Running it with double arithmetic in C will produce the same results up through at least 10 digits. Because bc uses fixed point arithmetic, it needs extra fractional digits for sin, more than are needed for most problems. The same program will work if the t function is replaced by a(x)/x, with appropriate changes in the interval. The coefficients for log base 2, used in powf_2.c, are calculated by having the t function evaluate the appropriate Taylor-Maclaurin series. With overnight runs, bc can calculate coefficients up to 50 significant digits. These Chebyshef subroutines are adaptations of those given by Press, Flannery et al (1).
Multiple Copy sin
Listing 2 shows the four copy sin function. The code which performs range reduction, by subtracting off the nearest multiple of pi, uses a rint function, but takes advantage of the fact that dividing by pi does not change the sign. It assumes that addition is performed in the highest available precision, which may be more than double. rint is not covered by standards, and its result may depend on rounding mode, so it would not take care of portability. Use of long double precision in these operations is highly desirable, but of little value unless a true long double value of pi is available. long double should prevent degradation of accuracy for arguments up to pi*10^ (LDBL_DIG-DBL_DIG).The sign of the argument is ored into the rounding constant in order not to tie up as many double registers, so that the operations on subsequent copies will not be delayed. This procedure avoids branching on processors which do not have a select operation. Portability at the expense of speed can be obtained using expressions such as
pm = (int)(x/pi+(x>O?.5:-.5))or
pm = (int)(x/pi-.5+(x>0))since, if fabs(x/pi) exceeds INT_MAX, there probably aren't more than three digits significance left. Since FORTRAN and Pascal have round double to integer syntax, certain processors (e.g. MIPS) have implemented it as a single instruction, which is not used by C compilers.The integer overflow situation is reported as errno=ERANGE, without distinguishing which of the four arguments caused it. Non-portable code for testing the exponent field to identify this situation is used because, on the system where the code was tested, there weren't enough double registers to squeeze in any more standard arithmetic without stretching the code out by 30%. There are ways to test whether pm is odd without ever casting to int, so that range errors are avoided out to pi/DBL_EPSILON, but it's not worth the trouble.
Covering the whole interval from -pi/2 to pi/2 with a single curve fit avoids conditional branches which are particularly troublesome for vector or vector chunk coding. An eight term Chebyshef-economized polynomial is just sufficient to hold the errors to 1 unit-in-the-last-place with DBL_MANT_DIG = 53, in the absence of other approximations. Putting the interval end points where the function has zero slope helps prevent round off error from introducing discontinuities.
Horner polynomial evaluations are begun before the sign of the result has been determined, leaving the sign switching to be performed when the compiler finds the necessary pipeline slots. The third and fourth polynomial evaluations will lag well behind the first and second, so the third and fourth Horner polynomials are split in two so that the pipelines can be kept fuller after the earlier polynomial evaluations are complete. This adds two multiplications and one possibly significant round off error in each of the third and fourth results.
The fourth copy differs from the third only in that the code is written with parentheses to force the final additions to occur in the most parallel (but not most accurate) order. The dummy multiply by 1 is needed to force K&R compilers to honor the parentheses, but has no effect in ANSI syntax. Since similar techniques are used to a greater extent in scalar math libraries for superscalar processors, these less accurate results are likely to be closer to the scalar results.
This function should achieve a megaflop rating better than the LIN-PACK rating on many processors, which is unusual effectiveness for such complex code. One of the ways it could be used would be to combine calculation of unrelated sins and coss, using the relationship
#define cos(x) sin(PI/2-(x))as needed. A similar tactic should pay off on vector architectures, in which the various arguments are copied to a temporary vector so that the vector sin function can be used.Effective pipelining of this function appears to require more than 16 double registers, along with special efforts to perform as many calculations as possible in int registers. Examination of results of an early MIPS compiler showed that it was able to economize on the size of generated code by setting the constants only once. Like many RISC architectures, MIPS has immediate constants available only to initialize registers, not to participate directly in floating point operations. This may not leave enough registers available for extensive pipelining.
Optimization for reduction of length of generated code prior to scheduling of operations is less well correlated with execution speed on pipelined than on scalar processors. The MIPS software does not report the number of empty pipeline stages. The compiler for the original Multiflow 7/200 compiles this code in 96 major instruction cycles and obtains a superscalar speedup factor of 4. Only six of these instructions are empty, all occurring after the first copy result is complete. sin_4 on the Multiflow is twice as fast as their library sin, giving four results in 12 microseconds. On the Silicon Graphics 4D/25, both sin_4 and the library sin take about four microseconds per result.
Listing 3 shows a test driver to compare the results of sin_4 with sin. While many compilers allow passing a double to a function which receives it as a union, other compilers push a union on the stack in a different order from a plain double. It is safer to make sin_4 copy the arguments into its unions. On one of the compilers tested, the generated code is the same either way.
The Chebyshef fit of Listing 1 can be changed to use sin_4, after changing from bc to C syntax. The order of Chebyshef fit may as well be a multiple of 4. The accuracy of math function approximations, such as the functions discussed in this article, can be tested by fitting Chebyshef polynomials and comparing the coefficients with those obtained by a higher accuracy calculation in the same interval.
Multiple Copy Float cos and sin
Listing 4 shows a function to calculate cos and sin of two arguments in float precision. Since it uses rational polynomial approximations, there is more built-in opportunity for parallelism than in a Horner polynomial, and two such functions are enough to fill a four stage pipeline at the peak stages. Without prototypes, the only way to pass float arguments without widening to double is by unions. With prototyping, it would be better to pass float arguments and copy them to unions inside the function.One multiply can be eliminated from the critical path by scaling the arguments to multiples of pi/2 and adjusting the polynomial coefficients accordingly. The division by 2 of the half-angle formulae is buried in the coefficients, so the range reduction maps the arguments into the range -2 to +2. Adding and subtracting 4/LDBL_EPSILON produces a number which is rounded to the nearest multiple of 4. As long as promotion to IEEE double is used, so that no precautions against underflow are needed, there would be no problem in changing the scaling so that the code could start off
tn = x1.flt/2/PI - rint(x1.flt/2/PI)in case that could be calculated more efficiently. The choice of scale was influenced by the desire to maintain accuracy if base 16 arithmetic is used.Scaling the arguments would produce an additional round off error if the calculations were performed in float precision, but double is almost mandatory anyway as it prevents degradation of accuracy for arguments up to 2pi*10^(DBL_DIG-FLT_DIG). A warning such as storing a value into errno could be provided when larger arguments arrive, but this is not clearly a failure meriting the ERANGE label unless the argument becomes so large that the rint code won't work. Basing the errno calculation on values which are calculated anyway minimizes the use of additional registers.
The first rational polynomial is calculated Horner style, and the last attempts to catch up by calculating all terms individually, at the cost of one additional multiplication. The scheme of eliminating one of the coefficients by choice of scale allows the numerator to get a head start so that the final multiplication can be performed without delaying the division. The compiler may have to be forced into performing the first add in the denominator without waiting until the last term has been calculated. Certain compilers insist on converting the repeated divisions into multiplications, which is no problem when the operations have been promoted in precision.
Vector Chunk float pow
The pow function in C is expected to embody two entirely different types of operation. In order for it to be vectorizable, or to obtain good vector chunk performance enhancement with current compilers, it has to be restricted to the cases of positive base, where it can be replaced in effect by
#define pow(x,y) exp(log(x)*y)This could be done with a top level powf_2 which determines whether both pairs of arguments are of one type, and, if so, invokes an appropriate vector chunk function. The usual test is whether y1 and y2 are changed by casting to int and back to float. It doesn't hurt much to use the log treatment anyway, unless x is negative. If the argument pairs cannot be processed by the same algorithm, it would have been more efficient not to have tried to treat them as a vector chunk at all.The function of Listing 4 does not take care of the negative base case, which is OK according to ANSI standards if it is called as the implementation of the FORTRAN real exponentiation operator. I use it in this form in time marching aerodynamics codes, where it gets executed millions of times.
Promotion to double is really needed only in the sections involving addition of the integer exponent to the base range log2 up to the splitting of the exp2 argument into an integer plus or minus a fraction, and then only when the result is far from 1. The somewhat complicated system for splitting the base into modified frexp form works quickly and accurately without widening on a system without gradual underflow. On architectures such as VAX which use a different byte order for float and int, the unions and constants are different.
If gradual underflow is to be supported without widening the precision, it will require special case treatment. To reduce degradation of accuracy if widening is not used for addition of the integer and fraction parts of the log2 function, log2 should be split into a power of 2 plus a smaller term. This leads to complicated code which may require branching, thus defeating attempts to gain pipelined performance.
Evaluation of log2(x2)*y2 is speeded up by grouping the terms in pairs. The calculation log2(x2)*y1 then becomes a bottleneck until the multiplication by y1 is distributed onto the two groups, one of which consists of the three-term Horner polynomial. Multiplication of y1 by the integer exponent is performed well before it is needed.
Making such detailed adjustments for a given system is possible only with readable assembly language which displays the final scheduling of the pipelined operations, and is helped greatly by static profiling which gives the effective clock count for each block of generated code. Since we try to write these functions so that there is only one code block, and there are few memory accesses which could introduce bus delays, the speed will not depend on data and there should be no question what effect each change has on speed.
In order to make the ROUND macro work the same under K&R syntax as it would in ANSI C, dummy multiplications by 1 are introduced. Otherwise it is a matter of luck whether a K&R compiler will generate the required code, although the left associativity of the + and - operators should produce a preference for left to right evaluation. From an algebraic point of view, ROUND would do nothing, and AI techniques could conceivably allow a compiler to know this. The peculiar syntax of K&R which requires such multiplications by 1 makes it semi-obligatory for optimizing compilers to eliminate the redundant operation, unless compiling for an architecture which may generate faster code with alternating multiplication and addition.
If the compiler is unable to generate efficient code for the max and min macros, it would be better to perform the ldexp operations on doubles and hope that the extra range of double will take care of over and underflows.
Multiple Copy tan
The tan_2 function (Listing 5) requires the least non-portable coding for optimum results, but it illustrates optimizations which have not appeared in the functions discussed above.Range reduction consists simply of subtracting the nearest multiple of pi, and there is no advantage in playing games with unions. The comparisons start into the pipeline first and are completed before the divides, which may have been converted to multiplications by the compiler.
The remainder of the calculation consists of evaluation of a rational polynomial. On a machine with a divide which pipelines at the same intervals as the other operations, straightforward Horner evaluation of the numerators and denominators might work as well as anything. On the processor for which this code was tuned, a divide operation delays the pipeline, but does not affect addition. For this reason, the order of operations is set up to push all of the multiplications into the pipeline before the divides, as well as to start the first division as soon as possible.
The terms of the numerator and denominator are grouped so that operations are always ready to start into the pipeline, and to take advantage of architectures which have separate pipelines for multiplication and addition.
In the denominator of the first copy, the high order terms must be added in order of increasing complexity. This could be done with parentheses with an ANSI compiler. Possibly better accuracy could be obtained by adding the high order term last. Order is not dictated in the low order terms of the first copy, so as to avoid delaying the second copy. In the numerator of the second copy, the final multiplications are distributed so that they may be performed before the final addition, in order to get the multiplications out of the way early, as well as to allow the second calculation to begin to catch up with the first.
Summary
Much of this article will appeal only to those who like to tweak code for another 20% in performance. I have concentrated on the points where architectural dependencies pop up and tried to show where their impact can be reduced for relatively small performance penalties. Math library functions are probably the closest thing to applications where non-portable code is appropriate. This is the reason for these functions (at least the scalar versions) being defined in the C standard so that they need not be carried as part of on application.The extent to which special vector chunk functions should be used to perform math library operations in groups may be questioned. An application which uses these functions probably should provide an alternative header file which will cause them to be replaced with standard functions. Compilers are most likely to begin to incorporate such functions automatically if they produce benefits on the standard benchmarks.
Scalar math functions can detract from the performance of superscalar processors. The techniques shown enable superscalar performance to be obtained in the evaluation of grouped math library functions. In typical applications, the percentage of execution time spent in math functions can be reduced in comparison with a scalar processor.
References
John Ellis, Bulldog: A Compiler for VLIW Architectures, MIT Press, 1985 (avoidance of code explosion, trace scheduled pipelined code).Press, Flannery, Teukolsky, Vettering, Numerical Recipes in C, Cambridge, 1988 for Chebyshef analysis.
John Palmer, Stephen Morse, The 8087 Primer, Wiley, 1984 for some math function concepts.
P. J. Plauger, Jim Brodie, Standard C, Microsoft Press, 1989 for the simplest satisfactory explanation of float.h, limits.h, math.h.
T. Prince, "Generating Source for <float.h>," The C Users Journal, V8N6, June 1990.
Listing 6: tan_2.c