May 1993/Performance Tuning a Complex FFT

Features

Performance Tuning a Complex FFT

Tim Prince

Tim Prince has a B.A. in physics from Harvard and a Ph.D in mechanical engineering from the University of Cincinnati. He has 25 years of experience in aerodynamic design and computational analysis. He can be reached at Box 3224, Rancho Santa Fe CA 92075.

Introduction
Much of the current lore on how to program in C and other languages for best execution speed is based on older architectures such as PDP, VAX, and 8088, and early compilers. Modern compilers for RISC architectures must do a good job of automatic register allocation and instruction scheduling or lose the benchmark wars. While much excellent literature on algorithms is available, published code often falls short on performance. In this article, I show step-by-step performance enhancements for a commonly-used algorithm, the complex fast Fourier transform. This algorithm is taken as a typical example of resolution of performance bottlenecks, as much as for its practical interest.

Basic Code
The starting point is the complex FFT code given by Press et al. (1988), slightly modified for compatibility with their earlier FORTRAN version. (The original is credited to N. Brenner.) Using single precision (float) data, the overhead of trig function evaluations is minimized by calculating sin only in the outer loop, using double arithmetic to prevent error accumulation from degrading the results.
The modifications consist of passing arguments by reference, using zero-based arrays, and not using shift operators when multiplication is meant, as modern compilers make these conversions internally.
Each version has been tested using the test drivers published by the same authors. Timing tests are performed on batches of 1,000 transforms using 1,024 data points, using clock. The published version took 4.6 milliseconds for a 1,024 transform on an HP730.

Precision Conversions Outside Inner Loop
The code was originally written for K&R compilers. Some of these compilers did not round the tempr and tempi results down to float precision, as they tried to promote all floating-point arithmetic to double. Declaring all local variables double produces slightly better accuracy than the original code, at least when Standard C compilers are used. If more than seven significant digits are required, everything including the data[] arrays must be typed double.
Placing (float) casts on the double variables in the inner loop permits the compiler to move these data conversions to the pre-loop setup code block, helping architectures that do not perform operations of mixed precision efficiently. This change does not reduce the accuracy significantly. The 1,024 transform now takes 3.85 milliseconds.

Help Compiler Optimize Data Fetches
Sufficient information is contained in the code to prove that j is always greater than i+1 where the data arrays are accessed, enabling a hypothetical optimizing compiler to determine that the j subscripts cannot point to the same data as the i subscripts. Existing compilers do not perform such an analysis. In the original version, the generated code allowed for the possibility that modifying data[j] and data[j+1] (which were stored first) might change data[i] and data[i+1]. This prevents the use of register values of data[i] and data[i+1]. Most compilers do take advantage of the non-overlap of pointers that differ by a known constant.
The cure is simple — copy the data into a local register-eligible variable that is not affected by assignments to array elements. With this change, the compiler is free to generate code that fetches the data sooner. Earlier data fetches reduce the impact of cache misses, although this should not be a factor in the present application. Now the same FFT is performed in 3.6 milliseconds.

Simplify Use of Trig Functions
As the published code calculates cos(theta) from sin(theta/2), and sin(theta/2) is a value that will be used in the next outer loop iteration, rearrangement allows just one computation of sin per outer loop iteration. The published code, rather than adding 1 in this calculation of cos, delays the addition to the end of the recursion formula in the middle loop. This is a standard technique for preservation of accuracy, but it is unnecessary since the recursion formulae are already coded in higher precision than the inner loop.
With cos calculated in the outer loop, the benchmark timing is 3.5 milliseconds. Numerical results were not affected by this substitution. By tinkering with the assembly code, it is possible to get this down to 3.3 seconds. Since the cache hit rate is higher than normal, certain load instructions may be delayed to places where an instruction requires the result of the previous instruction, so that all instructions are given two cycles to execute. The final source code is shown in Listing 1.

Pipeline Priming (Partial Unrolling)
All of the modifications discussed previously should be helpful or at least neutral on any modern computer. We have gained 30% in speed on the HP computer, without doing anything specific for that architecture. As most instructions on the HP take more than one instruction clock cycle to execute, the compiler attempts to schedule instructions so that operations can overlap. In several places, the generated code does not have the optimum number of instructions between an instruction that changes the contents of a register and one that uses the new contents of the register.
Because of the sequential nature of the code, only two of the four multiplication operations are executed in parallel with addition operations. Two of the last four add and subtract operations are capable of being written as two-operand operations, without introducing an extra register copy. These are eligible to be executed in parallel with a multiply operation on the HP. If you want to get more parallelism and allow more cycles for completion of each instruction, you need loop unrolling. Loop unrolling techniques performed by many current compilers increase the time required to start up the loop. This is unsatisfactory, as the average inner loop count for an FFT of order 1024 is only 5.
You can prime the pump, so to speak, by rearranging the code to permit pipelines to fill sooner. As the code normally is written, the only operations that are ready to be performed at the top of the loop are data fetches from memory to register. You need enough data in registers to permit the floating-point arithmetic pipelines to start immediately. On the HP architecture, you need to perform the first two multiplications also at the end of the previous iteration of the loop, so that they may be performed in parallel with addition or subtraction operations. This keeps the pipelines full closer to the end of the loop body.
One copy of the first part of the loop body is placed ahead and the other at the end of the iterated part of the loop. The remainder of the original loop body is copied after the iterated part of the loop. All of the data used in the new loop body are copied into registers before any results are stored in the data arrays, in order to get around the lack of compiler analysis for data overlap. Changing the way j is calculated for the next loop iteration saves us from having to calculate it in the post-loop body, but makes no difference otherwise. Arranging the conditions to run one less iteration of the loop, you end up performing exactly the same job, except that there are two copies of all of the loop body code and, in effect, one iteration of the loop is always performed. The unrolled inner loop is shown in Listing 2.
If it were necessary to allow for a zero iteration case, the whole unrolled inner loop could be placed within an if block. In this application, the original inner loop could have been written as a do-while, to emphasize the fact that at least one iteration is always taken. Although this saves a couple of instructions, there is no measurable change in performance. Current compilers transform a while(){} loop into if() do{} while() when that will improve performance.
The code will perform nearly as well if the inner loop is allowed to iterate once more, so that the duplicate code for the latter part of the original loop is not needed. The data array must be made large enough to avoid picking up data outside the array. Zero values will do in the extended part of the array. Values that generate underflows or exceptions will take a long time even though they do not affect the results.
For the FFT algorithm, arrangements similar to Listing 2 are useful on a range of architectures. The benchmark now executes in 3.1 milliseconds, a 45% improvement over the original code. The effect of unrolling will be greater on an architecture that takes more advantage of parallelism than the HP 7x0 series. While the code generated by the HP compiler does not appear to be optimum, no significant additional performance has been achieved by tinkering with it.

Conclusion
A published implementation of the complex FFT can be speeded up 30% by simple changes in the source code that eliminate unnecessary data conversions and redundant memory access. Execution speed can be increased another 12% by appropriate source code unrolling techniques on a typical pipelined RISC workstation. These performance increases are larger when cache hit rate is reduced. The unrolling technique shown performs well even with a small number of loop iterations.

References
Press, Flannery, Teukolsky, Vettering. 1988. Numerical Recipes in C. Cambridge University Press.
Vettering, Teukolsky, Press, Flannery. 1985. Numerical Recipes Example Book (FORTRAN). Cambridge University Press.