September 1993/Guidelines for Signal Processing Applications in C

Features

Guidelines for Signal Processing Applications in C

Adrian Freed

Adrian is developing a real-time sound synthesis environment in C for the SGI Indigo. He directs software and systems development at the Center for New Music and Audio Technologies, a research center of the music department at the University of California, Berkeley. He may be reached at 1750 Arch Street, Berkeley, CA 94709, (510) 643 9990. e-mail:adrian@cnmat.berkeley.edu.
Programming signal processing algorithms in C requires more than the usual amount of care in the areas of efficiency and clarity. Success and failure of time critical applications such as modems and sound processing may depend on the efficiency of its signal processing techniques. Clarity allows the code to be maintained and developed by programmers who lack the mathematical skills and knowledge originally required to develop the application. While developing signal processing code in C for a real-time musical sound synthesis application, I learned that efficiency can be achieved without sacrificing clarity, by careful programming and use of an optimizing compiler for a modern processor.

Clarity and Efficiency
Programmers previously used the coding style found in the audio signal processing library function in Listing 1 to achieve performance with early, low quality C compilers. Listing 1 takes advantage of C's ability to express low-level operations to optimize the performance of the code for a particular compiler and machine. The clearer version in Listing 2 could have been the starting point for these optimizations. It uses no register variables, pointers, or temporary variables and makes no special case for vectors of contiguous elements. Will it be much slower because of the five integer multiplies introduced per iteration? Will it be slower without the special case that allows compilers to optimize the post increment? Using the MIPS compiler for the R3000 processor, I found that the simpler version runs slightly faster than the hand-optimized version by eliminating the test for the special case of increments by one.
Listing 2 is certainly smaller and easier tounderstand than Listing 1. It also reflects another important measure of clarity for signal processing applications — that it should be easy to compare to the mathematical notation for the algorithm it implements.

The Right Algorithm
You can implement the Discrete Fourier Transform (DFT), a common signal processing function, directly from its mathematical definition (easy to compare to its mathematical notation). However, in practice programmers use the Fast Fourier Transfrom (FFT) to gain, usually, hundreds of times greater speed.
You must choose the right algorithm to achieve acceptable performance goals in signal processing applications. However, finding an appropriate algorithm may require some research work. Fortunately, many books on C programming for signal processing algorithms and numerical computations have been published in recent years. (See the Bibliography.) Each book places emphasis on different algorithms. Code quality and clarity vary considerably. For example, Numerical Recipes in C (Press et al. 1992) has a small descriptive section on FFTs with basic code, whereas Signal Processing in C (Reid and Passin 1992) is mostly devoted to theoretical discussions of FFTs. These books are at the expensive end of the price scale, so check their coverage of your area of interest before buying. A good college engineering or math library will have most of them.

The Guidelines
For the guidelines developed in this article, I start with recommendations for appropriate data types to represent signals, then I show you how to use ANSI C to control type promotion in arithmetic expressions. I introduce other techniques for writing quickly-computed expressions. Signal processing algorithms use vector and matrix operations, so I cover tips for efficient clear loops of arithmetic expressions. I conclude with memory layout and general compiler issues. Many of my recommendations apply to general C programming and numerical computation. Some, such as a warning about floating-point exception processing have particular significance for real-time signal processing applications.

Choosing and Controlling Data Types

For space and time efficiency, choose the smallest data type that can represent the range of values your application requires.
Generally, you choose data types to represent signals early in a project. A mistake here may require extensive changes later. Signals are functions of time so you need to choose data types to represent times as well as signal values. Often the choice is already made for you in the form of a standardized representation such as 16-bit two's complement integers for sound on compact disks, or perhaps unsigned eight-bit numbers for voice quality sound.
Your compiler's reference manual documents the sizes of the data types available to you. You can make these sizes available to a program by using the constants defined in the library header file <limits.h>. Table 1 contains an example of the information from a compiler for MIPS processors.
The three sizes of integers in Table 1 allow for efficient storage of samples of most signals sampled from the real world, including high quality sound and color images. However integers are not necessarily the best choice of type when it comes to processing those samples. Digital filters, for example, multiply incoming samples by small numbers, commonly between -2 and 2. Specialized chips for digital signal processing use a fractional arithmetic interpretation for binary numbers, where the binary point is near the most significant bit, rather than after the least significant bit, where it is for the C integer interpretation of binary numbers.
Although multiplication of two 32-bit integers on most processors yields the expected 64-bit resolution result, the multiplication operator in C yields an integer the same size as its operands — in this case 32 bits. This difficulty is being addressed by the ANSI X3J11.1 Numeric C committee who are considering a new data type, called fract, for this purpose.
Considerable effort has been expended developing interesting signal processing algorithms that minimize or avoid multiplication altogether. If you are lucky enough to be able to use such an algorithm, integers are a good choice. On most processors, addition of integers and multiplication by powers of 2 (arithmetic shifts) of integers are faster than these operations on floating-point numbers. On the other hand, most RISC processors and many recent CISC processors have faster floating-point multiplies than integer multiplies. This means that if you can't avoid multiplication, floating-point numbers are your best choice. The ability of floating-point numbers to represent a very wide range of values helps with the numerical stability of algorithms. Most processors implement floating-point arithmetic according to an IEEE standard, giving you a good chance to achieve consistent results as you move your programs from one processor to another.

Carefully apply features introduced with ANSI C to take advantage of the efficiency of single-precision floating point arithmetic.
Before the ANSI standard, C compilers promoted single-precision float types used in expressions to double precision. This rule penalizes expressions that only require single precision because most processors have slower double-precision calculations than single precision, and waste cycles converting from single to double precision. The rules for type promotion in expressions were changed in ANSI C so that types are promoted to the size of the larger operand for each operator.

Convert declarations to the new style defined by ANSI C.
For example:
float fast_sine(float angle);
rather than
float fast_sine(angle)
float angle;
Pre-ANSI C compilers often promoted arguments to functions defined with "old style" declarations to larger types. These early compilers promoted chars and shorts to ints. They promoted floats to doubles.

Use the ANSI C f or F suffix for single-precision floating-point constants.
For example,
3.141f
When the suffix is unavailable, use a cast. For example,
(float) 3.141
The constant 3.141 is of type double. Its use in an expression will cause other types to be promoted to double.

When single precision suffices use faster library routines, for example, fsin and fcos instead of sin and cos.
Standard C library functions such as sin and cos return double-precision results and expect a double-precision argument. ANSI C recommends (but doesn't yet require) single-precision floating-point versions of these routines. The single precision versions are often more than twice as fast as the double precision version.

Arranging Expressions to Maximize Performance

Fold constant expressions into a single value, if your compiler does not do this automatically.
For the expression sin(2pkt) most compilers will generate code to multiply k by t, p, and 2. So the following strategy, known as the constant folding optimization, will be more efficient:

#define TWOPI 6.2831853072 a = sin(TWOPI*k*t);

Consider strength reduction when it is measurably faster.
Processors implement division more slowly than multiplication. You might expect your compiler to perform operator strength reduction optimizations such as x/2.0 implemented as 0.5*x and 2.0* x as x+x. However, C compilers for superscaler and RISC compilers may not do strength-reduction optimizations to take advantage of processors able to perform an addition, multiplication, and division simultaneously. Consider the many possible optimizations of the expression 2.0xb[i] + a[i]/2.0. If you have a good compiler you can simply write the clearest expression and trust it to generate good code. If your compiler seems to be missing an opportunity, there may be some benefit in experimenting with strength-reduced versions, but consider carefully whether a loss in clarity is justified by the performance benefit you gain.

Measure changes in performance.
Often attempts to speed up code can actually slow it down because the changes deprived the compiler of opportunities to optimize. It is becoming increasingly hard to understand instruction timing issues in new processors. Many decisions are best left to the compiler. Looking at the assembly language code output of a compiler can be useful to understand what the compiler is up to. Do not conclude that your code is faster on this basis alone. Bugs tend to creep in as code is tuned, so it pays to keep the untuned version around to ensure your faster code does the job.

Consider restrictive and faster implementations of C library functions.
Mathematical library routines are designed to achieve good performance without compromising accuracy for a wide range of input values. So they can work over a much larger domain than an application may require. They must also maintain the errno variable for exception handling.
Remember that considerable effort has been expended on efficient implementations if most of the C library functions. The hard work to do better ones for specific applications may be difficult to justify (See Tim Prince, "Tuning Up Math Functions", CUJ December 1992).

Avoid computations which may lead to expensive exception processing.

Disable expensive exception processing if numerical robustness is not required.
Certain rare events that occur during arithmetic calculations, such as division by zero, require special additional processing. This is usually processor specific, but in the case of IEEE 754 floating point, this processing is formally defined. Reasoning that these situations happen rarely, many processor designers have chosen not to implement these operations efficiently in hardware. Instead, they use a special kind of self-generated processor interrupt, an exception. Exceptions result in an expensive operating-system context switch. In cases like floating-point underflow, strict adherence to the IEEE standard requires expensive exception handling. Many processors and operating systems provide mechanisms to avoid the performance penalty imposed by strict adherence to the IEEE standard.
I have learned the value of these recommendations on exception processing when implementing recursive digital filters. After filtering many thousands of samples successfully at the rate of 44,100Hz, my sound synthesis application slowed to a very audible crawl. What happened was that the recursive nature of the filter resulted in smaller and smaller values being subtracted from each other until floating-point underflow exceptions occurred. Their purpose was to use intricate number representations to maintain the highest possible precision at great computational expense. In my application these values were very close to 0. Certainly they were indistinguishable to the ear from 0. Since I was unable to disable this exception mechanism, I followed my recommendation to avoid using computations that lead to expensive exception handling (see Listing 3) . I added a routine to clamp the filter state variables to 0 as they approach 0.

Use register sparingly for the most important variables.
These days C compilers do a good job choosing which variables to put into registers. On the rare occasions when your compiler is letting you down, you may be able to guide it by careful use of the register storage-class specifier. Unfortunately there is no portable way of doing this as compilers are free to ignore or interpret this keyword in their own way. Your compiler's reference manual should describe its interpretation of the register storage-class specifier. Many compilers ignore register declarations altogether. Some provide an option to ignore them or use them.

Use volatile to defeat a compilers unfruitful register assignments.
The number of registers available varies from processor to processor. If you specify use of more registers than are available, your compiler may not necessarily choose the most important ones. However. your compiler may provide a mechanism for further control of register assignment — the volatile type qualifier introduced with ANSI C. Variables qualified this way will not usually be assigned to registers.

Use for loops and an iteration variable instead of while.
The heart of most signal processing functions is the repeated application of some basic arithmetic operations over vectors or matrices of signal data. In C the looping constructs for and while accomplish this basic task. You have considerable freedom to express looping functions in different styles, as the examples (paraphrase from real signal processing code) in Listing 4 illustrate. They all perform the addition of two non-zero length vectors.
For the interesting values of count greater than 0, these functions all perform the same operation. For values less than or equal to 0 they do not all perform the same way. It is unreasonable to expect a compiler to generate the same code for each function. Which function will be the fastest? For many compilers the first example is the fastest. Compiler developers cannot optimize every way to express loops. Perhaps because of the tradition of optimizing Fortran do loops, many compilers generate good code for loops with array indices which are functions of loop iteration count variables such as the variable i above.

Reference arrays with iteration variables instead of pointers.
I know of one exception to this rule — the IBM C compiler for the RS/6000 does a better job with pointers than arrays. So you may want to experiment with your compiler and perhaps use macros to parameterize loops for these vector functions.

Increase the probability of fruitful register assignments by avoiding side effects and narrowing scope of variables so side effects may be checked for by the compiler
You might expect your compilers to put the variable count in the following function into a register:
int count;
loopindex()
{
   int i;
   for(i=0; i < count; ++i)
      x(i*i);
}
Since the function x(int m) may modify count, the compiler cannot generate code to put a copy of count in a register at the start of the loop. If you know that there are no such side effects you can help the compiler draw this conclusion by assigning the variables in the loop to local variables:
int count;
loopindex()
{
   int i;
   int n = count;
   
   for(i=0; i < n; ++i)
      x(i*i);
}
Memory Layout
Modern computer systems, whether they are based on RISC, CISC, or DSP architecture, are designed to take advantage of locality of memory references. Register variables and data cache memory are used to take advantage of temporal locality — the tendency for most references to be to variables which were most recently referenced. Instruction caches and queues are designed to take advantage of temporal and spatial locality of instruction streams — the tendency for instructions streams to contain contiguous instructions or repetitions of short sequences from loops. Cache lines and page mode memory chips are used to take advantage of spatial locality of data and instructions.
The following guidelines apply to any computer system, but the playoff you observe in your applications depends on the processor and memory design of the system you are targeting.

Consider replacing the standard memory allocator by one that aligns data to memory page, and cache line boundaries.
A good implementation of the standard storage allocation functions will align data to memory page and cache line boundaries. If your system does not have a good allocator, consider writing your own. In addition to addressing these memory layout issues, you can also considerably improve the performance over standard library functions, as they are designed for generality.

Economize your use of precious limited register files and cache memory by avoiding larger data types than you need.

double variables commonly use twice the register and cache space of float variables. Use them only when the precision they offer is required.
Many signal processing algorithms require repeated use of a small number of common values. The FFT, for example, requires use of the sine and cosine functions at equally-spaced values from 0 to 2p. This is a good opportunity to use tables to avoid the expense of the C library sin and cos routines. These values are used very often in the inner loops of the FFT, so you can expect that processors able to store them in fast cache memory will perform faster than those that can't. Unfortunately, as the size of the transform required increases, so does the size of the sine and cosine tables. Eventually, for large FFTs, the processor will be unable to store the tables in its limited cache memory. At this point the FFT routine that uses tables may actually be slower than one that calls the C library functions. Consider that man workstations are able to perform dozens of arithmetic operations in the time it takes to access values in main memory.

Examine time/space tradeoffs carefully. Sometimes using more memory results in poor performance.

Use vector strides to gain flexibility in memory layout of vectors without complicating vector library functions.
With C you have considerable control over spatial locality of data in memory, since arrays are implemented in contiguous memory locations and successive elements in structures are required to be implemented contiguously in memory. A good way to use this control is to put variables that are commonly used together into a C structure. A common example of an opportunity to do this in signal processing applications is the use of complex numbers. A reference to the real element of a complex number consisting of two float structure elements commonly results in the imaginary element being brought into cache memory concurrently, speeding up a subsequent access to the complex element of the structure. Place variables that are commonly used together into contiguous memory locations, as in

typedef struct { float r, i; } complex ;
This recommendation is not always as easy to apply as the above complex number example suggests. Consider ways to represent vectors of complex numbers. Two possibilities suggest themselves: an array of structures containing the real and imaginary elements or two separate arrays. For algorithms that operate on the real and imaginary components separately the latter structure will be faster. For algorithms that operate on the components together the former will be faster. Why not simply use both data structures? This would mean providing different flavors for every function which operates on these vectors. Happily there is an old trick that addresses this problem neatly: the vector stride. You saw an example of this in Listing 2: the incr_in1, incr__in2, incr_out variables. The idea is to give vector functions the stride or distance between successive elements of a vector along with the address of the first element of the vector. A library of vector functions can work equally well with complex numbers with contiguous real and imaginary elements (by using a stride of 2) or two vectors (by using a stride of 1). Sometimes vectors need to be interpreted backwards. This is easily achieved by passing vector library functions a negative stride.

Consult your compiler manual for control over code generation.
Modern C compilers offer a range of optimization and other controls over the code-generation process. Compilers often provide support for traditional C and ANSI C. If you choose options for traditional C (perhaps to port old code) remember that this might introduce many needless type conversions and expensive double-precision arithmetic.

Use compiler optimization after code has been debugged. Options for generating code for symbolic debugging usually suppress optimization.

Turn off profiling options that affect performance. Profiling code is another potential source of inefficiency.

Review your code for forgotten remnants of debugging or profiling. Forgotten debugging or profiling code you have written may slow things down.
Finally, there is one common pitfall to avoid: breaking working code while changing it to make it more efficient. Most programmers have experienced the punishment for not following these recommendations I quote from Kernighan and Plauger's book on programming style:

Make it right before you make it faster.

Keep it right when you make it faster.

Make it clear before you make it faster.

Bibliography
Baker, Louis. 1992. C Mathematical Function Handbook. New York, NY: McGraw Hill.
Bentley, Jon Louis. 1982. Writing Efficient Programs. New York, NY: Prentice Hall.
David, Ruth A. and Samuel D. Stearns. 1992. Signal Processing Algorithms Using Fortran and C New York, NY: Prentice-Hall.
Press, William H. et al. 1992. Numerical Recipes in C, Second Edition. Cambridge, MA: Cambridge University Press.
Prince, Timothy. December 1992. "Tuning Up Math Functions," The C Users Journal. Lawrence, KS: R&D Publications.
Reid, Chistopher E. and Thomas B. Passin. Signal Processing in C. New York, NY: John Wiley & Sons.
Macdonald, T. 1991 "C for Numeric Computing," The Journal of Supercomputing, pgs. 5, 31-48. Boston, MA: Kluwer Academic Pulishers.
Jervis, Robert. August 1992. "Numerical Extensions to C," (Numerical C Extensions Group status report). Dr. Dobbs Journal. San Mateo, CA: M&T Publishing, Inc.
Kernighan, B. W. and P. J. Plauger. 1978. The Elements of Programming Style. New York, NY: McGraw-Hill.
Leary, Kevin W. and William Waddington. 1990. "DSP/C: A Standard High Level Language for DSP and Numeric Processing," Proceedings of the ICASSP.
Macdonald, T. 1991. "C for Numeric Computing," The Journal of Supercomputing. pgs 5, 31-48. Boston, MA: Kluwer Academic Publishers.
Nakamura, Shoichiro. 1992. Applied Numerical Methods in C. New York, NY: Prentice Hall.
Plum, Thomas and Jim Brodie. 1985. Efficient C. Plum Hall Inc.
Thompson, William J. 1992. Computing for Scientists and Engineers: A Workbook for Analysis, Numerics and Applications. New York, NY: John Wiley and Sons.