Software Optimization & DSP Embedded Systems

Dr. Dobb's Journal September, 2005

Planning for performance

By Robert Oshana

Robert is an engineering manager in the Software Development Systems Group at Texas Instruments. He can be contacted at roshana@ti.com.

New tools, more comprehensive software architectures, and a wider range of available algorithms have made it much easier to develop embedded systems based on digital signal processors (DSPs). Nevertheless, DSP system developers still have to optimize software because the challenge of developing DSP embedded systems lies in making the most of limited resources—performance, memory, and power. As a result, developers spend a considerable amount of time working to obtain the right balance of code size and execution speed for their particular system requirements.

Unfortunately, the process of optimizing software for DSP embedded systems has remained hit or miss, with software developers often following ad hoc approaches that may or may not achieve the best solution. A more effective optimization process does, however, exist. Developers who treat optimization as a concern throughout the entire development flow, rather than as an isolated stage, often achieve better results. In addition, those who become familiar with the algorithms, architectures, and compilers underlying their DSP systems are better positioned to take advantage of software techniques that can help them get the most out of their hardware. While the process of optimizing embedded DSP systems is not always easy, it can be accomplished successfully through a disciplined approach.

A Lifecycle Process

Although efficiency goals differ in every system, optimization normally focuses on program execution speed, memory usage, and power consumption. These indices are not independent because what affects one usually affects the others in some way, and all have some effect on system cost. Optimization is usually thought of as beginning after the code is written and functionally verified, but this is a short-sighted view. Choosing the right processor and algorithm at the beginning of the design cycle can make much more difference in performance than any amount of code-tweaking on a poorly chosen platform. A disciplined approach to DSP code optimization includes the following guidelines:

Do your homework. Make certain you have a thorough understanding of the DSP architecture and compiler, as well as the application algorithms.
Know when to stop. Performance analysis and optimization is a process of diminishing returns, so take advantage of the big improvements that come with relatively little effort and let the difficult, low-yield modifications go.
Change one parameter at a time. Avoid making several optimization changes at the same time, so you can tell which change led to which result. Retest after each significant change, then document so that you have a history of changes and results in case you need to backtrack.
Use the right tools. Profilers and other tools let you quickly identify and focus on the core bottlenecks in the program, helping you use your time more efficiently and effectively.
Have a set of regression tests and use it after each iteration. Develop a test plan that compares the expected results to the actual results of the software program, and run the regression tests often enough to catch problems early.

Doing Your Homework

DSPs are a specialized type of microprocessor that handles certain types of algorithms extremely efficiently for real-time signal processing. Knowing how the hardware and compiler function together can help you concentrate on the 20 percent of code that performs 80 percent of the work, giving you greater leverage on program efficiency as you modify your code.

Typical DSP algorithms used in real-time signal processing involve sums of products (SOPs) based on mathematical summations. To handle repetitive SOP operations as fast as possible, DSP data paths perform multiply and accumulate (MAC) operations in a single cycle. High-end DSPs are also characterized by multiple data and address buses that keep data flowing with minimal latency and wait states. Sometimes algorithms can be structured to compute more efficiently on DSP architectures while still obtaining the same result. A classic example is the use of fast Fourier transforms (FFTs) instead of discrete Fourier transforms (DFTs), a substitution that gains greater cycle count efficiency as iterations increase.

Relatively recent introductions in high-performance DSP architectures include two-level on-chip caches, sophisticated direct memory access (DMA) controllers, and multiple function units with very long instruction words (VLIWs) for increased parallelism. DSP architectures specifically aimed at portable systems tend to implement less parallelism, instead adding more features for control-of-power consumption. While DSP compilers are capable of handling these architectural elements, they may not automatically achieve the most efficient solution in terms of performance. Today, compiled C and C++ code is highly efficient, but judiciously tweaking source code can often achieve faster throughput or more compact code.

Machine-Independent Optimizations

Compilers perform optimization techniques that are both machine dependent and independent. Examples include:

Rearranging code to combine blocks and minimize branching.
Eliminating unnecessary recalculations of values.
Combining equivalent constants.
Eliminating unused code and storage of unreferenced values.
Global register allocation.
Inlining or replacing function and library calls with program code.
Substituting operations to conserve resources.
Eliminating ambiguous pointer references when possible.

A particularly important generic optimization technique is loop unrolling. Like other loop techniques, unrolling can be extremely significant in optimizing DSP algorithms, which normally spend the bulk of program execution time in loop iterations. Loop unrolling is simply rewriting the loop's body more than once to reduce the number of loop iterations or eliminate the loop altogether. For instance, the loop in Example 1(a) can be unrolled four times as in Example 1(b).

Because the loop branch logic is overhead, reducing the number of times the loop has to execute reduces the overhead and makes the loop body, the important part of the structure, run faster. The tradeoff in loop unrolling is that it requires more instruction code, and more registers and cache space for data. The latter factor puts a natural limit to the number of times you can unroll a loop because if your code stores values that exceed the number of registers available, or if it needs enough data to cause cache spills, memory accesses will be required, slowing execution.

Performance Optimizations

The area where your knowledge can make the greatest difference in optimizing code performance is in those transformations that depend on the underlying machine architecture. Generally, machine-dependent optimizations fall into three groups:

Special features, which you implement by selecting machine-specific instructions in the source code. Because DSPs are number crunchers by nature, their assembly instruction sets may include some instructions for mathematical operations used often in algorithms. However, compilers may not be able to implement these algorithm-specific instructions efficiently. In these cases, the programmer can force their use through specific assembly language instructions known as "intrinsics." The C compiler for TI's TMS320C5x DSPs, for instance, recognizes intrinsics for functions such as absolute value, normalization, rounding, and saturated add.

The last of these, saturation, limits two added values instead of overflowing, and is performed by many DSP applications during video and image processing. When it is manually written, a saturated add (sadd) requires many steps; see Example 2(a). Compiling this code generates 14 assembly instructions before the return. On the other hand, using an intrinsic in Example 2(b) compiles to only three instructions before returning.

One benefit of intrinsics is that they are automatically inlined, so they do not waste the overhead of calling a function. Also, because the compiler is aware of the intrinsic contents, it may be able to perform some additional optimizations that it cannot perform on an assembly-language function of which it is ignorant.

Latency techniques, which schedule the order of operations for most efficient execution. Assembly operations require different lengths of time due to memory-access delays and differences in parallel function units. When the compiler predicts these conditions, it can force the processor to wait by inserting no operation (NOP) cycles into the instruction stream. Alternatively, various instruction schedulers can decrease running time by reordering some of the operations in the compiled code. One of these schedulers implements software pipelining, which distributes different iterations of a loop to different parallel function units so that they execute simultaneously instead of sequentially.

Figure 3 schematically shows the body of a loop in five sections (A-E) that has been software pipelined to execute n iterations (A1-An, and so on) through four different function units. The opening and closing sections, which load and clear the pipeline, are referred to as the "prolog" and "epilog." The central section, when all of the function units are operating, is the pipeline kernel. Generally, the more iterations in the loop, the more efficient the pipeline, because the kernel is much longer than the relatively inefficient prolog and epilog. A pipelined loop with many iterations is more efficient in overhead cycle counts than the same loop unrolled, which is in turn more efficient than the original loop. In code size, the opposite order of efficiency applies because unrolling repeats instructions, and software pipelining does so even more.

Software pipelining inherently improves execution speed, but it may not be possible if unnecessary dependencies exist. In Example 3, the two input arrays may or may not overlap in memory. The compiler must assume that there is an overlap and that a given load or store must be finished before the next one can be initiated. If this source is compiled for serial execution on a C6x DSP, the complete loop requires 26 clock cycles to execute, including 20 NOP cycles. Using a little bit of parallelism, the compiled loop requires five NOP cycles and 10 cycles to execute.

However, if you know that the inputs are not dependent, you can indicate that these fields will not change by declaring the input1 and input2 arrays as restrict. When the declaration is changed like this, it becomes possible to software pipeline the loop efficiently because two values can be fetched from different memory locations simultaneously. Within the pipeline kernel, the NOPs are eliminated and the total loop cycle count drops to two:

void example2(float *out, restrict float *input1, restrict float *input2)

The complexity of high-end DSPs means that it is no longer feasible to fully hand-code algorithms in assembly language. Even so, you can still achieve considerable performance enhancement if you understand the DSP compiler and architecture.

Resource allocation, which determines the values that should reside in registers at each point in the program. Registers are the fastest locations in the memory hierarchy. Making optimum use of the registers is the job of the register allocator, which takes the instructions generated by the instruction scheduler and arranges data and variables to avoid spilling them from the registers into memory, where they take longer to access. Of course, sometimes there is simply too much data for the register allocator to avoid spills, and at these times you may be able to help by reshaping the source code.

One way to reduce or eliminate register spilling is to break larger loops into smaller loops. When a loop has the shape as in Example 4(a), it may be possible to rewrite it in the form in Example 4(b). The register allocator treats each loop independently, increasing the possibility of finding a suitable allocation for each of the functions. Eliminating inner loops from embedded loop structures, whenever possible, can also free up registers and promote the chances of software pipelining.

Memory Optimization

Some of the techniques used to optimize execution speed can lead to inflated code and larger memory requirements. The compiler's profiler can create a number of cycle count/code size options, giving you the choice of where you want to optimize. Figure 4 shows a screen from a compiler with a range of speed/memory optimization choices indicated along the curve of dots. The best balance is normally found in the lower left corner, where little of either code compactness or performance is sacrificed to benefit the other.

Programmers have other techniques available as well to make code more compact. For instance, some data structures make more sense for real-time applications than others do. Structures of arrays in the familiar form:

struct A [ n ] -> a short
b int
c char

waste bytes in storage with arrangements as shown in Figure 1.

They also require an extra load in statements such as:

for ( A [ i ] -> a ą 1)...

On the other hand, pointer-to-pointer structures of arrays such as:

A -> a [ i ]

are treated by the compiler as invariants and are mapped to memory in a thoroughly efficient manner, as in Figure 2.

In addition, statements such as:

ptr [I] ą 1

require only a single load inside a loop. In general, you should remember that compilers have a hard time dealing with complicated structures, so it is best to keep the simplest structures at the bottom of the pointer tree where they are easy to optimize and manage.

Direct memory access (DMA) control is essential to the operation of high-performance DSPs, allowing data transfers to proceed with little or no intervention from the processor core. While DMA operates automatically, some degree of intervention from you can increase its efficiency. For instance, writing to bits in the DMA register can control staging data to and from off-chip memory, so that while one block of cache is being used by the DSP core, another block is being written out and/or filled with fresh data. These blocks can be very fine grained because DMA controllers are capable of keeping track of what is in many small areas of on-chip memory.

Power Optimization

Modern DSPs are designed with power consumption in mind, especially those targeted at portable applications. Typical power conservation techniques include reducing voltages and frequencies whenever possible and disabling clocks and power rails from peripherals that are not in use. These are useful optimization resources for the developer, but the largest single factor in digital power consumption remains the area of memory accesses. In general, you should try to keep the most-used code as close to the DSP core as possible in order to minimize power-expensive pin and bus traffic. Some of the same techniques that improve performance speed also help lower power consumption by keeping data access on-chip as much as possible. Increasing code efficiency by reducing the number of instructions executed also serves to lower power requirements.

Code-Optimization Process

Figure 5 shows the steps involved in the DSP code optimization process from beginning to end. Although the initial seven steps take place before optimization is traditionally considered to begin, they are nevertheless part of the process. From step 8 on, the process can and should end whenever the application efficiency goals are met. Each optimization step is followed by a test regression step to verify that the software still functions as it should.

You should note that optimizing compilers can make your code harder to understand. Optimizers reorder operations and eliminate variables, so it becomes difficult, if not impossible, to debug code once you are in the optimization stage. In addition, because the compiler has scheduled instructions carefully, you risk seriously impairing performance if you try to debug optimized code. The best strategy is to make your code work right first, then turn on optimization features one at a time, documenting results as you go.

This disciplined approach makes optimization workable in the increasingly complex world of high-performance DSP design. Optimization should not be an afterthought, but an inherent part of the development process from concept to completion. Developers who do their homework about the algorithms, architecture, and compiler are in a better position to take advantage of the DSP hardware and software for greater performance with less code and less power consumption. With today's multifunctional DSP embedded systems, it may not always be possible to achieve the perfect optimization, but a disciplined approach should always make it possible to achieve a satisfactory optimization.

DDJ