The Embedded C Extension To C: Part II

C/C++ Users Journal September, 2005

A small extension that has a big impact

By Marcel Beemster, Hans van Someren, Willem Wakker, and Walter Banks

Marcel, Hans, and Willem are software engineers for ACE Associated Compiler Experts. They can be contacted at marcel@ ace.nl, hvs@ace.nl, and willem@ ce.nl, respectively. Walter is the founder of Byte Craft Limited, and can be contacted at walter@ bytecraft.com.

Embedded C is an extension to the C language to support freestanding embedded processors in exploiting the multiple address space functionality, user-defined named address spaces, and direct access to processor and I/O registers. In the first installment of this two-part article, we examined the Embedded C's feature set. In this installment, we present an example use of the language extension.

Listing 1 is a FIR filter that includes the fixed-point and address space features of Embedded C. The first line defines coeff, an array of _Fract values located in the X memory. The first value of the (incomplete) initialization shows the notation of a _Fract value constant, with the r appended. On line 5, there is an example of an _Accum constant value, which has a k appended.

Line 3 gives the fir function-type declaration. The function returns a _Fract. The argument points to an array of input values of type _Fract that is located in Y memory. Line 5 defines an accumulator variable. Lines 6 to 8 define a loop of N iterations. The body of the loop is a single multiply-accumulate statement.

In line 7, there is an explicit conversion of the input value to the _Accum type. This is to make sure that the processor's MAC unit can be used. Without the explicit conversion, the rules of arithmetic in C specify that the intermediate result of the multiplication must be a _Fract type; it does not allow the partial result to be kept with a higher accuracy. However, the practice of a typical MAC unit implementation is that the partial result is kept with high precision. The explicit conversion makes the use of a MAC possible.

Also important in line 7 is the loading of the coefficient value and input value from two different memories, X and Y. This corresponds to the DSP's architecture and allows for maximum bandwidth. Line 9 makes an explicit saturating conversion from the _Accum type. It makes sure that the returned value gets the maximum or minimum _Fract value in case of overflow.

Alternative in C

To write a program in plain C that implements the semantics of the aforementioned example is possible, but would lead to an implementation that will not use the performance-improving features of the DSP target processor. Plain C has no awareness of multiple memories. This notion cannot be expressed, hence the implementation will not use the dual-input pipeline of the DSP architecture. Similarly, fixed-point operations and saturation need to be expressed in terms of shifts and conditional execution. Given the number of different ways in which this can be done, it is almost impossible for the compiler to translate these DSP-specific features in an optimal way.

Most embedded C compilers before Embedded C failed to give C the power to eliminate the need for assembly code. This lack of capability in part fueled the debate on the merits of C versus assembly implementation in high-volume applications. The Embedded C specifications provide a means to directly access architectural features of the processor, thereby giving developers the ability to describe any processor function in C and effectively eliminate the functional arguments for using assembly code.

Implementation, Performance

Given the Embedded C version of the example, an optimizing compiler can compile the loop into a single instruction running under zero-overhead loop control. This is as good as an assembly version of the FIR filter can be. Besides knowledge of Embedded C, the required compiler techniques are:

All of these capabilities are well within the reach of modern compilers.

To show the improvements that are possible due to the use of Embedded C, we present three case studies: measurement on individual features, NEC compiler results, and a rewritten loop from the MiBench benchmark. The measurements are done using two CoSy-based compilers for DSP-C. It is expected that the results directly translate to Embedded C performance.

Table 1 shows results for three experiments. Each set of results reports the number of instructions in the inner loop of the algorithm and the size (in bytes) of a whole procedure implementing the operation. Results marked "*" are optimal; they cannot be improved by assembly hand coding.

Saturation is very expensive when implemented in ISO C because of the conditional branching and large constants involved. In DSP-C, the compiler can use the implicit saturation implemented by the hardware.

The inner loop of the FIR filter (Listing 1) is translated to a single instruction when written in DSP-C. In ISO C, additional shifting is needed inside the inner loop, causing the algorithm to run at half the speed. The explicit saturation in ISO C is outside the loop, thus impacting the code size.

The array copy example uses memory spaces to allocate the input and output arrays in different memory spaces. Because of this, the compiler is able to compute that there are no data dependencies in the loop and, as a result, it can use software pipelining to produce a single instruction inner loop. In the ISO C version, the read/write operations must be strictly sequential because of possible overlap in the arrays. Again, this makes the code twice as slow.

Tables 2 and 3 present results of the commercial DSP-C implementation for the NEC mPD77016 processor. This compiler was made with the ACE CoSy compiler-development system. To make the comparison, four applications were selected that were originally implemented in DSP-C. These applications were then reimplemented for comparison using plain ISO C only. A single DSP-C-capable compiler was used in all comparisons. This is possible because a DSP-C compiler is also fully ISO C compliant.

Table 2 shows the improvements in code size when DSP-C is used. The first of the four applications is a piece of control code. It was included in the test to show the contrast between dealing with signal-processing code and control code.

In control code, there is typically little use for the DSP-C extensions. This is confirmed in the code size that is reduced by only 6 percent in the DSP-C variant. There is usually no signal-processing arithmetic done, but the multiple memory and circular array primitives are sometimes applicable.

For the three signal-processing applications SP1, SP2, and SP3, the code savings are substantial. Application SP1 is even reduced to a quarter of its original size, but this was a small application to start with, containing only DSP-specific code. For the other two, with more administrative code in them, the savings are still substantial.

These results show that Embedded C is not only important to achieve high performance, but additionally leads to significant code savings.

Table 3 shows the improvements in execution cycles for application runtime. For the control code, the improvement is minimal. The 6 percent code size improvement did not translate into a similar speed improvement; this is because the size improvements did not occur in one of the performance-critical loops of the control-code application.

The performance improvements for the signal-processing applications are remarkable. They demonstrate the tremendous effect that the specific features of DSPs can have when used to their full advantage. The key issue with DSPs is that everything must fit just right before the maximum benefit can be obtained from the tightly designed processor data path. In the case of SP1, a 10 times performance improvement is measured. As the performance improvements outweigh the code-size improvements, this shows that for the signal-processing applications, the actual hot spots of the programs are attacked by DSP-C.

Based on these figures and more extensive comparisons, it is impossible to claim that DSP-C will always lead to a specific factor of performance improvement for signal-processing applications. This factor depends on how well the application fits the processor architecture.

For the next example, we used a procedure from the MiBench embedded processor benchmarks. The selected code, written in ISO C, is from the telecomm/gsm directory, procedure Short_term_analysis_filtering. This procedure uses a number of macros and functions to implement fixed-point arithmetic. The inner loop does two-memory read operations and one-memory write, as well as two fixed-point multiplications and two fixed-point additions.

The compiler used is the same as with the plain C example previously presented. The target architecture is a 16-bit VLIW DSP that can do two-memory operations and a dual-MAC operation in a single cycle. The compiler translates the ISO C code by inlining all function calls in the code, thus obtaining maximal speed.

When rewriting the code in DSP-C, no more function calls remain. All operations can be expressed in DSP-C. This makes an enormous difference not only for the compiler, but also for the programmer: The clarity of the code is improved tremendously. While the original is tricky to understand because of the emulation of fixed point, the DSP-C version is absolutely clear in its intent. Table 4 presents the results. The columns ISO C and DSP-C show the huge impact of rewriting the code into DSP-C. For cycles, the number of clock cycles in the inner loop is reported. A stunning factor of 8 in performance improvement is reached. Examining the assembly code shows that this can be further improved. With some knowledge of the architecture and rules of DSP-C arithmetic, two _Accum conversions can be placed in the DSP-C code. This achieves the result in the last column, which is indeed optimal.

So what is the reason that such huge improvements can be achieved? Two factors contribute: In the original ISO C code, fixed-point arithmetic was encoded using macros. This makes the code relatively easy to maintain but is not necessarily optimal. Careful study and optimization of the code makes it possible to postpone saturation, to temporarily skip shifts, and move some expensive corrections out of the inner loop. Even in ISO C, improvements are possible in this way. The end result would, however, be complicated code that is extremely difficult to maintain and debug.

The second factor is that the fixed-point emulation in ISO C uses more variables. The architecture that was used has a rather small register file, just like most DSPs. For the given code, the compiler runs out of registers and starts to generate spill code. Luckily for this architecture, spilling and restoring is rather efficient, costing only a single clock cycle; otherwise, the results would have been even worse.

These figures are a proof of concept for Embedded C. They show that for real-world applications, Embedded C enables the use of the high-speed extensions that are found in modern DSPs. It is possible to effectively program DSPs in C!

Conclusion

Embedded C is a relatively small extension to the C language, but its impact on the programmability of embedded processors and DSPs, in particular, is enormous. Embedded C extensions make the specialized application-specific features of these processors available to application programmers in a high-level language.

Today it is still common to program these processors in assembly language. The industry cannot afford the increasing time-to-market that assembly programming of ever more complicated applications incurs. Moreover, the inherit dependency on a specific processor and limited number of highly specialized assembly programmers incurs great risk and paralyzes future developments.

Embedded C offers a practical solution with proven results. It is being adopted by more and more compiler developers. With the ratification of the ISO technical report, Embedded C has become the standard solution for high-level language programming of the many billions of embedded processors out in the field.