SPEED TRIALS: FIVE C'S COMPARED

FIVE Cs COMPARED

Richard Relph

Richard Relph is a technical editor for DDJ and the author of several previous articles on C compilers. Richard is a member of the ANSI X3J11 committee, and the chief technical officer of Embedded Performance Inc. He may be reached at EPI, 3707 Willams Rd., Ste. 2, San Jose, CA 95117.

C is a language renowned for its efficiency. This fact has not been lost on compiler vendors, who in their advertising claims are always using words such as faster and fastest. As the author of "Optimizing Compilers for C" in last year's C issue of DDJ, I am well aware of the difference between advertising claims and the quality of generated code. This year, instead of looking at the theoretical aspects, I decided to take a look at he optimizing C compilers that are actually available.

This article reviews five optimizing C compilers for MS-DOS-Microsoft C, Version 1.5; Datalight's Optimum-C, Version 3.12; and Computer Innovation C86+, version 1.10. They are all from vendors who make vast claims as to the efficiency of their compilers. I decided not to include MetaWare's High C compiler, not because of a lack of technical merit but because mu company now has a close business relationship with MetaWare and I didn't want any charges of bias.

With a previous review for DDJ, there'd been a few accusations form readers that I was supplying an overwhelming number of numbers. This review will not suffer that fate. Instead, I'll concentrate on usability issues, language conformance, and the quality of the generated code.

The Test Software

This year, my company splurged and purchased the C Torture Test from Austin Code Works for $20. (See Sidebar.) Unfortunately, it seemed evident from the test results that most of the C compiler vendors never bothered to invest the $20 to obtain this suite.

I adapted the test suite for ANSI compatibility. The final performance tests consisted of compiling, linking, and running both the modified C Torture Test and the Dhrystone benchmark. (As most DDJ readers are aware, the Dhrystone is a performance benchmark that times execution of an "average" set of instructions and is as much a benchmark of characters and integer performance as Whetstone is of floating-point performance. Around the time of this review, I was also attempting to port a very large source-level debugger (Third Eye's CDB) from Unix to the PC. As a result of these different exercises, I soon had an excellent feel for all the compilers.

I ran all the tests on a Compaq 386 (16-MHz) Model 130 with 3 Mbytes of memory. I used 1 Mbyte of memory as a RAM disk (using RAMDRIVE which comes with Microsoft Windows/386) and the other spare Mbyte as a disk cache (using SMARTDRV which also comes with Microsoft Windows/386). I believe that machines with this speed and memory are no longer exceptional and will soon be the norm for most developers. With one exception (mentioned later), the test results should be representative of the rankings for these compilers, regardless of the particular machine you use. If you're impatient for the results, you can grab a peek at Table 1, below

Table 1: Performance test results

                                           
Compile
Product                  Version       Link/Run       Dhrystone
                                       (sec)          (sec)

Microsoft C              5.1           129            10.4*
Walcom C                 6.0           187            10.6
Borland Turbo C          1.5           43             13.3
Datalight Optimum C      3.1           165            10.9
Computer Innovations     1.1           327            14.0
C86+

Time was reduced to 10.0 seconds with changed optimization arguments. See text.

Although I used a 386 machine for the tests, please note that I made no effort to have the compilers produce code specifically for the 386. The code was generated to run on any member of the Intel 8086/80286/80386 line. As the Dhrystone was used as the performance test, all the compilers were directed to generate code with the fastest execution time (instead of optimizing for space).

Common Problems

In the C Torture Test, there were a couple of constructs that gave more than one compiler problems. The first is the following sequence of code:

foo(a)
struct x {int y, z;};
struct x a;      {...}

This source code should declare a structure tag x and define a function foo taking an argument a with type struct x. Three of the five compilers reviewed here, however, would not compile this construct.

The second common problem has to do with large constants. The suite attempts to confirm that decimal, octal, and hex representations of the same value all yield, in fact, the same value. The suite tests this at many points in the number range-nearly all of which work fine, of course. The problems arose when the suite attempted to test compile-time constant overflow. in particular, the test suite attempts to create three numbers greater than or equal to 232 in each of the three bases. Several of the compilers refused to compile such constants at all, yielding fatal errors. I believe the appropriate response is to warn about such things but also to continue compilation.

The last area of common problems has to do with the new ANSI escape sequence \x, which is a lead-in to a hex constant. The problem is what should the compiler do if all that it sees is \x? Should it yield a 0, a lowercase x, or a compile-time error? The ANSI draft is mute. My preference, as both a programmer and as a member of the ANSI committee, is to have \x yield a lower-case x, as this is unambiguous and will break the least amount of existing code.

The highlights (if you can call them that) of the test suite output are shown as Examples 1-4, starting on page 24. Perhaps the most interesting thing to note from the test results is how the compilers differ in allocating registers. Watcom C's output, for example, shows that this compiler reserves four(!) registers for pointer variables.

Section s22                 returned 0.
Section s241                returned 0.
s243,er2 Section s243       returned 2. READ

In contrast, when I redirected the output to a file, all I got was the last line. Neither result amounts to an acceptable performance on the test suite. C86 + took 5 minutes and 27 seconds to compile, link, and run the validation suite and the Dhrystone. Dhrystone run time accounts for 14.0 seconds. This makes C86 + the slowest compiler producing the slowest code in this group.

Summary

Because the emphasis here was on the quality of the code produced by the compilers, I didn't address other aspects of the packages. All the compilers reviewed here came with adequate documentation as well as an editor and a make facility. None of the editors even vaguely tempted me to switch away from Epsilon. Regarding the make utilities, Microsoft's was the weakest of the bunch but was fairly easy to understand; the other vendors attempted more or less to mimic the standard (Unix) make. Again, I found nothing to make me switch from the tool I already use (in this case, NDMake from D.G. Kneller).

As you read earlier, none of the compilers emerged from the testing unscathed. As a result, my recommendations really depend on the type of work you want to do.

Example 1: Test suite results for Microsoft C, Version 5.1

8 bits in chars.
16 bits in ints.
16 bits in shorts.
32 bits in longs.
16 bits in unsigneds.
32 bits in floats.
64 bits in doubles

1. 192093e-007 is the least number that can be added to 1. (float).

2. 220446e-016 is the least number that can be added to 1. (double).

char alignment: 1
short alignment: 2
int alignment: 2
long alignment: 2
unsigned alignment: 2
float alignment: 2
double alignment: 2

No errors detected.

Example 2: Test suite results for Watcom C, Version 6.0

8 bits in chars. 16 bits in ints. 16 bits in shorts. 32 bits in longs. 16 bits in unsigneds. 32 bits in floats.
64 bits in doubles

1. 192093e-007 is the least number that can be added to 1. (float).

2. 220446e-016 is the least number that can be added to 1. (double).

sign extension in char assignments

0 registers assigned to char variables.
2 registers assigned to pointer variables.
2 registers assigned to int variables.

char alignment: 1
short alignment: 2
int alignment: 2
long alignment: 2
unsigned alignment: 2
float alignment: 2
double alignment: 2

Sign extension in fields
Be especially careful with 1-bit fields

No errors detected.

Example 3: Test suite results for Borland's Turbo C, Version 1.5

 8 bits in chars.
16 bits in ints.
16 bits in shorts.
32 bits in longs.
16 bits in unsigneds.
32 bits in floats.
64 bits in doubles

1. 192093e-007 is the least number that can be added to 1. (float).

2. 220446e-016 is the least number that can be added to 1. (double).

char alignment: 1
short alignment: 1
int alignment: 1
long alignment: 1
unsigned alignment: 1
float alignment: 1
double alignment: 1

Signed extension in fields
Be especially careful with 1-bit fields:

Failed.

Example 4: Test suite results for DataLight's Optimum-C, Version 3.12

8 bits in chars.
16 bits in ints.
16 bits in shorts.
32 bits in longs.
16 bits in unsigneds.
32 bits in floats.
64 bits in doubles

1. 192093e-007 is the least number that can be added to 1. (float).

2. 220446e-016 is the least number that can be added to 1. (double).

1 register assigned to char variables. Register count for pointer is unreliable. Register count for int is unreliable.

char alignment: 1
short alignment: 2
int alignment: 2
long alignment: 2
unsigned alignment: 2
float alignment: 2
double alignment: 2

Failed.

If I were programming for OS/2 and DOS, then Microsoft is the only choice, primarily because of its superior debugger, CodeView. But if I were to program for DOS only, I would have great difficulty deciding between Microsoft and Watcom. At EPI we have been using Microsoft but we are going to start using Watcom (at least in parallel). It's a close call, though, as Microsoft still doesn't have a decent make and Quick C is crippled by the lack of support for multiple memory models, as is Express C.

Watcom C bears special watching as an excellent all-round compiler, good for DOS work, OK for porting work, and better than most for embedded systems work. I really want this compiler to succeed because it is so close to being the absolute best in every way. Watcom C is as good as Microsoft C in every significant way and is better in a few, most notably in execution time.

If I were porting existing programs from Unix to DOS, then Borland Turbo C would be my choice, but just barely. If the program is known to be portable, then Turbo C offers the best chance of running it unchanged. But the lack of a debugger makes using Turbo C problematical if the program doesn't "just run" after compiling. With the debugger I saw at Borland, this reservation would go away completely.

As nice as the Datalight Optimum-C compiler is, however, you probably won't be able to find it on the shelves much longer because the marketing rights to Optimum-C have recently been licensed to a British company called Zortech. According to a Zortech spokesperson, that company will not market Optimum-C per se, but instead incorporate the compiler's optimization technology into its own compiler called Zortech C. Datalight customers can expect to get product support from Zortech.

Datalight's Optimum-C is an excellent compiler, and I do like it very much. You get source code for the run-time library and an outstanding compiler for very little money. But the package lacks a debugger and professional documentation. Optimum-C would be my first choice for embedded systems work, but still it is not ideal (missing are the necessary utilities for burning PROMs and "remote" debugging).

Computer Innovations' C86 +, regrettably, has nothing except the reputation of the company to recommend it. You do get source code for the library but this hardly makes up for the lack of a debugger. And there's no escaping the fact that C86 + was the slowest compiler of the' group, producing the slowest code.

Notes

1. Richard Relph, "Optimizing Compilers for C," DDJ (August 1987).

2. Richard Relph, et al., "Benchmarking C Compilers," DDJ (August 1986).

3. Brian W. Kernighan and Dennis M. Ritchie, The C Programming Language (Englewood Cliffs; NJ.: Prentice-Hall, 1978).

Optimizing Techniques

The term "optimization" refers to the techniques a compiler uses to produce code that is appreciably smaller and/or faster than code generated by comparable non-optimized compilers. Over the past year, C compiler manufacturers have been jumping on the optimization bandwagon as a means of both improving their product and of differentiating it from other compilers on dealers' shelves. Unfortunately, they don't go into much detail explaining exactly what they mean by optimization other than saying their product is "faster and better" than other available compilers.

There are several optimization techniques available to compiler developers of which one or more method may be implemented within an individual compiler. These techniques include register allocation, constant propagation, dead assignment elimination, dead code removal copy propagation, and common subexpression elimination. For a detailed description of each of these methods, see "Optimizing Compilers for C" by Richard Relph (DDJ, August, 1987). You may also want to refer to Compiler Principles, Techniques, and Tools (also known as "The Dragon Book") by A. Aho, J. Ullman, and R. Sethi (Addison-Wesley).

Of all these methods, register allocation is perhaps the most advanced and important optimization technique and a key aspect of approach used by developers such as Watcom, Datalight/Zortech, and others. With register allocation, the number of memory references in a program are kept to a minimum by keeping variables and temporary results in registers. This allows shorter instructions to be generated which results in smaller, faster code.

The compilation process always begins by translating source into some internal, intermediate level representation. Once an intermediate representation of a function has been built, algorithms can be used to perform live variable analysis. A variable is live at a point in the flow graph if its value at that point could be used at some other point in the flow graph. Intuitively, a variable is live if its value may be used "later on" in the function.

Live variable information is used to build a data structure called the conflict graph. Every variable in the function has a corresponding node in the conflict graph. A link between two nodes is present whenever two variables are live at the same point in the flow graph. These links indicate that the two variables may not share the same register. The method used to assign registers to variables is called a coloring algorithm since it is similar to the problem of coloring a map so that no two adjacent countries have the same color. Since a machine has a limited number of registers, the object is to color (assign a register to) as many nodes of the conflict graph as possible without assigning the same color (register) to any two adjacent nodes. When a node is left uncolored, the corresponding variable must be assigned a position in memory or on the stack.

Of course, allocating registers intelligently requires information about the use of the variables. A variable that is heavily used in a loop is a better candidate for a register than a variable that is only used twice at the beginning of a function. Analysis may be performed on the flow graph to determine the loop nesting depth of each basic block. The uses of each variable are found and a weight is assigned to each use. Higher weights are assigned to uses within one or more loops. These weights are added up and stored in the conflict graph. The coloring algorithm takes these weights into account when assigning registers.

Although optimization can occur over several ranges statement, block, function, module, and program the compilation process, the most advanced are global optimization techniques that occur over the range of an entire function rather than just a single statement. Global optimizations are applied to intermediate representations of functions that consist of instructions formed by a set of operands and one operator. These instruction sets are organized into sequences called blocks that have one label at the beginning and one jump or return at the end. Blocks are linked together to form the control flow of the function.

In some cases, it is convenient for compilers to let the register allocator introduce some anomalies that are removed later in a post-optimization phase. One post-optimization technique involves keeping track of the value of a register at any point in the block. The data structure used to keep track of this information is sometimes called a register scoreboard. Each instruction is examined and the scoreboard is updated to reflect the effect that the instruction has on the register. ff a register is stored in a memory location, the scoreboard remembers that the memory location and the register contain the same value. When a move instruction is encountered, the scoreboard is consulted to determine if the move is redundant. When a memory reference is found in an instruction, the scoreboard is checked to find out if a register contains the appropriate value, By carefully designing the register scoreboard, a significant gain can be achieved. -- eds.

Putting C Compilers Through the Wringer

All the major players in the C compiler marketplace claim to support the standard language as defined by Kernighan and Ritchie, but do they really? And if not, what are the discrepancies?

An outfit called The Austin Code Works (11100 Leafwood Ln., Austin, TX 78750; 512-258-0785) sells a test suite to find out. It's called the C Compiler Torture Test. For $20, you get a PC-compatible disk containing 23 C programs and a READ.ME file that tells you how to use them.

The Torture Test measures a compiler's compliance with the K&R "white book." In a total of some 6600 lines of code, the suite exercises the entire standard. Particularly stressed are language features likely to be used by advanced C programmers: complex constructs, cascading defines, pointer manipulation such as multiple indirection, and so forth.

Testing with the suite occurs on two levels: compile-time and run-time. In some cases, the compiler might choke on certain constructs. This is a clear indication of noncompliance. A clean compile means that the compiler accepted the source language, but it doesn't mean that the result is correct.

Consequently, after torturing the compiler, you run each test program to see that it does what is expected. For example, dereferencing a chain of pointers should ultimately lead to a known variable that the program can compare with a hard-coded value. When the result is wrong, the program issues an error message. The messages are cryptic, but the test programs are structured so that you can look up the error message in the source text and determine from comments what the problem is.

Since the Torture Test programs contain only vanilla K&R code (regardless of its level of difficulty), a perfectly complying compiler should compile them without a whimper and the resulting programs should run correctly. A failure at-either level of testing pinpoints lack of compliance with the standard.

The C Torture Test is a valuable resource to both compiler writers and C users. Although delivered on a PC diskette, the code can be ported to any platform and used as is. And at only $20, nobody can complain about the price. -- eds.