LETTERS

Optimum Performance?

Either Richard Relph's article on optimizing compilers for C (August 1987) has a slight error, or I'm going to have a lot of trouble with my code if I start to use one. I believe the example of common subexpression elimination he refers to as block range optimization is at least module range and could be program range.

Under the K & R definition of C, a two-dimensional array is not necessarily a continuous block of storage. An expression such as a[i][j] refers to the jth element of the block pointed to by a[i]. The element a is an array of pointers to appropriate objects. This permits arrays in which rows can have different sizes and in which rows can be switched by swapping pointers and permits arrays to be built up dynamically during execution. Consequently, the identification of d[i][j] with d [0][t0] where t0 = i*10 +j is not always correct.

Even if we can be assured that the dimensions of d will be 10X10 when copy() is called, if the storage of d was allocated one row at a time (as is often the case in my programs), there is no assurance that d[i][j] can be addressed as in the example. There is even no assurance that the address of d[i][j] will bear the same relationship to that of d[0][0] as s[i][j] will to s[0][0]. In the particular example given, that assurance is provided by the declarations of d and s. (Even that assurance could be compiler implementation dependent, though I know of no compilers for which it would not be correct.) But these were globals, declared outside the function, so the optimization indicated requires at least module-level information and in other cases might require program-level information.

Any optimizing compiler that streamlines multidimensional array addressing in this way will either have to incorporate information beyond the function being optimized or it will have to restrict this technique to certain arrays declared within the function.

Thanks for an interesting article. Keep up the good work.

Clyde Schechter

116 Pinehurst Ave., Apt. D-63

New York, NY 10033

Better (or Worse) Standard

I had mixed emotions after reading Richard Relph's description of the forthcoming ANSI C standard in the August 1987 issue. As a programmer working almost exclusively in C, I like the idea of replacing local dialects with a common language, and I like many of the proposed enhancements. I am disturbed, however, by the move to merge the library routines into the definition of the language. At present, the C language is unique in not including library routines as part of the specification, which I think is the right approach.

For example, the article mentions that library function names and macros are to be reserved words, to eliminate the possibility of programmers substituting their own functions for standard ones. I have often substituted such routines to good effect while debugging a program, however. Similarly, it can be useful to set breakpoints on routines such as strcpy(), which will be impossible if the compiler is free to substitute in-line code for my subroutine call.

Although I agree that the library functions should be standardized, making them part of the syntax itself hurts the language. The justification for this seems to be that it makes it easier to optimize the code, but it is a poor trade to take away capabilities in order to go easy on the compiler!

At present, I know what's going on when I call strcpy() in any compiler, and if I want to speed up my program, I can write in-line code myself. With ANSI C, programmers won't know how to write efficient code without knowing which compiler will be used to compile it. This improves portability? Perhaps the ANSI committee needed a few more members who buy compilers along with those who write them.

William F. Linke

286 Dunhams Corner Rd.

East Brunswick, NJ 08816

Richard Relph replies:

Mr. Linke's letter makes a key observation. The C library is now "standardized." Without this change, program portability is unobtainable. It should be noted, however, that library routines are not mandated for freestanding environments, such as embedded systems.

I have some difficulty with his final paragraph. He seems to be wanting both maximal speed and maximal portability, a desirable goal, I agree. The two are often at odds with one another, however. Perhaps a solid example would help here. Suppose I need to perform a string copy operation. I could either call strcpy or code it in-line. Mr. Linke's assumption is that, by coding it inline, he would get the best performance. This is not necessarily true, even with today's non-ANSI compilers. Most libraries implement strcpy in assembly language, using instructions that most C compilers cannot normally generate, such as REP MOVSB in the 8086. Therefore, the programmer already doesn't know what's most efficient. Mr. Linke further assumes that he "knows what's going on when I call strcpy() in any compiler." Is he aware that some existing compilers perform in-line expansion of key library functions? MetaWare and Microsoft's both do this, and I'm sure others will follow.

By reserving the library function names for the compiler, the compiler can do several things that would otherwise be impossible. First, the library itself can call library routines. If users were allowed to replace strcpy, for example, other library routines such as printf could not call it. This essentially means that for the normal case when programmers do not replace strcpy, but do call it, the library must include two copies, one with some strange name for the library itself and one for the user to call.

Second, the library can be granularized at other than the function level. Although it is nice to have a library that is function granular, some machines cannot support libraries with so many chunks." If the implementor thought that strcpy and memcpy made a reasonable pair and put them in a single module, and the programmer replaced strcpy but called memcpy, what could the linker do?

Last, the compiler can now replace some calls to functions with in-line assembly-language code, taking into account the actual parameters. This is optimal efficiency, allowing the insertion of the assembly-language version of strcpy in the code, saving the call and return overhead.

At the risk of being taken literally and in the extreme, let me say that leaving operations at as high a level as possible is always desirable. If can call strcpy, or write it in-line, I will pick the call-for efficiency, for maintenance, for clarity, and for standardization.

Finally, let me say that these compiler features are what people will probably start basing their buying decisions on. This is a net plus. People who want what Mr. Linke wants will probably be able to get it and people who don't will be able to get something else.

Dimensional Data Typed in Forth

This letter is an addendum to Eric Lundquist's comment (Letters, August 1987) on Do-While Jones' article "Dimensional Data Types" (in Ada). His astonishment at the lengthy code required to implement this simple idea in Ada is, of course, justified. As a theoretical physicist and accident consultant who uses dimensioned data every day of my life, I was equally astounded.

Experienced Forth programmers, on the other hand, must have smiled indulgently, murmured "What fools these mortals be!", and passed on. As a fairly recent convert to Forth (and a heavy user of this language for scientific computing), however, I cannot pass up this opportunity (we recent converts tend to be zealots) to exhibit how easily dimensioned data can be implemented in Forth. Perhaps Mr. Lundquist--who feelingly alludes to the advantages of assembly language will then be sufficiently piqued to give Forth a try.

Suppose you need to input distances in various units. For example, U.S., English, and Continental police reporting accidents might wish to use inches, feet, and yards, or centimeters and meters. Rather than writing different versions of a program, and making users worry about which one they are using, it is simpler to write one program and make unit conversions part of the grammar. Thus, for example. you might keep all internal lengths in millimeters and convert as follows:

: INCHES 254 10 */;
: FEET [ 254 12 * ] LITERAL 10 */;
: YARDS [ 254 36 * ] LITERAL 10 */;
: CENTIMETERS  10 *;
: METERS 1000 * ;

The usage would be:

10 FEET . <cr> 3048 ok

These are more definitions than necessary, of course. A better alternative is a defining word UNITS:

: UNITS  CREATE  SWAP , , DOES> D@ */;

Then you would have what appears in Example 1, page 12. This is an improvement, but Forth permits a simple addition that allows conversion back to the input units when outputting (see Example 2, page 12).

Obvious modifications include use of floating point (to prevent the unintentional truncations of integer arithmetic) and defining a machinecode sequence <DO-UNITS> to replace the run-time instructions following DOES> if greater speed is needed. (It probably isn't--these definitions are intended to I/O a relatively small set of measurements from the console or a file.)

Finally, let me put in a plug for HS/FORTH (for IBM PCs/clones): Jim Callahan's Harvard Softworks (P.O. Box 69, Springfield, OH 45066) has been extremely supportive with upgrades, fixes, technical advice, and so on. The version of Forth is extremely fast "as is" but comes with all sorts of goodies and tools for using the 8087 chip, string handling, windowing, graphics, sound, and so on. Although the documentation of several years ago was rather cryptic, it is now excellent and includes Kelly and Spies' textbook. I have never seen this product reviewed or mentioned and cannot imagine why, considering its excellence.

Example 1: Conversion program using the defining word UNITS

254  10                   UNITS INCHES
254  12     *  10   UNITS FEET
254  36     *  10    UNITS YARDS
10 1                      UNITS CENTIMETERS
1000 1                    UNITS METERS
\ Usage:
10 FEET      . <cr>       3048 ok
3 METERS     . <cr>       3000 ok
\................................ \ etc.

Example 2: Code to convert back to input units when outputting

VARIABLE   <AS>      0  <AS>  !
:  AS  -1  <AS>  !  ;
:  UNITS  CREATE  SWAP  ,  ,  DOES>  D@  <AS>  @
       IF  SWAP  THEN  */  0  <AS>  !  ;
BEHEAD'  <AS>   \ TO  MAKE  IT  LOCAL  FOR  SECURITY

\  UNIT  DEFINITIONS  REMAIN  THE  SAME.
\  Usage:
10 FEET         .  <cr>  3048  ok
3048  AS  FEET  .  <cr>  10  ok

Julian V. Noble

105 Powhatan Cir.

Charlottesville, VA 22901

Instruction Timings

I hate to be the bearer of bad news, but Tom Disque's "8088 Assembly-Language Programming Techniques" in the July 1987 issue is inaccurate. In several cases Disque presents instruction timings that are not consistent with his assumptions.

In the second paragraph, Disque says, "Please note that all timings are for the 8088 microprocessor and assume that the 4-byte 8088 prefetch queue is empty at the start of execution." Although this is a good assumption because the 8088 is so common and the prefetch queue is always empty except for a few relatively rare circumstances, the timings that Disque gives for his instruction sequences do not correspond to this statement.

Disque claims that the simple sequence:

mov ax, cs mov es, ax

takes four cycles to execute. If the prefetch queue is empty, however, the instructions must be fetched first. These two instructions are 4 bytes, and the 8088 requires four clock cycles per memory cycle. This alone comes to 4 X 4, or 16 clock cycles, just to fetch the instructions, never mind any additional execution time. If the prefetch queue were full, you would have a different story, but Disque assumed that it wasn't.

I suspect that Disque probably meant to do better than this because later he gives the example:

shl ax, 1
shl ax, 1
shl ax, 1
shl ax, 1

He says: "Faster (8 cycles [plus 8 more to fetch the two extra instructions]..." I'm not sure what to make of this. Either he meant to assume the prefetch queue was empty or that it was full. In any case, it would take 16 clock cycles to fetch two instructions, not 8 (4 clock cycles per byte times 2 bytes per instruction times 2 instructions).)

Let's examine how the 8088 really executes the previous sequence. First, it must fetch the first instruction--eight clocks. It then starts fetching the second instruction while processing the first one. Processing the instruction takes two clocks and fetching the second instruction takes eight. Because eight is more than two, you really have to wait eight clock cycles. At this point the cycle repeats. The total "execution" time is 32 clock cycles.

As you can see, it takes the 8088 much longer to fetch its instructions than to execute them, so the 8088's speed is proportional to the number of bytes it must transfer. This includes instruction bytes as well as data processed by a program. This also means that the 8088's 4-byte queue is almost always empty.

The few exceptions are multiply and divide instructions, which take longer to execute than to fetch. There are also a few esoteric instructions that take longer to execute than to fetch, but their use is so rare that you can forget about them.

Therefore, to time 8088 code that does not include multiply or divide instructions, all you have to do is add the number of bytes of instruction that must be fetched to the number of bytes of data transferred in memory. Once you have the total number of reads and writes to memory, you multiply by four clocks per fetch. You need no instruction timing tables or photographic memory. All you need to be able to do is count and multiply by 4 (two left shifts in binary).

This same concept is used to time Z80 code, except for one quirk. The Z80 takes four clocks for the first byte of an instruction fetch and three clocks to fetch the data.

What about the 8086? The 8086 can fetch instructions faster than the 8088 can but is still basically limited to how fast it can fetch instructions (and read/write to memory). The 80386 should finally provide enough memory width (32 bits) to make a prefetch queue worth while.

Disque redeems himself, however, with his Example 8, which demonstrates an excellent, fast, generalpurpose, memory move routine. The correction for odd address transfers shows good understanding of that aspect of the 8086 processor.

Good assembly-language programmers such as Disque are rare, but sometimes we all become victims of Intel's overly optimistic instruction timings. Disque's instruction timings sound like they came from Intel.

Wenham, MA 01984

Correction

I'd like to provide some additional information about the Async AppleTalk article published in the October 1987 issue of DDJ.

Async AppleTalk is based on the source code of Apple Computer's AppleTalk driver. The following notice was dropped from the article during the editing process: The AppleTalk protocols and computer programs are licensed from Apple Computer Inc. and AppleTalk, Macintosh, and Laserwriter are trademarks of Apple Computer Inc. Portions copyright (c) 1985 Apple Computer and copyright (c) 1987 the Trustees of Dartmouth College.

Also, the disk being distributed by DDJ does not contain the executable desk accessory. This, along with the source files and other debugging tools, is available from the authors. Please send a blank (800K) Macintosh disk in a mailer to me at the address listed below.

Dartmouth College

Async AppleTalk Distribution

Kiewit Computer Center

Hanover, NH 03755

Stonecutter Error

With reference to the cover of your September 1987 issue, the statement TAKAGUCK will never be executed because ZOK-OOT-UK always sets GLOOTBLEE false. The correct statement should be XOK-OOT-OK, which uses GLOOTBLEE as an error indicator. Please inform your stonecutter to proofread the algorithms before printing.

James Sturdevant

Minneapolis, MN

To List or Not to List

In response to your question about source listings in Running Light (November 1987), I personally feel a big value in DDJ is the source listings. I use them and I'm sure a lot of the silent majority do too. You're right about a typing drill, so I normally try to download instead to typing. I agree with posting them on CompuServe, and maybe the idea of a BBS run by DDJ would be a good addition. Most of the bigger PC publications have established one for their readers.

One different idea I would like to push is the use of Cauzin Softstrips from Cauzin Systems ([203] 573-0150) in each publication, like PC Magazine does. Cauzin readers are cheap ($200) and will pay for themselves to readers like me who are far removed from California. I thought you were using Softstrips at one time, and I'm not sure what happened. Anyway, it is a cheap alternative for avid software users like me.

I hope you will consider my suggestion. It would sure be a help to a lot of us.

Computer Science Dept.

Illinois State University

133 B Stevenson Hall

Normal, IL 61761

I have a few comments regarding your question about the source listings. Keep small (one or two pages) listings. Put long listings on a BBS and make them available on disk. CompuServe is no solution. PC Magazine, Byte and Computer Language all have great BBSs for downloading. Also, make sure all necessary include files are present for C programs. Most published C source code is useless because the author uses routines hidden in header files that are not provided.

2G 70 Park Terrace

East New York, NY 10034