PERSONAL SUPERCOMPUTING

Cray's ideas turn a PC into a virtual-memory 64-bit supercomputer

Ian Hirschsohn

Ian holds a BSc in Mechanical Engineering and an MS in Aerospace Engineering. He is the principal author of DISSPLA and cofounder of ISSCO. He can be reached at Integral Research, 249 S. Highway 101, Suite 270, Solana Beach, CA 92075.

For any number of reasons, scientific and engineering computing has historically been the domain of mini- and supercomputers. For starters, scientists and engineers were the early computer users, and mainframes were the first system on the scene. Furthermore, the number-crunching needs of scientific applications require huge programs to manipulate and analyze vast amounts of data, and it takes big systems to provide the necessary horsepower. While microcomputers have made incredible in-roads in virtually every field, they've come up short in satisfying the computing needs of scientists and engineers. However, recent advances in multiprocessing architectures may finally be starting to tip the scales in favor of PCs in the engineering arena.

You can now, for instance, assemble EISA 486-based PCs with plug-in, multiprocessor 64-bit RISC cards and gigabyte disks that provide you with the effective performance of conventional mini- or mainframe computers--and for well under $20,000. In effect, you can have on your desktop a "personal supercomputer" that performs number-crunching tasks formerly relegated to big systems. But even though you can build a multiple-processor personal supercomputer, you'll still find your work cut out for you if you try to download multi-megabyte mainframe applications. Why? Because DOS and UNIX, the dominant PC operating systems, were never oriented toward the mainframe environment. DOS has yet to tap 32-bit protected mode--let alone 64-bit RISC processors--and multiprocessor support is still in UNIX's future.

In short, mainframe applications involve more than just compiling existing programs with Lahey Fortran using a Phar Lap 32-bit DOS-Extender--many apps are so big that they require virtual memory. These applications are typically floating-point intensive. Consequently, the five- to six-digit slide-rule precision of 32-bit floating point produces results that are pure fantasy after a few iterations. Typically, they are also data intensive, demanding maximum bandwidth disk and device I/O with binary record-oriented data handling. (DOS and UNIX byte-stream I/O was not designed for that.) DOS and XENIX have yet to speak 9-track tape and IBM 3480 cassette. So how can you port mainframe data to a PC?

To execute mainframe, you have to think mainframe; and to think mainframe, you must be familiar with the ideas of Seymour Cray, the genius behind today's supercomputing and mainframe architectures. In this and future articles, I'll discuss a PC-based system called PORT that adheres to Cray's design principles, thereby enabling what I call, "personal supercomputing." PORT is a software environment somewhat analogous to, say, Desqview plus Phar Lap. A key difference is that PORT is portable, so its host system and processor(s) are determined by the author of its CP/PP interface programs (discussed later). The computation- and I/O-intensive applications I will examine demand a 386/486 PC (or clone) "muscle machine" with 16+ Mbytes RAM, 200+ Mbytes hard disk, TI340x0 graphics accelerator with one or more plug-in i860 cards--under DOS 4.0 or higher. i860 cards are available from CSPI, Microway, Alacron, and others--I used a Hyperspeed D860. In contrast, my next article will use a 386SX laptop "wimp" with 4 Mbytes RAM and 60+ Mbytes disk--sans i860.

Thanks to Cray and his approach, PORT is able to implement almost any multi-megabyte mainframe application on a PC with plug-in RISC coprocessor(s). PORT's ability has been demonstrated with seismic cross-sections, real-time, photographic-image manipulation, projected world maps, large printed-circuit board autorouting, complex-image three-dimensional solid modeling, NC tooling, and other compute-/data-intensive applications.

PORT represents a "super DOS extender" that turns a 386/486 PC into a virtual-memory, 64-bit, mainframe-capable machine, while maintaining full compatibility with DOS and Windows. Any system which presumes to supplant the wealth of PC applications is a bit naive; PORT augments--not replaces--existing apps.

Supercomputing conjures up visions of inverting massive matrices, fluid dynamics, and other esoteric applications. But many work-a-day applications defy the most loaded PCs and even workstations. Consider Figure 1(a), which shows an original photograph and Figure 1(b), which has been electronically retouched by adding in the eye detail. This sort of retouch is routinely done on PCs (or on Macs, via Photoshop or Colorlab). The difference here is that the photograph was manipulated as an 8x8, 1000-lines-per-inch image at 24 bits per pixel. This represents 8000x8000x3, or 192 Mbytes as a single frame in real time. By comparison, anything beyond a 10-Mbyte TIFF file tends to be impractically slow on a PC or Macintosh. Photographic-image processing is an extreme test of a system's hardware and software ability to handle massive I/O fast, yet there is minimal computation and negligible floating point.

Figure 2 shows a photograph of a seismic cross section. Seismic processing consumes more supercomputer dollars than any other single application. (It costs $30 million to drill a hole, whether it has oil or not. Saving just one dry hole pays for a Cray Y-MP.) Thousands of acoustic traces are mapped into geologic-depth cross sections, requiring extensive floating-point operations on the data to account for varying sonic velocity through different rock strata. Just the ability to read the 1- to 5-Mbyte individual 9-track field tape records is nontrivial. Geophysicists prefer to study the "big picture" on paper media, where they can scratch away with felt-tip pens. The hard copy is typically 4 feet wide by 10 to 20 feet long and is produced on 300-dpi electrostatic plotters. These plotters must be fed data continuously at full speed, otherwise they halt and leave ugly toner bands. Advanced Technologies' PC Micromax excels at seismic cross sections on a VGA screen, but the area covered is limited and it cannot handle the field tapes.

Supercomputing is not a different set of applications--it is simply a matter of scale. For example, Wordperfect is a capable tool for indexed manuscripts, but what if you have to cross-reference the Encyclopedia Britannica or the service manuals for a Boeing 747? All these examples are easily within the capability of a 386/486 PC with plug-in RISC card(s).

To see how this all integrates, consider the Fortran code in Figure 3. The Fortran program executes entirely on the i860 together with all libraries and other utilities. The PRINT statement causes a call to the FORMAT I/O handler in the PORT kernel (a Dynamically Linked Library), which formats the string Value is 12345.678901 via the i860. The string is passed to the central PP interface, which composes a 5x64-bit word mailbox containing the code for "Write Text To Screen" and a pointer to the string in i860 card memory. The i860 now sets a semaphore flag, or "rings the PP's bell" via an interrupt, depending on the i860 card. Upon detecting the semaphore or interrupt, the 386/486 PP copies the mailbox contents to its own memory and executes the I/O request. It then copies the string and proceeds to display the line on the screen. Once the mailbox contents and string are transferred, the PP releases the i860 enabling the 386/486 and i860 to process in parallel.

Figure 3: Executing a Fortran program on the i860

        print 101,VALUE        with VALUE=12345.678901
  101   format ('Value is ',F12.6)

While the i860 is computing, the PP is free to do other chores such as servicing RS-232 COM I/O, network transfer, disk/tape caching, and tape streaming. PORT turns all available PC extended memory into a disk/tape cache pool (a 31- to 95-Mbyte cache). Large cache is invaluable for streaming-tape I/O, electrostatic plotters, and other devices requiring a sustained, high-speed data flow.

PORT runs almost entirely on the i860, so the 386/486 is free to run DOS multitasking (including Windows) provided the 386/486 PP is available when needed.

RISC From Day One

Most RISC compilers originated as CISC compilers with code generators modified to output RISC code. This strategy tends to overwhelm the on-board RISC instruction cache, resulting in "cache thrash." (Again, see "RISC: Rhetoric and Reason.")

Designed for a custom bit-slice RISC board from the outset, the PORT Fortran/C compiler outputs metacode--not direct RISC instructions. The resident RISC program decodes the metacode and performs the requisite operations. This approach is tantamount to defining a custom Fortran/C instruction set, using the RISC instructions as programmable microcode. The entire i860 version of the decoder is only 18 Kbytes, so it rarely cache misses. Like CISC microcodes, the PORT decoder was carefully handcoded in RISC assembly to maximize performance. It plugs almost every free cycle, uses all of the 64 i860 registers, trips off memory references as many cycles ahead as possible, and uses few internal subroutines.

The metacode decoder turns the RISC processor into a "Fortran/C engine." With the front-end PP to field all I/O, the metacode has no I/O instructions whatsoever (just a "PP Interrupt and Wait"), thereby sidestepping the single most complicated section of CISC microcodes.

Although the metacode approach is superior, it is avoided by almost all RISC systems because of the perceived overhead of decoding each meta-instruction. PORT uses 15 to 20 i860 instructions to decode each meta-instruction--a heavy penalty for A = B. What is overlooked is that just one cache miss consumes the equivalent of eight to ten RISC instructions: In practice the efficiency of a handcoded decoder more than makes up for the overhead. Because the overhead to decode A = B is the same as for A = SIN(B), A = SQRT(B), and A(I,J) = B(J + 20)**I(L), it becomes apparent why it is not a dominating factor in actual applications.

Also overlooked is that the sheer volume of RISC code generated by native RISC compilers is prohibitive unless many basic operations are executed by subroutines such as divide, multiple subscripts, modulus, and others. The call/return overhead (usually involving memory and often a stack) for this so-called "threaded code" is substantially greater than the register ops used to decode a meta-instruction. The efficiency of generating native RISC code is illusionary--even with infinite cache. I believe that most system programmers are new to high-performance RISC; understandably, they still apple CISC methods and prejudices.

A key aspect of the metacode approach is its indispensability to multiprocessor operation. The decoder tests each meta-instruction for semaphore bits from ancillary processors (and the PP) as part of the decode sequence. Thus the multiprocessor handling is all under software control, simplifying it and providing a direct mechanism for the application to manage it. (Significantly, it enables PORT to be machine independent.)

Metacode Custom to Fortran/C

A detailed description of the PORT metacode is left to a subsequent article. Because it is central to the potential for supercomputer performance and to multiprocessing, I'll highlight its salient features.

All PORT metacode instructions are of the form: A = B op C. Examples of this form are A = B + C, I = J/K, if(B = C) go to A, and call A(Blist,Count). All meta-instructions are 64-bit words, with the identical format to speed decoding. Array references are integral operand modes. For example, A(J) = B(K,L)* C(M + N) is a single instruction. (Part of the power of the metacode approach is incorporating Fortran/C indirect addressing modes as native.)

Like the mainframes that PORT emulates, all operands are 64 bit, including integer, floating point, and pointers; strings are 64-bit aligned. The i860 and other RISC processors have a 64-bit memory path, so the time savings for shorter operands is minimal. On the other hand, 64 bits can pack an awful lot.

A feature of this metacode approach is that the decoder can choose a more efficient algorithm, depending on the value of the operands. For example, the i860 has no divide instruction, integer or floating point. An integer divide involves converting to floating point, iterating a Newton-Raphson approximation, and converting back to integer--just like the CDC 6600. (The i860 multiply is so fast, the time is not much more than a typical CISC IDIV.) Internally, the PORT decoder uses fewer steps and 32-bit register operations if the operands are found to be less than 32 bits, which is the usual case.

Most important to our focus on supercomputing, the PORT metacode defines many high-level operatives to be "direct instructions." (SQRT, LOG, SIN, ATAN2, B**I, B**C, type COMPLEX ops, and most intrinsics are meta-instructions.) Furthermore, block ops (copy, initialize, search, and checksum) are also direct meta-instructions, as are type CHARACTER ops. Current work on the metacode is largely focused on expanding the block meta-instructions. These include vector/matrix multiply, sort, vector scale+translate, and others.

The user can extend the metacode to incorporate operatives specific to his own application, such as 3-D transform, polar coordinates, map projections, specialized sorts, and even string search, ignoring spaces and case. Experience has shown that implementing such operatives in direct RISC assembly typically produces orders of magnitude performance improvement. These added intrinsics are referenced as if they were subroutines. For instance, CALL ARYMOV(A(J),N,B(K)) copies N 64-bit words from A(J) to B(K) as a direct meta-instruction.

Virtual Memory Without Virtual Memory

A shortcoming of Cray's model is its impracticality for traditional "virtual memory," which allows executing programs larger than real memory by swapping pages to and from disk--transparently to the program. This limits program size even on a Cray Y-MP. The obvious reason for avoiding virtual memory is the virtual-to-real address-translate overhead on every memory reference. Less obvious, but more serious, is that the real-memory pages end up scattered all over memory. Thus, real memory becomes fragmented. To use memory-mapped I/O between the peripheral processor and computation processors would require going through the same translate table. The bookkeeping, coherence protocol, and overhead make multiprocessor virtual memory a nightmare. To be practical and allow high-speed burst DMA, shared memory areas should be contiguous blocks.

PORT implements virtual memory by observing that 85 percent of memory references are local and don't require address translation in the first place. Measuring local addresses from the start of each subroutine takes care of these 85 percent. This leaves only four instances in which virtual memory is actually required--array/COMMON references, pointers, arguments, and call/returns. PORT takes care of these as part of the metacode decoder via software "microcode." The RISC overhead for this 15 percent of addresses is not severe, and much of it can be buried between memory references and in free cycles. As a software scheme, it is harware-independent, which makes it portable.

Memory overhead being the bane of RISC, a significant feature of this scheme is that 85 percent of memory references use 13-bit fields. Thus A = B op C can be specified in a single 64-bit RISC, word. This utilizes RISC cache more efficiently and speeds decoding, thereby improving performance.

Key to multiprocessor supercomputing is that the scheme uses massive pages: currently 32 Kbytes, soon to be 64 Kbytes. Fewer pages make it practical to exchange pages to form the requisite contiguous memory blocks and lock them in real memory. Thus other processors sharing the data perceive it as contiguous memory. Finally, PORT circumvents the chief limitation of the Cray model by being tailored specifically to Fortran/C.

Multiprocessing

Multiprocessing is commonly viewed as a collection of identical, self-sufficient processors on a common bus, each executing its own "thread." Such symmetrical organization requires the application to be broken into self-sufficient tasks, but this is not always possible. Even when it is feasible, breaking a massive program into free-standing threads, complete with all intercommunication, can be as much work as writing the application in the first place. The operating-system overhead on each processor can neutralize the performance benefits. Lastly, no matter how fast the bus, the rush-hour traffic jams tend to degrade the system. The processors invariably domino until all are waiting in line for the bus. Although symmetric multiprocessing is expounded in many articles, it has yet to see widespread commercial use.

In almost every application my colleagues and I have studied--CAD, seismic, image processing, 3-D modeling, and even editors and compilers--80 to 95 percent of the computation is concentrated in less than 5 percent of the code. Experience has shown that multiprocessing just one or two subroutines typically produces an order of magnitude performance improvement. In cases such as 3-D rendering, seismic wiggle-trace fills (Figure 2), Fast Fourier Transforms, RGB-to-CMYK transformation (Figure 1), critical-path routing, vector-to-raster conversion, and so on, transferring the data back and forth to symmetric processors across a bus can take more time than the processing itself. On a similar note, I/O overhead has proved the nemesis of array processors, causing them to fall into disfavor of late.

Based on the way most applications operate, PORT extends Cray's model to pragmatic multiprocessing. The multiple ancillary RISC processors have access to the memory of the RISC Computation Processor. This hardware configuration is widely available. For example, the Hyperspeed D860 PC/AT card has two i860s sharing a common memory pool, and multiple cards can be interconnected via 64-bit, memory-to-memory flat cable across the top of the cards. Mercury, DuPont, and CSPI have similar solutions for workstations. Thus the data resides in shared or commonly accessible memory, eliminating the need for bus transfers. (At the 1991 ACM Siggraph conference, Hyperspeed exhibited ten i860s in a PC popping up Mandelbrot fractals faster than a dedicated Cray Y-MP. At the 1992 NCGA conference, they demonstrated eight i860 PC ray-tracing images of 25 transparent spheres with 25 levels of reflection in roughly three seconds--about 400 Mflops!)

The body of the application executes in the Computation Processor, but PORT provides the mechanism for the application to access subprograms running in the ancillary RISC processors. The subprograms are typically a few hundred lines of critical RISC code; data is shared via COMMON block arrays and communication is via mailboxes. The application controls the sequencing and synchronization of the ancillary processors using calls to PORT provided system subroutines. This hands-on pragmatic approach has proved remarkably effective and application programmers appreciate having the control. For example, in Figure 1 and Figure 2, the application typically uses two to four auxiliary RISC processors.

Proof of the Pudding

Table 2 shows the performance of PORT on a 20-MHz PC with a plug-in 33-MHz i860 card vs. the HP 9000 series 720--today's superworkstation. The tests used were the DISSPLA User Manual sample plots, a set of large graphics examples running on dozens of mainframes/superminis and not slanted to any machine. DISSPLA is supplied by Computer Associates as CA-DISSPLA and the equivalent library under PORT as its Graphics Subroutine Library.

More Details.

These results show that the PC+i860 under PORT can, in the case of DM7004, match the HP 720 at its full 50 MHz. Note that in the short plot cases involving minimal computation per vector where the HP 720 outperforms the PC+i860, the latter is bound by the speed of the 16-bit PC ISA bus. (The times are identical for 386/20 and 486/33 host PCs. Upcoming EISA i860 cards and the Hauppage 4860 should eliminate this mini/micro bottleneck.)

Interestingly, according to Table 1, the PC+i860 is four to ten times slower than the HP 720, which in turn outperforms the Sparc and RS/6000. Arguably, the code for GSL has diverged from DISSPLA over the years and may be more efficient in many instances. On the other hand, PORT runs all 64 bit with software virtual memory and avoids a globally optimizing compiler. The bottom line is that both packages produce the identical output.

Table 1: RISC performance under popular benchmarks as provided by Personal Workstation (June 1991). Values are presented for rough comparison only because the performance on actual large-scale applications may be different for the RISC processors. (Higher numbers are faster.)

  Processor                             Dhrystone        Linpack
                                      2.0/2.1 with   Single    Double
                                        register    (32-bit)  (64-bit)
  --------------------------------------------------------------------

  CISC
  486/25 via DOS Extender (typical)      26,300        1.16      1.08
  486/33 via DOS Extender (typical)      34,000        1.50      1.40
  RISC
  i860/33 (Microway Number Smasher)      29,819        1.23      1.11
  SPARCstation SLC                       18,255        2.25      1.20
  Silicon Graphics Iris 25D              24,630        2.62      1.35
  Motorola 88000/25 (Everex 8825)        50,033        1.67      1.02
  MIPS 3000/33 (Magnum 3000/33)          56,012        6.48      4.80
  IBM RS/6000 (POWERstation 320)         45,454        8.15      7.29
  HP 9000 series (model 720, 50 MHz)     86,335        17.0      14.4

My purpose is not to present a horse race, but to vindicate the "obsolete" mainframe methodology of PORT. The PORT results are also more consistent with the spec-sheet timings for both processors.

These results illustrate the performance of a single i860 as the Computation Processor. Experience has shown that the introduction of ancillary RISC coprocessor(s) improves throughput so dramatically that there is no comparison. An analogous example is the performance of a Silicon Graphics Indigo rendering 3-D models via a MIPS 3000 coprocessor.

Supercomputing by Low Entropy

PORT strives to achieve performance by low entropy rather than brute force: It focuses on minimizing overhead and presenting RISC processors with the maximum information in the minimum bits. It takes the view that the application programmer can maximize resource use more effectively than a big-brother system. Foxpro 2.0, Turbo C, and Norton Back-up also testify to the low-entropy approach.

To illustrate PORT's bit efficiency, the basic PORT system--including the extended Fortran/C compiler, linker, editor, virtual-memory manager, file manager, and all system libraries packs onto a single 1.44-Mbyte floppy--the distillation of several hundred thousand lines of Fortran source code. For mainframe users, the entire GSL (DISSPLA) library, plus drivers for 150 devices and utilities, packs onto two 1.44-Mbyte floppies. (On average there are 1.2 meta-instructions per executable source statement.) The importance of bit efficiency is placed not on program exchange, but on reducing RISC memory access and thereby on performance.

We have seen the future, and it is multiprocessor RISC. Current systems will have to come to terms with this, sooner or later. The methodology incarnated into PORT is field proven and catholic. Whether PORT ever becomes a factor or not, hopefully it will help keep Microsoft and Sun honest. In a future article, I'll show that the PORT approach can be implemented on almost any platform, including UNIX and Sparc. Although the PORT approach works best with add-on RISC processor(s), I will show that it works surprisingly well in the single-processor environment of a stock 386SX laptop.

Bibliography

Brooks, F.P. The Mythical Man-Month. Reading, Mass.: Addison-Wesley, 1975.

Lundstrom, D.E. A Few Good Men from UNIVAC. Cambridge, Mass.: MIT Press, 1987.

Margulis, N. i860 Microprocessor Architecture. Berkeley, Calif: Osborne/McGraw-Hill, 1990.

Siewiorek, D.P. and J.P. Koopman. The Architecture of Supercomputers. San Diego, Calif.: Academic Press, 1991.

RISC: Rhetoric and Reason

Reduced Instruction Set Computers (RISC) are touted as the panacea of computing, promising one instruction per cycle and multi-megaflop floating point. Yet in many benchmarks, RISC performance isn't much better than Complex Instruction Set Computers (CISC) such as the 80386/486 and 680x0. Paradoxically, both are correct. RISC is capable of phenomenal performance, but most current systems do not utilize its full potential. Almost all CISC processors are RISC processors internally because they have a RISC processor driven by microcode burned into on-chip ROM. Explicit RISC processors, on the other hand, allow their RISC code to be loaded into on-chip RAM cache. Hence, any RISC strategy functionally similar to a CISC processor is unlikely to achieve a spectacular speed improvement, which explains why the 486 in 32-bit protected mode is vexing Sparc and other RISC processors (see Table 1).

The i860 incorporates four independent processors on the same silicon wafer: integer unit, floating-point adder, floating-point multiplier, and graphics unit. These processors can operate in parallel and can be fed one instruction each cycle. Thus, a 40-MHz i860 is theoretically capable of 80 Mflops. By definition, however, RISC instructions are primitive, and it takes many of them to perform the same function as a CISC instruction. For example, the 80x86 instruction: ADD Value, 10 requires the i860 sequence shown in Figure 4. Although the i860 needs five 32-bit instructions, the 5+1 cycles to execute them is roughly the same as the CISC instruction. Indeed, the cycle count for CISC processors largely represents the count of RISC microcode instructions.

Figure 4: Equivalent i860 instructions for the 80x86 instruction ADD

Value, 10

  ORH    Value_HI,r0,r3   Place upper and lower 16
  OR     Value_LO,r3,r3    bits of VALUE addr in r3
  LD.L   0 (r3),r4        Load [0+r3] into r4
  ADDS   10,r4,r4         r4 = r4 + 10
  ST.L   r4,0 (r3)        Store r4 in [0+r3]

The example in Figure 4 illustrates several of the features and failings of RISC. On the plus side, once register r4 is loaded, it can be manipulated at the rate of one operation per cycle. (The sequence can be modified to perform an Add, Shift, and Test as if it were a single custom instruction in almost the same time as a simple ADD. The i860 can perform a 64-bit, floating-point multiply in four cycles, a 32-bit in three cycles, and an add in two, so the example could include floating-point operations at a speed far beyond the 386/486. Furthermore, the units can be "pipelined"--with a new multiply/add initiated every cycle in single precision.)

The minus side is more subtle. The i860 is not telepathic; the instruction sequence must be loaded in instruction cache and ready to go. If it is not, the i860 "cache misses," requiring the RISC instructions to be loaded from memory. A cache miss is expensive. On any 40-MHz processor it costs at least three to four cycles, and on the i860 it is much worse. The i860 minimizes memory overhead by transferring four consecutive 64-bit words as a block. The address lines are loaded only on the first word, and the next three words load in half the time. The down side is that any cache miss costs eight to ten cycles. Thus unless the RISC program can reside in i/cache with few misses, the performance is usually lackluster.

The LD.L does not complete the load to r4--it merely initiates it. The ADDS waits until r4 is loaded before proceeding to add. Here, a data-cache miss again costs eight to ten cycles. If you can find eight to ten instructions to insert between the LD.L and ADDS, there is no wait. Therein lies part of RISC's power. A clever compiler could theoretically find eight to ten instructions to plug the wait. In real life, only the programmer understands his algorithm well enough to do the substantial reorganization needed. Globally optimizing compilers also have the nasty habit of reorganizing when no reorganizing is desired--often producing wrong code.

This example also alludes to why the traditional compiler approach of producing direct RISC object code is not the most effective strategy. It takes at least five times as many RISC instruction bytes to do the same thing as a CISC instruction, yet the real-estate expense of on-chip i/cache limits its size. The i860 has only 4 Kbytes i/cache (optimistically about 70 lines of Fortran or C). If you have a loop of 71 lines, it will cache miss every eight RISC instructions. If the loop has the typical numerous branches and calls, it cache misses almost every other RISC instruction ("cache thrash").

As evidence of these observations, Table 1 shows the performance of the 33-MHz i860 on the Dhrystone and Linpack benchmarks. According to these benchmarks the i860 is barely able to best a 25-MHz 486 and is supposedly 14 to 17 times slower than an HP 9000/720 in floating point. Even adjusting for the 50-MHz clock rate of the HP 720, it is supposedly an order of magnitude faster than the i860. Yet the timings of the i860 operations are comparable to the HP 720. Why? Because the HP 720 has 256 Kbytes cache to the 4 Kbytes of the i860. I contend the timings are measuring i860 cache thrash!

Bigger cache is not necessarily the solution. The new i860XP has 16 Kbytes of i/cache, but many large, compute-intensive applications have critical loops that easily exceed that if the compiler outputs RISC code. Even if the i/cache were infinite, it would take several cycles to load instructions from memory.

This explains why the i860 can do somersaults on small hand-coded assembly functions such as Fast Fourier Transforms and the CSPI array processor library, yet be a dog as a general-purpose processor.

The most desirable RISC strategy is to implement only one instance of RISC code for each high-level instruction (one instance of A = B+C, one of A = B*C, and so on, rather than a copy for every A = B+C in the program). This strategy loads i/cache only once. Such a "metacode" approach is a CISC! Interestingly on the i860, these CISC meta-instructions, regarded as data, go into its 8 Kbytes of data cache so that the four-word memory load acts as a "32-byte prefetch queue."

CDC 6600: Anatomy of a Supercomputer

The CDC 6600 was Seymour Cray's seminal supercomputer design and its architecture is fundamental to today's Cray machines and most superminis. Studying the CDC 6600 architecture enables us to focus on the salient supercomputer features without bogging down in later refinements.

The CDC 6600 revolutionized mainframe design in the early '60s by implementing a 60-bit CPU that had no I/O capability. Instead it was surrounded by ten Peripheral Processor Units (PPUs), the sole function of which was to service I/O requests. One PPU was devoted to the banks of tape drives, another to disks, and yet another to the operator console. Other PPUs were assigned to servicing the hundreds of user terminals, remote job-entry stations, card readers, and so on.

Also revolutionary was that PPU to CPU communication was memory mapped--the PPUs had direct access to the main memory of the CPU. The PPUs would load requests or even entire jobs in areas of main memory and cause the CPU to jump to the prepared areas. Thus there was none of the bus overhead or time-consuming ACK/NACK protocols of other, bus oriented, mainframe architectures.

Because the CPU was unencumbered with I/O constraints, its internal architecture could focus on expediting computation. The CDC 6600 had a blistering-fast floating-point unit, and even its 60-bit integer ops were far beyond any competitor. Today, 30 years later, the CDC 6600 is still a machine to beat.

The Chippewa Falls Operating System (CHOPS) also broke ground in operating-system design, delivering performance far beyond the bloated IBM OS/360 and other systems. (CHOPS was superseded by MACE, which later became KRONOS, then NOS.) The saving grace of the IBM System 360 was its Virtual Memory; the CDC 6600 memory allocation was claustrophobic. Although the efficacy of this low-entropy system and architecture were central to the design of PORT, the lessons of its memory limitations and lack of execution checks were also important. (The follow-on CDC 7600 was so unreliable that programs were often run twice to check the results.)