PERSONAL SUPERCOMPUTING SEAMLESS PORTABILITY

A hardware-independent "virtual computer" is the key

The physical implementation of the CP and PP are transparent to PORT. In my previous article, the CP was implemented using the i860 RISC microprocessor and the PP via a 386/486 PC. In many PORT installations, the CP is implemented in 32-bit protected mode on the 386/486, and the PP uses 16-bit real mode on the same processor. The operation of PORT in the multiprocessor and single-processor environments is identical--the only difference is performance.

PORT with all options is almost one million lines of extended Fortran developed by a team of programmers over a ten-year period. It is a full-featured system with a vast array of utilities, debuggers, libraries, and services far beyond just a compiler plus environment. By comparison, the CP is about 6500 lines of assembly for the 386/486 (5000 for the i860), and the core of the 386/486 PP adds around another 12,000 lines of code. PORT is an open architecture, and the development of CP/PP versions for other platforms or even the PC is encouraged. Assembly language is not mandatory; a quick-and-dirty CP/PP can readily be coded in C (10,000 lines ballpark). The beauty of the CP/PPPORT separation is that individual modules can later be optimized into assembly, one by one.

The PORT compiler, editor, linker, file management, virtual-memory system, libraries, graphics, and so on are all oblivious to the actual CP and PP implementations. Programs on one platform can be immediately executed on another without changing a single line of code, recompiling, or even relinking because the whole PORT system is aloof from the hardware. To transfer PORT to another platform, it is necessary only to write a CP and PP for it. For example, an i860 plug-in VME card to the Sun SPARC just needs a PP for Sun's UNIX. Likewise, a MIPS 4000 plug-in card to a 386 PC only needs a CP version.

The Metacode Approach

The PORT Fortran/C compiler reduces the source code to a machine-independent "metacode". Although there currently is only one compiler for PORT, nothing prevents the writing of other compilers (even for other languages). Last month I pointed out that the metacode is tuned to the needs of Fortran/C, but its machine-level requirements are generic, and the metacode is extensible. Any compiler that outputs the PORT metacode can coexist in PORT.

UNIX and PORT differ in one key respect. UNIX compilers output the native instruction set for each platform. In addition, each UNIX implementation is internally customized to the architecture of that platform. PORT produces a machine-independent instruction set and hardware-independent I/O protocols. The platform is transparent to the whole of PORT, not just to the application source code. Details of the PORT metacode will be described more fully in a subsequent article. Here, I'll describe the salient features of the metacode as they pertain to portability.

Each meta-instruction of the metacode is a 64-bit word specifying A = B op C. For example, A = B+C, A = B*C, if(B>C) go to A, and call A (Blist, Count). The indirect addressing modes are specific to higher-level languages rather than conveniences of the hardware designers. For instance, A(I) = B(J,K)**N(L+50) is a single meta-instruction with A(I), B(J,K) and N(L+M) intrinsic indirect address modes. PORT local addresses are relative to the start of the current subroutine instruction block or data block, not a base segment or other hardware artifice.

Other aspects of the PORT metacode key to portability are that:

The metacode does not have "registers." Register organization is highly machine dependent.

PORT addressing is by 64-bit words, and operands are 64 bit. PORT follows wide memory-bus mainframe and supercomputer conventions rather than old microcomputer byte conventions. Bytes are treated as fields, with strings assigned as multiples of 64-bit words and usually manipulated eight bytes at a pop.

PORT is mainframe 64-bit big endian, not PC little endian. The PC 8Ox86 numbers bytes right to left within a word, so the test IF (IVAL='ABCD') THEN usually fails, because if IVAL is transferred from a byte array, it has the contents DCBA. On most mainframes, the Mac 680x0, and SCSI, bytes are numbered from left to right, which is more convenient for higher-level languages.

PORT implements software virtual memory that is independent of any hardware assist. Classical virtual memory as implemented on the IBM 370, VAX, and other machines is heavily dependent on a hardware-translate look-aside buffer and other assists. PORT achieves v/m without any of that.

Space does not permit a full discussion of the PORT metacode relating to portability, but the above items hopefully provide a feel for the way PORT answers to the needs of higher-level languages rather than contorting the applications software to fit the whims of the hardware designers. In some implementations, this forces the CP to twist through gyrations internally. (For example, a 64-bit integer has to be emulated using a double 32-bit integer on the 386/486.) Since there is only one instance of each instruction in the CP in the program, the overhead occurs in only one place.

Keeping I/O Simple

DOS and UNIX I/O internals are positively Byzantine. Not only does the DOS file-allocation table (FAT) result in two potential disk references for every actual reference (one for the FAT section), but corrupting a link in its chain can cause loss of disk data. You can't really fault DOS or UNIX too much--they were developed when loaded machines were a PC/XT with two floppy drives or a PDP 11 with a 20-Mbyte hard disk. Unfortunately, these systems still regard even gigabyte hard drives as oversized floppies.

I/O is the pacing factor in data-intensive applications. Give the device interface maximum flexibility, and it will reward you with an order-of-magnitude performance improvement. PORT I/O is oriented toward large hard disks and multi-megabyte files. The PP has just one disk-I/O service: Read or write a 32-Kbyte page. The file-management section of PORT divides the pages into directories and records. All the PP has to do is move a 32-Kbyte block. This simplicity extends to screen output, keyboard input, serial/parallel ports, tape I/O, and others. There is only one PP service to write a line of text to the screen, one to read a line from the keyboard, and so on.

In all, there are just 20 PP services covering device I/O, date/time, windows, graphics, and other requirements. Providing an interface to these PP services implements a PP on a new platform. The PORT CP presents each PP request as a 5x64-bit word block in common memory. The structure contains the service code along with any relevant parameters and addresses such as buffer locations. This simple mechanism is easier to port than interrupt protocols and message packets. The gyrations used by the PP program to honor a PP request are transparent to PORT. Whether it uses direct ROM BIOS, Int 21h services, Windows services, or UNIX APIs is entirely up to the PP implementor.

High-Level Operatives

If the metacode simply implements low-level primitives such as add, subtract, and multiply, it will be demolished by native-code compilers. (This is what happened to UCSD Pascal.) The overhead to decode each meta-instruction becomes the pacing factor.

PORT's trick is to implement a rich suite of high-level operatives--SQRT, SIN, LOG, A**B, EXP, ACOS, and all other intrinsics are direct PORT meta-instructions. For example TH=ATAN2 (X,Y) is a single instruction. PORT extends this concept to other frequently used operatives. For instance, Y=ZZPOLY(COEFFS,X) is a direct PORT instruction that evaluates a polynomial expansion. Complex-number operations are also direct meta-instructions. Decode overhead is a small fraction of the execution time for high-level operatives.

Native-code compilers have the advantage on A=B, but they execute most high-level intrinsics such as A=TAN(B) via procedure calls, which carry a substantial stack push/pop overhead. Here the metacode has the advantage because the decode overhead for A=TAN(B) is the same as for A=B. A metacode enjoys a bonanza on floating-point functions like ZZPOLY, where the CP can make maximal use of the math coprocessor registers and have the 386 compute in parallel.

The metacode goes on the offensive in block operatives. Consider the statement CALL ARYMOV(A(I),100000,B(J)), which copies 100,000 64-bit words from A(1) to B(J) as a single meta-instruction. The CP employs the 386/486 instruction REP MOVSD, which is an order of magnitude faster than even a native-code Fortran DO loop or a C for loop. The PORT metacode provides operatives for block copy, initialize, search, checksum, and others. It also provides direct meta-instructions for all string operatives (copy, concatenate, search, and so on). The metacode is currently being extended to fast Fourier transforms, matrix multiply, vector scale/translate, and others.

Debugging Metacode

As mentioned, it's the last 5 percent of bugs that typically pace the entire software timetable. A key factor in the PORT metacode design was to incorporate the maximum number of checks possible. (I'll detail these checks in future articles.) Suffice to say that they include bounds checks on all array references, pointer validation, uninitialized variable checks, invalid floating-point numbers, incorrect loop limits, and invalid strings. These checks are active at all times, in all programs (including the PORT system itself) without exceptions.

I cannot overemphasize how invaluable these checks have been in both software development and in wringing out versions of CP and PP. Invariably, the CP metacode decoder for a new platform has obscure bugs. The constant checking by the subsequent meta-instructions has proved that the corrupted results do not migrate too far before a fault occurs. For a native-code compiler to output these checks on every instruction would make the executable image too unwieldy. Without these checks, however, nightmare bugs are a certainty. Most compilers have a debug option, but the worst bugs often occur in release versions of the code, and all too often they mysteriously disappear with debug active.

A significant feature of the CP/PP separation is that the PP itself can be an important debugger. When a serious error occurs under DOS or UNIX, the machine can hang, leaving only postmortem debugging as an option. If PORT goes off into the weeds, the PP is still alive on the host system and can probe even the most intimate level of PORT. This makes checking out PORT on a new platform much easier.

The Minimal Case

Last month I described the implementation of PORT on a 386/486 PC with plug-in multiple i860s. The emphasis there was on RISC-processor performance. Now let's examine PORT in an environment at the opposite end of the spectrum--a low-cost 386SX PC with four Mbytes RAM, math coprocessor, and 60-Mbyte hard disk. The CP and PP are both executed by the 386SX. To maximize performance the CP executes in 32-bit protected mode and turns the 3-Mbyte extended memory into the common memory. The CP is simply a 45-Kbyte assembly program that reads 64-bit numbers from extended memory as 32-bit pairs, and performs the operation specified by a bit field in each. Basically, the CP program just rattles pairs of 32-bit numbers around in extended memory. The CP itself does not have to reside in extended memory. By residing in lower 640K, transfer from the CP to PP is simplified, eliminating the need for a DOS extender.

The PP is just a 16-bit real-mode assembly program that reads a 40-byte block from extended memory and calls on ROM BIOS and DOS interrupt services to execute the I/O request. Both the CP and PP are procedures in a PORT.EXE executable that run in 200 Kbytes of lower memory. PORT takes over the extended DOS partition on the hard disk. If the primary extended partitions are each allocated 30 Mbytes, then DOS occupies the lower half of the disk and PORT the upper half.

Because PORT has its own file management, it is not tied to the DOS Int 21h file services. Direct ROM BIOS 13h (or direct SCSI commands) are an order of magnitude faster. Not surprisingly, PORT's disk I/O is many times faster than that of DOS. The current PP for DOS even handles its own bad-track redirection. The PP doesn't just clean up a few sectors; it sweeps up a whole disk, track by track.

A direct benefit of the no-exception, 32-Kbyte disk block is that the DOS-based PP can implement highly efficient disk caching. If the PP finds more than eight Mbytes of extended memory, it turns the rest into a cache pool as a simple multiple of 32-Kbyte pages. Pages can be transferred from cache via the blistering fast 386/486 32-bit REP MOVSD instruction. (In a dual-processor implementation, the PP caching proceeds in parallel with the CP computation.) Bear in mind that PORT utilizes virtual memory, so the amount of RAM available merely affects speed. Beyond eight Mbytes, the RAM tends to be wasted and is more profitably employed as cache, but the division is user modifiable.

Although PORT has its own file management, it provides subroutines and utilities to read, write, and manipulate DOS files. Of course, any use of this feature is nonportable. It does, however, make PORT fully compatible with network use and DOS-based applications. Frequently-used files are generally copied from DOS files into their faster PORT equivalents. (As long as the application uses PORT files, it remains seamlessly portable.) The 16-bit realmode implementation of the PP allows PORT to be 100 percent compatible with DOS, and you can move freely between the two by executing PORT.EXE.

Conclusion

You may feel PORT's obsession with 64 bit to be excessive, but already most RISC microprocessors are 64 bit, the 80387/486/Weitek math coprocessors target 64 bit--and there's no doubt that the 80586+ will use 64 bit. Likewise, use of 32-Kbyte (soon 64-Kbyte) disk blocks may seem excessive, but disk-transfer time is becoming insignificant compared to (mechanical) seek time. High-perfomance RISC processors are proliferating, and the metacode approach is ideal for realizing their potential--particularly with multiple RISC processors. (It's even rumored that the 80586 will provide RISC on-chip.)

UNIX has done much to legitimize portability, but each implementation retains a strong affinity to its platform. A hardware-independent "virtual computer" is critical to cost effectively porting multi-megabyte applications.

Bibliography

Amdahl, G.M. "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities." AFIPS Spring Joint Computer Conference Proceedings (Volume 30, 1967).

Bowles, K.L., S.D. Franklin, and D.J. Volper. Problem Solving Using UCSD Pascal. Berlin: Springer-Verlag, 1984.

Portability vs. Performance

Output of a metacode from a compiler is nothing new. Ken Bowles's UCSD Pascal generated a machine-independent Pcode that was popular in the early '80s. Metacodes have been used for achieving machine independence, but past implementations had one big disadvantage--performance (or rather the lack of it). Last month, I showed how a metacode has an advantage on RISC processors. How does the PORT implementation of it stack up on a CISC processor?

Table 1 compares the Dhrystone and Linpack performance of the 386/33 and 486/33 PC vs. the IBM RS/6000, Sun SPARC SLC, Silicon Graphics Indigo, and the HP 9000 series 720 superworkstation. Accordingly, the HP, 720 is far ahead of the pack and supposedly an order of magnitude faster in floating point than the 486.

Table 1: RISC performance under popular benchmarks (Personal Workstation, June 1991). Higher numbers are faster.

                                      Dhrystone        Linpack
                                      2.0/2.1     Single     Double
                                      w/register  (32-bit)   (64-bit)
  -------------------------------------------------------------------

  CISC
  486/25 via DOS  extender (typical)  26,300      1.16       1.08
  486/33 via DOS  extender (typical)  34,000      1.50       1.40
  RISC
  i860/33 (Microway Number Smasher)   29,819      1.23       1.11
  SPARCstation SLC                    18,255      2.25       1.20
  Silicon Graphics Iris 25D           24,630      2.62       1.35
  Motorola 88000/25 (Everex 8825)     50,033      1.67       1.02
  MIPS 3000/33 (Magnum 3000/33)       56,012      6.48       4.80
  IBM RS/6000 (POWERstation 320)      45,454      8.15       7.29
  HP 9000 series (model 720, 50 MHz)  86,335      17.0       14.4

Table 2 compares PORT on a 386/33, 486/33, and PC+i860/33 vs. the HP 720 using the DISSPLA (and equivalent GSL) manual sample plots; see Figure 2. This is an extension of a similar table presented last month and shows that the 20-MHz 386 with the 33-MHz i860 under PORT is not far behind the 50-MHz HP 720. Note that the 486/33 is not outpaced by the order of magnitude predicted by the Dhrystone and Linpack results. The DISSPLA timings reflect the composite performance of the entire system, including large program execution, I/O service, and graphics output. The operative word is "composite"--popular benchmarks reflect the performance of a processor on a few tight loops in a vacuum, not the through-put of a real-world massive program. Native RISC code (on the RS/6000 and HP 720) has a tremendous advantage when it can iterate on small loops, but the DISSPLA sample plots reflect a normal program whose loops have frequent calls and branches.

Table 2: PC+i860 vs. HP 9000 series 720 using DISSPLA sample plots. Times are in seconds.

  CA-DISSPLA/                             PORT G.S.L.         CA-DISSPLA
  GSL Manual     Vectors  Filled    386     486     i860/33+  on HP 720
  reference no.           Polygons  33 MHz  33 MHz  386/20    50 MHz
  ----------------------------------------------------------------------

  DM3004/B31-3    3,373       0      18       8        6          3.5
  DM4003/B31-7    9,366     137      21      10        8          2.6
  DM7001/B31-11  18,526       0      28      13       10          4.2
  DM7004/B31-22  22,002     215     110      53       39         37.2
  DM8002/B31-27  15,215     161      55      26       19         13.0

It may seem incongruous that the 486 is not that much slower than the PC+i860 in the DISSPLA plots. I found that the times were identical for the Hyperspeed i860 card plugged into a 486/33 and a 386/20, indicating that the 2-Mbyte/second ISA bus was saturated, (Figure 2 is an example of dominant graphics I/O.) Keep in mind that the 486 is a high-performance RISC processor internally with eight Kbytes of immediate cache. A tight 32-bit program like PORT CP which heavily uses 32-bit registers tends to exploit this RISC affinity.

I must emphasize that PORT's graphics subroutine library is a vastly rewritten version of DISSPLA and probably more efficient than CA-DISSPLA, but the two produce identical output. Futhermore, PORT operates all 64 bit and uses software virtual memory. PORT checks array bounds on every array reference, tests for uninitialized variables on every arithmetic/compare instruction, and performs a host of other checks not executed by the HP 720.

Which system is faster is not the issue. The point is that the PORT metacode approach and CP/PP architecture hold their own. A "virtual computer" does not have to be a dog of an interpreter, as is widely believed.

Amdahl's Law

Actually, the performance of the PORT metacode approach can be predicted from Amdahl's Equation. According to Amdahl, the average instruction time is T_av = F*T_f +(1-F)* T_s, where T_s is the time of the slowest instructions, T_f is the time of the fastest instructions and F is the fraction of slowest instructions. For a typical 486 native-code compiler, T_f is one microsecond on average for A=B, I = J+K, and other simple integer operations. T_s reflects the time for floating-point operations. Accounting for call and math-coprocessor overhead, 50 microseconds is a reasonable value for T_s. Even in a highly floating-point intensive application we can reasonably assume 80 percent of the instructions are fast integer ops, so: Native T_av = 0.8*1+(1-0.8)*-50=10.8 microsec.

Now assume that the Port metacode is five times slower than the native compiler in integer and other fast ops (due to its decode overhead, 64 bit, and so on), but it can reasonably shave 20 percent off the floating-point ops by eliminating stack overhead and using coprocessor more efficiently. In this case, T_f equals 5 microseconds and T_s equals 40 microseconds, so: PORT T_av =0.8* 5+(1-0.8)*40= 12 microsec. This is not far from the native compiler.

The key result of this exercise is that shaving just 20 percent off the slowest instructions can make an enormous difference. Amdahl's Law states that no matter how much you work on the fastest process, the slowest process ultimately dominates.

What about compilers, editors, and other pure-integer programs? Here the block operatives, string operatives, and more efficient disk I/O can have an even greater impact. The savings are not a few percentage points, but orders of magnitude. (Our experience at SUPERSET was that in a key subroutine, replacing a loop or two with block operatives could make a whopping difference.) The DISSPLA timings indicate that the combination of all effects can substantially outweigh the decode overhead.

--I.H.