PERSONAL SUPERCOMPUTING SEAMLESS PORTABILITY

A hardware-independent "virtual computer" is the key

Ian Hirschsohn

Ian holds a BSc in Mechanical Engineering and an MS in Aerospace Engineering. He is the principal author of DISSPLA and cofounder of ISSCO. He can be reached at Integral Research, 249 S. Highway 101, Suite 270, Solana Beach, CA 92075.


There's a misperception that if you write in C for UNIX, your code will be completely portable. But no matter how vanilla flavored you try to code in Fortran or C, there's always something peculiar to each system that requires custom coding. It may be graphics, I/O, memory limitations, or some other dependency. Even if the source code is meticulously written to be 99 percent portable, the remaining 1 percent causes the most grief.

Through the process of porting the massive DISSPLA graphics package between different platforms, I became painfully aware of the costs and effort of transferring code. Consequently, this article addresses the concept of seamless portability, or the ability to transfer programs between different computers without relinking or recompiling the code. As in last month's article, I'll use the PORT system as evidence that this can be accomplished. (Recall that PORT is a software environment somewhat analogous to Desqview with the Phar Lap DOS-Extender.) While last month I looked at high-performance RISC systems and described PORT executing on a 386/486 PC with plug-in i860 RISC card(s), this month I'll describe PORT on a 386SX and examine its potential for other environments.

The Portability Equation

The effort of porting seems to increase exponentially with the number of platforms you are targeting--just two platforms means two copies of the source, two copies of the corrections, two copies of the corrections to the corrections, and so on. Even with rigorous bookkeeping, however, one of these fixes usually fails to be transferred to the other platform, or old versions of routines become linked with updated versions of others. The resulting bugs can take days to find. There are also bugs (like those that depend on transient memory contents) on one platform that can't be reproduced on others or those from one developer's code that show up in another programmer's work in team-developed, multi-megabyte programs.

From my experience, it's the last 5 percent of the bugs that take 95 percent of the entire conversion time. Murphy's Law is absolute in software porting and makes a mockery of even the most conservative timetable. Pundits not accustomed to life in the trenches may claim that rigorous diagnostics eliminate these bugs. While comprehensive test data and validation programs are indispensable, it is almost impossible to check every case in a program of any size. Finally, the most insidious bugs tend to occur at customer sites--with pathologic data on jobs due yesterday.

These are the tribulations I found with Fortran. C has the potential for even more interesting bugs: corrupted pointers, mismatched argument types, and uninitialized heap variables can while away a week or two. To add spice, the effects are often completely different from one platform to another--sometimes from one execution to another.

Other all-too-common traps include the following:

Practical Solutions

At the bit level, binary operations carried out on one processor can be emulated on just about any other. At the other end, applications are almost totally aloof from the nuances of computer architecture. Computer languages such as Pascal and Fortran (and, to some degree, C) are designed to be machine independent. Unfortunately, no program exists in a vacuum, and unless the system utilities are also identical, interaction with the program will be different on two platforms. UNIX is the closest candidate to a portable system, but no two implementations of UNIX (that I know of) are identical, even to the application software.

Assuming the hardware differences can be resolved, it remains to design a complete, portable system. But to be commercially viable, the system must first have acceptable performance, which means being competitive with native-code compilers and their I/O throughput. Secondly, it must be nonintrusive. (Compatibility with existing systems is a market reality.)

After years of designing device-independent graphics, we found that all graphics can be reduced to moves, draws, and fills. Distilling all axes, maps, curves, fonts, and complex features down to this simple set of primitives enabled us to support hundreds of diverse graphics devices. Each device had its own specific "device driver" to translate the primitives into device-specific commands. This strategy showed no limitation to either high-level features or use of the devices. We therefore asked ourselves whether application software could be reduced to "adds, multiplies, and divides." In other words, could the higher-level software be reduced to a set of efficient computation primitives that is machine independent, with a processor-dependent "device driver" for each platform? The answer was "yes," as PORT illustrates.

More Details.

The Virtual Computer

As I pointed out last month, Cray's CDC 6600 architecture was the archetype for almost all supercomputers. Serendipitously for portability, it isolates the divergent needs of computation, I/O, and the host system. To capitalize on Cray's model, PORT views its host as a virtual computer via an architecture defined by PORT, not any specific hardware. Each target processor has a machine-specific interface program analogous to the graphics "device drivers" mentioned above. The virtual computer is divided into two fundamental processors, the computation processor (CP) and the peripheral processor (PP); see Figure 1. Like Cray's CDC 6600, the CP does no I/O and the PP does no significant computation. The CP and the PP communicate with each other through a memory-mapped mailbox.

The physical implementation of the CP and PP are transparent to PORT. In my previous article, the CP was implemented using the i860 RISC microprocessor and the PP via a 386/486 PC. In many PORT installations, the CP is implemented in 32-bit protected mode on the 386/486, and the PP uses 16-bit real mode on the same processor. The operation of PORT in the multiprocessor and single-processor environments is identical--the only difference is performance.

PORT with all options is almost one million lines of extended Fortran developed by a team of programmers over a ten-year period. It is a full-featured system with a vast array of utilities, debuggers, libraries, and services far beyond just a compiler plus environment. By comparison, the CP is about 6500 lines of assembly for the 386/486 (5000 for the i860), and the core of the 386/486 PP adds around another 12,000 lines of code. PORT is an open architecture, and the development of CP/PP versions for other platforms or even the PC is encouraged. Assembly language is not mandatory; a quick-and-dirty CP/PP can readily be coded in C (10,000 lines ballpark). The beauty of the CP/PPPORT separation is that individual modules can later be optimized into assembly, one by one.

The PORT compiler, editor, linker, file management, virtual-memory system, libraries, graphics, and so on are all oblivious to the actual CP and PP implementations. Programs on one platform can be immediately executed on another without changing a single line of code, recompiling, or even relinking because the whole PORT system is aloof from the hardware. To transfer PORT to another platform, it is necessary only to write a CP and PP for it. For example, an i860 plug-in VME card to the Sun SPARC just needs a PP for Sun's UNIX. Likewise, a MIPS 4000 plug-in card to a 386 PC only needs a CP version.

The Metacode Approach

The PORT Fortran/C compiler reduces the source code to a machine-independent "metacode". Although there currently is only one compiler for PORT, nothing prevents the writing of other compilers (even for other languages). Last month I pointed out that the metacode is tuned to the needs of Fortran/C, but its machine-level requirements are generic, and the metacode is extensible. Any compiler that outputs the PORT metacode can coexist in PORT.

UNIX and PORT differ in one key respect. UNIX compilers output the native instruction set for each platform. In addition, each UNIX implementation is internally customized to the architecture of that platform. PORT produces a machine-independent instruction set and hardware-independent I/O protocols. The platform is transparent to the whole of PORT, not just to the application source code. Details of the PORT metacode will be described more fully in a subsequent article. Here, I'll describe the salient features of the metacode as they pertain to portability.

Each meta-instruction of the metacode is a 64-bit word specifying A = B op C. For example, A = B+C, A = B*C, if(B>C) go to A, and call A (Blist, Count). The indirect addressing modes are specific to higher-level languages rather than conveniences of the hardware designers. For instance, A(I) = B(J,K)**N(L+50) is a single meta-instruction with A(I), B(J,K) and N(L+M) intrinsic indirect address modes. PORT local addresses are relative to the start of the current subroutine instruction block or data block, not a base segment or other hardware artifice.

Other aspects of the PORT metacode key to portability are that:

Space does not permit a full discussion of the PORT metacode relating to portability, but the above items hopefully provide a feel for the way PORT answers to the needs of higher-level languages rather than contorting the applications software to fit the whims of the hardware designers. In some implementations, this forces the CP to twist through gyrations internally. (For example, a 64-bit integer has to be emulated using a double 32-bit integer on the 386/486.) Since there is only one instance of each instruction in the CP in the program, the overhead occurs in only one place.

Keeping I/O Simple

DOS and UNIX I/O internals are positively Byzantine. Not only does the DOS file-allocation table (FAT) result in two potential disk references for every actual reference (one for the FAT section), but corrupting a link in its chain can cause loss of disk data. You can't really fault DOS or UNIX too much--they were developed when loaded machines were a PC/XT with two floppy drives or a PDP 11 with a 20-Mbyte hard disk. Unfortunately, these systems still regard even gigabyte hard drives as oversized floppies.

I/O is the pacing factor in data-intensive applications. Give the device interface maximum flexibility, and it will reward you with an order-of-magnitude performance improvement. PORT I/O is oriented toward large hard disks and multi-megabyte files. The PP has just one disk-I/O service: Read or write a 32-Kbyte page. The file-management section of PORT divides the pages into directories and records. All the PP has to do is move a 32-Kbyte block. This simplicity extends to screen output, keyboard input, serial/parallel ports, tape I/O, and others. There is only one PP service to write a line of text to the screen, one to read a line from the keyboard, and so on.

In all, there are just 20 PP services covering device I/O, date/time, windows, graphics, and other requirements. Providing an interface to these PP services implements a PP on a new platform. The PORT CP presents each PP request as a 5x64-bit word block in common memory. The structure contains the service code along with any relevant parameters and addresses such as buffer locations. This simple mechanism is easier to port than interrupt protocols and message packets. The gyrations used by the PP program to honor a PP request are transparent to PORT. Whether it uses direct ROM BIOS, Int 21h services, Windows services, or UNIX APIs is entirely up to the PP implementor.

High-Level Operatives

If the metacode simply implements low-level primitives such as add, subtract, and multiply, it will be demolished by native-code compilers. (This is what happened to UCSD Pascal.) The overhead to decode each meta-instruction becomes the pacing factor.

PORT's trick is to implement a rich suite of high-level operatives--SQRT, SIN, LOG, A**B, EXP, ACOS, and all other intrinsics are direct PORT meta-instructions. For example TH=ATAN2 (X,Y) is a single instruction. PORT extends this concept to other frequently used operatives. For instance, Y=ZZPOLY(COEFFS,X) is a direct PORT instruction that evaluates a polynomial expansion. Complex-number operations are also direct meta-instructions. Decode overhead is a small fraction of the execution time for high-level operatives.

Native-code compilers have the advantage on A=B, but they execute most high-level intrinsics such as A=TAN(B) via procedure calls, which carry a substantial stack push/pop overhead. Here the metacode has the advantage because the decode overhead for A=TAN(B) is the same as for A=B. A metacode enjoys a bonanza on floating-point functions like ZZPOLY, where the CP can make maximal use of the math coprocessor registers and have the 386 compute in parallel.

The metacode goes on the offensive in block operatives. Consider the statement CALL ARYMOV(A(I),100000,B(J)), which copies 100,000 64-bit words from A(1) to B(J) as a single meta-instruction. The CP employs the 386/486 instruction REP MOVSD, which is an order of magnitude faster than even a native-code Fortran DO loop or a C for loop. The PORT metacode provides operatives for block copy, initialize, search, checksum, and others. It also provides direct meta-instructions for all string operatives (copy, concatenate, search, and so on). The metacode is currently being extended to fast Fourier transforms, matrix multiply, vector scale/translate, and others.

Debugging Metacode

As mentioned, it's the last 5 percent of bugs that typically pace the entire software timetable. A key factor in the PORT metacode design was to incorporate the maximum number of checks possible. (I'll detail these checks in future articles.) Suffice to say that they include bounds checks on all array references, pointer validation, uninitialized variable checks, invalid floating-point numbers, incorrect loop limits, and invalid strings. These checks are active at all times, in all programs (including the PORT system itself) without exceptions.

I cannot overemphasize how invaluable these checks have been in both software development and in wringing out versions of CP and PP. Invariably, the CP metacode decoder for a new platform has obscure bugs. The constant checking by the subsequent meta-instructions has proved that the corrupted results do not migrate too far before a fault occurs. For a native-code compiler to output these checks on every instruction would make the executable image too unwieldy. Without these checks, however, nightmare bugs are a certainty. Most compilers have a debug option, but the worst bugs often occur in release versions of the code, and all too often they mysteriously disappear with debug active.

A significant feature of the CP/PP separation is that the PP itself can be an important debugger. When a serious error occurs under DOS or UNIX, the machine can hang, leaving only postmortem debugging as an option. If PORT goes off into the weeds, the PP is still alive on the host system and can probe even the most intimate level of PORT. This makes checking out PORT on a new platform much easier.

The Minimal Case

Last month I described the implementation of PORT on a 386/486 PC with plug-in multiple i860s. The emphasis there was on RISC-processor performance. Now let's examine PORT in an environment at the opposite end of the spectrum--a low-cost 386SX PC with four Mbytes RAM, math coprocessor, and 60-Mbyte hard disk. The CP and PP are both executed by the 386SX. To maximize performance the CP executes in 32-bit protected mode and turns the 3-Mbyte extended memory into the common memory. The CP is simply a 45-Kbyte assembly program that reads 64-bit numbers from extended memory as 32-bit pairs, and performs the operation specified by a bit field in each. Basically, the CP program just rattles pairs of 32-bit numbers around in extended memory. The CP itself does not have to reside in extended memory. By residing in lower 640K, transfer from the CP to PP is simplified, eliminating the need for a DOS extender.

The PP is just a 16-bit real-mode assembly program that reads a 40-byte block from extended memory and calls on ROM BIOS and DOS interrupt services to execute the I/O request. Both the CP and PP are procedures in a PORT.EXE executable that run in 200 Kbytes of lower memory. PORT takes over the extended DOS partition on the hard disk. If the primary extended partitions are each allocated 30 Mbytes, then DOS occupies the lower half of the disk and PORT the upper half.

Because PORT has its own file management, it is not tied to the DOS Int 21h file services. Direct ROM BIOS 13h (or direct SCSI commands) are an order of magnitude faster. Not surprisingly, PORT's disk I/O is many times faster than that of DOS. The current PP for DOS even handles its own bad-track redirection. The PP doesn't just clean up a few sectors; it sweeps up a whole disk, track by track.

A direct benefit of the no-exception, 32-Kbyte disk block is that the DOS-based PP can implement highly efficient disk caching. If the PP finds more than eight Mbytes of extended memory, it turns the rest into a cache pool as a simple multiple of 32-Kbyte pages. Pages can be transferred from cache via the blistering fast 386/486 32-bit REP MOVSD instruction. (In a dual-processor implementation, the PP caching proceeds in parallel with the CP computation.) Bear in mind that PORT utilizes virtual memory, so the amount of RAM available merely affects speed. Beyond eight Mbytes, the RAM tends to be wasted and is more profitably employed as cache, but the division is user modifiable.

Although PORT has its own file management, it provides subroutines and utilities to read, write, and manipulate DOS files. Of course, any use of this feature is nonportable. It does, however, make PORT fully compatible with network use and DOS-based applications. Frequently-used files are generally copied from DOS files into their faster PORT equivalents. (As long as the application uses PORT files, it remains seamlessly portable.) The 16-bit realmode implementation of the PP allows PORT to be 100 percent compatible with DOS, and you can move freely between the two by executing PORT.EXE.

Conclusion

You may feel PORT's obsession with 64 bit to be excessive, but already most RISC microprocessors are 64 bit, the 80387/486/Weitek math coprocessors target 64 bit--and there's no doubt that the 80586+ will use 64 bit. Likewise, use of 32-Kbyte (soon 64-Kbyte) disk blocks may seem excessive, but disk-transfer time is becoming insignificant compared to (mechanical) seek time. High-perfomance RISC processors are proliferating, and the metacode approach is ideal for realizing their potential--particularly with multiple RISC processors. (It's even rumored that the 80586 will provide RISC on-chip.)

UNIX has done much to legitimize portability, but each implementation retains a strong affinity to its platform. A hardware-independent "virtual computer" is critical to cost effectively porting multi-megabyte applications.

Bibliography

Amdahl, G.M. "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities." AFIPS Spring Joint Computer Conference Proceedings (Volume 30, 1967).

Bowles, K.L., S.D. Franklin, and D.J. Volper. Problem Solving Using UCSD Pascal. Berlin: Springer-Verlag, 1984.

Portability vs. Performance

Output of a metacode from a compiler is nothing new. Ken Bowles's UCSD Pascal generated a machine-independent Pcode that was popular in the early '80s. Metacodes have been used for achieving machine independence, but past implementations had one big disadvantage--performance (or rather the lack of it). Last month, I showed how a metacode has an advantage on RISC processors. How does the PORT implementation of it stack up on a CISC processor?

Table 1 compares the Dhrystone and Linpack performance of the 386/33 and 486/33 PC vs. the IBM RS/6000, Sun SPARC SLC, Silicon Graphics Indigo, and the HP 9000 series 720 superworkstation. Accordingly, the HP, 720 is far ahead of the pack and supposedly an order of magnitude faster in floating point than the 486.

Table 1: RISC performance under popular benchmarks (Personal Workstation, June 1991). Higher numbers are faster.

                                      Dhrystone        Linpack
                                      2.0/2.1     Single     Double
                                      w/register  (32-bit)   (64-bit)
  -------------------------------------------------------------------

  CISC
  486/25 via DOS  extender (typical)  26,300      1.16       1.08
  486/33 via DOS  extender (typical)  34,000      1.50       1.40
  RISC
  i860/33 (Microway Number Smasher)   29,819      1.23       1.11
  SPARCstation SLC                    18,255      2.25       1.20
  Silicon Graphics Iris 25D           24,630      2.62       1.35
  Motorola 88000/25 (Everex 8825)     50,033      1.67       1.02
  MIPS 3000/33 (Magnum 3000/33)       56,012      6.48       4.80
  IBM RS/6000 (POWERstation 320)      45,454      8.15       7.29
  HP 9000 series (model 720, 50 MHz)  86,335      17.0       14.4

Table 2 compares PORT on a 386/33, 486/33, and PC+i860/33 vs. the HP 720 using the DISSPLA (and equivalent GSL) manual sample plots; see Figure 2. This is an extension of a similar table presented last month and shows that the 20-MHz 386 with the 33-MHz i860 under PORT is not far behind the 50-MHz HP 720. Note that the 486/33 is not outpaced by the order of magnitude predicted by the Dhrystone and Linpack results. The DISSPLA timings reflect the composite performance of the entire system, including large program execution, I/O service, and graphics output. The operative word is "composite"--popular benchmarks reflect the performance of a processor on a few tight loops in a vacuum, not the through-put of a real-world massive program. Native RISC code (on the RS/6000 and HP 720) has a tremendous advantage when it can iterate on small loops, but the DISSPLA sample plots reflect a normal program whose loops have frequent calls and branches.

Table 2: PC+i860 vs. HP 9000 series 720 using DISSPLA sample plots. Times are in seconds.

  CA-DISSPLA/                             PORT G.S.L.         CA-DISSPLA
  GSL Manual     Vectors  Filled    386     486     i860/33+  on HP 720
  reference no.           Polygons  33 MHz  33 MHz  386/20    50 MHz
  ----------------------------------------------------------------------

  DM3004/B31-3    3,373       0      18       8        6          3.5
  DM4003/B31-7    9,366     137      21      10        8          2.6
  DM7001/B31-11  18,526       0      28      13       10          4.2
  DM7004/B31-22  22,002     215     110      53       39         37.2
  DM8002/B31-27  15,215     161      55      26       19         13.0

It may seem incongruous that the 486 is not that much slower than the PC+i860 in the DISSPLA plots. I found that the times were identical for the Hyperspeed i860 card plugged into a 486/33 and a 386/20, indicating that the 2-Mbyte/second ISA bus was saturated, (Figure 2 is an example of dominant graphics I/O.) Keep in mind that the 486 is a high-performance RISC processor internally with eight Kbytes of immediate cache. A tight 32-bit program like PORT CP which heavily uses 32-bit registers tends to exploit this RISC affinity.

I must emphasize that PORT's graphics subroutine library is a vastly rewritten version of DISSPLA and probably more efficient than CA-DISSPLA, but the two produce identical output. Futhermore, PORT operates all 64 bit and uses software virtual memory. PORT checks array bounds on every array reference, tests for uninitialized variables on every arithmetic/compare instruction, and performs a host of other checks not executed by the HP 720.

Which system is faster is not the issue. The point is that the PORT metacode approach and CP/PP architecture hold their own. A "virtual computer" does not have to be a dog of an interpreter, as is widely believed.

Amdahl's Law

Actually, the performance of the PORT metacode approach can be predicted from Amdahl's Equation. According to Amdahl, the average instruction time is Tav = F*Tf +(1-F)* Ts, where Ts is the time of the slowest instructions, Tf is the time of the fastest instructions and F is the fraction of slowest instructions. For a typical 486 native-code compiler, Tf is one microsecond on average for A=B, I = J+K, and other simple integer operations. Ts reflects the time for floating-point operations. Accounting for call and math-coprocessor overhead, 50 microseconds is a reasonable value for Ts. Even in a highly floating-point intensive application we can reasonably assume 80 percent of the instructions are fast integer ops, so: Native Tav = 0.8*1+(1-0.8)*-50=10.8 microsec.

Now assume that the Port metacode is five times slower than the native compiler in integer and other fast ops (due to its decode overhead, 64 bit, and so on), but it can reasonably shave 20 percent off the floating-point ops by eliminating stack overhead and using coprocessor more efficiently. In this case, Tf equals 5 microseconds and Ts equals 40 microseconds, so: PORT Tav =0.8* 5+(1-0.8)*40= 12 microsec. This is not far from the native compiler.

The key result of this exercise is that shaving just 20 percent off the slowest instructions can make an enormous difference. Amdahl's Law states that no matter how much you work on the fastest process, the slowest process ultimately dominates.

What about compilers, editors, and other pure-integer programs? Here the block operatives, string operatives, and more efficient disk I/O can have an even greater impact. The savings are not a few percentage points, but orders of magnitude. (Our experience at SUPERSET was that in a key subroutine, replacing a loop or two with block operatives could make a whopping difference.) The DISSPLA timings indicate that the combination of all effects can substantially outweigh the decode overhead.

--I.H.


Copyright © 1992, Dr. Dobb's Journal