PERSONAL SUPERCOMPUTING VIRTUAL MEMORY, 64-BIT

Multi-megabyte applications on the PC

Ian Hirschsohn

Ian holds a BSc in Mechanical Engineering and an MS in Aerospace Engineering. He is the principal author of DISSPLA and cofounder of ISSCO. He can be reached at Integral Research, 249 S. Highway 101, Suite 270, Solana Beach, CA 92075.

In last month's installment, I introduced the concept of a "virtual computer" patterned on the Cray supercomputer model--an architecture defined by software rather than hardware. Using the PORT system (a PC-based environment that adheres to Cray's design principles, as described in my June 1992 article), I showed how the virtual computer concept can achieve surprising performance on PCs and exploit RISC processors, and do so in a seamlessly portable environment. This month, I'll discuss salient features of the PORT system itself, focusing on design decisions rather than implementation details. As you recall, PORT has a philosophy diametrically different from that of UNIX and OS/2. It unabashedly plagiarized the best features of a number of mainframe systems, especially Cray's CDC 6600. In short, PORT's architecture is more an incarnation of several proven systems than an untested fresh approach.

The key difference between PORT and most other systems is that PORT is totally dedicated to running multi-megabyte Fortran applications. (Although PORT was originally designed for Fortran, its Fortran/C compiler is being modified to accept C syntax.) For example, the file manager treats all directories as libraries (there is no need to MAKE libraries), the linker automatically resolves OBJ-type files from any specified directories, and FIND C=WORKSP searches OBJ files (ignoring all other types) for a COMMON block called WORKSP.

It's no secret that ANSI Fortran is unsuitable for systems programming. So rather than throw the baby out with the bathwater, we vastly extended Fortran (while preserving Fortran 77 compatibility) instead of creating some new language. PORT's Fortran/C incorporates most of the functions found in C--Boolean operations, shifts, bit-field operatives, a pointer type, and low-level I/O intrinsics--and has proven an effective systems-programming vehicle. The entire PORT system--including the compiler, linker, editor, virtual memory management, and system libraries (about 1 million lines)--is written in Fortran/C.

Virtual Memory Without Virtual Memory

Mainframe-grade Fortran applications often expect virtual memory, or the ability to run programs larger than available RAM by transparently swapping pieces to disk. Virtual memory has proved to have an even more important benefit: the ability to call massive programs from within one another. For example, the ability to call the editor from within your program, then the compiler within the editor, the file manager within the compiler, and so on. Your own program may eventually be totally swapped out to disk, but this is transparent.

Chaining multiple programs, which is often confused with multitasking, enables the development of gigantic applications. For example, if you can depend on calling the editor, there is no need to write editor functions into your program.

Being focused on Fortran/C, PORT achieves virtual memory simply by basing its page size on the longest practical subroutine (or function). As long as a subroutine never crosses a page, all variables and branches within that subroutine are local and don't need virtual-memory address translation. This idea is somewhat similar to the segment organization of the DOS huge model, except that PORT pages are a fixed length and can hold multiple subroutines. As I pointed out in my June article, keeping most references local is equivalent to about an 85 percent "cache hit ratio" right off the bat. PORT local addressing is relative to the start of the current subroutine instruction and data areas rather than a segment base address or other hardware-dependent artifice. The absolute RAM offset address to the subroutine is reset automatically upon any call or return, like the DOS huge model. Unlike the huge model, however, if the PORT metacode does not find the target page in RAM memory, it generates a page fault and the page is rolled in from disk (behind the programmer's back). Thus the program size is limited by the disk-staging area you assign, not the available RAM.

The current PORT page is 4096 64-bit words, or 32 Kbytes. (Support for 64-Kbyte pages will be available soon.) The instruction and data areas for each subroutine are generally kept in separate pages, so instruction pages don't have to be written back to disk. Measurements on numerous programs since 1979 have shown that the Fortran/C compiler generates 1.2 to 1.4 meta-instructions (64 bits each) per executable source statement. Hence a PORT page can accommodate a subroutine of up to about 3000 executable statements, or roughly 4500 typical lines. A single subroutine longer than that tends to be unwieldy and is best broken into smaller ones.

The linker packs as many subroutines as possible into an instruction/data page pair, which holds around 60 to 200 typical-length subroutines (32-Kbyte pages). If any one of those routines is called and the page pair is not in RAM, the remaining 59-199 ride in with it. The linker exploits this to improve performance by grouping routines that call each other into the same page pair and also provides user-specified grouping. Indeed, the linker gives precedence to grouping the branch of a call tree over page packing efficiency. Thus, a "miss" on one routine brings in many related routines on the same two disk references. By contrast, most virtual-memory systems go through a whole sequence of disk references before they develop a "working set." Disk I/O being a pacing factor in many virtual-memory applications, disk efficiency can outweigh processor performance.

More Details.

A key difference between the DOS huge model and PORT is that PORT permits arrays to cross pages: Arrays can be multi-megabyte with an overall limit of 32 Mbytes per program (310 Mbytes with 64-Kbyte pages). This may not be anywhere near the 4-gigabyte limit of 32-bit addressing, but in practice, a single program of even 5-10 Mbytes involves thousands of subroutines, and I have found it to be unwieldy to develop beyond a certain point. (Bear in mind that a megabyte amounts to about 100,000 lines of Fortran/C, accounting for data areas and typical arrays.) The 4-gigabyte number is a bit academic--simply dedicating a 4-gigabyte disk-staging area is beyond most PCs, workstations, and even mainframes (per user). Just think of the disk-swap time! In practice, since PORT programs can call one another, even 32 Mbytes per program has yet to prove a limitation. What is not academic is that 32-bit addressing wastes 1-2 bytes 99.99 percent of the time, hogging the processor's critical prefetch queue.

Virtual-memory addressing is only applied to access across pages. Since no subroutine can straddle pages, virtual-memory translation is only necessary for array references, call/return, and pointers. (COMMON blocks, strings, and other structures are presented as arrays to the metacode so there is no distinction at the metacode level.)

All 64 Bit

Like mainframes and supercomputers, PORT targets a 64-bit bus: Addressing is in terms of 64-bit words, and all instructions plus data items are processed as multiples of 64-bit words. A byte is regarded as just an 8-bit field, so the metacode changes a byte value by bringing in its 64-bit word, modifying the 8-bit field and writing out the whole word. (The cost of a byte and 64-bit word are identical on a 64-bit bus.) "All 64-bit" execution may be excessive for 80x86 PCs, but most plug-in RISC cards (such as i860-based cards) use a 64-bit bus locally. Boards like the Hyperspeed D860 allow multiple cards to interconnect via ribbon cables carrying a blistering fast 64-bit local bus.

Although the advantage of 64 bits for performance is clear, it has proved surprisingly beneficial to Fortran/C coding. All Fortran/C variables are 64-bit, or a multiple thereof. (INTEGER*2, REAL*4, and so on, are accepted but ignored.) Strings are 64-bit aligned and padded to a word boundary. Thus integers, floating point, pointers, and strings can all be freely intermixed. For example, INTEGER, REAL, CHARACTER, and POINTER type variables can be interchangeably passed through the same subroutine argument and EQUIVALENCEd to one another, contrary to both ANSI Fortran and C rules. Interchangeability simplifies dynamic record structures containing variable-length components or omitted items. 64-bit integers have proved invaluable for packing and manipulating multiple values simultaneously; for example, A=B moves eight bytes at a pop.

The Metacode

As mentioned in the previous articles, PORT defines its own machine-independent "instruction set" customized to Fortran/C. This metacode is executed via a processor-dependent decoder supplied for each platform. Each meta-instruction is a 64-bit word organized as shown in Figure 1. All meta-instructions have the form A=B op C. The tag for each operand A, B, and C determines the operand mode--151, J, X(I), Y(I,J), for instance. The text box "PORT's Metacode" describes metacode operation.

Table 1 lists all three-operand instructions (A=B op C). Because it's wasteful to use three operand slots for two-operand instructions (A = op B), these are grouped as instruction 25h using the value of C to distinguish them; see Table 2. Note how specific the instructions are to Fortran and C.

Table 1: Metacode three-operand instructions.

  Code (hex)  Mnemonic  Description   Code     Mnemonic  Description
                                      (hex)
  -------------------------------------------------------------------------

      00      --        Illegal op     27      CALL      Call subroutine.
                        code
                        (catches 0
                        as an
                        instr.).
      01      (+):i     Integer        28      AND       Logical AND.
                        ADD.
      02      (-):i     Integer        29      OR        Logical OR.
                        SUBTRACT.
      03      (*):i     Integer        2A      XOR       Logical XOR.
                        MULTIPLY.
      04      (/):i     Integer        2B      ANDF      Boolean AND.
                        DIVIDE.
      05      (//):i    Integer        2C      ORF       Boolean OR.
                        INCLUSIVE
                        DIVIDE.
      06      B**IC:i   Real           2D      XORF      Boolean XOR.
                        -to-integer
                        power.
      07      MOD:i     Integer        2E      SBYTE6    Replace 6-bit byte
                        remainder.                       (in array or
                                                         word).
      08      (=):i     Integer test   2F      LBYTE6    Extract 6-bit
                        EQUAL.                           byte.
      09      (<):i     Integer test   30      I_FIELD   Extract arbitrary
                        LESS THAN.                       field from word.
      0A      (>):i     Integer test   31      S_FIELD   Replace arbitrary
                        GREATER THAN.                    field in word.
      0B      (#):i     Integer test   32      ATAN2     Ratio arc tangent.
                        NOT EQUAL.
      0C      (<=):i    Integer test   33      SIN_COS   Compute both sine
                        LESS THAN OR                     and cosine of C.
                        EQUAL.
      0D      (=>):i    Integer test   34      shift_L   Boolean LEFT
                        GREATER THAN                     SHIFT.
                        OR EQUAL.
      0E      arith_L   Arithmetic     35      shift_R   Boolean RIGHT
                        LEFT shift                       SHIFT.
                        (preserve
                        sign).
      0F      arith_R   Arithmetic     36      StrngOP   Text op (copy,
                        RIGHT shift.                     search,
                                                         concatenate,
                                                         and so on).
      10      B**C:r    Real to power  37      ByteOP    Byte-string op
                        of a real.                       (nonspecific
                                                         StrngOP).
      11      (+):r     Real ADD.      38      Unpack8   Unpack string to
                                                         word-per-byte
                                                         array.
      12      (-):r     Real           39      Pack8     Pack eight
                        SUBTRACT.                        bits/word array
                                                         as string.
      13      (*):r     Real           3A      CALL_P    Call program on
                        MULTIPLY.                        another processor.
      14      (/):r     Real           3B      Load_P    Load program on
                        DIVIDE.                          another proc.
      15      POLYNML   Polynomial     3C      Reset_P   Restart program on
                        expansion                        another proc.
                        evaluate.
      16      INCADR    Increment      3D-3F             Not used.
                        POINTER.
      17      KCHECK    Test for       40      --        Illegal op code
                        operand                          (catches -0 as
                        error.                           instr.).
      18      (=):r     Real test      41      (+):c     Complex ADD
                        EQUAL                            (c/tests
                        (r/tests use                     use fuzz factor).
                        fuzz factor).
      19      (<):r     Real test      42      (-):c     Complex SUBTRACT.
                        LESS THAN.
      1A      (>):r     Real test      43      (*):c     Complex MULTIPLY.
                        GREATER THAN.
      1B      (#):r     Real test      44      (/):c     Complex DIVIDE.
                        NOT EQUAL.
      1C      (<=):r    Real test      45     B**IC:c    Complex to integer
                        LESS  THAN                       power.
                        OR EQUAL.
      1D      (=>):r    Real test      46      CMPLX     Real and imaginary
                        GREATER THAN                     to complex.
                        OR EQUAL.
      1E      SBYTE8    Replace        47      --        Not used.
                        8-bit byte
                        (in array
                        or word).
      1F      LBYTE8    Extract        48      (=):c     Complex test
                        8-bit byte.                      EQUAL.
      20      BreakPt   Breakpoint.    49      (<):c     Complex test LESS
                                                         THAN.
      21      GO TO*    List GO TO.    4A      (>):c     Complex test
                                                         GREATER THAN.
      22      jmp#0     Branch if      4B      (#):c     Complex test NOT
                        NONZERO.                         EQUAL.
      23      jmp=0     Branch if 0.   4C      (<=):c    Complex test LESS
                                                         THAN OR EQUAL.
      24      ARYMOV    Block op       4D      (=>):c    Complex test
                        (array                           GREATER THAN OR
                        search, init,                    EQUAL.
                        copy, and
                        so on).
      25      2op       Two-operand    4E-7F             Not used.
                        instructions
                        (see
                        Table 2).
      26      DO end    Loop test,
                        increment/
                        decrement,
                        and branch.

Table 2: Metacode two-operand instructions.

  Code (hex)  C Mnemonic  Description
  -------------------------------------------------------------

00          neg         Negate.
01          NOT         Logical invert.
02          NOTF        Boolean complement.
03          PP req      Flag peripheral processor and halt.
04          GO TO       Unconditional branch.
05          (:=)        Transfer one word.
06          RETURN      Subroutine RETURN.
07          Intrpt      Suppress interrupts (ignore PP flag).
08          ABS         Get absolute value.
09          INT         Truncate real to integer.
0A          FLOAT       Integer to real.
0B          SIGN        Assign sign of B to A.
0C          STATS       Get memory-error statistics.
0D          VADDR       Set POINTER to address.
0E          EXECUTE     Execute local procedure.
0F          MEMCHK      Force memory reference (test RAM).
10          EXCHNG      Swap contents of A and B.
11          IROUND      Round real to integer.
12          --          Unassigned.
13          SQRT        Square Root.
14          SIN         Sine.
15          COS         Cosine.
16          TAN         Tangent.
17          ArcSIN      Inverse sine.
18          ArcCOS      Inverse cos.
19          ArcTAN      Inverse tan.
1A          LOG         Log to base e.
1B          LOG10       Log to base 10.
1C          EXP         e to power B.
1D          EXP10       10 to power B.
1E          (+):i       Direct INTEGER ADD.
1F          (-):i       Direct INTEGER SUBTRACT.
20          (*):i       Direct INTEGER MULTIPLY.
21          (/):i       Direct INTEGER DIVIDE.
22          arith_L     Direct arithmetic LEFT SHIFT.
23          arith_R     Direct arithmetic RT SHIFT.
24          shift_L     Direct Boolean LEFT SHIFT.
25          shift_R     Direct Boolean RT SHIFT.
26          MOD:i       Direct remainder.
27          ORF         Direct Boolean OR.
28          ANDF        Direct Boolean AND.
29          XORF        Direct Boolean XOR.
2A          (+):r       Direct REAL ADD.
2B          (-):r       Direct REAL SUBTRACT.
2C          (*):r       Direct REAL MULTIPLY.
2D          (/):r       Direct REAL DIVIDE.
2E          REAL        Real part of complex.
2F          CMPLX       Real to complex.
30          CONJG       Complex conjugate.
31          SINH        Hyperbolic sine.
32          COSH        Hyperbolic cosine.
33          TANH        Hyperbolic tangent.

Since each meta-instruction needs to be interpreted, the decode time is critical. The metacode is therefore designed to be decoded as fast as possible. For example, even though the PORT page is 4096 words, needing only 12 bits, it is assigned a 16-bit field because most processors have instructions that can operate on 16 bits directly. Likewise, the 3-bit tags describing the contents of A, B, C are grouped as a 9-bit field, rather than being tacked onto their respective A, B, and C--allowing the decoder programmer to use a fast 512-way indexed branch to interpret all three tags in a single assembly instruction.

Without exception, all meta-instructions have an identical format. Furthermore, the metacode operand format is independent of the actual operation. (Both the 386/486 and i860 decoders interpret the operands before they even look at the Op Code.) This rigorous independence not only speeds the decoding on existing processors, but has an eye on future chips, which will incorporate multiple independent processors on the same silicon (Motorola's recently announced M88110 RISC multiprocessor, for instance). A context-free format lends itself to parallel decoding the A, B, and C operands and even to pre-interpreting instructions. Obsolescence being reality in computers, we have tried to think as far ahead as possible.

The Peripheral Processor

As previously mentioned, all I/O functions are handled by the peripheral processor (PP) program. Although designed to execute on a separate processor to the metacode computation processor (CP), I discussed how the PP coexists surprisingly well with the CP in a single-processor environment. Each PP action is specified via a 5x64-bit word structure deposited in the CP memory. Table 3 lists the current PP action codes. The PP is free to honor the requisite action in any way it chooses, thus enabling PORT to be implemented seamlessly in diverse host environments.

Table 3: Peripheral processor actions.

  Code  Action                Description
  -------------------------------------------------------------------------

  01    Output text line      Sends line to screen, COM or LPT port.
  02    Input text line       Gets line/char from keyboard or port.
  03    Set flag chars        Defines chars for line feed, prompt, and so
                               on.
  04    Set date/time         Resets current date and time.
  05    Get error stats       Gets error counts for disk and PP memory.
  06    Jumper ports          Redirects text-line output to specific ports.
  07    Get date/time         Retrieves current date and time.
  09    Read/write disk       Retrieves/replaces disk page.
  10    Discard interrupts    Clears any interrupt flag from PP to CP.
  11    Jump history          Activates/gets last 2048 branches, calls.
  12    Mag tape op           Reads/writes/positions 9-track, 8mm, 3480,
                               DAT.
  13    Set COM params        Sets serial-port baud rate, and so on.
  14    Generate tone         Sends tone sequence to speaker.
  16    Exec host program     Executes DOS(UNIX) based program.
  18    Set spooling          Activates/discards output to DOS file buffer.
  21    Set COM protocol      Sets port handshaking (XON/XOFF, RTS).
  24    Swap disk page        Exchanges page in memory with page on disk.
  25    High-speed I/O        Block I/O to IEEE 488, SCSI, and so on.
  27    Terminate PORT        Signs off PORT and returns to host system.
  28    Switching programs    Tells PP that PORT starting different
                               program.
  29    Host file op          Reads/writes/finds/copies/deletes DOS file.
  30    Screen graphics op    Outputs line/fill/block/color to screen.
  31    Window/mouse op       Creates/removes/retrieves/outputs a window.
  128   Fatal error           PORT detected internal error, sign off.

Again, all I/O data items lie on a 64-bit boundary and always have a length which is a multiple of 64 bits. (Strings are padded, as necessary.) This enables the PP implementor to utilize the maximum bandwidth available on the machine; for example, the PP lends itself readily to 32-bit EISA and Micro Channel implementation. By way of comparison, UNIX, OS/2, and DOS I/O are specified in terms of a byte offset and byte count, which requires a costly shift of the data block if the start byte does not fall on a 32-bit boundary.

To avoid the PP becoming involved in the virtual-memory process, the CP guarantees that no I/O block crosses a page boundary. As an example, standard Seismic SEG-B tapes have records of several megabytes apiece. The Fortran/C program can readily define a multi-megabyte array to hold the record, but the PP transfers these records as a sequence of one-page blocks. This places the onus on PORT and the CP to process any virtual-memory page faults on the huge array, between the record pages, thereby simplifying the PP implementation. The CP and PP are designed to execute in parallel. Even on a single 386/486 processor, the PP caches tape I/O using the DOS timer interrupt to service the drive concurrent with the CP program.

To illustrate the effectiveness of these ideas, PORT is able to copy a 2500-foot, 9-track, 6250-bpi Seismic SEG tape in 4.5 minutes between two streaming 125-ips drives using a Fortran program, on a stock 486/33 EISA PC (sans i860). Multiplying 2500 feet by 125 inches per second gives a theoretical lower limit of four minutes. By comparison, an oil company had run under UNIX on a Sun-compatible rated faster than a SPARC 4 and, after a year of custom device driver coding plus C conversion, was able to reduce the time from 12 to 6.5 minutes. The 4.5 minutes reflects a sustained data-transfer rate of 2x6250x125x(4/4.5)=1.3 Mbytes/sec (accounting for interrecord gaps) from Fortran. (At the last SEG show in November 1991, the PC was hidden under the table because the company felt nobody would believe a PC could do what it was doing.)

Debug Insecticide

True to the maxim that the last 5 percent of the bugs take 95 percent of the development time, a bug in the DISSPLA shading algorithm took us almost two years to find. It showed up on one mainframe, but not another. It even disappeared from execution to execution. Not even the most sophisticated debuggers were any help, because the program never failed when the debuggers were active. New DISSPLA releases were held up because every time we thought we had it, the bug popped up anew. (Sometimes we thought it depended on the phases of the moon.) It turned out to be a truncation from floating point to integer used to compute the index to a workspace array, whose contents depended on memory at the time the program was run. (I=INT(A) is highly processor dependent when A falls between 0.999... and 1.000...01.) Experiences like these on standard systems almost drove me into a different line of work, and I resolved that when we designed what is now PORT, debugging features would take priority over any performance considerations. The first time it was run under the SUPERSET precursor to PORT, the bug popped up--and was nailed.

The design of PORT assumes that nothing is working properly. For example, the file-management system attaches an internal software checksum to every record of every file. Whenever a file is copied on the disk or transferred to or from tape, this checksum is validated by the PORT software. We've even caught problems in the error-checking circuitry of the hardware!

The PORT metacode performs numerous checks on every instruction without exceptions, not even for the PORT system itself. These include:

Array-bounds checking on every array reference, including COMMON blocks, strings, and subroutine arguments.

Uninitialized-variable checks on all arithmetic and compare instructions (but not Boolean). The linker initializes memory to -0 (0 with sign bit set), which the metacode never generates on any arithmetic operation.

Invalid pointers. Pointers are 64-bit and carry the length of the original item together with a mandatory check pattern.

Unnormalized floating point. Usually an integer is passed through a real argument.

Field overflow (such as attempting to store a value greater than 255 in an 8-bit byte).

Field, or shift count, off either end of a 64-bit word.

Invalid text strings--strings not terminated by a null (0 byte).

Invalid loop limits (DO J=1,N with N<=0, for example).

These bugs tend to occur during execution, and no compiler "lint" can pick them up. The DISSPLA bug was detected by the uninitialized-variable check. In most cases, an uninitialized fault is just a nuisance, but every so often it stumbles across something really serious and pays for itself. The checks are conducted by the metacode decoder program (in the CP) as part of each meta-instruction's execution. As I pointed out in June, a RISC-based decoder can often bury the checks in free cycles so that they add little to the overhead.

The metacode approach provides other bonuses for code development. For example, the decoder can be requested to store the last 2048 branches (providing a software "logic analyzer" execution history). The decoder stores the count of meta-instructions executed by subroutine, furnishing a "profiler" that has proved invaluable for optimizing multi-megabyte applications. The PORT debugger can request the decoder to test for an event on every meta-instruction. Since the tests are embedded in the decoder, the debugger slows the program by just a few percent. Unlike debuggers that insert code, there are no loop-holes and no need to recompile. Basically, the metacode decoder represents modifiable "microcode" that can be programmed to do whatever you like.

Another subtle use of the modifiable microcode is the application of a "fuzz factor" to all floating-point compares (for example, IF(B=1.0) THEN succeeds if B=0.999...). The fuzz factor is user modifiable and circumvents the nightmare DISSPLA bug I described.

An important aspect of virtual memory is that the PORT debugger is available at all times on all programs, especially on release versions at user sites. The linker attaches a copy of its tables to the end of each executable image (no exceptions). These include the location, programmer name, date/time, and source installation of all the OBJ files.

Upon any fault (like an array bounds or bad pointer), the debugger swings into action, swaps the program back to disk, and loads the linker attached tables. Using this roadmap, the debugger can probe any program--even itself. The key difference between the PORT debugger and Codeview is that PORT can debug the program at the user site with the user's proprietary data in the exact state of its failure. (PORT doesn't know "memory dumps.") In multi-mega-byte apps developed by a team, the ability to pinpoint the subroutine version by date/time and the programmer responsible has proved worth its weight in gold.

Multiprocessor

The philosophy of PORT is to throw as many processors as possible at the task, not as many tasks as possible at the processor. Dedicating the entire system to the current application simplifies multi-processor handling, thereby improving throughput rather than wasting cycles in overhead. The elegant simplicity of Cray's memory-mapped CP/PP communication scheme proved easily extensible to multiple processors.

All processors are under the control of the CP, which in turn is under the control of your program. The program treats each processor as if it were executing a subroutine. To see how this works, consider using an auxiliary processor to do three-dimensional rendering; see Figure 2. The rendering is done by a call to RENDER_3D (OBJ_LIST, FRAME_BUF), for example, where OBJ_LIST is the list of solid-object coordinates, and FRAME_BUF is the target frame buffer. In a single-processor environment, RENDER_3D would just be a normal Fortran/C subroutine, and indeed, the linker expects a Fortran/C version. Prior to calling RENDER_3D, the program requests the CP to load RENDER_3D on processor 1 via a LOAD_PROCESSOR system call. If there is no processor 1, or no assembly version of RENDER_3D for it, the CP returns an error status to the user program. The program can then opt to terminate or proceed using the Fortran/C version. Given that everything is kosher, the metacode decoder on the CP intercepts the RENDER_3D call. Instead of executing the normal Fortran/C call, the decoder makes sure that the OBJ_LIST and FRAME_BUF arrays are loaded in RAM, passes their addresses to the RENDER_3D version in processor 1, and fires up processor 1. The decoder then executes a return to the user program as if it had executed the Fortran/C RENDER_3D call.

The user program executing in the CP and RENDER_3D in processor 1 proceed in parallel. When RENDER_3D is done, it sets a bit in a word tested by the CP prior to executing each meta-instruction (an interrupt without a hardware interrupt). A user-supplied interrupt-processing routine is then called. Alternatively, the program can just opt to poll a common status word it defines, to avoid interrupt handling. Upon detecting completion by processor 1, the program can send FRAME_BUF to the VGA and make another call to RENDER_3D or other routine loaded in processor 1. The bottom line is that your program is in control of all processors at all times.

Note that RENDER_3D has two versions: a high-speed, machine-dependent assembly version and a slower Fortran/C version. Thus the seamless portability of the program is preserved, albeit with the slower version. This feature is also useful for the orderly development of multiprocessor applications. If you go to the hassle of multiprocessor, speed is usually your criterion; PORT therefore focuses on running tight assembly modules, rather than a bureaucratic, symmetric, multithreaded C system. (Silicon Graphics workstations atest to the speed of a tightly coupled MIPS 3000 RISC processor rendering 3-D.)

Conclusion

The underlying PORT theme is to minimize entropy--to concentrate the information and focus on the objective. Bug checks increase entropy, but a program with a serious bug has infinite entropy.

In these three articles on "Personal Supercomputing," I've endeavored to introduce an alternative to the UNIX and OS/2 approach. My intent is not to disparage their multitasking philosophy, but to show that it is not gospel and that there is room for diverse system designs. PORT shows that by targeting a specific class of applications, it can overcome many of the limitations of the UNIX approach, for its constituency. Conversely, PORT is not the ideal vehicle for agenda managers, word processors, and multiple TSRs. The PORT design realizes this and endeavors to complement the host system rather than replace it. Nevertheless, technology has moved far beyond the PDP 11 and PC/XT. DOS, UNIX, and OS/2 must come to terms with high-speed, multiple RISC processors and 64-bit operation to realize the full potential of the next wave of technology.

PORT's Metacode

In examining the operation of the PORT metacode, let's begin with an example. Consider the Fortran/C source statement: Y(J)=X**11. Assume that Y is a local array with DIMENSION Y(100).

A=B**C is a single 64-bit meta-instruction; see Op Code 06 in Table 1. As shown in Figure 3, C is 11, B is the scalar X, and A the indirect reference to Y(J). The tags differentiate operand modes: C has tag 1 (immediate integer<64K), B tag 2 (local scalar,) and A tag 3 (indirect). Tag 1 uses the value of the 16-bit field itself (11). Tag 2 adds the field (257) to the start address of the subroutine data area giving the word address, so X at [d/area+257] has the 64-bit value 5.7. Tag 3 is similar to tag 2, except that the word points to the operand value, hence the value at [d/area+316] points to Y(J). All 64 bits as a single address is excessive, so the array base virtual address is in the upper half-word and the the lower 32 bits is for the length and index. In our example, Y starts at virtual page 5, word 862, and J has the value at [d/area+541], namely 25.

The metacode decoder first shifts out the 16-bit fields corresponding to A, B, and C. In the 386/486 version, the decoder then masks out the 9-bit combined-tags value in the upper 16 bits (1*64+2*8+3=83), uses this as an index to a table, and branches to entry 83. This simple branch decodes the tags 1,2,3 in one 386/486 instruction. Operands C and B are straight-forward (tags 1 and 2). For A, the decoder proceeds as with B, except that it now masks out the lower 16 bits and retrieves the 32-bit pair for J. The second 16-bit field in the indirect word has the length of Y, namely 100; if J is less than 1 or greater than 100 an array-bounds fault is produced. In the example, the decoder adds J-1=24 to the base v/address for Y (5:0862), giving 5:0886 as the location of Y(J). Page 5 is a virtual page, so the decoder references entry 5 of its page-translation table for the real-memory page address, adds offset 886, and obtains the real-memory word address for Y(J). Having the values for B and C plus the location of A, the decoder shifts out the Op Code and performs the operation listed in Table 1.

If entry 5 of the page-translate table is negative, it flags that Y(J) is rolled out to disk, and the decoder generates a page fault. If the value for either B or C is -0, it produces an uninitialized-variable fault. If the exponent in the upper 12 bits of B is 0, an unnormalized floating-point fault results. If A's exponent will be greater than 1023, a floating-point overflow fault is produced. These tests are each a matter of one or two instructions and add little to the overhead, but add a whole dimension of debugging power and implement virtual memory with simplicity.

The upper six bits of the packed indirect word determine whether A is Y(J), Y(I,J), Y(I+J), or a POINTER, and whether J is a variable or an immediate constant; for example: Y(5), Y(I,5), and Y(I+5). In the Y(I,J) and Y(I+J) cases, the index field points to a second level of indirection, for Y(I,J,K) a third level, and so on. All instances of Y(J) in the subroutine point to the same indirect word at offset 316, so there is only one entry for Y(J), not a copy for each source line at which it occurs (reducing data-area redundancy). If the array is longer than 16-bits (64 Kwords) or starts at any value other than 0 or 1 (for example, DIMENSION Y(-100:100)), the length field points to a word in page 0 containing the actual limits.

The metacode may appear daunting, but the bottom line is that it concentrates the maximum information in as few bits as possible, thereby minimizing the entropy. Using an 8088 PC/XT to execute it would be a joke, a 32-bit 386 is marginal; RISC processors with their one instruction per cycle and huge banks of registers are just what the doctor ordered. Most importantly, the mass of information presented to the RISC processor allows it to change the order of the logic, thereby plugging free cycles and memory-reference delays with useful RISC instructions. (As long as the decoder produces the result expected, the internal sequence is academic.)

--I.H.