Ian holds a BSc in Mechanical Engineering and an MS in Aerospace Engineering. He is the principal author of DISSPLA and cofounder of ISSCO. He can be reached at Integral Research, 249 S. Highway 101, Suite 270, Solana Beach, CA 92075.
In last month's installment, I introduced the concept of a "virtual computer" patterned on the Cray supercomputer model--an architecture defined by software rather than hardware. Using the PORT system (a PC-based environment that adheres to Cray's design principles, as described in my June 1992 article), I showed how the virtual computer concept can achieve surprising performance on PCs and exploit RISC processors, and do so in a seamlessly portable environment. This month, I'll discuss salient features of the PORT system itself, focusing on design decisions rather than implementation details. As you recall, PORT has a philosophy diametrically different from that of UNIX and OS/2. It unabashedly plagiarized the best features of a number of mainframe systems, especially Cray's CDC 6600. In short, PORT's architecture is more an incarnation of several proven systems than an untested fresh approach.
The key difference between PORT and most other systems is that PORT is totally dedicated to running multi-megabyte Fortran applications. (Although PORT was originally designed for Fortran, its Fortran/C compiler is being modified to accept C syntax.) For example, the file manager treats all directories as libraries (there is no need to MAKE libraries), the linker automatically resolves OBJ-type files from any specified directories, and FIND C=WORKSP searches OBJ files (ignoring all other types) for a COMMON block called WORKSP.
It's no secret that ANSI Fortran is unsuitable for systems programming. So rather than throw the baby out with the bathwater, we vastly extended Fortran (while preserving Fortran 77 compatibility) instead of creating some new language. PORT's Fortran/C incorporates most of the functions found in C--Boolean operations, shifts, bit-field operatives, a pointer type, and low-level I/O intrinsics--and has proven an effective systems-programming vehicle. The entire PORT system--including the compiler, linker, editor, virtual memory management, and system libraries (about 1 million lines)--is written in Fortran/C.
Mainframe-grade Fortran applications often expect virtual memory, or the ability to run programs larger than available RAM by transparently swapping pieces to disk. Virtual memory has proved to have an even more important benefit: the ability to call massive programs from within one another. For example, the ability to call the editor from within your program, then the compiler within the editor, the file manager within the compiler, and so on. Your own program may eventually be totally swapped out to disk, but this is transparent.
Chaining multiple programs, which is often confused with multitasking, enables the development of gigantic applications. For example, if you can depend on calling the editor, there is no need to write editor functions into your program.
Being focused on Fortran/C, PORT achieves virtual memory simply by basing its page size on the longest practical subroutine (or function). As long as a subroutine never crosses a page, all variables and branches within that subroutine are local and don't need virtual-memory address translation. This idea is somewhat similar to the segment organization of the DOS huge model, except that PORT pages are a fixed length and can hold multiple subroutines. As I pointed out in my June article, keeping most references local is equivalent to about an 85 percent "cache hit ratio" right off the bat. PORT local addressing is relative to the start of the current subroutine instruction and data areas rather than a segment base address or other hardware-dependent artifice. The absolute RAM offset address to the subroutine is reset automatically upon any call or return, like the DOS huge model. Unlike the huge model, however, if the PORT metacode does not find the target page in RAM memory, it generates a page fault and the page is rolled in from disk (behind the programmer's back). Thus the program size is limited by the disk-staging area you assign, not the available RAM.
The current PORT page is 4096 64-bit words, or 32 Kbytes. (Support for 64-Kbyte pages will be available soon.) The instruction and data areas for each subroutine are generally kept in separate pages, so instruction pages don't have to be written back to disk. Measurements on numerous programs since 1979 have shown that the Fortran/C compiler generates 1.2 to 1.4 meta-instructions (64 bits each) per executable source statement. Hence a PORT page can accommodate a subroutine of up to about 3000 executable statements, or roughly 4500 typical lines. A single subroutine longer than that tends to be unwieldy and is best broken into smaller ones.
The linker packs as many subroutines as possible into an instruction/data page pair, which holds around 60 to 200 typical-length subroutines (32-Kbyte pages). If any one of those routines is called and the page pair is not in RAM, the remaining 59-199 ride in with it. The linker exploits this to improve performance by grouping routines that call each other into the same page pair and also provides user-specified grouping. Indeed, the linker gives precedence to grouping the branch of a call tree over page packing efficiency. Thus, a "miss" on one routine brings in many related routines on the same two disk references. By contrast, most virtual-memory systems go through a whole sequence of disk references before they develop a "working set." Disk I/O being a pacing factor in many virtual-memory applications, disk efficiency can outweigh processor performance.
A key difference between the DOS huge model and PORT is that PORT permits arrays to cross pages: Arrays can be multi-megabyte with an overall limit of 32 Mbytes per program (310 Mbytes with 64-Kbyte pages). This may not be anywhere near the 4-gigabyte limit of 32-bit addressing, but in practice, a single program of even 5-10 Mbytes involves thousands of subroutines, and I have found it to be unwieldy to develop beyond a certain point. (Bear in mind that a megabyte amounts to about 100,000 lines of Fortran/C, accounting for data areas and typical arrays.) The 4-gigabyte number is a bit academic--simply dedicating a 4-gigabyte disk-staging area is beyond most PCs, workstations, and even mainframes (per user). Just think of the disk-swap time! In practice, since PORT programs can call one another, even 32 Mbytes per program has yet to prove a limitation. What is not academic is that 32-bit addressing wastes 1-2 bytes 99.99 percent of the time, hogging the processor's critical prefetch queue.
Virtual-memory addressing is only applied to access across pages. Since no subroutine can straddle pages, virtual-memory translation is only necessary for array references, call/return, and pointers. (COMMON blocks, strings, and other structures are presented as arrays to the metacode so there is no distinction at the metacode level.)
Like mainframes and supercomputers, PORT targets a 64-bit bus: Addressing is in terms of 64-bit words, and all instructions plus data items are processed as multiples of 64-bit words. A byte is regarded as just an 8-bit field, so the metacode changes a byte value by bringing in its 64-bit word, modifying the 8-bit field and writing out the whole word. (The cost of a byte and 64-bit word are identical on a 64-bit bus.) "All 64-bit" execution may be excessive for 80x86 PCs, but most plug-in RISC cards (such as i860-based cards) use a 64-bit bus locally. Boards like the Hyperspeed D860 allow multiple cards to interconnect via ribbon cables carrying a blistering fast 64-bit local bus.
Although the advantage of 64 bits for performance is clear, it has proved surprisingly beneficial to Fortran/C coding. All Fortran/C variables are 64-bit, or a multiple thereof. (INTEGER*2, REAL*4, and so on, are accepted but ignored.) Strings are 64-bit aligned and padded to a word boundary. Thus integers, floating point, pointers, and strings can all be freely intermixed. For example, INTEGER, REAL, CHARACTER, and POINTER type variables can be interchangeably passed through the same subroutine argument and EQUIVALENCEd to one another, contrary to both ANSI Fortran and C rules. Interchangeability simplifies dynamic record structures containing variable-length components or omitted items. 64-bit integers have proved invaluable for packing and manipulating multiple values simultaneously; for example, A=B moves eight bytes at a pop.
As mentioned in the previous articles, PORT defines its own machine-independent "instruction set" customized to Fortran/C. This metacode is executed via a processor-dependent decoder supplied for each platform. Each meta-instruction is a 64-bit word organized as shown in Figure 1. All meta-instructions have the form A=B op C. The tag for each operand A, B, and C determines the operand mode--151, J, X(I), Y(I,J), for instance. The text box "PORT's Metacode" describes metacode operation.
Table 1 lists all three-operand instructions (A=B op C). Because it's wasteful to use three operand slots for two-operand instructions (A = op B), these are grouped as instruction 25h using the value of C to distinguish them; see Table 2. Note how specific the instructions are to Fortran and C.
Code (hex) Mnemonic Description Code Mnemonic Description
(hex)
-------------------------------------------------------------------------
00 -- Illegal op 27 CALL Call subroutine.
code
(catches 0
as an
instr.).
01 (+):i Integer 28 AND Logical AND.
ADD.
02 (-):i Integer 29 OR Logical OR.
SUBTRACT.
03 (*):i Integer 2A XOR Logical XOR.
MULTIPLY.
04 (/):i Integer 2B ANDF Boolean AND.
DIVIDE.
05 (//):i Integer 2C ORF Boolean OR.
INCLUSIVE
DIVIDE.
06 B**IC:i Real 2D XORF Boolean XOR.
-to-integer
power.
07 MOD:i Integer 2E SBYTE6 Replace 6-bit byte
remainder. (in array or
word).
08 (=):i Integer test 2F LBYTE6 Extract 6-bit
EQUAL. byte.
09 (<):i Integer test 30 I_FIELD Extract arbitrary
LESS THAN. field from word.
0A (>):i Integer test 31 S_FIELD Replace arbitrary
GREATER THAN. field in word.
0B (#):i Integer test 32 ATAN2 Ratio arc tangent.
NOT EQUAL.
0C (<=):i Integer test 33 SIN_COS Compute both sine
LESS THAN OR and cosine of C.
EQUAL.
0D (=>):i Integer test 34 shift_L Boolean LEFT
GREATER THAN SHIFT.
OR EQUAL.
0E arith_L Arithmetic 35 shift_R Boolean RIGHT
LEFT shift SHIFT.
(preserve
sign).
0F arith_R Arithmetic 36 StrngOP Text op (copy,
RIGHT shift. search,
concatenate,
and so on).
10 B**C:r Real to power 37 ByteOP Byte-string op
of a real. (nonspecific
StrngOP).
11 (+):r Real ADD. 38 Unpack8 Unpack string to
word-per-byte
array.
12 (-):r Real 39 Pack8 Pack eight
SUBTRACT. bits/word array
as string.
13 (*):r Real 3A CALL_P Call program on
MULTIPLY. another processor.
14 (/):r Real 3B Load_P Load program on
DIVIDE. another proc.
15 POLYNML Polynomial 3C Reset_P Restart program on
expansion another proc.
evaluate.
16 INCADR Increment 3D-3F Not used.
POINTER.
17 KCHECK Test for 40 -- Illegal op code
operand (catches -0 as
error. instr.).
18 (=):r Real test 41 (+):c Complex ADD
EQUAL (c/tests
(r/tests use use fuzz factor).
fuzz factor).
19 (<):r Real test 42 (-):c Complex SUBTRACT.
LESS THAN.
1A (>):r Real test 43 (*):c Complex MULTIPLY.
GREATER THAN.
1B (#):r Real test 44 (/):c Complex DIVIDE.
NOT EQUAL.
1C (<=):r Real test 45 B**IC:c Complex to integer
LESS THAN power.
OR EQUAL.
1D (=>):r Real test 46 CMPLX Real and imaginary
GREATER THAN to complex.
OR EQUAL.
1E SBYTE8 Replace 47 -- Not used.
8-bit byte
(in array
or word).
1F LBYTE8 Extract 48 (=):c Complex test
8-bit byte. EQUAL.
20 BreakPt Breakpoint. 49 (<):c Complex test LESS
THAN.
21 GO TO* List GO TO. 4A (>):c Complex test
GREATER THAN.
22 jmp#0 Branch if 4B (#):c Complex test NOT
NONZERO. EQUAL.
23 jmp=0 Branch if 0. 4C (<=):c Complex test LESS
THAN OR EQUAL.
24 ARYMOV Block op 4D (=>):c Complex test
(array GREATER THAN OR
search, init, EQUAL.
copy, and
so on).
25 2op Two-operand 4E-7F Not used.
instructions
(see
Table 2).
26 DO end Loop test,
increment/
decrement,
and branch.
Code (hex) C Mnemonic Description ------------------------------------------------------------- 00 neg Negate. 01 NOT Logical invert. 02 NOTF Boolean complement. 03 PP req Flag peripheral processor and halt. 04 GO TO Unconditional branch. 05 (:=) Transfer one word. 06 RETURN Subroutine RETURN. 07 Intrpt Suppress interrupts (ignore PP flag). 08 ABS Get absolute value. 09 INT Truncate real to integer. 0A FLOAT Integer to real. 0B SIGN Assign sign of B to A. 0C STATS Get memory-error statistics. 0D VADDR Set POINTER to address. 0E EXECUTE Execute local procedure. 0F MEMCHK Force memory reference (test RAM). 10 EXCHNG Swap contents of A and B. 11 IROUND Round real to integer. 12 -- Unassigned. 13 SQRT Square Root. 14 SIN Sine. 15 COS Cosine. 16 TAN Tangent. 17 ArcSIN Inverse sine. 18 ArcCOS Inverse cos. 19 ArcTAN Inverse tan. 1A LOG Log to base e. 1B LOG10 Log to base 10. 1C EXP e to power B. 1D EXP10 10 to power B. 1E (+):i Direct INTEGER ADD. 1F (-):i Direct INTEGER SUBTRACT. 20 (*):i Direct INTEGER MULTIPLY. 21 (/):i Direct INTEGER DIVIDE. 22 arith_L Direct arithmetic LEFT SHIFT. 23 arith_R Direct arithmetic RT SHIFT. 24 shift_L Direct Boolean LEFT SHIFT. 25 shift_R Direct Boolean RT SHIFT. 26 MOD:i Direct remainder. 27 ORF Direct Boolean OR. 28 ANDF Direct Boolean AND. 29 XORF Direct Boolean XOR. 2A (+):r Direct REAL ADD. 2B (-):r Direct REAL SUBTRACT. 2C (*):r Direct REAL MULTIPLY. 2D (/):r Direct REAL DIVIDE. 2E REAL Real part of complex. 2F CMPLX Real to complex. 30 CONJG Complex conjugate. 31 SINH Hyperbolic sine. 32 COSH Hyperbolic cosine. 33 TANH Hyperbolic tangent.
Since each meta-instruction needs to be interpreted, the decode time is critical. The metacode is therefore designed to be decoded as fast as possible. For example, even though the PORT page is 4096 words, needing only 12 bits, it is assigned a 16-bit field because most processors have instructions that can operate on 16 bits directly. Likewise, the 3-bit tags describing the contents of A, B, C are grouped as a 9-bit field, rather than being tacked onto their respective A, B, and C--allowing the decoder programmer to use a fast 512-way indexed branch to interpret all three tags in a single assembly instruction.
Without exception, all meta-instructions have an identical format. Furthermore, the metacode operand format is independent of the actual operation. (Both the 386/486 and i860 decoders interpret the operands before they even look at the Op Code.) This rigorous independence not only speeds the decoding on existing processors, but has an eye on future chips, which will incorporate multiple independent processors on the same silicon (Motorola's recently announced M88110 RISC multiprocessor, for instance). A context-free format lends itself to parallel decoding the A, B, and C operands and even to pre-interpreting instructions. Obsolescence being reality in computers, we have tried to think as far ahead as possible.
As previously mentioned, all I/O functions are handled by the peripheral processor (PP) program. Although designed to execute on a separate processor to the metacode computation processor (CP), I discussed how the PP coexists surprisingly well with the CP in a single-processor environment. Each PP action is specified via a 5x64-bit word structure deposited in the CP memory. Table 3 lists the current PP action codes. The PP is free to honor the requisite action in any way it chooses, thus enabling PORT to be implemented seamlessly in diverse host environments.
Code Action Description
-------------------------------------------------------------------------
01 Output text line Sends line to screen, COM or LPT port.
02 Input text line Gets line/char from keyboard or port.
03 Set flag chars Defines chars for line feed, prompt, and so
on.
04 Set date/time Resets current date and time.
05 Get error stats Gets error counts for disk and PP memory.
06 Jumper ports Redirects text-line output to specific ports.
07 Get date/time Retrieves current date and time.
09 Read/write disk Retrieves/replaces disk page.
10 Discard interrupts Clears any interrupt flag from PP to CP.
11 Jump history Activates/gets last 2048 branches, calls.
12 Mag tape op Reads/writes/positions 9-track, 8mm, 3480,
DAT.
13 Set COM params Sets serial-port baud rate, and so on.
14 Generate tone Sends tone sequence to speaker.
16 Exec host program Executes DOS(UNIX) based program.
18 Set spooling Activates/discards output to DOS file buffer.
21 Set COM protocol Sets port handshaking (XON/XOFF, RTS).
24 Swap disk page Exchanges page in memory with page on disk.
25 High-speed I/O Block I/O to IEEE 488, SCSI, and so on.
27 Terminate PORT Signs off PORT and returns to host system.
28 Switching programs Tells PP that PORT starting different
program.
29 Host file op Reads/writes/finds/copies/deletes DOS file.
30 Screen graphics op Outputs line/fill/block/color to screen.
31 Window/mouse op Creates/removes/retrieves/outputs a window.
128 Fatal error PORT detected internal error, sign off.
Again, all I/O data items lie on a 64-bit boundary and always have a length which is a multiple of 64 bits. (Strings are padded, as necessary.) This enables the PP implementor to utilize the maximum bandwidth available on the machine; for example, the PP lends itself readily to 32-bit EISA and Micro Channel implementation. By way of comparison, UNIX, OS/2, and DOS I/O are specified in terms of a byte offset and byte count, which requires a costly shift of the data block if the start byte does not fall on a 32-bit boundary.
To avoid the PP becoming involved in the virtual-memory process, the CP guarantees that no I/O block crosses a page boundary. As an example, standard Seismic SEG-B tapes have records of several megabytes apiece. The Fortran/C program can readily define a multi-megabyte array to hold the record, but the PP transfers these records as a sequence of one-page blocks. This places the onus on PORT and the CP to process any virtual-memory page faults on the huge array, between the record pages, thereby simplifying the PP implementation. The CP and PP are designed to execute in parallel. Even on a single 386/486 processor, the PP caches tape I/O using the DOS timer interrupt to service the drive concurrent with the CP program.
To illustrate the effectiveness of these ideas, PORT is able to copy a 2500-foot, 9-track, 6250-bpi Seismic SEG tape in 4.5 minutes between two streaming 125-ips drives using a Fortran program, on a stock 486/33 EISA PC (sans i860). Multiplying 2500 feet by 125 inches per second gives a theoretical lower limit of four minutes. By comparison, an oil company had run under UNIX on a Sun-compatible rated faster than a SPARC 4 and, after a year of custom device driver coding plus C conversion, was able to reduce the time from 12 to 6.5 minutes. The 4.5 minutes reflects a sustained data-transfer rate of 2x6250x125x(4/4.5)=1.3 Mbytes/sec (accounting for interrecord gaps) from Fortran. (At the last SEG show in November 1991, the PC was hidden under the table because the company felt nobody would believe a PC could do what it was doing.)
True to the maxim that the last 5 percent of the bugs take 95 percent of the development time, a bug in the DISSPLA shading algorithm took us almost two years to find. It showed up on one mainframe, but not another. It even disappeared from execution to execution. Not even the most sophisticated debuggers were any help, because the program never failed when the debuggers were active. New DISSPLA releases were held up because every time we thought we had it, the bug popped up anew. (Sometimes we thought it depended on the phases of the moon.) It turned out to be a truncation from floating point to integer used to compute the index to a workspace array, whose contents depended on memory at the time the program was run. (I=INT(A) is highly processor dependent when A falls between 0.999... and 1.000...01.) Experiences like these on standard systems almost drove me into a different line of work, and I resolved that when we designed what is now PORT, debugging features would take priority over any performance considerations. The first time it was run under the SUPERSET precursor to PORT, the bug popped up--and was nailed.
The design of PORT assumes that nothing is working properly. For example, the file-management system attaches an internal software checksum to every record of every file. Whenever a file is copied on the disk or transferred to or from tape, this checksum is validated by the PORT software. We've even caught problems in the error-checking circuitry of the hardware!
The PORT metacode performs numerous checks on every instruction without exceptions, not even for the PORT system itself. These include:
The metacode approach provides other bonuses for code development. For example, the decoder can be requested to store the last 2048 branches (providing a software "logic analyzer" execution history). The decoder stores the count of meta-instructions executed by subroutine, furnishing a "profiler" that has proved invaluable for optimizing multi-megabyte applications. The PORT debugger can request the decoder to test for an event on every meta-instruction. Since the tests are embedded in the decoder, the debugger slows the program by just a few percent. Unlike debuggers that insert code, there are no loop-holes and no need to recompile. Basically, the metacode decoder represents modifiable "microcode" that can be programmed to do whatever you like.
Another subtle use of the modifiable microcode is the application of a "fuzz factor" to all floating-point compares (for example, IF(B=1.0) THEN succeeds if B=0.999...). The fuzz factor is user modifiable and circumvents the nightmare DISSPLA bug I described.
An important aspect of virtual memory is that the PORT debugger is available at all times on all programs, especially on release versions at user sites. The linker attaches a copy of its tables to the end of each executable image (no exceptions). These include the location, programmer name, date/time, and source installation of all the OBJ files.
Upon any fault (like an array bounds or bad pointer), the debugger swings into action, swaps the program back to disk, and loads the linker attached tables. Using this roadmap, the debugger can probe any program--even itself. The key difference between the PORT debugger and Codeview is that PORT can debug the program at the user site with the user's proprietary data in the exact state of its failure. (PORT doesn't know "memory dumps.") In multi-mega-byte apps developed by a team, the ability to pinpoint the subroutine version by date/time and the programmer responsible has proved worth its weight in gold.
The philosophy of PORT is to throw as many processors as possible at the task, not as many tasks as possible at the processor. Dedicating the entire system to the current application simplifies multi-processor handling, thereby improving throughput rather than wasting cycles in overhead. The elegant simplicity of Cray's memory-mapped CP/PP communication scheme proved easily extensible to multiple processors.
All processors are under the control of the CP, which in turn is under the control of your program. The program treats each processor as if it were executing a subroutine. To see how this works, consider using an auxiliary processor to do three-dimensional rendering; see Figure 2. The rendering is done by a call to RENDER_3D (OBJ_LIST, FRAME_BUF), for example, where OBJ_LIST is the list of solid-object coordinates, and FRAME_BUF is the target frame buffer. In a single-processor environment, RENDER_3D would just be a normal Fortran/C subroutine, and indeed, the linker expects a Fortran/C version. Prior to calling RENDER_3D, the program requests the CP to load RENDER_3D on processor 1 via a LOAD_PROCESSOR system call. If there is no processor 1, or no assembly version of RENDER_3D for it, the CP returns an error status to the user program. The program can then opt to terminate or proceed using the Fortran/C version. Given that everything is kosher, the metacode decoder on the CP intercepts the RENDER_3D call. Instead of executing the normal Fortran/C call, the decoder makes sure that the OBJ_LIST and FRAME_BUF arrays are loaded in RAM, passes their addresses to the RENDER_3D version in processor 1, and fires up processor 1. The decoder then executes a return to the user program as if it had executed the Fortran/C RENDER_3D call.
The user program executing in the CP and RENDER_3D in processor 1 proceed in parallel. When RENDER_3D is done, it sets a bit in a word tested by the CP prior to executing each meta-instruction (an interrupt without a hardware interrupt). A user-supplied interrupt-processing routine is then called. Alternatively, the program can just opt to poll a common status word it defines, to avoid interrupt handling. Upon detecting completion by processor 1, the program can send FRAME_BUF to the VGA and make another call to RENDER_3D or other routine loaded in processor 1. The bottom line is that your program is in control of all processors at all times.
Note that RENDER_3D has two versions: a high-speed, machine-dependent assembly version and a slower Fortran/C version. Thus the seamless portability of the program is preserved, albeit with the slower version. This feature is also useful for the orderly development of multiprocessor applications. If you go to the hassle of multiprocessor, speed is usually your criterion; PORT therefore focuses on running tight assembly modules, rather than a bureaucratic, symmetric, multithreaded C system. (Silicon Graphics workstations atest to the speed of a tightly coupled MIPS 3000 RISC processor rendering 3-D.)
The underlying PORT theme is to minimize entropy--to concentrate the information and focus on the objective. Bug checks increase entropy, but a program with a serious bug has infinite entropy.
In these three articles on "Personal Supercomputing," I've endeavored to introduce an alternative to the UNIX and OS/2 approach. My intent is not to disparage their multitasking philosophy, but to show that it is not gospel and that there is room for diverse system designs. PORT shows that by targeting a specific class of applications, it can overcome many of the limitations of the UNIX approach, for its constituency. Conversely, PORT is not the ideal vehicle for agenda managers, word processors, and multiple TSRs. The PORT design realizes this and endeavors to complement the host system rather than replace it. Nevertheless, technology has moved far beyond the PDP 11 and PC/XT. DOS, UNIX, and OS/2 must come to terms with high-speed, multiple RISC processors and 64-bit operation to realize the full potential of the next wave of technology.
In examining the operation of the PORT metacode, let's begin with an example. Consider the Fortran/C source statement: Y(J)=X**11. Assume that Y is a local array with DIMENSION Y(100).
A=B**C is a single 64-bit meta-instruction; see Op Code 06 in Table 1. As shown in Figure 3, C is 11, B is the scalar X, and A the indirect reference to Y(J). The tags differentiate operand modes: C has tag 1 (immediate integer<64K), B tag 2 (local scalar,) and A tag 3 (indirect). Tag 1 uses the value of the 16-bit field itself (11). Tag 2 adds the field (257) to the start address of the subroutine data area giving the word address, so X at [d/area+257] has the 64-bit value 5.7. Tag 3 is similar to tag 2, except that the word points to the operand value, hence the value at [d/area+316] points to Y(J). All 64 bits as a single address is excessive, so the array base virtual address is in the upper half-word and the the lower 32 bits is for the length and index. In our example, Y starts at virtual page 5, word 862, and J has the value at [d/area+541], namely 25.
The metacode decoder first shifts out the 16-bit fields corresponding to A, B, and C. In the 386/486 version, the decoder then masks out the 9-bit combined-tags value in the upper 16 bits (1*64+2*8+3=83), uses this as an index to a table, and branches to entry 83. This simple branch decodes the tags 1,2,3 in one 386/486 instruction. Operands C and B are straight-forward (tags 1 and 2). For A, the decoder proceeds as with B, except that it now masks out the lower 16 bits and retrieves the 32-bit pair for J. The second 16-bit field in the indirect word has the length of Y, namely 100; if J is less than 1 or greater than 100 an array-bounds fault is produced. In the example, the decoder adds J-1=24 to the base v/address for Y (5:0862), giving 5:0886 as the location of Y(J). Page 5 is a virtual page, so the decoder references entry 5 of its page-translation table for the real-memory page address, adds offset 886, and obtains the real-memory word address for Y(J). Having the values for B and C plus the location of A, the decoder shifts out the Op Code and performs the operation listed in Table 1.
If entry 5 of the page-translate table is negative, it flags that Y(J) is rolled out to disk, and the decoder generates a page fault. If the value for either B or C is -0, it produces an uninitialized-variable fault. If the exponent in the upper 12 bits of B is 0, an unnormalized floating-point fault results. If A's exponent will be greater than 1023, a floating-point overflow fault is produced. These tests are each a matter of one or two instructions and add little to the overhead, but add a whole dimension of debugging power and implement virtual memory with simplicity.
The upper six bits of the packed indirect word determine whether A is Y(J), Y(I,J), Y(I+J), or a POINTER, and whether J is a variable or an immediate constant; for example: Y(5), Y(I,5), and Y(I+5). In the Y(I,J) and Y(I+J) cases, the index field points to a second level of indirection, for Y(I,J,K) a third level, and so on. All instances of Y(J) in the subroutine point to the same indirect word at offset 316, so there is only one entry for Y(J), not a copy for each source line at which it occurs (reducing data-area redundancy). If the array is longer than 16-bits (64 Kwords) or starts at any value other than 0 or 1 (for example, DIMENSION Y(-100:100)), the length field points to a word in page 0 containing the actual limits.
The metacode may appear daunting, but the bottom line is that it concentrates the maximum information in as few bits as possible, thereby minimizing the entropy. Using an 8088 PC/XT to execute it would be a joke, a 32-bit 386 is marginal; RISC processors with their one instruction per cycle and huge banks of registers are just what the doctor ordered. Most importantly, the mass of information presented to the RISC processor allows it to change the order of the logic, thereby plugging free cycles and memory-reference delays with useful RISC instructions. (As long as the decoder produces the result expected, the internal sequence is academic.)
--I.H.
Copyright © 1992, Dr. Dobb's Journal