Intel's Upcoming 64-bit CPU Architecture

Dr. Dobb's Sourcebook January/February 1997

By Hal W. Hardenbergh

Hal is a hardware engineer who sometimes programs. He is the former editor of DTACK Grounded, and can be contacted at halwh@ddj.com.

Once upon a time, Windows NT was going to unify all computer Instruction Set Architectures (ISAs) under a common roof. How many of you remember when Bill Gates, back in the early '80s, publicly asserted that Microsoft's operating system would migrate towards UNIX? UNIX is the operating system that offered, and offers, support for multiple platforms. Bill has a world-class attention span, and NT is now fierce competition for UNIX.

But now that Microsoft has dropped NT support for the MIPS CPUs, we know it's not going to be a unifying agent. Neither Silicon Graphics (owner of MIPS) nor Apple Computer (main purveyor of PowerPC systems) like NT. Will NT support for PowerPCs be next to go? That would leave DEC's Alpha with its bare fanny hanging metaphorically in the cold breeze as the only non-x86 ISA supported by NT.

Microsoft wanted to force most, if not all, applications to run on multiplatform NT. That strategy has failed. First, most of the Windows 95 applications won't even run on x86 NT. The small subset of x86 shrink-wrapped code that'll run on NT hardly ever gets ported to a non-x86 platform, even though Microsoft would like us to believe that the process only involves recompiling the app's C source code. Yeah, right.

Bye-Bye Software Emulation

Now that Apple has announced future PowerPC boxes above a certain price will also contain Pentium CPUs, the pretense that x86 software could be productively run on PowerPC boxes using software emulation has been dropped. When the original announcement that PowerPCs could emulate x86 code at 486 speeds was made, I almost busted a gut laughing -- as did everyone else who lived through the original x86 emulation fiasco with the 68000-based Amiga.

Don't Emulate, Translate?

DEC had developed an alternative to straight emulation to help port VAX code to its Alpha microsystems. This involved native code translation. It had adapted this technique (dubbed "FX!32") to porting shrink-wrapped x86 apps to its Alpha systems. In reviewing FX!32, Windows Sources (September 1996) said:

We found that a program would generate an error message, and we'd exit the app, restart it, and get further along until another error occurred. Starting over again would take us further through the application, with this process being repeated over and over....

The obvious answer is for someone (DEC?) to perform the translation once, in advance, and provide the translated code to the Alpha system customer. This would make the owners (Lotus, for example) of the software deliriously happy. Especially when they start getting support calls for problems caused by imperfect translation. (Hey, if Lotus' spreadsheet, translated by DEC, doesn't run on your Alpha system, you call Lotus. Right?)

I believe that while translation can work, practical difficulties will get in the way. Translation hasn't worked up to now, else shrink-wrapped apps announced for the x86 on Monday would be announced for the PowerPC on Tuesday and for 68000-based Macs on Wednesday and for SGI's MIPS boxes on Thursday and...it ain't that easy, folks.

DEC has just released FX!32 benchmark data running BYTE's Windows NT integer benchmark. On a 500-MHz Alpha 21164-based system, the result was 48 percent faster than a 200-MHz Pentium Pro-based system. (FP was 21 percent slower.) (See "Most Significant Bits," Microprocessor Report, October 28, 1996.) Now go price a 500-MHz Alpha 21164 system.

The Problem Is...

Your corner software supermarket contains a huge assortment of shrink-wrapped software for x86 systems, one table for Macs, and nothing for any other CPU system. It's obvious that the vendors of those other systems, including PowerPC system vendors, are casting intensely envious eyes at all that x86 software.

Microsoft (naturally) wants its Office 97 suite to run on every desktop computer. Intel (naturally) would like to keep all that shrink-wrapped software for itself. We do not have unanimity within the "Wintel" camp.

Brute Force: One CPU per ISA

Apple's latest approach -- including an actual Pentium CPU in its PowerPC systems -- can obviously work, peripheral compatibility issues aside. It also obviously places its PowerPC systems at a severe price disadvantage versus Wintel systems. If every vendor of non-x86 desktop systems pursued this strategy, it wouldn't be long before Wintel's 90 percent market penetration jumped to 99 percent. (This process may eventuate no matter what. See "Whatever Happened to RISC PCs?", by Michael Slater, Computer Shopper, November 1996.)

Hardware-Assisted Emulation

While hardware assists will certainly make emulation run faster, the result will still be a lot slower than native code.

All non-x86 vendors of desktop CPUs have in the past couple of years announced that hardware support for x86 emulation would be built into their future CPUs. IBM, however, has canceled its "615" projects (yes, plural). No non-x86 CPU has yet been shipped by any vendor that has hardware assists for x86 emulation.

As other CPU vendors discover the effort required to provide a decent hardware assist for emulating Windows 95 protected-mode code, as well as how slow the result will nevertheless be, I think more non-x86 CPU vendors will follow IBM's lead.

Intel's New 64-bit ISA

You would think that Intel wouldn't want to support any other ISA than the x86. But there's the upcoming "Merced" (formerly P7) CPU in which Hewlett-Packard's Precision Architecture (PA), or at least HP, is somehow involved. ("Precision Architecture" is HP's name for its RISC ISA that replaced the old HP3000 minicomputer.)

I'm going to tell you, with a lot of help from John Wharton (jwharton@netcom .com), what the new ISA might be and how it can work with x86 libraries. John and I first discussed this a year or so ago. When I e-mailed him to verify our conversations, John volunteered considerable additional information, and allowed me to use it in this column.

John has been tracking this for a long while. Before Intel announced the Pentium, he wrote in a March 20, 1991 Microprocessor Report column:

My guess is that the hypothetical 686 would retain a 386/486 compatibility mode, just as those parts support 'virtual 8086' and 286-mode operation. Hopefully, though, the bulk of the 686 designers' creativity and transistor budget would be devoted to a new, unconstrained native execution mode, an architectural framework to end all architectures. The CPU would switch between the compatibility mode and native mode automatically upon entering or leaving the OS.

That column, nearly six years old, suggested that the native mode would provide 64-bit registers and operations throughout, eliminate some of the x86's more arcane addressing modes, and use variable-length instructions that wouldn't be x86 compatible. He also predicted vector math for accelerated graphics (read: MMX operand partitioning), a feature which was already in use on the i860 math accelerator.

John was off by a generation, but that's amazingly close to current predictions of the upcoming Merced. No, he didn't invent this stuff out of thin air. John's a technology analyst with Applications Research, a Palo Alto, California, design consulting firm. He's also the junior engineer who was the architect of the Intel 8051 microcontroller: He defined the programming model, instruction set, addressing modes, and instruction encoding formats. In all its variations, the 8051 family has an installed base of over a billion chips with sales still climbing; more than 200 million units in 1995, according to Dataquest. As a technology analyst, he goes to the industry technical forums and has discussions (on and off the record) with the CPU suppliers' project leaders (and others).

At the 1992 Microprocessor Forum, Intel's Pat Gelsinger asserted that Intel would someday be forced to move the x86 architecture to 64 bits, and when it did, it might introduce other architectural improvements, including (perhaps) a larger register set and more regular instruction encodings.

At the 1993 ISSCC conference, Fred Pollack, the Pentium Pro project manager, confirmed Gelsinger's earlier remarks. (Both comments were in response to John's questions from the floor.)

During a press interview following the 1995 ISSCC conference, Bob Colwell, Pentium Pro's chief architect, pointed out that essentially all of the Pentium Pro die was devoted to fast implementation of generic low-level operations, which are not one-to-one related to the vagaries of the x86 architecture. John said that Colwell asserted that "we could run PowerPC binaries on this chip if we wanted to" by redesigning the instruction decoders and microcode ROMs -- although he quickly clarified that Intel would have no reason to do so.

John is also a lecturer with the Computer Systems Lab at Stanford University. On October 25, 1995, when the Intel/HP deal was widely rumored to be based on VLIW (Very Long Instruction Word), HP's top VLIW researcher, Bob Rau, spoke to John's Stanford class. Afterward, Rau said he had nothing to do with the Intel deal, it was being handled by another HP division, and that none of his fellow VLIW researchers had ever had any dealings with Intel. Scratch VLIW from Merced.

During another Stanford talk toward the end of the 1996 spring semester, Albert Yu, Intel's senior vice president, said there would not be two versions of the upcoming CPU. Instead, the chip Intel sells HP will be identical in all respects to the chip Intel sells to its traditional customer base.

This brings us pretty much up to date with Intel's public pronouncements on its upcoming 64-bit CPU, code-named Merced. To what degree the new native ISA will or will not be related to HP's PA ISA is a closely held secret. So we don't know exactly what the new native mode will be. John pointed out how this can (should?) work: From examining the Pentium Pro microarchitecture, it is clear that the internal implementation has very little to do with the x86 ISA as we know it. The basic philosophy is to:

Prefetch an (architecture-irrelevant) instruction stream from on-chip Icache or external memory.
Decode the (architecture-specific) macroinstruction binaries into a series of (microarchitecture-specific) internal ops.
Schedule, dispatch, execute, and retain the results of the internal ops.

4. Retire the results produced by the internal ops in an order consistent with that of the original x86 instructions.

Except for the instruction decoders and microcode ROMs, all of these steps are essentially independent of the programming model and binary instruction encodings. So by simply incorporating two (or more) sets of decoders, and two (or more) sets of microcode ROMs, the same CPU could execute software for two (or more) ISAs, all at hardware speeds.

Horseback measurements of the Pentium Pro die photo indicate that the x86 instruction decoder, microcode, and microcode sequencer together occupy just 7 percent of the die area. A second decoder and sequencer would thus add 5 percent to the die area, or less, since a RISC-like ISA will be a lot less complicated to decode. Even with the other minor logic adjustments needed here and there to accommodate a new ISA, the die size (and manufacturing cost) would not be heavily impacted.

The Pentium and Pentium Pro already accommodate two (closely related) ISAs: 16-bit real mode and 32-bit protected mode. The operating system controls which ISA is running at a given time. The binaries are not identical (FP binary op codes, for example). In its simplest form, the new native mode is merely a third ISA to be selected by the (new) operating system. But there's a better way, one that doesn't abandon the huge existing x86 code base, especially DLLs and system calls.

The better way would call for defining new flags in the virtual memory paging system page tables; for instance, just as you can now set flags indicating that certain pages contain code, or data, or read-only data, or whatever; there could be new states to indicate whether a particular page contains x86 code or native code. Then, merely dispatching a user task on a different page, or calling a subroutine on one page from a different page, could cause the instruction decode mode to switch automatically as appropriate. Native applications could then "call" an x86 DLL. Similarly, existing x86 apps could potentially be accelerated by replacing a run-time graphics subroutine library with one that runs in native mode. This would allow a smooth transition from x86 code to mixed code to pure native code over a period of years.

Conclusion

John made a very good call six years back on variable-length instructions. The rest of the industry has only recently caught on to the fact that 32-bit fixed-length instructions produce bloated code that, with a sufficiently fast CPU core, actually runs more slowly due to instruction cache inefficiency and bus bandwidth limitations. Almost all of the latest "RISC" embedded CPUs use variable-length instructions for that reason. (The nonembedded RISC chips cannot use variable-length encoding without discarding all their existing software! Existing RISC desktop CPUs, once hailed as the latest and best, are locked into an obsolete, inferior instruction format!)

Look for Merced's native mode to use a 16-bit minimum instruction length, with longer instructions in 16-bit multiples. It takes two 32-bit instructions on a conventional RISC CPU to load a 32-bit constant. A single 48-bit instruction would be more efficient. And John thinks Merced may use an 80-bit instruction format to accommodate 64-bit constants.

This suggests that Merced's native architecture probably won't be binary compatible with Hewlett-Packard's Precision Architecture ISA. While this may be important to HP, it should be irrelevant to you readers. For you, Merced will be like a candy store to a six year old. It has lots of modern goodies: 32 general-purpose 64-bit integer registers, a general-purpose FP register set instead of the x86's FP stack.

Has anybody noticed that the x86 ISA had 85 percent of the desktop market three years ago and has 90 percent today? With Intel adding a modern ISA that's newer and better than its competitors (SPARC through PowerPC), that market share trend will continue, even accelerate. Why would you want to write code that can't run on Merced?

DDJ