Hal is a hardware engineer who sometimes programs. He is the former editor of DTACK and can be contacted through the DDJ offices.
I am not here to tell you how wonderful the massively parallel personal computer system you'll buy next year will be. Or how wonderful your next personal symmetrical multiprocessor will be. The thing is, your operating system and application programs--compilers, editors, spreadsheets, CAD, schematic entry, communications--are all scalar programs, and scalar programs don't work well on parallel processors. So let's get real. Let's discuss what improvements we can expect in our personal uniprocessor computer systems and what factors control those improvements.
We've benefitted from long-term hardware trends with well-established doubling rates. The predominant trend has been CPU performance doubling every two years since electronic computers were invented nearly a half century ago. The second most important trend (at least for you OOP programmers who have an eye on the new operating systems with voracious memory appetites) is memory capacity, which doubles every 1.5 years.
Unfortunately, nothing doubles forever; both of those trends have ended. See Figure 1: Those of you who are waiting for 64 Mbytes of DRAM (the minimum requirement of NT 2.0) to become cheap will have to wait a while longer. Intel's Gordon Moore thinks memory price/bit will start to go up as DRAM chip density increases.
Mainframe CPU performance, however, dropped off that trend years ago. In this article, I'll examine where CPU performance increases have been coming from and why (and when) microprocessors will also drop off that trend.
I once had a small company, Digital Acoustics, that made 6502-based products. When the 68000 came along, we started making attached processors, mostly for the Apple II. Based on considerable experience with these two processors, I (back then) estimated that an Apple II (or Commodore PET) had about 1/20th the performance of a 12.5-MHz 68000. The 12.5MHz 68000 had about the same integer performance as the VAX 11/780, introduced in 1978. The 11/780 was nearly contemporaneous with the Apple II and PET, which appeared late in 1977.
The Apple II and PET rated 0.05 VUP (Vax Unit of Performance). Sixteen years later, a 66-MHz Pentium rates 64.5 VUPs, an increase of 1290 over typical PCs in January 1978. (Contemporary 8080 machines had about the same performance as the Apple II.) The 6502 was introduced in August 1975 and the Pentium in March 1993--17.58 years later. Using just these two CPUs as data points,
we can calculate the doubling time as: dbltime=17.58 years*LOG(2)/LOG(1290)
=1.70 years.
Hmm. That's faster than the commonly accepted CPU doubling time of two years. Suppose we compare 1978's 1-VUP VAX 11/780 (not a microprocessor) with 1993's 110.9-VUP DEC 200-MHz Alpha box: dbltime=15 years * LOG(2)/LOG(110.9)=2.21 years.
The fastest minicomputer in 1978 was 20 times faster than a PC; today's fastest minicomputer (a server, not a workstation) isn't even twice as fast as a Pentium-based PC. Despite incessant, contrary propaganda, PC performance has been rapidly gaining on RISC-based workstations; see Figure 7.
Industry legend would have you believe that a particular fast microprocessor owes its speed to the incredibly elegant and sophisticated arrangement of registers, ALUs (arithmetic logic units), and lately, caches that its rocket-scientist designers have built in. Actually, all of these techniques, including multiple-issue per clock, were pioneered on mainframes.
The term "superscalar" applies to a processor that can issue more than one instruction per clock. Purists insist that the term should be restricted to those CPUs, such as Pentium and SuperSparc, that have duplicate execution units (usually integer), and so can issue two integer instructions in the same clock. The PowerPC, HP PA7100, and i860 can issue more than one instruction of differing types in the same clock. Since most personal-computer application programs have a mix of 85 percent integer, 15 percent branch, and 0 percent floating-point instructions, the ability to issue multiple integer instructions in a single clock is highly desirable.
The anonymous workers who've advanced the state of the semiconductor lithographic art over the years are more important than the microprocessordesigner rocket scientists.
The job of a CPU's ALU is to examine a lot of data, manipulate some of it, and change a little of it. To do this, the ALU must have access to the data, which is located in DRAM. The ALU's ability to do its job is limited by the data path connecting the integer core to the DRAM, which is the infamous von Neumann bottleneck.
In the Apple II, the data path was 1 byte wide at a clock rate of 1 MHz, for a data-bus bandwidth of 1 Mbyte/second. Since a 6502 system seemingly comprised a simple CPU/ALU directly connected to the DRAM memory system, it was easy to spot and measure the bottleneck. In fact, the X and Y registers helped decouple the ALU from the DRAM.
The 66-MHz Pentium is 1290 times faster than the 6502-based Apple II. A faster data bus makes most of this performance increase possible. A 64-bit (8byte) data path, plus a peak bus burst rate of 66 MHz, provides a 528 times faster data bus at the Pentium data I/O pins. That accounts for most, but not all, of that 1290times increase.
To make higher performance possible, instruction traffic is offloaded from the DRAM data bus by caches. The on-chip primary cache reduces the Pentium databus traffic significantly, and the external secondary cache reduces the DRAM databus traffic even further (the exact reduction is program and cache-size dependent). So, DRAM data-bus bandwidth doesn't limit the performance of the 66-MHz Pentium. In fact, some Pentium systems have been introduced which use only a 32-bit data path from the secondary cache to the DRAM.
(Intel has announced upcoming 100MHz DX2 and DX3 versions of the Pentium; running the secondary cache two or three times slower than the on-chip clock may impact the performance of those Pentium versions. The PC marketplace is marvelously efficient. We'll quickly learn what memory configuration works best at lowest cost. Witness the 486DX2/
66's triumph over the 486DX/50.)
Figure 2 was presented by IBM's George Marr at CompCon '77. (I chose to delete the 1960--70 portion of the graph because the pre-Intel 4004 era isn't important here.) I've extended the curve, which represents the best commercial practice rather than the state of the art. It's proven remarkably accurate for mass-production microprocessors. (The "boutique" micro market, with production of dozens per month, doesn't represent commercial practice.)
This trend to smaller minimum features is due to constant improvements by anonymous workers in the semiconductor production-equipment industry. The benefits are available to anyone with deep pockets. This trend, by itself, is entirely responsible for increasing clock speeds. See Figure 3 (originally published in 1989 by Gelsinger et al). If Marr's (1977) feature size drops by half every 7.16 years, then the size of a given CMOS transistor will be four times smaller, and a given charge/discharge current will change the output from a 1 to a 0 four times faster, or twice as fast in 7.16/2=3.58 years. Remarkably, Gelsinger's (1989) microprocessor clock doubles every 3.58 years! (I added the grid and the computed doubling rates to Figure 3.)
Figure 4 shows the trend to larger die sizes. This is due mostly to the increasingly pure silicon provided to semiconductor makers. Again, anonymous workers provide this benefit to every semiconductor producer.
Smaller feature sizes and larger dies are together responsible for the increase in the number of transistors per microprocessor die. See Figure 5, originally published in 1986 by Intel's Myers et al. I've added recent microprocessors, slightly adjusted the trend lines to accommodate the new data, and provided computed doubling rates (both trend lines originally doubled every two years).
Figure 6 shows the trend of data-bus width. The 90.5-bit datum shown for the MIPS R4000 needs some explanation. That chip has a 64-bit data-bus path and a separate 128-bit data bus to the external secondary cache. I decided the effective bus width was the geometric mean of those two external buses. You disagree? Okay, you tell me the width of the R4000's data bus.
Wide buses are so expensive that nobody uses them unless they're absolutely necessary. The fact that buses have been doubling in width every five years is a clear indication that ever-wider buses have been constantly needed.
More recent CPUs, such as the superscalar Pentium and SuperSparc, have very complex internal bus structures. Even the 486 gets its instructions from a 128-bit circular queue, not the internal cache. It's hard to identify the exact point which limits data transfer to and from the DRAM, thereby defining the effective databus width.
The easiest and cheapest way to run a microprocessor faster is to use more current to charge and discharge its parasitic capacities ("capacity" as in resistance and inductance). To get more current, lower-resistivity silicon is used, which simply means doping the silicon more heavily with impurities during the fabrication process. PC microprocessors once ran cool, but then the 486DX2/66 consumed 6 watts, and the 66-MHz Pentium 13 watts, 16 watts peak. The Pentium is Intel's last cool desktop engine. Look for the "Hexium" (aka P6) to consume 25 watts or more. This is a result of Intel's continuing drive to narrow the performance margin between PCs and high-performance server/workstations.
This is yet another (rather obvious) performance trick pioneered on mainframes. As I write this in late September, the most recent issue of IEEE Micro and the two latest issues of Electronic Engineering Times have articles on microprocessor thermal management, including liquid-cooling techniques for Pentiums.
Microprocessor electricalpower consumption is still far less than that consumed by your desktop CRT display. What implications does this have for future high-performance portable computers? Good question.
The availability of millions of transistors has made it possible for microprocessor designers to incorporate mainframe techniques developed a dozen years earlier on a single microprocessor chip. For a given transistor count (such as Pentium's 3.1 million) the job of chip designers is to choose an optimum mix of features that will extract the maximum performance from the new chip. Designers are constrained by the underlying instruction-set architecture; even the designers of DEC's Alpha and the new PowerPC are on their third or fourth iteration. (Ah, for the luxury of a blank piece of paper!)
Again, the job of modern microprocessor designers is less one of innovation and more one of optimization. Less risk-taking and more mistake-avoidance. They're all utterly dependent on the anonymous folk who drive the improvements in process technology, improvements that are equally available to all vendors. This hasn't always been clearly understood.
Which may explain why Sun Microsystem's Bill Joy, the noted UNIX software expert, announced in 1985 that Sun's SPARC offerings were going to double in performance every year for ten years; see Figure 7. The UNIX workstation industry, implying that Joy's prediction applied to all RISC CPUs, joyfully (pun intended) embraced that prediction. It never happened, of course. What's really hilarious is that until recently the RISC fanatics actually pretended that RISC performance was doubling every year, despite the complete absence of supporting data!
It's ironic that current industry perception of Sun's SPARC offerings is that the SPARC has fallen off the performance curve set by DEC's Alpha, HP's Snake, IBM's RS/6000, and MIPS's R4400. The two CPUs most demeaned by the RISC camp, Intel's x86 engines and Sun's SPARC series, are the two that lead their respective arenas in unit sales. Is there a lesson here?
Figure 7 also shows the microprocessor performance trend published in 1986 by Intel's Myers et al. Myers specifically noted the unavailability of a good performance metric at that time, which inhibited an accurate projection. What he needed was SPECint92.
Minicomputer CPUs were once built on large printed-circuit boards, using many integrated circuits. With the passage of time, the number of ICs has steadily dropped. With DEC's Alpha and HP's PA7100, the number is one--the CPU is a single chip. These are minicomputers: See how they are designed, manufactured, and marketed. They're also microprocessors by any reasonable definition. This would be worrisome if CPU performance of minicomputers and microprocessor-based PCs wasn't converging.
Soon, all minicomputers (or server/workstations) will use microprocessor CPUs. Trends of minicomputer and PC CPU performance will be the same for the same reasons, except that production quantities of minicomputers will continue to be much lower.
In order of importance, CPU performance is governed by feature size, data-bus width, and silicon purity (die size). We're about to hit a fundamental limit to data-bus width, one that was reached years ago in the mainframe world. After this event, CPU performance will be governed by feature size and silicon purity alone. I think we'll reach this limit around 1996, after which performance will double every three years, as shown in Figure 7.
All scalar computer programs, the ones we use on our personal computers, have branches every six instructions on average. Some instructions also have data dependencies, meaning an instruction cannot be executed until the results of previous instruction(s) are available. For this reason, there's a limit to the data-bus width that can be used, even if an infinitely wide data bus were available.
The next generation of Intel's and Sun Microsystem's desktop engines will have four integer execution units (Pentium and SuperSparc both have two). This is the upper limit of what is useful. Already, two-pipe CPUs issue only 1.5 instructions per clock even using an optimized compiler tuned to the innards of the CPU. Why put more than four integer pipes on a chip when the excess over four can only be used once in a blue moon? The first of the four-pipe CPUs will probably ship by 1996. It'll be fascinating to learn--and a real concern for all of us--how many instructions per clock those CPUs will be able to issue, on average.
I've bounced this idea off a couple of rocket scientists, er, microprocessor designers. (David Ditzel,Sun Microsystem's SPARC architecture maven, has a business card listing his job description as "Rocket Scientist.") They immediately began to speculate about how to handle the branch problem. I had to remind both of them that data dependencies were a serious problem when you want to issue a bunch of instructions all at the same time.
How'd you like to become famous? A celebrity, with a statue erected in your honor in the virtual-reality park of your choice? All you have to do is figure out how the scalar programs we use in our PCs can be efficiently used in a superscalar CPU which issues eight or more instructions at a time.
We're due to hit the wall on feature size in about ten years. It's not that we can't make chips with feature sizes under 0.25 micron, it's just that we can't, yet, make 30 million such chips a year. The known techniques for making such devices are horribly expensive for mass production. (Will handcraftsmen ultimately triumph over the automated production lines?) Besides, a feature that's 0.25 microns wide is only 700 silicon atoms wide. Who knows, efficient operating systems and efficient application programs may make a comeback. Soon, programmers won't be able to depend on the next-generation microprocessor to run their absurdities. But then, I'm prejudiced.
The best modern CPU benchmark is SPECint92, measured on computer systems (not CPUs) running compiled, public-domain application programs under UNIX. This benchmark is normalized to unity for the 11/780. An associated benchmark is SPECfp92, which is normalized to unity for the expensive FPA (Floating Point Accelerator) stunt box which could optionally be purchased with the 11/780. This article will concentrate on integer performance relative to the 11/780, called VUP (Vax Unit of Performance). If UNIX can be run, this is equal to SPECint92. Memory capacity doubles every 1.5 years. CPU performance doubles every 2 years. Feature size halves every 7 years. Data-bus width doubles every 5 years. DRAM chip speed doubles every 7 years.
Microprocessors Hit the Performance Wall (Again)
Nick is chief scientist at Altera and can be contacted at 2610 Orchard Parkway, San Jose, CA 95134 or as nickt@altera.com.
Whap! Microprocessors just hit the performance wall. Their designers have done and tried everything. The latest n-way superscalar microprocessors have large on-chip, nonblocking, critical-word-first, write-back caches, register renaming, out-of-order execution, speculative prefetch, branch prediction, branch folding, on-the-fly decoding, operand forwarding, buffered writes, multiple execution units, and duck feathers. They have everything! We're finally there, we've hit the wall. This time the fundamental limit is the inherent parallelism in the instruction stream itself. You can't improve performance by issuing 12 instructions per clock if the inherent parallelism in the instruction stream is only four.
Whenever I look at a new microprocessor, I'm invariably impressed: The new design always has better performance than I thought possible. I look at the new design and think: "This time, microprocessors have hit the performance wall. There's no way to improve this design significantly, because there's no way to get past the $&whatever.it.is.this.time performance barrier." This happens every couple of years since the original MC68000 design-- and this year is no exception. Barriers in the past may have been pins, lead inductance, bus protocol, critical path in the controller, or the critical path in the execution unit. This time it is the ultimate barrier: the inherent parallelism in the instruction stream. Or is it_?
If parallelism in the instruction stream is the problem, let's quit using instructions. This may not be as silly as it sounds. The proof in the pudding is that several accelerator cards for the Macintosh improve graphics performance by intercepting QuickDraw commands and executing them in hardware.
Recent microprocessors include multiple-integer units, special-branch units, and separate floating-point units on the same chip. In the coming generations, we can add more execution units. It wouldn't do to build a special execution unit for each anticipated application--there are just too many. Suppose we add a large reconfigurable-logic unit (RLU), an array of logic functions with programmable interconnections. Each connection to or from a logic function is controlled by a memory bit which can be written by the CPU. Rather than running an MPEG subroutine, the CPU simply configures the RLU as an MPEG encoder (or decoder) by writing to the connection memory. Then the CPU routes the data through the hardware MPEG encoder. Need JPEG? Reconfigure the RLU. Need a special data filter? Reconfigure the RLU. Doing logic simulation? Build the logic in the RLU and run the test vectors through it. When the CPU intercepts a call to a subroutine for which there's a hardware algorithm, it pages the configuration from a ROM or disk to the RLU's configuration memory and passes the data or pointers to the data to the RLU.
There's a lot of work to do to make this happen, but the payoff could be enormous. Logic simulation, for example, might be sped up by a thousand times or so. We haven't hit the last wall. Something like the RLU is on the other side of it. And when we get there, we'll look back and think it was obvious.
Copyright © 1994, Dr. Dobb's JournalNick Tredennick
The high-end workstation/server market is serviced by highly skilled craftsmen who produce computer systems by the dozens which are, in fact, faster than Pentium systems. In this market, price is no object--don't expect much change from $100,000 for your lovingly hand-tooled 200-MHz Alpha server.
The personal-computer market is serviced by numerous modern, high-speed automated production lines that produce millions of computer systems annually (nearly 30 million 486 systems in 1993). Price is very important. As I write this, I can drive down the street and buy a complete 486DX2/66-based no-name clone for $1288--two floppies, a 200-Mbyte hard drive, 4-Mbyte DRAM, VLB motherboard, 128K secondary cache, 14-inch color Super-VGA monitor. Workstation folk incessantly claim they can beat such a system on price/performance. They're wrong.
A parallel-processing computer can be regarded as one big, crude, superscalar CPU, with N integer pipes that execute instructions in the same clock. But these pipes are in separate ICs, so hardware logic can't test for data dependencies and branches. You really don't want to run scalar code on a parallel processor.
Figure 1: Historically, the predominant trends have been the doubling of CPU performance every 2 years and of memory capacity every 1.5 years.
Figure 2: The minimum feature size of microprocessors halves every 7.16 years. (Source: IBM's George Marr, CompCon '77.)
Figure 3: Microprocessor clock rate doubles every 3.58 years. (Source: Intel's Gelsinger et al.)
Figure 4: Microprocessor and DRAM die size. Die size doubles every 5 years.
Figure 5: Transistors per die double every 2 years. (Source: Intel's Myers et al.)
Figure 6: Data-bus width doubles every 5 years.
Figure 7: CPU performance trends.