COMPUTER SCIENCE AND THE MICROPROCESSOR

The battle for the desktop

Nick Tredennick

Nick did the logic design and microcode for the Motorola 68000 and IBM Micro/370 microprocessors and is the author of Microprocessor Logic Design (Digital Press, 1987). He can be contacted at 1625 Sunset Ridge Road, Los Gatos, CA 95030.

The invention of the integrated circuit in 1959 began a beneficial technology spiral in which, following Moore's Law, the possible number of transistors on an integrated circuit has been doubling every year. Three effects combine to sustain this trend: Chips grow, features shrink, and design techniques improve. The number of transistors on a chip goes up directly with increases in chip area--twice the area permits twice as many transistors. Circuits are formed on a chip by a complex process involving coating, etching, doping, and baking. The size of transistors and wires (features) is determined by the sophistication of the process--the better the process, the smaller the transistors and wires. If the width of a feature (wire or transistor) decreases by a factor of two, the number of transistors in a fixed area increases by a factor of four. Evolving implementation techniques improve the match between circuit requirements and the constraints of the semiconductor technology.

Integrated circuits improved the design of electronic systems. Before the integrated circuit, electronic systems were built of resistors, capacitors, inductors, diodes, and transistors (or vacuum tubes). Integrated circuits replaced collections of transistors, diodes, resistors, and capacitors. Texas Instruments and others introduced logic families, like the 74xx series TTL devices. Designers partitioned electronic systems into available logic modules. Systems became smaller, cheaper, and more reliable since each logic module replaced many discrete components. Design with logic modules was popular, and soon module catalogs such as TI's The TTL Data Book for Design Engineers grew to include hundreds of different logic modules.

Custom-chip design was one alternative to designing with standard logic modules. You could partition your system into unique chips and get someone like Intel to build them. The design would be fewer chips than a design using standard modules, so the implementation might be cheaper and more reliable to manufacture. But custom chip designs were expensive, so manufacturing volumes would have to be high to amortize the development cost enough to make custom chips the right choice. In the late '60s, it looked as if desktop calculators might have a high enough volume to justify custom-chip design.

In September 1969 the Japanese company Busicom approached Intel with a proposal for a calculator design using seven custom chips. In October, Intel's Ted Hoff countered with a three-chip design based on the idea of building computer-like chips and programming the logic to perform the desired function. One of these chips, the 4004, became the first commercial microprocessor. The 4-bit 4004 CPU, which processed data a nibble at a time, contained an execution unit (registers, arithmetic unit, and connecting logic) and a control (which interprets instructions and directs actions of the execution unit). The microprocessor is connected to memory (to hold instructions and data) and input/output logic (to communicate with the outside world) to make a working system.

Intel introduced the 4004 commercially in 1971. Since then, the number of transistors in a microprocessor implementation has been doubling every two years. The implementation of microprocessors isn't following Moore's Law. Figure 1 graphically describes the introduction of Intel microprocessors since 1971. Each microprocessor part is plotted by year of introduction and number of transistors per processor from the introduction of the 2,300-transistor 4004 in 1971 to the introduction of the three-million-transistor Pentium processor in 1993. The solid line plots transistors doubling every two years, starting in 1971.

The workstation market is fragmented. SPARC from Sun, MIPS from Silicon Graphics, POWER from IBM, PA (Precision Architecture) from Hewlett Packard, 881x0 from Motorola, Alpha from DEC, and Clipper from Intergraph all compete for shares of the half-million-unit workstation market. There are about 20 manufacturers making high-end RISC microprocessors for workstations. Currently, only Intel makes high-end microprocessors for IBM-compatible PCs. Costs are clearly not equal. If you have to do a new design every two years to keep up with everyone else, you'll be getting a share of either a million workstations (the number shipped in two years) or 60 million PCs. If you're one of the 20 manufacturers making a microprocessor for workstations, your share of the market is likely to put you well over on the left side of the curve in Figure 2. Amortizing the development cost over your share of the workstation market will drive the price of the chip--manufacturing cost won't be a big factor. If you're Intel, you'll be operating well off the right side of the chart--amortized development cost won't be a big factor in setting the price of the chip. Intel is currently shipping four to five million high-end microprocessors per quarter. RISC microprocessors cost a lot because high development cost must be amortized over low workstation shipping volumes.

Costs for workstations and PCs are not equal. Workstations cost more because development cost for the CPU must be amortized over significantly lower volumes. Lower manufacturing volumes also lead to smaller discounts on other component purchases. The more sophisticated workstation systems (cache, fast memory, special I/O) inherently cost more than the relatively unsophisticated PC. Workstations have traditionally been sold through more expensive distribution channels than the PC, which also contributes to higher cost.

Without even considering software, which is probably the most important determinant, I think the battle for the desktop is over. The PC is the winner--it is grabbing applications from the workstation market. The workstation market has been maintaining volume by pressing for ever higher performance and capturing new applications. We've reached the steady state. Workstations will continue to push for higher performance and specialized markets where they can command the higher prices they require. The PC will continue to chase the workstations out of their old market segments.

More Details.

Software

PC software is cheaper than workstation software. PC software is the stuff everyone needs: word processors, editors, communications, spreadsheets. Workstation software is specialized software designed for a particular market: chip design, visualization, timing analysis. Software is the same situation as the CPU. For PC software, high volumes mean low amortized development cost, so manuals and distribution probably dominate the cost. For workstation software, low volumes and complex applications mean high amortized development cost.

The major applications for the PC have already been written. If you're a PC software developer, do you have a chance of developing a new word processor and capturing the market? Not likely. So what's Microsoft doing with all their programmers? Over the past ten years, they've been working frantically on major high-volume applications for the PC. Now they're done. About all they need is two programmers and 30 documentation people per application to crank out the annual updates to Word, Excel, Power Point, and so on. So what are the other 6000 employees doing? When the applications that sell ten million copies are done, they'll work on applications that sell one million copies. When the applications that sell one million copies are done, they'll work on applications that sell a 100,000 copies. When the applications that sell a 100,000 copies are done, Microsoft will start laying off programmers. It's my guess that programmers at Microsoft are now working on applications which will sell between a 100,000 and one million copies. Engineering-design applications, the traditional market for workstations, will be converted to the PC by Microsoft programmers just before the big layoffs begin.

Conclusion

The battle for the desktop has reached a steady state: The PC is eating into traditional workstation applications at about the same rate that workstations, with ever-higher performance CPUs and more complex systems, are finding new applications. This, however, leads to problems for the RISC CPU manufacturers because workstation volumes are too low for the chip manufacturers to recover their development costs. The technology spiral is driving them to more complex CPUs with correspondingly higher development cost while their market is staying about the same size. This has caused the RISC CPU manufacturers to rediscover the embedded-control market. RISC CPU manufacturers have begun a major marketing campaign to capture embedded-control applications. After competing for a few years for shares of a half-million-unit market, it must look as if there's room for everyone in a market of two billion. There isn't.

Most of the market volume for embedded control belongs to the 4- and 8-bit microprocessors. That's the zero-cost segment. The zero-power and zero-delay segments are also cost sensitive, and belong to companies with high-volume, low-cost manufacturing. The zero-delay, embedded-control segment uses some high-end microprocessors, but the average selling price of a 32-bit CPU for embedded control is $65.00--about the same as the manufacturing cost for a RISC CPU. There's no way to recover development cost if the base price is the same as the manufacturing cost. That leaves the zero-volume segment. RISC CPUs are capturing high-profile applications in the zero-volume segment. The problem with the zero-volume segment is, as its name implies, that there's not enough volume to recover development cost.

PC sales are stalled at 30 million a year, workstation sales are stalled at half million a year, and the ancient CISC CPUs own the embedded-control market. The news is all bad for the makers of RISC CPUs. That's too bad, because it's fate and has nothing to do with the intrinsic value of the product (not that the intrinsic value is well known, given the state of computer science--but that's another story). It's all tied up in the technology spiral, the invention of the microprocessor, and the timing of the invention of the personal computer, the computer architect, and RISC. If you're a RISC advocate and this news has depressed you, here's something to make you feel better: Perhaps the technology spiral will come back to bite even the CISC microprocessors. The microprocessor was invented for embedded control: It displaced modules with a programmed logic solution. Perhaps reconfigurable or even self-configuring logic will displace the microprocessor in embedded-control applications. After all, the microprocessor is only an interim solution. Shouldn't those applications have self-configuring logic modules?

Microprocessor Implementations

First-generation microprocessors, typified by the Motorola 6800, didn't use pipelining. They didn't have to be fast for simple embedded-control applications and, since integrated-circuit technology was new, chip area for transistors was expensive. Early microprocessors used simple control and a simple interface to external memory. The microprocessor fetched the instruction, decoded it, and then executed it. When the microprocessor finished the first instruction, it started on the second, and so on--no pipelining, a simple controller. This execution model is shown in Figure 3.

The bottleneck in this simple, nonpipelined design is the controller. The external bus is only used every third cycle for the fetch, unless the execute cycle reads or writes an operand. The instruction decoder is only used every third cycle. And the execution unit is only used every third cycle. The pipeline can't stall since there isn't one.

In the late '70s, the next-generation microprocessors, typified by Motorola's 68000, used a simple, three-stage pipeline called instruction overlap. As the first instruction is executed, the second instruction is decoded, and the third is fetched. This execution mode is shown in Figure 4.

Instruction overlap makes better use of microprocessor resources than a nonpipelined version. The external bus, the instruction decoder, and the execution unit are all used on every cycle, unless there's a conflict for resources. It's possible for the processor to complete an instruction on every cycle. Fetch takes one cycle, decode takes one, and execute may take one to many cycles. If execute takes more than one cycle, the following instructions are held in the fetch and decode stages until the current instruction finishes execution. Only one instruction at a time is allowed to begin execution, so there are no operand conflicts. The execute stage and the fetch stage may contend for the external bus. In an add-memory-to-register instruction, for example, the execute stage will compute the operand address, read the memory operand, add the register and memory operands, and store the result in the register. If the memory-to-register add is instruction 1 in Figure 4, its execute phase would extend from cycle 3 through cycle 6, instruction 2 would be held in Decode, and instruction 3 would be held in Fetch. Instructions 2, 3, and, 4 would begin Execute, Decode, and Fetch, respectively, in cycle 7.

The first commercial RISC microprocessors introduced an extended pipeline. The extended pipeline split the execute phase into address calculation, operand access, execute, and write phases. Additional pipeline stages removed pipeline delays caused by resource conflicts such as contention for access to external memory. The extended-pipeline execution model is shown in Figure 5.

The extended pipeline potentially completes one instruction every cycle, but with additional stages, there are fewer delays due to resource contention. But, there are costs. The four instructions past the decode stage have potential operand conflicts to resolve. Additional pipeline stages require additional resources to avoid conflicts. You can estimate resources by looking at Cycle 6 in the figure. Since Cycle 6 represents the theoretical steady-state instruction flow through the microprocessor, it should be able to accommodate any combination of six instructions without resource conflicts. The memory system, for example, must have at least two read ports and one write port (for Instruction 1 write, Instruction 3 read, and Instruction 6 fetch) to avoid access conflicts. There must also be more ports to the register file (for address, read, and write) and at least two arithmetic units (one for address calculation, and one for execute).

While the Motorola 68040 uses a six-stage pipeline, there's nothing magical about it. Intel's 80486 and MIPS' R3000 are five-stage pipelines, and the newer MIPS R4000 is an eight-stage pipeline. (MIPS uses the pompous term "superpipeline" to describe their eight-stage pipeline.) The original Fujitsu SPARC gate array and the first custom Cypress SPARC use a four-stage pipeline. Increasing the number of stages in the pipeline reduces resource conflicts and may allow a faster clock. Throughput increases, but these pipelines still only complete one instruction per cycle, since they only issue one instruction per cycle.

A superscalar pipeline attempts to issue more than one instruction per clock. Intel's 80960CA, announced in 1989, was the first microprocessor with a superscalar pipeline. Figure 6 shows a six-stage pipeline capable of issuing two instructions per cycle.

Instructions 1 and 2 start at the same cycle, instructions 3 and 4 start at the same cycle, and so on. If we started three instructions per cycle, we could potentially complete three instructions per cycle. But look at the loaded pipeline represented by cycle 6 (as it was in the extended pipeline). The microprocessor is processing 12 instructions at each cycle. There's enormous potential for operand and address conflict. The register file and memory system need at least four read ports and two write ports each. And there must be at least four arithmetic units (two for address calculation, and two for execute). Hardware resources for a superscalar pipeline are substantial and grow as more instructions can be issued simultaneously. One way to limit required resources is to restrict combinations of instructions permitted simultaneous issue. DEC's new 21064 Alpha microprocessor, for example, uses a seven-stage pipeline and can issue two instructions per cycle with some restrictions on pairs that can issue simultaneously. HP's PA 7100 can issue a floating-point instruction and an integer instruction simultaneously, but cannot issue two integer instructions during the same cycle. TI's SuperSPARC and Motorola's 88110 allow simultaneous issue of two integer instructions. Intel's Pentium and Motorola's 68060 will also sport superscalar pipelines.

--N.T.