THE AM29000 AS AN EMBEDDED CONTROLLER

Programming a RISC processor

Bob Lowell

Bob is an engineer for Doctor Design and can be contacted at 5415 Oberlin Drive, San Diego, CA 92121.

One of the more interesting trends in embedded systems development is the proliferation of Reduced Instruction Set Computer (RISC) processors. This trend is especially evident in the laser printer market where there is a need to control the increasingly complex graphics interpretation tasks required by today's page description languages.

RISC chips, available in volume for under $50, boost graphics processing performance up to 20 times that of the Motorola 68000 used in the Hewlett Packard Laserjet and Apple Laserwriter printers. Consequently, graphics that recently took five minutes or more to print out on inexpensive printers can now print at an effectively instantaneous rate. This rate is limited only by the print engine speed. Advanced Micro Devices' Am29000 is a good chip for applications such as these because it can achieve very high performance without greatly impacting the hardware component cost of the printer controller board. Board cost considerations should be made relative to boards currently designed to run Adobe Postscript or Hewlett Packard's PCL5, the most popular page description languages today. The specific topics I discuss in this article relate to printer controller board design; however, the general concepts are important for the design of just about any 29000-based embedded system project.

Unique Architectural Features of the Am29000

While there are many available RISC processors with somewhat similar features and performance claims, the Am29000 has several architectural features that distinguish it from others. The Am29000 designers focused their implementation on the paramount factor for peak performance in a RISC microprocessor: a high bandwidth memory interface. The unique architectural features of the Am29000 that help achieve this high-memory bandwidth, or transfer rate, are the separate instruction (I) and data operand (D) buses and the branch target cache. Like many of the other RISC chips available today, the Am29000 can execute all of its instructions in a single clock. Single-cycle execution is difficult to achieve, though, because it necessitates use of a single-cycle memory subsystem. Until recently, memory subsystems designed to provide data at this high rate were very expensive, typically employing static RAM chips. And some RISC processors other than the Am29000 are not optimally tuned for inexpensive (slow) memories; they end up running below their peak rate of one instruction per clock because instructions and data cannot be transferred rapidly enough.

The Am29000's memory interface achieves the highest bandwidth possible from inexpensive DRAM and EPROM/mask ROM memories, making the execution rate of one instruction per clock realizable. The separate instruction and data bus interface on the Am29000, commonly referred to as the "three-bus interface," should be taken into account by any engineer evaluating a new microprocessor such as the Am29000 for use in a laser printer controller design.

Practical RISC Memory Design Techniques for a Laser Printer

RISC microprocessors (and some new CISC chips, too) have a bus interface designed to transfer one 32-bit data word every clock. This is necessary to maintain the peak execution rate of one instruction per clock. This is accomplished by sending out a single address for at least four sequential data words. After the first cycle is run to processor memory, optimized hardware on the board uses the stored address to access the subsequent data words at one clock per word, a technique commonly known as "bursting." The memory design techniques that allow single-clock accesses to sequential memory addresses are dependent on the type of memory used. Dynamic RAM can use page mode cycles for sequential accesses at single-clock rates. At higher frequencies, "interleaving" is necessary to maintain this transfer rate. (Interleaving involves having multiple banks of memory, with separate control signals. At the same time one memory bank is accessed, one or more other banks are being prepared for subsequent accesses. For ROM/ EPROM memory, an interleaving technique is always necessary to support single-clock accesses at all but the lowest operating frequencies.)

Designing a single-cycle access instruction memory, generally with interleaved ROM or EPROM, is more practical than it may sound. Most Postscript or PCL5 controller boards utilize a large amount of nonvolatile memory to store fonts and program routines. It typically runs between 1 and 2 megabytes. If the controller board manufacturer uses EPROMs for this memory, it turns out to be the highest cost item on the board. If there are 16 1-Mbit EPROMs on a board, they can be interleaved as four memory banks. This enables implementing a 2-Mbyte instruction/font memory with single-clock access capability in a cost effective way. The EPROMs are already part of the design, whether you support single-cycle access or not (open up an HP Series III or a Postscript printer if you're not convinced); the logic to implement interleaving is relatively minuscule in cost. The argument that interleaving adds little to the controller board cost is still true when you go to high-volume production. The EPROMs are replaced by lower-cost mask ROMs and the discrete logic to control the interleaving is usually put into an Application-Specific Integrated Circuit (ASIC). At worst, a few extra memory chips and the board space they require are what it costs to interleave. The Am29000 was designed with this in mind. No high-performance laser printer controller design using the Am29000 should exclude it.

Cache Replacement Algorithm: Don't Cache What You Don't Need

The Am29000 cache memory stores instructions for full-speed execution, like other RISC microprocessors. The size of this memory is only 512 bytes. That's only enough space for 128 instructions; the choice of which instructions to cache must be made carefully for highest overall performance. The logic that determines when an instruction fetched from external memory will be stored in the cache for possible future use was designed according to the "cache replacement algorithm." The Am29000's cache replacement algorithm is vastly different from those used in all the other RISC microprocessors. So are the assumptions behind it.

Most RISC microprocessors cache all instruction accesses from main memory any time they're made, as long as they're not already in the cache. The assumption behind this is that the memory interface is too slow to keep up with the processor's peak execution rate in most cases. This is fine for tight loops that fit inside the small cache, but not very good as repetitive program segments become larger and less localized in memory. Cached instruction sequences are often overwritten by other instruction sequences before the processor gets around to needing them again.

The assumption behind the Am29000 cache replacement algorithm is that the memory interface has high enough bandwidth to maintain peak execution rates as long as program execution is sequential. Once the Am29000 has established an instruction burst, hardware on the board latches/increments the address and controls all bus cycles for the subsequent instruction fetches. Data operand accesses to external memory don't slow the instruction fetches down because they occur on separate buses. The data accesses use the address bus, which is freed up after the instruction burst is established, and the D bus. Instruction accesses occur over the local code address bus, driven by the latch/burst address counter, and the I bus. The data and code accesses occur over separate buses during sequential program execution, so the instruction prefetcher can run at full rate without being slowed down by data accesses. The peak execution rate is fairly easy to maintain, assuming that each instruction access occurs in a single clock. There is no point in caching these instructions because the processor can read them the clock before they're needed. So instructions that can be accessed in a burst read are not cached.

When program execution branches, the Am29000 cannot maintain the peak transfer rate of one instruction per clock. This is because it must drive a new address out onto the bus, which must then be decoded, latched, and applied to the memory chips for their access period before an instruction can be read. There are several other more complex latency factors which may increase the amount of time it takes to fetch the first instruction after execution has branched. The Am29000 only caches the first four instructions fetched when a branch to an uncached address is taken. These four instructions are called a "branch target." Subsequent branches to a cached branch target will start fetching the instruction immediately past the branch target as the first instruction in the branch target executes from cache. This gives the board ample time to provide the Am29000 the first instruction it needs without slowing it down. What the cache stores are branch targets. Only 32 branch targets can reside in the cache at once. This is typically many more branch targets than would reside in a cache of the same size with a conventional replacement algorithm. AMD claims a cache hit rate (percentage of time needed instructions are executed from cache) of 65 percent for most software applications. This doesn't mean that 65 percent of the instructions execute at peak rate, as it would on most processors. It means that 65 percent of branches execute at peak rate. Sequential (nonbranch) instructions should execute very close to 100 percent of the peak rate.

Memory Latency: How Long Should it Be?

The cache replacement algorithm and the optimal instruction memory latency are intimately related. While the instruction memory should be capable of delivering one instruction per clock once a burst has been started, the initial instruction fetch cycle after a branch will take longer to complete, as described earlier. The branch target cache stores the first four instructions at the branch address, so four clocks sounds like the right amount of time for the maximum initial cycle length. It is, but it's a bit more complicated than that.

The Am29000 uses a technique called "delayed branching" to reduce the time it spends frozen doing nothing while it fetches an instruction at a branch target that isn't in the cache when the branch executes. A delayed branch actually executes the instruction immediately following the branch before transferring control to the branch address. The instruction following the branch is referred to as being in the "delay slot." When the Am29OOO is executing the instruction in the delay slot, it starts the bus cycle to read in the instruction at the branch address. It would seem that the memory subsystem has the delay-slot clock plus the four clocks the Am29000 takes executing the instructions in the branch target cache to return the first instruction without slowing the system down. That would be five clocks. But to maintain peak execution rate, the first instruction fetched from memory must be decoded by the Am29000 in the same clock that the last (fourth) instruction is being executed from the branch target cache. So the initial access time, or latency, the instruction memory is designed to have should be held to four clocks.

Don't Waste the Bus Cycle's First Clock Just Decoding Addresses

Many high-speed microprocessors, RISC or CISC, take nearly a full processor clock to drive a valid address onto the bus, in the worst case. The control signals to the memory chips often cannot be asserted until the second clock of a bus cycle. That means that nonbursted accesses to memory, typically data, may end up being one clock longer than they need to be. The Am29000 address drivers help alleviate this. A 20-MHz Am29000 guarantees data to be valid 16 nanoseconds from the beginning of the bus cycle. This delay is tested for an 80-picofarad load. To achieve this short delay, the Am29000 address driver circuits utilize a strong driver that's on in phase one of the first clock of a bus cycle in parallel with a weak driver that's on for the rest of the cycle.

It's common to start the memory cycle asynchronously when the system clock (SYSCLK) falls, starting phase 2. If this is done with a high-speed logic device, almost all of the second phase of the bus cycle's first clock can be included in the memory access cycle. By not waiting until the beginning of the second clock to start the memory cycle, you save 25 nanoseconds on a bus cycle to memory at 20 MHz. Those 25 nanoseconds may allow a memory cycle to run with one less processor waitstate; or they may allow using slower memory chips without adding processor waitstates.

The ideal memory latency for data accesses wasn't mentioned in the previous discussion on memory latency because it's much more complicated than instruction memory latency. It's clearly less than four clocks. It depends on factors such as how well the compiler can schedule the Am29000 load and store instructions relative to when they're needed in execution. It's best to keep data memory latency as short as possible. The short address valid delay helps in this respect. The other significant advantage the Am29000 has is, again, the separate bus for data operand accesses. Consider for a moment the higher latency an internal data request from the processor would see if: 1. Instruction and data accesses shared the same bus to external memory; and 2. An instruction access was already in process when the data request was posted. To be fair, the competing chips use on-chip write buffering and load scheduling in the compiler to offset these problems somewhat, but a dedicated data operand bus is a better solution. Whether the graphics data to be processed is in an intermediate or final (bitmap) form, a low-latency data bus increases performance. Certain routines can be programmed for burst access if it's deemed optimal for their data. The Am29000 supports that, too.

Difficulty #1: Correctly Handling the Bus Invalid Signal

The Am29000 asserts a signal called Bus Invalid (BINV*) to indicate that a bus cycle it has started must be aborted. If you were designing something such as a workstation, an assertion of BINV might mean something else, but in a laser printer it can't usefully mean anything else. This signal comes out in phase two of the first clock of the bus cycle. For performance reasons, it's advocated (and proven in a number of our designs) to start a cycle at the beginning of phase 2 of this first clock. This means the memory control signals (say RAS) will already be asserted by the time BINV can reliably be sensed. The designer must gracefully abort the cycle. If it's a cycle to DRAM, RAS cannot be pulled away immediately. Instead, the Bus Invalid signal must be latched so that the control logic remembers that this is an aborted cycle. The latched BINV signal is used to disable CAS, and RAS terminates when it would in a normal DRAM cycle without BINV asserted. For EPROMs, the cycle can be aborted as soon as convenient. It's often convenient to latch a signal like BINV in the PAL that generates the memory control signals or the state information for generating them. Unfortunately, the long setup times of inexpensive 15-nanosecond clocked PALs preclude their use for state machines in this application at 25-MHz operation and above.

Difficulty #2: The I/D Bus Contention Problem

In most cases, a practical design ends up having "swap buffers" connecting the I and D buses so that code can run in data memory and data accesses can be made from code memory. This runs contrary to the design philosophy of the Am29000 and reduces its performance by precluding parallel operation when the swap buffers are on. There are a number of situations where it's useful, however. Most printers take font or emulation cartridges in the same socket. The 2-Mbyte code/font memory advocated earlier in this article assumes data accesses will be made to the interleaved code memory to get outline font information. The fonts cannot practically be stored in chips separate from the code. In fact, they take up more space than the code. But because both PCL5 and Postscript employ font caching in DRAM, font data accesses to the code memory will be limited, and there is little performance penalty for putting them together.

Hooking the two buses up does cause problems, though. If the Am29000 is reading data on the D bus and it needs to write data out on the same bus, it waits one clock to start the write cycle. This gives the memory subsystem on the board time to turn off drivers from the read cycle and avoid contention. But if the Am29000 is executing out of data memory, say DRAM, and it also tries to write to that memory over the D bus, which is possible, a contention will result unless an extra set of buffers (beside the swap buffers) is used. This set of buffers goes between the D bus pins on the Am29000 and the D bus connections on the rest of the board.

Copyright © 1992, Dr. Dobb's Journal