Richard, an engineer in Motorola's microcontroller division, can be contacted at Soja_Richard_LOCAL_at_DEVTOOLS @devmail.sps.mot.com.
From automotive engines and machine tools to robotics and stepper motors, a variety of embedded-control applications require high-resolution timing capabilities. Consequently, devices such as Motorola's Time Processor Unit (TPU), Hitachi's Intelligent Sub-Processor (ISP) programmable timer, and Texas Instruments' Programmable Acquisition and Control Timer (PACT), are becoming increasingly common for high-end control processors.
Motorola's TPU, available as an on-chip peripheral on a range of MC68HC16 and MC6833X microcontrollers, is an intelligent module that executes timing functions from its on-board microprogrammed ROM. It can therefore be viewed as a special-purpose coprocessor geared to performing operations on time events. It can run independent of the host CPU, but in practice, the TPU would be initialized by a CPU module, then allowed to run without intervention except under specific predetermined conditions, such as error handling or status updating. In this way, complex timing tasks can be executed by the TPU with minimal or even zero CPU overhead, making it a powerful tool for designing timing-oriented embedded systems.
In this article, I'll examine the TPU and how it is used in a typical timing-related application--an automotive engine-management system.
Of the more than two billion microcontroller units (MCUs) produced in 1995, more than 80 percent were used in embedded-control applications. The automotive industry, in particular, uses MCUs for such diverse tasks as automated braking systems, power-train control, instrumentation, airbag deployment, and navigation.
To reduce pollution from vehicle-exhaust emissions, the auto industry developed MCU-based, electronically controlled engine-management systems (EMS). In terms of bandwidth requirements, engine management is one of the most demanding tasks for auto electronics. However, most 8-bit (and even some 16-bit) MCUs don't have the bandwidth to support the multitude of complex tasks that satisfy exhaust-emission requirements, while still providing better driveability, improved fuel economy, diagnostic information for easier maintenance, and safety and monitoring features. Engine management is essentially the closed-loop control of the combustion process in each cylinder. A number of engine-related parameters must be monitored and controlled. Sensors measure air and coolant temperatures, air flow, throttle position, crankshaft position and speed, and engine knock. Electronically actuated valves are used to control fuel injection, airflow, and exhaust-gas recirculation, while power devices that drive transformer coils control the firing of spark plugs to ignite the air/fuel mixture. Because of stringent emissions requirements, the precision with which fuel and spark must be delivered has become crucial.
During the intake stroke of the basic four-stroke cycle of the internal combustion engine, the piston falls in the cylinder head and a precisely metered amount of fuel is sucked in along with a quantity of air. The ratio between air and fuel is kept as close as possible to the stoichiometric value of 14.7:1 to ensure correct and durable catalytic-converter operation. During the compression stroke, the piston rises and the fuel/air mixture is compressed until a precise position is reached. At that point, the mixture is ignited by a spark delivered by open circuiting a precharged coil to which a spark plug is connected. The spark point is critical. If it occurs too early, knocking will occur. Knock can damage the engine by overstressing the cylinder and piston during the compression cycle. If the spark occurs too late, engine power is reduced and exhaust emissions can dramatically increase. The precision of fuel delivery is becoming much more critical, too, as new exhaust legislation comes into force.
One of the biggest problems associated with spark and fuel delivery is engine dynamics. Typically, the EMS must operate over an engine speed of 30 to 8000 RPM--an almost 300:1 range. In addition, spark positioning must be accurate to better than 0.5 crankshaft degrees. This translates to a timing accuracy of 10 microseconds at 8000 RPM. Moreover, the system must cope with engine-speed changes of around 10,000 RPM/second and respond to asynchronous demands for changes in both fuel amounts and spark positioning.
Figure 1 is a simplified model of the partitioning among different functional elements in a typical fuel and spark-delivery system. The application software executes the high-level strategy that calculates and supplies output data to the timing system, based on the current operating conditions of the engine. This strategy should be independent of, and asynchronous to, fuel- and spark-delivery functions. In a single-processor implementation, this can be partially achieved with a combination of time- and position-based hardware and a fast, interrupt-driven process. The bandwidth assigned to the interrupt-driven tasks primarily depends on the complexity of the hardware. Simple hardware, such as real-time match-and-capture registers and a free-running counter, result in heavy CPU loading. More complex hardware to support both time- and angle-based events reduces the CPU overhead, but at the expense of increased die size. (There is even a patented solution describing a hardware implementation of the entire fuel and timing system.)
Removing CPU overhead associated with handling real-time interrupts is desirable--and not just from the raw-performance perspective. This technique makes modeling, simulation, and verification of the application strategy simpler because the indeterminate effects of real-time interrupts can be eliminated.
While a largely hardware solution is feasible to relieve the CPU of interrupts and bandwidth impact, there are a number of good reasons for a different approach--a coprocessor solution.
If you look at the timing and positional events that must be supported, it is clear that numerical processing on a virtually continuous basis is unavoidable. This is difficult to implement efficiently in hardware alone, as Figure 2(a), a typical fuel-delivery pulse, and Figure 2(b), a spark-timing pulse, show. The two are almost identical, but with an essential difference. In the fuel pulse, the injector delivers fuel while the pulse is high. Ideally, the injector is turned off at a specified angle referenced to the crankshaft position. Older EMSs turned on the injector at a given angle, delivering fuel for a predetermined time following the angle. The time-angle implementation is now favored by motor manufacturers as it reduces exhaust emissions. The spark pulse is also time-angle based. The angle defines the position at which the ignition coil is turned off and the spark occurs. The pulse time that precedes this (the dwell time) is the time it takes to charge the coil; it defines the energy available to fire the spark. There are two important consequences of the time-angle requirement:
Real-time events can be supported by a free-running counter. Angular position is generally maintained by counting a pulse train obtained from a toothed wheel attached to a crankshaft. The counter must be reset periodically--usually it is reset after one whole engine cycle--which corresponds to 720 crankshaft degrees (two crankshaft revolutions). In some EMSs, the reset mechanism is supplied by a separate single pulse (akin to the floppy disk index hole). This has the disadvantage of additional mechanical hardware with the potential for unreliability and extra cost. This type of reset mechanism also takes up a valuable extra pin on the MCU. The alternative (preferred) solution is to add or remove one or more teeth on the toothed wheel (which already provides the angular reference on the crankshaft); see Figures 3(a) and 3(b).
The extra (or missing) teeth are used as reference to synchronize the angle-based counter. On startup of the engine, the synchronizing point must be searched for. This process is fraught with difficulties caused by changes of engine speed while it is being turned by the starter motor.
The capabilities of high-volume manufacturing put a restriction on the number of teeth that can be reliably machined on the toothed wheel. I've come across configurations that range from 3 to 60 teeth; obviously, the more teeth the better. While most auto manufacturers are reluctant to publicly reveal details, the German auto-supplier Bosch does make its "60 teeth with 2 missing teeth" wheel generally available as an after-market product.
Sixty-teeth/crankshaft revolution means an edge event occurs every six degrees. This translates into one event every 125 microseconds at 8000 RPMs. To accurately maintain fuel and spark timing under dynamic engine-speed conditions, the system may be required, in the worst case, to update all its output pulses every 125 microseconds. This requirement would bring all existing 8-bit and most 16-bit MCUs to their knees.
A TPU is one solution to this timing problem.
As Figure 4 illustrates, the Motorola TPU is partitioned into five main functional blocks: host interface, scheduler, microengine, execution unit, and timer-channel hardware.
The host interface provides the communication path between the CPU and TPU. It contains the time-function control registers as well as registers to support testing and development of new microcoded functions. The control registers are used principally to associate a time function with an I/O pin, assign a priority to the function, and optionally establish the interrupt mechanism to the CPU. The test registers allow production testing of the internal TPU architecture, while the development registers permit users to examine, modify, single step, and set breakpoints on microcode during design and debug of new functions. In addition, there are 100 locations of 16-bit dual-ported RAM (parameter RAM) used to pass data or extra control information between the TPU and the CPU. The way parameter RAM is used depends on how the timing function has been microcoded.
The TPU's microengine is event driven, executing microcode as a result of one of four events passed from a hardware scheduler. The use of a hardware scheduler reduces the context-switching time to 10 system-clock cycles--500 nanoseconds with a 20-MHz MCU such as the MC68332. Moreover, during the context switch, the TPU can preload data from memory into an internal TPU register and conditionally select the microcode-execution address, based on a set of flags, without additional software overhead, as these operations are defined by hardware within the TPU. This helps minimize microcode size without losing functionality. The CPU enables the TPU's scheduler to recognize service-request events on selected channels.
Each microcode instruction is 32 bits and located either in micro-ROM embedded in the TPU module, or in a peripheral RAM module. RAM mode is used for development purposes and does not compromise execution speed or functionality, thus allowing new user-defined timing functions to be debugged without complex hardware-emulation systems.
The microinstruction bus is spread throughout the TPU--to the parameter-RAM interface, microengine, scheduler, execution unit, and timer hardware. The parallel distribution of the microinstruction bus means that many operations that act on different functional blocks of the TPU can be encoded into a single 32-bit microword. The TPU has five different microinstruction formats, and the instruction set is quite unorthogonal. Figure 5 shows the bit fields for Format 2. This microinstruction can theoretically perform a maximum of ten simultaneous operations: 16-bit arithmetic addition on two operands; shift the result; schedule a match event; specify the output-pin transition when the event occurs; clear two timer flags; clear a software flag; clear a general-purpose flag; force the output-pin state; generate an interrupt to the CPU; and terminate microcode execution.
Typically, two or three operations are executed per microinstruction, yielding a bandwidth of 20 million to 30 million operations per second.
Other instruction formats provide memory to register data moves, conditional and unconditional branching, and additional timer-hardware control. The instruction mix and scope is designed to handle the real-time events in the timer-hardware functional block rather than to provide a wide range of arithmetic or logical operations and addressing modes.
To optimize performance, each microinstruction is pipelined to program memory. An interesting consequence of this occurs when a branch instruction is taken. In a regular microcontroller, the prefetched instruction following the branch instruction would be flushed from the pipeline. The TPU's microengine and instruction set allows programmers to select whether or not the pipeline is flushed. Not flushing the pipeline means an extra instruction cycle is not wasted, but the resultant source code may be quite difficult to follow as the instruction following the branch is executed regardless of program flow!
This scenario makes structured programming almost impossible. There are other architectural factors that dictate TPU programming style. For instance, the TPU has no stack, and to maximize the flexibility of the timer hardware channels and I/O-pin configuration, each microcoded timing function is usually limited to storing all its variables in six local, and up to four global, parameter RAM locations. However, the microengine does support one level of subroutine using a Return Address Register (RAR). The address of the instruction following the call, (or the next one if the pipeline is not flushed!) is jammed into the RAR when the call is executed. Since there is no context associated with the RAR, it is possible to branch out of a subroutine rather than return normally with no impact on the RAR. This technique should be used with care and certainly not before analyzing the design. Because the TPU supports absolute and indirect (as well as channel-relative) addressing modes, it is possible for a timing function to "steal" parameter RAM, although this will force a restriction on the assignment of functions to channels. In custom applications, this may not be a problem; for generic microcontrollers such as the MC68332A and MC68HC16Y1A, their functions are designed to allow any timing function to be applied to any or all channels.
Another quirk in the TPU's architecture is related to the lack of a stack. To test the condition code flags updated by an arithmetic operation, the flags must be specifically latched by a programmable instruction. This allows multiple arithmetic operations to be executed in sequence, and only the one with condition code flags that are to be tested needs to latch them.
The execution unit consists of five general-purpose 16-bit registers, three 4-bit control registers, a 16-bit arithmetic unit and 1-bit shifter, plus status and control paths to the microengine. Two of the 16-bit registers provide memory access to the parameter RAM. Another is used to access the time-event registers. One of the control registers can provide loop control on any single instruction, or in a special mode, the register is used to synthesize a programmable-sized multiply operation. A 16x16 multiply will execute in 1.8 microseconds, including preloading the multiplicand and multiplier from memory.
The arithmetic unit is capable of addition, subtraction, negation, and logical shifting, and can generate certain 16-bit constants in hardware, which avoids having to allocate 16-bit storage in data memory. This is advantageous, as the TPU's program memory is limited to 512 microwords. The loop-control register can be used with the arithmetic unit's shifter to implement multiple bit-shifting operations. Two of the 16-bit registers also have data paths to the timer hardware.
The real-time work of the TPU is performed by 16 hardware-driven channels assigned to 16 I/O pins. All channels contain identical hardware, giving complete flexibility in the timing function allocated to any pin. Each channel is subdivided into three distinct blocks: an event register, pin-control hardware, and channel-control hardware (see Figure 6). An event register consists of a 16-bit compare and 16-bit capture register, and a 16-bit greater-than or equal-to comparator that can be synchronized to one of two free-running timebases. The pin-control hardware defines the operation of the channel in response to an external event or internal event. An external event occurs if an input pin changes logical level; an event is internal if a timeout, set up by microcode, occurs when the pin is configured as an input. Timeout events are also used to change the output-pin state in real time. The channel-control hardware consists of various flags set and cleared by a combination of hardware events and microcode execution. The flags associated with timer events are also routed to the scheduler as service requests.
Since all channels have parallel hardware implementations, multiple service requests can be recognized by the scheduler simultaneously. The key role of the scheduler is to prioritize the requests and grant each channel execution time. Execution time given to a channel is called a "time slot" and its key characteristic is that it cannot be interrupted by any other service request. This ensures channel coherency and reduces the hardware overhead of the microengine and execution unit. The length of the time slot is determined by the amount of microcode required to service the event on the channel. The TPU can therefore be viewed as a cooperative multitasking machine. The CPU can assign each TPU channel to one of three priority levels (high, middle, or low), based on the rate at which the channel will need to be serviced.
The scheduler cycles through the priority levels in the order shown in Figure 7 (in a round-robin manner). If there is no service request pending, the scheduler idles at the boundary of a time slot. If a single service request on a lower priority level occurs while the scheduler is idle, then a priority-passing mechanism causes the request to be serviced immediately. Another mechanism in the scheduler ensures that if simultaneous requests occur at the same priority level from multiple channels, then all will be serviced in sequence before another event from these channels is recognized. This means that channels with exceptionally high request rates cannot lock out any other from the scheduler, regardless of priority.
Because of its scheduling mechanism, the optimum way to design a TPU function is as a state machine, with each state executed as a result of a service request caused by an event. Apart from the events generated by the channel hardware, two types of software events can generate service requests to the scheduler. One originates from the host CPU, while the other is a link request from a microinstruction. The link request is a bit like a software interrupt and provides a mechanism for one timing function to influence another. This feature gives the TPU the capability to synthesize complex, interactive, time-related activities without CPU intervention.
A typical implementation of fuel and spark functions relies on the TPU being able to continuously measure engine speed and position. The actual rate at which these engine parameters can be measured is really dependent upon the number of teeth on the crankshaft toothed wheel. Every rising or falling edge of a tooth causes an event on a TPU input function. Assuming that the wheel has already achieved synchronization by detecting the missing or additional tooth, every subsequent rising or falling edge defines an absolute angular position, which is stored by the TPU as a position counter. Unfortunately, there are two problems with this scenario:
TPU solutions have been developed to solve the problems of restoring missing teeth and increasing the apparent resolution of the system. Before designing the individual timing functions, you will need to design the timing system. The following two approaches can be used to increase the apparent resolution of the angle counter:
s ahead of time, latency issues are no longer significant.Figure 9 is a state diagram of a typical spark-output function with a resolution of better than 0.5 degrees. Here, there are only two states. A state transition is caused by a link event from the input function or a match event from the output function itself. One state schedules a rising-edge match based on a combination of coherently collected data from the input function plus asynchronous application-program data. The second state schedules a falling-edge match--again based on data from the input function and application program. The purpose of link events is to ensure that the most recently available data is always used to schedule output pulses. Engine acceleration and deceleration can cause large errors in the timing of scheduled events if the states are executed only once per engine cycle, as would be the case if the links were not present.
To help debug microcode, the TPU has a number of scan lines to several internal registers. These are accessed through the CPU and can be used to display and modify microcode as well as observe the current state of the scheduler and microengine. (For more information on programming the TPU, see the accompanying text box entitled "TPU Microcoding Guidelines.")
Despite the apparent vagary of TPU microcode, reliable quality code can be developed if proper software-development practices are followed. There are a number of software life-cycle models that can be used. For instance, I've implemented a microcoded system using the classic V model (see Figure 10). It describes a process where requirements must be fully specified before the details of the design can be identified. Actual microcoding is only started once the detail design is complete. However, both test and user manuals can be produced in parallel with the coding process. Provided that resources are available, this helps reduce the completion time of the project.
Figure 1: Software and hardware partitioning.
Figure 2: (a) Fuel delivery pulse; (b) spark timing pulse.
Figure 3: (a) Wheel with additional tooth; (b) wheel with missing teeth.
Figure 4: Simplified TPU block diagram.
Figure 5: TPU Format 2 microinstruction.
Figure 7: TPU scheduler priority sequence.
Figure 6: TPU channel hardware.
Figure 8: Input-function state diagram.
Figure 9: Output-function state diagram.
Figure 10: Software life cycle (V model).
The authors, who are principals of Ash Ware Inc., can be contacted at http://www.ashware.com.
To make Time Processor Unit development easier, we've developed these guidelines:
Use the TPU appropriately. Because the TPU has conditional execution, load/store, and arithmetic capabilities, it is often mistaken for a CPU. Consider the TPU as a mega-PAL or ASIC replacement. Look for expensive hardware associated with waveform interpretation and generation where high resolution or fast control is required. Use the TPU to replace this type of circuitry. Do not rule out replacing hardware with multiple I/O pins. Microcode functions can control multiple pins, or multiple I/O pins can be controlled by a set of symbiotic functions.
Front load the microcoding process. One approach is to design the parameter-RAM interface first, break the function into states, map states to entry points, and then write microcode. Microcoding generally takes about four hours per instruction. A superior up-front design reduces the microcode that must be written and debugged.
Don't forget the CPU. The TPU is optimized for waveform processing. The CPU is superior for number crunching and conditional execution. Move functionality to the CPU wherever possible.
Fully utilize the entry table. The classic CPU-like approach is to have a single entry point followed by if statements. Although conceptually simple, this is not the best way to write microcode. The TPU stores entry point addresses in a table arranged by entry conditions. When used properly, a given set of entry conditions causes the function to be invoked at the core of your function (no if statements are required).
Untangle your design before you tangle your code. Microcode is flexible in terms of what processing is done within each different service routine. If your design requires many jumps and calls to reuse code, rearrange the design so previously reused code gets hit only once.
Avoid hardcoding that limits functions to certain channels. Allow channel initialization via parameter RAM.
Service routines should be kept short. There should also be little variance in thread length. It is sometimes best to split long threads into two shorter service routines.
Don't "run quiet." Beginners often take a CPU-like approach and poll pins and timers within a simple loop. This has been described as "quiet," possibly because the thread of execution does not move around much. The microsequencer should be made available to other channels once each channel's hardware has been armed. Allow the round-robin scheduler to decide which channel to service next. Good microcode should be "loud." Imagine the TPU clanging away as it opens and shuts a door each time a service routine is traversed.
Design proactive--not reactive--code. Low latency is achieved because the microsequencer prepares dedicated hardware to react to events. Thus, the microsequencer's role is preparatory and proactive. The bulk of the microcode should reflect this philosophy and be associated with the next event, not the last event. If microcode is getting complicated and unmanageable, move the functionality backward in time to a previous service routine.
Measure optimization. Sum the number of AU subinstructions and RAM-access subinstructions and divide by the total number of instructions. Highly optimized code tends to have large ratios. A ratio above 0.7 is excellent. To improve this ratio, look for "empty" RAM accesses and AU subinstructions. Usually, microcode can be rearranged to combine these subinstructions into a single instruction. Another measure is the ratio of conditional, call, and goto statements that use the pipeline flush to the total number of branch commands. Flushes are wasteful: This ratio will be less than 0.3 in well-designed code.
Get the right training and tools. Motorola's TPU microcoding class will pay for itself. Motorola provides a freeware assembler (http://freeware.aus.sps.mot.
com/freeweb/amcu_ndx.html), but the released version is inexpensive, well supported, and comes with good documentation. Because TPU microcode can be difficult to debug, a good development environment is a must. Ash Ware Inc. (our company) offers a TPU Simulator that allows viewing of virtually every aspect of the TPU (http://www.ashware.com). A complex test-vector capability keeps the test environment in lockstep with the TPU as the microcode is debugged. Motorola also has DOS- and Windows-based versions of a TPU debugger.