Programming TI's Multimedia Video Processor

Client/server programs for real-time video

William May

Bill is the principal software engineer for Minerva Systems, a developer of high-end MPEG encoders. He can be reached at bmay@minervasys.com.

Originally dubbed the MVP ("multimedia video processor"), the Texas Instruments TMS320C80 processor affords the ability to program video algorithms in software--in other words, video DSP. The MVP is a radical departure from TI's traditional approach to digital-signal processing, so knowing how to program TI's fixed- or floating-point DSPs won't help you much. From top to bottom, the MVP architecture is designed to achieve performance orders of magnitude greater than traditional DSPs.

Likewise, you'll usually need to extensively rework algorithms you are currently familiar with, although you'll occasionally find that the MVP gives life to algorithms long since discarded or forgotten. In short, to take full advantage of the MVP's power when programming real-time video algorithms, you'll usually need to develop new approaches.

In this article, I'll examine what it means to write software for real-time video. If you're familiar with programming in C on Intel and Motorola processors, the MVP will give you glimpse of a strange, performance-driven world. For instance, you must be in complete control of how data is organized in memory and how it moves across various buses to be processed by the MVP. There is a premium on making every CPU cycle count. Clearly, high-level abstractions are very difficult. Still, the payoff is worth it, especially when you see what real-time video is really like.

Architecture Overview

CCIR-601 is the international standard for digital video. It is a single format that encompasses both NTSC video (the video standard for the U.S. and Japan, among others) and PAL (the European standard). In both cases, the total data rate for video is 27 MB/sec. One half of the data (13.5 MB/sec) represents luma, or gray-scale information; the other half represents chroma for two color channels (each of which is 6.75 MB/sec). However, not all the information in the video signal must be processed. In the NTSC variant of CCIR-601, the active video area in each frame is 720x486 pixels, at 29.97 frames/sec. Thus, you must process about 21 MB/sec to handle NTSC in real time. The rate for PAL is slightly lower. By comparison, processing high-quality digital audio requires just 88,200 samples/sec, less than .5 percent of the data rate for video.

From these simple calculations two things are evident. First, a video processor and its surrounding hardware must be able to read and write data very fast--just reading in and writing out captured video requires a rate of 42 MB/sec. If multiple passes are needed or intermediate results need to be stored, the requirements are that much greater.

An example of a simple image-processing algorithm is a 3x3 filter for edge detection. In a naive implementation, such a filter might require nine multiplications and eight additions per sample (usually the number is reduced due to symmetry in the filter). This comes out to about 360 million calculations per second, just for this very simple filter. In real applications, such a filter would be only one component of a much larger calculation.

This level of performance can be achieved by making a processor either run extremely fast or do a lot of work in each cycle. In general, I think the second approach is preferable. Making a processor faster often introduces new problems, such as the need for faster RAMs, faster buses, and the like to keep the processor busy. This gets expensive, and it can be very difficult to debug the hardware. However, if the processor is highly parallel, it can run at more "leisurely" clock rates, allowing the overall system to use standard components and connections that are more easily obtained, more easily debugged, and more reliable.

The MVP takes the second approach: parallelism throughout. The MVP actually contains six processors: a fairly standard RISC processor, an extremely sophisticated DMA engine, and four parallel processors(PPs). Figure 1 illustrates the chip architecture. There are also 50 KB of on-chip static RAM.

The first production release of MVP supports a 50-MHz clock. However, because of the six processors, the MVP can perform a massive amount of work in each cycle, including the following tasks:

The master processor can perform a single RISC instruction in parallel with a floating-point operation.
The transfer controller can do a 64-bit (8-byte) read or write. If the read or write completes an I/O request, it can also update source and destination pointers.
The parallel processors can do up to eight multiplies, 16 additions/subtractions (with shifting and masking), and eight reads or writes to on-chip memory (including effective address calculations and updates of address registers).

Of course, in real algorithms, it is almost never possible or useful to do all these operations in each cycle. However, the challenge of programming the MVP is to keep as many of these operations going as possible, for a given algorithm.

The Master Processor

The master processor (MP) is a straightforward RISC processor that is easily programmed in C; see Figure 2. The MP has a floating point unit (FPU) that runs in parallel with the RISC core. In many applications, the MP operates as a traffic cop. It responds to interrupts from I/O devices, handles communications with a host, allocates memory, and assigns work to the parallel processors. Many system configurations are possible, but in most, the MVP is used as a compute server, with a host telling the MP what to do and when. The MP in turn schedules the parallel processors to perform the specific tasks.

The FPU can be useful, even though it is not as fast as the parallel processors. For example, in many applications audio is handled by the FPU, while video is handled by the parallel processors.

TI supplies a multitasking kernel that runs on the MP. The kernel provides facilities for memory management, running multiple tasks, communications between tasks, interrupt handling, and so on. High-level control of an application usually resides in tasks running on the MP. These tasks then allocate memory or processing resources (the PPs) as needed, and communicate with other tasks or interrupt handlers. The kernel is fairly small, but provides the basis for easily developing many application- or hardware-specific features.

The Transfer Controller

Although it is not obvious at first, one of the keys of the MVP is the transfer controller (TC). The TC (see Figure 3) is an intelligent DMA controller, so virtually all I/O can be preprogrammed. Once I/O is set up at the beginning of an algorithm, the processors typically spend all their time computing, with data being read and written in the background by the transfer controller.

Besides intelligence, the TC also has enormous bandwidth, assuming that memory is set up correctly. There is a 64-bit data path on both sides of the transfer controller, so on a 50-MHz MVP, it is possible to move 400 MB of data per second. In practice, there is usually at least one wait state, making typical performance half the maximum--200 MB/sec. Still, this is sufficient for many video-processing algorithms. In my experience, I/O is usually not the bottleneck when working with the MVP.

The I/O tasks that can be specified are flexible, ranging from simple requests like reading/writing single or multiple lines of data to reading/writing blocks of data such as the 8x8 blocks in JPEG compression, to reading/writing patches of data (for drawing text and lines). Addressing can be absolute addresses, offsets from a starting point, or ping-ponging between double buffers. Many graphics operations (fills and rectangular bit blits, for example) can be performed using only the TC.

The Parallel Processors

The PPs are the compute engines of the MVP. In the first-generation MVP, there are four PPs. The hardware design also has support for one, two, or eight PPs. The PPs are purposely not called DSPs, as their architecture is quite different. PPs are more geared to performing graphics and image-processing operations than a DSP (which is more targeted for processing one-dimensional data).

As Figure 4 and Figure 5 illustrate, each PP internally has four separate computation units--a multiplier, an ALU, and two addressing units. Each unit operates in parallel with the others, so that in each cycle (all instructions execute in a single cycle) a single PP can do multiplication, ALU operations, and two reads/writes (with accompanying effective address calculations and address-register updates).

The multiplier and the ALU can be split, so the multiplier can perform a single 16x16-->32-bit multiply in each cycle, or two 8x8-->16-bit multiplies. The multiplier also has a mode where it can calculate a rounded result (16x16-->16-bit rounded), scale the result, and swap one of the coefficients so that another rounded multiply can be performed on the next cycle.

The ALU does all the usual logical and arithmetic operations on data in registers. However, this ALU can operate in full 32-bit mode or be split to perform two 16-bit ("halfword split") operations or four 8-bit ("byte split") operations. In byte-split mode, each ALU is able to perform four ALU operations (add, subtract, And, Or, and so on) in each cycle. Special hardware and muxes in the ALU also allow rotating one of the inputs, generating masks, expanding bits, and storing status bits for each ALU split.

These possibilities present a problem--it is difficult to design an instruction set that can specify such an array of operations and modes. TI resolved this by defining a basic set of operations that could be specified in a 64-bit instruction word. This basic 64-bit instruction word can specify a multiplication, a wide array of ALU operations, and two I/O operations for each cycle.

However, the PP hardware is too flexible to be expressed in even a 64-bit instruction. To provide access to the full functionality, there's the ealu ("extended ALU") instruction, which requires that the d0 register be used as an additional 32 bits of the instruction, bringing the total to 96 bits. This costs a potentially useful register, but opens up enormous flexibility. With ealu instructions, some additional multiplier modes and several additional ALU paths become available. The ealu instructions are often seen in tight loops (see the image-crossfade example developed shortly).

In addition to the multiplier and ALU, each PP supports fairly typical address generation calculations (pre- and post-increment, indexed addressing) and hardware-loop controllers, which are nice because they can be nested for multi-dimensional processing; and since they automatically handle instruction latency issues, they're also easy to use. One of the interesting uses of the loop controllers is for hardware branching on exceptions (such as numeric overflows or underflows). This allows a tight loop to operate very efficiently, without regard for exception handling. Execution of the loop branches out to a special handler only when a problem occurs.

Programming the MVP

Developing an application on the MVP typically involves the following steps:

Create a client and server task to invoke algorithm execution; see Listings One and Two, respectively. These tasks execute on the MP, communicate with a host, and allocate compute resources (the PPs) when needed. This is easy, since it is all written in C.
Lay out memory usage on the PPs and program them to issue I/O requests to the transfer controller. This is a complex balancing act, where the effects of limited memory are balanced against the possibility of crossbar contention and complex addressing in inner loops. The geometry of an algorithm must be carefully analyzed. Sometimes, this is straightforward. Many image-processing algorithms can be processed along scan lines. A JPEG codec naturally works on 8x8 blocks. In other cases, there are multiple I/O options, and the option selected can dramatically affect overall performance. One class of algorithms that is difficult to program includes image rotations and warping. In the case of general warping (mesh warps) a small patch of the image may explode into a large patch, making memory and I/O management quite complex.
Code the algorithm on the PP. Here again, there are many options and decisions. The first is the topology of the algorithm execution. In most cases, the image is subdivided in some way, and each PP executes the algorithm on its portion of the image. In this case, each PP is executing the same code. This is the simplest topology to code, since the PPs do not need to communicate with each other and the computational load is automatically balanced among them.
This is the most common means of subdividing a task, but the MVP supports numerous others. A task can be pipelined, with each PP performing part of the task and moving the result to the next PP. Or one PP can start performing a task and send the result to the other three for completion, and so on. Of course, algorithms may also use the MP to perform part of the processing.
Finally, code the algorithm itself. PP code is usually written in assembly language. Although TI supplies a C compiler for the PPs, it is not up to the task of using the PPs' resources efficiently.

An Image-Crossfade Example

To illustrate how some of these pieces fit together in a real application, I'll develop an application that performs an image-crossfade operation. Two images are supplied as inputs, and a third is generated as the weighted average of the two inputs. By gradually changing the weight over time, a crossfade is performed. Example 1 implements this algorithm in C, while Listing Three presents the PP code, which shows the typical situation where I/O requests are set up based on the request passed down from the MP. The assembly language is what TI calls an "algebraic assembly language" and is much more expressive than an opcode-based assembly language. It also uses many C-like expressions. The PP assembler is responsible for translating this huge assortment of expressions into PP machine language.

In this case (refer to the section of Listing Three beginning with ; set our loop, once through for each line), you see three packet-transfer requests being set up (two input, one output). Once in motion, the TC uses these packets to continually bring in data and write out processed data, without any further effort by the PP (except for making sure that new data has arrived when it is ready to process more data). Although it is not evident from the listing, the packet transfers support 3-D transfers--rows and columns as well as separate counters, pitches, and offsets for each row and column. It is common to use the three dimensions to bring in blocks of data from memory, using the third dimension to ping-pong between on-chip buffers. Proper use of this mode eliminates contention between the TC and the PPs.

To set up the packet transfer, the PP needs to know where to find its data and put the result. The process is done via shared memory. The MP has presumably received a command from the host to crossfade two images (with a particular weight) and display the result. The MP calculates the addresses and offsets so that the source images are processed in four sections, and the results are written to a display buffer. Thus, each PP gets an argument buffer containing such information as pointers to the source images and result image, and the weight to use in the calculation. The MP puts this argument buffer in each PP's local memory.

Finally, the inner loop performs the actual calculations (refer in Listing Three to the code beginning with width = width >> 1;), where each of the four PPs is executing to do a crossfade operation. The "||" symbol means that instructions are executed in parallel. When you see this, the first instruction (without the "||") and following instructions (with "||") all execute in the same machine cycle.

C-like syntax is used in many places. For example, &* calculates an effective address. Instructions such as zero =&*(La_Image0 = dba + fBuff0); load an effective address into an address register. There are also special registers (which the assembler recognizes by the name "zero") that are read as zero and throw-away writes.

Like most algorithms on the PP, there is little or no data management inside the inner loop. The address registers are loaded with pointers to the data, and loop execution begins. When this loop is invoked, the TC has already loaded the source data. While the loop is running, the TC is simultaneously writing out the results of the last call to the loop and bringing in source data for the next call. This allows the PP to concentrate completely on the computation.

Note that the PP initializes d0 to perform an ealu instruction. It then initializes the hardware loop-control registers (ls0, le0, lr0, and lctl) by loading the start and end instruction addresses in the loop, loading the loop counter, and setting a control register that enables looping ("| is a logical Or, just like in C).

Next, I prime the loop by preloading some registers and starting some of the calculations. Once again, C-like syntax is used to represent pointer references and incrementing. The =uh tells the PP to do unsigned halfword loads (don't sign extend the data).

The first products are calculated. Fixed-point values are used for the weight and the data. Each multiply does two 8x8--> 16-bit multiplies (=um means the multiplier does an unsigned split multiply). The results of two multiplies are later added using a halfword split ALU. Using split multiplies requires ealu instructions, but ealu is also used to align the data the way I want for later processing. ("\\" is a register rotate operator; positive rotates go left.)

The inner loop itself is three instructions long. Each iteration calculates two pixel results. All reads and writes from on-chip memory, four multiplications, accumulation, and data alignment are done in these three cycles. Of course, all these operations must occur in the right sequence in order to achieve the right results. Condensing the operations into a tight loop while taking advantage of parallelism is the main task of programming such an inner loop.

The inner loop is pipelined, so more than two sets of pixels are actually processed in each loop. This is typical of efficient code on the MVP.

When all is done, the PP branches to the return address in the special register iprs7. Due to instruction pipelining in the parallel processors, the two instructions following a branch are always executed. These two delay slots are often used for cleanup (as in this case).

Conclusion

So what is the MVP really capable of? TI claims the MVP can perform two billion operations per second. This raw statistic, while accurate in the literal sense, obscures more than it illuminates. Table 1 (developed by TI) lists benchmarks of some real algorithms, most of which are part of the tools distribution from TI (except for the JPEG codec).

The bottom line is that the MVP really is a video DSP, but it is first generation, so it has its limitations--cycles, memory, and bandwidth. I look forward to seeing where TI (and its competitors) go with this technology over the next several years.

Figure 1: MVP architecture overview.

Figure 2: Master processor.

Figure 3: Transfer controller.

Figure 4: Parallel-processor overview. Lds=local destination/source bus, Gsrc=global source bus, Gdst=global destination bus, Repl=replicate hardware, A/S=align/sign-extend hardware.

Figure 5: Parallel-processor data unit.

Example 1: Implementing the image-crossfade algorithm in C

/* perform a step in a crossfade. w is the current "weight" */
for (i = 0; i < height; i++) {
    for (j = 0; j < width; j++) {
        *dst++ = w * *src1++ + (1-w) * *src2++;
    }
}

Table 1: MVP benchmarks (generated by TI, presumably on a 50-MHz MVP).

Benchmarks                          Processor Result
Dhrystones                          MP only   140,000
3-D graphics transforms             MP only   2.6 MB/sec
800x600 image (4:1:1 YCrCb)
 JPEG encode/decode                 4 PPs     42-59 ms
8x8 forward DCT (H.261 accuracy)    4 PPs     800,000/sec
3x3 median filter                   4 PPs     25 MB/sec
3x3 convolution (16-bit precision)  4 PPs     22MB/sec
2x3 convolution (8-bit precision)   4 PPs     40 MB/sec

Listing One

/* client.c  --  C source code for simulating client task  */
#include <stddef.h>
#include <task.h>
#include "app.h"
#include "hwparams.h"
#include "main.h"
extern unsigned char image1[];
extern unsigned char image2[];
extern unsigned char image_out[];
/* Simulate client task that sends request messages to server tasks
 * that run on 340I MP, and receives reply messages from same. */
void DummyClient(void *arg)
{
    MSG_BODY    *msgBody;
    long        *pI;
    long        i;
    
    TaskOpenPort(PORTID_RECLAMATION);
    TaskOpenPort(PORTID_CLIENTREPLY);
    for (i = 0; i < 8; i++) {
        TaskReclaimMsg(TaskAllocMsg(40, PORTID_RECLAMATION));
    }
    /* keep sending messages to the server */
    for (;;) {
        msgBody = TaskReceiveMsg(PORTID_RECLAMATION);
        msgBody->opCode = REQUEST_CROSSFADE_IMAGE;
        pI = (long *)msgBody;
        pI[1] = 640;
        pI[2] = 240;
        pI[3] = 0x8080;             /* weight 50:50 crossfade */
        pI[4] = (long)image1;
        pI[5] = (long)image2;
        pI[6] = (long)image_out;
        TaskSendMsg(msgBody, PORTID_SERVER);
        /* Wait for next reply message to arrive from MVP. */
        if (!TaskWaitEvents(1L << EVENTNUM_AUXMSG)) {
            TaskYield(-1);
        }
        msgBody = TaskReceiveMsg(PORTID_CLIENTREPLY);    /* get msg */
    }
    TaskClosePort(PORTID_RECLAMATION);
}

Listing Two

/* Crossfade.c  --  C source code for crossfade server task  */
#include <stdlib.h>
#include <task.h>
#include "app.h"
#include "hwparams.h"
#include "main.h"
#include "mpppcmd.h"
#include "MemoryMapMP.h"
#define PPS_to_go   4
void setup_pps(SRVARG *arg, CrossfadeParams *sp, PPCMDBUF *cmdBufs[]);
/* A simple server task.  The single argument is a structure containing
 * all the persistent data needed to represent the state of the task from
 * one activation to the next.  (Between activations, a task maintains no
 * state information on the single, system stack.)  The value returned by
 * this function to the task scheduler is a long word containing up to 32
 * flags that indicate the set of events the task has selected to wait
 * on.  The task will be activated again when one of these events occurs.
 */
void CrossfadeServer(void *argument)
{
    SRVARG          *arg = (SRVARG *)argument;
    void            *msgBody;           /* save current request from client */
    PPCMDBUF        *cmdBufs[4];        /* current PP command buffer */
    long            opCode, portId, i, j;
    PPCMDBUF        *cmdBuf;
    PPINFO          *pp;
    CrossfadeParams *params;
    portId = TaskOpenPort(arg->portId);
    for (i = 0; i < PPS_to_go; i++) {
        pp = &(arg->pp[i]);
        /* Initialize the PPs that belong to this task. */
        cmdBuf = PpCmdBufInit(pp->ppNum, pp->program, 2);
        cmdBufs[i] = cmdBuf;
        PpCmdBufSetArgs(cmdBuf, (void *)(0x1000260 + (i << 12)));
        cmdBuf = PpCmdBufNext(cmdBuf);
        PpCmdBufSetArgs(cmdBuf, (void *)(0x10002C0 + (i << 12)));
    
        pp->semaId = TaskOpenSema(pp->semaId, 0);
        PpMsgIntSetSignal(pp->ppNum, 1, pp->semaId);
    }  
    for (i = 0; i < PPS_to_go; i++) {
        cmdBuf = cmdBufs[i];
        PpCmdBufSetFunc(cmdBuf, gPPCmd[PPCMDNUM_SETUP_FOR_CROSSFADE]);
        PpCmdBufIssue(cmdBuf);
        cmdBufs[i] = PpCmdBufNext(cmdBuf);
    }
    /* This task and its server PPs have now been initialized.  Await the
     * arrival of the first request message from a client.  Repeat the
     * loop below for each new request received from a client. */
    while(1) {
        msgBody = TaskReceiveMsg(portId);
        /* Begin processing new request message from client */
        switch (((MSG_BODY *)msgBody)->opCode) {
           case REQUEST_CROSSFADE_IMAGE:
           params = (CrossfadeParams *)&(((MSG_BODY *)msgBody)->filler[0]);
           setup_pps(arg, params, cmdBufs);
            /* Then return the request message as a reply message
             * indicating completion of the request. */
                ((MSG_BODY *)msgBody)->opCode = REPLY_CROSSFADE_IMAGE;
                portId = TaskGetReplyPort(msgBody);
                TaskSendMsg(msgBody, portId);
                break;
            default:
                /* No error handling yet.  Just discard bad request and
                 * continue waiting for a valid request to arrive. */
                TaskReclaimMsg(msgBody);
                break;
        }
    }
}
/* setup the PPs and let them go */
void setup_pps(SRVARG *arg, CrossfadeParams *sp, PPCMDBUF *cmdBufs[])
{
    long        width, height, ratio;
    unsigned char   *src_1, *src_2, *dst;
    long            used_pps = PPS_to_go; 
    long            partial_height;
    long            partial_offset;
    long            i;
    PPCMDBUF        *cmdBuf;
    CrossfadeParams *argBuf;
    PPINFO          *pp;
    /* Load operands into PP's argument buffer. */
    width           = sp->Width;
    height          = sp->Height;
    ratio           = sp->Ratio;
    src_1           = sp->Src1Address;
    src_2           = sp->Src2Address;
    dst             = sp->DstAddress;
    partial_height  = sp->Height/PPS_to_go;
    partial_offset  = width * partial_height; /* offset partial images */
    for (i = 0; i < PPS_to_go; i++) {
        cmdBuf  = cmdBufs[i];
        argBuf  = (CrossfadeParams *)PpCmdBufGetArgs(cmdBuf);
        while (PpCmdBufBusy(cmdBuf)) {
            TaskWaitSema(arg->pp[i].semaId);
        }
        cmdBuf              = cmdBufs[i];
        argBuf              = (CrossfadeParams *)PpCmdBufGetArgs(cmdBuf);
        argBuf->Width       = width;
        argBuf->Height      = partial_height;
        argBuf->Ratio       = ratio;
        argBuf->Src1Address = (unsigned char *)(src_1 + i * partial_offset);
        argBuf->Src2Address = (unsigned char *)(src_2 + i * partial_offset);
        argBuf->DstAddress  = (unsigned char *)(dst + i * partial_offset);
        PpCmdBufSetFunc(cmdBuf, gPPCmd[PPCMDNUM_CROSSFADE]);
        PpCmdBufIssue(cmdBuf);
    }
    /* The task has been partitioned among PP's.  Now wait
     * until the busy ones finish, and update command buffer pointers. */
    for (i = 0; i < used_pps; i++) {
        cmdBuf = cmdBufs[i];
        if (PpCmdBufBusy(cmdBuf)) {
            TaskWaitSema(arg->pp[i].semaId);
        }
        cmdBufs[i] = PpCmdBufNext(cmdBuf);
    }
}

Listing Three

;------------------------------------------------------------------------
; Does a crossfade between two images, writing data out to a third
; image. The image pointers, sizes, and a crossfade multiplier
; in 1.15 format are specified.
;
        .include    "MemoryMapPP.h"
        .include    "PacketPP.h"
        .ptext
        .global _SetupForCrossfade
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; setup the PR stuff
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
_SetupForCrossfade:
Ga_Packet0      .set    a10
Ga_Packet1      .set    a11
Ga_Packet2      .set    a12
        x0 = 0x800;         delta between ping pong buffers
        x8 = 0x80000103;    Dimensioned PR, stop, src update, ping pong dst
        x9 = 0x00000301;    Dimensioned PR, no stop, dst update, ping pong src
        x10= 0x00000103;    Dimensioned PR, no stop, src update, ping pong dst
        ; initialize the input PRs
        zero = &*(Ga_Packet0 = dba + fPktReqAddressBase0);
        zero = &*(Ga_Packet1 = dba + fPktReqAddressBase1);
        zero = &*(Ga_Packet2 = dba + fPktReqAddressBase2);
        ; first input packet transfer
        *Ga_Packet0.tPR_Options         = x10; dimensioned PR, no stop
        *Ga_Packet0.tPR_Next            = Ga_Packet1;   
        *Ga_Packet0.fPR_SrcBBPitch      = zero;
        *Ga_Packet0.fPR_DstBBPitch      = zero;
        *Ga_Packet0.tPR_SrcCCount       = zero;
        *Ga_Packet0.tPR_DstCCount       = zero;
        *Ga_Packet0.fPR_DstCCPitch      = x0;
        *Ga_Packet0.fPR_SrcCCPitch      = zero;
        ; second input packet transfer
        *Ga_Packet1.tPR_Options         = x8;  dimensioned PR, stop
        *Ga_Packet1.tPR_Next            = zero; no next
        *Ga_Packet1.fPR_SrcBBPitch      = zero;
        *Ga_Packet1.fPR_DstBBPitch      = zero;
        *Ga_Packet1.tPR_SrcCCount       = zero;
        *Ga_Packet1.tPR_DstCCount       = zero;
        *Ga_Packet1.fPR_DstCCPitch      = x0;
        *Ga_Packet1.fPR_SrcCCPitch      = zero;
        ; write data back out to the destination
        *Ga_Packet2.tPR_Options         = x9; dimensioned PR, no stop
        *Ga_Packet2.tPR_Next            = Ga_Packet0;       
        *Ga_Packet2.fPR_SrcBBPitch      = zero;
        *Ga_Packet2.fPR_DstBBPitch      = zero;
        *Ga_Packet2.tPR_SrcCCount       = zero;
        *Ga_Packet2.tPR_DstCCount       = zero;
        *Ga_Packet2.fPR_SrcCCPitch      = x0;
        *Ga_Packet2.fPR_DstCCPitch      = zero;
        br = iprs;                          return to calling function
        nop;
        nop;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; do the actual crossfade, this part of code. handles I/O for the crossfade.
;; on entry: a9  points to our command packet
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
        .align 512
        .global _CrossfadeImage
_CrossfadeImage:
ratio0          .set    d1
srcPtr1         .set    d2
srcPtr2         .set    d3
dst             .set    d4
width           .set    d5
height          .set    d6
La_Image0       .set    a2
Ga_Image1       .set    a8
La_Image2       .set    a3
Lx_Offset       .set    x0
Gx_Offset       .set    x10
Ga_Command      .set    a9;                     initialized by caller
        zero = &*(a0 = pba + fPP_reg_save0);
        *a0++ = a8;
        *a0++ = a9;
        sr = 0x2D;                              split ALU, halfword operations
        ; setup our loop, once through for each line
        height      = *Ga_Command.tHeight;
        height      = height - 1;               loop height + 1 times
        le0         = ipe + (end - $);
        ls0         = ipe + (start - $);
        lr0         = height;                   also sets loop counter
        lctl        = 0x9;                      enable le0, associate with lc0
        ; initialize some registers
        zero = &*(La_Image0 = dba + fBuff0);
        zero = &*(Ga_Image1 = dba + fBuff1);
        zero = &*(La_Image2 = dba + fBuff2);
        ; read in the command arguments
        width   = *Ga_Command.tWidth;           get the width
        ratio0  = *Ga_Command.tRatio;       
        srcPtr1 = *Ga_Command.tSrc1Address;
        srcPtr2 = *Ga_Command.tSrc2Address;
        dst     = *Ga_Command.tDstAddress;
        ; read in the first two lines of source images 
        zero = &*(Ga_Packet0 = dba + fPktReqAddressBase0);
        zero = &*(Ga_Packet1 = dba + fPktReqAddressBase1);
        zero = &*(Ga_Packet2 = dba + fPktReqAddressBase2);
        *Ga_Packet0.tPR_SrcStartAddress = srcPtr1;
        *Ga_Packet0.tPR_DstStartAddress = La_Image0;
        *Ga_Packet0.tPR_SrcBACount      = width;    a count is width:b is 0
        *Ga_Packet0.tPR_DstBACount      = width;    a count is width:b is 0
        *Ga_Packet0.fPR_SrcBBPitch      = width;
        *Ga_Packet0.fPR_DstBBPitch      = width;
        *Ga_Packet1.tPR_SrcStartAddress = srcPtr2;
        *Ga_Packet1.tPR_DstStartAddress = Ga_Image1;
        *Ga_Packet1.tPR_SrcBACount      = width;    a count is width:b is 0
        *Ga_Packet1.tPR_DstBACount      = width;    a count is width:b is 0
        *Ga_Packet1.fPR_SrcBBPitch      = width;
        *Ga_Packet1.fPR_SrcBBPitch      = width;
        ; start reading in the first lines, auto increment source address
        *(pba + fPR_LinkedListStart)    = Ga_Packet0;
        comm = comm | 1\\28;                        issue a packet request  
        ; meanwhile start setting up output PR 
        *Ga_Packet2.tPR_SrcStartAddress = La_Image2;
        *Ga_Packet2.tPR_DstStartAddress = dst;
        *Ga_Packet2.tPR_SrcBACount      = width;    a count is width:b is 0
        *Ga_Packet2.tPR_DstBACount      = width;    a count is width:b is 0
        *Ga_Packet2.fPR_SrcBBPitch      = width;
        *Ga_Packet2.fPR_DstBBPitch      = width;
        ; wait until the first PR has completed
        zero = comm & 1\\29;                    keep testing the PR queued bit
poll0:  br = [nz] ipe + (poll0 - $);
        nop;
        zero = comm & 1\\29;                    keep testing the PR queued bit
        ; start the second
        *(pba + fPR_LinkedListStart)    = Ga_Packet0;
        comm = comm | 1\\28;                        issue a packet request  
        
start:
        ; now we can process a line of data 
        *--sp = iprs;                           save iprs
        Lx_Offset = 0x0;                        default offset 
        call = ipe + (DoProcessing - $);
        zero = lc0 & 0x1;                       odd or even line?
        Lx_Offset =[eq] 0x800;                  other DRAM for the even lines
        iprs = *sp++;                           restore iprs
        ; wait until the PR has completed
        zero = comm & 1\\29;                    keep testing the PR queued bit
poll1:  br = [nz] ipe + (poll1 - $);
        nop;
        zero = comm & 1\\29;                    keep testing the PR queued bit
        ; write data to the destination and read in two new lines of data 
        zero = &*(Ga_Packet2 = dba + fPktReqAddressBase2);
        *(pba + fPR_LinkedListStart) = Ga_Packet2;
end:    comm = comm | 1\\28;                    issue a packet request  
        sr = 0x36;                              no more split ALU
        zero = &*(a0 = pba + fPP_reg_save0);
        br = iprs;                              return to caller
        a8 = *a0++;                             restore a8 and a9
        a9 = *a0++;
    
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; actually do the crossfade on a line. the crossfade operation is: a*x+(1-a)*y
; The following registers are initialized on entry:
;   ratio0
;   Lx_Offset:  offset to the image buffers
; The ALU must be setup for halfword split operations
; the ratio0 register should be saved
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
DoProcessing:
src0    .set    d2
src1    .set    d3
ratio1  .set    d4
prod1   .set    d5
prod0   .set    d6
sum     .set    d7
        width = width >> 1;                we do two samples each iteration
        || *--sp = width;                  I need to reuse this reg later on!
        ratio1 =u 0xFFFF - ratio0;         for second source
        ; setup pointer into my buffers
        Gx_Offset = Lx_Offset;             need a global offset and a local offset
        zero = &*(La_Image0 = dba + fBuff0);
        zero = &*(La_Image0 += Lx_Offset);
        zero = &*(Ga_Image1 = dba+ fBuff1);
        zero = &*(Ga_Image1 += Gx_Offset);
        zero = &*(La_Image2 = dba + fBuff2);
        zero = &*(La_Image2 += Lx_Offset);
        d0 = ROT_8;                         prepare for ealu
        le1 = ipe + (loop_e - $);           loop end
        lr1 = width - 2;                    # iterations (- 1)
        ls1 = loop_s;
        lctl = lctl | 0xA0;                 enable looping, associate with lc 1
        ; init loop, prime the data buffers
        src1 =uh *Ga_Image1++;               -> s1   
        || src0 =uh *La_Image0++;            -> s0
        prod1 =um ratio1 * src1;             -> s1xxs1xx
        || sum = ealu(ROT_8: sum\\8);        s0xxs1xx -> xxs1xxs0
        prod0 =um ratio0 * src0;             -> s0xxs0xx
        || src1 = ealu(ROT_8: sum\\8);       dummy operation for split multiply
        sum =m prod0 + prod1;                   sum = s0 + s1 (16 bits each)
        src1 =uh *Ga_Image1++;                  -> s1   
        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
        ; the inner loop for the crossfade
        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
loop_s: prod1 =um ratio1 * src1;             -> s1xxs1xx
        || sum = ealu(ROT_8: sum\\8);        s0xxs1xx -> xxs1xxs0
        || src0 =uh *La_Image0++;            -> s0
        prod0 =um ratio0 * src0;             -> s0xxs0xx
        || src1 = ealu(ROT_8: sum\\8);       dummy operation for split multiply
        || sum =ub2 sum;                     sum -> s1
        || *La_Image2++ =b sum;              s0 -> image
loop_e: sum =m prod0 + prod1;                sum = s0 + s1 (16 bits each)
        || src1 =uh *Ga_Image1++;            -> s1
        || *La_Image2++ =b sum;              s1 -> image
        br = iprs;                           return to caller
        width = *sp++;
        nop;