PARALLEL DSP FOR DESIGNING ADAPTIVE FILTERS

Paralleled DSP chips implement the filter; here's how to program them

Daniel Chen

Daniel is an engineer with Texas Instruments. He can be reached there at 12203 S.W. Freeway, Houston, TX 77477.

A new generation of advanced computer applications consist of programs whose required execution speeds are greater than the ability of the hardware to perform them. In such situations, designers must abandon the classic, single-processor, serial, Von Neumann computer architecture in favor of some form of parallel architecture. Today, it is possible to design a parallel system in which multiple processors are connected to work concurrently on different parts of a problem, dramatically increasing the speed at which instructions can be executed.

The conventional parallel architecture goes by the acronym SIMD, for Single Instruction, Multiple Data stream. A SIMD computer's instruction sequence is similar to that of a Von Neumann machine but the instructions are executed in parallel on multiple sets of data by multiple processors. For problems whose structure is unsuitable for SIMD techniques, designers can go to the MIMD (Multiple Instruction, Multiple Data stream) architecture.

A MIMD computer uses several independent instruction sequences, each acting on a separate data stream. MIMD offers greater parallelism and flexibility than SIMD but is more complex in terms of the synchronization needed among instructions.

SIMD and MIMD computers can be implemented with general-purpose microprocessor elements, but it is now possible to get higher performance by using special Digital Signal Processing (DSP) chips. For a number of high-performance, computationally intensive applications such as 3-D graphics, telecommunications, video conferencing, and neural networks, DSP techniques are preferable to those used with a conventional microprocessor. And when such chips are interconnected in parallel DSP (pDSP) architectures, instruction execution times are much faster than with general-purpose microprocessors.

Two factors are driving the trend toward pDSP. One is that DSP algorithms are inherently suited to task partitioning, which means that paralleled processors can be assigned to individual tasks. The second is the dramatic increase in DSP chip performance coupled with sharply lower prices compared to when these devices were introduced ten years ago. The result is that pDSP is becoming an increasingly cost-effective approach to achieving very high performance.

In addition to advanced applications (graphics, telecom, and so on), traditional DSP problems can be solved by multiple processor implementations. For example, filtering, correlation, and Fast Fourier Transforms (FFTs) are all functions representable by a signal-flow graph. Such flow graphs identify lower-level functions and their parallel interactions. Any problem that can be symbolized in this manner is a candidate for parallel processing.

Practical pDSP architectures can be implemented using the Texas Instruments TMS32OC40, the first DSP chip designed specifically for parallel processing. High-performance systems can be designed because a virtually unlimited number of C40s can be interconnected.

The TMS32OC40 incorporates the on-chip hardware necessary to meet the three main requirements of parallel processing systems: efficient interprocessor communication, high-information throughput, and a high-performance Central Processing Unit (CPU). These requirements are met through parallel communication ports for high speed and direct--no glue logic--interprocessor communications, a multichannel DMA (Direct Memory Access) coprocessor for concurrent I/O and immense throughput, and a high-performance, 32-bit floating point CPU (see Figure 1). Backing up the hardware features are a full range of software development tools specifically designed for parallel processing systems.

Canceling Echoes with Adaptive Filters

A variety of telecommunication system problems are concerned with echo cancellation. These problems crop up in long distance telephone voice communications, full-duplex voiceband data modems, and high-performance, "hands-free" audio-conferencing systems. In each case, practical echo canceling circuitry is based on the principles of adaptive filtering. Recent advances in DSP devices such as the C40 have led to the design of all digital echo cancelers for both desktop and large systems.

An adaptive filter, upon which echo-cancellation hardware is based, is one whose coefficients can be updated by an adaptive algorithm that optimizes the filter's response to suit a desired performance criterion. A filter's coefficients determine its characteristics and output function. In circuits such as echo cancelers, the coefficients required to produce a given output cannot be determined when the input signal is presented because the coefficients depend on changing line or transmission conditions. Thus there is a need for an adaptive filter that can alter its coefficients to match the electrical and physical environments of the phone line.

Figure 2 illustrates the basic form of an adaptive filter.

The filter consists of two distinct parts: a filter structure designed to perform the signal processing function and an adaptive algorithm for altering the coefficients to suit the environment. An incoming signal, x(n), is weighted in a digital filter to produce an output, y(n). The adaptive algorithm adjusts the weights in the filter structure to minimize the error, e(n), between the output, y(n), and the desired response of the filter, d(n).

In a real-time application such as echo cancellation (adaptive prediction, noise cancellation, and channel equalization are others), an adaptive filter implementation based on a programmable DSP device such as the C40 has many advantages over a hard-wired filter. Not only do DSP chips consume less power and space, they simplify manufacturing requirements. And the programmability feature provides flexibility for system and software upgrades.

Adaptive filters require an implementation that provides fast multiplication (parallel hardware multiplier), high-speed data flow (a pipe lined architecture), and large storage capacity. The C40 meets these requirements because its CPU contains a 40-bit floating-point multiplier and transfers data at 320 Mbytes/second with a 40-nanosecond cycle time. Moreover, the chip has a total memory space of 16 gigahytes, with 8 Kbytes of RAM and 16 Kbytes of ROM packed onto its own silicon. A 512-byte program cache and boot code ROM round out the C40's on-chip memory.

The DMA coprocessor runs concurrently with the CPU to maximize processing speed and throughput. Six high-speed parallel communication ports offer bidirectional data transfer rates between C40s of 20 Mbytes/second. Each port has its own FIFO buffers and arbitration logic. And band width is broad because multiple communication ports can be connected between processors without the need for glue logic.

While adaptive filters can be implemented in a variety of ways, the C40-based design to be described here uses a transversal filter and the LMS (Least Mean Square) update algorithm. LMS is relatively simple to design and implement and is well suited for many applications. The transversal filter--also known as a tapped delay line--is an FIR (Finite Impulse Response) type which offers greater stability than IIR (Infinite Impulse Response) types.

The general architecture of an adaptive filter used in an echo canceler system consists of four TMS320C40 DSP devices operating in parallel. One such LMS implementation is shown in Figure 3, where the leftmost C40 performs the operation of convolution and the remaining three processors handle the updating of filter coefficients.

Convolution is the multiplication of two vectors; in an adaptive filter, the vectors are an input vector (input signal) and a weight vector that determines the filter coefficient. This system also uses a central or global memory to store coefficient data, but this architecture is not optimal for the highest performance.

To improve filter performance, each C40 takes advantage of its captive on-chip memory, as illustrated in the parallel architecture of Figure 4. Unlike the circuit in Figure 3, each processor carries out the convolution task and also handles updating of filter coefficients.

Each C40's internal 8-Kbyte Static RAM (SRAM) stores the routine for executing the convolution function and the data for coefficient updates. The boot-code ROMs contain information that initializes the pointers and arrays within the chip. This permits the setting of addresses for the ports that permit communication between C40s, and the setting of addresses for storing filter weight information and the filter's output data. The type of data stored and its function within the filter is shown in the detailed assembly language code provided with this article.

The Adaptive Filter Implemented

Figure 5 shows the signal flow between the four interconnected C40 DSP devices that make up the adaptive filter. Lines of communication between the processors are illustrated by the lines with arrowheads. The C40 communication ports are used to send signals among the processors.

In this transversal filter, the system input signal or input vector to C40 #1 is denoted by x(n) and the system output signal is y(n). Along with x(n), the desired response of the filter, d(n), is fed to the input of C40 #1. The error signal, e(n), is developed in C40 #1 and distributed to the other C40s in the system. C40s #2, #3, and #4 develop their own output signals y2(n), y3(n), and y4(n), which are returned to C40 #1 to form the system output signal y(n).

The function of DSP devices #2, #3, and #4 in Figure 5 is to make an output-signal calculation based on input filter weights, the error signal and the input signal from the previous stage. That calculation for each DSP device is then sent to C40 #1, which returns the error signal e(n) to the devices. Each DSP (#2, #3, #4) then updates the filter weights and passes a new input signal to the following stage. The pseudo C code for executing these steps in DSP's #2, #3, and #4 is given in Listings Two, Three, and Four (page 74).

The basic procedure is to initialize the filter weights, compute the value of output signal y, receive an input x from the preceding stage, make an updated calculation of y, and pass that value back to C40 #1. When the error signal is received from C40 #1, the individual stage can update the filter weights.

A more extensive set of computations is carried out in C40 #1, which not only calculates its own y output signal but receives the y outputs from stages #2, #3, and #4. These values must be summed together to form the total output signal y(n). This stage also computes the error signal, e, which is derived by subtracting the output y from the desired value d. The pseudo C code program for executing the functions of C40 #1 is given in Listing One(page 74).

Listings Six through Nine (pages 74 to 77) respectively are the C40 assembly code versions of Listings One through Four. (Listing Five, page 74, is CONST.H, the file that sets up the constant for Listings Six through Nine.) In each listing, the program begins with an initialization routine to set the initial inputs and filter weights and to set the pointers for the communication ports.

The primary instructions for accomplishing these operations are LDI (Load Integer) and STI (Store Integer). The LDP instruction in Listing Six is an alternate form of LDI used to load the data-page pointer register. STF is the command to store a floating-point value in an internal memory location.

To perform the computation of the filter output signal in any stage (y1(n), y(n)), the assembly code uses the RPTBD command. This command allows a block of instructions to be repeated a number of times without incurring any penalty for looping.

The architecture of the C40 devices allows for the execution of parallel instructions which simplifies programming and speeds execution. Thus, instructions MPYF3||SUBF3 and MPYF3||ADDF3 allow a floating-point multiplication and floating-point subtraction, or a floating-point multiplication and floating-point addition, to be carried out in parallel or simultaneously. Together with the RPTBD command previously mentioned, the output values y1(n) and y(n) can be calculated on a continuous basis with updated data.

To update the adaptive filter coefficients, a program using the block repeat instruction (RPTBD) and the parallel multiply commands provides a simple and concise routine. See the portion of the code, "Update weights w(n)" in Listings Six through Nine. The simplicity of the code is due to the powerful architecture of the C40 DSP devices. Note that the parallel command used in this subroutine is the MPYF3||STF, which permits a simultaneous multiplication and store of a floating-point value.

Programs for the C40 can be written in the ANSI C language and translated directly into the highly optimized assembly code used in this adaptive filter example. This is accomplished through the TI TMS320C40 Optimizing C Compiler, which allows C programs to be linked with assembly language routines, and allows for handcoding of time-critical routines directly in C40 Assembly language. The compiler conforms exactly to the ANSI C specification and contains a C-shell program to facilitate a one-step translation from C source code to executable code.

Also incorporated in the C40 is SPOX, a hardware-independent software base for a real-time DSP operating system. SPOX features a set of high-level C-callable software functions that are independent of the underlying hardware platform, thus insulating real-time DSP applications from numerous low-level system details. The SPOX operating system plays an integral role in application development, from the concept of new algorithms to integration of application software into production hardware.


_PARALLEL DSP FOR DESIGNING ADAPTIVE FILTERS_
by Daniel Chen

[LISTING ONE]



/******* PSEUDO C CODE FOR CASCADE ADAPTIVE FILTER #1 *******/
/* Initialization */
    xptr = &x[0];
    wptr = &w[0];

    for (i=0;i<N1;i++){
    *xptr++ = 0.0;
    *wptr++ = 0.0;
    }
/*         N1-1
*   Compute  y1 = SUM w[i] * x[i]
*      i=0
*/
    xptr = &x[0];
    wptr = &w[0];
    input(x);              /* input x from A/D converter */
    *xptr = x;
    input (d);             /* input d from A/D converter */

    for (i=0;i<N1;i++)
    y1 += *xptr++ * *wptr++;
/* Compute  y = y1 + y2 + y3 + y4 */
    receive(y2,y3,y4);         /* receive y2, y3, y4 form processor 2, 3, 4 */
    y = y1 + y2 + y3 + y4;
/* Compute error signal e */
    e = d - y;
    output(y);          /* output y to D/A converter */
    pass(e);            /* pass e to processor 2, 3, 4 */
/* Update filter weights w[] */
    xptr = &x[N1-1];
    wptr = &w[N1-1];
    pass (*xptr);       /* pass x(n-N1) to processor #2 */
    for (i=N1;i>0;i--){
    *wptr-- += mu * e *xptr--;
    *(xptr+1) = *xptr;  /* delayed tap is implemented in circular buffer */
    }

[LISTING TWO]



/******* PSEUDO C CODE FOR CASCADE ADAPTIVE FILTER #2 *******/
/* Initialization */
    xptr = &x[0];
    wptr = &w[0];
    for (i=0;i<N2;i++){
    *xptr++ = 0.0;
    *wptr++ = 0.0;
    }
/*                N2-1
*   Compute  y2 = SUM w[i] * x[i]
*         i=0
*/
    xptr = &x[0];
    wptr = &w[0];
    receive(x);          /* receive x(n-N1) from processor #1 */
    *xptr = x;
    for (i=0;i<N2;i++)
    y2 += *xptr++ * *wptr++;
/* pass y2 and receive e */
    pass(y2);           /* pass y2 to processor #1 */
    receive(e);         /* receive e(n) form processor #1 */
/* Update filter weights w[] */
    xptr = &x[N2-1];
    wptr = &w[N2-1];
    pass (*xptr);       /* pass x(n-N1-N2) to processor #3 */
    for (i=N2;i>0;i--){
    *wptr-- += mu * e *xptr--;
    *(xptr+1) = *xptr;  /* delayed tap is implemented in circular buffer */
    }

[LISTING THREE]



/****** PSEUDO C CODE FOR CASCADE ADAPTIVE FILTER #3 ******/
/* Initialization */
    xptr = &x[0];
    wptr = &w[0];

    for (i=0;i<N3;i++){
    *xptr++ = 0.0;
    *wptr++ = 0.0;
    }
/*            N3-1
*   Compute  y3 = SUM w[i] * x[i]
*         i=0
*/
    xptr = &x[0];
    wptr = &w[0];
    receive(x);          /* receive x(n-N1-N2) from processor #2 */
    *xptr = x;

    for (i=0;i<N3;i++)
    y3 += *xptr++ * *wptr++;
/* pass y3 and receive e */
    pass(y3);           /* pass y3 to processor #1 */
    receive(e);         /* receive e(n) form processor #1 */

/* Update filter weights w[] */
    xptr = &x[N3-1];
    wptr = &w[N3-1];
    pass (*xptr);       /* pass x(n-N1-N2-N3) to processor #4 */
    for (i=N3;i>0;i--){
    *wptr-- += mu * e *xptr--;
    *(xptr+1) = *xptr;  /* delayed tap is implemented
                   in circular buffer          */
    }

[LISTING FOUR]



/****** PSEUDO C CODE FOR CASCADE ADAPTIVE FILTER #4 ******/
/* Initialization */
    xptr = &x[0];
    wptr = &w[0];

    for (i=0;i<N4;i++){
    *xptr++ = 0.0;
    *wptr++ = 0.0;
    }
/*               N4-1
*   Compute  y4 = SUM w[i] * x[i]
*         i=0
*/
    xptr = &x[0];
    wptr = &w[0];
    receive(x);          /* receive x(n-N1-N2-N3) from processor #3 */
    *xptr = x;

    for (i=0;i<N4;i++)
    y4 += *xptr++ * *wptr++;
/* pass y4 and receive e */
    pass(y4);           /* pass y4 to processor #1 */
    receive(e);         /* receive e(n) form processor #1 */

/* Update filter weights w[] */
    xptr = &x[N4-1];
    wptr = &w[N4-1];
    for (i=N3;i>0;i--){
    *wptr-- += mu * e *xptr--;
    *(xptr+1) = *xptr;  /* delayed tap is implemented
                   in circular buffer          */
    }

[LISTING FIVE]



**********************************************************************
*   CONST.H - This file set up the constant for Cascade TMS320C40
*   Adaptive Filter programs: LMS1.ASM LMS2.ASM LMS3.ASM LMS4.ASM
**********************************************************************
order1      .set    N1             ; filter order for #1 C40
order2      .set    N2             ; filter order for #2 C40
order3      .set    N3             ; filter order for #3 C40
order4      .set    N4             ; filter order for #4 C40
mu      .set    0.01           ; step size
io_port     .set    0100081h           ; data I/O comm port addr for d, x, & y
C40_1_2     .set    0100041h           ; comm port address from #1 to #2 C40
C40_1_3     .set    0100051h           ; comm port address from #1 to #3 C40
C40_1_4     .set    0100061h           ; comm port address from #1 to #4 C40
C40_2_1     .set    0100071h           ; comm port address from #2 to #1 C40
C40_2_3     .set    0100061h           ; comm port address from #2 to #3 C40
C40_2_4     .set    0100051h           ; comm port address from #2 to #4 C40
C40_3_1     .set    0100081h           ; comm port address from #3 to #1 C40
C40_3_2     .set    0100071h           ; comm port address from #3 to #2 C40
C40_3_4     .set    0100061h           ; comm port address from #3 to #4 C40
C40_4_1     .set    0100071h           ; comm port address from #4 to #1 C40
C40_4_2     .set    0100081h           ; comm port address from #4 to #2 C40
C40_4_3     .set    0100091h           ; comm port address from #4 to #3 C40

[LISTING SIX]



******************************************************************
*    LMS1 :  Cascade TMS320C40 adaptive filter #1 Using Transversal
*        Structure and LMS Algorithm, Looped Code
*    Configuration:
*        d(n) --------------------------+
*                       |
*               e(n)        |+
*                 +-----<-----(SUM)
*                 |         |-
*             --------+--------     |
*        x(n) ----|Adaptive Filter|-----+--------> y(n)
*             -----------------
*         +--------<-------+-------<--------+-------<--------+
*         |        |y2(n)       |y3(n)       |y4(n)
*   y(n)<-+   |        |            |            |
*     |  +----+----+      +----+----+      +----+----+  +----+----+
*     +--|TMS320C40|x(n1) |TMS320C40|x(n2) |TMS320C40|x(n3) |TMS320C40|
*   x(n)---->|         |----->|     |----->|     |----->|     |
*     +->|   # 1   |      |   # 2   |      |   # 3   |  |   # 4   |
*     |  +----+----+      +----+----+      +----+----+  +----+----+
*   d(n)--+   |        |            |            |
*         e(n)|        |            |            |
*         +-------->-------+------->--------+------->--------+
*         where n1 = n-N1, n2 = n-N1-N2, and n3 = n-N1-N2-N3
*    Algorithm for processor #1:
*       N1-1
*   y1(n) = SUM w(k)*x(n-k)    k=0,1,2,...,N1-1
*       k=0
*   y(n) = y1(n) + y2(n) + y3(n) + y4(n)
*       e(n) = d(n) - y(n)
*   w(k) = w(k) + u*e(n)*x(n-k) k=0,1,2,...,N1-1
*   where filter order N = N1 + N2 + N3 + N4 and u is the step size mu,
**********************************************************************
        .include "const.h"         ; include the constant definition file
        .sect    "vector"
reset       .word     begin
;   Initialize pointers and arrays
;     xptr = &x[0];
;     wptr = &w[0];
;     for (i=0;i<N1;i++){
;     *xptr++ = 0.0;
;     *wptr++ = 0.0;
;     }
        .text
begin       .set    $
        LDP     @io_addr           ; set data page
        LDI     0,R2           ; R2 = 0
        LDF     0.0,R1         ; R1 = 0.0
        LDI     @io_addr,AR4       ; set pointer for data I/O
        LDI     @C40addr2,AR5      ; set pointer for #2 C40 comm port
        LDI     @C40addr3,AR6      ; set pointer for #3 C40 comm port
        LDI     @C40addr4,AR7      ; set pointer for #4 C40 comm port
        LDI     @xn_addr,AR0       ; set pointer for x[]
        LDI     @wn_addr,AR1       ; set pointer for w[]
        STI     R2,*-AR5(1)        ; enable #2 C40 comm port
        STI     R2,*-AR6(1)        ; enable #3 C40 comm port
        STI     R2,*-AR7(1)        ; enable #4 C40 comm port
        STF     R1,*+AR5(1)        ; start #2 C40
        RPTS    order1-1
        STF     R1,*AR0++(1)%      ; x[] = 0.
    ||  STF     R1,*AR1++(1)%      ; w[] = 0.
        LDI     order1,BK          ; set up circular buffer
input:
;   Compute filter output y1(n)
;     xptr = &x[0];
;     wptr = &w[0];
;     input(x);          /* input x from A/D converter */
;     input (d);          /* input d from A/D converter */
;     *xptr = x;
;     for (i=0;i<N1;i++)
;     y1 += *xptr++ * *wptr++;
        LDI     order1-2,RC
        RPTBD   filter
        LDF     *AR4,R6        ; input x(n)
        LDF     *AR4,R7        ; input d(n)
    ||  STF     R6,*AR0        ; insert x(n) to buffer
        MPYF3   *AR0++(1)%,*AR1++(1)%,R1
    ||  SUBF3   R2,R2,R2           ; R2 = 0.0
filter      MPYF3   *AR0++(1)%,*AR1++(1)%,R1
    ||  ADDF3   R1,R2,R2           ; y1(n) = w[].x[]
        ADDF    R1,R2          ; include last result
;   compute y(n) signals
;     receive(y2,y3,y4);    /* receive y2, y3, y4 form processor 2, 3, 4 */
;     y = y1 + y2 + y3 + y4;
        ADDF    *AR5,R2        ; add y2(n)
        ADDF    *AR6,R2        ; add y3(n)
        ADDF    *AR7,R2        ; add y4(n)
;   Compute error signal e(n)
;     e = d - y;
;     pass(e);            /* pass e to processor 2, 3, 4 */
        SUBF    R2,R7          ; e(n) = d(n) - y(n)
        MPYF    @u,R7          ; R7 = err = e(n) * u
;   Output y(n) signal and e(n)
;     output(y);          /* output y to D/A converter */
;     pass(e);            /* pass e to processor 2, 3, 4 */
        STF     R7,*+AR5(1)        ; send out e(n)
    ||  STF     R7,*+AR6(1)        ; send out e(n)
        STF     R2,*+AR4(1)        ; send out y(n)
    ||  STF     R7,*+AR7(1)        ; send out e(n)
;   Update weights w(n)
;     xptr = &x[N1-1];
;     wptr = &w[N1-1];
;     pass (*xptr);       /* pass x(n-N1) to processor #2 */
;     for (i=N1;i>0;i--){
;     *wptr-- += mu * e *xptr--;
;     *(xptr+1) = *xptr;      /* delayed tap is implemented
;                    in circular buffer      */
;     }
        LDI     order1-3,RC        ; initialize repeat counter
        RPTBD   weight         ; do i = 0, N-3
        MPYF3   R7,*AR0++(1)%,R1   ; R1 = err * x(n)
        ADDF3   R1,*AR1,R2         ; R2 = wi(n) + err * x(n)
        NOP

        MPYF3   R7,*AR0++(1)%,R1   ; R1 = err * x(n-i-1)
    ||  STF     R2,*AR1++(1)%      ; update wi(n+1)
weight      ADDF3   R1,*AR1,R2         ; R2 = wi(n) + err * x(n-i)
        LDF     *AR0,R6
    ||  STF     R2,*AR1++(1)%      ; update wi(n+1)
        BD      input          ; delay branch
        MPYF3   R7,*AR0,R1         ; R1 = err * x(n-N+1)
    ||  STF     R6,*+AR5(1)        ; shift x(n-N) to #2 C40
        ADDF3   R1,*AR1,R2         ; R2 = wi(n-N+1) + err * x(n-N+1)
        STF     R2,*AR1++(1)%      ; update last w

;   Define constants
xn      .usect  "buffer",order1
wn      .usect  "coeffs",order1
        .data
io_addr     .word   io_port
C40addr2    .word   C40_1_2
C40addr3    .word   C40_1_3
C40addr4    .word   C40_1_4
xn_addr     .word   xn
wn_addr     .word   wn
u       .float  mu
        .end

[LISTING SEVEN]



******************************************************************
*    LMS2 :  Cascade TMS320C40 adaptive filter #2 Using Transversal
*        Structure and LMS Algorithm, Looped Code
*    Configuration:
*        d(n) --------------------------+
*                       |
*               e(n)        |+
*                 +-----<-----(SUM)
*                 |         |-
*             --------+--------     |
*        x(n) ----|Adaptive Filter|-----+--------> y(n)
*             -----------------
*         +--------<-------+-------<--------+-------<--------+
*         |        |y2(n)       |y3(n)       |y4(n)
*   y(n)<-+   |        |            |            |
*     |  +----+----+      +----+----+      +----+----+  +----+----+
*     +--|TMS320C40|x(n1) |TMS320C40|x(n2) |TMS320C40|x(n3) |TMS320C40|
*   x(n)---->|         |----->|     |----->|     |----->|     |
*     +->|   # 1   |      |   # 2   |      |   # 3   |  |   # 4   |
*     |  +----+----+      +----+----+      +----+----+  +----+----+
*   d(n)--+   |        |            |            |
*         e(n)|        |            |            |
*         +-------->-------+------->--------+------->--------+
*         where n1 = n-N1, n2 = n-N1-N2, and n3 = n-N1-N2-N3
*    Algorithm for processor #2:
*       N2-1
*   y2(n) = SUM w(N1+k)*x(n-N1-k)    k=0,1,2,...,N2-1
*       k=0
*   w(N1+k) = w(N1+k) + u*e(n)*x(n-N1-k) k=0,1,2,...,N2-1
*   where filter order N = N1 + N2 + N3 + N4 and u is the step size mu.
**********************************************************************
        .include "const.h"         ; include the constant definition file
        .sect   "vector"
reset       .word   begin
;   Initialize pointers and arrays
;     xptr = &x[0];
;     wptr = &w[0];
;     for (i=0;i<N2;i++){
;     *xptr++ = 0.0;
;     *wptr++ = 0.0;
;     }
        .text
begin       .set    $
        LDP     @C40addr1          ; set data page
        LDI     0,R2           ; R2 = 0
        LDF     0.0,R1         ; R1 = 0.0
        LDI     @C40addr1,AR5      ; set pointer for #1 C40 comm port
        LDI     @C40addr3,AR6      ; set pointer for #3 C40 comm port
        LDI     @C40addr4,AR7      ; set pointer for #4 C40 comm port
        LDI     @xn_addr,AR0       ; set pointer for x[]
        LDI     @wn_addr,AR1       ; set pointer for w[]
        STI     R2,*-AR6(1)        ; enable #3 C40 comm port
        STI     R2,*-AR5(1)        ; enable #1 C40 comm port
        STI     R2,*-AR7(1)        ; enable #4 C40 comm port
        STF     R1,*+AR6(1)        ; start #3 C40
        RPTS    order2-1
        STF     R1,*AR0++(1)%      ; x[] = 0.
    ||  STF     R1,*AR1++(1)%      ; w[] = 0.
        LDI     order2,BK       ; set up circular buffer
input:
;   Compute filter output y(n)
;     xptr = &x[0];
;     wptr = &w[0];
;     receive(x);          /* receive x(n-N1) from processor #1 */
;     *xptr = x;
;     for (i=0;i<N2;i++)
;    y2 += *xptr++ * *wptr++;
        LDI     order2-2,RC
        RPTBD   filter
        LDF     *AR5,R6        ; input x(n)
        STF     R6,*AR0        ; insert x(n) to buffer
        MPYF3   *AR0++(1)%,*AR1++(1)%,R1
    ||  SUBF3   R2,R2,R2           ; R2 = 0.0
filter      MPYF3   *AR0++(1)%,*AR1++(1)%,R1
    ||  ADDF3   R1,R2,R2           ; y2(n) = w[].x[]
        ADDF    R1,R2          ; include last result
;   Output y2(n) signals
;     pass(y2);           /* pass y2 to processor #1 */
        STF     R2,*+AR5(1)        ; send y2(n) to #1 C40
;   Input error signal e(n)
;     receive(e);         /* receive e(n) form processor #1 */
        LDF     *AR5,R7        ; load e(n) from #1 C40
;   Update weights w(n)
;     xptr = &x[N2-1];
;     wptr = &w[N2-1];
;     pass (*xptr);       /* pass x(n-N1-N2) to processor #3 */
;     for (i=N2;i>0;i--){
;     *wptr-- += mu * e *xptr--;
;     *(xptr+1) = *xptr;      /* delayed tap is implemented
;                    in circular buffer      */
;     }
;
        LDI     order2-3,RC        ; initialize repeat counter
        RPTBD   weight         ; do i = 0, N2-3
        MPYF3   R7,*AR0++(1)%,R1   ; R1 = err * x(n)
        ADDF3   R1,*AR1,R2         ; R2 = wi(n) + err * x(n)
        NOP

        MPYF3   R7,*AR0++(1)%,R1   ; R1 = err * x(n-i-1)
    ||  STF     R2,*AR1++(1)%      ; update wi(n+1)
weight      ADDF3   R1,*AR1,R2         ; R2 = wi(n) + err * x(n-i)

        LDF     *AR0,R6
    ||  STF     R2,*AR1++(1)%      ; update wi(n+1)
        BD      input          ; delay branch
        MPYF3   R7,*AR0,R1         ; R1 = err * x(n-N+1)
    ||  STF     R6,*+AR6(1)        ; shift x(n-N) to #3 C40
        ADDF3   R1,*AR1,R2         ; R2 = wi(n-N+1) + err * x(n-N+1)
        STF     R2,*AR1++(1)%      ; update last w

;   Define constants
xn      .usect  "buffer",order2
wn      .usect  "coeffs",order2
        .data
C40addr1    .word   C40_2_1
C40addr3    .word   C40_2_3
C40addr4    .word   C40_2_4
xn_addr     .word   xn
wn_addr     .word   wn
        .end

[LISTING EIGHT]



******************************************************************
*    LMS3 :  Cascade TMS320C40 adaptive filter #3 Using Transversal
*        Structure and LMS Algorithm, Looped Code
*    Configuration:
*        d(n) --------------------------+
*                       |
*               e(n)        |+
*                 +-----<-----(SUM)
*                 |         |-
*             --------+--------     |
*        x(n) ----|Adaptive Filter|-----+--------> y(n)
*             -----------------
*         +--------<-------+-------<--------+-------<--------+
*         |        |y2(n)       |y3(n)       |y4(n)
*   y(n)<-+   |        |            |            |
*     |  +----+----+      +----+----+      +----+----+  +----+----+
*     +--|TMS320C40|x(n1) |TMS320C40|x(n2) |TMS320C40|x(n3) |TMS320C40|
*   x(n)---->|         |----->|     |----->|     |----->|     |
*     +->|   # 1   |      |   # 2   |      |   # 3   |  |   # 4   |
*     |  +----+----+      +----+----+      +----+----+  +----+----+
*   d(n)--+   |        |            |            |
*         e(n)|        |            |            |
*         +-------->-------+------->--------+------->--------+
*         where n1 = n-N1, n2 = n-N1-N2, and n3 = n-N1-N2-N3
*    Algorithm for processor #3:
*       N3-1
*   y3(n) = SUM w(N1+N2+k)*x(n-N1-N2-k)    k=0,1,2,...,N3-1
*       k=0
*   w(N1+N2+k) = w(N1+N2+k) + u*e(n)*x(n-N1-N2-k) k=0,1,2,...,N3-1
*   where filter order N = N1 + N2 + N3 + N4 and u is the step size mu.
**********************************************************************
        .include "const.h"         ; include the constant definition file
        .sect    "vector"
reset       .word     begin
;   Initialize pointers and arrays
;     xptr = &x[0];
;     wptr = &w[0];
;     for (i=0;i<N3;i++){
;     *xptr++ = 0.0;
;     *wptr++ = 0.0;
;     }
        .text
begin       .set    $
        LDP     @C40addr1          ; set data page
        LDI     0,R2           ; R2 = 0
        LDF     0.0,R1         ; R1 = 0.0
        LDI     @C40addr1,AR5      ; set pointer for #1 C40 comm port
        LDI     @C40addr2,AR6      ; set pointer for #2 C40 comm port
        LDI     @C40addr4,AR7      ; set pointer for #4 C40 comm port
        LDI     @xn_addr,AR0       ; set pointer for x[]
        LDI     @wn_addr,AR1       ; set pointer for w[]
        STI     R2,*-AR7(1)        ; enable #4 C40 comm port
        STI     R2,*-AR6(1)        ; enable #2 C40 comm port
        STI     R2,*-AR5(1)        ; enable #1 C40 comm port
        STF     R1,*+AR7(1)        ; start #4 C40
        RPTS    order3-1
        STF     R1,*AR0++(1)%      ; x[] = 0.
    ||  STF     R1,*AR1++(1)%      ; w[] = 0.
        LDI     order3,BK          ; set up circular buffer
input:
;   Compute filter output y(n)
;     xptr = &x[0];
;     wptr = &w[0];
;     receive(x);          /* receive x(n-N1-N2) from processor #2 */
;     *xptr = x;
;     for (i=0;i<N3;i++)
;     y3 += *xptr++ * *wptr++;
        LDI     order3-2,RC
        RPTBD   filter
        LDF     *AR6,R6        ; input x(n)
        STF     R6,*AR0        ; insert x(n) to buffer
        MPYF3   *AR0++(1)%,*AR1++(1)%,R1
    ||  SUBF3   R2,R2,R2           ; R2 = 0.0
filter      MPYF3   *AR0++(1)%,*AR1++(1)%,R1
    ||  ADDF3   R1,R2,R2           ; y3(n) = w[].x[]
        ADDF    R1,R2          ; include last result
;   Output y2(n) signals
;     pass(y3);           /* pass y3 to processor #1 */
        STF     R2,*+AR5(1)        ; send y3(n) to #1 C40
;   Input error signal e(n)
;     receive(e);         /* receive e(n) form processor #1 */
        LDF     *AR5,R7        ; load e(n) from #1 C40
;   Update weights w(n)
;     xptr = &x[N3-1];
;     wptr = &w[N3-1];
;     pass (*xptr);       /* pass x(n-N1-N2-N3) to processor #4 */
;     for (i=N3;i>0;i--){
;     *wptr-- += mu * e *xptr--;
;     *(xptr+1) = *xptr;      /* delayed tap is implemented
;                    in circular buffer      */
;     }
;
        LDI     order3-3,RC        ; initialize repeat counter
        RPTBD   weight         ; do i = 0, N3-3
        MPYF3   R7,*AR0++(1)%,R1   ; R1 = err * x(n)
        ADDF3   R1,*AR1,R2         ; R2 = wi(n) + err * x(n)
        NOP

        MPYF3   R7,*AR0++(1)%,R1   ; R1 = err * x(n-i-1)
    ||  STF     R2,*AR1++(1)%      ; update wi(n+1)
weight      ADDF3   R1,*AR1,R2         ; R2 = wi(n) + err * x(n-i)

        LDF     *AR0,R6
    ||  STF     R2,*AR1++(1)%      ; update wi(n+1)
        BD      input          ; delay branch
        MPYF3   R7,*AR0,R1         ; R1 = err * x(n-N+1)
    ||  STF     R6,*+AR7(1)        ; shift x(n-N) to #4 C40
        ADDF3   R1,*AR1,R2         ; R2 = wi(n-N+1) + err * x(n-N+1)
        STF     R2,*AR1++(1)%      ; update last w

;   Define constants
xn      .usect  "buffer",order3
wn      .usect  "coeffs",order3
        .data
C40addr1    .word   C40_3_1
C40addr2    .word   C40_3_2
C40addr4    .word   C40_3_4
xn_addr     .word   xn
wn_addr     .word   wn
        .end

[LISTING NINE]



******************************************************************
*    LMS4 :  Cascade TMS320C40 adaptive filter #4 Using Transversal
*        Structure and LMS Algorithm, Looped Code
*  Configuration:
*        d(n) --------------------------+
*                       |
*               e(n)        |+
*                 +-----<-----(SUM)
*                 |         |-
*             --------+--------     |
*        x(n) ----|Adaptive Filter|-----+--------> y(n)
*             -----------------
*         +--------<-------+-------<--------+-------<--------+
*         |        |y2(n)       |y3(n)       |y4(n)
*   y(n)<-+   |        |            |            |
*     |  +----+----+      +----+----+      +----+----+  +----+----+
*     +--|TMS320C40|x(n1) |TMS320C40|x(n2) |TMS320C40|x(n3) |TMS320C40|
*   x(n)---->|         |----->|     |----->|     |----->|     |
*     +->|   # 1   |      |   # 2   |      |   # 3   |  |   # 4   |
*     |  +----+----+      +----+----+      +----+----+  +----+----+
*   d(n)--+   |        |            |            |
*         e(n)|        |            |            |
*         +-------->-------+------->--------+------->--------+
*         where n1 = n-N1, n2 = n-N1-N2, and n3 = n-N1-N2-N3
*  Algorithm for processor #4:
*        N4-1
*    y4(n) = SUM w(N1+N2+N3+k)*x(n-N1-N2-N3-k)   k=0,1,2,...,N4-1
*        k=0
*    w(N1+N2+N3+k) = w(N1+N2+N3+k) + u*e(n)*x(n-N1-N2-N3-k) k=0,1,2,...,N4-1
*    where filter order N = N1 + N2 + N3 + N4 and u is the step size mu.
**********************************************************************
        .include "const.h"         ; include the constant definition file
        .sect    "vector"
reset       .word     begin
;   Initialize pointers and arrays
;     xptr = &x[0];
;     wptr = &w[0];
;     for (i=0;i<N4;i++){
;     *xptr++ = 0.0;
;     *wptr++ = 0.0;
;     }
        .text
begin       .set    $
        LDP     @C40addr1          ; set data page
        LDI     0,R2           ; R2 = 0
        LDF     0.0,R1         ; R1 = 0.0
        LDI     @C40addr1,AR5      ; set pointer for #1 C40 comm port
        LDI     @C40addr2,AR6      ; set pointer for #2 C40 comm port
        LDI     @C40addr3,AR7      ; set pointer for #3 C40 comm port
        LDI     @xn_addr,AR0       ; set pointer for x[]
        LDI     @wn_addr,AR1       ; set pointer for w[]
        STI     R2,*-AR5(1)        ; enable #1 C40 comm port
        STI     R2,*-AR6(1)        ; enable #2 C40 comm port
        STI     R2,*-AR7(1)        ; enable #3 C40 comm port
        RPTS    order4-1
        STF     R1,*AR0++(1)%      ; x[] = 0.
    ||  STF     R1,*AR1++(1)%      ; w[] = 0.
        LDI     order4,BK          ; set up circular buffer
input:
;   Compute filter output y(n)
;     xptr = &x[0];
;     wptr = &w[0];
;     receive(x);          /* receive x(n-N1-N2-N3) from processor #3 */
;     *xptr = x;
;     for (i=0;i<N4;i++)
;     y4 += *xptr++ * *wptr++;
        LDI     order4-2,RC
        RPTBD   filter
        LDF     *AR7,R6        ; input x(n)
        STF     R6,*AR0        ; insert x(n) to buffer
        MPYF3   *AR0++(1)%,*AR1++(1)%,R1
    ||  SUBF3   R2,R2,R2           ; R2 = 0.0
filter      MPYF3   *AR0++(1)%,*AR1++(1)%,R1
    ||  ADDF3   R1,R2,R2           ; y4(n) = w[].x[]
        ADDF    R1,R2          ; include last result
;   Output y4(n) signals
;     pass(y4);           /* pass y4 to processor #1 */
        STF     R2,*+AR5(1)        ; send y4(n) to #1 C40
;   Input error signal e(n)
;     receive(e);         /* receive e(n) form processor #1 */
        LDF     *AR5,R7        ; load e(n) from #1 C40
;   Update weights w(n)
;     xptr = &x[N4-1];
;     wptr = &w[N4-1];
;     for (i=N3;i>0;i--){
;     *wptr-- += mu * e *xptr--;
;     *(xptr+1) = *xptr;      /* delayed tap is implemented
;                    in circular buffer      */
;     }
        LDI     order4-3,RC        ; initialize repeat counter
        RPTBD   weight         ; do i = 0, N4-3
        MPYF3   R7,*AR0++(1)%,R1   ; R1 = err * x(n)
        ADDF3   R1,*AR1,R2         ; R2 = wi(n) + err * x(n)
        NOP

        MPYF3   R7,*AR0++(1)%,R1   ; R1 = err * x(n-i-1)
    ||  STF     R2,*AR1++(1)%      ; update wi(n+1)
weight      ADDF3   R1,*AR1,R2         ; R2 = wi(n) + err * x(n-i)

        BD      input          ; delay branch
        MPYF3   R7,*AR0,R1         ; R1 = err * x(n-N+1)
    ||  STF     R2,*AR1++(1)%      ; update wi(n+1)
        ADDF3   R1,*AR1,R2         ; R2 = wi(n-N+1) + err * x(n-N+1)
        STF     R2,*AR1++(1)%      ; update last w

;   Define constants
xn      .usect  "buffer",order4
wn      .usect  "coeffs",order4
        .data
C40addr1    .word   C40_4_1
C40addr2    .word   C40_4_2
C40addr3    .word   C40_4_3
xn_addr     .word   xn
wn_addr     .word   wn
        .end