NEURAL NETS AND NOISE FILTERING

Back-propagation is a powerful adaptive method for filtering out noise or identifying underlying signals

Casey Klimasauskas

Casimir C. "Casey" Klimasauskas is the president and founder of Neural-Ware, a developer of neural network tools. He can be reached at 103 Buck-skin Ct., Sewickley, PA 15143.

Neural networks are an information-processing technology inspired by studies of the brain and nervous system. Despite (or, perhaps, because of) their origins, certain types of neural networks have a strong founding in mathematics. One network type in particular, back-propagation, is a powerful adaptive technique for approximating relationships between several continuous valued inputs and one (or more) continuous valued output. This article discusses applications of back-propagation to filtering out noise, or, conversely, identifying fundamental underlying signals. An example, discussed later, is in isolating EKG signals taken from a noisy environment.

This article addresses the problem of developing a limited number of "feature detectors" that can account for the maximum amount of a signal (in a least-mean-square sense). In doing this, the noise is assumed to be random and of a higher frequency than the underlying signal. The approach taken here is to find a way of encoding or compressing the input data and then reexpanding it. The compression process eliminates portions of the input data, which represent small or nonrecurring features. By selecting the number of "encoders," you can vary the amount of detail retained in the transformation. Figure 1, page 31, illustrates this process. Key to solving this problem is the selection of a good set of feature detectors for the particular signal to be to filtered.

Notice that in Figure 1 the 11 inputs have been reduced at the output of the encoder stage to 3 outputs. If you were to represent each input as well as the output of each encoder as a 32-bit floating-point number, it would take 352 bits (11 x 32) to represent the input to the system. It would take 96 bits (3 x 32) to represent the output of the system. From this perspective, you have compressed the input data by a factor of 96/352. This form of data compression is also known as dimensionality reduction. You have reduced the number of dimensions from 11 to 3. For this system to work well, you must be sure that the following two interrelated criteria are met: 1. There must be some relationship between input variables; and 2. You must use an encoding system that can represent the relationship.

It may not be apparent at first that these two criteria are related, but consider the situation in which the input samples are from a constant-amplitude sine wave. In this case, you could encode the data exactly with a single "sine-wave" detector in which the output is related to the position in the cycle. On the other hand, trying to use another type of detector function may require several detectors to develop a "piece wise" approximation of the function. If there is no relationship between the input variables, dimensionality reduction often results in a series of fixed outputs representing the "average" value of a combination of inputs. In the case of noise filtering (where the inputs are a series of time-sequential samples), there is usually a high degree of correlation (or relationship) between adjacent time samples. Dimensionality reduction works very well in this situation.

With a little thought, it becomes apparent that if the signal you are working with has a noise component added to it, the contribution of the noise to the signal will tend to be random or nonstationary with respect to other larger features. As such, the noise will most likely be one of the features eliminated in the encoding process. If the noise component is of a relatively low frequency and accounts for a majority of the energy in the output signal (for example, 60-cycle hum), this process will detect the noise. Subtracting the output of the detector from the input signal removes the noise from the signal.

To approach the problem of feature detection, first consider a simpler problem of mapping from a single continuous input to a single continuous output. Assume that the input and output are related by some continuous differentiable function. Figure 2, this page, shows a plot of input values "X" and output values "Y."

A first attempt to approximate the relationship uses a linear relationship. This, too, is shown in Figure 2. The slope of this line can be determined using statistical techniques (linear regression) or by using adaptive signal-processing techniques (Bernard Widrow). Rather than using a straight line, you could have used a quadratic polynomial. Figure 3, this page, shows how a quadratic polynomial might fit the data shown. Both of these techniques can be generalized to two, three, or more inputs. Notice that regardless of the number of inputs, each feature detector has exactly one output.

For both of the cases just described, the decoded output is assumed to be identical to the output of the encoders. Consider a more complex situation in which you use a "sigmoidal" feature detector. Figure 4, page 34, shows an example of a sigmoid function. You might use this function to create two encoders as shown in Figures 4b and 4c. Now, if you use a decoder that takes the difference of the output of the two encoders, you have the function shown in Figure 4d. Using this method, you have been able to synthesize a "bump." By varying the ai0 term, you shift the transition point. Varying the ai1 parameter changes the slope at the transition. In this way, you could have created a "bump" of any size, any location, and independently vary the slope of each of its sides. By adding more "bumps" together, you can create arbitrarily complex functional relationships between one or more inputs and a single output.

Though bumps can be used to create arbitrarily complex functions, this may not always be the best method. At times, it may be better to use sine or other types of functions in the encoders. Using "sine" encoders is equivalent to Fourier decomposition of the inputs. Unlike the standard Fourier transform, this technique accounts for both the phase and amplitude of the sampled input.

At this point, you might observe that what has been discussed sounds like adaptive signal-processing, but with two stages instead of one. And you are right to an extent. An adaptive unit is identical to the "linear" encoder shown in Figure 2. If you were to take two linear encoders and try combining their outputs together in a linear combination, you can easily show that this is identical to a single linear element. The novel characteristics of the neural network approach is that the output of the weighted sum (linear combination of the inputs) is transformed using a non-linear function such as a sigmoid or sine function. These nonlinearities make the creation of multilevel systems possible and are responsible for several of the resulting characteristics.

Why use this approach? The two basic reasons are as follows. First, it is capable of retaining a level of detail greater than other techniques. Rather than simply ignoring small features, it can allow them to pass if they are of a stationary recurring nature. Second, the amount of detail retained by the filtering process can be varied by varying the number of elements in the hidden layer. From these perspectives, it is a technique providing additional flexibility and control when the engineer analyzes noisy data.

More Details.

The approach described thus far is to develop a limited number of feature detectors that encode a series of samples of an input signal. The output of these encoders is used to reconstruct all or a portion of the input sample. The use of a limited number of feature detectors ensures that some of the information (preferably the noise) will be removed in the reconstruction.

Putting the Theory Into Practice

Until now, the problem of how to compute the various coefficients used by a feature detector or encoder has been ignored. The process is a form of regression. The word "regression" comes from a study done in the early 1900s. In this study, the researchers attempted to find relationships between the height of parents and that of their children. They developed a statistical technique for fitting a straight line through the data (linear regression). The results of the study showed that tall parents had shorter children and that short parents had relatively taller children -- their heights were regressively correlated. The term regression analysis applies to any technique for fitting a curve to data.

The technique used to set the coefficients in a network is described as a set of differential equations that modify the coefficients in such a way as to reduce the mean-square-error of the output for the training set. These are derived by defining the error for a single pass through a training set (several presentations of input and expected, or desired, output) as the sum of the squares of the difference of the actual and expected outputs. This is called the square error.

The derivative of each weight (or coefficient) in the network is computed as a function of the error. For most cases, these differential equations do not have a closed-form solution. As such, the solution (or a solution) is derived by using numerical methods. The numerical techniques compute the direction to change the coefficients (gradient) and then change them slightly in that direction. The mathematics behind this are explained in detail in Chapter 8 of Parallel Distributed Processing by Rumelhart and McClelland (MIT Press 1986).

Listing One, page 96, presents a back-propagation network program that can solve the "exclusive OR" problem described in the Rummelhart-McClelland book. This program could also be extended to solve the noise filtering problem. If you want to extend the program in Listing One to solve the "noise filtering" problem, you must do three things:

1. Increase the number of processing elements in the input and output layers to 19 each and the number of processing elements in the hidden layer to 5, 7, 10, or 19.

2. Delete all of the existing connections and fully connect the input layer to the hidden layer and the hidden layer to the output layer. Likewise fully connect the bias to both the hidden and output layers.

3. Write a module to load a series of noise data into memory and present them to the network.

For the purposes of studying the noise filtering problem, I used a commercial neural network development environment (NeuralWorks Professional II by NeuralWare) for all of the examples that follow because neural network development systems provide the infrastructure needed to solve the problem. These systems provide the network building tools that make it easy to change the network design, interfaces to assist in setting up the problem, tools for investigating what is happening within the network itself, and a host of other "bookkeeping" functions. A good development environment also enables you to focus on the problem. Though most neural network types appear relatively straightforward from the standpoint of mathematics, many programmers find it difficult to debug this kind of numerical-oriented code. If a handcoded network fails to converge, the problem may lie either in the "code" or the network parameters. A neural network development environment solves the first of these problems and makes the other problems easier to test.

Figures 5a, page 42, and 5b , 5c, and 5d, page 44, show screen dumps of four tests of training a network to filter out noise. The examples use 5, 7, 10, and 19 elements in the hidden layer. As you would expect from the previous discussion, the more processing elements in the hidden layer, the more detail is preserved from the encoding process to decoding process.

The data displayed in the lower window of each of the examples is the raw input data. This is actually EKG data taken in a noisy environment (the noise is caused primarily by fluorescent lights). The filtered output data is displayed in the upper window. The networks used here each had 19 inputs and 19 outputs. The "filtered output" in the upper window is the output of the center processing element of the output layer. Experimentally, the outputs of the other processing elements in the output layer do a similar job of tracking the input.

So far, we have looked at the problem from the perspective of doing dimensionality reduction on 19 inputs to produce 19 outputs. The way we trained the network was to randomly select a block of 19 sequential samples from the raw data, and to use this as both the input and desired output of the network. This was typically repeated 300,000 to 600,000 times for each of the networks shown (overnight on a NEC Powermate III). Each time a block of inputs was applied to the network, the weights were slightly adjusted to make the actual network output closer to the desired output. Only the middle output element is used as the "filtered" output.

You might wonder what would happen if you were to have only one output element that was trained to reproduce the value of the middle input. Figure 6 , this page, shows the results of such a network with five elements in the hidden layer.

Closely examining the weights in the network, it becomes clear that the output is affected almost exclusively by the center input. All other inputs are ignored. Neural network techniques are sometimes smarter than you might expect! The use of additional processing elements in the output layer forces the feature detectors to come up with a more "general" way of encoding the inputs.

During the recall process, the raw data is shifted through a 19-element long shift register and the output computed with that set of inputs. The data is shifted one position and the process repeated.

The training process is reasonably repeatable with the data shown. We have tried it on a variety of other signals (manually constructed) with some interesting results. Depending on the type of signal used, the processing elements in the hidden layer should use a "sine" function rather than a sigmoid. As mentioned previously, this causes the network to do a form of Fourier decomposition and reconstruction.

To illustrate the process in an engineering application, Figure 1a shows the training or target signal for a medical diagnostic (hemoglobin red blood cell) device and the network output after different numbers of training cycles. For our data a few hundred training epochs resulted in better than a one percent difference between target and output. Figure 1b shows the end result of applying this network on a noisy set of data. For this study I used either two, five, or ten inputs along with ten hidden units; the presented results are for the case of five inputs.

Now to important considerations when putting neural network theory into practice. First, if we use a small input window (for instance, the number of inputs is one or two), we will have a fine sampling of our signal. As such, we will need a large number of hidden units to correctly encode the character of the signal. This in turn involves more data processing, which could prohibit real-time response without the use of parallel processors. Furthermore, with only two inputs, if one or both are noisy, the success of the network in distinguishing between signal and noise is diminished. Therefore, we might want to increase the size of the input window in order to maximize the network's noise handling capabilities. Remember, however, that we're also trying to maintain as fine a sampling as possible to maximize signal detail. The question then becomes: How many inputs and how many hidden units?

In general, the task is to find a balance between a number of variables (besides the number of units), which at the present are more or less determined empirically. In addition, we might decide to introduce feedback (networks of this type are sometimes called Jordan networks) from either input or output to further enhance the network's memory of the signal's time structure. As mentioned in the main article, many of these options can be explored more quickly using one of the commercially available network simulation packages. For my work, I used, and highly recommend, The Cognitron by Cognitive Software.

Figure 1c shows a final plot of our data representing the arithmetic difference between the two curves of Figure 1b. You can see this difference is relatively small, except for a number of spikes. Figure 1c is marked with two horizontal lines to indicate these large variance points. In fact, these spikes emerge just where we have a random noise pulse interposed as part of the data!

To conclude, the network distinguished the signal from the noise without using multiband filters, Fourier-Laplace transforms, and the like. You might even go further to conclude that artificial neural networks seem to have an almost human capability to process difficult, intrinsically nonlinear signal shapes.

_NEURAL NETS AND NOISE FILTERING_ by Casey Klimasauskas [LISTING ONE]



/* (backprop.c)  Back-propagation XOR example */

#include <stdio.h>

/************************************************************************
 *  Back-propagation Exclusive OR Program                               *
 ************************************************************************
 Written by Casimir C. Klimasauskas.  No copyright or other
 proprietary rights reserved.
   This program was compiled and tested using DataLight "C" version 3.14
   as follows: DLC -ms backprop.c
   NOTE:  the structure "_conn" uses an index for the PE source.  This was
   done (rather than a pointer to the processing element itself) to get
   around certain problems with handling circular references in structure
   definitions.
 */

#define MAXCONN     5           /* maximum number of conn */

typedef struct _conn {          /* connection to a PE */
    int          PESource;      /* index of PE source */
    float        ConnWt;        /* connection Wt */
    float        LastDelta;     /* last weight change */
} CONN;

typedef struct _pe {            /* processing element */
    float        Output;        /* PE output */
    float        Error;         /* Accumulated error */
    CONN         Conns[MAXCONN+1];  /* connections */
} PE;

/************************************************************************
 *      Network Topology                                                *
 ************************************************************************
   The following diagram shows how the processing elements have been
   connected to solve the exclusive "OR" problem.  This is taken from
   Volume 1 of "Parallel Distributed Processing" by Rummelhart and
   McClelland.

     +--------------------- PE5
     |                     / | \
     |                   /   |  \
     +-----------------/----PE4  \
     |                 |  /    \  |
     |                 | /      \ |
     |                 |/        \|
    PE1 - bias        PE2        PE3

*/
static PE pe1 = { 1.0, 0 };                               /* bias */
static PE pe2 = { 0 };                                    /* inputs */
static PE pe3 = { 0 };
static PE pe4 = { 0, 0,  1,0,0,  2,0,0,  3,0,0 };         /* hidden */
static PE pe5 = { 0, 0,  1,0,0,  2,0,0,  3,0,0,  4,0,0 }; /* output */

/* --- Processing Elements (for reference by number) ---- */
static PE *PEList[] = { (PE *)0, &pe1, &pe2, &pe3, &pe4, &pe5 };

/* --- Layer definitions --- */
static PE *LayIn[]  = { &pe2, &pe3, (PE *)0 };  /* input layer list */
static PE *LayMid[] = { &pe4, (PE *)0 };        /* hidden layer list */
static PE *LayOut[] = { &pe5, (PE *)0 };        /* output layer list */

/* --- Network List --- */
static PE **LayList[] = { &LayIn[0], &LayMid[0], &LayOut[0], (PE **)0 };

/************************************************************************
 *  Sigmoid() - Compute the sigmoid of a value                          *
 ************************************************************************
 */
double sigmoid( x )
    double   x;
    {
    double   r;         /* result */
    extern double exp();

    /* check special limiting cases to prevent overflow */
    if ( x < -10. ) r = 0.0;
    else if ( x > 10. ) r = 1.0;
    else        r = 1.0 / (1.0 + exp( -x ));

    return( r );
    }

/************************************************************************
 *  RRand() - Compute a random number in a range                        *
 ************************************************************************
 */
double RRand( low, high )
    double low, high;          /* low / high limits */
    {
    double r;               /* return result */
    extern int rand();          /* random number generator */

    r = (rand() / 32767.) * (high - low) + low;

    return( r );
    }

/************************************************************************
 *  RandWts()  - randomize all of the weights in a network              *
 ************************************************************************
 */
void RandWts( low, high, LLp )
    double    low, high;        /* low / high limits for random */
    PE         ***LLp;          /* layer list pointer */
    {
    PE      **PePP; /* PE Pointer */
    PE       *PeP;  /* PE itself */
    CONN     *WtP;  /* connection Pointer */

    for( ; (PePP = *LLp) != (PE **)0; LLp++ )       /* layer loop */
    for( ; (PeP = *PePP) != (PE *)0; PePP++ )       /* PE loop */
        for( WtP = &PeP->Conns[0]; WtP->PESource != 0; WtP++ )
        {
        WtP->ConnWt = RRand( low, high );
        WtP->LastDelta = 0.0;
        }
    return;
    }

/************************************************************************
 *  Recall() - Recall information from the network                      *
 ************************************************************************
 */
void Recall( ov, iv, LLp, rcf )
    double   *ov;           /* output vector */
    double   *iv;           /* input vector */
    PE     ***LLp;          /* layer list pointer */
    int       rcf;          /* "recall" mode flag (0=learn) */
    {
    PE      **PePP;         /* PE Pointer */
    PE      **LastPP;       /* last non-zero PE list pointer */
    PE       *PeP;          /* PE itself */
    CONN     *WtP;          /* connection Pointer */
    double    sum;          /* weighted sum */

   /* copy the input vector to the inputs of the network */
    for( PePP = *LLp++; (PeP = *PePP) != (PE *)0; PePP++ )
    PeP->Output = *iv++;

   /* compute the weighted sum and transform it */
    for( ; (PePP = *LLp) != (PE **)0; LLp++ )       /* layer loop */
    {
    LastPP = PePP;
    for( ; (PeP = *PePP) != (PE *)0; PePP++ )   /* PE's in a layer */
        {
        /* weighted sum of the inputs */
        sum = 0;
        for( WtP = &PeP->Conns[0]; WtP->PESource != 0; WtP++ )
        sum += WtP->ConnWt * PEList[ WtP->PESource ]->Output;

        /* transform it using a sigmoidal transfer function */
        PeP->Output = sigmoid( sum );

        /* if "learn" mode, set the error to zero */
        if ( rcf == 0 ) PeP->Error  = 0.0;      /* (for learning) */
        }
    }

/* copy the results to the output array */
if ( rcf != 0 )     /* only if not learning */
    {
    for( ; (PeP = *LastPP) != (PE *)0; LastPP++ )
        *ov++ = PeP->Output;
    }
    return;
    }

/************************************************************************
 *  Learn() - "learn" an association                                    *
 ************************************************************************
 */
double Learn( ov, iv, LLp, alpha, eta )
    double   *ov;           /* output vector */
    double   *iv;           /* input vector */
    PE     ***LLp;          /* layer list pointer */
    double    alpha;        /* learning rate */
    double    eta;          /* momentum */
    {
    double    MAErr;        /* Maximum Absolute error */
    double    rv;           /* work value */
    double    SigErr;       /* back-propagated error */
    PE     ***ALp;          /* alternate layer pointer */
    PE      **PePP;         /* PE Pointer */
    PE      **LastPP;       /* last non-zero PE list pointer */
    PE       *PeP;          /* PE itself */
    CONN     *WtP;          /* connection Pointer */
    extern double fabs();      /* absolute value */

    Recall( ov, iv, LLp, 0 );           /* perform a recall */

    /* find the output layer */
    for( ALp = LLp; ALp[1] != (PE **)0; ALp++ )
    ;

    /* compute the square error in the output */
    for( MAErr = 0.0, PePP = *ALp; (PeP = *PePP) != (PE *)0; PePP++ )
    {
    rv = *ov++ - PeP->Output;       /* output Error */
    PeP->Error = rv;
    if ( fabs(rv) > MAErr ) MAErr = fabs(rv);
    }

    /* back-propagate the error & update the weights */
    for( ; ALp > LLp; ALp-- )           /* layer loop */
    {
    PePP = *ALp;
    for( ; (PeP = *PePP) != (PE *)0; PePP++ )   /* PE's in a layer */
        {
        /* compute the error prior to the sigmoid function */
        SigErr = PeP->Output * (1.0 - PeP->Output) * PeP->Error;

        /* back-propagate the errors & adjust weights */
        for( WtP = &PeP->Conns[0]; WtP->PESource != 0; WtP++ )
        {
        PEList[ WtP->PESource ]->Error +=
            WtP->ConnWt * SigErr;
        rv = alpha * PEList[ WtP->PESource ]->Output * SigErr +
            eta * WtP->LastDelta;
        WtP->ConnWt += rv;
        WtP->LastDelta = rv;
        }
        }
    }
    return( MAErr );
    }

/************************************************************************
 *  Main() - main driver routine to train the network                   *
 ************************************************************************
 */
static double iv1[] = { 0.0, 0.0 }; static double ov1[] = { 0.0 };
static double iv2[] = { 1.0, 0.0 }; static double ov2[] = { 1.0 };
static double iv3[] = { 0.0, 1.0 }; static double ov3[] = { 1.0 };
static double iv4[] = { 1.0, 1.0 }; static double ov4[] = { 0.0 };

static double *ivp[] = { &iv1[0], &iv2[0], &iv3[0], &iv4[0] };
static double *ovp[] = { &ov1[0], &ov2[0], &ov3[0], &ov4[0] };

main()
    {
    int      wx;        /* work index */
    int      x;     /* index into samples array */
    double   r;     /* work value */
    double   MAErr;     /* maximum Absolute error */
    double   wo[sizeof(ivp)/sizeof(*ivp)];

    /* randomize the weights in the network */
    RandWts( -1.0, 1.0, &LayList[0] );

    MAErr = 0.0;
    for( wx = 0; ; wx++ )
    {
    x = wx % (sizeof(ivp)/sizeof(*ivp));
    if ( x == 0 && wx != 0 )
        {
        if ( (wx % 100) == 0 )
        printf( "Presentation %4d, Maximum Absolute Error = %.5f\n",
            wx, MAErr );
        if ( MAErr < .1 ) break;
        MAErr = 0.0;
        }

    r = Learn( ovp[x], ivp[x], &LayList[0], 0.9, 0.5 );
    if ( r > MAErr ) MAErr = r;
    }

    /* test the network */
    for( wx = 0; wx < (sizeof(ivp)/sizeof(*ivp)); wx++ )
    {
    Recall( wo, ivp[wx], &LayList[0], 1 );      /* perform a recall */
    printf( "Input: %.2f %.2f -> %.2f\n", ivp[wx][0], ivp[wx][1], wo[0] );
    }

    return;
    }

NEURAL NETS AND NOISE FILTERING

Back-propagation is a powerful adaptive method for filtering out noise or identifying underlying signals

Casey Klimasauskas

Putting the Theory Into Practice

Conclusion

Bibliography

Neural Networks for Signal Processing: A Case Study