Casimir C. "Casey" Klimasauskas is the president and founder of Neural-Ware, a developer of neural network tools. He can be reached at 103 Buck-skin Ct., Sewickley, PA 15143.
Neural networks are an information-processing technology inspired by studies of the brain and nervous system. Despite (or, perhaps, because of) their origins, certain types of neural networks have a strong founding in mathematics. One network type in particular, back-propagation, is a powerful adaptive technique for approximating relationships between several continuous valued inputs and one (or more) continuous valued output. This article discusses applications of back-propagation to filtering out noise, or, conversely, identifying fundamental underlying signals. An example, discussed later, is in isolating EKG signals taken from a noisy environment.
This article addresses the problem of developing a limited number of "feature detectors" that can account for the maximum amount of a signal (in a least-mean-square sense). In doing this, the noise is assumed to be random and of a higher frequency than the underlying signal. The approach taken here is to find a way of encoding or compressing the input data and then reexpanding it. The compression process eliminates portions of the input data, which represent small or nonrecurring features. By selecting the number of "encoders," you can vary the amount of detail retained in the transformation. Figure 1, page 31, illustrates this process. Key to solving this problem is the selection of a good set of feature detectors for the particular signal to be to filtered.
Notice that in Figure 1 the 11 inputs have been reduced at the output of the encoder stage to 3 outputs. If you were to represent each input as well as the output of each encoder as a 32-bit floating-point number, it would take 352 bits (11 x 32) to represent the input to the system. It would take 96 bits (3 x 32) to represent the output of the system. From this perspective, you have compressed the input data by a factor of 96/352. This form of data compression is also known as dimensionality reduction. You have reduced the number of dimensions from 11 to 3. For this system to work well, you must be sure that the following two interrelated criteria are met: 1. There must be some relationship between input variables; and 2. You must use an encoding system that can represent the relationship.
It may not be apparent at first that these two criteria are related, but consider the situation in which the input samples are from a constant-amplitude sine wave. In this case, you could encode the data exactly with a single "sine-wave" detector in which the output is related to the position in the cycle. On the other hand, trying to use another type of detector function may require several detectors to develop a "piece wise" approximation of the function. If there is no relationship between the input variables, dimensionality reduction often results in a series of fixed outputs representing the "average" value of a combination of inputs. In the case of noise filtering (where the inputs are a series of time-sequential samples), there is usually a high degree of correlation (or relationship) between adjacent time samples. Dimensionality reduction works very well in this situation.
With a little thought, it becomes apparent that if the signal you are working with has a noise component added to it, the contribution of the noise to the signal will tend to be random or nonstationary with respect to other larger features. As such, the noise will most likely be one of the features eliminated in the encoding process. If the noise component is of a relatively low frequency and accounts for a majority of the energy in the output signal (for example, 60-cycle hum), this process will detect the noise. Subtracting the output of the detector from the input signal removes the noise from the signal.
To approach the problem of feature detection, first consider a simpler problem of mapping from a single continuous input to a single continuous output. Assume that the input and output are related by some continuous differentiable function. Figure 2, this page, shows a plot of input values "X" and output values "Y."
A first attempt to approximate the relationship uses a linear relationship. This, too, is shown in Figure 2. The slope of this line can be determined using statistical techniques (linear regression) or by using adaptive signal-processing techniques (Bernard Widrow). Rather than using a straight line, you could have used a quadratic polynomial. Figure 3, this page, shows how a quadratic polynomial might fit the data shown. Both of these techniques can be generalized to two, three, or more inputs. Notice that regardless of the number of inputs, each feature detector has exactly one output.
For both of the cases just described, the decoded output is assumed to be identical to the output of the encoders. Consider a more complex situation in which you use a "sigmoidal" feature detector. Figure 4, page 34, shows an example of a sigmoid function. You might use this function to create two encoders as shown in Figures 4b and 4c. Now, if you use a decoder that takes the difference of the output of the two encoders, you have the function shown in Figure 4d. Using this method, you have been able to synthesize a "bump." By varying the ai0 term, you shift the transition point. Varying the ai1 parameter changes the slope at the transition. In this way, you could have created a "bump" of any size, any location, and independently vary the slope of each of its sides. By adding more "bumps" together, you can create arbitrarily complex functional relationships between one or more inputs and a single output.
Though bumps can be used to create arbitrarily complex functions, this may not always be the best method. At times, it may be better to use sine or other types of functions in the encoders. Using "sine" encoders is equivalent to Fourier decomposition of the inputs. Unlike the standard Fourier transform, this technique accounts for both the phase and amplitude of the sampled input.
At this point, you might observe that what has been discussed sounds like adaptive signal-processing, but with two stages instead of one. And you are right to an extent. An adaptive unit is identical to the "linear" encoder shown in Figure 2. If you were to take two linear encoders and try combining their outputs together in a linear combination, you can easily show that this is identical to a single linear element. The novel characteristics of the neural network approach is that the output of the weighted sum (linear combination of the inputs) is transformed using a non-linear function such as a sigmoid or sine function. These nonlinearities make the creation of multilevel systems possible and are responsible for several of the resulting characteristics.
Why use this approach? The two basic reasons are as follows. First, it is capable of retaining a level of detail greater than other techniques. Rather than simply ignoring small features, it can allow them to pass if they are of a stationary recurring nature. Second, the amount of detail retained by the filtering process can be varied by varying the number of elements in the hidden layer. From these perspectives, it is a technique providing additional flexibility and control when the engineer analyzes noisy data.
The approach described thus far is to develop a limited number of feature detectors that encode a series of samples of an input signal. The output of these encoders is used to reconstruct all or a portion of the input sample. The use of a limited number of feature detectors ensures that some of the information (preferably the noise) will be removed in the reconstruction.
Until now, the problem of how to compute the various coefficients used by a feature detector or encoder has been ignored. The process is a form of regression. The word "regression" comes from a study done in the early 1900s. In this study, the researchers attempted to find relationships between the height of parents and that of their children. They developed a statistical technique for fitting a straight line through the data (linear regression). The results of the study showed that tall parents had shorter children and that short parents had relatively taller children -- their heights were regressively correlated. The term regression analysis applies to any technique for fitting a curve to data.
The technique used to set the coefficients in a network is described as a set of differential equations that modify the coefficients in such a way as to reduce the mean-square-error of the output for the training set. These are derived by defining the error for a single pass through a training set (several presentations of input and expected, or desired, output) as the sum of the squares of the difference of the actual and expected outputs. This is called the square error.
The derivative of each weight (or coefficient) in the network is computed as a function of the error. For most cases, these differential equations do not have a closed-form solution. As such, the solution (or a solution) is derived by using numerical methods. The numerical techniques compute the direction to change the coefficients (gradient) and then change them slightly in that direction. The mathematics behind this are explained in detail in Chapter 8 of Parallel Distributed Processing by Rumelhart and McClelland (MIT Press 1986).
Listing One, page 96, presents a back-propagation network program that can solve the "exclusive OR" problem described in the Rummelhart-McClelland book. This program could also be extended to solve the noise filtering problem. If you want to extend the program in Listing One to solve the "noise filtering" problem, you must do three things:
For the purposes of studying the noise filtering problem, I used a commercial neural network development environment (NeuralWorks Professional II by NeuralWare) for all of the examples that follow because neural network development systems provide the infrastructure needed to solve the problem. These systems provide the network building tools that make it easy to change the network design, interfaces to assist in setting up the problem, tools for investigating what is happening within the network itself, and a host of other "bookkeeping" functions. A good development environment also enables you to focus on the problem. Though most neural network types appear relatively straightforward from the standpoint of mathematics, many programmers find it difficult to debug this kind of numerical-oriented code. If a handcoded network fails to converge, the problem may lie either in the "code" or the network parameters. A neural network development environment solves the first of these problems and makes the other problems easier to test.
Figures 5a, page 42, and 5b , 5c, and 5d, page 44, show screen dumps of four tests of training a network to filter out noise. The examples use 5, 7, 10, and 19 elements in the hidden layer. As you would expect from the previous discussion, the more processing elements in the hidden layer, the more detail is preserved from the encoding process to decoding process.
The data displayed in the lower window of each of the examples is the raw input data. This is actually EKG data taken in a noisy environment (the noise is caused primarily by fluorescent lights). The filtered output data is displayed in the upper window. The networks used here each had 19 inputs and 19 outputs. The "filtered output" in the upper window is the output of the center processing element of the output layer. Experimentally, the outputs of the other processing elements in the output layer do a similar job of tracking the input.
So far, we have looked at the problem from the perspective of doing dimensionality reduction on 19 inputs to produce 19 outputs. The way we trained the network was to randomly select a block of 19 sequential samples from the raw data, and to use this as both the input and desired output of the network. This was typically repeated 300,000 to 600,000 times for each of the networks shown (overnight on a NEC Powermate III). Each time a block of inputs was applied to the network, the weights were slightly adjusted to make the actual network output closer to the desired output. Only the middle output element is used as the "filtered" output.
You might wonder what would happen if you were to have only one output element that was trained to reproduce the value of the middle input. Figure 6 , this page, shows the results of such a network with five elements in the hidden layer.
Closely examining the weights in the network, it becomes clear that the output is affected almost exclusively by the center input. All other inputs are ignored. Neural network techniques are sometimes smarter than you might expect! The use of additional processing elements in the output layer forces the feature detectors to come up with a more "general" way of encoding the inputs.
During the recall process, the raw data is shifted through a 19-element long shift register and the output computed with that set of inputs. The data is shifted one position and the process repeated.
The training process is reasonably repeatable with the data shown. We have tried it on a variety of other signals (manually constructed) with some interesting results. Depending on the type of signal used, the processing elements in the hidden layer should use a "sine" function rather than a sigmoid. As mentioned previously, this causes the network to do a form of Fourier decomposition and reconstruction.
As a general technique, neural network-based techniques for noise filtering offer an interesting and potentially powerful approach. One of the particularly interesting characteristics of neural nets is the capability to develop an adaptive filtering technique that can be tuned to preserve varying degrees of detail. Significant experimentation remains to develop these techniques to the point where they are well characterized and understood.
D. Rumelhart and J. McClelland. Parallel Distributed Processing, Vol. 1. (Boston Mass.: MIT Press; 1986).
B. Widrow and S.D. Stearns. Adaptive Signal Processing (Englewood Cliffs, NJ: Prentice-Hall, 1985).
Steve Melnikof is an experimental particle physicist, who spends his working hours in a multitasking fashion. When he isn't designing experiments or consulting for GE, he can be found at home programming a Macintosh II to investigate neural networkarchitectures.
Artificial neural networks (ANNs) show great promise in providing solutions to signal processing or signal-noise reduction problems. In fact, these solutions might be seen as more natural or intuitive when compared with traditional techniques. As this note will indicate, however, there are a number of issues and considerations that have to be faced when applying theory to real-world situations. A few more words on how a neural network does signal processing might help to clarify this point.
Recall that back-propagation networks configure themselves in such a way as to establish a relationship between input and output variables that minimizes the error between the network generated output and the exact or target output (in the sense of least squares fitting). After minimizing the error or training the network, we can say that it has learned to input/output relation. What's amazing is that a network can and will produce an output close to and consistent with the target value, even though one or more of the inputs have been corrupted. If you still aren't impressed, remember that neural nets are nonlinear devices! Again, inputs can be noisy, but networks usually generalize well enough to produce an acceptable output. For example, we can train a network on the input sequence of (1.0 2.0 3.0 4.0 5.0) with a target of (6.0) and then input (1.0 3.0 3.2 4.1 4.9) to obtain (5.8).
Our working strategy is as follows: We pick an input sequence of data points and then proceed to train the network using (as a target) the next point in the signal sequence. We might, for instance, have a two input data and the sixth as a target; for example, every third datum is a target. This way, we encode the time structure of our signal directly into the network. Next, the network is trained on a set of input sequences that together form the shape (or representation) of a typical signal. Following this, we have a network to process the noisy data.
To illustrate the process in an engineering application, Figure 1a shows the training or target signal for a medical diagnostic (hemoglobin red blood cell) device and the network output after different numbers of training cycles. For our data a few hundred training epochs resulted in better than a one percent difference between target and output. Figure 1b shows the end result of applying this network on a noisy set of data. For this study I used either two, five, or ten inputs along with ten hidden units; the presented results are for the case of five inputs.
Now to important considerations when putting neural network theory into practice. First, if we use a small input window (for instance, the number of inputs is one or two), we will have a fine sampling of our signal. As such, we will need a large number of hidden units to correctly encode the character of the signal. This in turn involves more data processing, which could prohibit real-time response without the use of parallel processors. Furthermore, with only two inputs, if one or both are noisy, the success of the network in distinguishing between signal and noise is diminished. Therefore, we might want to increase the size of the input window in order to maximize the network's noise handling capabilities. Remember, however, that we're also trying to maintain as fine a sampling as possible to maximize signal detail. The question then becomes: How many inputs and how many hidden units?
In general, the task is to find a balance between a number of variables (besides the number of units), which at the present are more or less determined empirically. In addition, we might decide to introduce feedback (networks of this type are sometimes called Jordan networks) from either input or output to further enhance the network's memory of the signal's time structure. As mentioned in the main article, many of these options can be explored more quickly using one of the commercially available network simulation packages. For my work, I used, and highly recommend, The Cognitron by Cognitive Software.
Figure 1c shows a final plot of our data representing the arithmetic difference between the two curves of Figure 1b. You can see this difference is relatively small, except for a number of spikes. Figure 1c is marked with two horizontal lines to indicate these large variance points. In fact, these spikes emerge just where we have a random noise pulse interposed as part of the data!
To conclude, the network distinguished the signal from the noise without using multiband filters, Fourier-Laplace transforms, and the like. You might even go further to conclude that artificial neural networks seem to have an almost human capability to process difficult, intrinsically nonlinear signal shapes.
_NEURAL NETS AND NOISE FILTERING_
by Casey Klimasauskas
[LISTING ONE]
Copyright © 1989, Dr. Dobb's Journal
/* (backprop.c) Back-propagation XOR example */
#include <stdio.h>
/************************************************************************
* Back-propagation Exclusive OR Program *
************************************************************************
Written by Casimir C. Klimasauskas. No copyright or other
proprietary rights reserved.
This program was compiled and tested using DataLight "C" version 3.14
as follows: DLC -ms backprop.c
NOTE: the structure "_conn" uses an index for the PE source. This was
done (rather than a pointer to the processing element itself) to get
around certain problems with handling circular references in structure
definitions.
*/
#define MAXCONN 5 /* maximum number of conn */
typedef struct _conn { /* connection to a PE */
int PESource; /* index of PE source */
float ConnWt; /* connection Wt */
float LastDelta; /* last weight change */
} CONN;
typedef struct _pe { /* processing element */
float Output; /* PE output */
float Error; /* Accumulated error */
CONN Conns[MAXCONN+1]; /* connections */
} PE;
/************************************************************************
* Network Topology *
************************************************************************
The following diagram shows how the processing elements have been
connected to solve the exclusive "OR" problem. This is taken from
Volume 1 of "Parallel Distributed Processing" by Rummelhart and
McClelland.
+--------------------- PE5
| / | \
| / | \
+-----------------/----PE4 \
| | / \ |
| | / \ |
| |/ \|
PE1 - bias PE2 PE3
*/
static PE pe1 = { 1.0, 0 }; /* bias */
static PE pe2 = { 0 }; /* inputs */
static PE pe3 = { 0 };
static PE pe4 = { 0, 0, 1,0,0, 2,0,0, 3,0,0 }; /* hidden */
static PE pe5 = { 0, 0, 1,0,0, 2,0,0, 3,0,0, 4,0,0 }; /* output */
/* --- Processing Elements (for reference by number) ---- */
static PE *PEList[] = { (PE *)0, &pe1, &pe2, &pe3, &pe4, &pe5 };
/* --- Layer definitions --- */
static PE *LayIn[] = { &pe2, &pe3, (PE *)0 }; /* input layer list */
static PE *LayMid[] = { &pe4, (PE *)0 }; /* hidden layer list */
static PE *LayOut[] = { &pe5, (PE *)0 }; /* output layer list */
/* --- Network List --- */
static PE **LayList[] = { &LayIn[0], &LayMid[0], &LayOut[0], (PE **)0 };
/************************************************************************
* Sigmoid() - Compute the sigmoid of a value *
************************************************************************
*/
double sigmoid( x )
double x;
{
double r; /* result */
extern double exp();
/* check special limiting cases to prevent overflow */
if ( x < -10. ) r = 0.0;
else if ( x > 10. ) r = 1.0;
else r = 1.0 / (1.0 + exp( -x ));
return( r );
}
/************************************************************************
* RRand() - Compute a random number in a range *
************************************************************************
*/
double RRand( low, high )
double low, high; /* low / high limits */
{
double r; /* return result */
extern int rand(); /* random number generator */
r = (rand() / 32767.) * (high - low) + low;
return( r );
}
/************************************************************************
* RandWts() - randomize all of the weights in a network *
************************************************************************
*/
void RandWts( low, high, LLp )
double low, high; /* low / high limits for random */
PE ***LLp; /* layer list pointer */
{
PE **PePP; /* PE Pointer */
PE *PeP; /* PE itself */
CONN *WtP; /* connection Pointer */
for( ; (PePP = *LLp) != (PE **)0; LLp++ ) /* layer loop */
for( ; (PeP = *PePP) != (PE *)0; PePP++ ) /* PE loop */
for( WtP = &PeP->Conns[0]; WtP->PESource != 0; WtP++ )
{
WtP->ConnWt = RRand( low, high );
WtP->LastDelta = 0.0;
}
return;
}
/************************************************************************
* Recall() - Recall information from the network *
************************************************************************
*/
void Recall( ov, iv, LLp, rcf )
double *ov; /* output vector */
double *iv; /* input vector */
PE ***LLp; /* layer list pointer */
int rcf; /* "recall" mode flag (0=learn) */
{
PE **PePP; /* PE Pointer */
PE **LastPP; /* last non-zero PE list pointer */
PE *PeP; /* PE itself */
CONN *WtP; /* connection Pointer */
double sum; /* weighted sum */
/* copy the input vector to the inputs of the network */
for( PePP = *LLp++; (PeP = *PePP) != (PE *)0; PePP++ )
PeP->Output = *iv++;
/* compute the weighted sum and transform it */
for( ; (PePP = *LLp) != (PE **)0; LLp++ ) /* layer loop */
{
LastPP = PePP;
for( ; (PeP = *PePP) != (PE *)0; PePP++ ) /* PE's in a layer */
{
/* weighted sum of the inputs */
sum = 0;
for( WtP = &PeP->Conns[0]; WtP->PESource != 0; WtP++ )
sum += WtP->ConnWt * PEList[ WtP->PESource ]->Output;
/* transform it using a sigmoidal transfer function */
PeP->Output = sigmoid( sum );
/* if "learn" mode, set the error to zero */
if ( rcf == 0 ) PeP->Error = 0.0; /* (for learning) */
}
}
/* copy the results to the output array */
if ( rcf != 0 ) /* only if not learning */
{
for( ; (PeP = *LastPP) != (PE *)0; LastPP++ )
*ov++ = PeP->Output;
}
return;
}
/************************************************************************
* Learn() - "learn" an association *
************************************************************************
*/
double Learn( ov, iv, LLp, alpha, eta )
double *ov; /* output vector */
double *iv; /* input vector */
PE ***LLp; /* layer list pointer */
double alpha; /* learning rate */
double eta; /* momentum */
{
double MAErr; /* Maximum Absolute error */
double rv; /* work value */
double SigErr; /* back-propagated error */
PE ***ALp; /* alternate layer pointer */
PE **PePP; /* PE Pointer */
PE **LastPP; /* last non-zero PE list pointer */
PE *PeP; /* PE itself */
CONN *WtP; /* connection Pointer */
extern double fabs(); /* absolute value */
Recall( ov, iv, LLp, 0 ); /* perform a recall */
/* find the output layer */
for( ALp = LLp; ALp[1] != (PE **)0; ALp++ )
;
/* compute the square error in the output */
for( MAErr = 0.0, PePP = *ALp; (PeP = *PePP) != (PE *)0; PePP++ )
{
rv = *ov++ - PeP->Output; /* output Error */
PeP->Error = rv;
if ( fabs(rv) > MAErr ) MAErr = fabs(rv);
}
/* back-propagate the error & update the weights */
for( ; ALp > LLp; ALp-- ) /* layer loop */
{
PePP = *ALp;
for( ; (PeP = *PePP) != (PE *)0; PePP++ ) /* PE's in a layer */
{
/* compute the error prior to the sigmoid function */
SigErr = PeP->Output * (1.0 - PeP->Output) * PeP->Error;
/* back-propagate the errors & adjust weights */
for( WtP = &PeP->Conns[0]; WtP->PESource != 0; WtP++ )
{
PEList[ WtP->PESource ]->Error +=
WtP->ConnWt * SigErr;
rv = alpha * PEList[ WtP->PESource ]->Output * SigErr +
eta * WtP->LastDelta;
WtP->ConnWt += rv;
WtP->LastDelta = rv;
}
}
}
return( MAErr );
}
/************************************************************************
* Main() - main driver routine to train the network *
************************************************************************
*/
static double iv1[] = { 0.0, 0.0 }; static double ov1[] = { 0.0 };
static double iv2[] = { 1.0, 0.0 }; static double ov2[] = { 1.0 };
static double iv3[] = { 0.0, 1.0 }; static double ov3[] = { 1.0 };
static double iv4[] = { 1.0, 1.0 }; static double ov4[] = { 0.0 };
static double *ivp[] = { &iv1[0], &iv2[0], &iv3[0], &iv4[0] };
static double *ovp[] = { &ov1[0], &ov2[0], &ov3[0], &ov4[0] };
main()
{
int wx; /* work index */
int x; /* index into samples array */
double r; /* work value */
double MAErr; /* maximum Absolute error */
double wo[sizeof(ivp)/sizeof(*ivp)];
/* randomize the weights in the network */
RandWts( -1.0, 1.0, &LayList[0] );
MAErr = 0.0;
for( wx = 0; ; wx++ )
{
x = wx % (sizeof(ivp)/sizeof(*ivp));
if ( x == 0 && wx != 0 )
{
if ( (wx % 100) == 0 )
printf( "Presentation %4d, Maximum Absolute Error = %.5f\n",
wx, MAErr );
if ( MAErr < .1 ) break;
MAErr = 0.0;
}
r = Learn( ovp[x], ivp[x], &LayList[0], 0.9, 0.5 );
if ( r > MAErr ) MAErr = r;
}
/* test the network */
for( wx = 0; wx < (sizeof(ivp)/sizeof(*ivp)); wx++ )
{
Recall( wo, ivp[wx], &LayList[0], 1 ); /* perform a recall */
printf( "Input: %.2f %.2f -> %.2f\n", ivp[wx][0], ivp[wx][1], wo[0] );
}
return;
}