Signal Analysis via the Bootstrap

A resampling algorithm for error estimation

Gary McGrath

Gary has a BS in physics from the University of California at Irvine and a PhD from the University of Hawaii. He is currently doing research work at the Fermi National Accelerator Laboratory and can be reached at gary@master.ps.uci.edu.

Many quantities are natural variables to examine statistically but are useless as test statistics because their error isn't easily estimated. Still, many possible test statistics are easily computed. For example, computing the median value of a data set requires little more than sorting the data and finding the middle value. The second moment of a distribution is calculated by simply summing the square of the difference between each value and the mean value. These types of quantities are frequently used as some sort of decision criteria, but a single value itself does not allow rejection of a hypothesis. Instead, hypotheses are rejected based on confidence intervals; however, calculating the confidence interval for the median or moments is extremely difficult using conventional techniques. What we need is a calculable and accurate estimation of probability distributions for complex quantities.

Conventional error estimation requires assumptions about the shape of the distributions involved. Much effort has been dedicated to formulating more precise approximation methods for error estimation and propagation, but they inevitably rely on many assumptions. For example, if C=A+B and A and B are sets of measured data, the standard method for estimating the value and error of C is to calculate the mean values for A and B and use those values to calculate the mean value for C. Then, the error on [mu] _C is found with the formula in Example 1, where [sigma] ²_AB is the covariance between the two quantities and [sigma] ²_A and [sigma] ²_B are the variances. We use A and B to calculate the confidence interval for C, and the distribution of C is assumed to be Gaussian. These formulas are valid only under certain assumptions about the distributions of A and B. Furthermore, calculating the confidence interval with standard techniques asserts the probability distribution for C and uses the calculated mean and standard deviation.

Systematic errors resulting from bad assumptions are propagated throughout the calculation, leaving an ambiguous result. Accurate error propagation using conventional methods is a delicate process, sensitive to assumptions. If the true distribution is not approximated well by a Gaussian, but is instead skewed with a long tail, treating that distribution as a Gaussian induces a large systematic uncertainty in the final result. Many analyses are proven faulty due to bad assumptions and thereby cast doubt on all tenuous results. The significance required for a formidable result has increased throughout the years to a level that precludes many experiments.

Proper signal analysis requires a technique which minimizes assumptions. If the final result is calculated without assuming distribution functions or using crude approximation methods for propagating the error through an equation, the calculated significance is reliable. Such a method allows the analysis of tenuous signals because a borderline result lacks calculational uncertainties. It exactly states the probability of the result occurring randomly without many assumptions convolved into the calculation.

When Efron extended the Jackknife technique to form the Bootstrap technique (see "References"), he provided a method to easily propagate errors through complex calculations, resulting in an estimate of the error with few assumptions. The Bootstrap is a resampling algorithm that estimates the error on quantities by resampling the data in random ways. The astrophysics community adapted this technique to search for tenuous signals in multidimensional data, in which the mechanisms and local efficiencies are not well understood. Generating a background via the Bootstrap allows a sensitive analysis without assumptions convolved into the calculation. The measured data is assumed to be a discrete sampling of the background distribution and can thus be used to calculate the background. By using the data itself to calculate the background, local efficiencies and systematic uncertainties are directly incorporated into the calculation. This technique provides a way to analyze signals without assuming efficiency functions or probability distributions. When it's crucial to detect signals without many assumptions convolved into the calculation, the Bootstrap provides a sensitive technique.

The next example uses the Bootstrap to see a small signal from a star in a telescope. A simple telescope that measures nothing more than total light level sweeps past a star to detect whether or not it can see the light from the star; see Figure 2(a). The two measured values are the angle of the telescope ( [theta] ), measured from some reference point, and the light level in the telescope. The signal is detected by a correlation between the light level and the angle at which the star is located ( [theta] =0). A turbulent atmosphere further complicates the measurement by randomly fluctuating the light level. Table 1 shows the measured data.

The program in Listing Two (page 81) calculates the background distributions by randomizing the angle with respect to the light level. Doing so essentially averages the light levels over the different angles. Figure 3 shows the results. Notice that the data shows a measurement above the background at the angle of the star. The program indicates the random chance probability at [theta] =0 is 8 percent; thus, the star seems to be detected by this telescope with 92 percent confidence. The signal might appear stronger than estimated to the human eye, but that illustrates the problem with fitting functions to make a background. This result shows that there is a signal, given no assumptions. Although this example is useful to demonstrate the procedure, it trivializes the advantages of this technique.

Perhaps the greatest benefit from this type of analysis is the ease with which systematic biases are handled. Since the data includes biases, using the data to calculate the background includes the systematic biases, provided you choose to randomize about the correct variables. If there were a strong bias against a combination of two angles, it would not be preferable to randomize out that correlation. Instead, randomizing the set of those angles with respect to another parameter provides the needed calculation while including the bias in angles. To illustrate this point, the next example adds a bias into the data.

Now let's extend the previous discussion to account for the earth's rotation. Table 2 shows the angle of the earth [phi] , the angle of the telescope [theta] , and the light level measured by the telescope shown in Figure 2(b). In this example, there is a large bias to the measurement. When the data was taken, there happened to be a tree branch overhead shadowing telescope angles between -5 and +5 degrees. The experimenter looks at the distributions shown in Figure 4 and sees there is a bias, which leads him or her to randomize the earth's angle. Randomizing the earth's angle separately from the telescope angle and light level grouped together leaves the strong telescope-angle bias in the calculation. The program in Listing Three (page 81) calculates the background in this way.

Figure 5 shows the results of the analysis. The histogram represents the measured values, the plotted points show the mean values of the background distribution, and the error bars represent the standard deviation of the background distribution. The program also generates the random-chance probabilities for the measured variables. The star is seen at [alpha] =30 degrees with a 95 percent confidence.

Although the last example performed well, this technique is actually best for counting data. If the data in Table 2 is converted to counting data, both biases are handled optimally. For instance, treat the first light level of 23 to actually mean there are 23 photons collected; replace that one point with 23 values of [phi] =0 and [theta] = -10. After doing this for all the points, repeat the process with only two changes. First, instead of incrementing the background map by the light level associated with the [theta] value, increment it by one for each [phi] , [theta] pair. Second, calculate the mean as done previously, but set the =[sqrt] since this is counting data. Counting data is distributed in a Poisson distribution, which can be approximated by a Gaussian with a variance equal to the mean. Listing Four (page 82) converts the data from the previous example into counting data and then performs the suggested algorithms.

The results from this example handle the biases perfectly. Figure 6 shows the results. The random-chance probability for the observation is extremely small, P( [alpha] =30_) ~0. Once again, notice that the Bootstrap slightly overestimates the background because the signal is included. This final result does not assume that any local-angle distributions are uniform. Instead, it entirely accounts for the distributions with only the angle-to-angle interdependence removed. By doing so, it provides a formidable result relying on few assumptions.

Conclusions

The Bootstrap is the best possible choice for many signal-analysis situations. It is one of few ways to propagate errors through complex transformations with little difficulty, and it requires few assumptions. Its ease of use, accuracy, excellent handling of local biases, and O(n) run time make it an essential tool for contemporary signal analysis.

References

Efron, B. Annals of Statistics, 7, 1979.

----. The Jackknife, the Bootstrap and Other Resampling Plans, CBMS 38, 1982.

Flannery, B.P., et al. Numerical Recipes. Cambridge: Cambridge University Press, 1986.

Knuth, D. The Art of Computer Programming. Reading, MA: Addison-Wesley, 1973.

Example 1: Error on C is found with this formula.

Figure 1: Grade distribution (wide histogram) and Bootstrap-generated median-probability distribution (small histogram).

Figure 2: (a) The measured values are the angle of the telescope , measured from some reference point, and the light level in the telescope. The signal is detected by a correlation between the light level and the angle at which the star is located (=0); (b) large bias to the measurement.

Figure 3: Observed (histogram) and predicted (points) light levels vs. . The source is at =0 degrees.

Figure 4: (a) Light level vs. telescope angle; (b) light level vs. earth angle.

Figure 5: Observed (histogram) and predicted (points) light levels vs. . The source is at =30 degrees.

Figure 6: Observed (histogram) and predicted (points) light levels vs. . The source is at =30 degrees.

Table 1: Data from a telescope that measures total light level.

     Theta   Light Level
     -------------------
     -5      1.1
     -4      0.3
     -3      1.9
     -2      1.6
     -1      2.3
      0      3.3
      1      1.8
      2      1.0
      3      1.7
      4      1.6
      5      0.8

Table 2: The angle of the earth (_), the angle of the telescope (q), and the light level measured by the telescope shown in Figure 2(b).

     Phi    Theta    Light Level
     ---------------------------
     0      -10      23
     0      -5       11
     0       0       10
     0       5       10
     0       10      20
     10     -10      19
     10     -5        9
     10      0       13
     10      5       10
     10      10      24
     20     -10      20
     20     -5       11
     20      0       11
     20      5       13
     20      10      35
     30     -10      22
     30     -5       13
     30      0       17
     30      5       14
     30      10      25
     40     -10      34
     40     -5       13
     40      0        9
     40      5       10
     40      10      19
     50     -10      22
     50     -5       10
     50      0        8
     50      5       11
     50      10      25

[LISTING ONE]


#include<stdio.h>
#include<math.h>
#include<stdlib.h>

#define NumOfGrades 20
#define NumOfBins 20
#define TestMedian 50
#define TGA {45,67,73,27,44,96,49,38,51,63,11,87,67,46,76,57,53,61,57,64}
#define nBSLoops 1000000


main()
{
int i, j, index, numOver=0;
int grades[NumOfGrades]=TGA, fake[NumOfGrades], iseed = -534;
float sum=0, sumSq=0, median, fmedian, ave, sdev, prob1Tail, prob2Tail;
float hist[NumOfBins];
extern int cmp();
extern float ran1(), FindMedian();

    /*Make sure histogram starts at 0*/
    for( i=0; i<NumOfBins; i++ )
        hist[i]=0.0;
    /*First, find the true median*/
    median = FindMedian( grades );
    printf( "median=%f\n", median );

    for( i=0; i<nBSLoops; i++ )
    {
        /*Make a fake array by sampling the grades with replacement*/
        for( j=0; j<NumOfGrades; j++ )
        {
            index = (int)( (float)NumOfGrades * ran1( &iseed ) );
            fake[j] = grades[ index ];
        }
        /*Find the median for that fake set of grades*/
        fmedian = FindMedian( fake );
        /*Find the components needed later for mean and probability*/
        sum += fmedian;
        sumSq += fmedian*fmedian;
        if( fmedian >= TestMedian )
            numOver++;

        /*Build histogram*/
        index = (int)( (fmedian - 30.0)/2.0 );
        if( (index >= 0) && (index < NumOfBins) )
            hist[index]++;
    }
    /*Output the histogram for printing purposes*/
    printf( "median probability distribution\n" );
    for( i=0; i<NumOfBins; i++ )
          printf( "%d  %f\n", (2*i+31), ((float)hist[i]/(float)nBSLoops));
    /*The probability of random occurance*/
    prob1Tail = 1.0 - (float)(numOver)/(float)(nBSLoops);
    prob2Tail = 2.0*prob1Tail;
    printf( "1-tail-prob=%f  2-tail-prob=%f\n", prob1Tail, prob2Tail );
    /*Find the mean and stdev*/
    ave = sum/nBSLoops;
    sdev = sqrt( (sumSq-ave*ave*nBSLoops)/( (float)nBSLoops-1.0 ) );
    printf( "average=%f  sdev=%f\n", ave, sdev );

exit(0);
}
float FindMedian( int array[NumOfGrades] )
{
float median;
extern int cmp();
    qsort( array, NumOfGrades, sizeof(int), cmp );
    median = ( (float)(array[9]+array[10]) / 2.0 );
return( median );
}
int cmp( int *in1, int *in2 )
{
int flag;
    flag = *in1 - *in2;
return( flag );
}

[LISTING TWO]



#include<stdio.h>
#include<math.h>

#define NumOfAngles 11
#define TMA { 1.1, 0.3, 1.9, 1.6, 2.3, 3.3, 1.8, 1.0, 1.7, 1.6, 0.8 }
#define TAA { -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5 }
#define nBSLoops 1000000

main()
{
int i, j, index, numOver[NumOfAngles];
int iseed = -534;
int angle[NumOfAngles]=TAA;
float measure[NumOfAngles]=TMA, fmeasure[NumOfAngles]=TMA;
float sum[NumOfAngles], sumSq[NumOfAngles], mean[NumOfAngles], temp;
float sdev[NumOfAngles], prob1Tail[NumOfAngles], fm, tempMeasure;
extern float ran1();
    for( i=0; i<NumOfAngles; i++ )
    {
        sum[i] = 0;
        sumSq[i] = 0;
        numOver[i] = 0;
    }
    for( i=0; i<nBSLoops; i++ )
    {
        /*Make a fake array by performing Knuth shuffling*/
        j = NumOfAngles;
        do
        {
            index = (int)( (float)j * ran1( &iseed ) );
            tempMeasure = fmeasure[ index ];
            fmeasure[ index ] = fmeasure[ j ];
            fmeasure[ j ] = tempMeasure;
            j--;
        }
        while( j > 0 );
        /*Find the components needed later for mean and probability*/
        for( j=0; j<NumOfAngles; j++ )
        {
            fm = fmeasure[j];
            sum[j] += fm;
            sumSq[j] += fm*fm;
            if( fm >= measure[j] )
                numOver[j]++;
        }
    }
    for( i=0; i<NumOfAngles; i++ )
    {
        /*The probability of random occurance*/
        prob1Tail[i] = ((float) numOver[i])/nBSLoops;
        /*Find the mean and stdev*/
        mean[i] = sum[i]/nBSLoops;
        temp = (sumSq[i] - mean[i]*mean[i]*nBSLoops)/( nBSLoops-1.0 );
        sdev[i] = sqrt( temp );
        printf( "\nAt angle=%d\n", angle[i] );
        printf( "measured=%f  mean=%f  sdev=%f\n", measure[i],
                               mean[i], sdev[i] );
        printf( "Prob: 1t=%f\n", prob1Tail[i] );
    }
exit(0);
}

[LISTING THREE]



#include<stdio.h>
#include<math.h>

#define NumOfAngles 30
#define NumOfSMAngles 7
#define TEA {0,0,0,0,0,10,10,10,10,10,20,20,20,20,20,30,30,30,30,30,40,40,\
40,40,40,50,50,50,50,50}
#define TTA {-10,-5,0,5,10,-10,-5,0,5,10,-10,-5,0,5,10,-10,-5,0,5,10,\
-10,-5,0,5,-10,-5,0,5,10}
#define TLL {23,11,10,10,20,19,9,13,10,24,20,11,11,13,35,22,13,17,14,25,\
34,13,9,10,19,22,10,8,11,25}
#define nBSLoops 1000000

main()
{
int i, j, k, index, numOver[NumOfAngles], temp, alpha;
int iseed = -534;
int SkyMap[NumOfSMAngles], FakeSkyMap[NumOfSMAngles];
int EarthAngle[NumOfAngles]=TEA;
int TeleAngle[NumOfAngles]=TTA;
int LightLevel[NumOfAngles]=TLL;
float sum[NumOfSMAngles], sumSq[NumOfSMAngles], mean[NumOfSMAngles];
float sdev[NumOfSMAngles], prob1Tail[NumOfSMAngles], fsm;
extern float ran1();
    /*First, make sure certain arrays are zero*/
    for( i=0; i<NumOfSMAngles; i++ )
    {
        sum[i] = 0;
        sumSq[i] = 0;
        numOver[i] = 0;
        SkyMap[i] = 0;
    }
    /*Make the Measured SkyMap*/
    for( i=0; i<NumOfAngles; i++ )
    {
        alpha = (EarthAngle[i]+TeleAngle[i])/10+1;
        if( (alpha >= 0) && (alpha < NumOfSMAngles) )
            SkyMap[ alpha ] += LightLevel[i];
    }
    /*With the measured SkyMap in hand, begin generating background maps*/
    for( k=0; k<nBSLoops; k++ )
    {
        /*Make a fake EarthAngle array by performing Knuth shuffling*/
        j = NumOfAngles;
        do
        {
            index = (int)( (float)j * ran1( &iseed ) );
            temp = EarthAngle[ index ];
            EarthAngle[ index ] = EarthAngle[ j ];
            EarthAngle[ j ] = temp;
            j--;
        }
        while( j > 0 );
        /*Zero out the FakeSkyMap*/
        for( i=0; i<NumOfSMAngles; i++ )
            FakeSkyMap[i] = 0;
        /*Make the Fake SkyMap*/
        for( i=0; i<NumOfAngles; i++ )
        {
            alpha = (EarthAngle[i]+TeleAngle[i])/10+1;
            if( (alpha >= 0) && (alpha < NumOfSMAngles) )
                FakeSkyMap[ alpha ] += LightLevel[i];
        }
        /*Find the components needed later for mean and probability*/
        for( j=0; j<NumOfSMAngles; j++ )
        {
            fsm = (float)FakeSkyMap[j];
            sum[j] += fsm;
            sumSq[j] += fsm*fsm;
            if( FakeSkyMap[j] >= SkyMap[j] )
                numOver[j]++;
        }
    }
    for( i=0; i<NumOfSMAngles; i++ )
    {
        /*The probability of random occurance*/
        prob1Tail[i] = ((float) numOver[i])/(float)nBSLoops;
        /*Find the mean and stdev*/
        mean[i] = sum[i]/nBSLoops;
        temp = (sumSq[i] - mean[i]*mean[i]*nBSLoops)/( nBSLoops-1.0 );
        sdev[i] = sqrt( temp );
    }
    /*First, look at the observed distribution*/
    for( i=0; i<NumOfSMAngles; i++ )
        printf( "%d  %d\n", (10*(i-1)), SkyMap[i] );
    /*Then, the predicted*/
    for( i=0; i<NumOfSMAngles; i++ )
        printf( "%d  %f  %f\n", (10*(i-1)), mean[i], sdev[i] );
    /*Now, the random chance probability*/
    for( i=0; i<NumOfSMAngles; i++ )
        printf( "alpha:%d   Prob=%f\n", (10*(i-1)), prob1Tail[i] );
exit(0);
}

[LISTING FOUR]



#include<stdio.h>
#include<math.h>

#define NumOfAngles 30
#define NumOfSMAngles 7
#define TotalLightLevel 481
#define TEA {0,0,0,0,0,10,10,10,10,10,20,20,20,20,20,30,30,30,30,30,40,40,\
40,40,40,50,50,50,50,50}
#define TTA {-10,-5,0,5,10,-10,-5,0,5,10,-10,-5,0,5,10,-10,-5,0,5,10,\
-10,-5,0,5,-10,-5,0,5,10}
#define TLL {23,11,10,10,20,19,9,13,10,24,20,11,11,13,35,22,13,17,14,25,\
34,13,9,10,19,22,10,8,11,25}
#define nBSLoops 100000

main()
{
int i, j, k, index, numOver[NumOfAngles], temp, alpha;
int iseed = -534;
int SkyMap[NumOfSMAngles], FakeSkyMap[NumOfSMAngles];
int EarthAngle[NumOfAngles]=TEA, DEarthAngle[TotalLightLevel];
int TeleAngle[NumOfAngles]=TTA, DTeleAngle[TotalLightLevel];
int LightLevel[NumOfAngles]=TLL;
float sum[NumOfSMAngles], mean[NumOfSMAngles];
float prob1Tail[NumOfSMAngles];
extern float ran1();
    /*First, make sure certain arrays are zero*/
    for( i=0; i<NumOfSMAngles; i++ )
    {
        sum[i] = 0;
        sumSq[i] = 0;
        numOver[i] = 0;
        SkyMap[i] = 0;
    }
    /*Make the Measured SkyMap*/
    for( i=0; i<NumOfAngles; i++ )
    {
        alpha = (EarthAngle[i]+TeleAngle[i])/10+1;
        if( (alpha >= 0) && (alpha < NumOfSMAngles) )
            SkyMap[ alpha ] += LightLevel[i];
    }
    /*Make the discrete angle sets*/
    k = 0;
    for( i=0; i<NumOfAngles; i++ )
        for( j=0; j<LightLevel[i]; j++ )
        {
            DEarthAngle[k] = EarthAngle[i];
            DTeleAngle[k] = TeleAngle[i];
            k++;
        }
    /*With the measured SkyMap in hand, begin generating background maps*/
    for( k=0; k<nBSLoops; k++ )
    {
        /*Make a fake EarthAngle array by performing Knuth shuffling*/
        j = TotalLightLevel;
        do
        {
            index = (int)( (float)j * ran1( &iseed ) );
            temp = DEarthAngle[ index ];
            DEarthAngle[ index ] = DEarthAngle[ j ];
            DEarthAngle[ j ] = temp;
            j--;
        }
        while( j > 0 );
        /*Zero out the FakeSkyMap*/
        for( i=0; i<NumOfSMAngles; i++ )
            FakeSkyMap[i] = 0;
        /*Make the Fake SkyMap*/
        for( i=0; i<TotalLightLevel; i++ )
        {
            alpha = (DEarthAngle[i]+DTeleAngle[i])/10+1;
            if( (alpha >= 0) && (alpha < NumOfSMAngles) )
                FakeSkyMap[ alpha ]++;
        }
        /*Find the components needed later for mean and probability*/
        for( j=0; j<NumOfSMAngles; j++ )
        {
            sum[j] += (float)FakeSkyMap[j];
            if( FakeSkyMap[j] >= SkyMap[j] )
                numOver[j]++;
        }
    }
    for( i=0; i<NumOfSMAngles; i++ )
    {
        /*The probability of random occurance*/
        prob1Tail[i] = ((float) numOver[i])/(float)nBSLoops;

        /*Find the mean*/
        mean[i] = sum[i]/nBSLoops;
    }
    /*First, look at the observed distribution*/
    for( i=0; i<NumOfSMAngles; i++ )
        printf( "%d  %d\n", (10*(i-1)), SkyMap[i] );
    /*Then, the predicted*/
    for( i=0; i<NumOfSMAngles; i++ )
        printf( "%d  %f  %f\n", (10*(i-1)), mean[i], sqrt( mean[i] ) );
    /*Now, the random chance probability*/
    for( i=0; i<NumOfSMAngles; i++ )
        printf( "alpha:%d   Prob=%f\n", (10*(i-1)), prob1Tail[i] );

exit(0);
}
End Listings