August 2000/A Simple Linear Regression Class

Features

A Simple Linear Regression Class

David C. Swaim II

Even an operation as simple as linear least squares can benefit from encapsulation in a class.

I began my computer career years ago while I was a student studying physics in college. One of the earliest things I used a computer for was statistical analysis of experimental data. Because of this I developed an interest in numerical analysis. As fate would have it, I eventually transitioned from a physicist writing programs to a programmer working on engineering and scientific applications. During that transition my programming language of preference mutated through Basic and Pascal to C and finally C++. I still find some numerical methods useful in my work and have translated many an algorithm to C and C++. A numerical tool I frequently use is one to fit data to a linear or straight line curve.

Linear Regression

Linear regression analysis, sometimes called the method of least squares, involves fitting a best-fit straight line to a set of data points. Scientist and engineers are often faced with having to make this type of analysis on experimentally gathered data. The analysis consists of determining the coefficients a and b in the equation y = a + bx, where y is the dependent variable and x is the independent variable. The coefficient a is the value of y when x is zero and is called the y intercept. The coefficient b is the slope of the line or the amount y changes when there is a given change in x. The slope, sometimes called the gradient, determines the angle of the line when it is plotted on a graph.

It is also nice to know how well the set of data fits the equation. There are several statistical coefficients for determining the "goodness" of the fit. These are the standard error of the estimate, the correlation coefficient, and the coefficient of determination. Of these the main one is the correlation coefficient, which is the square root of the coefficient of determination. It is sometimes referred to as the linear correlation coefficient. This coefficient has a value of one if the data is a perfect fit. If the data is perfectly random (no fit) the coefficient has a value of zero.

Due to experimental fluctuations you will never get zero or one, but some fraction in between. Given any correlation coefficient there is always a statistical chance that totally random data could give the same coefficient. Tables have been developed for determining this probability based on the number of data points used in the regression analysis. Reference 2 has such a table. For the purposes of this article it is sufficient to know two things:

The higher the correlation coefficient the more certain you can be that the curve fits the data.
The more data points in your analysis the more certain you can be that the curve fits the data even with a lower correlation coefficient.

For example, if an analysis using ten data points resulted in a correlation coefficient of 0.765 there is only a 0.01 (or one per cent) probability that this could have occurred from purely random data. If 100 data points were used in the analysis there is only a 0.001 (or 0.1 per cent) probability that purely random data could result in a correlation coefficient greater than only 0.327 (Reference 2). The moral to this story is that you should gather as much data as you can for the analysis.

The standard error of the estimate comes into play when you use the coefficients (a and b) to estimate a future point on the curve. There is a 68 per cent probability that the real value is between the value estimated by the regression analysis plus the standard error and the estimated value minus the standard error. So if the regression analysis predicts a y value of 55 and the standard error of the estimate is 0.55 there is a 68 per cent chance that the actual value is between 55.55 and 54.45. There is a 99 per cent probability that the real value lies within 2.58 times the standard error of the estimated value.

A C++ Class

As a "born again" object-oriented programmer, I have noticed that the published references on numerical programming, even though some of them are using C++, tend to treat numerical analysis strictly as a set of algorithms. Most of the code I have seen consists of sets of simple functions which do the analysis or part of it. Typically an entire program is devoted to just a single linear regression analysis on one set of data. When I was converting my old programs to C++ I approached my linear regression code in an object-oriented way. Of course this meant encapsulating the regression analysis into a C++ class. The linear regression class that resulted from my efforts turned out to be very simple.

The code in linreg.h, shown in Listing 1, contains the definition of my linear-regression analysis class. There are two classes defined in this header file. The first is a very simple class to define a two-dimensional point on a plane. This class, Point2D, is self contained in the header file. There are more robust implementations of point classes (this one is extremely simple) but Point2D has enough functionality to serve the LinearRegression class. All I really wanted in this case was a class to group my x and y values into a logical pair.

The main linear-regression class, LinearRegression, is the second one defined in Listing 1. This class accepts data points and calculates the coefficients for the straight line. All calculated coefficients, including the error and correllation coefficients, are available from member "get" functions. In addition, I overloaded the insertion operator to print the coefficients to a standard C++ output stream. I chose not save the individual data points in the class to keep it as simple as possible. All I keep up with are the sums needed to do the regression calculation. The calling function can keep up with the individual data points if it needs to.

The code in linreg.cpp, shown in Listing 2, is the implementation code for the LinearRegression class. I provided constructors for creating a LinearRegression instance using an array of Point2D data or a pair of arrays of x values and y values. The Point2D constructor doubles as the default constructor. You can use the havedata member function to determine if there is enough data in the instance to calculate the coefficients. You need at least three points for the math to work (considerably more for the statistics to be significant). The addXY and addPoint member functions are used to add individual data points to the LinearRegression.

The Calculate member function actually calculates the coefficients. This function is a protected utility function. It is called by addXY each time a data point is added. The addPoint function simply calls addXY. The constructors also use addXY to load data so that all the data will go through addXY. This insures that the coefficients remain updated no matter how the data points enter the object.

You can get the coefficients from the LinearRegression at any time using the "get" functions. One of the reasons to do curve fitting is so you can estimate the values of points you have not measured. The LinearRegression class provides the estimateY member function for this operation.

A simple test program utilizing the LinearRegression class is in lrtest.cpp, shown in Listing 3. It illustrates all three methods of instantiation of the class using the same data. The program is extremely simple. The only critical information that the program must provide to the first two objects, in addition to the arrays, is the number of points represented in the arrays. If you don't know how many points you will be loading ahead of time the best method is to create an empty instance and fill it one point at a time using addXY or addPoint. The object will keep up with the number of points it represents.

LinearRegression as a Base Class

Sometimes you run across situations where the data fits a curve that is still simple but not linear. An example of this is nuclear decay, which follows an exponential decay curve. For data analysis you need a regression analysis that uses the equation y = a * exp(b * x). You can convert this equation to a linear one by taking the logarithm of both sides to obtain ln(y) = ln(a) + b * x, where ln is the natural logarithm. The new equation is now linear (the logarithm of y varies linearly with x).

Using inheritance it is easy to create an ExponentialRegression class using the LinearRegression class as the base class. The code in expreg.h, Listing 4, illustrates how this may be done. The only member functions I had to override in this class were the the addXY data entry function, the getA function and the estimateY function. Note that the addXY function makes the appropriate modification to the data to make it linear and passes it on to the addXY function in the base class. The main point here is that the child class must modify the data going in to match the linear form of the equation and it must modify the data coming out through the "get" functions so they are correct for the equation being modeled. The constructors for the derived class can be duplicates of the base class constructors or they could just invoke the base-class constructors. Because of this I did not include the implementation file for the derived class.

I did some preparation in the LinearRegression class to make it work better as a base class. Since I might want to refer to the derived class objects using base class pointers I made some of the base class functions virtual. This insures, for example, that the correct addXY member function will be executed. Remember, all data added to the class must go through the LinearRegression::addXY member function to insure the coefficients stay updated at all times.

Decay.cpp, shown in Listing 5, illustrates the use of the ExponentialRegression class. Like the test program in lrtest.cpp, Listing 3, the program is short and very simple. Again I demonstrated more than one instance of the class. This time I used two different sets of data. The object of this program is to determine the decay constant so you can determine what isotope is producing the radioactivity. In this case b is the decay constant. Once b is determined it is just a matter of looking it up in physical tables to determine the isotope responsible. As you can see the ExponentialRegression class is just as easy to use as the LinearRegression class.

Conclusion

My main motivation for building these classes was to have an object-oriented set of regression-analysis tools. I was pleased to discover how easy it was to make the LinearRegression class and to derive other classes using the LinearRegression class as the base class. These simple classes have been very useful numerical-analysis tools for me. They are so simple that I have used them as illustrations in some of the object-oriented programming classes I teach. They make very good examples of C++ reuse at work.

References

1. William H. Press, et. al. Numerical Recipes in C (Cambridge University Press, 1992).

2. Hugh D. Young. Statistical Treatment of Experimental Data (McGraw-Hill, 1962).

3. Milton Smith. A Simplified Guide to Statistics, 4th ed., (Holt Rinehart and Winston, 1970).

4. Ramond Annino and Richard Driver. Scientific and Engineering Applications with Personal Computers (John Wiley & Sons, 1986).

5. Lon Pool, et. al. Some Common BASIC Programs (Osborne/McGraw-Hill 1980).

David C. Swaim II has a B.S. degree in Physics, a professional degree in Mechanical Engineering, a Masters degee in Physics, and a Ph.D. in Applied Physics. He is a senior information systems analyst for Southern Company Services in Atlanta, Georgia and also a member of the adjunct faculty at Gwinnett Technical Institute in Lawrenceville, Georgia, where he teaches C++ and object-oriented programming. His email address is David@Swaim.com.