Labeling the axes of a graph is not as easy as you might think.
In plotting and other graphics applications, generating an axis definition on the fly has always proved to be a difficult task. It is especially difficult when the data, the output device, and user preferences are unknown in advance. This article presents a function that helps to overcome some of these difficulties. In particular, the function possesses the following qualities:
- It never fails to return an axis definition for valid arguments.
- It uses a "data-driven" method of axis determination; that is, the axis is based solely on the range of data to be plotted, as opposed to an arbitrary definition of what constitutes a good scale.
- It makes only minor assumptions about the data and the end-user.
An axis can be completely defined by three items: the minimum value, the increment value for major tick marks, and the maximum value. However, many past attempts to define an axis-generating algorithm have seemingly failed to capture this simplicity. Typically, these algorithms have constrained the data to fit what the programmer believed was a good definition for an axis. Early examples of such algorithms would be the FORTRAN routines CHLON and SCALE, which use predetermined increment values. Recently, this journal presented a solution that built upon the same foundation laid by SCALE, but which exhibited an additional problem of failing to return an axis definition under certain conditions [1].
The problem with this type of solution is that it constrains future data to fit what the programmer has prescribed in advance to be a valid solution. Yet programmers cannot possibly anticipate every form of data to be encountered by their algorithms. As an alternative, I present a solution that exploits two common facts: most plotting scales are linear, and the decimal number system is widely accepted in modern mathematics and computing. Using these two concepts, it is possible to throw out hard-coded intervals and all their trappings. The resulting algorithm is data-driven: it uses the data to generate a solution that fits the data.
Scales and Number Systems
The algorithm presented here relies on the notion of linear scales. A scale is linear when the distance represented between any two adjacent points (tick marks) equals the distance represented between any other two adjacent points. The integer number line is an example of a linear scale the distance represented between any two adjacent points is one. An example of a nonlinear scale is the logarithmic scale. The value represented at any point i varies with log(i), assuming i > 0. In a logarithmic scale, the distance represented between adjacent pairs of points is greatest at the low end of the scale. This distance decreases exponentially as you move up the scale. Nonlinear scales have rather restricted, but valuable uses. For example, if you plot data on a logarithmic scale, you can quickly tell if it follows an exponential distribution: it will yield a straight line. However, unless there are compelling reasons for using nonlinear scales, they should be avoided. They can fool human perception all too easily.
The function described here also assumes the use of the decimal place-value numbering system. Although most of us probably take the decimal system for granted, it is worth noting there are many cultures which do not [2]. The power of the decimal system can be expressed by the logarithm function: log10(x) = y. This equation says that for any positive real number x, there is a real number y such that 10y = x. Suppose you have a set of data values you need to plot. Let x be proportional to the difference between the smallest and largest value. If an integer y can be found that is smaller than log10(x), then the number 10y should make a good initial candidate for an increment value.
Moreover, 10y will have a desirable property: it will be an increment value that people can readily cope with. That is, we are much better at estimating the values of observations on the scale {0, 10, 20, ..., 100} than we are on {0, 7.25, 14.50, 21.75, ..., 101.50}. While the latter scale completely contains the former, we are usually more successful at (and more confident about) locating points such as 20 or 25 on the first scale [3].
The DefineAxis Function
The complete code for the axis definition function is presented in the file DefineAxis.cpp, Listing 1. This file contains one function, DefineAxis, whose arguments are three pointers to doubles: the minimum value of the data to plot, the maximum value of the data to plot, and the increment value for the scale. DefineAxis uses the first two arguments to determine the range of the data. It returns in these three arguments the minimum value of the scale, the maximum value of the scale, and the increment value to use for the major tick marks.
The code in Listing 1 was compiled using Borland C++ Builder Version 5.0, but it uses no constructs specific to this compiler, so it should be easy to compile on other systems. The only thing you must watch out for is the definition and implementation of floating-point functions, such as log10.
The function begins by defining local variables it will need to complete its task. Test_inc contains the function's temporary value for a calculated increment value. Test_min and Test_max are the respective minimum and maximum values of the scale, and Range is the range of the first two arguments you initially pass to the function.
The function first checks that the arguments are valid (the range must be non-negative) and for the special case where all values are the same or only one point is being plotted (the range equals zero). With these minor issues out of the way, the function determines a candidate increment value using the base 10 logarithm of the range divided by 10. (Recall that in the previous section I proposed using an x proportional to the range, as a starting place for the search.) The range must be divided by 10 in order to prevent the candidate increment value from being larger than the range. To see why, imagine for the moment that this line of code does not first divide the range by 10. Then the return value of log10(Range) will be some real number k, between the integers n and n+1; that is, n < k <= n+1. Since this implies 10n < range = 10k <= 10(n+1), then this line of code will return an increment value of 10(n+1), because the ceiling function will round up the exponent k to the next integer value of n+1. In general, an increment value should be smaller (in magnitude) than the range.
The next eight lines determine the maximum and minimum values of the scale so that they are separated by an integral multiple of the candidate increment. These few steps can be fundamentally important if the graphing system employs axis definitions of the form {minimum TO maximum BY increment}, for instance, as in the SAS Institute GRAPH module. If the range of the maximum and minimum scale value is not an integral multiple of the increment value, then the upper part of the scale will be truncated. The largest tick mark will be (minimum value+(n-1)*increment value) < maximum value, and this could result in data not being displayed on your graph.
Note the code following the do loop that establishes the bottom value of scale: it sets the bottom value of the scale to zero if it is less than 1E-10, which seems like a fairly reasonable limit. This is due to the inherent limitations of floating-point arithmetic to perform accurate calculations. For example, if the function were passed the values of 0.01 and 0.1 as the minimum and maximum data values, then the scale that this function would produce would be 1.0408341E-17 TO 0.1 BY 0.01. The extremely small bottom scale value is produced in the do loop, when Test_min = 0.01 and the code subtracts Test_inc = 0.01 from it. Everyone knows that the value should be zero, but the floating-point representation used by computers always seems to discover a couple of bits that the rest of the world was ignorant of. That is, the value of 0.1 was actually seen by the computer as 0.0100000000000000104083408558608, hence extremely small and bizarre values pop up when zero is expected. The if statement that follows the do loop checks for and corrects this problem. (If a user complains that setting the bottom of the scale to zero doesn't work because their data is measured that precisely, then you should strongly suggest that the data be re-scaled and a note to that effect placed in a footnote.)
The next six lines check that there are at least six major tick marks on the resulting scale. If not, then the increment value is divided in half. If it uses a smaller increment value, the function then checks whether it is possible to "tighten up" the scale by increasing the bottom value and decreasing the top value of the scale. The choice of six tick marks as a minimum number is based on my experience with graphs somewhere between arbitrary and empirical. It should be easy to change the function to produce a user-specified minimum number of tick marks.
The function wraps things up by passing the scale parameters (minimum, maximum, and increment) back to the caller, through the pointer arguments passed in.
Function Performance
Table 1 presents the results from this function given various input conditions. The first two columns show the minimum and maximum data to plot, followed by the range. The fourth through sixth columns are the axis definitions returned by the function: the scale minimum, scale maximum, and increment values. While the table does not cover all possible data ranges (which would be impossible), for the ones that it covers it shows that the returned axis definitions are easy to comprehend and completely encapsulate the range of the data.
Conclusion
This article has presented one alternate function for axis definition. This function determines an axis based on the range of the data to be plotted; it backs away from using arbitrary, predefined increment values. The function does not require maintenance when unexpected extremes are encountered that is, you don't have to add new increment values to an array. Moreover, you don't have to worry about any ripple effects this may have in any other products that use this code. (Indeed, most of the function is nothing more than window dressing to the single line of code that calculates the candidate increment value: Test_inc = pow10(ceil(log10(Range/10))).) The function makes just two assumptions: that the scale is linear and that the data is measured using the decimal place-value system.
This function is not the only solution to the axis definition problem, but it attempts to solve an old problem using a deterministic and scientific approach. Another potential solution to this problem could exploit the center and spread of the data. That is, build an axis definition whose center approximates the medium of the data, with major tick marks radiating out from this point that are approximately one standard deviation apart.
References
[1] Antonio Gómiz Bas. "Finding Neat Scales for Plotting," C/C++ Users Journal, March 2000.
[2] J.D. Barrow. Pi in the Sky (Clarendon Press, 1992).
[3] W.S. Cleveland. The Elements of Graphing Data (Wadsworth, 1985).
Michael Bramley is a consultant with over 10 years of experience in the software industry. He has degrees in computer science, history, mathematics, and a masters in statistics. He is currently occupied as a statistical programmer/analyst and enjoys writing inter-process communication software.