Summations for Fast Counting

Efraim Berkovich

How do I sum thee? This author counts and illustrates the many, useful ways.

In computing, many problems require some sort of counting. That does not just mean straightforward problems such as counting the number of words in a document. There are many more elaborate counting situations, not to mention times when high performance in solving a problem is important. For these problems, a good counting algorithm lets you avoid time-consuming calculations and/or complicated coding.

When what you are counting remains relatively static, pre-computing the counting improves the program's response time. In the simplest case, suppose the program pre-computes the answer to every possible question and fetches the appropriate answer when asked. It's hard to get faster than that.

For many counting problems, however, the simple approach is not practical due to the large number of possible inputs. Calculating all of them may exceed a reasonable amount of computer memory.

Summation is one approach to solving counting problems that is based on the idea of pre-computing results. Suppose you have a function that tells you the count (of whatever you are counting) at each increment. Another way to represent the same data is as a running sum, effectively the integration of the count function. If the count function is non-negative, then the summation function lets you quickly find the count between any two points by subtraction alone.

An example of a counting problem that can be addressed by summation techniques is a "business calendar." A business calendar keeps track of business hours, such as business holidays, normal business hours, half-days, etc. The business calendar must be able to answer date measurement questions in terms of business time (as opposed to regular time). Sample questions might include: "What date is exactly 12 business days from October 10, 2002?" or "How many business hours are there between 9:30 AM August 3, 2002 and 3:00 PM August 6, 2002?"

This article discusses summation techniques for solving counting problems that have a non-negative count function. The business calendar example provides a good base for demonstrating solutions to questions of counting. The code in this article is written in Java, but the concepts apply to other programming languages.

Adding and Subtracting

The math behind the summation approach to counting is straightforward. Suppose that you have a non-negative counting function f(x) that is defined over a domain D (see Figure 1). Then, a summation function F(x) is defined over the same domain as being the sum of f(t) as t increases in D to x (see Figure 2). Basically, the summation function is like the integration of the counting function. Note that this summation function is a non-decreasing function (i.e., it either stays the same or increases as x increases).

There are two basic counting questions that can be addressed by manipulating the summation function F(x).

1. Subtraction: What is the total count of all the f(x) between x1 and x2?

2. Addition: What value of x2 makes the total count of all the f(x) functions between x1 and x2 exactly m?

These questions are inverse operations of each other; the first question is analogous to subtraction, and the second question is analogous to addition.

Translating into the specific example of the business calendar, the two questions become something like the following:

1. Subtraction: How many business days are there between date1 and date2?

2. Addition: What date is m business days from date1?

The first question (the subtraction question) has an answer that comes straight from the summation function definition. The total count between the domain values x1 and x2 is just F(x2) - F(x1). To see that this is so, consider the case where the summation is zero. If F(x1) is zero, then F(x2) will be the exact count at x2. So, the summation function can determine the count between any two points by using one subtraction.

Although the second question (the addition question) is an inverse of the subtraction question, determining the answer from the summation function is not a direct operation. In fact, to solve directly requires having the inverse of the summation function, defined by F-1(y) (see Figure 3).

In the subtraction question, the equation F(x2) - F(x1) = m determined the solution for the difference in the total count between x1 and x2. The two x values are known, and the unknown variable to solve for is m. In the addition question, the values of x1 and m are known, and you want to solve for the x2 variable. The original equation from the subtraction question can be restated as F(x2) = F(x1) + m. And with the inverse F-1(y) defined, the solution for x2 = F-1(F(x1) + m). Having the inverse of the summation function lets you determine the answer using one addition.

One thing to keep in mind when programming solutions to the addition question is that it may be possible to solve the problem without using the inverse F-1(y) directly. Some methods discussed later do just that. Rather than calculating the inverse, it is possible to stumble onto it by looking through the values of x until the desired value of F(x) shows up. For example, as a simple algorithm, you can just start at x1 and check the values F(x1 + 1), F(x1 + 2), etc. until you find where F(x2) = F(x1) + m. These searching approaches may be appropriate when a) searching through the summation function will not require too many operations, and b) creating data structures to handle the inverse is considered too complicated or undesirable for whatever reason. (Laziness is an acceptable excuse.)

Simple Summation in an Array

If you have a relatively small number of increments in the domain, then you can use an array to store the results of the summation. In the business calendar example, suppose that you are interested in dealing with time in the gradation of a day and that the business calendar itself spans only a few years. Then storing the summation results fits into an array on the order of a thousand elements long (see Figure 4).

Take a look at the code in the Java DayBusinessCalendar class (available for download at <www.cuj.com/code>). The class has an internal array for storing the summation function. Each slot in the array represents a day on the regular calendar. The slot holds the running summation total of the business days up until that day. When the class is instantiated, the constructor code builds the summation array using the list of business holidays passed as a parameter and a determination of the weekends.

The code design makes the assumption that performance in instantiating the class is not a high priority. In the class, the private function findCount is the analogue of the f(x) count function talked about above. The constructor uses that function to run through the calendar[] array, summing up as it goes. Clarity of the design takes higher priority than the speed of calculation here. With the DayBusinessCalendar created, you can call the add and diff public methods to answer the two basic counting questions.

The diff function works as expected in answering the subtraction question. The function subtracts the value of the summation function at one date from the value of the summation function at the other date. Each date maps to a location in the calendar[] array, so obtaining the value of the summation function is just a lookup in the array at the appropriate place. The function then subtracts one value from the other and returns the answer, yielding a computational complexity of O(1).

Having the inverse of the summation function would make solving the addition question straightforward. For this example, calculating the inverse is actually not a big deal. To do that, you make a new array with the slots corresponding to the number of business days elapsed. Each slot in the array holds the calendar date that corresponds to that number of business days. The example code builds the inverse in the constructor for the class and stores the values in the array called inverse[].

The inverse may provide non-unique values because there is no guarantee that the inverse of a function is a function. So, for those cases where there are non-unique values, the rule is to pick the smallest date. With the inverse mapping in an array, you do one addition to find the location containing the resulting date (see the add function). The computational complexity is O(1).

Although in this case making the inverse is easy, the code also contains two other alternatives, to illustrate the other available options. One alternative, shown in the add2 function, uses the existing calendar[] array and a binary search. The first step is, of course, to find the number of business days at the start date and then to add the desired number of business days to that. The binary search finds the value of x where the summation F(x) has the desired number of business days. This value is the date that is the given number of business days away from the initial date.

Using the binary search is possible because the calendar[] array is already in sorted order thanks to the non-decreasing summation function. The inverse is non-unique. So, after finding the x value of the desired F(x), the code checks backwards through the array to make sure to return the lowest possible x value for this F(x). Assuming that running backwards finds the lowest value within a few steps (i.e., the number of consecutive non-business days is not too great), then the computational complexity is on the order of the performance of the binary search, O(log(n)).

The second alternative in the example code is the add3 function. Once again, the goal is to find x where F(x) is a particular value. Instead of using the binary search, this approach estimates where the value might be and then searches one by one from there. The code generates the estimated place to look for the inverse from the density of business days in the calendar.

Suppose that there are about 5 business days for every 7 days. To estimate what date is 10 business days from now, you add 10*7/5 days to the current date. The code checks the value of the summation function at the estimated date. If the estimate is off, then running through the array in the appropriate direction will eventually hit the desired answer. And once again, since the inverse is non-unique, the code makes sure to find the smallest matching date. Assuming that the estimate is usually pretty close and that the number of consecutive non-business days is not too high, then the computational complexity becomes pretty much that of a single lookup, O(1).

As a twist, you can repeat the estimation (if the first estimate is too far off), using the previous estimate point as the new starting point. In fact, you can keep iterating the estimating procedure until you find the desired point. A discussion of the computational complexity of the iterative estimate search algorithm is, however, beyond the scope of this article.

Piecewise Summation

For an arbitrary count function, if you do not have another way to calculate or represent the summation function, then you have to store the summation result for each increment. When the number of increments becomes large, it is no longer practical to place the summation in an array with one-to-one correspondence between the domain and the range. In other words, the size of the array may be too large if you store the representation of the summation function directly (e.g., F(1)=1, F(2)=1, F(3)=2, etc). If that is so, you may be out of luck.

In many cases, it is possible to represent the summation function in a piecewise manner to reduce the amount of necessary storage. Consider the case of the business calendar where the time unit gradation is a millisecond. For large swaths of time, the summation function is either flat (when the time is non-business time) or steadily increasing with a slope of one (when the time is business time). So, storing just the points where the function transitions between flat and sloping is enough to describe the whole function.

In the BusinessCalendar class sample, a couple of arrays store the piecewise description of the summation function. The date[] array holds the dates corresponding to the x coordinates of the transition points. The sum[] array holds the summation results at each of these points (i.e., the value of F(x)). Both the arrays contain a list of non-decreasing values; they are, therefore, sorted. This feature allows using a binary search to find arbitrary points in either array. And, to simplify things later on, you can also make an isStart[] array that tells you whether a point is the first point of a sloping piece of the summation function (see Figure 5).

This sample class defines two private helper functions: findSum and findInverse. Each operates by first finding the location of the requested point in the appropriate array. If the point is in a sloping piece of the function, then the answer is the offset within the piece plus the value at the start point of the piece. Otherwise, the answer is just the value at the start of the piece. Armed with those functions, solving the addition and summation questions becomes straightforward (see downloadable code at <www.cuj.com/code>). It is just a matter of plugging in the values returned from findSum for F(x) and findInverse for F-1(y) into the equations discussed in the previous section. Since the helper functions rely on binary search, the computational complexity of answering either question is then O(log(n)).

Conclusion

The summation method is a tool you can use in developing high performance algorithms for solving counting problems. Although fast computers are cheap and plentiful, there may come a time when you need to build a website that needs to compute the business time for one million requests at once. When that time comes, you'll be ready.

About the Author

Efraim Berkovich is a software architect living in New York City. Most recently, Efraim has been consulting at a global financial company developing a financial planning web-based application. He has a BS in mathematics and an MS in electrical engineering. Efraim can be reached at berkovichnyc@hotmail.com.