Algorithms


Skip Lists

Frederick Hegeman


Frederick Hegeman is an amateur programmer and computer language hobbyist. He can be reached at P.O. Box 2368, Rapid City, SD 57709, telephone (605) 343-7014.

Maintaining some sort of ordered list must surely be one of the most common problems in programming. An ordered list can be a symbol table, a data dictionary or the distribution of periodic readings from some instrument. Maintaining an ordered list is also one of the most commonly studied, analyzed, refined, and seemingly well understood problems.

That's why running across a completely novel approach is a bit unnerving, like finding a clam with legs. William Pugh's article, "Skip Lists: A Probabilistic Alternative To Balanced Trees," in CACM (June 1990), presented a novel data structure with characteristics intermediate between those of simple linear linked lists and perfectly balanced binary trees. I know a clam with legs when I see one.

Skip Lists

Skip lists are linked lists with extra pointers that leapfrog, or skip over, intermediate nodes during search procedures. Because the extra pointers are assigned by consulting a random number generator, search times are independent of the ordering of the data input during list construction. Figure 1 diagrams a skip list that might be built by the test program with a command line of

sltest 7 4 2 n x
If you start searching along the pointer chain marked level one, you are obviously searching along a simple one-way linked list, with a NIL pointer marking the end of the list. Each key along the chain is examined in turn. If the key compares less than the searchkey, you look at the next node. If it compares equal, you have found the desired node. If it compares greater than, the key you are searching for is not in the list. If you reach the NIL pointer, the key is not in the list.

The strategy for searching the skip list is similar. Find the highest level pointer in the list header and follow the pointer chain. If you reach a NIL pointer, decrease the level one step. If the level drops to zero, the key is not in the list; otherwise, continue along the new chain. If you reach a node with a greater key, you've gone too far and have to drop down one more level. If the level drops to zero, the key is not in the list; otherwise, continue along that chain.

Pugh provides a great deal of analysis. I've taken a more informal approach. First, I build a skip list, and then I take a kind of snapshot of the way it looks and behaves.

Contents Of skiplist.h

Listing 1, skiplist.h, describes the node structure used to build the lists. In order to build the largest possible list, sizeof(NODE) has been kept artificially small, by omitting any data members. If you want to use skip lists in any real application, you will have to declare a member to hold or point to data. The structure is indifferent as to where you put the data member(s), as long as the dynamically sized array pointers is the last member of the structure.

Listing 1 also defines several important constants, such as the name of the comparison function COMPARE. As written, the functions depend on a COMPARE function that conforms to the strcmp()/memcmp() model (returning negative, zero, or positive). Integers are used as the keys in the test program, both to avoid using memory for storing character strings and to keep comparison times to the minimum.

Skip List Routines

Listing 2, skiplist.c, contains the actual list maintenance routines. search() works exactly as above, returning a pointer to the desired node if successful, NIL if not. insert() and delete() are only slightly more complicated, needing to preserve the integrity of the pointer chains. In either case, we begin by recording all pointers that pass the indicated key. insert() returns immediately whenever the key is already in the list. delete() records that pointer also. The number of comparisons performed in this step is limited by the maximum node level currently in the list. In the case of delete(), one additional comparison is necessary to determine if the node actually exists. After fetching a new list node or deleting an existing one, the code reconstructs pointer chains in a loop governed by the level of the affected node.

The list always begins with a pseudo-member, or header, with space for MAXLEVEL pointers and provision for recording the current level of the list. newlist() ensures that you have a header node large enough to hold the maximum number of pointers that may be allotted. The call to calloc() will initialize each potential pointer to NIL. The initial level is set to one, which is the minimum. newnode() gets memory for a dynamically sized node and initializes the key member of the structure.

randomlevel() returns an integer from 1 through MAXLEVEL, based on the numbers chosen for DMAX (MAXLEVEL) and NMAX (PARTITION). The ratio NMAX/DMAX determines the distribution of node levels through the list. For example, if DMAX is 4 and NMAX is 2, the function slrandom() is called repeatedly with 4 as its argument, incrementing the local variable newlevel so long as the returned value is 0 or 1. Note that the entire series of consecutive values less than NMAX is consumed. The length of the series determines the level. Given a random number generator with reasonable performance, the distribution will approach the programmer specified ideal where approximately one-half of all nodes will have only one pointer, one-fourth will have two pointers, one-eighth will have three, and only one-sixteenth of all nodes will have the maximum number.

Random Number Routines

The pseudo-random number routines are in Listing 3, slrandom.c. The functions slrand() and slsrand() and the constant SL_RAND_MAX mimic their ANSI standard library counterparts in <stdlib.h>. They are based on code published by the ANSI committee, using longs rather than unsigned longs to accommodate compilers that do not recognize unsigned long ints. I included them so you could reproduce results across compilers and machines. When you're tired of the test program, use the standard functions. The function slrandom() scales a pseudo-random number returned from the standard function to fit within a specified range. Your compiler libraries may already have something similar. slrandom() provides a good distribution of values within a range — other schemes may not.

Test Routines

Listing 4, sltest.c, contains the code for a simple test program to build and examine some skip lists. Five parameters are entered on the command line as unsigned integers — the size of the list to build, the value of MAXLEVEL, the value of PARTITION, a seed for the pseudo-random number generator, and a flag to choose between random insertion of keys or an ordered insertion. None of these routines is very interesting on its own. A skip list of the indicated size, or of the largest size allowed by memory, is built as requested. Certain calculations are handled most easily by lookup tables, so the maximum size you can request is limited to 32,767, which will probably exceed the size of any list you can actually build in memory. Once the list is built, each element known to be in the list is searched for in its turn. I use the number of comparisons performed during a search as the measure of the length of the search path. This information is collected, organized, and output to stdout as shown in Figure 4.

Skip Lists

When you look at Figure 4, start with the line "pointers per level." No node has more than seven pointers, far fewer than the maximum allowed. About three-fourths of all nodes have only one pointer. The number of nodes with two pointers is approximately one-fourth the number at level one. Likewise with levels three and four. A little calculation shows that 4,050 pointers (not counting 16 pointers in the header) are sufficient to maintain a skip list of 3,000 entries — 1.35 pointers per node. If trees always require two pointers per node, the savings is 1,950 pointers — most probably 3,900 or 7,800 bytes. In general, the average number of pointers per node will be 1/((MAXLEVEL -- PARTITION)/MAX-LEVEL), a constant figure independent of the length of the list. The expected figure here would be 1.33 pointers per node. Unless the list is very short, you can make a reasonably close calculation of memory requirements even though you can't predict the structure of the list.

The line "comparisons per search" is less interesting in isolation than when changing the command line parameters. Note, however, that you would have built a degenerate tree with the same input sequence.

The simple graph is scaled so that the display will fit on a 24-line screen. Here, it shows a nice distribution of results about the mean. Very short lists can have distributors that are bi- or tri-modal, or even discontinuous. The random number routines simply don't cycle long enough to display any "random" behavior. The graph of comparisons over a reasonably long list sometimes starts looking less than bell-shaped. That tells you that one of the command line parameters needs to be adjusted to get better search properties.

I consider a tree to be perfect when it requires the minimum number of comparisons to find each node in the tree. This situation occurs when each level of the tree, except, perhaps, the last level, is full. A degenerate tree is the same as a linked list. The figures for either are easily calculated and presented for comparison.

Using sltest

Twiddling the knobs can illustrate a number of things. First, varying only the random number seed and the flag parameter should demonstrate that a skip list of any reasonable length is indifferent to the order in which it was built. Once you're satisfied that this it true, you will find that processes go a little faster by always requesting in-order insertion.

Second, the effect of varying the random number seed decreases as the size of the list increases. Building a series of short lists can seem chaotic, but only because the sequence of numbers generated never becomes long enough to become truly random.

Third, varying the size of the list while keeping the other parameters constant might show some seemingly paradoxical behavior — the mean number of comparisons goes down as the size of the list increases. It's only temporary. In general, the mean increases as the size of the list increases, but not always continuously. A strategic node is inserted, randomly of course, and many search paths are favorably affected. The list grows longer, and the mean begins to increase again. Pugh demonstrates that searching a skip list is 0(log N) — as in searching a balanced tree. Intuitively, the cost of searching the list increases very slowly relative to the size of the list.

This is the case if MAXLEVEL and PARTITION are in the proper proportions — both to each other and to the size of the list. A little doodling around should demonstrate that a ratio greater than one-half makes no sense at all. One-half works well but uses an average of two pointers per node. Pugh suggests the ratio one-fourth. Empirically, one-fourth seems an ideal balance between performance and memory usage, but why this should be a magic number is beyond my understanding.

Tweaking the values for MAXLEVEL while maintaining the same ratio to PARTITION is less informative than it might appear. It is interesting to watch performance degrade as list size grows and/or MAXLEVEL decreases. Think of the distribution of pointers per node. If the distribution was exactly in our chosen ratio, we could determine the level where the expected number falls to one or below. Take a list of 32,768 elements. If the ratio were one-half, the level would be 15. If the ratio were one-fourth, the level would be nine. As MAXLEVEL starts to drop below expected levels, pointers begin to accumulate at the highest level, and search times increase. You might not have noticed how few nodes have been allotted at levels higher than strictly necessary, although I tried to hint at it previously. I would suggest that maintaining the ratio is more important than saving a few bytes allotted in the header, a few more that might or might not be needlessly allocated in one or two rogue nodes, and a negligible bit of stack space on insertion or deletion. A level of 16 will safely accommodate at least 65,536 nodes — I would think very hard about what kind of memory space that would require — and leaves more room for adjusting PARTITION, where you can preserve serious amounts of memory.

Varying the value of PARTITION over a constant sized MAXLEVEL has another important effect that might not be readily apparent. That ratio, together with list size and random chance, has a bearing on the maximum comparisons per search. It seems obvious that the maximum should appear to increase as the size of the list increases. Changing the random number seed should cause the maximum to vary as different lists are built. Changing the ratio should affect the maximum as more or less leapfrogging gets done at higher levels. It might not be so obvious that as the ratio grows away from one-half, the maximum path length grows increasingly sensitive to changes in the random number seed, or chance. At the limit of one-sixteenth, the variations can be striking.

Chance is the wildcard that I have been deliberately ignoring. There is always the chance that you might build a degenerate structure like one of those in the sidebar. You might as well get the worrying out of your system by using my random number routines and running

sltest 16 3 1 35 1
You may never see such a situation again. Why this should be so might be illustrated by going through the previous paragraph in reverse.

As the ratio approaches one-half, the volatility of the maximum search path lessens. If MAXLEVEL was chosen intelligently, or allowed to default, the maximum search path in any list may vary but should remain within a comfortable multiple of the mean. This conforms to Pugh's probabilistic analysis. Instinctively, also, as the mean increases very slowly in comparison to increases in list size, any constant multiple of the mean is also increasing very slowly. At the same time, the observed maximum is becoming relatively less sensitive to the random number seed. It is behaving rather nicely as a multiple of the mean, and as a multiple of what we might guess to be the mean of the means, so to speak, of all the lists of that size that are being built. The result is a crude demonstration that the behavior is becoming more regular as the size of the list increases.

This seems to me one of the most profound properties of skip lists. The probability of worst-case behavior decreases exactly as the consequences of such behavior become more severe. And, inversely, the probability of poor behavior is greatest exactly at that stage when dealing with it means the least. Build a list of 16 elements — the size of the list is limited; the random numbers don't look so random; the algorithms construct a degenerate list; it might take 16 comparisons to find a member of the list. So what? Build a list of 3,000 elements and the chances can be less than one in one hundred million that it might take, not 3,000 comparisons, but as many as 80 or 90 comparisons. It will not be likely to ever take 3,000.

Conclusions

The common choice for managing ordered lists of any real size has usually been binary trees. Binary trees are sensitive to the randomness of the keys from which they are built, so various schemes for building balanced trees have been developed to keep search times from getting out of hand. Search can be much faster, but insertion and deletion are much slower. How much slower? Baer and Schwab suggest that the number of searches per node should be at least three before you consider AVL trees over random trees. They also found that AVL trees perform better than other balancing algorithms if only insertions and searches are performed.

Now take as input some arbitrary text, break it down into individual words, record each distinct word, keep a count of how often each word appears in the input, then output the list of words in alphabetical order, together with each word's count of appearances. Sound familiar? Think of what could go wrong. Do you really need, or want, to use AVL trees?

Skip lists are simple and straightforward to implement. In-order traversal is an iterative procedure that operates in constant space, unlike recursive procedures applied to tree structures, which can corrupt the stack when the list is large or highly unbalanced. Insertion and deletion procedures are faster than those for AVL trees. Skip lists are space efficient and can yield good search performance using an average of 1.33 pointers per node. Although skip lists have terrible worst-case search performance, no predictable input sequence will create a degenerate structure, and no random sequence of insertions and deletions will create such a degenerate structure with any significant degree of probability. Performance is sensitive to the size of the list, the number of pointers allowed per node, and the fraction used to determine when a node is allotted an extra pointer. These, however, are subject to tuning by the programmer.

I cannot recommend Pugh's approach too highly because skip lists have so many desirable properties.

Bibliography

Pugh, William. "Skip Lists: A Probabilistic Alternative to Balanced Trees." Communications Of The ACM. XXXIII, 6, June 1990, Pp. 668-676.

Baer, J.-L. and Schwab, B. "A Comparison of Tree-Balancing Algorithms." Communications Of The ACM. XX, 5, May 1977, Pp. 322- 330.

Knuth, Donald E. The Art Of Computer Programming: Vol. I, Fundamental Algorithms. Reading, MA, Addison-Wesley, 1968. Knuth goes deeply into lists and trees, though not, of course, skip lists.