June 1991/Optimizing Binary Trees

Features

Optimizing Binary Trees

Bruce Terry

Bruce Terry is a computer programmer for Martek Inc. in Marmora, NJ with seven years experience programming microcomputers and three years experience using C. He can be reached at P.O. Box 325, Ocean View, NJ 08230.
Binary trees can store any set of dynamic data in sorted order. How efficiently a tree stores data depends largely on the order in which data elements are added to the tree.
Various methods have been developed to maintain balanced trees, most notably the AVL method. The problem with this type of method is that the simple code for adding a node to the tree is replaced by cumbersome and complex code. This added complexity negates the elegance and efficiency of binary trees, causing the programmer to stick with imbalanced trees.
The various tree-balancing algorithms share one thing in common. They attempt to maintain a balanced tree as the tree is being built. This article and code offer an alternative solution — any tree, no matter how imbalanced, can be rebuilt into its most optimal form.
There are several advantages to optimizing a binary tree after it has been built. First, adding a node to the tree is still simple and efficient. Second, no additional memory is required. For example, AVL trees require an additional four bytes of memory (assuming integers are used) per node to store the depth of the left and right subtrees of each node. For small data objects, this results in high overhead.

Optimizing A Binary Tree
When optimizing a binary tree, first rebuild it into its absolute worst-case form. The resulting tree will appear graphically the same as a singly linked list. Next, "fold" the tree in half, starting at the root. Then recursively fold each subtree with a length of three or more nodes. See Figure 1, Figure 2, Figure 3, and Figure 4.
The code accompanying this article is contained in three modules. The first, TREEOPT. C, contains the main module and uses the optimization techniques on a simple binary tree that stores integers. The second, SHOWTREE. C, contains code that will draw the tree on a CGA screen, to confirm that the tree is indeed optimized. The last module, OPTIMIZE.C, contains all the optimization functions. Although the first two modules contain code that is specific to Microsoft C, this module is 100 per cent compatible with ANSI C and requires no libraries. The file stdio.h is included only for the compiler's definition of NULL. With the new ANSI implementation of NULL now being ((void *)0) [possibly] rather than 0, this is not a trivial matter. Because it is the focus of this article, I describe only the last module in detail.
The function that interfaces with outside code is optimize. It is called with a pointer to the root of a binary tree to be optimized as its single parameter. The return value is a pointer to the same tree in its most optimal form. Both the pointer to the root and the return value are void pointers. This is the suitable data type to use because any tree may be optimized using this function. Internally, the optimization functions access the nodes in the binary tree using an "empty" tree node structure:
typedef struct tnode {
   struct tnode *left, *right;
} TREENODE;
This of course implies that the tree being optimized has its node structure organized such that the pointers to the left and right child nodes are the first two fields in the record. That is a small price to pay for a generic optimizing function that requires only a single parameter.
If the root passed to optimize is not empty, it is passed to worstcase, which is a recursive function that stores the root of the worst-case form of the tree in the pointer worstroot.worstcase is really a modified form of the standard recursive traversal of a binary tree found in most computer-science textbooks. It traverses the tree in ascending order, adding a node to the global variable worstroot on each node visitation. At the end of the traversal, worstroot will point to the root of the tree in its new worst-case form.

Folding The Tree
Now that the tree has been arranged into its worst-case form — with all nodes containing only a right-child node — it can be folded in half. A worst case tree is folded by first determining the depth d of the tree. Then, all nodes from the root up to and including the d/2 node are arranged such that each node becomes the left-child node of its right child. (The process is similar for worst-case trees with all nodes containing only a left-child node. In this case, the node becomes the right-child node of its left child.) If d is odd, then all nodes up to and including d/2+1 are arranged. This has the effect of placing all short branches to the outside of the tree when the folding function is used to fold any local worst-case trees (see below). The folding process is performed by fold, which receives a pointer to a local root and the direction to fold. It returns a pointer to the new local root of the folded branch.
After the tree has been folded, there will probably be local worst-case branches. To optimize these branches, the recursive function fixbranch is called once for the left subtree and once for the right subtree. fixbranch recursively folds all local worst-case branches that require folding. It receives the same type of parameters as fold and also returns a pointer to the new local root of the folded branch.
fixbranch also accounts for a special type of optimization that can be applied to a local branch that has exactly five nodes. In this situation, the normal folding process would leave the branch in a form such that the root would contain two child nodes, each containing only one child node. The shape of the tree is like an upside down "V". The tree in Figure 5 does not extend beyond two levels (level 0 being the first). However, there are only two integers, a 4 and a 6 that can be added to the tree without extending it to three levels. The tree in Figure 6 can have any integer less than 1 added without extending it to three levels. Similarly, the tree in Figure 7 can have any integer greater than 9 added without extending it to three levels. Obviously, these latter two forms are preferred. For local branches that have this special optimization performed, the shorter branch should always be arranged such that it hangs toward the outside of the tree, thereby allowing for the greatest range of numbers that do not extend the tree another level.

Implementations
Any situation that calls for a binary tree will benefit from this type of optimization. For example, a hash table would usually store collisions in binary trees. Using these optimizing functions, the program can wait until all or most of the data has been added to the table and then optimize each collision tree. Now data retrieval will be as fast as possible without the inefficiencies of imbalanced trees or the complexities of AVL trees.

References
[1] Feibel, Werner, Advanced Quick-C, McGraw-Hill, Inc., Berkeley, CA 1988.

Listing 1

Listing 2

Listing 3