April 1991/Doctor C's Pointers

Columns

Doctor C's Pointers®

Data Structures, Part I

Rex Jaeschke

Rex Jaeschke is an independent computer consultant, author and seminar leader. He participates in both ANSI and ISO C Standards meetings and is the editor of The Journal of C Language Translation, a quarterly publication aimed at implementors of C language translation tools. Readers are encouraged to submit column topics and suggestions to Rex at 2051 Swans Neck Way, Reston, VA 22091 or via UUCP at uunet!aussie!rex or aussie!rex@uunet.uu.net.
This month I embark on a new series on data structures. Probably the single biggest strength of C is its wealth of basic and derived data types — everything from simple scalars and one-dimensional arrays to arrays of pointers to functions returning pointers to arrays of objects.

Introduction
C is an expensive language to learn. In theory, it should only be pursued vigorously if the benefits significantly outweigh the costs. Unless you can master the extensive data typing capabilities C provides, you will not be able to completely understand or exploit the language. If you don't think you need many the data capabilities C provides, you should either use a much safer language with less power or learn more about the capabilities C has so you can see how you might use them.
When most people come to a new language, they think in terms of the capabilities of their old languages for quite a while. If their old language didn't support a given capability (for example, pointers to functions), they may not even know about that capability, or have any real idea why they would want to use it. So the more educated you can become about C's data structure capabilities, the better off you'll be. I hope this series will help.
I plan to cover the following topics:

simple arrays and array operations (such as insertion, deletion, and searching)

arrays of pointers versus multi-dimensional arrays

the use of malloc and friends to extend arrays

stacks, queues, deques, and linked lists (single, double, circular, and lattice)

recursion

pointers to functions and tables thereof

trees
This is an ambitious agenda, and no doubt there will be digressions along the way. However, I will try to adhere to something along these lines.
If you wish to read a little on your own, I recommend Donald E. Knuth's The Art of Computer Programming, Volume 1: Fundamental Algorithms. Addison-Wesley ISBN 0-201-03809-9. Chapter 2 (Information Structures) runs some 230 pages and is full of useful information on data structures. Although much of this was written more than 20 years ago, it is still quite relevant.

Simple Array Manipulations
The following, rather large, program defines an array of integers. It allows the user to manipulate the array either as a whole or by its individual elements. (I'll call them nodes.) The complete set of operations are:

Add a new node to the end

Change a user-selected node

Display a user-selected node

Insert a new node between nodes or before the first node

Remove a user-selected node

Search for first occurrence of a node value

Search for last occurrence of a node value

Report the number of nodes

Sort the nodes in ascending order

Show all entries in the table
<
The array is defined with only four nodes. However, by changing the macro MAX_NODES you can make the number anything you choose. (I chose four so I can easily fill the array and test the error checking for full arrays, etc.)
I use scanf to get user-supplied input. If you want robust error recovery on invalid input, scanf is difficult (if not impossible) to use. That is a separate issue, not directly relevant to the main theme. So I have chosen to ignore input validation for the most part. If you want, you can easily supply input that will break the program.
The commands are defined using macros, so you can change the characters used. You could also obtain input using a mouse-driven menu. Again, that is peripheral to the main topic.
Throughout the series, I will adapt this program to use different data structures behind the scenes. For example, in this first part I use a simple automatic one-dimensional array. Next, I use one allocated at runtime via malloc. Eventually, in a future installment, I will use a dynamically allocated linked list.
Listing 1 shows a program that allocates memory for an array ary:
int ary[MAX_NODES];
It permits the user to perform various operations on the whole array or on specific elements (or nodes) within it. They may, for example

add, remove, or change a node

display or sort the whole array
The program is designed as an introductory example of using data structures with C.
I will not discuss the code much because it is clearly laid out and the identifier names are self-explanatory. I tested the program by creating a text file of commands mixed with various expected and unexpected inputs. I then fed that to the program using command-line redirection. This is not only convenient for testing, but allows for a crude batch facility to process large amounts of production data acquired through other means.
You might take issue with my use of goto in the cases SEARCH_FNODE and SEARCH_LNODE. I find many programmers of the structured programming faith (of which I am one) taking their dislike for goto to the extreme. If you think goto should never be used, I suggest you think how to build a machine that does not have branch or jump instructions. Even Pascal, supposedly the ultimate structured language, has a goto.
The problem with goto is that programmers inevitably write code that branches to irresponsible places — that's why C's break and continue are so nice. The compiler works out where to go instead. In any event, I believe my goto solution as presented in Listing 1 is more elegant and efficient than any structured solution I've seen for the problem. My rule is, "Use goto only when absolutely necessary and then only in a forward direction in a local scope." End of sermon on goto.

Dynamically Allocated Arrays
The first version of the program defined ary as having automatic storage duration. This was an arbitrary choice — it could just as easily have been static for my purposes.
Standard C provides a family of routines to allocate memory at runtime. I will refer to this as dynamic memory allocation. (See my column "The Memory Management Library" in CUJ Vol. 8, No. 1.) With careful design, you can easily change from using an automatic or static object to a dynamic one. The only changes you need to make to the program are as follows:

#include <stdlib.h> main() { int *ary;/* ary is now a pointer, NOT an array */ ... ary = malloc(MAX_NODES * sizeof(int)); if (ary == NULL) { printf ("Cannot allocate memory for ary\n"); exit(2); }
You need the header stdlib.h to declare the memory allocation functions. Since the array does not exist at compile-time, ary is now simply a pointer to an int, not an array of int. By allocating the same amount of memory as before, but now at runtime, you can make ary point to the first element in the allocated array. Where ary was previously the name of an array and was converted to &ary[0] in all the right places, it is now the address of the first int in the allocated space. In short, ary can be used in exactly the same way as before. (This is possible since C permits a pointer expression, such as ary, to be arbitrarily subscripted to one level. Because a[i] is equivalent to *(a + i), the two subscripted forms are completely interchangeable.)

Variable Size Arrays
Using malloc is a step in the right direction, but the array is still limited to a fixed number of nodes. You can fix this by replacing the constant MAX_NODES with a variable max_nodes and using realloc to extend the array if it fills. Again, with the proper initial design, the changes needed are small and localized. For example:

int *ptr; int max_nodes = 1; ary = malloc(max_nodes * sizeof(int)); if (ary == NULL) { printf ("Cannot allocate initial memory" "for ary\n"); exit(2); }
By initializing max_nodes to 1, you start out allocating an array of only one element. Of course, if your application always used at least 20, for example, this initializer should be changed. There's no point extending the array by one node 20 times if you know up front what you're going to do.
The only other changes are to the cases ADD_NODE and INSERT_NODE. Instead of complaining when the table is full, these operations can now extend the array by one node, as follows:

if (nodes_in_use == max_nodes) { ptr = realloc(ary, (max_nodes + 1) * sizeof(int)); if (ptr == NULL) { printf("Cannot allocate new node\n"); break; } ary = ptr; ++max_nodes; }
Now you have a version that is limited only by the amount of memory available at runtime. Note though that realloc might not be able to allocate memory contiguously, in which case it has to go the more expensive route of allocating a new space, copying the old array contents to it, and freeing the old copy. realloc has another limitation. Say, for example, there are 1,000 bytes of memory available to begin with. Once the allocated array uses about 500 of these, there is insufficient memory left for realloc to make a copy before it frees the original. While only half of available memory is used, realloc might not be able to use the other half.
realloc lets you extend the array by one node with a very small amount of code, but it is probably not efficient over the long run. You might want to extend the array by a cluster of nodes (say 5 to 10 at a time). Or you might chose to use a linked list instead to avoid the need for realloc to copy any of the existing array.
In my realloc version, I chose not to shrink the array when a node was deleted. This was an arbitrary choice. You could argue that it would be good housekeeping to do so. You could also argue that by not freeing deleted nodes you can maintain a cache of available nodes when the next addition or insertion is needed.

Final Comments
All examples in this article work and are useful, but they probably have inefficiencies. For example, inserting and deleting nodes requires all subsequent nodes to be shuffled toward or away from the end, respectively. As the node count increases, these operations become increasingly expensive. You can avoid this expense by using linked lists. You will see in future installments that such lists have their own unique problems.