C/C++ Users Journal June, 2004
Maybe I'm in the minority these days, but I happen to love the C language. Second to Pascal (but only because of sentimental reasons), C is the most genial language ever developed. Even in this day of web development, I constantly seek out platforms and languages that closely relate to C (such as PHP, for instance).
The beauty of C is in its simplicityyou don't have to remember too many commands or keywords to get things done. Conversely, properly written C code is much easier to read and understand than many other languages, even though it is a concise language.
Working with other languages, however, at times you do find something that makes a developer's life easier. Take, for instance, the concept of the "array," supported by many recent typeless (or loosely typed) languages, such as Visual Basic, Perl, or PHP. On these platforms, arrays are collections of data, heterogeneous both in their values and in their keys. Thus, you are free to mix strings and integer valuesand associate them to keys of any type. Most of all, arrays are completely dynamic in natureyou are free to add, remove, and reorganize their contents at will.
The same isn't feasible in C, except on an ad hoc basis. The only way to handle dynamically scalable values in C is to manipulate pointers in a continuous way. Consequently, C programmers have to handle pointers in much the same way that explosive experts handle nitroglycerinas little as possible and with a lot of respect.
From the C point of view, an array is nothing more than a block of memory that can be addressed as a group of elements of a predetermined type (and, therefore, size). Each element of the array can be addressed using a numeric key, which is simply used to determine the position of the element within the array by multiplying its value by the size of each element.
This approach makes it possible to implement the mechanism in an efficient way, thus accomplishing C's main goal of simplicity and performance. On the flip side, it becomes difficult to use arrays that contain data of different types, and one has to keep a constant eye on the number of elements that the array effectively uses to avoid memory corruption.
Loosely typed languages have revolutionized the concept of the array and expanded it so that it consists of a simple set of typeless keys univocally associated to typeless values. The dimensions of the array automatically change in response to the addition of new elements or the deletion of existing ones. Moreover, each array can be traversed in either direction (that is, from the beginning to the end and vice versa), one element at a time, without the need to address each element explicitly. Finally, when an array is destroyed, all the data associated with it is also released as appropriate, depending on whether it is being used elsewhere.
Reproducing in C the functionality provided by the array-on-steroids is a relatively simple concept. After all, most of the loosely typed languages that implement these arrays are written in C to start with!
The real trick is in writing code capable of handling arrays in an efficient way. Because both keys and values are essentially typeless, there is no clear correlation between the former and the latter. This, in turn, makes it impossible to retrieve a value given its key by performing a simple multiplication, as C does with its built-in arrays.
The first step, therefore, consists of determining how the array library will treat its keys. Given that just about any scalar value can be used as a keyintegers, strings, doubles, and even entire areas of memorythe best course of action here is to simply consider the key an area of memory of known length. This way, regardless of what is used, the library can store the key and compare it against another key through a simple call to memcmp().
Next, the library needs a method for storing and retrieving the elements of the array. Obviously, the simplest possible method would be to store each array element in a contiguous block of memory, then scan them sequentially when a particular one has to be retrieved. Given that one of the library's goal is efficiency, however, this solution is not particularly appealing.
In my experience, different libraries use different methods when it comes to storing and retrieving elements. The more sophisticated ones, for example, use complexbut very efficientstructures such as balanced or semibalanced trees, which are optimized for read operations so that pulling elements out of an array is as fast as possible. This, however, presupposes a bias in favor of read operations that only works well in situations where the structure of an array doesn't change very often. If elements are continuously added or removed from the arrayor if arrays are continuously created, used, and discardedbinary trees become inefficient due to the inevitable rotations required to keep them balanced.
I prefer a simpler methodthe "bucket hash table," which is less elegant but provides a reasonably good level of performance with read and write operations. This type of hash table works by assigning two values to each key: a bucket and a hash.
Both the hash and the bucket are used to identify a key in a univocal (but not necessarily unique) way using a datum of fixed size. For example, a hash value of type unsigned long can be calculated by simply adding together all the bytes in the key. The goal of the hash is to provide a simple and quick mechanism to determine whether two keys are different: Comparing two longs and finding out that they are different is much faster than potentially having to rummage through a key several kilobytes long to reach the same conclusion.
Because a hashing algorithm is normally used to reduce the size of a key to a more manageable scale, it is impossible for it to guarantee that only one hash value exists for every possible key. Therefore, once two hash values are found to be equal, performing a binary comparison of their respective keys is the only way to know for sure that the latter are the same.
If the hashing algorithm is good enough to produce numbers that are evenly distributed across the available range for any particular key size, the chances of two different keys getting the same hash value are greatly reduced. For my array library, I have chosen a popular hashing algorithm written in 1978 by P.A. Larson (now a senior researcher for Microsoft) that's simple and still widely used after more than 20 years.
While a hashing algorithm makes the comparison of keys much quicker, the library still needs a mechanism for storing the array's data that will make locating a particular element fast and reasonably efficient. The hash in itself is quite useful if the search algorithm still has to go through each element sequentially whenever a search is performed, particularly if you consider that a search has to be performed every time a new value is added since array keys must be unique.
This is where the concept of buckets comes into place. A bucket is simply a container for a group of elements that possess the same hash value. By grouping the elements this way and creating a C array that contains a slot for each possible hash value, you can quickly fall back to narrowing down a search to just one of these groups, rather than having to search the entire array.
Clearly, for this technique to work, it is necessary that the hashing algorithm's result range (which, in this case, is called the "bucket size" of the hash table) be as limited as possible. If, for example, the library were to use Larson's algorithm, which returns an unsigned long integer, it would need an array of 232 elementsdefinitely not practical if you actually want your application to do something other than allocate a single array!
Instead of looking for another hashing algorithm, however, I opted for a simpler solution that is often used by other libraries similar to the one I'm presenting here. The result of Larson's algorithm is actually reduced to a range corresponding to the bucket size of the hash table through a simple modulus operation. This accomplishes two goals. First, any well-distributed hashing algorithm is more intensive than a simple modulus; by recycling the hash value that it calculates for each key, the library saves valuable time without compromising its efficiency.
Additionally, because the original range of Larson's algorithm is much greater than any practical bucket size, the latter can be changed at will, so that you can fine-tune the library's memory usage against its speed and efficiency.
Once the bucket table is in place, searching for an element is a rather trivial operation. As you can see from Listing 1, each element of the array is a structure that contains a number of items. To allow for easy manipulation, I've set up each bucket as a double-linked list. Once the bucket for a particular key is determined, the search algorithm simply retrieves the element that sits in that bucket and checks whether the two keys correspond; otherwise, it moves on to the next element in the list, pointed to by hash_next.
Adding a new element is a two-step operation. Since each key in the array is unique, whenever a request to add an element comes in, the library must first determine whether another element with the same key exists. If that is the case, then the old element's value is destroyed and replaced with the new one. If it isn't, then a new element is created and inserted in the proper place. This means that the array_add() function (see Listing 2) can also be used to replace the value of an element that already exists. It also means that, as previously mentioned, the search function must be as efficient as possible since the library has to call it every time a new element is added to an array.
Deleting an element is a bit more problematic. The operation of actually removing the element from the double-linked list is trivial. Therefore, I will not dwell too much on it, other than to note one particular. The use of a double-linked list makes it possible to delete an element of the array by simply obtaining that element. Once that operation is performed, removing it from the chain is very simple. With a normal linked list, it is necessary to traverse the list, all the while keeping a reference to the previous element so that its pointers can be properly adjusted. By using a double-linked list, the library can use a single search function that only returns a pointer to the proper array element for insertion, search, and deletion, thus reducing the amount of code (though at the cost of one extra pointer for each array element).
The real problem with deleting an array element is in how its value is destroyed. One of the library's requirements is to be able to properly dispose of an entire array in an automated fashion. Since all the element values are passed as pointers, it might be possible to destroy each of them with nothing more than a call to free(). However, it would be impossible to nest dynamic arrays since, as you can see in Listing 3, the array structure contains a number of internal memory blocks that must be allocated by the library when an array is created. A call to free(), in this case, would simply be the source of considerable memory leakage.
The right solution here depends on how you want your system to work. Scripting engines often implement some sort of reference count mechanism that keeps track of how variables are used and destroys them automatically when they are no longer referenced anywhere. This makes it possible to adopt a "set it and forget it" attitude toward variable management because the underlying engine takes care of all the tracking.
If you're writing a completely C-based application rather than a scripting engine, however, this may not be the best way to go about things. My preference is to just pass along a pointer to a function that can destroy each element at the time when the latter is added to the array (destroy_function in Listing 1). Thus, if an array needs to be destroyed, the library has a pointer to the function that is best suited for the task.
The final touch on the library is the ability traverse the contents of an array in the order in which they were added, either forward or backward. To accomplish this, I set up another double-linked list that is global to the array rather than restricted to each bucket.
With this second double-linked list in place, traversing the array is trivial. The first step consists of calling the array_first() function (Listing 4), which retrieves a pointer from the head of the list and returns its value to the caller. In addition, this function modifies the control parameter passed to it. This creates a sort of "state marker" for the traversing sequence, which can later be passed to either array_prev() or array_next() to navigate the array. This marker could be stored inside the array structure itself, thus eliminating the need for the extra parameter (but doing so would effectively restrict the library to handle only one traversing sequence per array). By forcing the caller to maintain the marker, it is possible to create as many traversing sequences as needed.
Even though the library only has one setting that directly influences performancethe bucket sizeits influence on performance is quite significant.
Clearly, the bigger the bucket size, the better the library will perform, at the expense, however, of memory consumption. On a 32-bit machine, each ELEMENT instance requires 36 bytes. Thus, a bucket size of 50 (ideal for arrays of up to about 10,000 elements) requires a minimum of 50×36=1800 bytes for each array.
On my computer, a 2.6-GHz Pentium 4, a program that uses this bucket size and creates an array with 10,000 elements then plucks one out at random, results in an execution time that is not detectable by my profiler (under 1/100th of a second).
Bumping up the size of the array by 10, however, causes the execution time to skyrocket to 6.37 seconds, or a factor of approximately 600. In this case, increasing the bucket size to 500 reduces the time required to run the program down to about one hundredth of a second again.
Unlike other structures, such as binary trees, the retrieval cost of a bucket hash table is dependent on the number of elements that an array contains. In my experience, most people do not use arrays bigger than 10,000 elements (or even 1000) in their applications and, therefore, a bucket size of 50 provides an efficient algorithm and a reasonable memory footprint.
In addition, keep in mind that this library allocates each element individually from the global heap as it is being added to an array. A better approach, which would result in better performance and less memory fragmentation, is to allocate a pool of elements at a time and use them as needed.
In its present form, the array library (available at http://www.cuj.com/code/) is quite powerful. However, its functionality can be improved based on your actual requirements. One possibility would be to change the way the key comparison is performed; for example, by allowing a call to array_init() to receive a pointer to an external function that performs a comparison between two keys and returns a result similar to strcmp() or memcmp(). This would make it possible to introduce semantic, rather than binary, comparison algorithms such as a caseless version of strcmp().
Another interesting enhancement could be a sorting function. It would significantly expand the capabilities of the libraryyou could now use it almost as an in-memory data table with automatic indexing; add a set of serialization functions, and you could easily read and write data to a stream or a file.