February 1995/Using Associative Arrays

Features

Using Associative Arrays

Michael McClung

Michael McClung is a senior staff engineer at Intecom Inc. He worked as a software contractor for 8 years before reentering the corporate work force. He received his Master of Science in Computer Science from the University of California at Lawrence Livermore Laboratory. His main interests include real-time systems, software development processes, and object-oriented programming. He can be reached at 214-447-8060 or mm@intecom.com.

Introduction
Associative arrays are well known in the software community. They are built into some languages such as AWK, Perl, and Smalltalk (as the data dictionary), and it is not unusual to find them in C++ class libraries since the language supports them well. You can easily find references to them by browsing through programming books. Yet, in my experience, associative arrays are not commonly used in C programming environments. This article defines associative arrays, shows where they can be useful, and presents an implementation in C.

Defining Associative Arrays
Interpretations may vary, but in this context the term associative array refers to an array that takes a user-defined type as an array index. Viewing all arrays as mapping functions, a conventional array is a function that maps only from an integer index to another type. For example
a[5]= 19.58 maps the integer 5 to the floating point number 19.58.
b[7]= "twelve" maps the integer 7 to the string "twelve".
c[9]='D' maps the integer 9 to the char 'D'.
Associative arrays remove the restriction of integer indexes, allowing arbitrary data types to directly map to other types, as in the following:
a["first"] = 19.59 maps the string "first" to the floating point number 19.58.
b[1.21]="twelve" maps the floating point number 1.21 to the string "twelve".
c[employee_rec]=&d_rec maps the structure employee_rec to the record pointer &d_rec.
The sparse array exemplifies another type of mapping problem that lends itself well to associative arrays. Suppose several points must be identified in a 10,000-meter cube. The simplest model is a three-dimensional array with each dimension ranging from zero to 10,000. Allocation of a conventional linear array far exceeds available memory. With an associative array, which does not depend on linear storage, the memory requirements are much smaller. The following code, using the associative array CUBEPOINT, requires only the locations used, one for each assignment.

CUBEPOINT(1,123,1564)=TRUE; CUBEPOINT(100,256,5647)=TRUE; CUBEPOINT(1000,6123,9156)=TRUE;
Associative arrays are an abstraction, not a particular data structure or algorithm, since many different implementations are possible.

Design Requirements
Since there are lots of ways to implement associative arrays, I've established some criteria to narrow down the choices. In this section I describe these criteria, which are efficiency, a generic implementation, and an intuitive interface.

An Efficient Algorithm
The algorithm must be efficient in both execution speed and memory usage. Two primary criteria are efficient insertions of new elements and fast searches for existing elements. In addition, I want neither the amount of storable data nor the form to be strictly limited. Two excellent algorithms in terms of efficiency are AVL trees and hashing. AVL trees, which maintain a balanced search tree, provide the best memory handling characteristics. Hashing, which usually involves a simpler algorithm, provides a search speed hard to match. In this article, I present a design that uses hashing, mainly because its ease of implementation makes the example code more clear.

A Generic Implementation
A practical implementation must be generic in the sense that a single module must handle arbitrary user-defined types and array sizes. I accomplish this in two ways. First, the definition of each array is self-contained in an array identifier. This encapsulation of the complete definition by an identifier allows independent arrays to coexist. Second, I categorize the index for each array definition as either a string type or raw binary type. Knowing how to read, compare, and copy only these two types, I can implement a generic hash algorithm. The user specifies a key's type (string or binary) when the array is created.

An Intuitive Interface
Associative array access should resemble conventional array access as much as possible to ease use and improve readability. The lack of language support for generic functions and operator overloading exacerbates the problem of creating a good interface. In part, I sidestep the interface requirement by pushing some of the responsibility onto the user. Wrapping the access function with a user macro provides a respectable, albeit inelegant, associative array interface.
Ideally, an associative array of file descriptor pointers, each indexed by file name, would be accessed as follows:

fd["filexy"]=filedesc;
Because C does not support this extension, the actual interface is more formidable:

*((FILE **)aa_addr(fd,"filexy")=filedesc;
However, by adding a macro wrapper, users can approximate the look of conventional arrays:

#define FD(n) (*((FILE **)aa_addr(fd.n)) . . . . FD("filexy")=filedesc;
Since the compiler's preprocessor does not allow dynamic macro generation, the user must define an access macro for each array definition. I explain these macros further in the examples.

The Code
Two source files contain the associative array code. Listing 1 shows the header file that must be included by the application. This file defines the associative array structure, declares each entry point, and provides macro definitions. The second file contains the module's source, which must be compiled and linked with the application. Throughout the description of the source, I use the term key to refer to the array index.
Listing 2 through Listing 7 break the module's source into functional parts, each separately explained below. For publication I have kept in-line documentation to a minimum, but the accompanying text should clarify the purpose of each code section.

Part 1: The Module Header
Listing 2 shows the module's header file, which defines macros and makes declarations for the rest of the module.

Part 2: Creating the Array
The aa_create function in Listing 3 creates an associative array. Creating an array entails allocating a structure (aa), loading the structure with all values describing the array characteristics, and returning the structure pointer. The calling function must save the structure pointer as an identifier for all other operations on the array. The caller should treat the identifier as an abstract data type, never directly accessing the structure. If an error occurs during creation, aa_create returns a null identifier.
The identifier structure, AA, contains eight fields that make up a complete definition. type indicates whether the key is a string type or binary type. (This distinction is required since string operations, like compare and copy, are different from binary memory operations.) key_size gives the key's size in bytes. The key_size parameter is meaningful only with binary keys since string sizes are automatically determined with the strlen function. data_size specifies the maximum size in bytes of each array element. Two internal tables of pointers, keys and data, point respectively to a key and its corresponding data element; these two tables maintain the actual contents of the associative array. max_elements indicates the size of the keys and data tables, the maximum number of elements the array can hold, barring memory limitations. The last field, hash_function, points to the hash function that operates on the key to produce the hash value. This field, usually not user-defined, is discussed further in Part 4.
Shown at the end of Listing 3 is the prime_size function used by aa_create and later by aa_resize. This function forces the array size to be prime to improve the efficiency of the hash algorithm (see Part 4).

Part 3: Accessing the Array
Once an associative array has been created with aa_create, programs can index it with the aa_addr function in Listing 4. aa_addr takes two parameters: an associative array identifier and a key. The key indexes the associative array to find the corresponding data element. When the data element is found, aa_addr returns its address to the calling function, which can directly store into or retrieve from the data area.
Since aa_addr adds new keys to the associative array, it also monitors the amount of free space left. If the array becomes too full — 75 percent in the current implementation — this function dynamically expands the size by 50 percent. Maintaining a sufficient number of empty slots in the array's hash tables insures an efficient search. I chose 75 percent as the critical number since studies show a rapid degradation in efficiency as the table load approaches 80 percent [1]. Other hash algorithms have better behavior when approaching capacity, but at a tradeoff of other factors, such as complexity. Part 6 describes the function that changes the array size.
After checking the array size, aa_addr calls hashindex to find the index corresponding to the specified key. Once found, the same index offsets into both the keys table and data table. If the keys table already contains the sought after key, aa_addr returns the corresponding data address, completing the function. If the key is not in the table, aa_addr must allocate memory, save the key, and then return the data address. Saving the key guarantees the same data address is returned for future accesses with identical keys.

Part 4: Hashing the Keys
The whole module pivots on the hashindex function shown in the second part of Listing 5. This function, along with the subordinate function, hash_function, implements the hash algorithm. This particular hash implementation is called Double Hashing with Linear Probe. (See the sidebar "Common Hashing Techniques" for an overview of hashing techniques.)
hashindex calls the hash function for the starting search location. hashindex then continues its search until it finds either a match for the input key or an empty slot. (An empty slot indicates the key is not in the table.) When the hashindex makes comparisons for matching keys it must consider whether a binary or string key is used (i.e., whether to use memcmp or strcmp). If the search was not successful on the first attempt, hashindex makes a second call to the hash function using a different size parameter to give the effect of a different hash calculation. If its search is still not successful, hashindex continues with a linear search until it finds either a matching key or an empty slot. On success, hashindex returns the index of the matching or empty slot.
A good hash function is critical to the efficiency of the hash algorithm. Unless the caller defines a custom hash function when the array is created, the program will use hash_function by default. The user may wish to define a custom hash function for one of two reasons. The first reason is for speed, since the function can be optimized for the specific key type. The second and most critical reason is to handle non-continuous keys. A non-continous key is a key that contains sections of non-data bytes.
An example of a non-continuous key is a record structure with multiple character strings; the data after each string's terminator is undefined, leaving gaps in the key. A more subtle example is a complex key type that the compiler breaks up to fit on word boundaries. This behavior is compiler dependent and should be checked if in doubt. (Dumping memory with the debugger is an easy way.) A non-continuous key can corrupt the calculation of the hash value by introducting random data. If the key is not guaranteed to be continuous, the caller must either provide a custom hash function or initialize the key before use. A custom function can bypass any compiler dependencies since it knows the structure of the key. Setting the complete key to zeros (as with memset(key, 0, sizeof(key)) also resolves the problem by initializing any memory gaps to a consistent value.
hash_function works well on both binary and string keys. The implementation sacrifices some clarity for the sake of efficiency since this function more than any other affects the speed of the module.
hash_function builds a shift table the first time it is invoked. The shift table helps guarantee that similar keys, such as "x-coordinate" and "y-coordinate," do not produce similar hash values. hash_function uses static table initialization since the size must be large enough to eliminate error checking in the hash calculation.
Two for loops calculate the hash value, the first for string keys and second for binary keys. Each loop takes the exclusive-OR of all bytes in the key, with each byte first being modified by a shift value. Only bytes up to the null terminator are used for string keys. The aim of both loops is to minimize machine cycles while getting a good hash number.
The last and essential part of creating the hash value is performing division with the modulo operator. Division completes the randomization of the hash number and bounds it to the size of the hash table. This is where a prime table size pays off; taking the modulo of a prime number gives a much better hash distribution.

Part 5: Retrieving the Keys
aa_keys, in Listing 6, returns all keys in an associative array. Because associative array indexes are not ordered, this function must provide a means of sequentially visiting every element in the array. The array identifier is the input parameter for aa_keys; it outputs an array of key pointers and the array's size. The output array is allocated dynamically so it must be freed by the caller, but individual key pointers should be considered constants since they point into the hash table. The keys are gathered by walking through every entry in the keys table. If the entry contains a valid key, neither null nor deleted, its pointer is saved in the output array.

Part 6: Resizing the Array
aa_resize, shown in Listing 7, expands or shrinks the size of the associative array. aa_resize is a module entry point and is also called internally from aa_addr when the array becomes too full. My original implementation of this module did not allow the array size to be changed, but I found myself wasting memory for worst-case allocations. The user still has the option of avoiding the resize overhead by allocating a large enough array at creation time.
No short-cut exists to change the size of a hash table since the size is an integral part of the hash function. When a hash table is resized, aa_resize must allocate complete new tables and rehash every valid key in the old table to a new index location. aa_resize moves the key and associated data pointers to the new locations in the new tables. If the new tables cannot be allocated or the requested size is too small, the function fails.

Part 7: Deleting and Testing Elements
Listing 8 contains two functions: aa_delete removes an element from the array, and aa_defined tests if a key is already in the array.
aa_delete takes two parameters, the array definition and the key to be deleted. The deletion frees both the key and the data memory. A flag, DELETED_KEY, is stored in the key pointer to prevent the slot from being reused. The pointer cannot appear empty since hashindex assumes the search is complete when an empty slot is found. If multiple keys hashed to the same index as the deleted key, those keys stored after the one deleted would never be reached since an empty slot would be found first. This aspect of key deletion is inherent to the hashing algorithm used.
aa_defined returns a true value if the specified key is already stored in the array. Users should call aa_defined to find out if the key they're looking for is in the table. The application should never examine the data returned from aa_addr to determine if a key is contained in the array, since aa_addr always adds new keys to the array.

Examples
Listing 9 and Listing 10 show two complete programs. These programs link with the associative array module and standard libraries, but are otherwise self contained. The first program counts the frequency of words read from standard input and prints a frequency list. The second program shows an associative array implementation of a sparse array.
The user-defined macros, WORD_COUNT and SPARSE, simplify access to the arrays. A simple rule to follow when creating these macros is to cast the return value of aa_addr to a pointer to the array's data type and encase this expression with the dereference operator. Dereferencing the returned address allows the macros to be transparently used with nearly all operators. The user may optionally use the macro, AA_ACCESS, to create the access macro. AA_ACCESS is defined for this purpose in the header file.

Conclusion
Associative arrays are a general purpose abstraction fitting a wide range of problems. Effectively incorporating higher abstractions into our programming solutions provides a path to lower cost and complexity. A few examples have shown that associative arrays do simplify programs. My own use of associative arrays has expanded as I've grown more aware of the range of possibilities.
I have used associative arrays in applications ranging from development tools running in batch mode to production code in real-time systems. Conventional arrays access data elements only through integer indexes, but associative arrays allow arbitrary types, such as strings, to index the array. This seemingly small but powerful addition in capability has made associative arrays a mainstay in many of my solutions to programming problems.

Bibliography
[1] Knuth, Donald E. The Art of Computer Programming, Volume 3: Sorting and Searching. Addison-Wesley, 1973.