February 1995/Using Associative Arrays/Sidebar

Common Hashing Techniques

Hashing is an alternative approach to search algorithms that rely on key comparison. Hashing bases its search on a mathematical calculation performed on the search key. The mathematical calculation, known as the hash function, calculates the data location using the key as the function input.
As an example, assume a book is stored by ISBN number, such as ISBN 0-19-502402-8.
If the ISBN number is treated as an integer and a simple hash function is defined as f(ISBN)=ISBN modulo 500, the location of each book in a data table is determined by inputting its ISBN number into the function. Since the hash function indexes the data table, the function's range must correspond to the size of the table.
The above hashing function by itself has a flaw since nothing prevents the hash function from selecting the same location for two different books; duplicate indexes occur if two ISBN numbers differ by a multiple of 500. Duplicate hash values for different keys are known as collisions in hashing terminology. You can handle hash collisions one of two ways: 1) prevent the occurrence of collisions by using a perfect hash function, or 2) use additional processing to handle collisions when they occur.
Perfect hash functions, which produce no collisions, can only be derived when the number of keys is very small and the keys themselves are fixed, as in a compiler's list of key words. Since collisions cannot be avoided for most applications, writers of hash functions concentrate on reducing the frequency of collisions, by producing an even distribution of hash values.
Most hash function calculations are based either on division, as in the modulo operation above, or multiplication, in which selected digits, from the product of the index and a constant, are combined into a hash value.
When the key is not an integer, a hash function usually first reduces it to an integer form and then inputs it into a hash calculation. For example, a hash function for character strings might first exclusive-OR individual characters into a cumulative value, then input that value into a modulo operation.

Handling Collisions
Hash algorithms can be categorized by the method they use to handle collisions. Two common methods are described below.
Linear probe is the simplest method of handling collisions. If a key hashes to a location already containing another value, the linear probe method selects the next available empty location. In Figure 1, if 5960010 hashes to location 3, the linear probe performs a sequential search that scans the table until it finds an empty slot.
Another common hash algorithm handles collisions by associating a "bucket" with each hash value. Two keys colliding at the same location are stored in the same bucket (Figure 2) . To find a key, the hash function first identifies the correct bucket, then compares the residents of the bucket individually with the search key. The number of keys stored in any bucket is usually small enough such that a linked-list can adequately represent a bucket. A sequential search through the linked-list identifies the correct key.