Features


A Simple Soundex Program

Jeff Rosen


Jeff Rosen is employed by Software One, Inc. in Columbus, Ohio. He has developed custom and packaged applications for PC UNIX-based systems since 1982. Jeff has developed menu and screen generators, and E-mail and print spooling utilities. He has a Bachelor of Science degree in Accounting and Information Systems from Bowling Green State University and a Masters degree in Business Administration from Capital University.

If you've dialed for directory assistance in recent years, you probably have not been asked to spell the name; the operator merely enters the name into the computer using a phonetic spelling. The computer uses the Soundex algorithm to search for names which sound similar to the name entered.

Soundex consists of a relatively simple set of rules, which assign a numeric value to each consonant based on the sound it makes. For example, in Soundex, 'C' and 'K' have the same numeric value.

By applying the rules to a name, the algorithm generates an alphanumeric code which may be used as a key for finding phonetically similar names in a database. For example, a search for the name Gardiner would also find entries for Gartner and Gardner.

Soundex finds matches based on similar sound rather than similar spelling. Thus, Soundex is especially useful when you have an approximate spelling with which to begin a search — as in applications involving inquiries by phone or order entry. To find the customer in your database, you would enter the customer's name using your best guess based on the sound of the name. If the customer's name were Mosier, you might enter Moser. Both of these spellings would generate the same Soundex code. (For examples of similar-spelling algorithms see "Inexact Alphanumeric Comparisons" by Hans Zwakenberg, CUJ, May 1991, and "Approximate String Matching," by Thomas Phillips, CUJ, April, 1994. In some situations, you may want to use both types of algorithms within an application.)

The Soundex Algorithm

Soundex codes are always four characters long. The first character of the code is the first letter of the name being coded. The remaining characters are the numeric digits representing remaining letters in the name. Table 1 shows the numeric digits assigned to each letter. Note that vowels and the letters 'h', 'w', and 'y' are ignored (except when they are the first character of the name.)

If a certain letter occurs more than once in succession, Soundex encodes only the first occurrence of that letter. If the name has too few letters to generate a full four-character code, Soundex pads the output code with zeros. Conversely, if the name is too long, Soundex stops encoding when the output length reaches four characters.

Some examples:

Jones      Soundex Code: J520  (N=5,  S=2,   pad with zero)
Mosbacher  Soundex Code: M212  (S=2, B=1, C=2, ignore rest of name)
Agganis    Soundex Code: A252  (G=2, N=5, S=2, ignore 2nd 'g')
I have found it useful to modify the Soundex algorithm slightly for my own applications. My revised version of Soundex, encodes names that start with 'K' as if they started with 'C', and names that start with 'PH' or 'PF' as if they started with 'F'. These modifications produce identical codes for names such as Cook and Koch, or Fifer and Phifer. You may want to do your own fine tuning especially if your database contains many foreign names.

The Soundex Function

Listing 1 shows my implementation of Soundex. This code includes my modifications to the algorithm, but simply removing the line #define MODIFIED_VERSION generates code for the original algorithm.

This program takes one name as a command-line argument and displays the four-character Soundex code on the standard output.

The soundex function takes two arguments; a pointer to the name to be encoded, and a pointer to an output buffer. (I've made no attempt to optimize the soundex function, but you may find useful ways to improve on it.) The comments in the code explain each step of the process.

Conclusion

Soundex is a useful search tool for names having a known sound, but unknown spelling. Also, Soundex can sometimes resolve minor differences in the spelling of names when searching for duplicates. Because the algorithm is simple, it's easy to finetune it to your specific application.