April 1994/Approximate String Matching

Features

Approximate String Matching

Thomas Phillips

Thomas Phillips, CCP, is a member of the ACM, IEEE:CS, ACL, SIAM, and AMTA. His interests include linguistics, computational linguistics, parallel processing, and cognitive science. He can be reached via e-mail at 72713.2100@compuserve.com.

Introduction
People aren't perfect. Unfortunately, computers often demand perfection, sometimes unjustly. One example is when the user must type in a word or a string of words to tell a program what to do. Suppose you have a database filled with names and addresses; you can write a program that allows a user to search by name, but if you're not careful the program will be too selective. Let's say the last name is Xavier, but the user thinks the name is spelled Zavier. If your program uses a simple strcmp, it won't find the name. What you need is a function to perform approximate matches. That's what freq_match does (see Listing 1) — it finds the closest match based on frequency distributions.

Matching via Frequency Distributions
The frequency distribution of a word shows how many times each letter occurs. For example, the frequency distribution of the word "LETTER" would be: L-1, E-2, T-2, R-1. If you find the frequency distributions of two different words, you can compare them for similarity in spelling.
The function freq_match uses the array freq_count[] to compare the distributions of two words. Each element in freq_count[] holds the distribution for its respective character. For example, the ASCII value for an 'a' is 97 (or 61H). If a word being analyzed contained three a's (like aardvark), then freq_count[97] would equal 3. (Note that freq_count[0] shouldn't really be used; you would only expect one NULL in a string.)

Comparing Two Distributions
Once freq_count[] is loaded with the distribution for one word (what I call the mask word), it's simple to calculate a number that tells how closely the mask word matches another word (the test word). I call this number the divergence. To start, let freq_count[] hold the frequency distribution of the mask word. Next, subtract from it the frequency distribution of the test word. Take the words "aardvark" and "apple," for example. The mask word is "aardvark," so

freq_count['a'] is 3, freq_count['r'] is 2, freq_count['d'] is 1, freq_count['v'] is 1, freq_count['k'] is 1, and all others are 0.
After the first letter of "apple" is processed, freq_count['a'] equals 2. This is because my algorithm subtracts while it processes the test word. After all the letters of the test word have been processed, freq_count[] will be as follows:

freq_count['a'] is 2, freq_count['r'] is 2, freq_count['d'] is 1, freq_count['v'] is 1, freq_count['k'] is 1, freq_count['p'] is -2, freq_count['l'] is -1, and freq_count['e'] is -1.
Finally, the algorithm adds up the absolute values of each of freq_count's elements. The elements not listed (such as freq_count['q']) are assumed to contain zeros. The total, or divergence, is 11. This huge divergence indicates that "aardvark" and "apple" are quite disimilar. However, had I compared "aardvark" with its misspelling, "ardvark," I would have arrived at a divergence of 1, indicating extreme similarity.

Using freq_match
Using the freq_match function requires two steps. First, the program must initialize the distribution array by passing it a valid mask word and an empty string for the test word. An example would be:

freq_match("aardvark", "");
The second parameter must be an empty string — find_match uses this empty string as a switch enabling initialization. Since freq_match stores the mask word's frequency distribution in a static array (freq_count[]), it only needs to load each unique mask word once. In subsequent calls to freq_match, the first parameter is ignored as long as the second parameter is not an empty string. The second parameter should contain a valid test word. For example, once "aardvark" has been loaded, you can test its divergence from "apple" by using this statement:

divergence = freq_match("", "apple");
The return value will be 11.
To use freq_match, call it for each word in your list of test words. Keep track of the minimum divergence and its corresponding test word. After you've gone through the entire list of words (or optionally stopping when you find a divergence of zero), accept the test word with the minimum divergence as the most similar word.

Anagrams and Other Problems
Frequency distributions are great for most approximate matches, unfortunately, they can't discriminate between a word and one of its anagrams. For example, "TAB" is an anagram for "BAT." The frequency distribution for either of these words is: T-1, A-1, B-1. One possible solution to this problem is to test the first or last letters of the mask and test words. If one or both match, then reduce the divergence by some constant number. If they don't match, increase the divergence. For example:

divergence+=(mask_word[0]==test_word[0]) ? ((divergence>0) ? -1 : = 0) : 1;
This strategy will slow down the matching process, though. In addition, it can introduce some of its own problems (like matching "BIG" with "BOG").
Should you throw away strcmp and use freq_match for all of your string matching needs? Obviously no, but there are places where freq_match() is very useful. freq_match works best when a mask word must be found in a relatively small list of test words (a few hundred or less). Typically, you would use this function in applications where human errors are also frequent — and that covers a lot of ground.

Features

Approximate String Matching

Thomas Phillips

Introduction

Matching via Frequency Distributions

Comparing Two Distributions

Using freq_match

Anagrams and Other Problems