May 1991/Inexact Alphanumeric Comparisons

Features

Inexact Alphanumeric Comparisons

Hans G. Zwakenberg

Hans G. Zwakenberg has been active in the software industry since 1984. He worked for several companies as a software developer and marketer. More recently he was appointed marketing manager by Borland International Paris Office for the Benelux area (Belgium, Netherlands and Luxemburg). Currently an independent consultant, he specializes in relational database development. Hans can be reached at Flora 13, 7422 LN Deventer, The Netherlands, phone (int+31) 5700 50683.
One feature that makes a database application much more useful to the end-user is the ability to perform inexact alphanumeric searches. You need only approximate knowledge of a keyword to retrieve the correct record. Moreover, a small typo in a keyword doesn't give you just a "record-not-found" error message. Instead, you get a number of records, from which you can select the right one. To include this type of graceful behavior into your application, you must determine to what extent two character strings differ. One measure comes from comparing the phonetics of both strings. I'll present another approach, using a simple metric known as the Levenstein Distance.

Algorithm
The Levenstein algorithm determines how many mutations are needed to transform string1 into string2. Three types of mutation are possible:

addition of a character (if string1 is shorter than string2)

deletion of a character (if string1 is longer than string2)

changing a character (if characters compared don't match)
Each of these mutations is assigned an individual penalty value. The Levenstein algorithm arrives at a final penalty value by adding all penalties for each required mutation. A final score of zero denotes two equal strings. The higher the score, the more the strings differ.

Implementation
To explain how the algorithm is implemented I'll walk through function l_distance in Listing 1 (ld.c). The first two lines define a couple of variables. The next two determine the length of the strings to be compared. The string length is assigned to variables r_len and f_len. To save memory space and improve execution speed, the maximum length is limited by the constant COMP_LEN. You may change this value as needed. Lower values speed execution and minimize memory requirements, but they also compare fewer characters in the strings.
The next section initializes distance, a two-dimensional integer array. You can visualize distance as a small spreadsheet. The top row is filled with penalty values for the addition mutation. The leftmost column is filled with penalty values for the deletion mutation. The nested loop that follows fills the distance array with the appropriate values. To continue the spreadsheet analogy, the bottom right cell contains the final score — the Levenstein Distance. This integer value is returned to the calling routine.
To produce a stand-alone program, remove the comment from the #define DEBUG directive, then recompile and link ld.c. This program serves a number of purposes. It enables you to test drive the routine and feed it different sets of strings. That shows you how fast the Levenstein Distance increases with increasingly differing strings. The stand-alone version dumps the content of distance (less the top row and left column). That lets you see how the final score is determined.
As a guideline, Levenstein Distances below 10 are acceptable, at least if you use the Dutch language and stick to the values I gave the variables addition, change, and deletion. You will need to experiment with different values for these three variables for each language you use. Just feed the function sets of strings you consider sufficiently alike. Note the final score for each set. If any set scores higher than 10, change the value of the three variables and test your set of strings again. You can use a higher threshold than 10, but be careful — soon you might end up with a program giving all database records as a possible answer to your question. Remember, the idea is to narrow down the number of possible answers from which the end-user must choose. Only increase the threshold your program uses as a last resort.

A Useful Filter
As a bonus I've included is_alike, which acts as a filter. (See Listing 2. ) All strings passing this filter are similar enough to the required string to be possibly correct. is_alike is an adaptation of l_distance, modeled after the like matching operator that gives Paradox queries so much flexibility. is_alike uses four steps to either reject or accept a string:
1. is_alike compares the first character of both strings. Uppercase and lowercase differences are ignored in this comparison and in any subsequent comparison. If first characters don't match, the string is rejected.
2. If first characters match, is_alike calculates the maximum allowable difference and stores it in threshold (see Listing 2) . The formula used is (1 + ((strlen(string1) +2) / 4)).
3. is_alike checks the length difference between the two strings. The absolute value of the length difference should be smaller than or equal to threshold. If the difference is larger than threshold, the string is rejected. If not, is_alike continues to the last step.
4. is_alike uses l_distance to calculate the Levenstein Distance. This value should be smaller than or equal to threshold. If so, the string passed the test and is accepted. If the calculated distance is larger than threshold, the string is rejected.
Note that is_alike is not a full emulation of the Paradox like operator. The Paradox like operator checks if a character in string1 exists in string2 and ignores letter positions within strings. Since Levenstein's algorithm does check on letter position, I compensated for this by lowering the penalty for a change and deletion mutation. If these adjustments aren't enough, you could also change the formula that calculates threshold. As with l_distance, experimenting with the various parameters will give you the behaviour you're after.

Possible Improvements
I didn't spend any time optimizing these routines, nor have I done any benchmarking. For execution speed, it is out of the question to call malloc to allocate distance each time the function gets called. So I declared the array global, with COMP_LEN limiting its size. Initially, in the nested for loop calculating the Levenstein Distance, each loop contained a call to strlen. These calls were replaced by references to r_len and f_len. At this point, I quit optimizing due to lack of time. Improvements are possible for both memory consumption and execution speed:

Memory usage could be reduced significantly, but at the cost of slower execution. Declaring distance to be a two-dimensional array of unsigned char would reduce its size by a factor of four, but the compiler would need to include extra code to type-cast integers to unsigned chars. This extra code will slow down the process.

Execution speed can be slightly increased by using pointers instead of indexes to access distance. Pointer arithmetic is faster than index arithmetic. Comparing strings with an average length of seven characters, you would access distance 64 times within the nested loop for each set of strings you want to compare. Switching to pointer arithmetic might pay off.
I would be grateful to hear from anyone implementing these suggestions or any other improvements. Better still, share your ideas by sending a letter to CUJ describing your implementation.

Conclusion
Levenstein's technique for determining to what extent strings differ is a useful algorithm to have in your bag of tricks. Instead of doing complex phonetic comparisons, it uses comparatively simple metrics to arrive at a conclusion. Levenstein's algorithm can perform inexact string comparisons during database searches. Alternatively, it can be used for input validation. For example, if you were to write a command processor, this algorithm could provide it with some tolerance for small typos.