Have your computer understand what you mean, not just what you say, with this effective string matching algorithm.
Introduction
I typed into my computer:
coyp a band the computer replied:
bad command or file nameIsnt that the way it always is? Stupid computer, you say, and you retype the offending command. Sometimes you even type it correctly the next time. Any human would be able to figure out the intention; why cant these great machines? Why in this day of spelling and grammar checkers, ever faster and ever more capable computers, cant they do something simple like figure out what I mean rather than what I say. The common response (or so beginners are told) is that computers are simple and that they just do what they are told; they cant figure out what you meant. Well, that shouldnt mean that you have to be so limited. Even if you dont put artificial intelligence into your systems, it doesnt mean that you have to tolerate artificial stupidity everywhere.
This article describes an algorithm you can use to make your programs smarter about correcting typographical errors. You cant fix everything, but at least keywords that arent too far from correct can be corrected automatically. Ive hooked two algorithms together so they can at least correct some of the kinds of common errors. And the algorithm is compact enough to embed in other programs.
The code below uses a dictionary, a nearest neighbor search [1], and a simple routine to give a distance measure for how different two strings are. Heres an example using the 300 most common words in English (plus a couple of their contractions) and a misquote from The Rolling Stones. If I type:
yo caint alwys ge wha yu wanThe algorithm replies:
you cant always get what you wantNot bad, but you will be able to get more from it by the time you are through. The code developed is not a complete system, but an easily modified toolkit for attaching to your applications. A simple command processor can easily be written to go in front of the system shell and correct misspelled commands.
Distance between Strings
Theres substantial literature on string matching. For my purpose, I needed approximate string matching, and I needed a metric for how different the strings are. For use with the nearest neighbor algorithm, the metric needs to obey the triangle inequality. The triangle inequality simply expresses the fact that the length of one side of a triangle can exceed the sum of the lengths of the other two sides. Many algorithms exist for comparing strings [2], but a common one is the Levenstein edit distance [3]. It finds the minimum number of insertions, deletions, and replacements to convert one string into another. The three operations can be given weights to change their relative importance.
Levenstein edit distance computation is implemented by dynamic programming, and it has a computational complexity of O(n2) to O(n3). I have chosen a simpler, less general algorithm that has complexity O(n) [4]. (These measures are for nearly equal string lengths.) Like Levenstein, this algorithm obeys the triangle inequality.
The SD (string difference) algorithm that Ive used has complexity O(n). It doesnt compute full edit distances, but it robustly compares strings and gives a bounded distance measure that obeys the triangle inequality. Matches are computed by a pair of passes through the strings, one in the forward direction and then one in the reverse direction. Information on matching characters is accumulated on each pass. In each direction, matches are scored as having a value equal to the string length for the first character examined, one less for the next, and so on down to the last character, which gets a match weight of one. If the corresponding characters do not match, then the match score is increased if the character found in the first string has occurred in the second string more often than it has in the first; this allows some weight for misplaced characters. The match weight given in this latter case is the length of the string still remaining in the first string. The same logic is also applied to the other string.
The other part of the SD algorithm is the computation of the normalizing factor. The match metric is divided by the sum of the maximum possible metric values of the forward and reverse passes (summed for each string). These are determined by summing the character position in the pass times the weighting factor for the particular character.
All of the match values calculated are weighted by factors that can be different for the forward and reverse passes, and the values can be specified for each particular character. In this way, the vowels, for instance, could be down weighted relative to the consonants; this would reflect their secondary importance in English. The whole SD algorithm has been cast as a C++ template so that other string containers (such as wide-character strings) or locales can be easily implemented.
Examples of Distances between Strings
Here are some results from string matching, starting with a simple string. All characters are given equal weight.
String Distance from xxx xxx 0.00 xxy 0.33 xyx 0.33 yxx 0.33 xyy 0.67 yxy 0.67 yyx 0.67 yyy 1.00The fairly intuitive result is that xxx is about 0.33 from xxy and twice as far from xyy. The value of 0.33 for comparing xxx to xxy certainly corresponds to our intuition that they differ by 1/3. If no characters are common to the two strings, then the normalized distance is 1.00 (the maximum distance).
Next, you do the same set again, but now you make the reverse weight 0.8 times the weight for the forward pass. Here are the results for matching the string xxx to several other strings.
String Distance from xxx xxx 0.00 xxy 0.32 xyx 0.33 yxx 0.35 xyy 0.65 yxy 0.67 yyx 0.69 yyy 1.00Because the weight for the forward scan is larger than the weight for the reverse direction, matches at the front of the strings are more important than those at the rear. So xxx matches xyy better than it matches yxy or yyx. This weighting corresponds somewhat better to how you look at words and misspellings (in English) than it corresponds to equal weights.
So now lets look at some strings that differ in length. Using the same 1.0/0.8 weights, compare x to several other strings of xs.
x 0 xx 0.25 xxx 0.43 xyx 0.43 xxxx 0.55 xxxxx 0.63x is most similar to xx, which makes a certain sense. The single x matches the first character in both the forward and reverse scans, and only one other character is left over. For xxx there are two characters left over, etc. Surprising at first, the comparison of x to xxx and to xyx give the same values.
Here are some comparisons to xx:
xx 0.00 xxx 0.11 xxxx 0.23 x 0.25Again you see a reasonable ordering of the matches where xx matches xxx better than it matches xxxx. All this makes a lot of sense, and it inspires confidence that searching can be reasonable. What does reasonable mean? Well, nearly matching words will have a lot of characters in common, and the more they are in the same order, the better the match will be. Will you get some bad matches? Probably. Will you get some of the howlers that Microsoft Word produces? Probably not.
A Class to Implement String Distance
Two obvious ways to implement the SD algorithm present themselves: a C-style function to return a value or a C++ class. Since there is persistent data associated with the algorithm (the weights for matches in each direction), a class is the obvious solution. I have chosen to create a template, CThetaMatch, so that locales and different string times might be used without changing the code (not yet tested). Listing 1 shows the implementation of the most important parts of the template, and the complete code is available for download at <www.cuj.com/code>. The string distance is returned by the function call operator (operator()) so that a function object is available to the nearest neighbor search. The use of function objects in this type of application was discussed earlier in CUJ [5].
I included three constructors in the source code. A default constructor allows 256 characters with reverse weights equal to 0.8 times the forward weights. This constructor is ideal for use with the ASCII character set. A second constructor uses the same weights, but allows the number of characters in the character set to be specified. And finally, another constructor allows two valarrays of type double to be input; they should contain the weights for the forward and reverse directions, and they should be the same size as the character set to be examined.
The other important item in the SD class is, of course, operator(). It returns the string distance between two input strings, calculated using the weights specified in the constructor. Also included in CThetaMatch are some routines for setting weights for the forward and reverse comparison passes. A function, m_fnSetEnglishFrequencies, sets the forward and reverse weights so that the most common letters in English are the most important in matches, and the least are less important. Also, there are routines for allowing some compensation for missing characters, but that makes CThetaMatch violate the triangle inequality, and it turned off by default.
Nearest Neighbor Search
An implementation of the nearest neighbor search has already appeared in CUJ [1]. Since the distance measure was being replaced, the older version needed to be modified to use a function object to return the string distance. This necessitated the addition of the function objects class to the template parameters. Only minor modifications were needed, principally passing the function object into the tree builder, the nearest neighbor function, and the search within a sphere function where it would be needed.
Dictionary Lookup Multiple Matches vs. Best Match
Listing 2 lists the heart of a small program (main2.cpp available at <www.cuj.com/code>) to search for best matches and nearly best matches in a dictionary that has been loaded into a NearTree (Tree in the example). If the string being tested is found in the dictionary, then there usually isnt any reason to search any further. But if it is not found, then searching with some radius larger than the distance between the input string and the nearest in the dictionary gives a list of possible candidates. If the closest is significantly nearer than the next nearest, then the closest is probably the correct one.
Other Implementation Issues and Limitations
Obvious improvements could be made in the source code. Changing locales and character sets might require some fine tuning of the templates. The storage for counting found characters and for weights is stored in valarrays. For large character sets, sparse array storage methods could reduce memory requirements. Another alternative would be to simply use two scalars for the forward and reverse weights and not to allow the weights to be assigned character by character. Also not implemented here is upper/lower-case independent matching.
While fast enough for modest dictionaries, some improvements in search speed might be made by changing the character representation from strings to char* (but computers do just keep getting faster).
References
[1] Larry Andrews. A Template for the Nearest Neighbor Problem, C/C++ Users Journal, November 2001.
[2] Steven Skiena. The Algorithm Design Manual (SpringerVerlag, 1997).
[3] Zvika Ben-Haim. <www.codeguru.com/cpp_mfc/EditDist.shtml>.
[4] Unknown. Article in Electronics, 1980.
[5] Qiang Liu. Implementing Reusable Mathematical Procedures Using C++, C/C++ Users Journal, June 2001.
Larry Andrews has a Ph.D. in chemistry from the University of Washington in Seattle. He has been a software developer and software QA manager for a number of years. He may be reached at andrewsL@ix.netcom.com.