Erick Otto has a degree in computer science and has been working in the customer support group for the 5ESS telephone switch since 1985. He has been programming in C since 1984 and in the last four years has been involved in programming a Customer Complaints Database for AT&T switching products. His particular interests are String/Pattern searching and techniques to make C programs run as fast as possible. He is currently working as supervisor of AT&T-NS local field support in Indonesia. He may be reached at eotto@lfsin.idgtw.attibr.att.com or com!att!attibr!idgtw!lfsin!eotto.
Introduction
This article introduces two algorithms that are very fast in finding strings in large text files. I have implemented them to perform string searching in a text database of over 120 Megabytes. I needed algorithms that were very fast and flexible, because the normal and still very much used one-by-one character comparison method was too slow to be practical. The algorithms I discuss here are about ten times faster on average than the "normal" UNIX grep.The first algorithm is actually the faster. It is based on a tuned Boyer and Moore algorithm [2] further developed by Hume and Sunday [3]. The second algorithm, called "shift-OR," is based on algorithms found by Beaza-Yates and Gonnet [1] and further developed by Wu and Manber. It provides more flexibility, including regular-expression searching without much loss of speed.
The Boyer and Moore Algorithm
The basics of the first algorithm were first found by Boyer and Moore in 1977 [2]. The original goal was to make as few character comparisons as possible in order to speed up the search. They found that a low number of character comparisons usually means that the program is faster. By using efficient programming, and testing various constructions of algorithms, a higher execution speed can be obtained.One of the differences from grep is that the algorithm makes use of information retrieved from the pattern itself. Knowing that a certain character does not occur in the pattern is useful information. One other difference from grep is that the characters are compared fight to left.
The character comparison starts at the last character of the pattern and the related position of that character in the text. (See Figure 1a. ) In other words, if we are positioned in the text at the character x, and the pattern we are looking for is today, there is no possibility that we can have a match at this position since x does not occur in the pattern.
In order to know which characters occur in the pattern and which do not, we build a table dist, which contains the distances of every character in the pattern calculated from the end. The characters that do not occur in the pattern are set to the length of the pattern. (See Figure 2. ) This is another difference from grep. We have some overhead in pre-processing the pattern.
Now we can look up the character x in the table and find that the distance is 5. This means we can advance our text pointer by 5. (See Figure 1b. ) The pointer into the pattern is untouched. We start the comparison again with the new pointer into the text.
Now consider another situation, where the text character is an a. We can advance only dist[a], or one character in the text. That way, the second character on the left of both the text and the pattern would be the character a. (See Figure 1c. )
Using this algorithm speeds up the search considerably because not every character in the text is compared with the pattern. It will at first only look for a starting position in the text to start the full pattern comparison.
Another nifty trick that is used at this point is to see if the character that occurs the least frequently in the text (the lfc, for short) is present where we expect it. If we assume that m is the lfc in the pattern memorandum, we can do a pre-check on the leftmost occurrence of the lfc in the text. See Figure 3. This figure also shows at what positions which elements of the algorithms are used. If the pre-test is satisfied, we perform a normal match, comparing every character in the pattern with every character in the text.
Finally the last basic step is to go forward in the text after we tested for a possible match. After extensive research performed by Hume and Sunday [3], the shift used is based on the first leftward re-occurrence of the last character of the pattern. If there is no such re-occurrence we forward by the pattern length.
This algorithm is very fast, but depends on the length of the pattern and the frequency of the characters in the pattern. Listing 1 shows my implementation of the Boyer and Moore search algorithm. Listing 3 shows a small program that will generate a frequency distribution derived from a file.
The Shift-OR Algorithm
The other algorithm that I present is based on binary operations instead of character comparisons. This is faster than grep because mostly binary operations are performed.In this method we track the state of the search by using binary operations. The state variable will tell us how many characters have matched to the left of the current position in the text. To do this we need to maintain a table that contains information obtained from the pattern. The position of every character that occurs in the pattern gets a zero set in the table entry for that character. (See Figure 4. )
The characters are counted from left to right and the bits are counted from right to left. Since the character e occurs twice in the pattern stereo, it gets two bits set to zero. Characters that do not occur in the pattern get all ones set in the table. The initial state gets set to all ones. To perform the comparison we first shift the state left one bit. (In C, this mean insert a zero.) We then bitwise-OR the table entry for that character in the text. It can be envisioned as a window, opening (zero) when comparing a certain position.
Suppose that the current character in the text is s. We first shift the state left one bit and bitwise-OR in T[s]. The result is the new state of the search, which has bit zero set to zero. This means that at the current position in the text there is one match of length one. (See Figure 5a. )
If the word stereo is in the text and we are positioned in the text at the second e, the state will be 1...101111. This means that there is one match of length five left of the current position. (See Figure 5b. )
The way to decide if we have found a match is to set another variable to the length of the pattern (stereo has length 6) in the appropriate way in this case to 1...000000. Now, if the state is less than the stop variable, we have found a match (1...0111111..1000000).
The flexibility of this algorithm comes from the fact that we can easily manipulate the character table T. If we set the second bit from the right to a zero for every entry, this means we don't care what the second character of the pattern is. This is the so-called wildcard search. And the nice thing about this is that it doesn't cost a penny more to perform.
We can apply the same logic to ranges of characters. We could for example set the second bit in table T for the characters t, p, and f to a zero. This would match the words stereo, spereo, and sfereo in the text. You could even make use of the information in the state to report partial matches.
This algorithm is slower than the previous one but does not depend on the length of the pattern or the frequency of the characters. It can also be used in a much more flexible way. Listing 2 shows my implementation of the shift-OR search algorithm.
Implementations
The code presented hides a few tricks I would like to mention. I have put a lot of comments in the code to make clear what is happening. First the speed of a program like this also depends on the I/O mechanism that is used. In line 20 in the first program (Listing 1) I therefore use the system-defined buffer size. This will reduce the number of I/O accesses. Usually, using buffers less than the system buffer size will cause unnecessary I/O.Further, in line 74 you will see a technique used to keep the number of tests inside a loop as small as possible. It usually saves a lot of execution time setting sentinels at the end or beginning of a buffer when searching for a character, instead of testing inside your loop for reaching the beginning or the end of your buffer.
In line 80 we make sure that we end the text search on a line boundary, to prevent the match routine from getting half words. The build routine does the preparation work for the pattern search. It initializes the distance table, finds the lowest frequency character, and sets the shift. In line 176 and 177 again you see the sentinel technique, although a little different than before. We need to put as many characters at the end of the buffer as the text pointer can be forwarded. (See Figure 3. )
The other two more interesting points are the code lines at line 191-193 and line 234. At line 191 a technique called loop unrolling is used. Loop unrolling prevents a lot of branching (for iterations through the loop). On modern pipelined machines, the code is usually faster since most probably all the instructions can be contained in the instruction cache. Note that when on line 191 k is zero, lines 192-193 will also return k as zero, which is a prerequisite of using loop unrolling in this case. On different machine architectures, unrolling less or more times can be benificial. You can play around with it.
Last, on line 234 we forward the text to the end of the line if we reported a match. Since we already printed the full line, why should we search for more occurences of the pattern? If you want to count the number of matches, this line should be removed!
This program can also be easily adapted to do case-insensitive searches. The pattern will have to be converted to lower case. All table entries for characters occuring in the pattern will have to be set to their distances, but also their upper case counterparts will have to be set. Besides that, lines 205 and 209 will have to be changed so that references to the text are translated to lower case. The speed of the program is hardly lowered if this is done with a translation table using something like:
char TR[MAXSYM]; /* build the case TRanslation table */ for(i=0; i < MAXSYM; i++) TR[i] = i; for(i='A'; i <= 'Z'; i++) TR[i] = i + 'a' - 'A';The shift-OR program (Listing 2) uses the same efficient programming techniques. Starting at line 13, the size of a "word" (unsigned int) is defined. This is needed since the number of bits in a word is also the limit for the maximum pattern length that can be used. Of course you can use two words (this will slow the algorithm down), but the function matchpat will have to be altered.The build function in line 80, does the pre-processing of the pattern. I have included a small subset of the normal regular-expression pattern matching. The build function can handle a wildcard (dot or .) and ranges such as [a-z]. I am sure more inventive ways can be found to implement this, but this one works fine. In line 242, you can see one limitation that this implementation has. The first character needs to be normal, not a regular expression metacharacter. This actually speeds things up. Finally, on line 250, you see the actual binary operations being performed.
References
[1] R. Baeza-Yates and G. H. Gonnet. "A New Approach to Text Searching," Communications of the ACM, Oct. 1992, Vol 35, No 10.[2] R.S. Boyer and J.S. Moore. "A Fast String Searching Algorithm," Communications of the ACM, Oct. 1977, Vol 20, No 10.
[3] A. Hume and D. Sunday. "Fast String Searching," Software, Practice and Experience, Fall 1991.