Features


Fast String Searching

John W. Ross


John Ross is a computational scientist in the University of Toronto's High Performance and Research Computing group. His interests include scientific programming on highly parallel and vector supercomputers. John can be reached at yjohn@utirc.utoronto.ca.

Introduction

String searching, or finding occurrences of a particular pattern of characters in a block of text, is a common task and a component of many applications, including text editing and document indexing. Although conceptually straightforward, there are a number of subtleties to the problem, especially if it is to be done quickly.

In this article I consider a couple of "intuitive" approaches to the problem and then present an algorithm that has been available in the literature for a number of years that is easy to code and extremely fast.

Brute-Force Method

In the string searching problem, we have a pattern of length m characters and we want to find all occurrences of it in a block of text of length n characters. Usually n is much larger than m and may consist of multiple megabytes of text.

The brute-force algorithm is the simplest approach. As everyone knows, it is very difficult to find a needle in a haystack, but to do so we could start by aligning the pattern and string so that their first characters coincide:

IN A HAYSTACK
NEEDLE
and then compare successive pairs of characters until a) we get a mismatch or b) we run out of characters in the pattern. If b) then we have found an occurrence of the pattern in the text string and we report it. In either event, we move the pattern one character to the right

IN A HAYSTACK
 NEEDLE
and start comparing characters again. We proceed in this manner until we have moved the pattern to within m characters of the end of the text block, i.e., to character n - m + 1 in the string:

IN A HAYSTACK
      NEEDLE
There is no point moving the pattern further to the right since this is the last possible position in which we could find a match. Eventually we conclude that there are no needles in this haystack.

This algorithm will find all occurrences of the pattern in the text block, but it is not terribly fast. We can make a minor improvement by noting that if we do find a match, we can slide the pattern along by m characters to the right before looking for the next match. So after:

A NEEDLE IN A HAYSTACK
  NEEDLE
we try:

A NEEDLE IN A HAYSTACK
        NEEDLE
The brute-force algorithm is quite easy to code. It is shown in
Listing 1, in function bfFind (for "brute force").

In testing various string searching algorithms, I used a common main program shown in Listing 2. It allocates a block of memory on the heap and then reads several copies of a text file into it. The various search algorithms are implemented in functions declared as:

int xxxFind(int n, char *txt, int m, char *pat);
where xxx is changed to reflect the type of algorithm. The function returns the number of matches.

In the tests reported in this article, I used a total text size of just over one megabyte. The characteristics of the patterns searched for are summarized in Table 1. On a MIPS R2000-based UNIX workstation, the brute-force algorithm obtained the performance shown in Table 2.

Locate First Character

It is generally more efficient to find a single character in a text string than to find a pattern — this operation may even be supported by the underlying hardware. A potentially better algorithm would be to locate the next occurrence of the first character of the pattern in the text block and only resort to character-by-character matching from that point. Provided that the character searching function is sufficiently rapid, this should improve our search time.

This algorithm is implemented in function fcFind (for "first character") in Listing 3. It uses the C library's strchr function to locate the first character. Sure enough, this cuts the search time down considerably — by about one half, as shown in Table 3. In fact, this is the algorithm often used by the C library function strstr. A search function can also be built using strstr. This is implemented in function ssFind, shown in Listing 4. Performance times of ssFind and fcFind were practically identical on the test machine.

A reflection on the nature of English text should suggest an improvement for this algorithm. If the first character of our search pattern is an E (the most common letter in English text) then we will have a lot of potential matches and will have to resort to character-by-character searching frequently, with many unsuccessful comparisons between the pattern and the letters following each E.

If however, the first character of the pattern is an infrequently occurring letter like X then we will have fewer potential matches and the search will proceed faster. It is not necessary to have the first character of the pattern be a low frequency letter. We can select the least frequently occurring character in the pattern and search based on it. The fcFind function is modified as shown in fclfFind (for "first character, low frequency") in Listing 5. Variable pos is the position number of the character in pat that occurs least frequently in English text.

To select this character optimally you should have a list of frequencies of occurrence of upper and lower case letters in common English text. Better performance could be obtained by calculating the character frequencies based on the actual block of text you are searching. The success of this strategy depends on the pattern. Longer patterns generally do better since they are more likely to contain a low frequency character.

Boyer-Moore Algorithm

To gain a really substantial improvement in speed, it is necessary to take a new approach. An algorithm discovered by R.S. Boyer and J.S. Moore [1] in 1977 introduces a couple of unexpected twists into the pattern matching method.

First of all, they compare characters in the pattern and the text string from right to left rather than the more "natural" left to right. The method also incorporates a pair of heuristics. The occurrence heuristic is obtained by realizing that when we find a mismatch between characters, we must shift the pattern so that the character in the text causing the mismatch aligns with the first occurrence of that letter in the pattern. If it does not occur in the pattern we can shift the pattern right by its entire length. This procedure is accomplished simply by setting up a table of shifts for each letter of the alphabet based on the pattern, and looking up the required shift for the mismatching character.

The other interesting idea embodied in the algorithm is expressed in the match heuristic. This notes that in comparing characters in the pattern and the text we may obtain a partial match before the mismatch occurs. If this sub-pattern occurs earlier in the pattern (i.e., further to the left) then we should shift the pattern right by the distance between the two sub-patterns. These shifts as well can be precomputed based solely on the pattern and the alphabet. However, the preprocessing step for the match heuristic is fairly complex.

Boyer-Moore-Horspool Algorithm

A couple of years later, Nigel Horspool published a simplification of the Boyer-Moore algorithm [2]. He pointed out that the match heuristic is used for optimizing a search when the pattern being searched for contains repetitive subpatterns, thereby avoiding the worst-case running time. Since patterns of this sort occur infrequently, he postulated that this part of the algorithm could be omitted entirely, simplifying the coding considerably. In a number of tests searching for patterns of various lengths, he found that indeed the full Boyer-Moore algorithm and his simplified version produced practically identical timings.

Horspool introduced a further refinement, known as the Boyer-Moore-Horspool (or BMH) algorithm, by noting that once we know for certain that the pattern matches or does not, any of the characters from the text can be used to address the heuristic table. The BMH algorithm addresses the occurrence with the character in the text corresponding to the last character of the pattern. Horspool found this algorithm faster than the "locate first character" algorithm for all but the shortest strings.

You might wonder how this approach compares to other string searching algorithms. Baeza-Yates [3] has surveyed a number of string searching algorithms — actually coding and testing them — including the variations on the Boyer-Moore algorithm and the brute-force method. His empirical results showed that the BMH algorithm was superior to the algorithms tested for random or English text on search patterns of three or more characters. (For patterns of length 2, he found an algorithm by Knuth, Morris, and Pratt to have a slight edge.)

Listing 6 shows function bmhFind, an efficient implementation of the Boyer-Moore-Horspool algorithm. Perhaps surprisingly, the code is extremely simple. To avoid treating special cases, the first character (element zero) of the pattern and the text are set to characters that do not occur in the pattern or the text, respectively. Assuming an alphabet size of 128 (standard ASCII text) gives us lots of potential candidates. I selected character 0xfe for the pattern and character 0xff for the text.

The algorithm searches for pattern pat[1...m] in text block txt[1...n] rather than beginning at element zero as is typical for C programs. The main program (see Listing 7) defines a pair of pointers, *txtp and *patp, that point to element 1 of the text and pattern, respectively. This allows us to address these text strings in the conventional manner in the main program.

Timings

Table 4 shows timings for the various algorithms: bf (brute-force), fclf (first character, low frequency) and bmh (Boyer-Moore-Horspool, not including table initialization). As you can see, in all cases the BMH algorithm performs best, though it has just a slight edge (probably not statistically significant), when searching for the short pattern. The BMH algorithm turns in much better times when searching for longer patterns, especially if they occur infrequently or not at all in the text block.

References

[1] R.S. Boyer and J.A. Moore. "A Fast String Searching Algorithm," Communications of the ACM, 20 (10), p. 762, 1977.

[2] R.N. Horspool. "Practical Fast Searching in Strings," Software Practice and Experience, 19, p. 501, 1980.

[3] R.A. Baeza-Yates. "Algorithms for String Searching: A Survey," SIGIR Forum, 23 (3,4), p. 34, 1989.