Full-Text Searching in Perl

Dr. Dobb's Journal January 1999

Fast, efficient, and surprisingly easy

By Tim Kientzle

Tim is the senior technical editor for DDJ. He can be reached at kientzle@ddj.com.

I've had to build several full-text search engines recently, and found that Perl works extremely well for this. Not only can Perl easily read and manipulate text files, Perl also has built-in support for fast B-Tree databases. In this article, I'll present some Perl programs that handle very rapid full-text searching. With a little work, you could easily use these programs to provide full-text search capability for a web site.

Standing on Your Head

When you want to view a text file, you name the file and get back all of the words in the file. To perform full-text searching, you have to do the opposite -- name a word and get a list of all of the files containing that word.

The most effective way to do this is to simply build a database whose keys are words. When you look up a word, you get a list of all the files containing that word. Once you have this database, Example 1 can perform full-text searches. With just a little modification, this program can produce HTML search results suitable for use on a web site.

Although Example 1 is short, it is very fast. On a 266-MHz Pentium II, it takes about 0.2 seconds to perform a complete search against a large database (see Table 1). This time includes loading and initializing the Perl interpreter. Most of this speed comes simply from using the right disk access method -- B-Tree files are designed specifically for fast random access to data on disk. Even with the database on slow CD-ROM, searches take less than two seconds.

Conceptually, there are really two databases. The first maps each word to a list of 2-byte integers that will represent the files. Binary integers are the most compact way to store this information, and I don't expect to have more than 65,000 files anytime soon. The second database maps these numbers to file descriptions, which can include a title and a filename or URL. By storing the filename just once for each file, the total size of the index is reduced considerably.

To make the index easier to move around, I've collapsed them into a single file. File number keys are stored with a leading zero byte, which prevents word entries and file entries from colliding.

Working with Berkeley DB

Perl makes it easy to access database files. The tie() function attaches a hash name to an on-disk database file. Perl supports several simple database formats through this mechanism. The one I chose is the "Berkeley DB" package, which supports B-Tree files on disk. Because the C library implementing this file format is freely available, most Perl interpreters support it. Plus, Berkeley DB supports any byte ordering, so you can move database files across platforms with impunity. Berkeley DB supports C, C++, Java, and Perl APIs, for both UNIX and Windows 95/NT. Although Berkeley DB 2.0 is available from Sleepycat Software (http://www.sleepycat.com/), most Perl implementations still rely upon previous versions because of licensing and API changes. The Perl code I present here, for instance, is based on Berkeley DB 1.85/1.86, available from Sleepycat and The Comprehensive Perl Archive Network archives (http:// www.cpan.org/).

The tie() command in Example 1 is typical. The first three arguments are the hash to be bound, the database library to use, and the filename of the on-disk database. The next two arguments are file permissions, indicating whether you're going to read or write the database, and the file mode to be used if the database has to be created. The last argument is a reference to a special variable that specifies how this file should be opened. Example 2 shows how to set the cache and page sizes used by the DB library. (See the Perl documentation if you're unfamiliar with Perl references.)

The tie() call uses the contents of the $DB_BTREE variable when it calls the library to open a DB file, so you have to perform any such manipulations before you open the database file.

Building the Index Database

Perl's string-handling abilities make the basic indexing chore pretty easy. Listing One is indexer_basic.pl, a complete program to build the index. It uses Perl's find() function to visit every file in a directory tree. The IndexFile subroutine identifies HTML files by their extension, reads the entire file into a string in memory, strips the HTML tags, and gives the resulting text to IndexWords. If you wanted to add support for other file formats, it should be easy to do; all you need is a way to obtain raw text from the file. (For example, you might call the GhostScript programs to extract raw text from PostScript or PDF files.)

The IndexWords subroutine first uses Perl's split function to break the $words string into an array of separate words. It then uses Perl's grep command to discard certain nonwords and remove duplicates from the list. Finally, it simply iterates over the words, appending the file key to the database entry for each word.

Although the basic principles are the same, the indexer I currently use (indexer_ big.pl, available electronically; see "Resource Center," page 5) looks quite different from Listing One. To index DDJ articles, for instance, I use a more complex scheme to generate key numbers because I want to map words to articles, as opposed to files. That is, since a single article might contain multiple HTML files, I use some special-purpose Perl coding to convert any HTML filename into the name of the top-level file for that article. This also requires some additional complexity elsewhere, since the filename of the file -- which is needed to read the contents of the file -- is no longer the same as the name being entered into the index for a particular file.

Indexing Tips

The most important step in indexing a file is converting a text string into a list of relevant words. In practice, this is much more complex than Listing One suggests. Consider, for example, that "OS/2" and "C++" are single words, while "and/or" and "index+offset" are each two words separated by a punctuation mark.

To handle this particular problem, my first call to split is very liberal about what constitutes a word; any of the previous examples would be considered a single word. I then call a separate function that looks for compounds (entries that consist of two or more parts separated by /, -, or +) and adds the component pieces to the index as well. After this, the aforementioned examples would result in all of the following being added to the index: "OS/2," "OS," "2," "C++," "C," "and/or," "and," "or," "index+offset," "index," and "offset."

At several points in the indexing process, I examine the words being indexed to refine the final index. Some of these checks are generic. For example, I use a set of regular expressions to recognize and throw out certain nonwords, such as entries that consist only of punctuation and entries that contain only a single character.

Other index improvements require long lists of words. For example, users should get the same result if they search for "analyze" (American spelling) or "analyse" (British). The only way to ensure this is to manually build a list of synonyms and use this list to make sure those words are entered both ways. My complete indexer program includes over 9000 lines with information about words. This information includes:

Synonyms. Words that should be indexed identically.
Stop words. Words that should not be indexed at all.
Variant word forms. Derived forms, such as plurals or conjugated verbs, are also indexed under the root word.

This information was put together the hard way. After my initial indexer was working, I dumped the initial list of some 140,000 "words" to a text file and edited it manually to build the aforementioned lists. The lists are included at the end of the Perl program and read into hash tables when the program starts up. There's still a lot of room for improvement here. If you're serious about building a good index, you might want to spend a few days reviewing and improving these tables of word knowledge.

In general, I've tried to err on the side of including too many words in the database. If you look at the contents of the final database, it's clear that there are a lot of useless entries -- nonwords that are unlikely to ever be used. Unfortunately, I've found no automatic way to purge them. In particular, any text corpus will have enough names, made-up terms, and special words that verifying words against a standard dictionary is a bad idea.

About B-Trees

Recall that a B-Tree file consists of pages stored on disk. Each page is either an internal page, which has keys and pointers to other pages, or a leaf page, which stores keys and their associated data.

To find a particular entry, you start with the top-level page, usually an internal page. You compare the desired key to each key on the page. Your key will fall between two of the keys on that page, which determines the next page number to check. You may walk through several internal pages before you arrive at a leaf page that contains the data you want. Figure 1 illustrates this process.

The depth of this tree depends on the size of the pages. If you increase the page size, you can reduce the depth of the tree, thereby reducing the number of disk accesses needed to retrieve an item, and speeding up your program. In my case, I found that a 32-KB page size was sufficient to produce a two-level tree; any item in my database can be found with only two disk accesses. Better yet, since the top page remains in the database cache, only the first database request requires more than a single disk access. This juggling of page size was especially important because I wanted to put the final database on a CD-ROM. CD-ROMs are exceptionally slow at seeking, which makes it especially important to minimize the number of pages that must be read, since each page read requires a seek.

Because a B-Tree stores items in sorted order, you can speed up multiple searches tremendously by doing your searches in sorted order. If your next search happens to land on the same page, you won't have to go to disk at all, since the required pages will already be in the cache.

Performance

To try out my indexing tools, I used as a testbed 10-years worth of DDJ articles in HTML format. Even with this large collection, search time was simply never an issue. Building the index, however, presented some challenges.

As Table 1 shows, the indexer in Listing One has a serious speed problem. It required over three hours to index my sample corpus. However, it used only 14 minutes of CPU time, which suggests that the program is spending most of its time waiting on disk I/O.

Every time you add a word to the database, you have to find that word in the database first, then write it back with an additional key. Since the replacement is longer, the update is likely to require moving data to accommodate it. In addition, the words in a single file probably span much of the database, so if you're indexing a large number of files, you quickly get to the point where you're skimming through several megabytes of the database for every file you index.

To address this, the current version of my indexer defines %wordCache to be an ordinary in-memory hash, and adds new word entries to that. After indexing 500 files, the indexer calls FlushWordCache (Listing Two) to update the disk database from the in-memory hash. With this change, the total running time dropped from over three hours to a reasonable 20 minutes. The CPU time was practically unchanged.

Refining the Front End

The searcher in Example 1 may be fast, but it's not the ultimate in user friendliness. It currently uses a simplistic ranking scheme that could be improved in many ways:

Use 1 byte in each word's database entry to assign a weight. This weight could then be used to rank articles -- just add up the weights of the words in that article. (The ranking scheme in Example 1 essentially gives every word a weight of 1.) For example, you might weight nouns more heavily than verbs, or specific names (Windows) more heavily than generic terms (GUI).
In the absence of explicit weights, it's easy to modify Example 1 so that less-common words are weighted more heavily than more-common words.
Use 1 bit in each key to mark files that have words in special places. For example, you could set the high bit of the key if that file contains the word in the HTML <TITLE> tag. You could then assign such files a higher rank.
Choose the order of indexing. Example 1 uses the value of the file key as a tie breaker in the sort. I indexed the DDJ articles in reverse-chronological order so that more recent articles would appear earlier in the output.

Of course, the most obvious deficiency of Example 1 is that it simply picks out all of the files that contain at least one word. With a little bit of effort, you could support Boolean expressions or AltaVista-style "+" and "-" expressions.

Ranking and Boolean support are not independent decisions. If you use a more-powerful selection logic, then ranking may not be necessary, since users can quickly narrow their choice. If you use simpler selection logic, then effective ranking becomes more important. Also note that, because I'm using the Berkeley DB library, you could easily reimplement the front end in C or C++ if that's more convenient for you (for example, if you want to implement a yacc grammar to handle complex expressions).

Refining the Back End

The next big step is to make the indexing program more aware of the language it's working with. A program with a large built-in vocabulary and a knowledge of English grammar would be able to identify whether a word was being used as a noun, adjective, or verb. Generally, nouns and adjectives are much more important in an index than verbs. For example, users interested in skip lists would be much more interested in "use a skip list" (adjective, noun) than "skip the first and list the second" (verbs). Such a program might also be able to discriminate between similar-looking words -- compare "Windows have certain properties" to "Windows has certain properties." Such programs are hard to write, though, and require a large database of word knowledge to drive them.

If you're using this on a web site, it might be a good idea to log searches. By reviewing such a log, you can get a better idea what type of things people coming to your site are looking for. In particular, be certain to log searches that returned no results, since they might point out things that did not get properly indexed. A log of search requests is also a useful tool for planning future additions to your site.

Caveats

Although the Berkeley DB package is generally easy to use and quite fast, I have run across a few problems. One common process is to walk through the entries of index.db with Perl's each() command, and selectively modify certain items from the database. I've had several crashes when I've tried to modify items as I walked the database. Instead, I now walk through the database and accumulate a list of items to be modified, then modify those items in a separate pass. Also, I've found that Perl's string-concatenation assignment operator (.=) doesn't work correctly with tied hashes. In Listing One, I first read the database entry into a variable, append the data I want, then assign the result back into the database. Finally, I have had several crashes when the B-Tree file had very many pages (for example, a 13-MB database using 8-KB pages). Increasing the page size seemed to help in each case.

These comments apply to Version 1.85 of Berkeley DB, which is commonly built into Perl distributions. Version 2.0 has recently been released and may have addressed these issues.

Moving Ahead

Really good indexing is hard. If you want a truly compact index that only contains the important terms organized in a clear fashion, be prepared to build the entire thing manually. The programs I've presented here can automatically build a functional index database, but, as with any endeavor, the more work you put into it, the better the result will be. The best commercial programs can reduce the amount of manual work required, but can never entirely eliminate it.

The complete set of Perl scripts that I currently use is available electronically. Be aware that, as I described in this article, you might need to do some customization for your particular application.

DDJ

Listing One

#!/usr/local/bin/perlrequire 5;
use DB_File;    # Access DB databases
use Fcntl;      # Needed for above...
use File::Find; # Directory searching
undef $/; # Don't obey line boundaries
$currentKey = 0;


# Single database version:
#    Stores file entries in index.db as <NULL><binary file number>
#    The leading NULL prevents any word entries from colliding.
############################################################################
unlink("index.db");                 # Delete old index.db
tie(%index,'DB_File',"index.db",    # Open new one
    O_RDWR | O_CREAT, 0644, $DB_File::DB_BTREE) ;
find(\&IndexFile,"articles");       # Index all of the files
untie(%index);                      # Close database
###########################################################################
sub IndexFile {
    if(!-f) { return; }
    if(/\.html?$/) {                # Handle HTML files
        print "$File::Find::name\n";
        open(HTML_FILE,$_);
        my($text) = <HTML_FILE>;    # Read entire file
        $text =~ s/<[^>]*>//g;      # Strip out all HTML tags
        my($wordsIndexed) = &IndexWords($text,$currentKey);
        $filedb{pack"xn",$currentKey} = $File::Find::name;
        $currentKey++;
    }
}
###########################################################################
sub IndexWords {
    my($words, $fileKey) = @_;
    my(%worduniq);          # for unique-ifying word list
    # Split text into Array of words
    my(@words) = split(/[^a-zA-Z0-9\xc0-\xff\+\/\_]+/, lc $words);
    @words = grep { $worduniq{$_}++ == 0 }         # Remove duplicates
         grep { s/^[^a-zA-Z0-9\xc0-\xff]+//; $_ }  # Strip leading punct
         grep { length > 1 }                       # Must be > 1 char long
         grep { /[a-zA-Z0-9\xc0-\xff]/ }           # must have alphanumeric
         grep { !/[\x00-\x01f]/ }                  # No ctrl chars
             @words;
    foreach (sort @words) { # Add file key to each word
        my($a) = $index{$_};
        $a .= pack "n",$fileKey;
        $index{$_} = $a;
    }
    return scalar(@words);    # Return count of words indexed
}

Back to Article

Listing Two

# Flush temporary in-memory %wordCache to disk database %indexsub FlushWordCache {
    my($word,$entry);
    # Do merge in sorted order to improve cache response of on-disk DB
    foreach $word (sort keys %wordCache) {
        $entry = $wordCache{$word};
        if(defined $index{$word}) {
            my($codedList);
            $codedList = $index{$word};
            $entry = &MergeLists($codedList,$entry);
        }
        # Store merged list into database
        $index{$word} = $entry;
    }
    %wordCache = (); # Empty the holding queue
}
###########################################################################
sub MergeLists {
    my($list);
    foreach (@_) { $list .= $_; }              # append all the lists
    my(@unpackedList) = unpack("n*",$list);    # Unpack into integers
    my(%uniq); # sort and unique-ify
    @unpackedList = grep { $uniq{$_}++ == 0 }  # Remove duplicates
                    sort { $a <=> $b }         # Sort
                    @unpackedList;
    return pack("n*",@unpackedList);           # repack
}

Back to Article