Indexed Text Retrieval

By indexing every word, I can search on the logical AND/OR combination of any words I choose. For example, to find all of the hospitals in Ottawa, I just type itr hospital ottawa, where itr is the name of the ITR program, and hospital and ottawa are text keywords. The ITR program will print out all phone-number records that match the two keywords. (Note that this version of ITR is case insensitive.)

Implementation of ITR requires two separate programs--an indexing program and a lookup program. Listing One is the ITR.H file for the ITR program, while Listing Two is the main ITR retrieval program, ITR.C. Other relevant files (the ITR Indexer, a common library, and a makefile) are available electronically; see "Availability," page 3.

Indexing every word involves opening each input file, reading each line (station code and text), and parsing out the words. In my implementation, I skip anything that is not "A" through "Z". Therefore, for each word, I have the following information: filename (in this case, the area code and office code), position (in this case, the station code), and the word itself. I don't care about the word's relative position on the line--I just want it associated with a given DN. Then when I specify the search keys, I don't have to specify them in any particular order; "ottawa hospital" is equivalent to "hospital ottawa".

Managing this amount of data is in itself a challenge since this index is quite large.

The Alpha Tree

After the digit-tree approach succeeded, I decided to experiment with an alpha tree. Using this scheme to find the keys for "ottawa," I simply open the file "o/t/t/a/w/a.db" and find all of the keys. The keys are directly translatable into the telephone-database record. To preserve disk space, I had to make some decisions about the extent of the alpha tree. I first chose a maximum depth of five levels. Depending upon the data that you wish to index, this restriction may be drastically altered to suit your requirements.

To generate the index, I implemented a "bucket-sort" algorithm that works like this: During analysis, I take each DN and word and generate a 10-byte entry consisting of six bytes from the word (null padded) and four bytes representing the DN. Then, I append this entry to the file with the same name as the first character of the word. For example, the word "weather" is stripped to just five characters ("weath"), and, along with the key (6135141212), appended to a file called "w.1". The next word, "Number," is also stripped to five characters ("numbe"), and along with the same key, is appended to a file called "n.1".

Once all words in all files are processed, each ".1" file is opened and analyzed. If the number of records in the file warrants splitting it up (this is a tunable parameter in the source), a subdirectory is created and the process repeats, using the next character of the word as the basis for the filename. For example, when the "w.1" file is analyzed, the word "weath" is written to "w/e.1". This process continues until either five levels of directory structure have been generated, or the ".1" file is small enough not to split. Finally, when all of the files are split, the partial word is stripped out of the files since it is no longer required, making each record contain only the 4-byte (telephone number) key.

The final result is an alpha-tree structure on disk with files consisting of indexes representing phone numbers. This is similar to a classical hash function, except that the length of the hash is flexible, depending upon the number of entries. Table 1, which lists the files I have on my disk, shows that not as many words start with "WEK" as with "WELC". Interestingly, as the number of levels increases, so does the theoretical number of files. Table 2 summarizes the theoretical maximum number of files and the number actually contained in my database.

Retrieval Software

To access the stored information, the retriever takes the keywords passed on the command line and finds matching database records. If only one keyword is specified, the retriever opens the associated keyword file; for example, if the keyword is "welcome," the retriever opens "w/e/l/c.db". Then, for each entry in the keyword file, the retriever opens the corresponding database file, performing secondary matching on the database records as they go by. Any records that fully match are printed.

The reason for a secondary match is that ".db" files are unique only as far as their filenames go. For example, the w/e/l/c.db file contains keys for all words that start with "welc": "welcome," "welch," and the like.

In the case of two or more keywords, the retriever opens all specified keyword files and finds keys that are the same in all of them. (Since the key files are sorted by key, this is relatively simple.) Once keys are found that are the same in all files, the telephone-database file is opened and a secondary match is performed to check that the keywords are all present in the database entry.

The current version of the retriever (see Listing Two) provides only an AND function--all keywords listed on the command line must match for the record to be printed. It is simple to add an OR capability and partial-key matching.

Extending the ITR Concept

How can you extend this system to usefully index arbitrary bodies of text? Start by revisiting the indexer, which generates alpha-tree entries that contained the index. The index consists of a complete telephone number. You interpret the telephone number in two parts; together, the area code and office code make up the "filename," and the station code (the last four digits) is the "record position" within the specified filename.

By revising the indexer to generate 8-byte indexes--four bytes for the filename (as a file ID number, DOCNUM) and four for the word position within that file--arbitrary text can be indexed. In fact, the sample telephone-database indexer is a subset of this approach.

As an aside, you can't really store a complete telephone number into four bytes. I "cheated" by creating a table of area-code and office-code pairs, then stored a six-digit index in this table. For instance, my database has 8221 area-code/office-code pairs, which means that the index ranges from 0000010000 through to 0082219999--well within the range of four bytes. Some string manipulation is then performed on the decimal ASCII representation of the four bytes, and the area-code/office-code pair and station ID are split out or merged in, depending upon the operation being performed.

The interpretation of the 4-byte file position is left up to you. For certain types of database searches, the four bytes might represent the word's byte offset from the start of the file. For other types, it may represent the relative word number in the file. In the telephone-database example, it represents the record number.

Figure 1 illustrates a flexible approach to implementing an ITR program. In this scheme, any documents being added to the ITR database are assigned a document number in the DOCNUM database. This allows an association to be made between the filename (which can be arbitrarily long and complex; hypertext links, for example) and a 4-byte document number to be established.

Next, another database is established that allows the association between the document number and a document type (DOCTYPE). The document type is used by the front-end analysis scheme to choose an appropriate analyzer, and by the back-end matching scheme to choose a corresponding matcher.

To index a file once the DOCNUM and DOCTYPE database entries have been created, every word within the file is processed by the appropriate front-end analyzer; the document number, word position, and word itself are sent to the indexer. The indexer creates/updates the alpha-tree based upon the word, adding the document number and word position at the appropriate place.

To retrieve a document, a generic matcher is passed the list of keywords being searched for, and finds them in the alpha tree. Every time the generic matcher finds a match (based upon the alpha-tree information), it calls the specific matcher for that file type (from the DOCTYPE database) to verify the match. If the match is verified, the appropriate information is printed. Again, depending upon the exact requirements, there may be no need for a back-end matcher; for example, if the alpha-tree information is complete, rather than partial.

Much of the system's flexibility comes into play in the DOCTYPE database and analyzer and matcher. The analyzer and matcher operate as a pair, so it is entirely up to them to interpret the word position returned from the generic matcher. As an example, the DOCTYPE could indicate a PostScript, spreadsheet, word processor, or other type of arbitrary file or document.

Table 1: Sample list of files.

Filename     Size (Bytes)  Records

w/e/k.db       36            9
w/e/l/m.db     28            7
w/e/l/e.db     32            8
w/e/l/b.db   1060          265
w/e/l/c.db   2580          645
w/e/l/d/a.db   36            9

Table 2: Theoretical maximum number of files versus number actually observed in sample database. Consuming 141.2 MB of index for the 549-MB database.

Level  Maximum  Actual  Percentage

1           26      26      100.00
2          676     462       63.02
3       17 576   5 730       32.60
4      456 976  23 949        5.24
5   11 881 376  32 314       <0.01

Example 1: Contents of a typical digit-tree file.

1212 Weather Number, 24 hours / day, Ottawa Valley
1213 Liz and Brian's Pizza House, 2666 Temp Street, Ottawa
1443 Pay Phone, 55 Queensway Offramp

Figure 1: Implementing an ITR program. (a) INDEXER; (b) RETRIEVER.

Listing One

/* itr.h 
 * QNX 4
 * (C) Copyright 1993 by Robert Krten, all rights reserved.
 * This module contains the manifest constants and associated declarations
 * for the "ITR" program.
*/
/* manifest constants */
#define MaxDBLineLen        256 /* maximum length of database line */
#define NumKeys             10
#define KeyCacheSize        1024
#define MaxKeyWordLength    64
#define MaxFilenameLength   256
/* structure definitions */
typedef struct
{
    char    keyword [MaxKeyWordLength];
    char    fname [256];
    FILE    *fp;
    long    val;
    long    *cache;
    int     cachePtr;
    int     maxcache;
}   Key;
/* prototypes */
void        init (void);
void        ftrSearch (void);
void        findNextKey (int);
void        printKey (int);
void        printMatch (int);
void        initKeys (void);
int         databaseMatch (char *, char *);

Listing Two

/* main.c -- QNX 4, main.c shell version 0.004
 * (C) Copyright 1995 by Robert Krten, all rights reserved.
 * This module represents the main module for the itr retriever.
 * This program will retrieve text based on a full-text retrieval database.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <time.h>
#include <ctype.h>
#include "itr.h"
#include "lib.h"
char    *progname = "itr";
char    phoneDB [MaxFilenameLength];    /* database filename roots */
char    root [MaxFilenameLength];       /* base of itr database */
char    npanxx [MaxFilenameLength];     /* base of document # xlat */
char    *translation;                   /* allocated by read_translation */
int     numtrans;                       /* number of above */
int     longerThanIndex [NumKeys];      /* 1 if strlen (searchWord) > 5 */
long    topKey;                         /* highest of any */
long    maxKey;                         /* largest valid key */
int     numSearch;
Key     keys [NumKeys];                     
/* the main key structure */
main (argc, argv)
int       argc;
char    **argv;
{
    optproc (argc, argv);   /* get keys from command line */
    init ();                /* initialize search engine */
    itrSearch ();           /* search engine */
}
void
optproc (argc, argv)
int     argc;
char    **argv;
{
    int     i;
    numSearch = 0;
    for (i = 1; i < argc; i++) {
        if (numSearch >= NumKeys - 1) {
              fprintf (stderr, "%s:  Too many search keys, limit is %d\n",
                                                          progname, NumKeys);
              exit (1);
        }
        strcpy (keys [numSearch++].keyword, argv [i]);
    }
}
void
init ()
{
    char    *p;
    if ((p = getenv ("PHONEDB")) != NULL) {
    strcpy (phoneDB, p);
    } else {
        strcpy (phoneDB, "/data/telecom/phoneDB");
    }
    if (phoneDB [strlen (phoneDB) - 1] != '/') {
        strcat (phoneDB, "/");
    }
    sprintf (root, "%s1", phoneDB);
        sprintf (npanxx, "%s/.npanxx", root);
    read_translation (npanxx);           /* fetch the translation table */
    maxKey = numtrans * 10000 + 9999;    /* maximum key value */
}
/* prepareSearchFname -- generate alpha-tree pathnames corresponding to each 
 * keyword specified.
*/
void
prepareSearchFnames ()
{
    int     i;
    int     l;
    FILE    *fp;
    char    shortFnames [NumKeys][11];  /* /a/b/c/d/e\0 */
    for (i = 0; i < numSearch; i++) {
        /* generate flag array */
        longerThanIndex [i] = strlen (keys [i].keyword) > 5;
        /* trim search key */
        for (l = 0; l < 5; l++) {
           shortFnames [i][l * 2 + 0] = '/';
           shortFnames [i][l * 2 + 1] = tolower (keys [i].keyword [l]);
        }
        shortFnames [i][10] = 0;    /* terminate preventively */
        /* for short strings, the \0 that got stuck into shortFnames is
         * almost doing its job--there will be a trailing '/' which we
         * kill off here
        */
        if (shortFnames [i][strlen (shortFnames [i]) - 1] == '/') {
            shortFnames [i][strlen (shortFnames [i]) - 1] = 0;
        }
        /* now, see which ones are valid. If not valid, trim off two characters
         * at the end and try again. For short filenames, we (uselessly) trim 
         * off characters after the \0 -- this all comes out in the wash.
        */
        for (l = 4; l >= 0; l--) {
            sprintf (keys [i].fname, "%s%s.db", root, shortFnames [i]);
            if ((fp = fopen (keys [i].fname, "r")) != NULL) {
                break;
            }
            shortFnames [i][l * 2] = 0; /* kill off part of name */
    }
    if (fp != NULL) {
        fclose (fp);
    } else {
        fprintf (stderr, "%s:  couldn't find database for 
                                keyword %s!!!\n", progname, keys [i].keyword);
        exit (1);
    }
  }
}
/* itrSearch -- The search engine itself.
 * We change the keywords that we are searching for into actual alpha-tree
 * filenames (via preparseSearchFnames).  Then, we open all searchkey files,
 * and try to find keys that match.
*/
void
itrSearch ()
{
    int     i;
    int     change;
    int     allEqual;
    /* translate search strings into valid filenames */
    if (isdigit (keys [0].keyword [0])) {  /* kludge -- if it's a number */
                if (strlen (keys [0].keyword) != 10) {
                  fprintf (stderr, "%s: reverse number lookup requires a 
                                                  10 digit DN!\n", progname);
                  exit (1);
               }
               reverseLookup (keys [0].keyword);
               return;                                         /* done */
      }
      prepareSearchFnames ();
      initKeys ();
    /* initialize variables for search */
      for (i = 0; i < numSearch; i++) {
               keys [i].val = 0;
       }
       topKey = 0;             /* init, findNextKey uses it */
      /* now the real search algorithm begins. Initialize by searching first 
       * file, and assigning to 'topKey'
       */
       findNextKey (0);        /* find next (ie:  first) key for string zero */
       topKey = keys [0].val;
       while (topKey != -1) {  /* until EOF */
         /* advance all keys that are below topKey, until they are either at
          * topKey, or above. If a key goes above topKey, reassign topKey and 
          * continue. If a key is equal, skip until all keys are equal, or 
          * topKey has been adjusted. If all keys are equal, emit a match, and
          * advance any key. Ensure that key has in fact advanced!
          */
          change = 1;
          while (change) {
              change = 0;
              for (i = 0; i < numSearch; i++) {
                while (keys [i].val < topKey && keys [i].val != -1) {
                       findNextKey (i);
                       if (keys [i].val > topKey) {
                             topKey = keys [i].val;
                             change = 1;
                       }
                }
                if (keys [i].val == -1) {
                      return;
                }
           }
      }
      /* since there is no more change, ensure all keys are equal */
      allEqual = 1;
      for (i = 0; i < numSearch - 1; i++) {
            if (keys [i].val != keys [i + 1].val) {
                   allEqual = 0;
            }
      }
      if (!allEqual) {
              return;         /* Couldn't find a match */
      }
      printMatch (keys [0].val);    /* validate and then show matches */
      while (keys [0].val == topKey) {
        findNextKey (0);    /* advance to next search */
      }
      topKey = keys [0].val;
   }
}
/* reverseLookup -- prints database entry associated with the passed DN */
void
reverseLookup (dn)
char *dn;
{
     char    obuf [256];
     if (databaseMatch (dn, obuf)) {
             printf ("%s:  %s\n", dn, obuf);
        } else {
             printf ("No match\n");
    }
}
void
initKeys ()
{
    int     i;
    for (i = 0; i < numSearch; i++) {
        if ((keys [i].fp = fopen (keys [i].fname, "r")) == NULL) {
            fprintf (stderr, "%s:  couldn't open %s for r\n",
                     progname, keys [i].fname);
            exit (1);
        }
        if ((keys [i].cache = malloc (sizeof (long) * 
                                                     KeyCacheSize)) == NULL) {
                fprintf (stderr, "%s:  couldn't allocate a cache of %d
                            bytes!\n", progname, sizeof (long) * KeyCacheSize);
            exit (1);
        }
        keys [i].maxcache = fread (keys [i].cache, sizeof (long),
                           KeyCacheSize, keys [i].fp);
        if (keys [i].maxcache < KeyCacheSize) {
                fclose (keys [i].fp); /* close it, we've read it all */
                        keys [i].fp = NULL;  /* indicate closed */
                }
                keys [i].cachePtr = 0;          /* current position for read */
         }
}
/* findNextKey -- fetches next key that is greater than "topKey" for the given
 * keyword number, and handles the cache aspects too.
*/
void
findNextKey (k)
int k;              /* search string number */
{
     /* could be replaced with binary search for even greater speed */
     do {
            if (keys [k].cachePtr < keys [k].maxcache) {
                    keys [k].val = keys [k].cache [keys [k].cachePtr++]; 
                                                           /* "read" value */
            } else {        /* exceeded cache */
                  if (keys [k].fp == NULL) {  /* no more to read, done */
                      keys [k].val = -1; /* set EOF */
                  } else {
                     keys [k].maxcache = fread (keys [k].cache, sizeof (long), 
                                                    KeyCacheSize, keys [k].fp);
                           if (keys [k].maxcache < KeyCacheSize) {
                                 fclose (keys [k].fp);
                                 keys [k].fp = NULL;
                           }
                           keys [k].val = keys [k].cache [0];
                           keys [k].cachePtr = 1;
                     }
               }
       } while ((keys [k].val < topKey) && (keys [k].val != -1));
}
/* printMatch -- fetches matching record, and performs secondary matching
 *  validation on it.  Due to the algorithm used in the ITR program,
 *  it is possible that a "false match" will be generated (see article).
 *  The procedure ensures that no false matches are printed.
*/
void
printMatch (key)
int        key;
{
   char    number [256];           /* translated key -> number */
   char    result [256];           /* database record fetched */
   char    buf [16];               /* temp, key -> ASCII xlat */
   int         npanxx;             /* temp, NPA/NXX from key */
   int         station;            /* temp, station code from key */
   int         matching [NumKeys]; /* matching key matrix 1==match */
   char    word [256];             /* word buffer for 2ndary matches */
   char    *ptr;                   /* buffer pointer for fetching words */
   int         i;
   if (key == -1) {
             printf ("<EOF>\n");
   } else if (key > maxKey) {
             printf ("<INVALID %010ld>\n", key);
   } else {
           sprintf (buf, "%010ld", key);
           station = atoi (buf + 6);
           buf [6] = 0;
           npanxx = atoi (buf);
           sprintf (number, "%6.6s%04d", &translation [npanxx * 7], station);
           /* fetch complete database record into 'result' */
           if (databaseMatch (number, result)) {
                 /* now check 2ndary matches on all keys */
                 for (i = 0; i < numSearch; i++) {
                         matching [i] = 0;        /* initially, no matches */
                 }
                 ptr = result;
                 while (getWordLC (&ptr, word)) {
                         for (i = 0; i < numSearch; i++) {
                                if (!strcmpi (word, keys [i].keyword)) {
                                             matching [i] = 1;
                                }
                        }
                }
                /* see if all matched 2ndary search */
                for (i = 0; i < numSearch; i++) {
                        if (!matching [i]) {
                               break;
                        }
                }
                if (i == numSearch) {
                   printf ("%s %s\n", number, result); /* all matched, print */
                } /* else didn't match *all* keys! */
    } else {
        /* internal bug, couldn't find key in database */
                printf ("Can't match %s in database!!!\n", number);
        }
    }
}
static        char tmpbuf [MaxDBLineLen];
/* databaseMatch -- uses binary search with variable length lines to find 
 * the number in the text database.
*/
int
databaseMatch (number, result)
char    *number;                                /* international DN */
char    *result;                                /* looked-up name, or \0 */
{
    FILE    *fpr;                           /* db open for read */
    char    *ptr;
    int     len;
    long    top, bot;                       /* binary search values */
    long    mid;
    int     sts;
    createDigitTreeName (number, tmpbuf);
    if ((fpr = fopen (tmpbuf, "r")) == NULL) {
              return (0);
    }
    len = strlen (number);          /* how big is the number? */
    bot = 0;                        /* start of the file */
    fseek (fpr, 0L, 2);             /* get to end */
    top = ftell (fpr);              /* limit the search */
    while (top - bot > 512) {       /* while it's worthwhile */
          mid = (top + bot) / 2;
          fseek (fpr, mid, 0);
          fgets (tmpbuf,MaxDBLineLen, fpr); /* skip to end of damaged record */
          mid = ftell (fpr);
          fgets (tmpbuf, MaxDBLineLen, fpr);
          tmpbuf [strlen (tmpbuf) - 1] = 0;
          if (!(sts = strncmp (number + len - 4, tmpbuf, 4))) {
                 ptr = tmpbuf + 4;
                 while (*ptr && isspace (*ptr)) {
                         ptr++;
                 }
                 strcpy (result, ptr);
                 fclose (fpr);
                 return (1);
         } else {
                 if (sts < 0) {
                       top = mid;
                 } else {
                       bot = mid;
                 }
         }
  }
  fseek (fpr, bot, 0);                    /* rewind to last "bot" position */
  while (!feof (fpr) && fgets (tmpbuf, MaxDBLineLen, fpr) != NULL) {
         tmpbuf [strlen (tmpbuf) - 1] = 0;
         if (!(sts = strncmp (number + len - 4, tmpbuf, 4))) {
               ptr = tmpbuf + 4;
               while (*ptr && isspace (*ptr)) {
                       ptr++;
               }
               strcpy (result, ptr);
               fclose (fpr);
               return (1);
        }
        if (sts < 0) {                  /* exceeded number, must not be there */
               return (0);
        }
   }
   fclose (fpr);
   return (0);
}
/* converts "num" to a digit-tree "path" */
createDigitTreeName (num, path)
char      *num;
char      *path;
{
          int     len;
          char    *ptr;
          *path = 0;
          sprintf (path, "%s/", root);
          ptr = &path [strlen (path)];
          len = strlen (num);
          while (len > 4) {
               *ptr++ = *num++;
               *ptr++ = '/';
               *ptr = 0;
               len--;
         }
         --ptr;
         *ptr = 0;
}