ALGORITHM ALLEY

Searching for a Search Engine

Tom Swan

Selecting the right tool for the job is always important, whether you are a carpenter, mechanic, or programmer. Too often, however, programmers choose algorithms for the wrong reasons--selecting a Quick sort because they believe it's always the fastest (not true) or using a binary search because they heard it always makes the fewest comparisons when finding elements in a sorted array (also not true). Never choose an algorithm because of its popularity. Depending on your application's requirements, a less-well-known method may be faster or more efficient.

On the other hand, it's human nature to be taken in by claims of superiority, as I discovered while searching for a tool of another variety--I'm talking oil-filter wrenches, now, not algorithms. You see, I need to regularly change the oil and filter in the diesel engine on board my home and sailboat, but it took three tries to find a wrench that would properly unscrew the filter can. The first tool I purchased, the most popular design, came with a band of steel attached to a vice grip that dented the filter case with only minimal pressure. The next sported a plastic strap and the written promise that "one size fits all." Imagine the raw holding power of plastic on a greasy canister, and it's not hard to understand why this filter wrench of the future could never work as advertised. (Products like these make me question whether tool manufacturers ever try their own wares. I often wonder the same about software vendors.) Finally, while poking around in a mechanic's tool chest, I found a homemade pipe, fitted for a socket wrench, with a rough leather strap that grabbed the filter the first time. Later, I bought one from the mechanic. This just goes to show that you should never choose tools based on their popularity or advertising claims. It's often the unlikely junk in the bottom of the drawer that works best.

Fast Failures

The same is true of algorithms. For instance, in dusting off an old program that I use to prepare Pascal listings for publication, I wondered whether a binary search was the best way to look up entries in a sorted list of keywords--the critical code in this application that parses Pascal programs and converts keywords to lower case, optionally delimited for boldfacing in a word processor. I knew that a binary search makes only log N+1 comparisons, where N is the number of words in the array. Finding an entry in a list of 100 keywords, then, requires a maximum of six comparisons, which I wrongly assumed to be the best results I could expect.

To improve the program's speed, I considered using a hash function or a binary tree to search for keywords, but then I realized that, once again, I had been searching for a search engine for the wrong reasons. Most strings in a program listing are not keywords, so my program's speed was more dependent on how fast a word was not found than it was on the speed of a successful search. In other words, I needed a method that failed faster than the competition. Once I came to that realization, I found a way to boost my program's run-time speed by 20 percent. The algorithm that I chose, called a trie search--after "information reTRIEval"--is no faster on the average than a binary or hash function, but it requires a maximum of N comparisons--where N is the number of words beginning with the same letter--to determine that a word is not in the table. In practice, most failed searches take only one or two such comparisons--many take none--far better than required by a binary search, which tends to make the maximum number of comparisons for unrecognized words. By selecting the right tool for the job, taking into consideration the fact that most searches could be expected to fail, I increased my program's speed by using a less-popular, but better-suited, search algorithm.

A classic trie-search algorithm relies on a table arranged as illustrated in Figure 1. The figure shows only a portion of a complete table, indexed in the first column from A to Z. You could also index the table using other character sets--a standard ASCII trie table, for example, might have 127 rows. Each element in the index contains the number of another array that stores the table's words. The table entries might directly store data, or they could contain pointers--the exact format of the table depends on your program's requirements and the programming language you are using. A zero or null entry in a column indicates there are no words beginning with that letter. There are no Pascal keywords beginning with H, Y, or Z, so those entries are set to zero. (I'm using Borland Pascal's keywords here.)

As you can tell from Figure 1, a program can use a trie-search table to quickly determine whether a search argument is not a key word. In fact, no string comparisons at all are required for entries beginning with H, Y, or Z. Only one comparison is needed to find words beginning with B. To achieve the same results using a binary search requires up to six comparisons for negative searches of Borland Pascal's 57 keywords. In other applications with larger tables, you could extend the algorithm to use two or more tables indexed on a word's successive letters. The trick is to minimize the number of full comparisons required to find words in the table or to determine their absence. Once you've structured the table, the rest is easy.

One problem, however, is evident from Figure 1. Many table slots are empty, wasting space. To minimize memory use, you can instead construct the table as a sparse matrix, as illustrated in Figure 2. Now the first column becomes an array of pointers, each of which addresses a list of words beginning with the same letter. (The table could be compressed somewhat by deleting the first letter of each word.) Entries with no words are null. As in the classic table, you could extend the sparse matrix by building other indexes for subsequent letters in each word. Carrying that idea to the extreme reduces the trie table to a digital list--that is, a binary tree of letters, with paths forming the table's words. Small tables such as the one shown here, however, work just as well with a single-level index.

Example 1 is pseudocode for Algorithm #18, Trie Search. The algorithm simply looks up an input argument's first letter in the index, then searches the linked list for a match. Only a single exact-match string comparison is needed inside the inner loop--the key ingredient of this method's speed. A binary search requires alphabetical less-than or greater-than comparisons, further slowing searches for arguments not found.

Pascal Parser

Listings One, Two, and Three show the source code for my Pascal Parser, IDENT.PAS. SEARCH.PAS, the Pascal unit in Listing One (page 143), implements the trie-search algorithm. Keyword lists are composed of linked records of type ResWordRec. The global Index array corresponds to the Index column in Figure 2. Procedures AddList and AddWord build the trie-search tables--you can use these procedures to construct a trie-search engine for any list of words, but the words must be inserted in alphabetical order (see function Initialize). Function IsReserved determines whether a given word, passed as argument Ident, is a member of the table.

The other two listings, COMMON.PAS (Listing Two, page 143) and IDENT.PAS (Listing Three, page 143), use the trie-search engine to parse a Pascal listing. The program converts to lower case all keywords in a Pascal source file, and also optionally capitalizes all non-keywords (specify option--c). Use option--b to add <* and *> delimiters to keywords. The word begin, for example, is translated to <*begin*>. Use the --b option only on a copy of a source file--after conversion, the file will no longer compile. (You can restore the original text by deleting all instances of <* and *>.) I use WINWORD.MAC (Listing Four, page 145) in Word for Windows to convert delimited words to boldface after inserting a listing into a document. You could probably whip up a similar macro for other word processors.

Your Turn

Next month, more algorithms. Meanwhile, send your favorite algorithms and tools to me in care of DDJ--software tools, that is.

Figure 1: Classic trie-search table.

       [1]    [2]     [3]    [4]    [5]     [6]
[a]     2     and    begin   far    goto    xor
[b]     3     array    0     file     0      0
:.     --     asm     --     for     --     --
[f]     4     0       --   function  --     --
[g]     5     --      --      0      --     --
[h]     0     --      --      --     --     --
:.     --     --      --      --     --     --
[x]     6     --      --      --     --     --
[y]     0     --      --      --     --     --
[z]     0     --      --      --     --     --

Figure 2: Trie table converted to a sparse matrix.

Example 1: Pseudocode for Algorithm #18 (trie search).

input
     Arg: String;
var
     P: Pointer;
begin
     P = Index[Arg[1]];
     while(P <> nil) do
     begin
          if P^.Word = Arg then
               return True;
          P = P^.Next;
     end;
     return False;
end;

[LISTING ONE]


(* ----------------------------------------------------------- *(
** search.pas -- Search engine for IDENT program               **
** Trie search algorithm                                       **
** Copyright (c) 1994 by Tom Swan. All rights reserved.        **
)* ----------------------------------------------------------- *)

unit Search;
INTERFACE
uses Common;

{ Return true if Ident is a Turbo Pascal reserved word }
function IsReserved(Ident: IdentStr): Boolean;
IMPLEMENTATION
type
  ResWord = String[14];
  PResWordRec = ^ResWordRec;
  ResWordRec = record
    Word: ResWord;      { Reserved word string }
    Next: PResWordRec;  { List link field }
  end;

var
  Index: array['a' .. 'z'] of PResWordRec;
{ Add word W to list at P }
procedure AddList(var P: PResWordRec; var W: ResWord);
begin
  if (P <> nil) then
    AddList(P^.Next, W)
  else begin
    P := new(PResWordRec);
    if (P = nil) then
    begin
      Writeln('Out of memory');
      Halt;
    end;
    P^.Word := W;
    P^.Next := nil
  end
end;

{ Add word W to global Index }
procedure AddWord(W: ResWord);
begin
  if Length(W) = 0 then exit;
  AddList(Index[W[1]], W)
end;

{ Initialize search engine variables }
procedure Initialize;
var
  C: Char;  { Index[] array index }
begin
  for C := 'a' to 'z' do
    Index[C] := nil;
  AddWord('and');
  AddWord('array');
  AddWord('asm');
  AddWord('begin');
  AddWord('case');
  AddWord('const');
  AddWord('constructor');
  AddWord('destructor');
  AddWord('div');
  AddWord('do');
  AddWord('downto');
  AddWord('else');
  AddWord('end');
  AddWord('export');
  AddWord('exports');
  AddWord('far');
  AddWord('file');
  AddWord('for');
  AddWord('function');
  AddWord('goto');
  AddWord('if');
  AddWord('implementation');
  AddWord('in');
  AddWord('inherited');
  AddWord('inline');
  AddWord('interface');
  AddWord('label');
  AddWord('library');
  AddWord('mod');
  AddWord('near');
  AddWord('nil');
  AddWord('not');
  AddWord('object');
  AddWord('of');
  AddWord('or');
  AddWord('packed');
  AddWord('private');
  AddWord('procedure');
  AddWord('program');
  AddWord('public');
  AddWord('record');
  AddWord('repeat');
  AddWord('set');
  AddWord('shl');
  AddWord('shr');
  AddWord('string');
  AddWord('then');
  AddWord('to');
  AddWord('type');
  AddWord('unit');
  AddWord('until');
  AddWord('uses');
  AddWord('var');
  AddWord('virtual');
  AddWord('while');
  AddWord('with');
  AddWord('xor');
end;

{ Trie search algorithm }
function IsReserved(Ident: IdentStr): Boolean;
var
  P: PResWordRec;
begin
  IsReserved := false;
  if Length(Ident) = 0 then exit;
  DownCase(Ident);
  P := Index[Ident[1]];
  while(P <> nil) do
  begin
    if P^.Word = Ident then
    begin
      IsReserved := true;
      exit
    end;
    P := P^.Next
  end
end;

begin
  Initialize;
end.

[LISTING TWO]



(* ----------------------------------------------------------- *(
** common.pas -- Various constants, types, and subroutines     **
** Copyright (c) 1994 by Tom Swan. All rights reserved.        **
)* ----------------------------------------------------------- *)
unit Common;
INTERFACE
const
  identStrLen = 64;
  digitSet = ['0' .. '9'];
  upperSet = ['A' .. 'Z'];
  lowerSet = ['a' .. 'z'];
  alphaSet = upperSet + lowerSet;
  identSet = alphaSet + digitSet + ['_'];
type
  IdentStr = String[identStrLen];
{ Return lowercase equivalent of Ch }
function DnCase(Ch: Char): Char;
{ Convert all letters in identifier to lowercase }
procedure DownCase(var Ident: IdentStr);
IMPLEMENTATION
{ Return lowercase equivalent of Ch }
function DnCase(Ch: Char): Char;
begin
  if Ch in upperSet
    then Ch := Chr(Ord(Ch) + 32);
  DnCase := Ch
end;

{ Convert all letters in identifier to lowercase }
procedure DownCase(var Ident: IdentStr);
var
  I: Integer;

begin
  if Length(Ident) > 0 then
    for I := 1 to Length(Ident) do
      Ident[I] := DnCase(Ident[I])
end;

begin
end.

[LISTING THREE]



(* ------------------------------------------------------------*(
** ident.pas -- Convert key word identifiers in .PAS files.    **
** Converts key words in Pascal listings to lowercase, and     **
** marks them for bold facing. Words are marked using the      **
** symbols <* and *>. For example, <*begin*> is interpreted as **
** a bold faced "begin" key word. A word-processor macro could **
** search for all <* and *> symbols in the resulting file and  **
** replace these with bold face on and off commands.           **
** Copyright (c) 1994 by Tom Swan. All rights reserved.        **
)* ------------------------------------------------------------*)

{$X+}  { Enable "extended" syntax }
program Ident;
uses Dos, Common, Search;
const
  bakExt  = '.BAK';   { Backup file extension }
  tempExt = '.$$$';   { Temporary file extension }
type
  PString = ^String;
  PListRec = ^TListRec;
  TListRec = record
    Path: PString;
    Next: PListRec
  end;
  TState = (
    Reading, Chkcomment, Comment1, Comment2, Stopcomment,
    Stringing, Converting
  );
var
  FileSpec: ComStr;         { Files entered on command line }
  Root: PListRec;           { File name list root pointer }
  DelimitWords: Boolean;    { True to add <* and *> to reserved words }
  CapIdentifiers: Boolean;  { True to capitalize non-keywords }
{ Return copy of a string }
function NewStr(S: String): PString;
var
  P: PString;
begin
  GetMem(P, Length(S) + 1);
  if (P <> nil) then
    PString(P)^ := S;
  NewStr := P
end;
{ Return true if InF is successfully converted to OutF }
function ConvertIdents(var InF, OutF: Text): Boolean;
var
  Ch, PushedCh: Char;
  State: TState;
  Identifier : IdentStr;
  function GetCh(var C: Char): Char;
  begin
    if PushedCh <> #0 then
    begin
      C := PushedCh;
      PushedCh := #0
    end else
      Read(InF, C);
    if (C = #13) or (C = #10) then
    begin
      if (C = #13) then
        Writeln(OutF);  { Start new line }
      C := #0           { Ignore new line characters }
    end;
    GetCh := C
  end;
  procedure UngetCh(Ch: Char);
  begin
    PushedCh := Ch
  end;
  procedure PutCh(Ch: Char);
  begin
    if Ch <> #0 then
      Write(OutF, Ch)
  end;

begin
  PushedCh := #0;     { No pushed character }
  State := Reading;
  while not eof(InF) do
  begin
    GetCh(Ch);
    case State of
      Reading:
      begin
        case Ch of
          '('  : State := Chkcomment;
          '{'  : State := Comment1;
          '''' : State := Stringing;
        end;
        if Ch in alphaSet then
        begin
          UngetCh(Ch);
          State := Converting
        end else
          PutCh(Ch)
      end;
      Chkcomment:
        if Ch = '*' then
        begin
          PutCh(Ch);
          State := Comment2
        end else begin
          UngetCh(Ch);
          State := Reading
        end;
      Comment1:
      begin
        PutCh(Ch);
        if Ch = '}' then
          State := Reading
      end;

      Comment2:
      begin
        PutCh(Ch);
        if Ch = '*' then
          State := Stopcomment
      end;
      Stopcomment:
      begin
        PutCh(Ch);
        if Ch = ')' then
          State := Reading
        else
          State := Comment2;
      end;

      Stringing:
      begin
        PutCh(Ch);
        if Ch = '''' then
          State := Reading;
      end;

      Converting:
      begin
        Identifier := '';
        while Ch in identSet do
        begin
          Identifier := Identifier + Ch;
          Read(InF, Ch)  { Note: Don't call GetCh here! }
        end;
        if IsReserved(Identifier) then
        begin
          DownCase(Identifier);
          if DelimitWords then
            Identifier := '<*' + Identifier + '*>'
        end else
        if CapIdentifiers and (Length(Identifier) > 0) then
          Identifier[1] := UpCase(Identifier[1]);
        Write(OutF, Identifier);
        UngetCh(Ch);
        State := Reading
      end
    end
  end;
  if PushedCh <> #0 then  { Write possible pushed last char that }
    PutCh(Ch);            {  sets eof() to true. }
  ConvertIdents := true
end;

{ Convert one file specified in Path string }
procedure ConvertOneFile(Path: PathStr);
var
  Result: Integer;
  BakF, InF, OutF: Text;
  TempName, BakName: PathStr;
  Name: NameStr;
  Dir: DirStr;
  Ext: ExtStr;
begin
  Write(Path);
  Assign(InF, Path);
  {$i-} Reset(InF); {$i+}
  if IoResult <> 0 then
    Writeln(' **Error opening file')
  else begin
    FSplit(Path, Dir, Name, Ext);
    TempName := Dir + Name + tempExt;
    BakName := Dir + Name + bakExt;
    Assign(OutF, TempName);
    {$i-} Rewrite(OutF); {$i+}
    if IoResult <> 0 then
      Writeln(' **Error creating output file')
    else begin
      if ConvertIdents(InF, OutF) then
      begin
        Close(InF);
        Close(OutF);
        Assign(BakF, BakName);
        {$i-}
        Erase(BakF);
        Result := IoResult;      { Throw out IoResult }
        Rename(InF, BakName);
        Rename(OutF, Path);
        {$i+}
        if IoResult <> 0 then
          Writeln(' **Error renaming files')
        else
          Writeln(' done')
      end else
        Writeln(' **Error processing files')
    end
  end
end;

{ Convert files on global list at Root pointer }
procedure ConvertFiles(List: PListRec);
begin
  if List = nil then
    Writeln('No files specified')
  else
    while List <> nil do
    begin
      ConvertOneFile(List^.Path^);
      List := List^.Next
    end
end;

{ Add file path to list }
procedure ListFile(var List: PListRec; Path: PathStr);
var
  P: PListRec;
begin
  New(P);
  P^.Next := List;
  P^.Path := NewStr(Path);
  if P^.Path = nil then
    Dispose(P)
  else
    List := P
end;

{ Create list of file names from FileSpec string }
procedure ListFiles(var List: PListRec);
var
  Sr: SearchRec;        { Directory search record }
  L: Integer;           { Length of Dir string }
  OldDir: DirStr;       { Old directory upon entry to procedure }
  Path: PathStr;        { Expanded file specification with path info }
  Dir: DirStr;          { Directory component of Path }
  Name: NameStr;        { File name component of Path }
  Ext: ExtStr;          { File extension component of Path }
begin
  GetDir(0, OldDir);             { Save current path }
  Path := FExpand(FileSpec);     { Add path info to file spec }
  FSplit(Path, Dir, Name, Ext);  { Separate Path components }
  L := Length(Dir);              { Prepare to change directories }
  if L > 0 then
  begin
    if (Dir[L] = '\') and (L > 1) and (Dir[L - 1] <> ':') then
      Delete(Dir, L, 1); { Ensure that ChDir will work }
    ChDir(Dir)           { Change to location of file(s) }
  end;
  FindFirst(Path, 0, Sr);        { Start file name search }
  while DosError = 0 do          { Continue while files found }
  begin
    Path := FExpand(Sr.Name);    { Expand to full path name }
    ListFile(List, Path);        { Add path to list }
    FindNext(Sr)                 { Search for the next file }
  end;
  ChDir(OldDir)
end;

{ Display instructions }
procedure Instruct;
begin
  Writeln('Use -b option to surround reserved words with');
  Writeln('<* and *> for bold-facing in a word processor.');
  Writeln('Use -c option to capitalize non-keyword identifers.');
  Writeln;
  Writeln('WARNING: After conversion with -b, the listing will');
  Writeln('not compile. Use -b ONLY on a copy of original files.');
  Writeln;
  Writeln('ex. IDENT single.pas');
  Writeln('    IDENT -b one.pas two.pas');
  Writeln('    IDENT wild??.pas -b *.pas')
end;

{ Main program initializations }
procedure Initialize;
begin
  Writeln;
  Writeln('IDENT -- (C) 1994 by Tom Swan');
  Writeln('Converts Pascal reserved words to lowercase.');
  Writeln;
  Root := nil;              { File name list is empty }
  DelimitWords := false;    { Normally do not add <* and *> to words }
  CapIdentifiers := false   { Normally do not capitalize other idents }
end;
{ Main program block }
var
  I: Integer;
begin
  Initialize;
  if ParamCount = 0 then
    Instruct
  else for I := 1 to ParamCount do
  begin
    FileSpec := ParamStr(I);
    if (FileSpec = '-b') or (FileSpec = '-B') then
      DelimitWords := true
    else if (FileSpec = '-c') or (FileSpec = '-C') then
      CapIdentifiers := true
    else begin
      ListFiles(Root);
      ConvertFiles(Root)
    end
  end
end.

[LISTING FOUR]



Sub MAIN
StartOfDocument
EditFind .Find = "<*", .WholeWord = 0, .MatchCase = 0, .Direction = 1, \
While EditFindFound()
 EditClear
 EditFind .Find = "*>", .WholeWord = 0, .MatchCase = 0, .Direction = 1, \
 If Not EditFindFound() Then
  Stop
 End If
 EditClear
 WordLeft 1, 1
 Bold 1
 EditFind .Find = "<*", .WholeWord = 0, .MatchCase = 0, .Direction = 1, \
Wend
End Sub