Selecting the right tool for the job is always important, whether you are a carpenter, mechanic, or programmer. Too often, however, programmers choose algorithms for the wrong reasons--selecting a Quick sort because they believe it's always the fastest (not true) or using a binary search because they heard it always makes the fewest comparisons when finding elements in a sorted array (also not true). Never choose an algorithm because of its popularity. Depending on your application's requirements, a less-well-known method may be faster or more efficient.
On the other hand, it's human nature to be taken in by claims of superiority, as I discovered while searching for a tool of another variety--I'm talking oil-filter wrenches, now, not algorithms. You see, I need to regularly change the oil and filter in the diesel engine on board my home and sailboat, but it took three tries to find a wrench that would properly unscrew the filter can. The first tool I purchased, the most popular design, came with a band of steel attached to a vice grip that dented the filter case with only minimal pressure. The next sported a plastic strap and the written promise that "one size fits all." Imagine the raw holding power of plastic on a greasy canister, and it's not hard to understand why this filter wrench of the future could never work as advertised. (Products like these make me question whether tool manufacturers ever try their own wares. I often wonder the same about software vendors.) Finally, while poking around in a mechanic's tool chest, I found a homemade pipe, fitted for a socket wrench, with a rough leather strap that grabbed the filter the first time. Later, I bought one from the mechanic. This just goes to show that you should never choose tools based on their popularity or advertising claims. It's often the unlikely junk in the bottom of the drawer that works best.
The same is true of algorithms. For instance, in dusting off an old program that I use to prepare Pascal listings for publication, I wondered whether a binary search was the best way to look up entries in a sorted list of keywords--the critical code in this application that parses Pascal programs and converts keywords to lower case, optionally delimited for boldfacing in a word processor. I knew that a binary search makes only log N+1 comparisons, where N is the number of words in the array. Finding an entry in a list of 100 keywords, then, requires a maximum of six comparisons, which I wrongly assumed to be the best results I could expect.
To improve the program's speed, I considered using a hash function or a binary tree to search for keywords, but then I realized that, once again, I had been searching for a search engine for the wrong reasons. Most strings in a program listing are not keywords, so my program's speed was more dependent on how fast a word was not found than it was on the speed of a successful search. In other words, I needed a method that failed faster than the competition. Once I came to that realization, I found a way to boost my program's run-time speed by 20 percent. The algorithm that I chose, called a trie search--after "information reTRIEval"--is no faster on the average than a binary or hash function, but it requires a maximum of N comparisons--where N is the number of words beginning with the same letter--to determine that a word is not in the table. In practice, most failed searches take only one or two such comparisons--many take none--far better than required by a binary search, which tends to make the maximum number of comparisons for unrecognized words. By selecting the right tool for the job, taking into consideration the fact that most searches could be expected to fail, I increased my program's speed by using a less-popular, but better-suited, search algorithm.
A classic trie-search algorithm relies on a table arranged as illustrated in Figure 1. The figure shows only a portion of a complete table, indexed in the first column from A to Z. You could also index the table using other character sets--a standard ASCII trie table, for example, might have 127 rows. Each element in the index contains the number of another array that stores the table's words. The table entries might directly store data, or they could contain pointers--the exact format of the table depends on your program's requirements and the programming language you are using. A zero or null entry in a column indicates there are no words beginning with that letter. There are no Pascal keywords beginning with H, Y, or Z, so those entries are set to zero. (I'm using Borland Pascal's keywords here.)
As you can tell from Figure 1, a program can use a trie-search table to quickly determine whether a search argument is not a key word. In fact, no string comparisons at all are required for entries beginning with H, Y, or Z. Only one comparison is needed to find words beginning with B. To achieve the same results using a binary search requires up to six comparisons for negative searches of Borland Pascal's 57 keywords. In other applications with larger tables, you could extend the algorithm to use two or more tables indexed on a word's successive letters. The trick is to minimize the number of full comparisons required to find words in the table or to determine their absence. Once you've structured the table, the rest is easy.
One problem, however, is evident from Figure 1. Many table slots are empty, wasting space. To minimize memory use, you can instead construct the table as a sparse matrix, as illustrated in Figure 2. Now the first column becomes an array of pointers, each of which addresses a list of words beginning with the same letter. (The table could be compressed somewhat by deleting the first letter of each word.) Entries with no words are null. As in the classic table, you could extend the sparse matrix by building other indexes for subsequent letters in each word. Carrying that idea to the extreme reduces the trie table to a digital list--that is, a binary tree of letters, with paths forming the table's words. Small tables such as the one shown here, however, work just as well with a single-level index.
Example 1 is pseudocode for Algorithm #18, Trie Search. The algorithm simply looks up an input argument's first letter in the index, then searches the linked list for a match. Only a single exact-match string comparison is needed inside the inner loop--the key ingredient of this method's speed. A binary search requires alphabetical less-than or greater-than comparisons, further slowing searches for arguments not found.
Listings One, Two, and Three show the source code for my Pascal Parser, IDENT.PAS. SEARCH.PAS, the Pascal unit in Listing One (page 143), implements the trie-search algorithm. Keyword lists are composed of linked records of type ResWordRec. The global Index array corresponds to the Index column in Figure 2. Procedures AddList and AddWord build the trie-search tables--you can use these procedures to construct a trie-search engine for any list of words, but the words must be inserted in alphabetical order (see function Initialize). Function IsReserved determines whether a given word, passed as argument Ident, is a member of the table.
The other two listings, COMMON.PAS (Listing Two, page 143) and IDENT.PAS (Listing Three, page 143), use the trie-search engine to parse a Pascal listing. The program converts to lower case all keywords in a Pascal source file, and also optionally capitalizes all non-keywords (specify option--c). Use option--b to add <* and *> delimiters to keywords. The word begin, for example, is translated to <*begin*>. Use the --b option only on a copy of a source file--after conversion, the file will no longer compile. (You can restore the original text by deleting all instances of <* and *>.) I use WINWORD.MAC (Listing Four, page 145) in Word for Windows to convert delimited words to boldface after inserting a listing into a document. You could probably whip up a similar macro for other word processors.
Next month, more algorithms. Meanwhile, send your favorite algorithms and tools to me in care of DDJ--software tools, that is.
Copyright © 1994, Dr. Dobb's JournalFast Failures
Trie-Search Algorithm
Pascal Parser
Your Turn
Figure 1: Classic trie-search table.
[1] [2] [3] [4] [5] [6]
[a] 2 and begin far goto xor
[b] 3 array 0 file 0 0
:. -- asm -- for -- --
[f] 4 0 -- function -- --
[g] 5 -- -- 0 -- --
[h] 0 -- -- -- -- --
:. -- -- -- -- -- --
[x] 6 -- -- -- -- --
[y] 0 -- -- -- -- --
[z] 0 -- -- -- -- -- Figure 2: Trie table converted to a sparse matrix.
Example 1: Pseudocode for Algorithm #18 (trie search).
input
Arg: String;
var
P: Pointer;
begin
P = Index[Arg[1]];
while(P <> nil) do
begin
if P^.Word = Arg then
return True;
P = P^.Next;
end;
return False;
end;
[LISTING ONE]
(* ----------------------------------------------------------- *(
** search.pas -- Search engine for IDENT program **
** Trie search algorithm **
** Copyright (c) 1994 by Tom Swan. All rights reserved. **
)* ----------------------------------------------------------- *)
unit Search;
INTERFACE
uses Common;
{ Return true if Ident is a Turbo Pascal reserved word }
function IsReserved(Ident: IdentStr): Boolean;
IMPLEMENTATION
type
ResWord = String[14];
PResWordRec = ^ResWordRec;
ResWordRec = record
Word: ResWord; { Reserved word string }
Next: PResWordRec; { List link field }
end;
var
Index: array['a' .. 'z'] of PResWordRec;
{ Add word W to list at P }
procedure AddList(var P: PResWordRec; var W: ResWord);
begin
if (P <> nil) then
AddList(P^.Next, W)
else begin
P := new(PResWordRec);
if (P = nil) then
begin
Writeln('Out of memory');
Halt;
end;
P^.Word := W;
P^.Next := nil
end
end;
{ Add word W to global Index }
procedure AddWord(W: ResWord);
begin
if Length(W) = 0 then exit;
AddList(Index[W[1]], W)
end;
{ Initialize search engine variables }
procedure Initialize;
var
C: Char; { Index[] array index }
begin
for C := 'a' to 'z' do
Index[C] := nil;
AddWord('and');
AddWord('array');
AddWord('asm');
AddWord('begin');
AddWord('case');
AddWord('const');
AddWord('constructor');
AddWord('destructor');
AddWord('div');
AddWord('do');
AddWord('downto');
AddWord('else');
AddWord('end');
AddWord('export');
AddWord('exports');
AddWord('far');
AddWord('file');
AddWord('for');
AddWord('function');
AddWord('goto');
AddWord('if');
AddWord('implementation');
AddWord('in');
AddWord('inherited');
AddWord('inline');
AddWord('interface');
AddWord('label');
AddWord('library');
AddWord('mod');
AddWord('near');
AddWord('nil');
AddWord('not');
AddWord('object');
AddWord('of');
AddWord('or');
AddWord('packed');
AddWord('private');
AddWord('procedure');
AddWord('program');
AddWord('public');
AddWord('record');
AddWord('repeat');
AddWord('set');
AddWord('shl');
AddWord('shr');
AddWord('string');
AddWord('then');
AddWord('to');
AddWord('type');
AddWord('unit');
AddWord('until');
AddWord('uses');
AddWord('var');
AddWord('virtual');
AddWord('while');
AddWord('with');
AddWord('xor');
end;
{ Trie search algorithm }
function IsReserved(Ident: IdentStr): Boolean;
var
P: PResWordRec;
begin
IsReserved := false;
if Length(Ident) = 0 then exit;
DownCase(Ident);
P := Index[Ident[1]];
while(P <> nil) do
begin
if P^.Word = Ident then
begin
IsReserved := true;
exit
end;
P := P^.Next
end
end;
begin
Initialize;
end.
[LISTING TWO]
(* ----------------------------------------------------------- *(
** common.pas -- Various constants, types, and subroutines **
** Copyright (c) 1994 by Tom Swan. All rights reserved. **
)* ----------------------------------------------------------- *)
unit Common;
INTERFACE
const
identStrLen = 64;
digitSet = ['0' .. '9'];
upperSet = ['A' .. 'Z'];
lowerSet = ['a' .. 'z'];
alphaSet = upperSet + lowerSet;
identSet = alphaSet + digitSet + ['_'];
type
IdentStr = String[identStrLen];
{ Return lowercase equivalent of Ch }
function DnCase(Ch: Char): Char;
{ Convert all letters in identifier to lowercase }
procedure DownCase(var Ident: IdentStr);
IMPLEMENTATION
{ Return lowercase equivalent of Ch }
function DnCase(Ch: Char): Char;
begin
if Ch in upperSet
then Ch := Chr(Ord(Ch) + 32);
DnCase := Ch
end;
{ Convert all letters in identifier to lowercase }
procedure DownCase(var Ident: IdentStr);
var
I: Integer;
begin
if Length(Ident) > 0 then
for I := 1 to Length(Ident) do
Ident[I] := DnCase(Ident[I])
end;
begin
end.
[LISTING THREE]
(* ------------------------------------------------------------*(
** ident.pas -- Convert key word identifiers in .PAS files. **
** Converts key words in Pascal listings to lowercase, and **
** marks them for bold facing. Words are marked using the **
** symbols <* and *>. For example, <*begin*> is interpreted as **
** a bold faced "begin" key word. A word-processor macro could **
** search for all <* and *> symbols in the resulting file and **
** replace these with bold face on and off commands. **
** Copyright (c) 1994 by Tom Swan. All rights reserved. **
)* ------------------------------------------------------------*)
{$X+} { Enable "extended" syntax }
program Ident;
uses Dos, Common, Search;
const
bakExt = '.BAK'; { Backup file extension }
tempExt = '.$$$'; { Temporary file extension }
type
PString = ^String;
PListRec = ^TListRec;
TListRec = record
Path: PString;
Next: PListRec
end;
TState = (
Reading, Chkcomment, Comment1, Comment2, Stopcomment,
Stringing, Converting
);
var
FileSpec: ComStr; { Files entered on command line }
Root: PListRec; { File name list root pointer }
DelimitWords: Boolean; { True to add <* and *> to reserved words }
CapIdentifiers: Boolean; { True to capitalize non-keywords }
{ Return copy of a string }
function NewStr(S: String): PString;
var
P: PString;
begin
GetMem(P, Length(S) + 1);
if (P <> nil) then
PString(P)^ := S;
NewStr := P
end;
{ Return true if InF is successfully converted to OutF }
function ConvertIdents(var InF, OutF: Text): Boolean;
var
Ch, PushedCh: Char;
State: TState;
Identifier : IdentStr;
function GetCh(var C: Char): Char;
begin
if PushedCh <> #0 then
begin
C := PushedCh;
PushedCh := #0
end else
Read(InF, C);
if (C = #13) or (C = #10) then
begin
if (C = #13) then
Writeln(OutF); { Start new line }
C := #0 { Ignore new line characters }
end;
GetCh := C
end;
procedure UngetCh(Ch: Char);
begin
PushedCh := Ch
end;
procedure PutCh(Ch: Char);
begin
if Ch <> #0 then
Write(OutF, Ch)
end;
begin
PushedCh := #0; { No pushed character }
State := Reading;
while not eof(InF) do
begin
GetCh(Ch);
case State of
Reading:
begin
case Ch of
'(' : State := Chkcomment;
'{' : State := Comment1;
'''' : State := Stringing;
end;
if Ch in alphaSet then
begin
UngetCh(Ch);
State := Converting
end else
PutCh(Ch)
end;
Chkcomment:
if Ch = '*' then
begin
PutCh(Ch);
State := Comment2
end else begin
UngetCh(Ch);
State := Reading
end;
Comment1:
begin
PutCh(Ch);
if Ch = '}' then
State := Reading
end;
Comment2:
begin
PutCh(Ch);
if Ch = '*' then
State := Stopcomment
end;
Stopcomment:
begin
PutCh(Ch);
if Ch = ')' then
State := Reading
else
State := Comment2;
end;
Stringing:
begin
PutCh(Ch);
if Ch = '''' then
State := Reading;
end;
Converting:
begin
Identifier := '';
while Ch in identSet do
begin
Identifier := Identifier + Ch;
Read(InF, Ch) { Note: Don't call GetCh here! }
end;
if IsReserved(Identifier) then
begin
DownCase(Identifier);
if DelimitWords then
Identifier := '<*' + Identifier + '*>'
end else
if CapIdentifiers and (Length(Identifier) > 0) then
Identifier[1] := UpCase(Identifier[1]);
Write(OutF, Identifier);
UngetCh(Ch);
State := Reading
end
end
end;
if PushedCh <> #0 then { Write possible pushed last char that }
PutCh(Ch); { sets eof() to true. }
ConvertIdents := true
end;
{ Convert one file specified in Path string }
procedure ConvertOneFile(Path: PathStr);
var
Result: Integer;
BakF, InF, OutF: Text;
TempName, BakName: PathStr;
Name: NameStr;
Dir: DirStr;
Ext: ExtStr;
begin
Write(Path);
Assign(InF, Path);
{$i-} Reset(InF); {$i+}
if IoResult <> 0 then
Writeln(' **Error opening file')
else begin
FSplit(Path, Dir, Name, Ext);
TempName := Dir + Name + tempExt;
BakName := Dir + Name + bakExt;
Assign(OutF, TempName);
{$i-} Rewrite(OutF); {$i+}
if IoResult <> 0 then
Writeln(' **Error creating output file')
else begin
if ConvertIdents(InF, OutF) then
begin
Close(InF);
Close(OutF);
Assign(BakF, BakName);
{$i-}
Erase(BakF);
Result := IoResult; { Throw out IoResult }
Rename(InF, BakName);
Rename(OutF, Path);
{$i+}
if IoResult <> 0 then
Writeln(' **Error renaming files')
else
Writeln(' done')
end else
Writeln(' **Error processing files')
end
end
end;
{ Convert files on global list at Root pointer }
procedure ConvertFiles(List: PListRec);
begin
if List = nil then
Writeln('No files specified')
else
while List <> nil do
begin
ConvertOneFile(List^.Path^);
List := List^.Next
end
end;
{ Add file path to list }
procedure ListFile(var List: PListRec; Path: PathStr);
var
P: PListRec;
begin
New(P);
P^.Next := List;
P^.Path := NewStr(Path);
if P^.Path = nil then
Dispose(P)
else
List := P
end;
{ Create list of file names from FileSpec string }
procedure ListFiles(var List: PListRec);
var
Sr: SearchRec; { Directory search record }
L: Integer; { Length of Dir string }
OldDir: DirStr; { Old directory upon entry to procedure }
Path: PathStr; { Expanded file specification with path info }
Dir: DirStr; { Directory component of Path }
Name: NameStr; { File name component of Path }
Ext: ExtStr; { File extension component of Path }
begin
GetDir(0, OldDir); { Save current path }
Path := FExpand(FileSpec); { Add path info to file spec }
FSplit(Path, Dir, Name, Ext); { Separate Path components }
L := Length(Dir); { Prepare to change directories }
if L > 0 then
begin
if (Dir[L] = '\') and (L > 1) and (Dir[L - 1] <> ':') then
Delete(Dir, L, 1); { Ensure that ChDir will work }
ChDir(Dir) { Change to location of file(s) }
end;
FindFirst(Path, 0, Sr); { Start file name search }
while DosError = 0 do { Continue while files found }
begin
Path := FExpand(Sr.Name); { Expand to full path name }
ListFile(List, Path); { Add path to list }
FindNext(Sr) { Search for the next file }
end;
ChDir(OldDir)
end;
{ Display instructions }
procedure Instruct;
begin
Writeln('Use -b option to surround reserved words with');
Writeln('<* and *> for bold-facing in a word processor.');
Writeln('Use -c option to capitalize non-keyword identifers.');
Writeln;
Writeln('WARNING: After conversion with -b, the listing will');
Writeln('not compile. Use -b ONLY on a copy of original files.');
Writeln;
Writeln('ex. IDENT single.pas');
Writeln(' IDENT -b one.pas two.pas');
Writeln(' IDENT wild??.pas -b *.pas')
end;
{ Main program initializations }
procedure Initialize;
begin
Writeln;
Writeln('IDENT -- (C) 1994 by Tom Swan');
Writeln('Converts Pascal reserved words to lowercase.');
Writeln;
Root := nil; { File name list is empty }
DelimitWords := false; { Normally do not add <* and *> to words }
CapIdentifiers := false { Normally do not capitalize other idents }
end;
{ Main program block }
var
I: Integer;
begin
Initialize;
if ParamCount = 0 then
Instruct
else for I := 1 to ParamCount do
begin
FileSpec := ParamStr(I);
if (FileSpec = '-b') or (FileSpec = '-B') then
DelimitWords := true
else if (FileSpec = '-c') or (FileSpec = '-C') then
CapIdentifiers := true
else begin
ListFiles(Root);
ConvertFiles(Root)
end
end
end.
[LISTING FOUR]
Sub MAIN
StartOfDocument
EditFind .Find = "<*", .WholeWord = 0, .MatchCase = 0, .Direction = 1, \
While EditFindFound()
EditClear
EditFind .Find = "*>", .WholeWord = 0, .MatchCase = 0, .Direction = 1, \
If Not EditFindFound() Then
Stop
End If
EditClear
WordLeft 1, 1
Bold 1
EditFind .Find = "<*", .WholeWord = 0, .MatchCase = 0, .Direction = 1, \
Wend
End Sub