C PROGRAMMING

TEXTSRCH Connections

Al Stevens

This month we continues with the TEXTSRCH project, one that started two months ago. To build the entire project, you will need the express.c source file from December, the exinterp.c and textsrch.c files from January, and the several source files introduced this month.

TEXTSRCH is a text retrieval system that provides a concordance-like index into a text data base. You provide the files of text, and TEXTSRCH builds an inverted word index into the files. Later, when you go looking for some forgotten reference to something, you use the TEXTSRCH query language to compose a query, and TEXTSRCH finds and reports to you the files that match your search. I described the syntax of the query two months ago and developed its interpreter last month. This month we build index files and connect it all together.

TEXTSRCH consists of several widely used techniques. Besides its usefulness as a text management tool, it provides examples of hashing, binary trees, command-line parsing, expression parsing, infix and postfix expression notation, and expression interpreting.

Listing One, page 140, is textsrch.h, Version 3. We built the first version two months ago and added to it last month. Most of this month's additions are function prototypes for the code that follows, but there is one other notable addition. The MAXFILES variable will influence the size of your data base index. It specifies the maximum number of text files that a TEXTSRCH index can support. If you change this number, you will change the size of the bit map that the expression interpreter needs. Remember, the bit map has one bit per file in the data base. When the bit map size changes, so does the size of one of the index files. The file named "TEXT.NDX" includes, among other things, a copy of the bit map that corresponds to every unique word in the data base.

Before we get into the index code, let's look at two general-purpose functions that TEXTSRCH uses and that you might find useful in other projects.

Parsing the Command Line

Listing Two, page 140, is cmdline.c, a source file that contains the parse_cmdline function. The function processes a program's command-line parameters and is the only part of TEXTSRCH that is oriented to any specific architecture in that it works with MS-DOS and Turbo C. You can easily redo it for another compiler or a different architecture. Its purpose is to parse file name specifications from the command-line into discrete file names and to call a specified function to process each file. In addition, parse_cmdline will sense the presence of command-line option switches and set entries in an associated array to logical true and false states accordingly. TEXTSRCH does not use this last feature, but parse_cmdline supports it nonetheless.

When your program first begins, it calls parse_cmdline and passes the conventional argc and argv variables that are available as parameters to the main function. Your program also pass s the address of a 128-byte array of switches and the address of a function. The function, provided in your program, will process the files, one by one.

The option array contains the initial default setting for the command-line switches. This statement tests the array:

  
if (array('a'))   
...

If the command-line had +a on it, this test would be true. If the parameter was -a or if the default had a zero value in the 61st (ASCII assumed) position, the test would be false.

The function selects each file specification and calls the function addressed by the last parameter in the call to parse_cmdline. If you explicitly name files on the command-line, each of these is processed in turn. You can use wild cards or prefix a file name with the @ character to specify that the file names to be processed are recorded in a text file.

If you use wild cards, the files that match the ambiguous specification are processed in the order in which the directory scan finds them. I used the Turbo C findfirst and findnext functions for the directory scan. Microsoft C has similar but different ones named _dos_findfirst and _dos_findnext. Other compilers have their own versions of these functions or, at least, ways to call MS-DOS directly to use the MS-DOS functions that perform these operations. If you use a different operating system, then you must provide other ways to get the same result. If you use an operating system that does not support a directory scan or ambiguous file name specifications, then you must use the @ prefixed file of file names supported by parse_cmdline.

A Binary Tree

Listing Three, page 141, is bintree.c. It contains the functions to build and search a binary tree. The tree in question is used for what we call "noise" words, and I'll explain that feature soon, but first you need to consider the addtree and srchtree functions and binary trees in general.

A binary tree is a search data structure that facilitates fast searches of key values. Each entry in the tree contains the value to be searched and two pointers -- a pointer to the entries with values that collate higher than the current one, and a pointer to the entries that are lower. Even with a large number of entries, the number of compares needed to find a value is reasonably small.

The addtree function adds entries to the tree, which is empty at first. It allocates memory for the entry from the heap and then calls srchtree to see if the entry is already there or, if not, where it should be inserted.

After the tree is built, we use the srchtree function to tell us if a value is in the tree. In this implementation of the binary tree there are no other data values in an entry. We use the tree to determine the presence or absence of a value, nothing more.

Binary trees are most efficient when the entries are added to the tree in random sequence. If you were to build one from an ordered list of entries, the tree would have one long branch, and searches would be inefficient. There are tree balancing algorithms for binary trees, but we do not need one for our use here because we build the tree once, and our original order is random.

Noise Words

We use the binary tree for a list of noise words. Noise words are those words that are so common that we can assume that they appear in every file. The word "the" is a typical noise word. Because we can make that assumption, we can bypass such words when we build our index, thus saving index space. Likewise, we can bypass searching the index for such words during retrievals, thus saving time.

The build_noisewords function in bintree.c builds the binary tree from a text file that contains all the noise words. Listing Four, page 142, is noise.1st, a small file of noise words that I arbitrarily selected. You might want to review the list and modify it. The more words you can put into the noise word list, the more efficient your index will be.

Building the Index

Last month we stubbed the text search process by using the GREP utility program to process our queries. This month we toss that method aside and build an inverted index into our data base. To build the index we extract each word from each file in the data base. From the word we use a hashing algorithm to compute a random address. The random address points into SLOTS.NDX, a file of slots with one slot available for each possible random address. For this project we will assume a maximum of 64K words, so there will be 64K slots.

Each slot contains a long integer with the value -1 if the slot is not in use. If the slot is in use, it contains a character offset into a text index file, named TEXT.NDX. The offset points to the first character of a variable-length record in TEXT.NDX. The record contains the matching text word, a file bit map, and a record chain pointer.

The bit map in the TEXT.NDX record contains one bit for each text file in the data base. If a bit in the bit map is one, the word appears in the file that is associated with the bit.

The file named "FILELIST.NDX" contains the names of the text files in the data base. The bit offset in the bit map and the name offset in FILELIST.NDX are the same.

The chain pointer in a TEXT.NDX record manages those words that hash to common slots. When a word hashes to a slot that a previous word used, we have a collision. We will add a new record to the end of TEXT.NDX and write its character offset into the pointer in the record pointed to by the slot entry. This builds a chain. The slot points to the first word that hashed to the slot. The first word's record points to the second, and so on. Subsequent collisions lengthen the chain.

The code that builds and searches these files is contained in Listing Five, page 142, index.c.

Building the index is managed by the program that starts with Listing Six, page 144, bldindex.c. It calls the parse_cmdline function to extract file names from the specifications on the command line and to execute the index_file function for each file to be indexed. The index_file function opens the specified file and calls the extract_word function (explained next) until there are no more words in the file to be extracted. The srchtree function tells us if the word is a noise word recorded in the binary tree. If not, we add the word to the index by calling the addindex function, found in index.c.

Extracting Words

To build an index (and to build the noise word binary tree) we need a function that extracts each word from a text file and normalizes it into a form suitable for indexing. The extract_word function in text.c, Listing Seven, page 144, does just that. You pass it the FILE pointer of an open file and a character pointer where it will copy the next word, and it does the rest. The normalization process consists of skipping characters that are not considered to be text and then selecting all text characters up to the next nontext character. For my purposes, I select character groups that include alphabetic characters, digits, the underscore (_), the plus (+) sign, and the pound (#) sign. This allows me to index text words, references to C++, function names, and preprocessor directives. The extract_word function converts alphabetic characters to lowercase. Other applications will have different requirements, so the is-TEXT macro in text.c allows you to define your own select criteria.

The Hashing Algorithm

Hashing is when you compute a random number from a key argument. You can then use the number as a random address into a file where the information related to the argument exists. With hashing you start with the key, compute the address, and use the address to find the record, all in the blink of an eye. Or so it would seem. In practice there is a bit more to it.

We will hash a random number from each of the words in our text files and later from the words we use as arguments in our queries. The algorithm I chose is a loose adaptation of the one that appeared in Steve Heller's "Extensible Hashing" article in the November 1989 issue of DDJ. It adds the seven-bit ASCII value of each character in the word to the initially zero hash total and shifts the total seven bits to the left. This add-shift loop continues for each character in the word.

There are as many different hashing algorithms as there are data formats to be hashed. The idea is to compute a random number that is evenly distributed across a defined range of values. If you can have 100,000 inventory records in a data base, the algorithm must compute from the inventory control key a random number between 1 and 100,000. A common technique is to reduce the key to a numerical value and divide that value by the prime number closest to the range. The remainder of that division is used as the random number.

In our approach we have decided that the range goes to 64K. Therefore, the 16-bit integer that we compute from a text string should suffice as a random number. Further reduction should not be necessary. Our 64K limit approximates the number of different words we can expect to find in a text data base. Most works of prose will have far fewer than that number of unique words. Even if we build an index for some tome that exceeds that number, then our collision strategy ensures that every word has a place in the index.

The compute_hash function in index.c is our hashing algorithm.

TEXTSRCH Retrievals

We established our query language and interpreter in earlier installments. Now that we can build real indexes, we need to replace the stubs from last month with a real search process. Listing Eight, page 144, is search.c and it replaces last month's search.c stub program. The real thing is simpler than the stub because most of the work is now done in index.c. The search function calls srchtree to see if the word is in the noise word list. If so, the search function returns a bit map with all bits set to one because noise words are assumed to exist in all text files in the data base. If the word is not a noise word, the search function calls search_index in index.c to derive the bit map that says which files in the data base contain the word.

The search_index function calls compute_hash to develop the random address of the word's entry in SLOT.NDX. That entry contains the character offset to the first word that uses that slot. We read the offset and seek to that record location in TEXT.NDX. If the word in that record is the same as the one we are looking for, we have a hit. If not, and there are no further records chained to the slot, we have a miss. We navigate the chain if it exists and compare each word in it to the one we are searching for. If we get a hit, the bit map associated with the matching record is the one returned. Otherwise an all-zero bit map is returned.

The search process just described is for one word only. The expression interpreter from last month combines the searches for all the words in the expression with the Boolean operators and develops one bit map that represents the result of the full search.

The search.c file also contains the process_result function, which is no smarter than the one we stubbed last month. All it does is display on the console the name of the files that match the search criteria. What you will do with that list is up to you. See the discussion on Possible Enhancements for some thoughts on the matter.

Running TEXTSRCH

To build an index, get all your text files together and figure out how you are going to specify them on the command-line. If they are scattered throughout your subdirectories, you might want to make a list of them in a file. Use a text editor to make the file and put each file name on a separate line. These are the different commands for building an index:

bldindex *.txt
bldindex textfile.001 textfile.002 textfile.003 
bldindex @files.lst

You can mix these formats like this:

bldindex *.txt textfile.001 @files.lst

The file specifications can have paths.

If the files named SLOTS.NDX, TEXT.NDX, and FILELIST.NDX already exist, you will be adding to an existing index. Try not to add files that you already indexed because this practice results in unnecessary redundancy in your index files. If the files do not exist, you are building a new index.

Building an index takes a long time, so you might want to do it in small increments, a few files at a time. A power failure could make you start all over. The text files need to be accessible during the build phase, but can be elsewhere during retrievals. The retrieval process only identifies the file specifications as they existed when the index was built. Retrievals do not involve the text files themselves.

To perform a retrieval run the textsrch program from where the three .NDX files can be read. Enter query expressions as you did last month, and review the file lists that result. A null query expression terminates the program.

Possible Enhancements

TEXTSRCH has a lot of power and it has a lot of potential. As published here, it is a subset of a more powerful text processing system that I developed for engineering documentation applications. I use the version you see here with no enhancements for personal use, but there are several extensions that you might consider. Here are some of them.

Phrase Searches -- TEXTSRCH retrievals are based on word searches because the index is built from discrete words. You can approximate a phrase search by combining all the words in a phrase into a query expression with the AND Boolean operator. This retrieval would deliver the names of the files that contain all the words in the phrase, but it would not tell you if the phrase itself was in any, all, or none of the files. You could search each file for the specific phrase itself by using Boyer-Moore or a similar text-matching algorithm. This approach would require modification of the expression input function to recognize the difference between phrases and individual words. The process_result function must be extended to do the follow-up search of the text files themselves.

Wild Card Searches -- TEXTSRCH searches on a simple case-insensitive word pattern. You could add the "?" wild card by changing the search function in search.c to perform successive searches using every possible character where the wild card appeared.

Integrate with a Word Processor

The industrial system from which TEXTSRCH derives is integrated with a word processor. When the file list is presented to the user, he or she selects one, and the program fires up the word processor telling it to load the selected file and position itself to the first occurrence of the matching word or phrase. This implementation binds the two applications together and is beyond the scope of this column.

What Good Is It?

How might you use TEXTSRCH? I spend a lot of time reading and writing electronic mail and on-line service forum conversations. Sometimes I need to go back and find a message that mentioned something I want to recall. You know how it goes. Someone makes a comment about something that barely catches your attention. A few months later you are working on a new project and that comment would be helpful if only you could remember who made it and what they said. With TEXTSRCH indexes into my old mail I can usually find the messages that pertain to the subject in question.

I keep all my columns, articles, and book manuscripts stored away in TEXTSRCH data bases. That's right, I can't always remember what I wrote, when I wrote it, or where, and TEXTSRCH helps my failing memory (always the second thing to go).

I do another indexing trick with TEXTSRCH. I build dummy manuscripts that simulate the dozens of articles, manuals, and books I read every month. The dummy manuscripts contain nothing more than key words from the articles and books. The subsequent TEXTSRCH indexes help me find the hard copies where the material appears. Someday when books and articles are available in machine-readable media, or when I have a reliable OCR scanner, I'll be able to build the indexes directly and bypass the dummy manuscript process.

There is one other use that will be of interest to C programmers. If you put only the C keywords (int, long, typedef, and so on) into the noise word list and build an index from the source code for a huge project, you have in TEXTSRCH the beginnings of a handy cross reference of the functions and variables. It isn't worth the trouble if you are developing a new system, but if you have inherited one of those behemoth undocumented applications to maintain, TEXTSRCH might save you a lot of GREP time.

Book Review

Programming in C++ by Stephen C. Dewhurst and Kathy T. Stark. This book is a must for anyone who embarks on an intensive study of C++. It would help if you already had an exposure to C++, and knowledge of C is a prerequisite. I think I would have found this book a bit overwhelming a few months ago before I began to study C++. When read from the perspective of prior knowledge, however, this book is a gem. My one criticism is that it shares a characteristic found in most C++ and object-oriented literature. It uses fruit for examples. If I have to stomp through one more orchard of apple and orange objects, I fear I may be felled by the dreaded canker. To their credit, D&S don't dwell all that much on the ubiquitous citrus hierarchies, and other than for that one small lapse, their book is very good. To my surprise I found that the book has clear and concise explanations of procedural programming, top-down design, and structured programming, not OOP subjects at all. Most OOP books simply assume that you already know about these things. Well, you should, but if you need brief explanations of them written in a style that real people can understand, this book does the job.

And, bless them, D&S do not dwell on object-oriented programming as the be-all, end-all that we keep hearing about. No, they defer any detailed mention of OOP until Chapter 6, and then spend just a short amount of space with it. Skip that chapter and you might conclude from this book that C++ is simply an improved, extensible C. And that conclusion would not be far off the mark. Other parts of the book discuss those properties of C++ that combine to make it embraceable by the OOP ilk, but these presentations are not OOP-thumping paradigm evangelizations but rather mere explanations of the features of the paradigm as they are implemented in C++.

C and C++ will converge. Despite the minor differences that cause incompatibility between ANSI C and C++, a common language will evolve if only through usage. C++ brings too many improvements to C for that not to happen.

_C PROGRAMMING COLUMN_ by Al Stevens [LISTING ONE]



/* ----------- textsrch.h ---------- */

#define OK    0
#define ERROR !OK

#define MXTOKS 25           /* maximum number of tokens */
#define MAXFILES 64         /* maximum number of files  */
#define MAXWORDLEN 25       /* maximum word length      */
#define MAPSIZE MAXFILES/16 /* number of ints/map       */

#define SLOTINDEX  "slots.ndx"
#define TEXTINDEX  "text.ndx"
#define FILELIST   "filelist.ndx"

/* ---- the search decision bitmap (one bit per file) ---- */
struct bitmap {
    int map[MAPSIZE];
};

/* ------- the postfix expression structure -------- */
struct postfix  {
    char pfix;      /* tokens in postfix notation */
    char *pfixop;   /* operand strings            */
};

/* --------- the postfix stack ---------- */
extern struct postfix pftokens[];
extern int xp_offset;

/* --------- expression token values ---------- */
#define TERM     0
#define OPERAND 'O'
#define AND     '&'
#define OR      '|'
#define OPEN    '('
#define CLOSE   ')'
#define NOT     '!'
#define QUOTE   '"'

/* --------------- textsrch prototypes --------------- */
struct postfix *lexical_scan(char *expr);
struct bitmap exinterp(void);
struct bitmap search(char *word);
void process_result(struct bitmap);

/* -------------- binary tree prototypes ------------- */
void addtree(char *s);
int srchtree(char *s);
void delete_tree(void);
void build_noisewords(void);

/* ------------- command line prototypes ------------- */
void parse_cmdline(int,char **,char *,void (*)(char *));

/* ---------------- text prototypes ------------------ */
void extract_word(FILE *fp, char *s);

/* --------------- index prototypes ------------------ */
void open_database(void);
void init_database(void);
void close_database(void);
void addindex(char *word, int fileno);
void indexing(char *filename);
int getbit(struct bitmap *map1, int bit);
struct bitmap search_index(char *word);
char *text_filename(int fileno);





[LISTING TWO]


/* ----------- cmdline.c ----------- */

#include <stdio.h>
#include <dir.h>
#include <string.h>
#include "textsrch.h"

/*
 * Parse a command line:
 *  filename1 filename2 ... filenamen
 *  @filelist
 *  wild cards
 *  -+x option list (-a +b -c -xyz +pdq)
 *  any mix of the above
 */

/* ---- parse a command line for options and file names ---- */
void parse_cmdline(int argc, char *argv[], char *options,
                void (*func)(char *fn))
{
    char path[65];
    FILE *fp;
    while (argc-- > 1)  {
        switch (**++argv)   {
            case '/':
            case '+':
                /* -------- add an option --------- */
                while (options && *++*argv)
                    options[**argv] = 1;
                break;
            case '-':
                /* -------- remove an option --------- */
                while (options && *++*argv)
                    options[**argv] = 0;
                break;
            case '@':
                /* ----- a file of file path/names ----- */
                if ((fp = fopen(*argv+1, "rt")) != NULL)
                    while ((fgets(path, 65, fp)) != NULL)   {
                        path[strlen(path)-1] = '\0';
                        (*func)(path);
                    }
                break;
            default:
                /* ---- a file spec on the command line ---- */
                if (strchr(*argv, '*') || strchr(*argv, '?')) {
                    /* ------ an ambiguous file spec ------- */
                    struct ffblk ff;
                    char *cp;
                    int rtn;

                    /* ---- copy the ambiguous file spec ---- */
                    strcpy(path, *argv);

                    /* ---- find the filename part ---- */
                    if ((cp = strrchr(path, '\\')) == NULL)
                        if ((cp = strrchr(path, ':')) == NULL)
                            cp = path-1;
                    cp++;

                    /* ---- search for matches ---- */
                    rtn = findfirst(*argv, &ff, 0);
                    while (rtn == 0)    {
                        strcpy(cp, ff.ff_name);
                        (*func)(path);
                        rtn = findnext(&ff);
                    }
                }
                else
                    /* ----- an unambiguous file spec ----- */
                    (*func)(*argv);
                break;
        }
    }
}





[LISTING THREE]


/* ------------ bintree.c ---------- */

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "textsrch.h"

struct bintree {
    struct bintree *lower;
    struct bintree *higher;
    char wd[1];
};

static struct bintree *first, *next;

/* ---------- add a string to a binary tree ----------- */
void addtree(char *s)
{
    struct bintree *tp;
    int intree;

    tp = malloc(sizeof (struct bintree) + strlen(s));
    if (tp == NULL)
        return;
    strcpy(tp->wd, s);
    tp->lower = tp->higher = NULL;
    if ((intree = srchtree(s)) != 0)    {
        if (first == NULL)
            first = tp;
        if (next != NULL)   {
            if (intree < 0)
                next->lower = tp;
            else
                next->higher = tp;
        }
    }
}

/* ------ Search a binary tree for a string.
    Return 0 if the string is in the tree.
    Return < 0 or > 0 if not.
    If not, next -> the node where insertion may occur. --- */

int srchtree(char *s)
{
    struct bintree *this;
    int an = -1;
    this = next = first;
    while (this != NULL)    {
        if ((an = strcmp(s, this->wd)) == 0)
            break;
        next = this;
        this = ((an < 0) ? this->lower : this->higher);
    }
    return an;
}

/* ------- delete the nodes of a branch of the tree ------ */
static void delete_nodes(struct bintree *nd)
{
    if (nd->lower)
        delete_nodes(nd->lower);
    if (nd->higher)
        delete_nodes(nd->higher);
    free(nd);
}

/* ------- free all memory allocated for the tree -------- */
void delete_tree(void)
{
    delete_nodes(first);
}

/* ------- build a binary tree of noise words -------- */
void build_noisewords(void)
{
    FILE *fp;
    char word[MAXWORDLEN+1];
    /* ---- open the noise word file ---- */
    if ((fp = fopen("NOISE.LST", "rt")) != NULL)    {
    /* ---- extract words and add them to the list ---- */
        while (!feof(fp))   {
            extract_word(fp, word);
            /* -------- search the noise word list -------- */
            addtree(word);
        }
        fclose(fp);
    }
}





[LISTING FOUR]


so very what he she her both there if above only it again
done our then from just my along left who may we all these
them us after once need through onto others can want where
would nor none with do here been was see own since off this
not will which itself that each to take down on you under at
their however the an as even have whose a said before or
nearly below are those possible alike begin out than having
two know when way often together by many thus whatever i his
past another though other entire in against am most taken
instead him because for might too rather few soon either near
about every until neither beyond your better must usually
several does such something using put more whether sent any
later like and now they also upon close be use of is should
could were into made further used let how enough its himself
known never become sure over next up among had causes has no
old some come but while me





[LISTING FIVE]


/* ----------- index.c ------------- */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "textsrch.h"

static void setbit(struct bitmap *map1, int bit);
static unsigned compute_hash(char *word);

static FILE *slots, *text, *flist;
static char path[65];
int file_count = 0;

/* ------ open or create the index files for building ------ */
void open_database(void)
{
    unsigned ctr = 0xffff;
    long empty = -1L;
    build_noisewords();
    if ((slots = fopen(SLOTINDEX, "r+b")) == NULL)
        slots = fopen(SLOTINDEX, "w+b");
    if (slots != NULL)  {
        if ((text = fopen(TEXTINDEX, "r+b")) == NULL)
            text = fopen(TEXTINDEX, "w+b");
        if (text != NULL)   {
            if ((flist = fopen(FILELIST, "rt")) != NULL)    {
                while (fread(path, sizeof path, 1, flist))
                    file_count++;
                fclose(flist);
                flist = fopen(FILELIST, "at");
                return;
            }
            /* ---- if the file list does not exist,
                we must be building a new data base ----- */
            if ((flist = fopen(FILELIST, "wt")) != NULL) {
                /* --- preset the slots to -1 values --- */
                printf("\nBuilding index. Please wait...\n");
                while (ctr--)   {
                    if ((ctr % 1000) == 0)
                        putchar('.');
                    fwrite(&empty, sizeof(long), 1, slots);
                }
                file_count = 0;
                return;
            }
            fclose(text);
        }
        fclose(slots);
    }
    printf("\nCannot establish index files");
    exit(1);
}

/* -------- open an index data base for retrieval ------- */
void init_database(void)
{
    build_noisewords();
    /* ---------- open all three index files ---------- */
    if ((slots = fopen(SLOTINDEX, "rb")) != NULL)   {
        if ((text = fopen(TEXTINDEX, "rb")) != NULL)    {
            if ((flist = fopen(FILELIST, "rt")) != NULL) {
                /* ------- count the text files ---------- */
                while (fread(path, sizeof path, 1, flist))
                    file_count++;
                printf("\n%d files in the data base.",
                            file_count);
                return;
            }
            fclose(text);
        }
        fclose(slots);
    }
    printf("\nCannot open Index Data Base");
    exit(1);
}

/* --------- close the index files ---------- */
void close_database(void)
{
    delete_tree();
    fclose(slots);
    fclose(text);
    fclose(flist);
}

/* --------- add a word to the index
        or add the file number to the existing word -------- */
void addindex(char *word, int fileno)
{
    long slotno;
    long hash;
    struct bitmap map1;
    long ptr = -1L;
    char wd[MAXWORDLEN+1];

    /* ------- compute a randon address from the word ------ */
    hash = compute_hash(word);
    hash *= sizeof(long);
    /* ------ read the random slot value -------- */
    fseek(slots, hash, SEEK_SET);
    fread(&slotno, sizeof(long), 1, slots);
    if (slotno == -1L)  {
        /* --- empty slot, add the word to the data base --- */
        fseek(text, 0L, SEEK_END);
        slotno = ftell(text);
        /* ----- set the bit map to this file only ----- */
        memset(&map1, 0, sizeof(struct bitmap));
        setbit(&map1, fileno);
        fwrite(&map1, sizeof(struct bitmap), 1, text);
        /* -- set the chain pointer to the terminal value -- */
        fwrite(&ptr, sizeof(long), 1, text);
        fwrite(word, strlen(word)+1, 1, text);
        /* --- insert the text address into the slot file --- */
        fseek(slots, hash, SEEK_SET);
        fwrite(&slotno, sizeof(long), 1, slots);
        return;
    }
    /* -------- the hashed slot is in use -------- */
    for (;;)    {
        /* --- point to the text index record ---- */
        fseek(text, slotno, SEEK_SET);
        fread(&map1, sizeof(struct bitmap), 1, text);
        fread(&ptr, sizeof(long), 1, text);
        fgets(wd, MAXWORDLEN+1, text);
        /* --- see if the entry matches the word --- */
        if (strcmp(wd, word) == 0)  {
            /* ---- the word matches this entry,
                    set this file's bit ---- */
            setbit(&map1, fileno);
            fseek(text, slotno, SEEK_SET);
            fwrite(&map1, sizeof(struct bitmap), 1, text);
            break;
        }
        /* ----- see if there is a chain ----- */
        if (ptr == -1)  {
            /* ------ end of the chain; add this word ---- */
            long newslotno;
            /* ---- to the end of the text index file ---- */
            fseek(text, 0L, SEEK_END);
            newslotno = ftell(text);
            /* ----- set the bit map to this file only ----- */
            memset(&map1, 0, sizeof(struct bitmap));
            setbit(&map1, fileno);
            fwrite(&map1, sizeof(struct bitmap), 1, text);
            fwrite(&ptr, sizeof(long), 1, text);
            fwrite(word, strlen(word)+1, 1, text);
            /* ----- chain the last entry to this one ----- */
            fseek(text, slotno+sizeof(struct bitmap), SEEK_SET);
            fwrite(&newslotno, sizeof(long), 1, text);
            break;
        }
        /* ------ the text record is chained
            (multiple words hash to the same slot) ---- */
        slotno = ptr;
    }
}

/* ------ search the data base for a match on a word ------ */
struct bitmap search_index(char *word)
{
    long slotno = -1;
    long hash;
    struct bitmap map1;
    char wd[MAXWORDLEN+1];

    /* ----- preset the bit map to all zeros ----- */
    memset(&map1, 0, sizeof(struct bitmap));
    /* ---- compute random slot address for the word ---- */
    hash = compute_hash(word);
    hash *= sizeof(long);
    fseek(slots, hash, SEEK_SET);
    fread(&slotno, sizeof(long), 1, slots);
    /* ------- navigate the chain -------- */
    while (slotno != -1)    {
        fseek(text, slotno, SEEK_SET);
        fread(&map1, sizeof(struct bitmap), 1, text);
        fread(&slotno, sizeof(long), 1, text);
        fgets(wd, MAXWORDLEN+1, text);
        if (strcmp(wd, word) == 0)
            /* ---- the word matches this entry ---- */
            break;
        memset(&map1, 0, sizeof(struct bitmap));
    }
    return map1;
}

/* ---- compute a random address from an ASCII string ---- */
static unsigned compute_hash(char *word)
{
    unsigned hash = 0;
    while (*word)
        hash = (hash << 7) + (hash >> 9) + *word++;
    return hash;
}

/* ------ sets a designated bit in the bit map ------ */
static void setbit(struct bitmap *map1, int bit)
{
    int off = bit / 16;
    int mask = 1 << (bit % 16);
    map1->map[off] |= mask;
}

/* ------ tests a designated bit in the bit map ------ */
int getbit(struct bitmap *map1, int bit)
{
    int off = bit / 16;
    int mask = 1 << (bit % 16);
    return map1->map[off] & mask;
}

/* --------- add a file name to the list ----------- */
void indexing(char *filename)
{
    /* ----- add the file path to the index file list ------ */
    memset(path, '\0', sizeof path);
    strcpy(path, filename);
    fwrite(path, sizeof path, 1, flist);
    file_count++;
}

/* ---- return the file name associated with an offset ---- */
char *text_filename(int fileno)
{
    fseek(flist, (long) (fileno * sizeof path), SEEK_SET);
    fread(path, sizeof path, 1, flist);
    return path;
}





[LISTING SIX]


/* -------------- bldindex.c --------------- */

#include <stdio.h>
#include <string.h>
#include "textsrch.h"

static void index_file(char *filename);

void main(int argc, char *argv[])
{
    open_database();
    printf("\nTextSrch: Building a TextSrch Index File");
    parse_cmdline(argc, argv, NULL, index_file);
    close_database();
}

static void index_file(char *filename)
{
    FILE *fp;
    char word[MAXWORDLEN+1];
    extern int file_count;
    int wdctr = 0;

    printf("\nIndexing %s", filename);
    indexing(filename);
    /* ---- open the object file ---- */
    if ((fp = fopen(filename, "rt")) == NULL)   {
        printf("\nError: No such file");
        return;
    }
    putchar('\n');
    /* ---- extract words and add them to the index ---- */
    while (!feof(fp))   {
        extract_word(fp, word);
        /* -------- search the noise word list -------- */
        if (srchtree(word) != 0)    {
            if ((++wdctr % 100) == 0)
                printf("\r%5d words", wdctr);
            /* ------- add the word to the index ------- */
            addindex(word, file_count-1);
        }
    }
    printf("\r%5d words", wdctr);
    fclose(fp);
}






[LISTING SEVEN]


/* ----------- text.c ------------ */

#include <stdio.h>
#include <ctype.h>
#include "textsrch.h"

#define isTEXT(c) (isalpha(c) || isdigit(c) || \
                   c=='+' || c=='#' || c=='_')

/* --------- extract a word from an input stream ----------- */
void extract_word(FILE *fp, char *s)
{
    int i, c;

    c = i = 0;
    while (!isTEXT(c))
        if ((c = fgetc(fp)) == EOF)
            break;
    while (isTEXT(c))   {
        if  (i++ < MAXWORDLEN)
            *s++ = tolower(c);
        if ((c = fgetc(fp)) == EOF)
            break;
    }
    *s = '\0';
}





[LISTING EIGHT]


/* ---------- search.c ----------- */

/*
 * the TEXTSRCH retrieval process
 */

#include <stdio.h>
#include <string.h>
#include "textsrch.h"

/* ---- process the result of a query expression search ---- */
void process_result(struct bitmap map1)
{
    int i;
    extern int file_count;
    for (i = 0; i < file_count; i++)
        if (getbit(&map1, i))
            printf("\n%s", text_filename(i));
}

/* ------- search the data base for a word match -------- */
struct bitmap search(char *word)
{
    struct bitmap map1;

    memset(&map1, 0xff, sizeof (struct bitmap));
    if (srchtree(word) != 0)
        map1 = search_index(word);
    return map1;
}