C PROGRAMMING

Building the Text Engine Database

Al Stevens

I just returned from London where Dr. Dobb's Journal hosted a one-day seminar for the British. Several editors and our venerated publisher, Peter Hutchinson, brought the good news from the colonies. We discussed the information superhighway, visual programming, interoperable component objects, and the state of C++, and gave a Newton demo. During Q&A at the end of my talk, an attendee asked if I thought that Borland would soon abandon OWL and embrace MFC. Figuring that the British audience would be unfamiliar with our sitcoms, I decided to steal a joke from "The Nannie." I told him that he had as much chance of seeing that as Tonya Harding had of being on a Wheaties box. No one laughed. Blank stares. Fortunately, my bomb was forgotten when, in a moment of inspiration, I announced that we would break for lunch. Later, I asked someone if they had ever heard of Tonya Harding. Of course they had, but they didn't know what a Wheaties box was.

Text-Engine Database

This month we continue the static text-search-engine project by building the database. The data analysis is complete. We've built the common word list and determined the hierarchical organization of the database. Next, we must convert data and build word indexes.

To refresh your memory: The project implements a fast search engine for the text of the King James version of the Bible. A Visual Basic front end accesses the text engine through a C-language DLL. The C engine can be compiled as a stand-alone DOS program with a stubbed user interface to test the engine. The text-conversion and database-build programs discussed this month are all written in C.

The Raw-Text Data

The raw-text data for this project comes from a public-domain copy of the King James version of the Bible distributed on diskette in 1987 by Thomas Cox of Easley, South Carolina. The text is organized into 98 ASCII text files, with each book of the Bible represented in one or more files. Cox split large books into multiple files so that users could read them with the word processors and text editors of the day. Each file's name identifies the book that it contains. For example:

...
EZRA.DOC
NEHEMIAH.DOC
ESTHER.DOC
JOB.001
JOB.002
PSALMS.001
PSALMS.002
PSALMS.003
...

Files that contain a complete book have .DOC file extensions. The others use .001, .002, and so on. Figure 1, from the book of Esther, shows how Cox organized the text. His choice of formats was fortunate. It lends itself well to parsing the individual documents (verses) of the database.

Book Names

Due to the highly structured organization of the data, the database does not need the book/chapter/verse tokens that Mr. Cox uses. The book names can be maintained in program tables. There are 66 books and two tables. Figure 2 shows a portion of the Visual Basic and C tables that identify the book names.

There are two tables because the engine is implemented two ways--first as a DOS stand-alone C program that tests the engine; then as a DLL that supports the Visual Basic front end. I'll discuss this approach in more detail next month.

Converting the Raw Text

Each book is organized into chapters numbered serially, beginning with chapter one. Each chapter is organized into verses, also organized serially beginning with the number one. The program can internally address documents by book/chapter/verse converted into document numbers. Therefore, the first step in building the database text consists of converting Cox's text into text files with verse text only, as Figure 3 shows.

Each verse is separated by a newline character. Each chapter is separated by an extra newline character. Each book is in a single ASCII text file named as shown here:

GENESIS.BBK
EXODUS.BBK
LEVITICU.BBK
...

The program that converts the raw text into the .BBK files is called BLD.C. (I am not publishing all the conversion source code in the magazine. The BLD.C program, for example, is unique to this particular database and would not apply to a text project of your own design. You can, however, get the program from a download or Careware source. See the end of this discussion for details.) The program reads a file named BOOKS.LST. I manually prepared the file with a text editor. It lists the files in Mr. Cox's database in the order that they appear in the Bible. BLD.C uses BOOKS.LST to identify the files in the raw-text database. A shareware version of the program included only the New Testament, and BOOKS.LST was modified to list only those books. Besides writing the converted files, BLD.C writes a test file named BOOKS.DOC, which lists the converted .BBK files. Later programs in the conversion task use this file to identify the .BBK files in their proper order. The BOOKS.DOC file and the .BBK text files are used to build both the index and the data files.

Building the Data Files

The data part of the database is built as two disk files, BIBLE.SQZ and BIBLE.BCV. The text part of the database is compressed and concatenated into BIBLE.SQZ, which contains a Huffman decompression tree followed by the compressed text. Each verse is a newline-terminated string. Genesis 1:1 is the first string; Revelation 22:21, the last. There are 31,102 strings in the database. Therefore, there are 31,102 documents.

The BIBLE.BCV file is a table of offsets into the text with one table entry per document. A document number from 1 to 31,102 provides an offset into the table. Each table entry consists of a byte and bit offset into the compressed text. (I adopted the same Huffman compression logic that I used in the D-Flat help database a while back.) Therefore, given a document number, the text engine can retrieve the text of a single verse by seeking to the byte offset in the text data and decompressing characters starting at the bit offset within that byte. The verse is retrieved when the engine has decompressed the terminating newline character.

The program called HUFFC.C reads BOOKS.DOC to see what .BBK text files to use in the database. The program builds two files, BIBLE.NDX, which is a temporary file with book, chapter, and verse and the offset table data, and BIBLE.SQZ, which contains the Huffman tree and compressed data. A program called BLDINDEX.C reads BIBLE.NDX and builds BIBLE.BCV, which is the database's byte/ bit offset table.

Building the BCVtable Source Code

The text engine, which we will discuss next month, addresses documents by document numbers 1 to 31,102. The user interface, however, uses book, chapter, and verse to identify documents. The engine must be able to convert from book, chapter, and verse to document number and back. Why back? When the user navigates forward or backward through the documents, the engine must report when the navigation changes to the next or previous book or chapter with proper wrapping of first or last chapter and verse numbers. To make these conversions, the engine needs to know how many chapters are in each book and how many verses are in each chapter. Rather than add a data table to the database, the engine uses an initialized, unsigned char array, named BCVtable. The array contains variable-length entries for each book. Each entry begins with an entry that records the number of chapters in that book followed by an entry for each chapter with the number of verses in the chapter. The engine uses this table to compute the document number from the book and chapter, and to compute the book, chapter, and verse from a document number.

A program named BLDTABLE.C reads the BIBLE.NDX file created by BLDINDEX.C and writes the BCVtable definition to stdout. One of the engine's source-code files includes a file named TABLE.C, and the conversion task builds that file by redirecting BLDTABLE's output to TABLE.C.

Extracting Words

The text engine uses a word index into the database. Given a word, the engine will deliver all the document numbers that contain that word. (If the word is from the common word list, determined from last month's analysis, the engine delivers all 31,102 documents in the database.) The index contributes to the engine's phrase and Boolean queries by searching each word in the query and logically combining the results of all those searches into a final document list.

Listing One, WORDS.C extracts all the words from the .BBK files into a file named WORDS.LST. Each record in WORDS.LST contains a document number and a word. WORDS.LST is, therefore, a really big file with one record for every word in the database, excepting words in the common list. The next step sorts that file into word, document number, and sequence. The programs SORTW.C and MERGEW.C do the sort.

Sorting Words

Always eager to recycle software, I used the sort utility program that I published in this column several years ago. That program was a general-purpose utility that sorted fixed-length records. Input parameters specified the record length and sort-field positions and lengths. The records in WORDS.LST are of variable length. The first field is the integer document number, and the second is a null-terminated string with the word. I had to modify the sort program to handle this format.

Sorting large files consists of two major passes. The first pass reads records into large memory buffers. When a buffer is full, the pass sorts the records, adds a terminal record, and writes the records to a work file. There are two work files. Each time the program writes a buffer, the program alternates work files. Each buffer is called a "sequence." SORTW.C is the first sort pass. It builds the two files and names them WORDS.WK0 and WORDS.WK1. Listing Two is SORTW.C. It is a throwback to the old, small-memory models of 16-bit compilers and has a relatively small sequence buffer. You could speed up the process and perhaps eliminate the second major pass by using a 32-bit compile and all the available memory in a contemporary machine.

The second major pass consists of some number of secondary merge passes that merge pairs of sequences from the two files into single sequences on a second pair of files. Each merge pass halves the number of sequences. When only one sequence remains, the file sort is completed and the remaining work file is the output. Listing Three, MERGEW.C is the merge program. It merges WORDS.WK0 and WORDS.WK1 into WORDS.WK3 and WORDS.WK4. Then it merges WORDS.WK3 and WORDS.WK4 into WORDS.WK0 and WORDS.WK2, repeating this operation until either only WORDS.WK0 or WORDS.WK2 has records. Then it renames that file WORDS.SRT, which is the input to the next step in building the index.

Building the Word Index

A program named BLDLIST.C (Listing Four) reads WORDS.SRT and builds three files, BIBLE.BNT, BIBLE.IDX, and BIBLE.DLS. These three files constitute the word index into the database. Given a word, they return a list of document numbers where that word appears.

BIBLE.BNT is an ordered array of offsets into a table of words. BIBLE.IDX is that table of words. The offset array is ordered so that the first offset points to the first word in the order of the index, which happens to be word order. "Aaron" is first, for example. The search engine uses BIBLE.BNT to perform a binary search of the words in BIBLE.IDX. The offsets in the array are of fixed length, facilitating the fast binary search. Each entry in BIBLE.IDX contains the null-terminated text of the word and an offset to a document list in BIBLE.DLS. When the search engine finds a matching word in BIBLE.IDX, it uses the offset to point to the document list in BIBLE.DLS. Each document list is a variable-length array of integers. The first integer is the number of documents in the list. The others are document numbers. The engine can use these document numbers to retrieve the text of the matching documents in the database.

The index architecture that I just described is fast and space efficient. It supports fast retrievals but would not readily support efficient word-document inserts and deletes. It is a good architecture for large, static databases.

Combining the Files into a Database

At this point, the database-build task has built five database files: BIBLE.BNT, BIBLE.IDX, BIBLE.DLS, BIBLE.BCV, and BIBLE.SQZ. It has also built TABLE.C to be used in the source-code build of the text engine. The five database files must now be combined into one file. The easiest way to do that is to use the DOS copy command, like this: copy /B bible.bnt+bible.idx+bible.dls+bible.bcv+bible.sqz /B bible.dat. This command concatenates the five files into one file named BIBLE.DAT, and that file is the database. We are not done with the others, however. We need to know their sizes. The DOS DIR command provides that information:

BIBLE.BNT 50,012
BIBLE.IDX 151,727
BIBLE.DLS 666,030
BIBLE.BCV 155,510
BIBLE.SQZ 2,276,832

The engine needs to know the size of each of the database components so that it can seek to them when it reads the one file named BIBLE.DAT. Figure 4 shows how the engine uses the lengths at compile time to compute the seek offsets.

The programs discussed this month build the text-engine database. Your text-engine projects will involve similar steps, and you might use versions of these programs and others adapted to the requirements for your database. Next month, we'll discuss the text engine itself, concentrating on the C code that implements the DLL and the DOS stand-alone test version.

Getting the Source Code and Database

The database and Visual Basic and C source code for the text engine are free. You can download them from the DDJ Forum on CompuServe and on the Internet by anonymous ftp; see "Availability," page 3.

There are several archived files containing the database (almost 1.5 Mbytes' worth), the Visual Basic front-end source code, the text-engine DLL source code, the database-build source code, and the Windows Help database. We discussed the database-build source code in this column.

If you cannot get to one of the online sources, send two high-density, 3.5-inch diskettes and a stamped, addressed mailer to me at Dr. Dobb's Journal, 411 Borel Avenue, San Mateo, CA 94402, and I'll send you a copy of the source code and database. It's free, but if you care to support my Careware charity, include a dollar for the Brevard County Food Bank. They support some of the needs of our hungry and homeless citizens.

Figure 1: Raw text.

EST 1:1  Now it came to pass in the days of Ahasuerus, (this is Ahasuerus which reigned, from India even unto Ethiopia, over an hundred and seven and twenty provinces:)
EST 1:2  That in those days, when the king Ahasuerus sat on the throne of his kingdom, which was in Shushan the palace,
EST 1:3  In the third year of his reign, he made a feast unto all his princes and his servants; the power of Persia and Media, the nobles and princes of the provinces, being before him:

Figure 2: Book-name tables.

Global BookNames(66) As String
BookNames(0) = "Genesis"
BookNames(1) = "Exodus"
BookNames(2) = "Leviticus"
BookNames(3) = "Numbers"
 ...
static char *BookName[] = {
    "Genesis",
    "Exodus",
    "Leviticus",
    "Numbers",
    ...
};

Figure 3: Converted text.

Now it came to pass in the days of Ahasuerus, (this is Ahasuerus which reigned, from India even unto Ethiopia, over an hundred and seven and twenty provinces:)\n
That in those days, when the king Ahasuerus sat on the throne of his kingdom, which was in Shushan the palace,\n
In the third year of his reign, he made a feast unto all his princes and his servants; the power of Persia and Media, the nobles and princes of the provinces, being before him:\n

Figure 4: File sizes and offsets.

#define BNTLENGTH 50012L
#define IDXLENGTH 151727L
#define DLSLENGTH 666030L
#define BCVLENGTH 155510L
#define IDXOFFSET BNTLENGTH
#define DLSOFFSET (IDXOFFSET+IDXLENGTH)
#define BCVOFFSET (DLSOFFSET+DLSLENGTH)
#define DATOFFSET (BCVOFFSET+BCVLENGTH)

Listing One


#include <stdio.h>
#include <dir.h>
#include <dos.h>
#include <process.h>
#include <ctype.h>
#include <string.h>
#include <stdlib.h>

#define DOS_VERSION

#include "htree.h"

int BinarySearch(char *word, char **cp, int wdct);

int isCommon(char *word);
unsigned int docno;
static FILE *fo;

static char *wds[500];
static int wdctr;

static int inList(char *word)
{
    int i;
    for (i = 0; i < wdctr; i++)
        if (strcmp(word, wds[i]) == 0)
            return 1;
    return 0;
}
static void AddList(char *word)
{
    wds[wdctr] = malloc(strlen(word)+1);
    strcpy(wds[wdctr++], word);
}
static void ClearList(void)
{
    int i;
    for (i = 0; i < wdctr; i++)
        free(wds[i]);
    wdctr = 0;
}
static void dowords(char *fn)
{

    char verse[600];
    char *cp, *wd;
    char word[30];
    FILE *fp = fopen(fn, "rt");
    if (fp != NULL)    {
        while (fgets(verse, 600, fp) != NULL)    {
            if (*verse == '\n')
                continue;
            docno++;
            cp = verse;
            while (*cp && *cp != '\n')    {
                wd = word;
                // extract a word
                while (!isalpha(*cp))    {
                    if (!*cp || *cp == '\n')
                        break;
                    cp++;
                }
                while (isalpha(*cp))    {
                    *wd++ = tolower(*cp);
                    cp++;
                }
                *wd = '\0';
                if (*word && *(word+1))    {
                    if (!isCommon(word))    {
                        if (!inList(word))    {
                            fwrite(&docno, sizeof docno, 1, fo);
                            fwrite(word, strlen(word)+1, 1, fo);
                            AddList(word);
                        }
                    }
                }
            }
            ClearList();
        }
        fclose(fp);
    }
}
void main()
{
    char fname[15];
    FILE *fp;
    fo = fopen("words.lst", "wb");
    if (fo != NULL)    {
        if ((fp = fopen("books.doc", "rt")) != NULL)    {
            while (fgets(fname, 15, fp) != NULL)    {
                printf(fname);
                fname[strlen(fname)-1] = '\0';
                dowords(fname);
            }
            fclose(fp);
        }
        fclose(fo);

    }
}


Listing Two


#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <process.h>

#define MAXWORDS 2048

char workfile[] = "words.wk0";
FILE *fo, *wi, *wo;

static int sequences;

struct sortword    {
    unsigned int docno;
    char *word;
} wds[MAXWORDS];

static int buffct;
static int wordct;

static int wdcmp(const void *c1, const void *c2)
{
    int rtn = strcmp(((struct sortword *)c1)->word,
        ((struct sortword *)c2)->word);
    if (rtn == 0)
        rtn = ((struct sortword *)c1)->docno -
            ((struct sortword *)c2)->docno;
    return rtn;
}
void bufferout(void)
{
    int i;
    unsigned int zero = 0;
    char eos[] = "{end of sequence}";
    qsort(wds, wordct, sizeof(struct sortword), wdcmp);
    fo = fopen(workfile, "ab");
    for (i = 0; i < wordct; i++)    {
        fwrite(&wds[i].docno, sizeof(unsigned int), 1, fo);
        fwrite(wds[i].word, strlen(wds[i].word)+1, 1, fo);
        free(wds[i].word);
    }
    fwrite(&zero, sizeof(unsigned int), 1, fo);
    fwrite(eos, sizeof eos, 1, fo);
    fclose(fo);
    printf("\r%d sequences    ", ++sequences);
    workfile[8] ^= 1;
    wordct = 0;

    buffct++;
}
void insertsort(struct sortword sw)
{
    if (wordct == MAXWORDS)
        bufferout();
    wds[wordct++] = sw;
}
void main()
{
    FILE *fp = fopen("words.lst", "rb");
    struct sortword sw;
    char word[50];
    int i, c;

    unlink(workfile);
    workfile[8] ^= 1;
    unlink(workfile);
    workfile[8] ^= 1;
    if (fp != NULL)    {
        while (!feof(fp))    {
            if (fread(&sw.docno,sizeof(unsigned int),1,fp) != 1)
                break;
            i = 0;
            while ((word[i++] = fgetc(fp)) != 0)
                if (feof(fp))
                    break;
            if ((sw.word = malloc(strlen(word)+1)) == NULL)    {
                fprintf(stderr, "OM!\a");
                exit(1);
            }
            strcpy(sw.word, word);
            insertsort(sw);
        }
        fclose(fp);
        if (wordct)
            bufferout();
        spawnl(P_WAIT, "mergew.exe", "mergew.exe", NULL);
    }
}


Listing Three


#include <stdio.h>
#include <string.h>
#include <stdlib.h>

char *fn[] = {
    "words.wk0",
    "words.wk1",
    "words.wk2",

    "words.wk3"
};
int sequences = 99;

void writeword(FILE *fo, unsigned *dn, char *wd)
{
    fwrite(dn, sizeof(unsigned int), 1, fo);
    fwrite(wd, strlen(wd)+1, 1, fo);
}
void readword(FILE *fp, unsigned *dn, char *wd)
{
    if (fread(dn, sizeof(unsigned int), 1, fp) == 1)    {
        int i = 0;
        while ((wd[i++] = fgetc(fp)) != 0)
            if (feof(fp) || i == 100)
                break;
        wd[i] = '\0';
    }
}
void merge(FILE *f0, FILE *f1, FILE *f2, FILE *f3)
{
    char wd0[101] = "", wd1[101] = "";
    unsigned dn0 = 0, dn1 = 0;
    int cr;
    FILE *fo = f2;

    sequences = 0;
    while (!feof(f0) || !feof(f1))    {
        cr = strcmp(wd0, wd1);
        if (cr == 0)
            cr = dn0 - dn1;
        if (cr == 0)    {
            if (*wd0)
                writeword(fo, &dn0, wd0);
            if (*wd0 == '{')    {
                sequences++;
                printf("\r%d sequences   ", sequences);
                // flip output files
                if (sequences & 1)
                    fo = f3;
                else 
                    fo = f2;
            }
            else if (*wd1)
                writeword(fo, &dn1, wd1);
            readword(f0, &dn0, wd0);
            readword(f1, &dn1, wd1);
        }
        else if (cr < 0)    {
            writeword(fo, &dn0, wd0);
            readword(f0, &dn0, wd0);
        }

        else    {
            writeword(fo, &dn1, wd1);
            readword(f1, &dn1, wd1);
        }
    }
}
void main()
{
    int pass = 0;
    FILE *f0, *f1, *f2, *f3;
    int in0 = 0, in1 = 1, out0 = 2, out1 = 3;
    while (sequences > 1)    {
        f0 = fopen(fn[in0], "rb");
        f1 = fopen(fn[in1], "rb");
        f2 = fopen(fn[out0], "wb");
        f3 = fopen(fn[out1], "wb");
        printf("\nPass %d: Merging %s and %s to %s and %s\n",
            ++pass, fn[in0], fn[in1], fn[out0], fn[out1]); 
        merge(f0, f1, f2, f3);
        printf("\r%d sequences   ", sequences);
        fclose(f0);
        fclose(f1);
        fclose(f2);
        fclose(f3);
        if (sequences > 1)    {
            in0 ^= 2;
            in1 ^= 2;
            out0 ^= 2;
            out1 ^= 2;
        }
    }
    remove(fn[in0]);
    remove(fn[in1]);
    remove(fn[out1]);
    rename(fn[out0], "words.srt");
}


Listing Four


// -------- bldlist.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

static unsigned doclist[10000];
static unsigned docct;
static long wdct;

FILE *f1;
FILE *f2;
FILE *f3;

void dumpbuffer(char *word)
{
    if (docct)    {
        long offset = ftell(f1);
        // --- write binary tree file ( -> word list )
        fwrite(&offset, sizeof(long), 1, f3);
        // --- write word list file ( -> document list )
        offset = ftell(f2);
        fwrite(word, strlen(word)+1, 1, f1);
        fwrite(&offset, sizeof(long), 1, f1);
        // ---- write document list file
        fwrite(&docct, sizeof(unsigned), 1, f2);
        fwrite(doclist, sizeof(unsigned), docct, f2);

        docct = 0;
    }
    wdct++;
    if ((wdct % 100) == 0)
        printf("\r%ld words", wdct);
}
void main()
{
    unsigned docno;
    char word[101];
    char pword[101] = "";
    int len, dn;
    FILE *fp = fopen("words.srt", "rb");
    if (fp != NULL)    {
        f1 = fopen("bible.idx", "wb");
        f2 = fopen("bible.dls", "wb");
        f3 = fopen("bible.bnt", "wb");
        while (fread(&docno, sizeof(unsigned), 1, fp) == 1)   {
            int i = 0;
            while ((word[i++] = fgetc(fp)) != 0)
                if (i == 100 || feof(fp))
                    break;
            word[i] = '\0';
            if (strcmp(word, pword))    {
                // ---- finish off the old word
                dumpbuffer(pword);
                strcpy(pword, word);
            }
            doclist[docct++] = docno;
        }
        dumpbuffer(pword);
        fclose(f1);
        fclose(f2);
        fclose(f3);
        fclose(fp);
        printf("\r%ld words", wdct);
    }
}


Copyright © 1995, Dr. Dobb's Journal