Tokenized Databases

Dr. Dobb's Sourcebook March/April 1997

Data compression for databases

By Joseph M. Newcomer

Joe, who received his Ph.D. in the area of compiler optimization, is a Windows consultant and developer in Pittsburgh, Pennsylvania. He can be contacted at newcomer@flounder.com.

A common problem with databases is the very large amount of space required to store them. Many database systems (such as the xBase model) require a complete record with fixed allocation for each field. Since text fields have fixed sizes, they must be large enough to hold the longest possible string. Even when fields are optionally blank, space must still be present. This can result in large records. With a large number of records, the total size of the database can become quite unwieldy.

In other databases, fields are of variable length, perhaps requiring a field marker to represent a missing field. These databases use less disk storage than the traditional fixed-length-record databases. Ultimately, there is a tradeoff between disk space and performance. For example, in a variable-length field database, locating the 1481st record in a variable-length database involves more than simply multiplying the record length by 1481 (and usually adding some header size value) to obtain a file offset. Updating records can be even more complex. An indexing scheme must be used to convert record numbers to file offsets.

In this article, I'll present tokenized databases, a scheme that allows conventional fixed-width-field representation of records but also affords many of the advantages of variable-width fields, plus data compression.

Of course, data compression (such as PKZip) can be used with any database, but compression is best achieved when applied to entire databases or significant subpieces (images or bitmaps, for instance). When applied record-by-record, the payoff is lower and the complexity higher. Essentially, you get the complexity of the variable-length-field databases plus the complexity of the compression/decompression.

Fixed-length fields present problems when multilevel indexing is required. For example, to sort on four 55-byte fields requires 220 bytes in each index record. The size of an index file for such a database could rival the size of the database itself. To make matters worse, an index like this cannot be done in most xBase systems, which retain the original dBase II constraint of 100 bytes in the key.

I have successfully used tokenized databases in several projects, most recently in a program called the Windows Developer's Assistant (WDA), an index of over 93,000 Win16 API-related entries in books, magazines, and manuals.

Because this technique is both space and time efficient, it is particularly adaptable for palmtop-computer applications or read-only databases. In the WDA project, I used dBase IV 2.0 as the database engine. The basic dBase database is over 42 MB in size, while the compressed database is 2.2 MB with a token dictionary of 666 KB, keeping the entire database "disk footprint" under 3 MB. The PKZip version fits on a single 1.44-MB floppy.

Tokenized Databases

Tokenized databases work best when there are, for any particular string, many occurrences of that string in the database. In the case of the WDA, the regular occurrence of terms is such that the multilevel (maximum of four) index for the 93,000 entries has only 14,000 unique strings. Tokenized databases also work well when the database is fundamentally static. To implement the technique, you assign an integer to each unique string in the database, then represent the database by records of these integers. Thus the moniker "tokenized database," named after the method in which a lexical analyzer or parser represents input strings by unique tokens.

Consider the record of four 55-byte fields containing the values in Figure 1(a), which might be represented in the compacted database by the record containing four 16-bit integers in Figure 1(b). Thus, the record is reduced from four 55-byte fields (220 bytes) to four 2-byte fields (8 bytes) plus the dictionary entries (46 bytes for NULL-terminated strings). The record is reduced from 220 to 54 bytes, a 4:1 savings. The overall savings are realized when the multiple occurrences of these words are taken into account. For example, this database has over 16,900 occurrences of the word "discussion;" represented as full 55-byte fields, this word alone would require over 929,500 bytes of storage in the database. If it were stored as a NULL-terminated string for each record (11 bytes for each record) it would require 185,900 bytes. Using tokenization, the value consumes 33,800 bytes plus 11 bytes for the dictionary entry. This is a 27.5:1 compression ratio for one (admittedly popular) string.

The reason this works particularly well for static databases is that the token values can be assigned in consecutive order based upon the desired sorting order. Thus, sorting or indexing information requires only a comparison of the token values. With dynamic databases, adding new values by either adding new records or editing fields carries a penalty: To compare two tokens, the corresponding strings must first be looked up in the token dictionary, then the string values can be compared. (Alternatively, you could reassign token numbers when insertions are done, or leave gaps. This technique, known to all Basic programmers as the RENUMBER command, involves assigning token numbers in multiples of 10, for example, and reassigning the token numbers when the gap is filled.)

To tokenize the database, I created a list of all the unique words in all the fields, then sorted it. I then scanned this list and assigned to each entry an integer that represented the position of the entry in the list. This formed the token dictionary. I then read the text database sequentially (in its "natural order," that is, unindexed); for each text record I created an output "binary" record having integer fields. For each text field in the input record, I looked up the value in the field in the token-dictionary database and replaced the corresponding entry in the output record with the integer found in the token-dictionary database.

The result was a binary database that was the tokenized version of the original database. This binary database was in the original (unsorted) order. To sort it, I would have to collate the sequence order of the original strings. But, because the tokens were assigned in collating sequence order, a comparison of integer fields is sufficient to produce a sorted database; if one integer is less than the other, its corresponding string is lower in collating sequence order than the string represented by the second integer.

With xBase systems, the integer fields are stored as 5-byte values, so the four-level sort (ignoring the other fields I index on, such as book ID, chapter, and page) requires only a 20-byte key. This is not only smaller and faster than the text comparison, but it is also well under the 100-byte key limit.

The resulting binary database requires 7.8 MB for 93,000 entries, and the resulting index is 5.9 MB. The token dictionary is another 2.4 MB. This requires 16.1 MB; although this is substantially less than the (unindexable) 42-MB raw-data file, it is still unacceptably large.

The next step was to make a pure binary image. In this image, I could reduce the 5-byte numeric fields to 2 bytes of pure binary integer. Keeping the token-dictionary database as an xBase file would require a full 55-byte field for each string, so I instead chose a representation of compacted strings, where each entry requires only the number of bytes needed plus the terminating NULL byte. In the example discussed here, I took advantage of an existing mechanism to do this: I encoded the token dictionary as a STRINGTABLE resource. I might have achieved better compression and possibly even faster access using more-sophisticated techniques, but this mechanism was cheaply available and quite adequate for the task.

Finally, I used the Windows resource mechanism to encode the actual database. Each record in the binary database generated a fixed-sized record in a user-defined resource. These resources were installed in resource-only DLLs, accomplished via the LoadLibrary, FindResource, and LockResource calls.

In this particular application, I required both "print-as" and "sort-as" representations for each field. The print representation might be something like _export, but it should sort as export_. So each record actually has eight 55-byte fields to represent it: a field at each level (L1, L2, L3, and L4) and a corresponding "sort as" field (L1_AS, L2_AS, L3_AS, and L4_AS). In addition, each record has fields for book, chapter, and page, and a field of flag information. Ultimately, each record took 26 bytes. The count of 14,000 tokens for 93,000-plus records includes these sort-key strings as well. A typical resource looked like Figure 2.

Using resources to represent this data imposed serious limitations in the Win16 environment in which I was working. The first was that no individual resource could exceed 32,768 bytes. Therefore, I had to break the database up so that the resource script (.rc) file contained no resource larger than 32 KB; I chose 26 KB (1000 records) as the quantum for each chunk. The resources were thus named DATA00, DATA01, DATA02, and so on; see Figure 2. A sentinel record of all 0 values marked the end of each resource chunk. Because 0, which meant "blank," was not a legal value as a top-level entry, checking the L1_AS field for a 0 value was a sufficient test for the end of a chunk. This worked fine when I had 22,000 entries in the database. But when I got to about 38,000 entries, I hit another limitation -- the Win16 version of the "rc" compiler could not handle large files! Therefore, I had to break the database up further, into multiple files, so that the rc compiler could process them. This required occasionally rerunning the database with a new "number of files" parameter because the output was split evenly among n files. Even the STRINGTABLE had to be split up into three files because rc could not compile a 14,000-entry STRINGTABLE.

Finally, at around 88,000 entries, I hit a fatal limitation: I added a set of flags to allow additional query filtering. Following standard good practice, I used symbolic values. This triggered a new rc limitation: Apparently, the preprocessed text is kept in memory. Running out of memory is reported in the Win16 rc compiler by reporting some sort of syntax error on some perfectly valid line. This required breaking the files into smaller and smaller pieces. When I ended up with the file limit around 13, I realized this was a bad idea. Unfortunately, the entire program depended upon the resource representation, and I did not feel like rewriting it. So I ended up writing the binary .RES file directly from the dBase program. I wrote the code to put out, directly, the binary .RES files incorporated into the resource-only DLLs that form the basis of the WDA database: five database DLLs, four KWIC index DLLs, and three token-dictionary DLLs. Ultimately, I left the DLLs split up to this degree because of the requirements of putting them on disk, rather than other fundamental limitations.

The use of resource-only DLLs has other advantages. For example, the menu that lists all the books is a resource in one of the DLLs. This means that the program has no "hardwired" knowledge of the database, thus updated databases can be readily distributed (the intent was to distribute them on the Web). There is no need to implement any paging, caching, or record-blocking techniques; the resource mechanism in Windows handles this all quite nicely, reducing the programming effort.

To improve performance, I cache some parameters, such as the record number at which each letter of the alphabet changes. I also record the timestamp of the DLL that represents the resources. Whenever the program starts up, it checks the current DLL against the cached timestamp; if they are the same, the cached information is valid, but if they are different, the program takes an additional half-minute or so to recompute the necessary values. The result is a faster startup, plus automatic recognition of the changed DLL.

The Graphical Developer Interface

We all know that GUI is short for Graphical User Interface. But as developers, it is important that we recognize the value of the GDI -- no, not the Graphical Device Interface, but Graphical Developer Interface. It is very difficult to debug a program where there are tens of thousands of records and there is an error somewhere in the algorithm that shows up sometimes. Reproducing these errors can be difficult, and breakpoints are not an adequate method when you may have to "F5" thousands of times before the condition you are looking for goes by. (Not to mention that there's a chance you'll miss it and have to start over.) I had such a bug. Consequently, I created (as I do in many of my programs) a GDI that let me debug at a higher level. By being able to type in various record numbers, token IDs, and the like, I could eventually zero in on the exact failure conditions. Then I could set a breakpoint, confident that I could trigger it immediately with the condition that caused the error. (It turned out to be a boundary condition dealing with the last entry in a resource block.) The GDI for the database is shown in Figure 3. After all, why should the end user be the only one to benefit from the power of the Windows graphical interface? This is a paradigm shift we should all make.

An important consideration about this sort of debugging technique: If it is good enough to write once, it is good enough to be a permanent part of the application. I rarely "condition out" such code. Instead, the choice is dynamic. There is a Debug menu item that contains many options, one of which is the database debug item. Normally, when the program starts up, it removes this menu from the menu bar before the menu bar is displayed. But if the .INI file (remember, this was a Win16 program; I'd use the Registry today) contained, in the [Debug] section, the option Menu=1, the menu is left visible. The impact of such features on field maintenance is substantial. Instead of being perplexed, your tech-support people can tell end users how to enable the debug menu, then use the debug options to reveal the internal state of the program, which can ultimately lead to identifying the problem (sometimes it is the end user doing something odd, such as deleting a critical file, or "hand editing" a data or configuration file and violating some constraint).

NonWindows Environments

Though I have used these techniques many times over the years, this is actually my first use of them in a Windows environment. If you don't have resource compilers and operations such as LoadString, you have to implement these features by hand. In one application (about six years ago), I implemented a Least Recently Used (LRU) paging algorithm that kept the 30 (a #define value) most recent tokens in memory. There were no more than 3000 tokens in this dictionary, and it was always less than 65 KB, so the dictionary worked by scanning the token (text) file on disk, which was in token order, one token per line, and kept (16-bit unsigned) file offsets into each of the lines in a 6000-byte table. When token 282 was required, I first checked the LRU cache to see if 282 was in memory. If it was, I used it; if not, I went to the table and found the file offset of the 282nd entry. I did an fseek to this offset, then read the line up to the CR/LF. If necessary, the oldest cache entry was purged to find room for this new entry. This takes only a few hundred lines of code to fully implement. If the database is large enough that the index cannot be kept in memory, one of the many B-tree packages should suffice.

Limitations

Like any engineering decision, this one involved tradeoffs. For example, it can be awkward to go from a text string to a token ID, because there is no index-by-string into the token dictionary. In such a case, a binary-search technique analogous to the in-memory bsearch of the C run-time library must be written. In WDA, there are three cases where I must go from an external string to a token value. The simplest possible way is to search the database linearly, first-to-last. This takes a long time, but I do it only for a certain number of predefined string values. Once I find the token IDs of these values, I cache them in the .INI file (in Win32, I would cache them in the Registry) and do not have to rescan the database unless the database itself changes. Therefore, a user interface to such a database should use drop-down lists or something similar, where you always know the integer-token values. A user interface based on, for example, handwriting input of a string, might cancel out some of the advantages of tokenization.

Inserting new entries can be painfully complex, or require additional time, and either penalty may be unacceptable.

Generally, I do not expend much effort making "efficient code" when efficiency is not an issue. However, even reasonable dBase code can run slowly when it is not optimized for performance. A complete run of the WDA database takes quite a few hours, as shown in the log file in Figure 4.

This application was developed in dBase IV 2.0. It takes about 2.5 hours to build the basic token-dictionary word list, because each word is looked up to see if it is already present (this has the properties of being O(n²)). It only takes 6 minutes to sort it and assign the tags once it is built. It takes about 1.25 hours to create the binary database file from the text file, and about 7 minutes to index the binary database. It takes over 3 hours to write the binary resource files. Nearly 2 hours are required to build the KWIC index, and about half an hour to index its 250 KB+ entries. It takes nearly 5 hours to write the KWIC resource files. Thus, it takes about 16 hours of a 486/50DX2 (with a 9ms SCSI drive) to do a complete run, and it runs in its own 220 KB disk partition, consuming most of it. Could I write this to run faster? Probably. Is it worth it? No. I just start the run before I go to bed, and since I'm using Windows, I can continue later in the day to do editing and other work unlikely to crash Windows. (Running CodeView and debugging potentially flaky applications is out.) If I operate without multitasking, it runs in about 14 hours. This is acceptable because the task usually gets run once a week.

Conclusion

With reasonable effort in programming, the performance penalties for using tokenized databases are small, and often irrelevant. In many cases, tokenized databases make it possible to fit an application on a limited-resource machine, where it would be impossible to run an untokenized database. Furthermore, such databases allow programs such as the Windows Developer's Assistant to fit on a machine using a reasonable disk footprint. A product that would require 60 MB+ of hard drive space and many floppy disks to install is not as attractive to the end user as a product which has identical functionality, requires under 4 MB of disk space, and can be distributed on a single 1.44-MB floppy.

DDJ