Designing BitBox

Dr. Dobb's Sourcebook March/April 1997

Lean, mean, and flexible data management

By Lou Grinzo

Lou, a programmer, consultant, and technical writer, is the author of The Zen of Windows 95 Programming (Coriolis Books, 1995). He can reached 71055.1240@compuserve.com.

Data is a harsh mistress, to paraphrase Robert Heinlein. File formats, data formats, access methods, performance and security concerns, as well as distribution, installation, and maintenance issues are just some of the details that can quickly turn into nightmares when your application and its data encounter the real world. Such "opportunities for creativity" can be bewildering and frustrating, and worse, they can delay the release of your application and dramatically increase life cycle costs. My partner and I have been wrestling with these issues in various forms on consulting and contract programming jobs for years. Recently, we had to deal with a particularly demanding set of requirements that prompted me to use my toolbox of file-access routines as the foundation for a new database tool I call "BitBox." This project had more than its share of surprises. In fact, by the time I finished the basic design, I appreciated more than ever the old line about how anyone can build a bridge that will stand up, but it takes an engineer to build one that will just barely stand up.

Data is Hard

The project required us to write a Windows 3.1 application to be distributed via diskette to tens of thousands of customers. The eventual users of our program varied widely in computer literacy, from absolute beginners to gurus of the first order. This, coupled with the technical details of the project, suggested to us that we needed a custom solution, one that displayed the following characteristics:

Read-only access to about two dozen data files of wildly differing geometries. Because each file had a different record definition, there was no practical way we could have forced the information into appreciably fewer files.
Speed. Our application needed to read considerable amounts of data as the user interacted with it, so file access and processing had to be efficient to be "hidden" in screen-to-screen response time. This would be particularly difficult on the older PCs still used in corporate settings. Essentially, we had to support any system that could run Windows 3.1, such as a 386/16 with 4 MB of RAM.
Small size. We needed to distribute the application, help file, data files, numerous Windows metafiles, and an installation program on as few floppies as possible -- preferably one. Since adding even one floppy to the final package would have considerably increased the total cost of distribution, every byte had to pull its own weight.
Enough security to detect vandals and foil casual snoopers.
Enough flexibility to allow us to maintain the master copy of our data in Microsoft Access, our preferred tool. We weren't concerned about converting the data from Access to another format as part of our development process.

Because of the wide range of user-experience levels we had to accommodate, we decided not to use a DLL on the project, since we didn't want the added headache at installation time. Similarly, we didn't want all our data floating about as individual disk files, where they could be deleted, moved, renamed, edited, or otherwise mangled by well-meaning users. (I've seen external files subjected to just about every imaginable indignity at one time or another, and my programs and I have the scars to prove it.) We decided that our solution should also provide a robust and automatic way for the application to identify the set of data files, including a version number, since it seemed likely that there would be later, data-only updates to the program.

Since the hardware and software configurations of our users were another unknown, we wanted to place the minimum burden possible on the operating system. With all those data files, we wanted to avoid running out of file handles, of course, but we also wanted to avoid having to warp the program's logic to open only one or two files at a time. While constant file opens and closes probably would not have seriously impacted performance, we knew it wouldn't have helped matters. Similarly, we needed to ensure that our data was available at run time in exactly the format we wanted, without wasting cycles on data conversions.

Cross-platform support was not a high priority, but we still wanted our eventual solution to offer at least data-file-level portability across DOS, Win16, Win32, and OS/2 -- all using the same API. We also wanted a solution that could be used with various programming languages (even at the cost of using a dreaded DLL under Windows), since some clients insist that new code be written in a particular language or even with a particular compiler product.

The Cupboard was Bare

Given these constraints, it's probably no surprise that we didn't find an off-the-shelf solution. Still, it's worth briefly covering what we considered and rejected.

ODBC or the Borland Database Engine with files in a standard database format. This wasn't even feasible, as both solutions are far too large and complex. Including the redistributables for either product would have blown our disk budget. Also, we didn't want to inflict a second installation procedure and more system-management headaches on our users.
Export files from Access to dbf (xBase) format and read them with home-grown or commercially available code linked into our application. This seemed promising -- until we realized that dbf files, by definition, use only ASCII representations of numbers, not binary values. This would have required the application to do run-time data conversions, and worse, it would have increased the size of our files significantly. In numerous cases we were able to use 8- or 16-bit integers; consuming 10 bytes for each of those fields (for example, with a dbf "numeric" type), multiplied by the number of rows in the data files wasted too much space. Also, this solution would have required us to manage a separate disk file for each of our data files, and we still would not have had any security. We could have used the trick of creating character format fields in dbf files, then really storing binary numbers in those fields. We immediately rejected this solution as an "unnatural act" that would have been more trouble than it was worth in the long run.
Package the data files into the program's Windows resources. We briefly flirted with this approach, and even had a working prototype, but we quickly saw that it would have been too unwieldy to maintain in the long run. Also, it would have required us to redistribute the entire EXE and data to update just the data files.

The BitBox Design

Faced with the inevitable, I began work on what eventually became BitBox. The basic design of BitBox allows for or requires the following:

All files are "raw records" -- they have no headers, and each one is simply a collection of contiguous, fixed-length records in its own disk file.
A database compiler utility that assembles numerous individual data files into a single, self-identifying, (optionally) encoded, database. The database is tagged with the version number of the compiler utility (which specifies the file's format), plus a separate user-supplied version number, the file's 16-bit CRC value, the compilation date, an optional comment field, and several other important pieces of information. The file is "fingerprinted" in a way that makes it virtually impossible for another file to be mistaken for a BitBox database. Because the file contains a directory, it provides a valuable level of abstraction, so the application doesn't have to know anything about the format of the compiled database. One benefit of this design is the fact that our application can access a large number of files concurrently without concern for running out of operating-system file handles -- the entire database is accessed with a single handle.
The previous item, plus our decision to use a compiler, required us to create the database in a three-step process: Export it from Access to dbf files (a trivial operation); use a small, custom Borland Database Engine application that reads the dbf files and performs the desired data conversions and saves each file as raw records; and run the compiler to turn the set of raw files into a single BitBox database. This process was easily automated, and we considered it a worthwhile tradeoff, since it allowed us to manage our data in exactly the format and way we wanted with very little run-time overhead.
Data-access routines in the form of a Delphi library that is linked into the final application. This approach gave us a minimal distribution package: The EXE and compiled DAT file were all that were absolutely needed to run the program. We felt this simplicity was important, since it would allow the application's users to copy just these two files to another computer, including a laptop, and still have a fully functional copy without undue hassle or the need to run an installation program.
All access to individual records in the files is through cursors. The application can open many cursors on different files (or even the same file), interleave file access requests from different cursors, and so on. Because all access is record based, I found it economical to add support for features like reversing a cursor's read direction or repositioning a cursor by record number.

We also added direct support for Windows metafiles and bitmaps. This allowed us to pack these graphics into the main database with the "real" data, further simplifying packaging. We added custom routines that would load a specified file from the BitBox database and return it to the caller in the form of a Delphi TMetafile or TBitmap.

We Don't Need No Stinkin' Badges

With this many files, each with its own record definition, it was inevitable that the question of how to design the record-access routines became an issue. For this job, we used pairs of custom routines, each of which retrieved either the first or the next record from a specific file that matched certain criteria. The function that retrieved the first row in the file also initialized the cursor. For example, we knew that we would have to read records from file X where field A in each matching record was between two values determined at run time by the user. For this "query," our pair of functions looked like Example 1.

The returned value of these functions was a return code that indicated if the operation completed normally, and the min_A and max_A parameters defined the range of acceptable values for field A. The "cursor" type was a small structure that contained the state information for this query, such as the current record number, and was, in essence, a file handle. (This is a simplified example. In some cases, several pieces of data were supplied to define an acceptable record, and the GetFirst/GetNext routines had to examine more than one field.)

These functions would call a generic access routine to read the next record from the file, at which point we were using a "trusted interface." This always happens in such database applications. No matter how much effort is expended to make the interface between the application program and the database routines intelligent, there is always a point when you're using a generic routine to read data. Then, you're on your own; you're calling a function and passing it an untyped pointer to the buffer where it's to store the next record. There can be no compile-time or run-time validation of the operation. (Assuming you can spare the overhead, this limitation can be overcome with the use of persistent objects, but such solutions weren't viable for this project.) This was not a problem on our consulting job, but it did play an interesting role in the development of the commercial version of BitBox (which consists of C libraries and Pascal units).

From Custom Code to Product

Converting BitBox from a custom solution into a more generalized product was an even greater adventure. Some of the more interesting decisions and changes made along the way include:

Multiple database access. It was obvious that while many applications could be accommodated with a single BitBox database, it was also likely that, for some projects, this wouldn't be sufficient. It's not hard to imagine a project that includes a large amount of data, only a small part of which has to be updated after the initial release. By supporting several databases at run time, BitBox now gives programmers the option of packaging files to accommodate such maintenance needs.
Cursor manipulation and reference is now done via handles that are simply "magic tokens" as far as the application is concerned, much as Windows uses handles. I felt this change was necessary, since it kept all cursor state data out of the application program's reach and prevented unwanted experimentation. (One of Lou's Laws of Programming says that if you don't want a programmer to do something, make it impossible; warnings don't count.)
Direct support for importing dbf files, including memo/DBT files into the database compiler. Even though dbf files were not a viable option for our project, I knew that, in many cases, they could be, and directly supporting the lingua franca of database formats could be a nontrivial convenience.
Callback-based record filtering. I mentioned earlier that using a trusted interface had an interesting implication for BitBox, and this is it. Since I wasn't willing to add the complexity to BitBox to use object technology (which would have also sacrificed its cross-platform and cross-development-tool abilities), I knew that its read functions would always represent a trusted interface. As it turns out, once you embrace that design decision, callback functions become an extremely powerful and flexible way to filter records.

Callback Functions

Windows programmers are familiar with callback functions, as they're used to build a list of available fonts, for example, via the EnumFonts API. The BitBox implementation is very similar: When you define a cursor, you can optionally include a pointer to a function in your code. This function accepts a pointer to a record plus a 32-bit value set by your code (which can be anything that fits in 32 bits), and returns one of two predefined values indicating whether that record should be considered part of the result set. Whenever BitBox retrieves a record using that cursor, it relies on your callback function to act as a filter. This mechanism allows various functions, such as those that count the number of records in a file or the number of still-unread records, to operate in terms of either all physical records in the file or just in terms of that logical subset of records that meet the application's criteria. Callbacks give the application programmer the ultimate in flexibility -- the function can perform calculations using the record's fields and consult other resources (including other files and the user) in passing judgment on a record. Because callbacks are cursor specific, the application can open and use several cursors on the same file with different callback functions, or mix cursors that do and don't use callbacks.

Callback functions create an interesting possibility for optimization: You can use one to read part of a file and then change the behavior of the cursor by changing its callback to a different function (or none at all), or by setting a flag that will modify the behavior of the callback without changing the cursor's file position. This allows a programmer to exploit knowledge of the exact sequence of data in the program's files. For example, a program could use a callback that looked for components made of a particular material, and once the first one is found, the callback could be disabled, since the programmer knows that all the cast iron parts are at the end of the table. Once that sequence of data is found, there's no need to incur the overhead of filtering the remaining records -- they can simply be read sequentially until the end of the file is encountered. Combined with the ability to reverse the cursor's read direction, this feature becomes even more powerful -- a program can filter records to a certain point, such as the beginning of the cast iron parts, and then reverse the cursor direction and sequentially read the last few records pertaining to the parts made of aluminum, for example. Because callbacks place all control over filtering in the hands of the programmer on a per-record basis, they allow you to access data with almost any desired level of complexity and optimization.

Obviously, a program doesn't have to use callbacks to filter records from a file. You can write a subroutine that repeatedly reads the next unfiltered record until it either hits the end of the file or finds a matching record. But this would not generalize the concept of filtered records to include other BitBox functions, such as those that count records. Still, you have the option of using or not using callbacks, and at surprisingly little overhead.

Related to the previous item, it becomes very easy to do relational-database-style joins. A two-way join cursor, for example, would take a pointer to a callback function that accepts pointers to two records, one from each file. The logic in the callback would then check that the records have equal values in their join fields as well as meet whatever other criteria you desire. Again, callbacks are surprisingly flexible, as they allow you to accommodate necessary quirks in the data files. Perhaps one file uses an older numbering sequence for employee IDs, while another file uses a newer sequence. The callback function could be smart enough to take this into account and avoid false positives as well as false negatives. The more such examples you make up, the clearer it is that callbacks provide a flexibility and efficiency that canned databases and query languages can't match.

Given how often users must struggle with file placement details, I thought it would make sense to enhance BitBox's flexibility in this area, as well. When opening a database, the calling program can now specify whether the database is in the current directory or the directory that contains the main program's executable, regardless of whether it contains a fully qualified name, or whether BitBox should search several directories, similar to the way that Windows locates DLLs.

Summary

Designing and implementing BitBox was one of the most interesting, albeit at times frustrating, projects I've taken on. This article only provides a glimpse into the design and prototyping exercise; suffice it to say that along the way I explored dozens of dead ends and minor variations that "almost" worked. Still, I think BitBox is an excellent example of what can happen when you carefully, and grudgingly, select tradeoffs, such as requiring the database compilation step and relying on a trusted interface, then push the design as far as you can. In this case, the result was a flexible and efficient file-access package that solves many of the real-world problems of maintaining, distributing, and using data across the lifespan of an application.

DDJ