Designing BitBox

Dr. Dobb's Sourcebook March/April 1997

Lean, mean, and flexible data management

By Lou Grinzo

Lou, a programmer, consultant, and technical writer, is the author of The Zen of Windows 95 Programming (Coriolis Books, 1995). He can reached 71055.1240@compuserve.com.

Data is a harsh mistress, to paraphrase Robert Heinlein. File formats, data formats, access methods, performance and security concerns, as well as distribution, installation, and maintenance issues are just some of the details that can quickly turn into nightmares when your application and its data encounter the real world. Such "opportunities for creativity" can be bewildering and frustrating, and worse, they can delay the release of your application and dramatically increase life cycle costs. My partner and I have been wrestling with these issues in various forms on consulting and contract programming jobs for years. Recently, we had to deal with a particularly demanding set of requirements that prompted me to use my toolbox of file-access routines as the foundation for a new database tool I call "BitBox." This project had more than its share of surprises. In fact, by the time I finished the basic design, I appreciated more than ever the old line about how anyone can build a bridge that will stand up, but it takes an engineer to build one that will just barely stand up.

Data is Hard

The project required us to write a Windows 3.1 application to be distributed via diskette to tens of thousands of customers. The eventual users of our program varied widely in computer literacy, from absolute beginners to gurus of the first order. This, coupled with the technical details of the project, suggested to us that we needed a custom solution, one that displayed the following characteristics:

Because of the wide range of user-experience levels we had to accommodate, we decided not to use a DLL on the project, since we didn't want the added headache at installation time. Similarly, we didn't want all our data floating about as individual disk files, where they could be deleted, moved, renamed, edited, or otherwise mangled by well-meaning users. (I've seen external files subjected to just about every imaginable indignity at one time or another, and my programs and I have the scars to prove it.) We decided that our solution should also provide a robust and automatic way for the application to identify the set of data files, including a version number, since it seemed likely that there would be later, data-only updates to the program.

Since the hardware and software configurations of our users were another unknown, we wanted to place the minimum burden possible on the operating system. With all those data files, we wanted to avoid running out of file handles, of course, but we also wanted to avoid having to warp the program's logic to open only one or two files at a time. While constant file opens and closes probably would not have seriously impacted performance, we knew it wouldn't have helped matters. Similarly, we needed to ensure that our data was available at run time in exactly the format we wanted, without wasting cycles on data conversions.

Cross-platform support was not a high priority, but we still wanted our eventual solution to offer at least data-file-level portability across DOS, Win16, Win32, and OS/2 -- all using the same API. We also wanted a solution that could be used with various programming languages (even at the cost of using a dreaded DLL under Windows), since some clients insist that new code be written in a particular language or even with a particular compiler product.

The Cupboard was Bare

Given these constraints, it's probably no surprise that we didn't find an off-the-shelf solution. Still, it's worth briefly covering what we considered and rejected.

The BitBox Design

Faced with the inevitable, I began work on what eventually became BitBox. The basic design of BitBox allows for or requires the following:

We also added direct support for Windows metafiles and bitmaps. This allowed us to pack these graphics into the main database with the "real" data, further simplifying packaging. We added custom routines that would load a specified file from the BitBox database and return it to the caller in the form of a Delphi TMetafile or TBitmap.

We Don't Need No Stinkin' Badges

With this many files, each with its own record definition, it was inevitable that the question of how to design the record-access routines became an issue. For this job, we used pairs of custom routines, each of which retrieved either the first or the next record from a specific file that matched certain criteria. The function that retrieved the first row in the file also initialized the cursor. For example, we knew that we would have to read records from file X where field A in each matching record was between two values determined at run time by the user. For this "query," our pair of functions looked like Example 1.

The returned value of these functions was a return code that indicated if the operation completed normally, and the min_A and max_A parameters defined the range of acceptable values for field A. The "cursor" type was a small structure that contained the state information for this query, such as the current record number, and was, in essence, a file handle. (This is a simplified example. In some cases, several pieces of data were supplied to define an acceptable record, and the GetFirst/GetNext routines had to examine more than one field.)

These functions would call a generic access routine to read the next record from the file, at which point we were using a "trusted interface." This always happens in such database applications. No matter how much effort is expended to make the interface between the application program and the database routines intelligent, there is always a point when you're using a generic routine to read data. Then, you're on your own; you're calling a function and passing it an untyped pointer to the buffer where it's to store the next record. There can be no compile-time or run-time validation of the operation. (Assuming you can spare the overhead, this limitation can be overcome with the use of persistent objects, but such solutions weren't viable for this project.) This was not a problem on our consulting job, but it did play an interesting role in the development of the commercial version of BitBox (which consists of C libraries and Pascal units).

From Custom Code to Product

Converting BitBox from a custom solution into a more generalized product was an even greater adventure. Some of the more interesting decisions and changes made along the way include:

Callback Functions

Windows programmers are familiar with callback functions, as they're used to build a list of available fonts, for example, via the EnumFonts API. The BitBox implementation is very similar: When you define a cursor, you can optionally include a pointer to a function in your code. This function accepts a pointer to a record plus a 32-bit value set by your code (which can be anything that fits in 32 bits), and returns one of two predefined values indicating whether that record should be considered part of the result set. Whenever BitBox retrieves a record using that cursor, it relies on your callback function to act as a filter. This mechanism allows various functions, such as those that count the number of records in a file or the number of still-unread records, to operate in terms of either all physical records in the file or just in terms of that logical subset of records that meet the application's criteria. Callbacks give the application programmer the ultimate in flexibility -- the function can perform calculations using the record's fields and consult other resources (including other files and the user) in passing judgment on a record. Because callbacks are cursor specific, the application can open and use several cursors on the same file with different callback functions, or mix cursors that do and don't use callbacks.

Callback functions create an interesting possibility for optimization: You can use one to read part of a file and then change the behavior of the cursor by changing its callback to a different function (or none at all), or by setting a flag that will modify the behavior of the callback without changing the cursor's file position. This allows a programmer to exploit knowledge of the exact sequence of data in the program's files. For example, a program could use a callback that looked for components made of a particular material, and once the first one is found, the callback could be disabled, since the programmer knows that all the cast iron parts are at the end of the table. Once that sequence of data is found, there's no need to incur the overhead of filtering the remaining records -- they can simply be read sequentially until the end of the file is encountered. Combined with the ability to reverse the cursor's read direction, this feature becomes even more powerful -- a program can filter records to a certain point, such as the beginning of the cast iron parts, and then reverse the cursor direction and sequentially read the last few records pertaining to the parts made of aluminum, for example. Because callbacks place all control over filtering in the hands of the programmer on a per-record basis, they allow you to access data with almost any desired level of complexity and optimization.

Obviously, a program doesn't have to use callbacks to filter records from a file. You can write a subroutine that repeatedly reads the next unfiltered record until it either hits the end of the file or finds a matching record. But this would not generalize the concept of filtered records to include other BitBox functions, such as those that count records. Still, you have the option of using or not using callbacks, and at surprisingly little overhead.

Related to the previous item, it becomes very easy to do relational-database-style joins. A two-way join cursor, for example, would take a pointer to a callback function that accepts pointers to two records, one from each file. The logic in the callback would then check that the records have equal values in their join fields as well as meet whatever other criteria you desire. Again, callbacks are surprisingly flexible, as they allow you to accommodate necessary quirks in the data files. Perhaps one file uses an older numbering sequence for employee IDs, while another file uses a newer sequence. The callback function could be smart enough to take this into account and avoid false positives as well as false negatives. The more such examples you make up, the clearer it is that callbacks provide a flexibility and efficiency that canned databases and query languages can't match.

Given how often users must struggle with file placement details, I thought it would make sense to enhance BitBox's flexibility in this area, as well. When opening a database, the calling program can now specify whether the database is in the current directory or the directory that contains the main program's executable, regardless of whether it contains a fully qualified name, or whether BitBox should search several directories, similar to the way that Windows locates DLLs.

Summary

Designing and implementing BitBox was one of the most interesting, albeit at times frustrating, projects I've taken on. This article only provides a glimpse into the design and prototyping exercise; suffice it to say that along the way I explored dozens of dead ends and minor variations that "almost" worked. Still, I think BitBox is an excellent example of what can happen when you carefully, and grudgingly, select tradeoffs, such as requiring the database compilation step and relying on a trusted interface, then push the design as far as you can. In this case, the result was a flexible and efficient file-access package that solves many of the real-world problems of maintaining, distributing, and using data across the lifespan of an application.

DDJ