July 1990/An Applied File I/O Tutorial: Using Binary File I/O

Features

An Applied File I/O Tutorial: Using Binary File I/O

Leor Zolman

After spending the first half of his life in Hollywood, CA and the second in Boston, MA (where he happily discovered Thai restaurants), Leor Zolman now resides directly between those two cities in beautiful Lawrence, KS, where he has a tremendously enjoyable time hacking DOS and Xenix systems for CUJ (but really misses Thai restaurants.)
This is the final installment in a series of articles describing the design and implementation of a simple flat-file information manager in C. My goal for this series has been to provide a clear, general introduction to the usage of elementary C building blocks.
Last month I presented ASCII-based file I/O routines for reading and writing the Mini-Database system data records to disk. Those ASCII routines used the text-line-oriented library functions fprintf() and fscanf() to perform their data transfers. While ASCII data files are easily examined and debugged, they are generally less efficient than binary files. An application that manipulates ASCII data must perform format conversions on the data objects: from binary to ASCII on output to disk, and vice-versa when the data is read back from disk. While the performance penalty isn't really noticeable for small sets of data (such as after entering a few records into the Mini-Database), for larger amounts of data the time needed to perform these conversions can be prohibitive, especially when a larger percentage of the data is numeric.
Therefore, the subject this month is a set of disk routines for reading and writing the data files of our Mini-Database system in straight, binary format. Instead of formatting the data into ASCII text, these binary routines will simply dump the data from memory into disk files and back again with no translation whatsoever. The effective results, at least as far as a user of the Mini-Database may notice, will be the same as if the data were being translated to ASCII and back again to binary. To make any sense of the data files outside of the application, however, you would probably need a low-level debugging tool and some knowledge of the specifics of the target machine's data type characteristics.

The Modular Approach
The ASCII-based file I/0 functions presented last month, read_db() and write_db(), were placed in a source file by themselves. Since the rest of the Mini-Database system code makes no assumptions about how read_db() and write_db() perform their tasks, we can simply substitute a new source file containing the binary versions for the source file containing last month's versions of these functions. You won't even have to re-compile the other modules; just compile the one new source module, and then link it in with the other existing object modules to get a complete, new executable module.

Variations On A Theme
The Mini-Database (as presented up to now) has used a fixed-length array named recs to store the pointers to data records in memory. The maximum number of records that the system can handle has been fixed by the value of the MAX_RECS symbolic constant, defined in the header file. While this arrangement is reasonably efficient (because memory for the data records themselves is still obtained dynamically), we could improve the system further by somehow allowing even the memory used by the recs array to be allocated at run-time, based upon anticipated requirements.
When a binary format is used for representing the data on disk, we can tell exactly how much run-time memory will be required to hold an existing Mini-Database file saved in binary format just by finding the length of the saved binary file (note that this would not be true in the case of files saved in ASCII text format).
So how can we arrange to allocate only the amount of memory we'll really need for the array of record pointers, and no more? The answer lies in the effective use of C's powerful data-typing mechanisms; the technique is called dynamic array allocation.

Building And Using A Dynamic Array
Simple arrays in C are defined using notation such as

type-name identifier[size-constant];
The array dimension expression (size-constant, above) may be omitted in certain special instances, such as when the array is a formal parameter in a function definition (or in a prototype). Take, for example, a prototype for the common C library function strlen():

int strlen(char str[])
The dimension isn't necessary because in such cases the identifier str is acting purely as a pointer. There is no memory actually allocated except a single pointer's worth; that pointer becomes a place-holder for an existing (previously allocated) array whose address is passed to the function.
We can set up a dynamic array that works somewhat like the formal array parameter described above, but explicitly rather than implicitly. Listing 1 and Listing 2 show new versions of the header file and main program file for the Mini-Database system, enhanced to provide the option for a dynamically-allocated recs array (the original versions, limited to a statically-allocated array, were published in the April 1990 issue). Note that with the DYN_ARRAY symbol defined to FALSE, the program compiles as in the original version, with recs defined as a simple array of pointers (Listing 1, line 57). With DYN_ARRAY set to TRUE on the other hand, recs is defined as "a pointer to an array of pointers to structures" (line 53).
Notice the empty square brackets! Since we now have a data type that is primarily a pointer, not an array, the compiler doesn't really care how many elements the array being pointed to may contain at runtime. We as programmers care very much, however, because we are going to be managing the memory for this "virtual array" explicitly. We will obtain memory for the dynamic array via malloc(), and assign the address of the memory to the recs variable. Once recs is initialized, the expression (*recs) will be valid when used as the base array expression in array subscripting operations.
Now the reason for the symbolic constant RECS (Listing 1, lines 54 and 58) begins to emerge: wherever the simple array name recs was used in the original (static array) version of the program, we now must substitute the more complex expression (*recs) to support dynamic array allocation. Rather than using awkward conditional blocks (e.g., #if. . .#else) every time an array operation appears, I've chosen to use RECS as the generalized expression for the array base, greatly reducing the number of places where code must be changed to support the dynamic arrays. In the few places where alternate coding was unavoidable, I've used conditional compilation based on the value of the DYN_ARRAY symbol.

Binary I/O Without Dynamic Allocation
First, I'll explain how the binary file I/O functions work in the case where DYN_ARRAY is FALSE, i.e., RECS is defined as the simple static-array name recs.
To write the currently active database to disk, the write_db() function (Listing 3, lines 74-115) is called with the filename as the only formal parameter. To safeguard against wiping out previously-stored versions of the data file in the event of a write error, a temporary filename is used to actually write the file. If the write operation completes without error, then the temporary file is renamed with the supplied filename parameter.
The first really interesting difference between the binary and text-based versions of the function is the way the fopen() library call is invoked. Instead of using a mode of "w", the binary version uses "wb". This mode prevents system-dependent automatic newline-conversions. With a mode of "w", newlines (represented in memory by a single ASCII linefeed [0x0a] character) are expanded into CR-LF pairs on certain systems (like MS-DOS) before being written to disk. With the "wb" mode, this conversion is never performed; therefore, "wb" is the mode that must be used to write binary data.
To perform the disk writes, we call upon the standard library function fwrite() within a for loop that iterates through the records pointed to by RECS. The fwrite() function has been cleverly designed to accept its data parameters in terms of logical records of whatever size is convenient to the programmer, as opposed to solely in terms of "bytes." Such organization easily supports the writing of fixed-length records with effective error checking, and all in a very portable manner.
Our data is organized by records, each of which is a structure of type record being pointed to by an entry in the RECS array. Listing 3, lines 96 and 97 show the fwrite() call for a single record. The first parameter is a pointer to the data, the second is the size of each logical record, the third is the number of such logical records to be written, and the final parameter is the file pointer.
Notice how the sizeof operator is used to obtain the size of the data record. This C "idiom" is far preferable to absolute byte counts embedded as numeric constants. Using sizeof renders the code immune to variations in data representation between different kinds of computers. There is no additional runtime overhead incurred either, because sizeof expressions are always evaluated at compile time.
fwrite() returns the number of logical records successfully written. The error checking is easy: if the return value does not match the value of the third parameter, there must have been a fatal error, in which case, we diagnose the error, clean up, and return. If the return value agrees, the write loop continues.
After all records have been written without incident, the old version of the disk file is removed (line 106) and we begin a while loop. This loop is another safeguard for the file-renaming process; if, for any reason, the temporary file cannot be renamed to the filename supplied, then the user is given the opportunity to enter another filename. This process continues until the rename() function succeeds.
This safety feature wasn't in my first draft, but during the testing process, I once created a new database and accidentally gave it an illegal filename. Since no filename checking was performed until write_db() was actually called, the program wouldn't let me save my data, and I had to start over! Safety checks like this while loop help reduce the chance of losing data, contributing to "user friendliness".
Reading a saved database file into memory with read_db() follows a similar pattern. Again, the letter "b" is appended to the mode string (yielding "rb") in order to turn off text conversions on input. The fread() standard library function is analogous to fwrite(), with one minor variation: we must be able to distinguish between error conditions and a normal end-of-file condition. If the ferror() function returns a logical true (non-zero) value after a call to fread(), then we know some kind of error has occurred. Otherwise, any return value less than the value of the third parameter indicates an end-of-file condition, with the number of records read equal to the return value. Lines 60—64 of Listing 3 perform the checks for error and end-of-file conditions. If both tests are passed, then lines 63—64 allocate memory for the record (checking for an out-of-memory condition), and the data is copied over from the temporary holding buffer recbuf using structure assignment (line 66).

Going Dynamic
Now let's examine how the system operates when DYN_ARRAY is set to TRUE. First a new symbolic constant, MAX_TO_ADD, and the previously discussed changes to the definitions of recs and RECS are added to the header file (Listing 1, lines 52-54).
MAX_TO_ADD controls the allocation of memory for database records above-and-beyond the existing memory requirements. When MAX_TO_ADD is 100, only 100 or fewer new records can be added to any database file within a single session (between opening and closing the file). If the MAX_TO_ADD limit is reached, the user must close the file and re-open it before adding additional records. We cannot "dynamically extend" the RECS array once it has been initially allocated, because the memory must be contiguous. (Remember, we're dealing with memory for the array of pointers to the data records, not the memory for data records. The data record memory can be allocated in record-sized chunks as necessary.)
Within the CREATE option of the main menu, the dynamic array mechanism requires some special setup (Listing 2, lines 74-81). The size for RECS is calculated as the product of the data record pointer size multiplied by MAX_TO_ADD (since the database will be initially empty). The required memory is obtained via malloc(), and the array pointer recs is initialized with the address of the allocated memory block. Note that this is one of only two locations in the program code (the second comes up shortly) where the variable recs is operated on directly; in all other references, recs appears only in conjunction with the indirection operator as specified in the definition of RECS (Listing 1, line 54).
The other section of code that must change to support the dynamic array is in the read_db() function. Before a saved database file is loaded into memory, we must know its size so the appropriate amount of memory can be allocated. The file size is obtained by calling the fseek() and ftell() standard library functions: after a data file is opened for binary input, we seek to the end of the file using fseek() (Listing 3, line 38). Then we call ftell() to get the byte-position of the end-of-file marker. That byte-position value is divided by the logical data record size to yield the number of records stored in the file, MAX_TO_ADD is added, and (finally!) that resulting value is scaled by the size of a record pointer to determine the appropriate quantity of memory to allocate (line 41). After all that, we must still call fseek() one more time to re-position the file pointer at the beginning of the file, in preparation for reading the data.
Note: there may be a library function included with your particular C implementation that simply returns the size of a file, eliminating the need to use fseek() and ftell() as shown. However, not all compiler vendors provide such a function. Use the technique shown here for maximum portability; this method is fairly efficient because file seeks are typically performed via fast, random-access system calls.

Exercises
As the system is now structured, all data for an open database file must be present in memory simultaneously. How might the system be enhanced to allow databases too large for available memory to be handled? This is not a trivial proposition, by the way.
For a first attempt, think about keeping only the current record in memory and don't worry about how to sort the database.
When you have that much designed, create an indexing scheme based on an in-memory index array that represents the ordering of the records in the data file. When the file is stored, the array occupies the first block of file data (preceded by some parameters indicating the length of the index array, by necessity, and perhaps also the length of the data portion of the file for convenience). As individual records are added and deleted, the index array can be updated to reflect logical ordering.
The package could also be generalized by allowing the index array to be kept in its own data file and to grow larger than the existing main memory capacity. At that point, you'd have a basic framework similar to the scheme used by many full-blown relational database systems! But realistically, I'd recommend using one of those commercial systems over designing one yourself...unless you're really ready for a challenge.