Features


An Applied File I/O Tutorial: Text-Based Disk Routines

Leor Zolman


After spending the first half of his life in Hollywood, CA and the second in Boston, MA (where he happily discovered Thai restaurants), Leor Zolman now resides directly between those two cities in beautiful Lawrence, KS, where he has a tremendously enjoyable time hacking DOS and Xenix systems for CUJ (but really misses Thai restaurants.)

In the first two installments of this tutorial series, I presented the framework, user-interface and record-editing portions of a small, special-purpose database program. Now we arrive, finally, at the central point of the tutorial: the storage and retreival of database records to and from the mass storage device.

Text Or Binary?

The fundamental design of the read_db() and write_db() functions depend on whether we store the data as human-readable, ASCII text or as a straight (raw) binary image. If efficiency is a major concern, then binary mode is the right choice: the data will take up less space on disk, and require the minimum amount of format translation during input and output operations. To make the routines easier to debug, however, the ASCII format is often more appropriate: storing the data in human-readable form allows for convenient visual inspections (using type or any text editor) after the data has been written to disk.

Other considerations need also be taken into account when deciding upon a disk format. Will there ever be a need for the data to be read in by other applications, such as spreadsheets or fullblown database management systems? If so, the text format is probably the better choice.

Depending on which of these two approaches is chosen, a different set of standard library functions will be required. For the ASCII approach, the line-oriented formatted I/O functions fprintf() and fscanf() will do all the work. The binary approach will rely on the byte-oriented functions fread() and fwrite().

The Interface

Both versions of read_db() and write_db() interface with their calling routine (the main function) via the same set of external data:

RECS: the array of pointers to data records. The actual array name is recs, with RECS being a defined synonym for recs. This synonym will ease a future transition to dynamic array allocation.

n_recs: the number of records currently stored in RECS.

The only additional piece of information specified by the caller is the file name to be used for the operation.

write_db() does not return any value; either the data is written correctly, or the error is diagnosed within write_db() and the program aborts through a call to the error() function. See the Exercises section at the end of this article for more on write_db() error recovery.

read_db() returns the number of records loaded.

For convenience, the header file for this mini-database system is reproduced in Listing 1.

The read_db() Function

Listing 2 shows the read_db() and write_db() functions written to work with ASCII data files.

read_db() begins by defining several control variables, along with a set of temporary variables for containing record field values in transition from disk to memory array storage.

The control variables are:

fp: the file pointer for buffered input

rec_no: the number of records read in so far

rp: a temporary record buffer pointer

nitems: the number of items read in each line of input

read_db() starts out by setting the max_recs variable to the maximum number of records that can be handled by the system (line 31). Since the recs array is defined statically to contain MAX_RECS elements, the value of max_recs will never change in this version; we'll make better use of max_recs later when adding dynamic array allocation to the system.

To open the input file, the fopen () library function is called with a mode of "r" to specify text mode. NULL (0) is returned if the file cannot be opened.

The text format chosen for the data consists of a single line of ASCII text (terminated by newlines) per record, with whitespace (one space character) between each field of a record and no other whitespace permitted.

The main read loop (lines 39-68) opens with a call to the fscanf function. The format specification contains one format conversion specifier of the appropriate type for each field of the record, and the remaining parameters are the locally-defined temporary variables. Because the parameters to the scanf family of input functions must all be pointers, the & operator must be used on the scalar parameters in order to generate pointers to those objects. The character array parameters (last and first) do not need the & operator applied to them, because array names used alone are equivalent to pointers to their first elements.

fscanf returns a value telling how many items were actually matched from the input, and we assign that value to the variable nitems. If the value was EOF, then a normal end-of-file has been reached and we fall out of the loop, close the file, and return with the total number of records loaded (rec_no, initialized to zero at the top of the function.)

If nitems ends up with any value other than 7, then something unexpected was encountered in the input; a warning message is printed and the reading loop is exited.

Having avoided all the possible error conditions from the fscanf() call, we are ready to deal with a valid record's worth of new data. The first step is to get some memory in which to store the data; the alloc_rec() function (Listing 3, reproduced from the MDBUTIL.C listing shown in the April '90 issue) does this for us. alloc_rec() returns either a valid pointer to the needed block of memory, or NULL if the memory could not be allocated. This return value is assigned to the rp variable.

Using the rp pointer, each field value is copied into the memory block obtained from alloc_rec. Then the address of this memory block is installed in the RECS array and the record counter rec_no is incremented (line 67) for the next loop iteration.

The write_db() function is set up much like read_db(). Since the fprintf() function takes values, not pointers, for its list of parameters to write, there is no need for & operators (in fact, using them would cause incorrect results.) The return value from fprintf() is checked only for a single negative error flag, in accordance with fprintf()'s definition.

The only additional feature of interest in write_db() is the use of a temporary file for writing the output text. This practice insures against losing both the results from the current session and the previously stored database data file in the event of a catastrophic failure during the output writing process. Only if the temporary file is written without incident, is the previous version of the file erased and the temporary file renamed. To be especially safe, a check is performed on the return value of the rename() call (line 110-116) in case the new filename is not accepted; in this case, the user is given the opportunity to enter other names until the rename() call succeeds.

Next time, I'll show how to implement the read_db() and write_db() functions using raw binary images instead of ASCII data.

Exercises

1. The %s format conversions used to read in the first and last name fields effectively prevent those fields from being able to contain space characters. What if a two-part first or last name needs to be represented, or a middle name? A way to extend the system to allow strings with embedded spaces would be useful, and this can be accomplished by changing the text format to require a delimiter character between each field item. A special form of format conversion may be used in place of "%s" to read such variable-legth string fields: if the delimiter character chosen were the vertical bar, for example, then a scanf call to read a single line containing a variable-legth string (terminated by a vertical bar character) followed by an integer would look like:

scanf("%[^|] |%d\n", string, &i);
The sub-sequence %[^|] tells scanf to match all characters up to but not including the first | character encountered. The final | in the format sequence says to then skip the | character in the input stream.

Modify the mini-database to allow spaces within the first name and last name fields, using this technique.

2. As written, if an error occurs during the main loop of a write_db() call it is possible to lose all data modified in the current session. Modify the program to recover gracefully from a file output error, so that the user has a chance to try again on, say, a different drive. Note that write_db() is currently called from both the SAVE and QUIT main menu options.