Features


A Self-Extracting Archive for MS-DOS

P.J. LaBrocca


Pat LaBrocca is the author of ReCalc(TM), a set of rational expression calculators that never give answers (well, almost never), and run identically on PCs, Macintoshes and Apples. He has a BS and MA in Chemistry and teaches computer science at Peter Rouget Middle School 88 in Brooklyn, NY. You can contact him at plabrocc@nycenet.edu.

Introduction

An archive is a file that contains several other files. A self-extracting archive (SEA) is a file that doesn't need the original archiving software to release its contents. In this article I present pair of utilities for creating SEAs under MS-DOS. The utilities provide a somewhat barebones implementation in that they don't provide features common to other archives, such as compression, or preservation of subdirectories. Once you understand how these utilities work, you may want to extend them. I provide some references at the end of this article for those wishing to extend the utilities.

The SEA Structure

My self extracting archive consists of three parts. The first part is the set of files that have been archived, which I refer to as the component files. The second part, the extraction module, is the code embedded in the archive that reads the component files from the archive and writes them to new output files. Finally, a system of headers embedded in the archive tell the extraction module the lengths and names of the component files. The extraction module uses the information in the prefix headers to recreate the files. To avoid confusing C's include header files and MS-DOS's program file headers, I refer to my headers as prefix headers, since they prefix each component file. Also, when I refer to the "archiver" I am speaking of the utility that builds the archive. When I refer to the "SEA" I am speaking of the archive itself.

The disk layout of a SEA containing n component files is shown in Figure 1.

A terminating suffix header follows the last component file. The suffix header indicates to the extraction module that there are no more component files.

Overview of SEA Process

The archiver starts by copying the extraction module to the archive. Then for each component file, the archiver constructs a prefix header and writes the header and the associated file to the archive. When the archiver runs out of files to archive, it writes the suffix header.

The SEA appears to MS-DOS as an .EXE file and executes like any other .EXE file. The SEA is actually an .EXE file with extra information appended to the end. The appended information does not interfere with the SEA's execution. (See the sidebar, "Piggybacking an .EXE File.")

When you run the SEA, the extraction module reads the prefix headers and recreates each of the files in turn. This process continues until the SEA reads the suffix header, at which time the extraction is complete.

The Archiver — Detailed Operation

The source code for the archiver is in arch.c (Listing 2) . The extraction module and the archiver expect the same prefix headers, so I put the prefix header structure in a separate include file (see Listing 1) . After verifying that the user has provided a list of files to archive, the archiver begins opening files. First, arch.exe tries to open the extraction module, extr.exe, to prepare to copy it to the archive. If arch.exe can't find the extraction module, arch.exe displays an error message and halts. The default name for the SEA to be created is out.exe. arch.exe tries to open the output file with this name, this time for writing, and issues a message if the open fails. Arch.exe opens out.exe as a binary file to ensure that other functions will not perform unwanted conversions on the file. For example, when fopen opens a file in text mode, subsequent reads strip out extra carriage returns. Since archive files need to contain any kind of file, fopen must treat all files as binary files, thus guaranteeing the integrity of each byte in the file.

If all goes well, arch.exe copies the extraction module to out.exe, and then closes it.

Next, arch.exe attempts to open a file from the command line list. If arch.exe can't open the file, it displays a message indicating the name of the file, increments count, and forces a jump to the top of the loop. (In this case, the failure of a file to open may not indicate a problem. Wild-card expansion in the command line sometimes generates the name of a subdirectory, which, of course, can't be opened.)

Next, arch.exe calls fseek and ftell to determine the size of the opened file. The call

fseek( input, 0, SEEK_END );
moves the file position indicator (an index into the file maintained by the FILE type) to one byte after the last byte in the file. A call to ftell returns the current file position. The combination

fseek( input, 0, SEEK_END );
header.filesize = ftell( input );
stores the file's size, in bytes, in header.filesize. Another call to fseek, this time with argument SEEK_SET,

fseek( input, 0, SEEK_SET );
repositions input''s file position indicator to the beginning of the file.

Depending on how it is built, arch.exe may provide wildcard expansion of command line arguments. Most compilers provide an object module which can be linked into the program to provide this feature (see the sidebar "Wildcard Expansion"). Therefore, after arch.exe reads the command line, the command-line arguments may be in the form of file names with or without extensions, with partial paths or full paths. The prefix header structure expects at most a file name plus extension. arch.exe has to process the command line arguments into that form. Some C compilers provide a function that does just that job. Unfortunately, it's not a standard function. For example, Microsoft C provides _splitpath, which breaks a path into its component parts and stores them in strings. Zortech C++ 3.0 supplies the function filespecname, which returns a pointer to a string containing the file name plus extension. Instead of using a compiler-specific function, I created the function filename in Listing 2. filename is a stripped down version of _splitpath that extracts the file name plus extension from a path. Arch.exe passes filename the path and a character buffer. filename scans the path in reverse order, by decrementing a pointer, and stops when it has a full file name. The call to filename completes the prefix header data structure.

Arch.axe calls fwrite to write the prefix header to out.exe. After writing, fwrite leaves the file position indicator just beyond the prefix header. After copying the current file to out.exe, Arch.exe then closes input in preparation for the next file and increments count.

When there are no more files to be archived, arch.exe writes one final header, with header.filesize set to -1L, to the output file. This suffix header serves as the end-of-archive mark for the extraction module and completes the SEA.

Extraction Module — Detailed Operation

Listing 3 contains the source code for the extraction module, extr.exe. extr.exe reads in prefix headers, and uses the information thus gleaned to recreate files.

The Magic Number

The extractor must know where the first prefix header starts, which means the extractor must know its own length. To get the size information into the extraction module, I needed to know the size of extr.exe before I compiled it. So I declared a long int, MagicNumber, and initialized it with a dummy value. Then I compiled and linked extr.c the usual way. I ran MS-DOS's DIR command to obtain extr.exe's file size and used this value to initialize MagicNumber. I had to recompile, of course, since I had edited the source code, but the size of extr.exe doesn't change. Now MagicNumber tells the extraction module how big it is. (I use a batch file to automate keeping the value of MagicNumber synchronized with the size of extr.exe. See "Miscellaneous Implementation Notes" for some details.)

Command Line Processing

When the extraction module begins execution just inside functin main) it first checks for arguments on the command line. If the user types in an unknown option at the command line, the SEA displays a usage message and exits. When argc equals 1 the default action, extraction, is performed. The only option extr.exe recognizes is -l(ist), which causes a list of archived files and their sizes to be sent to the standard output.

argv[0] contains the string used to invoke the extraction program, so the function call fopen(argv[0], "rb") opens the file that is currently executing. The program can open its own .EXE file from disk because the executing image is just a copy of the disk file. Using this technique to open the SEA allows you to rename out.exe to whatever you want.

Navigating the File

The program calls fseek with arguments SEEK_SET and MagicNumber to move the file position indicator just past the extraction module, to the beginning of the first prefix header. (Remember to adjust MagicNumber if you edit extr.c!)

In the while loop, fread reads in a prefix header. If it's the suffix header, there are no more files to extract, so the program exits the loop and closes input. Otherwise, the program attempts to create a file in the current directory using the string from the prefix header, header.filename. If a file with the same name already exists, the program overwrites it. The messages displayed along the way indicate progress. When the program has copied header.filesize bytes to the new file, it closes the new file, increments count, and starts the next iteration.

The procedure for listing the component files is the same, except instead of copying a file, the program skips the file by calling

fseek( input, header.filesize, SEEK_CUR );
which moves the file position indicator header.filesize bytes forward from its current position, to the beginning of the next prefix header.

Miscellaneous Implementation Notes

The prefix header is a structure declared in Listing 1, sea.h. The first member, filename, holds the file's name in an array of characters, as a C-style string. The array only needs to be thirteen bytes long in this implementation. If you decide to store more than a base name, a dot, and an extension, you adjust the array's size accordingly. A long int, filesize, contains the file's length.

If you change the size of extr.exe you must recompile the extraction module. I run a little batch file, REMAKE.BAT (Listing 5) , from the makefile each time extr.exe gets rebuilt, which prints a message to the screen indicating if MagicNumber equals the size of extr.exe. The batch file creates a temporary file composed of extr.c and a one-line directory listing. An awk program (Listing 6) , called from the batch file, digs out the file size from the directory line and the value used to intialize MagicNumber, compares them, and prints a one line report to the screen. (To keep the awk program simple, I put a space between MagicNumber's initializer and the semicolon.) I use MKS Awk, but other versions should work, too.

To use the archiver, copy arch.exe and extr.exe to a separate subdirectory on your system. arch.exe expects to find extr.exe in the same subdirectory. The files to be archived can exist in any subdirectory and on other drives. However, the SEA as currently implemented does not store subdirectory or drive information. Therefore, when you run the SEA, it will extract all files to the same subdirectory. This can be a problem if the archive contains duplicate file names from different subdirectories. The extractor will overwrite files with duplicate names. If you compiled with Microsoft C and linked with setargv.obj as described in the sidebar "Wildcard Expansion," you can use the usual MS-DOS wildcards, ? and *. Other compilers may or may not offer wildcard expansion as an option. The archiver produces a SEA named out.exe in the current directory. You can rename it to anything you want.

To add compression to the archiver see "A Simple Data-Compression Technique" by Ed Ross in the October 1992 issue of The C Users Journal. He describes a method of run length encoding. The source code is available on the CUJ code disk, or you can download it from one of the online sources listed at the end of the table of contents.

For an extensive introduction to methods of data compression in C, see The Data Compression Book by Mark Nelson, from M & T Books. Nelson presents explanations and detailed working versions of popular varieties of data compression. The final chapter contains a complete compression/archiving package, CARMAN.

Conclusion

The SEA and archiver I have described are very simple, but useful. Because of the SEA's simplicity, programmers should find it easy to modify for their own use. The SEA's straightforward structure also makes it useful as a learning tool.