April 1994/Record-Oriented Data Compression

Features

Record-Oriented Data Compression

John W. Ross

John W. Ross is a computational scientist in the University of Toronto's High Performance and Research Computing group. His interests include scientific programming on highly parallel and vector supercomputers. John can be reached at yjohn@utirc.utoronto.ca.

Introduction
Data compression programs are very useful but many of them have an annoying shortcoming; they work on a file in toto, compressing the whole file in one gulp. This shortcoming becomes especially significant when you are building a data-base program or file manager. You can use a data compressor to compress the resulting file, but once you do you no longer know where the record boundaries are. You have to uncompress the file to extract or insert records, so you lose some of the advantages of compression. A lot of similar applications can't take advantage of conventional data compressors for this reason.
We need a data compressor/decompressor that operates on the individual records of a file. Such a program would allow you to compress a record, hand it to the data base manager for storage, then later retrieve it and uncompress it. In this article I present a program that does just that.

Implementation
I've chosen Huffman encoding as the algorithm to compress/decompress data records (see the sidebar, "Two Popular Compression Algorithms," for a detailed discussion of my choice).
Since Huffman encoding relies on character frequencies, it works with practically any size data record. Of course, it is necessary to perform an initial pass over the entire data set (or at least, a representative data set) to determine the character frequencies. From that point on, as long as a data record has the same character frequency characteristics as this data set, the program will achieve good compression ratios.
I have implemented my compression scheme as three distinct entities. The first is a stand-alone program that reads a file and generates a Huffman encoding tree, which it then writes out to a file. The next is a function which compresses a record using the encoding tree. The function takes a pointer to a record to be compressed, and passes back a pointer to the compressed record. The third entity is a function which decompresses a record. I have also written a simple file management program to illustrate the application of these functions. I consider each of the units in turn.

Building the Encoding Tree
Before compressing data records a program must build a Huffman encoding tree from a sample of text similar to that which will be compressed. The encoding tree will be based on the character frequencies found in the file. Program bldtree (Listing 2) performs this operation.
hufftree.h (Listing 1) is an include file that defines the encoding tree structure. By defining it in terms of short integers I am able to keep the external storage for the tree to a minimum.
bldtree takes a single command-line argument, the name of the file to be analyzed. The first thing bldtree does is determine a frequency count of the characters in the file. Note that all characters are assigned a minimum count of 1, so that the compression routine will be able to generate a code for any character, even if it didn't appear in the file. The remainder of the program builds the encoding tree from the table of character frequencies. bldtree creates a file called htree.dat and writes the encoding tree to the file.

The Compression Function
The function compress and its ancillary functions, encode and emit, are shown in Listing 3. compress takes a pointer to a character buffer (containing the uncompressed data record) and the buffer length as arguments, and returns a pointer to another character buffer containing the compressed data record.
Note that the first thing compress does is put the length of the input data record into the first byte of the output record. compress stores this length in the first byte so the decompression function will know how many bytes it has to decode from the compressed record. Using one byte to store the length restricts the maximum input record length to 255 bytes, minus one byte to store that length — so the input record must contain at most 254 bytes of data. If you wanted to use larger records you could store the length as a short integer using the first two bytes of the output record. (This assumes that the input records are variable length. If they are fixed length, and the application knows that length then it is unnecessary to store the length in the record.)
For each character in the input buffer, compress calls function encode, which in turn uses the global struct ht, the encoding tree. The application program is responsible for defining storage for ht, and either reading it from an external file or constructing it. The final call to emit in compress pads out the final byte of the compressed record if necessary.

Decompression
To invert the compression process, an application program passes a compressed record to the function decode, (Listing 4) which expands the record and returns it in its original form along with its length in a separate buffer. (It should be obvious, but in case it isn't I want to stress that the encoding tree used to decompress the data records must be the same one that was used to compress them.)

An Example
The program called filer (Listing 5) shows how the compression routines may be applied. filer takes two command-line arguments: the name of an input file to be compressed and the name of an output file that will contain the compressed records. filer is designed to work with a plain text file. filer treats each line in the file, delineated by a newline (\n), as a record to be compressed. filer also writes an index file as it processes the text file so that individual compressed records may be retrieved later.
filer assumes that a file called htree.dat containing the Huffman encoding tree exists.
filer keeps track of the number of bytes in the input and output records and prints out the average compression ratio.
unfiler (Listing 6) illustrates the application of decompression. unfiler can read any record from the compressed file by number, but it's set up in this example to read all of the records in sequence.
To use these programs with a text file called misc. txt, you would type the following commands:
bldtree misc.txt
filer misc.txt misc.cmp
unfiler misc.cmp misc.unc
misc.cmp is the file containing the compressed records and misc.unc contains the decompressed versions of the records. misc.txt and misc.unc should be identical. These programs will compile and run under MS-DOS and most versions of UNIX.
To work with a file that does not consist of individual lines of text, it is simple to modify filer to read blocks of text or data rather than lines. Just replace the program lines that read:
    while (fgets(bufin, MAXBUF-1, in) != NULL)
    { inlen = strlen(bufin);
      ncharin += inlen;
with
    while ((inlen = fread(bufin, 1, 255, in)) > 0)
    { ncharin += inlen;
This modification will yield slightly better compression ratios. You don't have to modify unfiler to recover the output file.

Simple Encryption
This compression process also performs encryption as a side effect. The encryption key is the particular Huffman encoding tree in use when the data was compressed. Someone looking to decrypt your data would have to determine that it had been compressed using Huffman encoding and then would have to determine what sort of character frequency distribution had been used. While this kind of challenge probably wouldn't stop a professional code breaker, it might dissuade the casual snooper.

Summary
In this article I have shown how to adapt a common file compression algorithm, Huffman encoding, so that it may be applied to the individual data records in a file. This technique would be useful in database or file management applications in which parts of the file have to be stored or retrieved independently. The compression process also also serves as a means of encrypting the data.