December 1993/On-The-Fly Huffman Coding

Features

On-The-Fly Huffman Coding

Ludger Engbert

Ludger Engbert, economics graduate from the University of Geneva, Switzerland (1975), has been working as a self-employed management and software consultant since 1984. he has conducted a few dozen seminars in those fields, speaking English, French, German, and Spanish. He has been programming in C since 1988 or so, for business applications (accounting, financial analysis) and, more recently, to enjoy the pure beauty of logic, C language tools, and the like. He can be reached by phone (+49-6084-2367) or fax (+49-6084-2458), or by writing to: Engbert, Taunusstrasse 8, D-61389 Schmitten, Germany.
The February 1992 issue of The C Users Journal carried a code-heavy article on Huffman coding. (See Dwayne Phillips, "Data Compression Using Huffman Encoding," CUJ February 1992.) The theory of Huffman coding was well explained there. This article proposes two enhancements:
1) a new algorithm, which reduces the necessary code to a fraction of the above
2) on-the-fly compression/decompression, which eliminates the need for two passes through the input file during compression
The sidebar, "A Huffman Coding Algorithm," presents the new algorithm. Basically, it uses a "Huffman list" in place of the familiar "Huffman tree." One might use the proposed algorithm to build an encoding table along the lines proposed in the above-mentioned article. This encoding table would become part of the compressed file. It would enable the decoder to decode the compressed file.
Enter on-the-fly compression. Here the encoding table is continually being modified, with each character read from the original (uncompressed) file. The same "dynamic" encoding table is built during decompression, with each character decoded.
The dynamic encoding table works the same way as the familiar (static) encoding table except that it has one more code: the empty-node code. The encoder sends an empty-node code whenever a character read from the input file does not exist in the encoding table. Each empty-node code is then followed by the ASCII (or other convenient) definition of the "unknown" character in question. You can think of the Huffman tree as having one node or leaf which is not reserved for any particular character; hence the term "empty node."
The algorithm proposed in this article represents the empty-node code with a series of one bits. As the encoding table grows, the number of bits necessary to represent the empty node will grow. But as a typical text will generally not consist of more than some hundred characters, there are at the most about a hundred empty-node codes to send.
The source code for compression/decompression is provided in Listing 1 through Listing 7. Make sure to include the config.h file (Listing 8) . Use the config.h file as is to build the encoder; remove the line
#define COMPRESS
in the config.h file, to compile the decoder.
The proto.h file (Listing 9) mentioned in the source listings is a list of function prototypes, which you should build separately for encoder and decoder. All *.c files except decomp.c are needed to build the encoder. All *.c files except comp.c are needed to build the decoder. Some non-ANSI functions in the Turbo C library were used to restore date and time of decompressed files to the original date and time. The corresponding functions in main.c may simply be removed, if it is not necessary to timestamp decompressed files.
I tested the encoder/decoder on several C sources, and was able to reduce file sizes to about 60% of their original sizes. Important savings can be made by concatenating several files into one compressed file. The decoder accepts any number of files on the command line and builds one compressed file, which the decoder will again decompress into the original files. Further savings could be achieved here, using a single encoding table for all files, especially if files are small and are similar. As proposed, the encoder/decoder builds a new encoding table for each file.
Code was designed with a view to make maintenance as easy as possible. Each source file corresponds to more or less a datacum-functionality "object." For example, the file usage.c is in charge of the user (command line) interface. It handles command-line parsing and the "usage screen." When the command line syntax is modified or corresponding defaults are modified, only this file needs modifying. The file identity.c knows how to identify a compressed file; other modules therefore consult this module in the matter, even at the cost of some more code.
Figure 1 presents the function calling hierarchy in an organization-chart form (encoder only). Listing 1, bits.c, accepts bits from the encoder and writes them to disk, one byte at a time; or it reads bytes from the compressed file and passes them bit by bit to the decoder. Listing 2, comp.c, holds the main compression routine. The corresponding decompression routine is presented in Listing 7, decomp.c. The size of the original file is communicated to the decompressor, as an additional safeguard to make sure that the original file is restored. If the decompressor hits the end of the file before restoring the complete original file, decompression has failed.
Listing 3, identity.c, writes out a header to the decompressed file to identify it as such; or it verifies that the header is present on files to be decompressed. Listing 4, list.c, implements the Huffman list encoding/decoding algorithm. There is no true encoding/decoding table, as the bit-stream used for encoding is generated for each character that gets encoded! The code can be made a lot faster, if such a table is built, going back to some form of two-pass compression. But care should be taken, as it is not guaranteed that any fixed-length integer, even 32-bits wide, will be able to hold the code for the odd character which is found only twice in a big input file!
Listing 5, main.c, opens/closes files for compression/decompression and interacts with the user. The compressor and decompressor then simply read stdin and write to stdout. It is therefore mandatory to output error messages and comments to stderr. Listing 6, usage.c, is the file to modify if you want to change some elements of command-line syntax or the user interface.
Most of the code is devoted to operating system and user interfaces. The true compression/decompression is done in list.c and comp.c/decomp.c. These are the modules you would have to replace to plug in some other compression algorithm, such as dictionary-based compression. There are basically two compression algorithms: Huffman coding or dictionary-based coding. In dictionary-based coding frequently recurring sequences of characters are encoded as a separate character; generally such schemes end up using 10- to 15-bit codes to represent the new extended alphabet. The advantage of Huffman coding is that it hardly needs any memory and can therefore be used in real-time applications (electronic mail etc.).