April 1994/Record-Oriented Data Compression/Sidebar

Two Popular Compression Algorithms

Two general purpose data compression algorithms are currently in widespread use. These are the Lempel-Ziv-Welch (LZW) algorithm (Welch, Terry A., "A Technique for High Performance Data Compression," IEEE Computer, June, 1984.) and the Huffman encoding algorithm. Both are lossless compression methods, but rely on completely different compression mechanisms.
The LZW algorithm generally yields better results and does not require external tables. This algorithm replaces strings of characters with single codes which create a sort of expanded alphabet. LZW adds new strings to a string table, which is created dynamically as the input data stream is processed. As output LZW produces the codes representing strings in the input stream.
In the decompression phase the LZW algorithm reads the string codes, from which it can dynamically reconstruct the string table and output the original strings of characters.
The Huffman encoding algorithm is based on the observation that certain characters, especially in text, occur more frequently than others. Instead of allocating an 8-bit ASCII code to each character the Huffman algorithm uses variable-length codes for each character. Frequently occurring characters are represented by codes with fewer bits while rarely occurring characters are represented by codes with more bits.
Static Huffman encoding requires a table of probabilities of occurrence of each character before compression. Programs may derive this table from a pre-scan of the actual data if it is available, or from statistical observations of similar data — this works well for English text, for example. In any event, the program constructs an encoding tree from the probability information. This tree must be available in both the compression and decompression phases. Though the output consists of variable-length bit strings for each character, no end-of-character markers are required — the algorithm "knows" when it has reached the end of a code in the decompression phase.

Choosing an Algorithm for Compression
As mentioned above, the LZW algorithm generally produces better compression ratios but it is not suitable for record-oriented compression. The records involved may be quite short and may contain no repeated strings. Since the records may be processed independently (for example, inserting one record at random into a data base) LZW can't take advantage of the fact that the data base or file as a whole may have much repeated information in it. Since the LZW algorithm works by replacing repeated strings with shorter codes it would often provide unsatisfactory compression in this application.
Huffman encoding will work with individual records, but it's not without drawbacks. First, Huffman encoding requires an encoding tree be available during both compression and decompression phases. If you use an encoding tree that was constructed from data with a particular character frequency distribution to encode data that has a different character frequency distribution, your results will not be satisfactory. Second, the algorithm doesn't work well for data containing characters that occur with a roughly uniform frequency. Binary data is particularly bad — files such as a program executable don't compress — they will probably increase in size!
A third problem stems from the algorithm's conversion of input data as a string of bytes to output data as a string of bits. Since most computers can only address data units down to the byte level, this output string must be padded out to an integral number of bytes. Unfortunately, in the decompression phase the Huffman algorithm can't tell whether these extra few bits represent more characters or whether they are just padding. It is therefore necessary to inform the decompression program what the original record length was so it will know when to stop.
Having said all that, the algorithm generally works well for text data. For English text you can expect data to be compressed to about 60% of its original size. The algorithm also works well in situations where your data consists of numbers represented by ASCII digits — in this case you can get compressions closer to 40% of the original size.