PROGRAMMER'S BOOKSHELF

10 lbs. of Data in a 5-lb. Bag

Andrew Schulman

About a year ago, I purchased a highly regarded textbook on text compression and was pleased to find, among the mathematical formulas, some source code for a data-compression program. While typing in the code, I marvelled at how something so seemingly abstract as information theory could yield something as tangible as reduced hard-disk consumption, shorter transmission times, and lower online fees. The program was fairly small, and my speculations on the almost unreasonable effectiveness of information theory were put on hold while I compiled and ran the program.

Well, the resulting data-compression program turned out to have an interesting property: Much as a failed alchemist might turn gold into lead, it made all files larger; the accompanying decompression program (yeah, I typed that in too) made them smaller. You can probably appreciate the fact that I have since learned to be more cautious when purchasing books on this subject. Data compression really does have a crucial theoretical foundation, making a pleasant contrast to the ad hoc mess that exists in most other areas of computing. But as the compression program that makes everything larger shows, this theoretical foundation is not always exploited in the most practical or productive way.

Mark Nelson to the Rescue!

Nelson's The Data Compression Book now stands as the best all-around book on this subject, providing both the crucial mathematical foundations of information theory and genuine, working C code that actually compresses.

Nelson, vice president of development for Greenleaf Software, takes the reader through a variety of data compression techniques, following their historical progression. First, the book covers statistics-based techniques, showing the evolution from Huffman's 1952 landmark "method for the construction of minimum-redundancy codes," to the need for adaptive coding, to arithmetic coding, which answered the need for coding with a nonintegral number of bits. The next three chapters deal with dictionary-based compression, which began with Ziv and Lempel's universal algorithm paper from 1977, and some variety of which is used in most contemporary compression software, such as PKZIP, LHArc, and the mighty Stacker. Next are two chapters on "lossy" compression, without dedicated hardware, of sound and graphics. Each technique i demonstrated with a detailed walkthrough of working C code; the code is also provided on the two accompanying disks.

One of the key points to emerge from the book is the importance of the model used for compression. Data compression, Nelson points out, consists not of mere "coding," that is, of deciding how best to represent symbols based on the probability of their appearance in a stream of data, but, equally important, of "modeling," which is what determines the probability in the first place. A good example of what Nelson means by "model" is the difference between viewing a file as a collection of characters on the one hand, and viewing it as a collection of pairs or triplets of characters on the other.

As discussed in Kas Thomas's article, "Entropy" (DDJ, February 1991), the minimum number of bits required to represent a symbol, based on the probability of its appearance, is known as its entropy. But Nelson makes the important point that

...unlike the thermodynamic measure of entropy, we can use no absolute number for the information content of a given message. The problem is that when we calculate entropy, we use a number that gives us the probability of a given symbol. The probability figure we use is actually the probability for a given model, not an absolute number. If we change the model, the probability will change with it.... This seemingly unstable notion of a character's probability proves troublesome to many people. They prefer that a character have a fixed "true" probability.

But, in dictionary-based compression in particular, the model chosen is all important. Another interesting point is that the history of data compression techniques relates closely to the hardware available at any given time. Abstract algorithms were discovered at just about the time it became practical to implement them, given typical memory availability and CPU speeds. The genesis of dictionary-based techniques in 1977 is a good example.

While walking through the entire history of data compression, Nelson takes the reader all the way to the present, covering the despised Unisys LZW patent, ARC, PKZIP, Yoshi's LHArc, Stac Electronics' QIC-122 compression standard, and so on. While some of this material has appeared previously in Nelson's DDJ articles, almost all of it is new.

For the many programs that come with the book, Nelson has built an extremely nice foundation library, into which he plugs the different data compression and decompression routines. At a time when you see so much repetitive code in programming books, with each program differing by only a few lines from the preceding one (books on Windows programming are particularly bad in this way), it's nice to see Nelson's book practicing "compression" at this conceptual level. This is truly "minimum redundancy coding!"

I was a little surprised, though, by the book's lack of emphasis on speed of compression and decompression. This seems an important subject. The run time for some of the programs could have been halved simply by using the C setvbuf() function to read and write more data at a time. Also, the BIT-IO module seems somewhat inefficient; running the compression programs under a profiler indicates that almost all time is spent here, rather than in the compression routines themselves. BIT-IO uses a "rack" in which bits are stored until a complete byte can be written out to a file; I suspect that writing out a 4-byte long instead would improve the performance. But it is probably unfair to complain about this, in what is really a wonderfully enjoyable book on programming. Even someone who isn't a C programmer will probably get something from this inside look at how programs such as PKZIP and Stacker operate.

Silicon Dreams

Nelson notes that "data compression is perhaps the fundamental expression of Information Theory." There is something paradoxical about information theory, because meaningless gibberish has greater entropy, that is, is worth more bits, than the relatively structured gibberish that I'm writing now. Information theory often comes under attack for neglecting "meaning." For example, the brilliant art critic Rudolf Arnheim, in his Entropy and Art, complains of "the information theorist, who persists in ignoring structure." But the proof of the pudding is in the eating: Completely random banging on keys does not compress as much as the text file I'm writing. The existence of data compression programs, and of efficient programs that not only detect but correct errors, shows that Claude Shannon was, to put it mildly, on to something.

There are, of course, many good books on the subject of information theory. Shannon's original paper from 1948, together with an essay by Warren Weaver, is still in print as The Mathematical Theory of Communication, and is amazingly readable and relevant after over 40 years.

A well-known popular account is John Pierce's Introduction to Information Theory: Symbols, Signals, and Noise. A really popular account is Jeremy Campbell's Grammatical Man: Information, Entropy, Language, and Life. The problem with these popular surveys is that they bite off more than they can chew: Much as critics of information theory demand that it somehow take into account "meaning," the authors of these popular works on information theory try to make it explain, in the course of 300 pages, life, the universe, and everything. This in spite of Shannon's own warning, in a 1956 article titled "The Bandwagon" (quoted in Campbell's Grammatical Man), that information theory "has perhaps ballooned to an importance beyond its actual accomplishments.... Seldom do more than a handful of nature's secrets give way at one time."

Robert Lucky's recent popular explanation of information theory, Silicon Dreams, does a wonderful job precisely because--like information theory itself -- it does not try to explain everything. The mere fact that data can be compressed is, when you think about it, mysterious enough without also dragging in any possible connection between this and the ultimate heat death of the universe. In contrast to other books on information theory for a general audience, Lucky sticks solidly to subjects such as Shannon's theorem, the nature of text, text input and output, speech, speech processing, graphics, and graphics processing.

Those, in fact, are the key chapters in Lucky's book. For example, there is a solid 50-page chapter on text, which includes in-depth looks at the statistical properties of English and at text compression. The chapter on text input has a wonderful discussion of the nature of typing and of keyboards. (The persistence of the QWERTY keyboard against innovations such as Dvorak's might serve as a lesson for anyone who wants to replace an intrenched mess with a brand-spanking-new innovation.) Another 50-page chapter discusses speech processing, including coding, synthesis, and recognition. A chapter on picture processing looks in depth at the storage, generation, processing, and coding of graphics.

Most impressive in all this is that Lucky, executive director of research at AT&T Bell Laboratories, has managed to discuss all this in a book that could easily be understood by one's non-computer spouse. Basically, Silicon Dreams is a very readable survey, for a general audience, of signal processing, the Fourier transform, Lempel-Ziv, Huffman coding, linear predictive coding, and other topics you probably wouldn't expect to find discussed outside a for-professionals-only text. Anyone who wants to understand the theoretical foundation that makes possible software such as PKZIP and Stacker will want to read Silicon Dreams. Even professionals who already know about some of these topics will benefit from reading the book, because it will probably teach them some things they didn't already know, and it will certainly help put what they do know into a wider context.

Lucky's chapters on sound and graphics--like the code-intensive chapters on the subject in Nelson's data compression books -- are important to anyone interested in multimedia. I particularly recommend Lucky's account of the failure of Picturephone, which seems to me to say something about multimedia today.

The Entropy Problem

As I've said, Lucky concentrates on the mysterious-enough hard facts of information theory, and leaves alone the more nebulous questions in which others have become mired. This includes the ultimate question, which is what, if anything, the entropy of information theory has to do with the entropy of Clausius's second law of thermodynamics. Of course, they may have nothing to do with each other; it is reported that John von Neumann suggested that Shannon call his measure "entropy," because "no one knows what entropy is, so in a debate you will always have the advantage."

In addition, Lucky notes that the striking similarity of the formulas involved may reflect little more than the fact that "any analysis of order and disorder would result in similar logarithmic equations." In any case, he writes, "such philosophical arguments might be made appropriately late at night over a bottle of wine." I've decided to take his advice literally. It's late at night, and I've been into the wine, so I feel fully equipped to talk about our final book this month, Maxwell's Demon. This is a collection of 25 papers on the link between information theory and the second law of thermodynamics, as they relate to a mythical being, Maxwell's demon, who uses observation alone to defy the second law. The general idea is that, if the type of entropy one might read about in a book on data compression is the same as the entropy that Clausius said tends to a maximum, then the demon is consistent with the second law.

Some of the flavor of these papers comes out in a remark in one of them, Charles Bennett's "Notes on the history of reversible computation," which refers to "the realization that one bit of information is somehow equivalent to k ln 2 units of entropy, or about 2.3 X 10{24} cal/Kelvin." This is a remarkable realization!

Other papers included in Maxwell's Demon are Szilard's 1929 classic "On the decrease of entropy in a thermodynamic system by the intervention of intelligent beings," Landauer's "Computation: a fundamental physical view" and "Irreversibility and heat generation in the computing process," and Bennett's "The thermodynamics of computation."

Surprisingly, one viewpoint not well represented here is that of Edward Fredkin, who is concerned, not with the physical basis of computation, but with--get this--the computational basis of physical reality. The world, in other words, is a computer, and reality is digital, not analog. (I wonder what size hard disk it takes. Will it run Stacker?) Fredkin's reversible logic gates and billiard-ball computer are discussed in some of the papers in Maxwell's Demon, but for his wider views you might enjoy the biographical sketch of Fredkin (who was reportedly the model for the character Professor Steven Falken in the 1983 movie War Games) in Robert Wright's Three Scientists and Their Gods. It goes very nicely with the Chianti.

Copyright © 1992, Dr. Dobb's Journal