PROGRAMMER'S BOOKSHELF

Literate Programming

Andrew Schulman

In much the same way that General Francisco Franco is still dead, volume 4 (Combinatorial Algorithms) of Donald Knuth's projected seven-volume Art of Computer Programming is still not out. Neither are volumes 5 (Syntactical Algorithms), 6 (Theory of Languages), or 7 (Compilers).

DDJ readers are probably familiar with the reason why: Knuth has been on a ten-year detour from the Art of Computer Programming, working in the field of computer typesetting (TeX) and typography (METAFONT). In addition to producing the TeX and METAFONT software itself, Knuth has used this software to produce an attractive five-volume series, Computers and Typesetting, that includes not only the documentation for TeX and METAFONT, but also their source code.

What DDJ readers may not be familiar with is that this source code is written in a programming language called WEB. But when I say "written," I really do mean written: the source code, and the written description of it, are one and the same. WEB is a language, quite similar to Pascal (there's also CWEB, which is quite similar to C), which makes it possible to merge the executable source for a system with its description. More importantly, it allows you to construct and present the source in an order which makes "psychological" sense. Using such a system, software "authors" really do become authors, concerned with writing and presenting code in a way that makes sense, not so much to the compiler, but to the reader.

Even if you're not interested in typesetting or typography take a look some time at Knuth's TeX: The Program (Volume B of Computers and Typesetting) and METAFONT: The Program (Volume D). You'll not find anywhere else such detailed presentations of the entire source code--warts and all--for a large program. It's instructive to consider what life would be like if the source code for your favorite system were available as a WEB.

What inspired Knuth to issue these 600-page hardcover books of source code? Literate Programming, a recently issued collection of essays by Knuth and others from 1974 to 1989, contains a fascinating answer:

Tony Hoare provided a special impetus for WEB when he suggested in 1978 that I should publish my program for TeX. Since very few large-scale software systems were documented in the literature, he had been trying to promote the publication of well-written programs. Hoare's suggestion was actually rather terrifying to me, and I'm sure he knew that he was posing quite a challenge. As a professor of computer science, I was quite comfortable publishing papers about toy problems that could be polished up nicely and presented in an elegant manner; but I had no idea how to take a piece of real software, with all the compromises necessary to make it useful to a large class of people on a wide variety of systems, and to open it up to public scrutiny. How could a supposedly respectable academic, like me, reveal the way he actually writes large programs?

This same challenge faces anyone who has ever tried to write about software: Only small programs seem explicable in the course of an article or even a reasonably sized book. Yet genuine software tends not to be small, and generally seems to consist mostly of ugly distractions from whatever major points you're trying to make. Small programs are toys, and genuine programs seem impossible to explain in depth. Of course, one could present only a topdown view of a large program, ignoring the mass of details; but it is in these details that the program's true worth (certainly its monetary worth!) probably resides.

Knuth tackles this problem with the idea of programs as webs:

When I first began to work with the ideas that eventually became the WEB system, I thought that I would be designing a language for "top-down" programming, where a top-level description is given first and successively refined. On the other hand I knew that I often created major parts of programs in a "bottom-up" fashion, starting with the definitions of basic procedures and data structures and gradually building more and more powerful subroutines. I had the feeling the top-down and bottom-up were opposing methodologies: one more suitable for program exposition and the other more suitable for program creation.

But after gaining experience with WEB, I have come to realize that there is no need to choose once and for all between top-down and bottom-up because a program is best thought of as a web instead of as a tree....

When I'm writing a longish program ... I invariably have strong feelings about what part of the whole should be tackled next. For example, I'll come to a point where I need to define a major data structure and its conventions, before I'll feel happy about going further. My experiences have led me to believe that a person reading a program is, likewise, ready to comprehend it by Learning its various parts in approximately the order in which it was written.... Sometimes the "correct" order is top-down, sometimes it is bottom-up, and sometimes it's a mixture; but always it's an order that makes sense on expository grounds.

Thus the WEB language allows a person to express programs in a "stream of consciousness" order.... the fact that there's no need to be hung up on the question of top-down versus bottom-up--since a programmer can now view a large program as a web, to be explored in a psychologically correct order--is perhaps the greatest lesson I have learned from my recent experiences.

In other words, there's a way to present genuine, large programs, give the reader an understanding of how the entire system fits together, and still not brush aside messy issues like error recovery, special cases, system dependencies, tweaks for performance, hacks, kludges, and all the other seemingly nonalgorithmic issues that make up the bulk of a genuine program. Such details are crucial to one's understanding. They can't be "black boxed."

This jumping around from top-down to bottom-up and back to top-down isn't just a matter of how source code gets presented, either. It cuts to the root of programming itself. As Knuth notes in an amazing essay from 1974 in this collection ("Structured Programming with go to Statements"):

I have felt for a long time that a talent for programming consists largely of the ability to switch readily from microscopic to macroscopic views of things, i.e., to change levels of abstraction fluently.

The name WEB of course comes straight from this notion of a program as a tangle of high-level and low-level issues.

Above all, what comes through here is the notion of writing a program as writing. TeX, METAFONT, and WEB all fit together as part of a grand attempt (apparently partially inspired by Arthur Koestler's Ghost in the Machine) to break down the division between documents and programs, or at least between documentation and programs.

One of the key points is the need for a program and its documentation to be written by the same people. This sounds like an impossible luxury, but in many cases the awful state of documentation--and the inexplicable misfeatures of a program--are the result of the "doc" group not knowing the program they have been hired to describe, and the programmers not thinking about how one would actually describe or use the program.

What is missing is "the discipline of simultaneously writing and describing a program." "Manual writing provides an ideal incentive for system improvements, because you discover and remove glitches that you can't justify in print." Basically, if you can't explain it in a nonembarrassing way, then it shouldn't be in the product. "The designer of a new system must not only be the implementor and the first large-scale user; the designer should also write the first user manual." This is just a variation on the age-old theme that the best way to learn is to teach, that the best way to understand something is to have to explain it. Again, we see that writing software means writing.

Of course, where there is writing, there must be criticism. Not so much "criticism" in the sense of a "this is good, this is bad" review, but what in literary criticism is called "close reading." So it is fitting that one chapter of Literate Programming, a reprint from a column by Jon Bentley on WEB, contains first a program by Knuth and then a detailed criticism by Doug McIlroy (who ends up calling Knuth's sample program "a sort of industrial-strength Faberge egg").

One of the most fascinating parts of Literate Programming is a 100-page section called "The Errors of TeX," which "describes the milieu of literate programming, by tracing the history of all changes made to TeX as that system evolved." This includes a complete error log for TeX from 1978 to 1991. Interestingly,

... if you ask me whether keeping such a log has helped me learn how to reduce the number of future errors, my answer has to be No. I kept a similar log for errors in METAFONT, and there was no perceivable reduction. I continue to make the same kinds of mistakes.

Oh well.

It seems unfortunate that so much of this book focuses on Pascal. However, Chapter 12 presents the word count (wc) program from UNIX, rewritten in CWEB to demonstrate "literate programming" in C. While wc is, of course, a very simple program, Knuth notes that his "fondest hope is that readers who look at Chapter 12 will realize how wonderful it would be if the entire UNIX system and its successors were written in the style of Chapter 12 or something similar."

Actually, it also made me wish that volumes 1-3 of the Art of Computer Programming had been written in the style of Chapter 12 or something similar, or at least in the style of something other than the MIX assembly language. At any rate, Literate Programming brings back what a pleasure reading source code can be.

The Dr. Dobb's Handprinting Recognition Contest

Last month marked the official launch of our Dr. Dobb's Handprinting Recognition Contest, and your response has been exhilarating. Whether it's the opportunity for you to strut your recognition stuff, or the chance to win a Macintosh Power-Book 100, we're receiving new entries daily. But just in case this is the first you've heard about the contest, we'll briefly recap the details.

The contest began on June 15th, when the official version of the DDJ test framework, test data, and contest entry from were made available electronically and by mail. The deadline for submissions is September 15th, and the winner will be announced in our December 1992 issue.

We've built a platform-independent test harness that, in the most general case, allows you to plug in your recognizer and check the result. Your recognizer can use any platform on which the DDJ test harness runs. You don't need a pen computer or pen operating system to participate in the contest. The DDJ harness code assumes only the C standard library. Even though you can run the harness on any platform that has a C compiler, we can only test your code on Macintosh or PC platform. Assuming your code is portably written, this shouldn't be a problem.

The test harness is 200 lines of C code, and was written by Ron Avitzur to run as a batch process using the standard I/O library functions. This code has been tested on the Macintosh, SPARC, DOS, and Windows 3 platforms. On the PC, the code compiles with both Borland and Microsoft compilers.

You must send in both source code and an executable. Any other written commentary or documentation is also welcome. Source code is for publication and can be in C (or, on the PC, in any language that can be linked to the OBJ files of the DDJ test harness). Submissions will be judged primarily on recognition accuracy. Speed is a secondary consideration; third is the conciseness and elegance of your implementation. As mentioned, first prize is a Macintosh PowerBook 100, generously provided by Apple Computer.

A more complete description of how the harness works is included on page 60 of our July 1992 issue as well as with the contest entry information.

--editors