Features


Literate Programming in C and C++ using CWEB

Lee Wittenberg


Lee Wittenberg teaches Computer Science at Kean College of New Jersey. He is an active contributor to the Literate Programming electronic discussion group, and can be reached via the Internet at leew@pilot.njin.net.

Introduction

A quiet revolution is taking place in programming circles — a revolution called literate programming. In 1972, Edsger Dijkstra [1] wished for a "program written down as I can understand it, I want it written down as I would like to explain it to someone." Ten years later, Donald Knuth developed the original WEB system, coining the phrase "literate programming" in the process [2].

Literate programming is precisely what Dijkstra wanted: the ability to write a program as you would explain it to another human being, rather than as your compiler would have it. In the decade since Knuth introduced the idea, the number of literate programming tools has proliferated. In addition to the original WEB, FunnelWeb, FWEB, noweb, Nuweb, Spidery WEB, and WinWordWEB are all available and in active use. But the flagship of the WEB fleet is CWEB, a preprocessor that supports literate programming in both C and C++, and (with the recent publication of "The Stanford GraphBase" [4] and "The CWEB System of Structured Documentation" [5]) is slowly moving into the mainstream, both in industry and academia.

CWEB Program Format

A CWEB program, commonly called a web, is made up of sections, each of which contains a text part and a code part, or chunk. The chunks can be extracted to create a program for compilers, or the entire web can be typeset to be read by people. A single source file serves the divergent interests of human and machine.

Listing 1 is an example of a simple web. It shows how the traditional "hello world" program might look if written in CWEB. The sections are numbered, the code parts are cross-referenced by section number, and the identifiers and chunk names are indexed. Following Algol conventions, keywords are typeset in boldface, identifiers in italics, and strings in a typewriter font for easier reading.

More important to note is the program layout. Instead of putting the line #include <stdio.h> at the beginning of the program, I use < Header files needed by the program 3 > as a placeholder. At this stage of the game, I know I'll be using header files, but — in the best tradition of structured programming — I'm not yet ready to make a decision about which ones I will need, so I leave the details up in the air until I know more. I can do the same thing with the body of main. In other words, I organize the program for the human reader, not the compiler.

Each section of the web is a self-contained, self-explanatory piece of the program. Listing 1 amounts to overkill for such a tiny, well-known program, but Literate Programming is invaluable in dealing with larger programs, as Listing 4 will show.

Listing 2 is the source file as I actually typed it, which CWEB used to generate Listing 1. This source file is basically a C program with embedded "markup" strings to delineate the literate programming features. Sections are introduced by @* or @ plus space followed by the text part. The code part of a section begins with @c, or with a chunk name followed by = . Chunk names are enclosed in @< and @>. Vertical bars surround bits of code embedded in text.

Weaving and Tangling

Generating a human-readable document from a marked-up source file is called "weaving the web." The CWEAVE command generates the section numbers, cross-references, and index. It also typesets the text and code properly, with occasional hints from the programmer (the control codes @; and @# are examples of such hints).

The analogous procedure for generating a compiler-readable file is called tangling. Listing 3 shows the result of running the web in Listing 2 through CTANGLE. CTANGLE expands the @c chunk, recursively replacing chunk names with their definitions. If you look carefully, you can spot the classic program peeking through. CTANGLE inserts #line directives in the output file so the compiler and debugger can refer to the original web rather than the generated C file. Humans rarely need to look at tangled output, but for the sake of those who do, CTANGLE generates comments that link the tangled code to the numbered sections in the woven document.

Creating a CWEB Program

Since literate programs are meant to be read, I'm going to ask you to read through the program fragment in Listing 4. Before you do, however, I should warn you that CWEB provides a few notational conveniences that traditional C programmers often find disconcerting at first. CWEB replaces the symbols listed in Table 1 by mathematical equivalents. This substitution serves a dual purpose; it improves the readability of some of the more obscure operators, and provides a somewhat language-independent "publication language" for C. (WEB uses the same symbols for typesetting Pascal, and FWEB and Spidery WEB do likewise for the languages they support.) You can customize CWEB so it will typeset these symbols any way you like when you weave a web. However, you may find, as I do, that you prefer the "standard" CWEB symbols.

Listing 4 demonstrates many of the advantages of literate programming. You can read the web straight through, or use the cross-referencing to follow a single thread. In addition to identifiers, the index contains references to "portability problems," "possible improvements," and other potentially useful items. "System dependencies" is a common index entry in webs. Reference lists, acknowledgements, and other useful bits of information not normally found in program documentation are common in literate programs.

I wrote the program pretty much straight through, starting with section 1 through section 20. [Sections 10 through 20 have been omitted here to save space. The complete listings are available on this month's code disk, as well as the original webs (.w files) and PostScript versions of the woven webs — mb.] I added sections 21-24 as they became necessary, and CWEAVE generated section 25 automatically. I proceeded topdown, starting with the main program and using chunk names to implement a process of stepwise refinement, keeping each section small and easy to understand. Since chunk names are typographically distinct from the actual code, they are much easier to read than function or macro names.

For each code chunk, I thought about what it should do and how it should work, writing down these thoughts as the text part of a section. I then wrote the code part to meet these specifications. When you write a literate program, you usually write the code to match the documentation, rather than the other way around.

I also like to include explanations of why I did (or didn't) do things in a particular way. Observations that would be intrusive in conventional program comments fit quite well in a literate program. I can use all of the techniques of word processing — table of contents, footnotes, charts, graphs, etc. — to better explain my code.

Some of the benefits of literate programming as applied to Listing 4 do not show up in the final result. For example, I worked on this program over several months, leaving it alone for weeks at a time, when more pressing business called. I only had to read through the program once in order to pick up where I left off. It would have taken a full day, at least, with a non-literate program.

C++ and CWEB

Listing 5 is an example of a web that implements a C++ class. Since C++ syntax is more complex than that for C, you generally have to give CWEB more hints to get the prettyprinting correct, but the process is the same.

CWEB allows a programmer to develop a class interface and implementation in lock-step. For example, I define the Xstring constructor and destructor interfaces in section 5, and follow immediately with their implementation in sections 6-11 [sections 8 through 17 not shown — mb]. CTANGLE automatically extracts the class declaration in the <xstring.h> chunk to create the xstring.h file (because the chunk is named with @( instead of @<), so I don't have to worry about keeping information consistent across two separate files.

Producing Formatted Output

The CWEAVE program does not produce its formatted output directly. It produces a file that is then processed by the TeX (pronounced "Tekh") typesetting system. TeX is the typesetting system of choice for literate programming systems, for a variety of reasons:

1. TeX is readily available. Implementations exist for pretty much every computer in existence, and are usually available free of charge. I use the excellent emTeX implementation for MS-DOS, by Eberhard Mattes, which is freely distributable.

2. TeX is portable. It is designed to be as machine-independent as possible. Output generated by emTeX on my PC and that generated by, say, a UNIX implementation will be identical.

3. TeX is stable. Before a program can be certified as TeX, it must pass a rigorous test suite called the "trip test." Since TeX is not a commercial product, it is not subject to the whims of a manufacturer.

4. The quality of TeX output is significantly better than that of any commercial word processor or desktop publisher on the market today.

On the other hand, there is no reason a literate programming system has to use TeX. A number of non-TeX systems exist, and many are suitable for programming in C and C++.

Availability

Although it is not in the public domain, CWEB is freely distributable. The official distribution is via anonymous ftp from labrea.stanford.edu on the Internet. The distribution includes source for UNIX, MS-DOS, VMS, and Amiga distribution. This version is also available on this month's code disk and via the online sources listed on page 3.

TeX is available via ftp from the following sites:

Host

ftp.tex.ac.uk
ftp.dante.de
ftp.shsu.edu
After logging in, cd to directory tex-archive/systems, then cd to appropriate subdirectory (msdos, unix, mac, etc.) for machine-specific TeX materials. Complete TeX systems for Amiga, Atari, Macintosh, MS-DOS, OS/2, UNIX, VM/CMS, VMS, and Windows NT are available on the Prime Time TeXcetera 1-1 collection on CD-ROM, available from R&D Publications for $60.00 plus $3.50 shipping and handling. Contact:

R&D Publications
1601 W. 23rd St.
Lawrence,KS 66046
(913)-841-1631
michelle@rdpub.com

If you would like to find out more about literate programming, I recommend you subscribe to the LitProg discussion group on the Internet. All discussion is via electronic mail, so it should be possible to subscribe from CompuServe, or any other service that can send and receive Internet mail. To subscribe, send a message to LISTSERV@shsu.edu, and include

SUBSCRIBE LitProg "your name"
in the body of the message. If you prefer, you can subscribe to the comp.programming.literate newsgroup. All items posted to the newsgroup are automatically distributed to LitProg subscribers, and vice versa.

Conclusion

Literate programming is not for the beginner. The literate programmer must be proficient in at least two languages: a programming language, like C or C++, and a text formatting language, like TeX or a commercial word processor. Since the bulk of a web is explanation, rather than code, the programmer has to take the time to think through the program in order to be able to explain it. As in any programming methodology, time spent in thought "up front" translates into reduced debugging and easier maintenance. What literate programming adds to the mix is that the programmer's thoughts no longer disappear into thin air once the program is written; they are preserved in a web. The programmer who maintains the web has these thoughts as a foundation for future work. For the professional programmer, maintenance is everything. A literate program is a maintainable program.

But, above all, literate programming is fun. As you write your explanations, the code unfolds naturally, as if it had always been there and you just now happened to find it. The phenomenon of "code that seems to write itself" is common among literate programmers, as is the practice of passing programs around for comments. The tedium of rearranging bits of code to cater to a compiler is gone. In its place is a process of discovery.

Acknowledgements

I'd like to thank Danielle Bernstein, Leonard Ginsburg, and Jack Ryder, whose comments and criticism greatly improved this article.

References

[1] Edsger W. Dijkstra. Structured Programming (Academic Press, 1972), "Notes on structured programming," pp. 1-82.

[2] Donald E. Knuth. "Literate programming," The Computer Journal, May: 1984, pp. 97-111. Reprinted in Knuth's Literate Programming (Stanford University Center for the Study of Language and Information, 1992), pp. 99-135.

[3] Donald E. Knuth. Computers and Typesetting (Addison-Wesley, 1986), Volume A—"The TeXbook."

[4] Donald E. Knuth. The Stanford Graph-Base: A Platform for Combinatorial Computing. A collection of CWEB programs for generating and examining a wide variety of graphs and networks. (ACM Press, 1994).

[5] Donald E. Knuth and Silvio Levy. The CWEB System of Structured Documentation (Addison-Wesley, 1994). ISBN 0-201-57569-8.

Additional Information

Walsh, Norman. Making Tex Work. O'Reilly & Associates, 1994. ISBN 1-56592-051-1.

LitProg FAQ. FTP from the TeX archive sites listed in main article, directory help/LitProg- FAQ.