Norman Wilde has been programming since the early 1960s in everything from Fortran II to C++. He now teaches Software Engineering at the University of West Florida in Pensacola and works on tools for software maintenance and program understanding. You can reach him at wilde@cs.uwf.edu.
We all want the software we produce to be "correct." Software engineering textbooks imply that you produce correct software by starting from a precise specification and then writing code that exactly implements it.
But what can you do if a precise specification is simply not achievable? Often you have only examples of inputs and outputs to work from, with no way of knowing how complete they may turn out to be. This article shows how that familiar tool, the parser generator, can sometimes let you develop a "correct" program by systematic exploration of such examples.
A Sample Problem
Figure 1 shows a common programming problem. An existing program produces data in format 1. For some new purpose we now need the data in format 2. Unfortunately the details of format 1 are not available. How can a conversion program be written and documented with some semblance of clarity and precision?I recently faced this problem while developing a C++ code visualization tool set (Figure 2) . A company had loaned us a C++ analyzer which could read C++ source and produce data records containing attributes of the code. We were eager to use this analyzer with our visualization tools since it was clear that we could not develop one of our own without a major investment. Unfortunately, our code visualization tools needed input in the form of Prolog facts. The C++ analyzer's code attribute record format was known only from one page of documentation plus an example file of 620 KB! Yet the conversion program needed to be very precise and well documented since any bug in it would be almost impossible to track down later when the whole system was running.
Converting Attribute Records to Prolog Facts
Our solution was to use a parser generator for exploratory development. In fact, this method seems applicable to many problems in which inputs are complex and not completely described.A parser generator, such as yacc or AnaGram produces a program from a grammar (see the sidebar). The program's behavior can be precisely defined, both for inputs that obey the grammar and for inputs that do not. So the only difficulty is that you need a grammar that completely describes what might be encountered in the program's input. Once you've found such a grammar, the program that is generated from it will be completely capable of handling the input. Fortunately, such a grammar can be developed through controlled exploration using the error detection capabilities of the parser generator. This process is illustrated in Figure 3.
The C++ code analyzer produced code attribute records of several different types, one for each kind of object recognized in the C++ source. For example, when the analyzer found a declaration of a class called Animator on line 98 it would produce something like:
class <tab> name@Animator <tab> 1ine@98Each record consisted of a series of fields separated by tabs. The first field was a keyword saying what kind of object had been found. The remaining fields were of the form
<attribute name> @ <attribute value>So far so good. The format was fairly simple and, it turns out, had been designed to be read using a tool like awk. Unfortunately, the records were often far too long for our version of awk, and the attribute records sometimes contained things like:
name@_::clnt_ops::cl_call(void )which would need to be carefully teased apart to extract the information that we needed for our visualization tool. The exact form for each attribute was not documented and was certainly not immediately obvious! At a minimum, the name format allows loops (i.e. one or more structures of the form "xxx::...") which are difficult to express in awk but relatively easy to handle in a parser (see sidebar).
Exploratory Programming
At this point I decided that some kind of systematic exploratory process would be needed. The solution was to start with an initial rough grammar and "grow" the conversion program from it by systematic trial and error using the large example input file. Listing 1 shows an early version of the AnaGram grammar. The top levels looked like the following:
grammar -> record?..., eof record -> valid record, newline = cleanUp(); -> bad data, newline = cleanUp(); bad data -> error = wrError(PCB. line, PCB.column);The first production states that the input file is a sequence of records followed by end-of-file. The productions for record then say that a record is either a valid record or bad data followed by a newline. The code after the equals sign in the valid record alternative is a "reduction procedure" which the generated program will call after a valid record is identified. In this case the generated program just does some clean up to get ready for the next record.For trial and error to proceed quickly, handling the records that do not yet match the grammar is almost as important as handling the valid records. This is where the error handling features of a parser generator come in handy. AnaGram provides several ways to handle errors, but I used the simplest method, which it shares with yacc and other earlier tools.
The bad data alternative is an error trap to catch any record that does not match the grammar. The token called "error" is special to the parser generator. If a record does not match the grammar at any point, the generated parser will call the wrError function to write an error message. (As shown, AnaGram's parsers keep track of the line and column number so that these can be included in the message). The parser will then back up to the point where "error" appears in the grammar, that is, to the beginning of the record and then scan forward until it finds an input that could come after the error, in this case a newline. Thus, the offending record is discarded and processing can continue smoothly with the next record.
These top-level productions provided the shell within which the conversion program could evolve in a controlled way. I first made test files for each kind of record by taking a small random sample from the 620 Kb example file. I guessed at a grammar for the record type and debugged it using the sample data. I wrote a specification document as I went along to document exactly how the conversion program handled each possible input.
Writing a grammar from a sample of input is not always easy, but again the parser generator provides tools to help. If you do not specify how to handle some possible sequence of inputs the parser generator warns you of omissions or conflicts in your grammar. AnaGram also provides a "file trace" feature that lets you preview how your grammar would process an input file without generating and compiling the parser. You can see exactly how each token in the file would be recognized by the parser and detect errors early.
Listing 1 and Listing 2 show two stages in the evolution of the grammar. Listing 1 takes a first cut at the problem by simply treating all attribute values as strings. The generated program collects the characters of each attribute in a buffer, and releases them when the full attribute value is found. The program then uses the resulting string to construct an attribute object, which is added to the list of attributes for this record. This grammar at least let me parse the sample input files.
But we needed to get more structure out of complex attributes such as the name example given above. So in Listing 2, I redefined names to be ID values, and structured an ID value into a scoped name plus a name remainder. The scoped name is further decomposed into a sequence of terms separated by "::" markers. This action reorganized attributes into an object class hierarchy with StringAttribute and IdAttribute as two subclasses of the virtual base Attribute class. (Listing 4 shows the header for the C++ support code.)
When the generated program finds a string value or an identifier value it creates an attribute of the appropriate type and passes a pointer to it up to the attribute production. In that production, the calls to the atAdd function construct a linked list of attributes. When the full record is found, the call to putClass or to putInstanceVariable in the valid record production walks the attribute list and writes the Prolog facts needed. The call to the cleanUp function in the record production then frees memory so the program can start over with the next record.
The main function for the program is at the end of the syntax file. It simply opens the file where error messages will go and calls the function conv, which is the name AnaGram gives to the generated parser.
Feeling Brave
When I had completed the tests on the sample files, it was time to try the whole 620KB of the example. Not surprisingly, the error trapping code detected several additional cases, but I easily tracked these down. For example, I found that occasionally the C++ analyzer generated function names such as
??-expression-??::foo(void ) ??-type-??::foo(void )or
??-symbol-??::foo(void )when it encountered a call to a member function and was unsure of the class of the receiving object. So, as shown in Listing 3, I added three additional productions for scoped name term to the grammar and documented this special case in the specification. The trial grammar thus acted as a probe on the input to quickly detect unusual situations that would have been extremely tedious to find using any other method.Note that the structure of the syntax file changes very little as it evolves. You can add productions to handle complexities in one of the attribute types without affecting either the overall program structure or the details of other productions. At all stages, the grammar acts as an easily readable and maintainable specification of the generated program.
Of course Listing 2 is a simplified version of the real AnaGram syntax file, and shows just the class and instanceVariable record types and just the name@, file@, and line@ attributes. The complete program involved considerably more complications than those shown here, but the syntax file was still only 260 lines, plus 350 lines of support C++ code.
Conclusions
This disciplined exploratory programming has produced a program that is "correct" in quite a strong sense. Since the program is precisely defined by the grammar, I can state exactly what the program will do if the input obeys that grammar. At the same time I can confidently assert that if at any future time the input does not obey the grammar, the program will behave in a precisely predictable way, that is, that it will write an error message for the offending record and continue processing. There is almost no chance that unexpected input will cause a complex or subtle failure.I must anticipate that when our final C++ visualization tool is run on someone else's code, we will probably find that the C++ analyzer has a few tricks that we don't know about yet. Still, I can be confident that the error handling mechanism will be able to identify them.
The advantage of using a parser generator is not just that it gets jobs like this one done quickly, but rather that it creates a maintainable program whose behavior can be precisely known (a considerable virtue in any software). Parser generators should probably be much more widely used for many routine and not-so-routine programming tasks.