Parser Generators


A parser generator produces programs that read input that is described by a grammar and perform specified actions based on what they find in that input. The earliest application of parser generators was in compiler writing and the best-known parser generator is the early UNIX tool yacc ("yet another compiler compiler"). PC versions of yacc are available from MKS (519-884-2251) and Abraxas (503-244-5353). The examples in this article use the AnaGram grammar-based programming system from Jerome T. Holland (508-358-7968), which provides an integrated PC environment for developing, testing, and debugging grammars, as well as some very useful extensions to conventional parsing techniques.

Figure 3 shows how a parser generator works. The syntax file contains "productions," which describe the input that the generated program will handle, and "reduction procedures," which say what actions will be taken or output produced when each production is matched in the input.

Listing 3 is an example of part of a syntax file. The productions for a runtest statement give a good example of how the generated parser works. runtest statements look like the following:

runtest "r1" combining someClient theInvoice numItems someItem
{
   ...
}
that is, they consist of the keyword runtest, a run name in quotes, the key word combining, a list of the variables to be used in the run, and finally a C++ block that will be executed for each combination of the variables. The syntax file of Listing 3 declares a runStatement consisting of a runHeader, one or more spaces, and a testSpec. Then runHeader and testSpec are described on the following lines.

The production for runHeader shows that it starts with the runtest keyword, followed by spaces and the string with the run name (string is declared elsewhere in the syntax file). This production shows how a user-supplied reduction procedure called setRunName can manipulate the data collected during parsing. A local variable called n is declared to hold the semantic value of the string, in this case a character pointer to the string characters. When a complete runHeader has been identified, the parser will call setRunName (which is coded like any C or C++ function) and pass it this pointer. In this particular case, setRunName simply stores the string into a global variable for later use.

The production for testSpec is similar but illustrates more generally how reduction procedures can get the data they need. When a complete testSpec is identified, the makeComb reduction procedure is called (in this case the procedure writes all the code needed for one set of tests). The semantic values it uses are a pointer to the block and a pointer to the variable list; it also makes use of the global run-name string previously saved by setRunName.

The production for variableList illustrates recursion, which is very useful for lists of things, and also shows how typed semantic values are created. The variableList production has two rules, saying that a variableList can be either a single variable or a variableList followed by a variable. This is just a recursive way of saying that a variableList is one or more variables.

The single variable rule will be the first to be matched during parsing. When it is matched, the next call is to the listNew reduction procedure, which creates a new STRLIST (string list) and returns a pointer to it. The AnaGram-generated parser makes this pointer the semantic value of the variableList; note that the type of the pointer is declared in the first line of the variableList production so that normal C type checking can be enforced. When a second variable is encountered, the second rule will match, since there is now a variableList followed by a variable. Both are passed to the listAdd reduction procedure, which appends the new variable to the list and returns a pointer to the new list. When all variables have been encountered and the complete variableList has been built, the last pointer generated will be passed up to the testSpec production as variable vl.

Parser generators can thus be used for a wide range of problems in which a program's main purpose is to operate on structured input. Their big advantage is that the structure of the input is explicit in the syntax file and not distributed through the program, so that changes in input requirements tend to be far easier to accommodate.