July 1994/Dealing With Uncertain Inputs: Exploratory Software Engineering/Sidebar

Syntax Files and Parser Generators

A parser generator, such as AnaGram or yacc, is a program that writes programs (see Figure 4) . The parser generator reads a syntax file and produces a C or C++ parser file. Users can then compile the parser file, along with its support code, into an executable parser program.
The parser program will read input and try to match it against the grammar in the syntax file. For example, the lines:
grammar
-> record?..., eof
(void) record
-> valid record, newline = cleanUp();
-> bad data, newline = cleanUp();
in Listing 2 state that the whole input (caller grammar) is a sequence of records followed by an end-of-file. The record is defined to be a valid record or bad data followed by a newline. Later productions define valid record, and bad data, and so on. The process continues until the grammar gets down to a definition in terms of specific characters, such as the following:
name start char = 'a-z' + 'A-Z' + '_' + '~'
which states that a name start char can be any lower case or upper case alphabetic character, or the special characters '_', or '~'.
Productions can be recursive, such as:
attribute list
-> tab, attribute
-> attribute list, tab, attribute
which says that an attribute list is a tab, followed by a sequence of one or more attributes seperated by tabs.
As the parser program executes, it will match the input it reads against this grammar. When a production is matched, the program executes the reduction procedure listed after the '=' in the syntax file. Reduction procedures are just fragments of C or C++ code which are often calls to functions in the support code file. For example, the following fragment:
(char *) scoped name term
-> identifier:i = i;
.
.
(char *) identifier
-> name string = release();
(void) name string
-> name start char:c = collectFirst(c);
-> name string, name follow char:c = collect(c);
defines a name string (recursively) to be a name start char, possibly followed by any number of name follow chars. When the parser program finds the name start char, it assigns the char to the variable c and passes c to the function collectFirst, which empties a global input buffer and puts c at the beginning of the buffer. Then when each name follow char is found, it is passed to the collect function, which adds it to the end of the buffer. When no more name follow chars are found, the name string is complete and the generated parser will recognize that it now has a complete identifier, so it calls the release function. This function copies the buffer into a null-terminated string on the heap and returns a pointer to the string. The pointer is called the "semantic value" of the identifier, and is available in the scoped name term production as the variable i. Thus results can be passed from one production to another as needed.
Note that in AnaGram the semantic values are given C/C++ types so that the compiler can do type checking. Thus (char *) in the above example means that the result returned by release is expected to be a character pointer while (void) tells the compiler that collectFirst and collect do not return any value.
The best known parser generator comes from the UNIX world and is called yacc ("yet another compiler compiler") from its original application as an aid to compiler writers. A variety of parser generators are available both commercially and in the public domain.