November 1995/The Column that Needs a Name*

Columns

The Column that Needs a Name*

A Sensible Grammar Notation

Dan Saks

Dan Saks is the president of Saks & Associates, which offers consulting and training in C++ and C. He is secretary of the ANSI and ISO C++ committees. Dan is coauthor of C++ Programming Guidelines, and codeveloper of the Plum Hall Validation Suite for C++ (both with Thomas Plum). You can reach him at 393 Leander Dr., Springfield OH, 45504-4906, by phone at (513)324-3601, or electronically at dsaks@wittenberg.edu.

Introduction
First of all, I'd like to welcome my friend and colleague Bob Schmidt as our newest columnist. (He wants his readers to know him as Bobby, but I've always called him Bob, and he doesn't seem to object.) I first met Bob in 1982 when he was a student and I was a teacher at Wright State University in Dayton, Ohio. I believe his first course with me was CS 480 (Comparative Languages). I recall that we hit it off pretty quickly, and he went on to take several more courses from me over the next few years.
I left Wright State at the end of 1986 to work full-time on Saks & Associates. Bob graduated the following year and came to work for me. A couple of years later he moved to the Pacific northwest and beyond, but we've managed to stay in touch.
When I found out that the CUJ needed a new columnist to replace the dear departed Chuck Allison, I immediately thought of Bob and recommended him to the CUJ editors. Apparently Bob passed his audition, and now here he is. I've always thought Bob's writing showed an attention to (but not obsession with) detail and a sharp wit that made for informative, yet entertaining, reading. I've enjoyed reading his stuff for many years, and I hope you will, too.
Which brings me to the title of my column. I had been writing about C++ under the banner "Stepping Up to C++." I started over four years ago writing on introductory C++ topics, contrasting the C++ way of doing things with the C way (where they differ). Over the years, I've "stepped up" with the readership to increasingly more advanced topics, including the state of the burgeoning C++ standard. It's pretty clear now that this is no longer a column on for C++ beginners. Don't worry, we won't be abandoning introductory C++ topics; Bob will be handling them as he sees fit in his "Learning C/C++urve" column. But my column clearly needs a new name, and we invite you to suggest one. As an incentive, the winner (if any) will receive a free copy of the first, long-awaited CUJ CD-ROM (my editors tell me that it is due to ship in the first quarter of 1996) and a free C/C++ Users Journal t-shirt. Send your suggestions to cujed@rdpub.com, or write to C/C++ Users Journal, 1601 W. 23rd St., Ste. 200, Lawrence, KS 66046. OK, now down to business.

Grammar Notation
Last month, I waxed philosophically about the relationship between the syntactic structure of flow control statements and indenting style. (See "Stepping Up to C++: Syntax and Style," CUJ, October 1995.) Aside from trying to show that my way of indenting is best (he wrote with tongue firmly in cheek), I tried to make the case that if more C/C++ programmers had a genuine understanding of C/C++ syntax, we might actually have a more common view of style issues. Or, we might at least have a better idea of what we disagree about.
A solid understanding of programming language syntax holds other rewards. The language syntax (its structure) is closely related to its semantics (its meaning). Understanding the structure often clarifies its meaning. I believe this is true for C, and especially true for C++ because its syntax is so much more complicated than C's.
For example, many C and C++ programmers don't really understand the syntax of declarations and their associated semantics. While they have little trouble with simple declarations, their understanding starts to break down when confronted with declarations that have too many *s, &s, ()s, []s or const qualifiers. Sometimes "too many" is "any more than one."
It pains me to see programmers cope with compile-time error messages in declarations by haphazardly adding or removing *s, []s, and ()s until the messages (but not necessarily the errors) go away. While you might scrape by doing this is with C, it's hardly a formula for success with C++.
C++ compilers often have a tougher time than C compilers pinning down syntax errors in source programs. C++ offers you more ways to make mistakes, and so compilers have a harder time telling you what you did wrong. It really helps in deciphering error messages if you learn to think like the compiler. This means really understanding the syntax.
As we plumb the depths of C++ over the coming months, we will need a reasonably clear and concise notation for describing the syntax of C++. Most C and C++ literature, including the draft C++ standard uses one such grammar notation. Unfortunately, that notation leaves much to be desired. Consequently, I'm going to buck the trend and use a different one. This I do not do lightly. The grammar notation, and my reasons for preferring it, are my topic this month.

K & R's Grammar Notation
In a way, it's hard to fault C and C++ programmers for not understanding the language syntax. Although most C and C++ references use the same grammar notation, that notation is pretty hard to read.
Kernighan and Ritchie introduced this notation in their description of C [1], and it just ran rampant, at least within the C and Unix communities. Other descriptions of C, notably [2], adopted this inconvenient notation, as did the ANSI C committee. Stroustrup also used the notation to describe C++[3], and, most regrettably, so did the ANSI C++ committee.
At its simplest, K & R's notation is not bad at all. For example, the grammar rule for a C++ delete-expression is:
delete-expression:
    ::_opt delete cast-expression
This rule states that the C++ language construct called a delete-expression consists of an optional :: (the scope resolution operator), followed by the keyword 3, followed by a cast-expression. Grammar rules such as this are also known as productions, where the symbol on the first line of the rule is called the left-hand-side, and the sequence of symbols on the next line is called the right-hand-side. You can read a production P to mean that, whenever you see an occurrence of P's left-hand-side in a production (including P itself), you can replace that occurrence with the symbols on P's right-hand-side.
In reality, a delete-expression has more than one form. That is, it has more than one possible right-hand-side. The K & R notation simply lists each alternative right-hand-side on a separate line in the production, as in
delete-expression:
    ::_opt delete cast-expression
    ::_opt delete [ ] cast-expression
This rule states that a delete-expression is either an optional :: followed by the keyword delete and then a cast-expression, or it's an optional :: followed by the keyword delete and a [ and a ] and a cast-expression.
One problem with this notation is that it relies on different typefaces or typestyles to distinguish terminals from non-terminals and meta-symbols (operators of the notation itself). A terminal symbol is one that's part of the language that the grammar describes. For example, the keywords case and default are terminals. They never appear as the left-hand-side of a production, and so you can't expand them any further.
A non-terminal is a symbol that is a not a terminal. (How profound.) A non-terminal expands to a sequence of replacement symbols. Each non-terminal must appear as the left-hand-side of some production. For instance, constant-expression and statement are non-terminals. K & R's notation has only two meta-symbols, (1) the : that separates the left-hand-side from the right-hand-side, and (2) the subscript opt which appears after an optional symbol.
In most works, terminals appear in the same typeface as code examples. In the C++ draft document, that typeface is Courier. In this publication, it's Letter Gothic. The non-terminals usually appear in the same typeface as normal text, but in italics. This may be fine for books and journals, but not for source and header files, nor email messages. Most of us still maintain our source code as plain text (typically using ASCII or some other national variant of ISO 646). We also use plain text for email correspondence. When you try writing K & R-style grammar rules in plain text, you lose the typesetting information. You must do something else to convey the distinction between terminals and non-terminals, and to distinguish opt as a meta-symbol.

EBNF
The simplest solution to the K & R notation's typesetting problems is to use a different notation. I prefer a notation called EBNF (Extended Backus-Naur Form). Jensen and Wirth[4] popularized EBNF, and most literature on Pascal and Modula-2 uses it. EBNF solves the typesetting problems by putting the terminals in double quotes, and by enclosing optional items in [ ]. Applying these conventions to the production for delete-expression, you get:
delete-expression:
   [ "::" ] "delete" cast-expression
   [ "::" ] "delete" "[" "]" cast-expression
The K & R notation has another serious drawback: it uses the end of a line to separate alternatives. This presents formatting problems when the right-hand-side of a production is too long to fit on a single line. If you let the end of the right-hand-side spill over on to another line, readers likely will mistake the spill as a new alternative, rather than as a continuation of the previous one. As far as I know, you have only three choices: use a wider column, use a smaller typeface, or move part of the long right-hand-side to a separate production. Oh, there is a fourth choice: use the EBNF conventions.
EBNF rules are free format. That is, they treat newlines as ordinary whitespace (as any sensible notation should). EBNF uses | (vertical bar) as the separator between alternatives, and . (period) to terminate the entire production. Also, EBNF uses = (equal) rather than : (colon) to separate the left- from the right-hand-side. Applying these conventions to the production for delete-expression, we get:
delete-expression =
   [ "::" ] "delete" cast-expression |
   [ "::" ] "delete" "[" "]" cast-expression .
The two alternatives differ only in that the second one has the tokens [ and ] after the keyword delete, and the first does not. EBNF makes it easy to collapse the two right-hand-sides into one, as follows:
delete-expression =
   [ "::" ] "delete" [ "[" "]" ] cast-expression .
You can't do this nearly as easily using K & R's notation. The subscripted operator opt applies to only one symbol at a time, so if you try writing
delete-expression:
    ::_opt delete [ ]_opt cast-expression
then only the ] is optional. Writing
delete-expression:
    ::_opt delete [_opt ]_opt cast-expression
isn't right either, because this allows either
delete [ cast-expression
or
delete ] cast-expression
neither of which is legitimate C++.
K & R's notation does not allow parentheses as grouping operators, so you can't even write
delete-expression:
    ::_opt delete ( [ ] )_opt cast-expression
The best you can do is create a new non-terminal
brackets:
    [ ]
and then rewrite the delete-expression rule as:
delete-expression:
    ::_opt delete brackets_opt cast-expression
Sometimes extra productions, such as brackets above, add clarity. Often they don't, adding clutter instead.

Repetitive Constructs
The K & R notation has no direct notation for repetition, so it combines recursion with alternation to simulate repetition. For example, in discussing the syntax for flow-control statements last month, I presented productions for compound-statement:

compound-statement: { statement-seq_opt }
and for statement-seq (statement sequence):

statement-seq: statement statement-seq statement
The first alternative for statement-seq says that a statement-seq can be a single statement. The second alternative says that a statement-seq followed by a statement is itself a statement-seq. Since a statement-seq can be a single statement, then a statement followed by a statement is a statement-seq. Since two statements are a statement-seq, then so are three statements, and four statements, and so on.
EBNF has a more direct notation for repetition. EBNF uses { and } to enclose a sequence of symbols that may occur zero or more times. For example,

statement-seq = { statement } .
is EBNF specifying that a statement-seq is a sequence of zero or more statements. This is a slight change from the K & R-style production above, which states that a statement-seq is one or more statements. Hence, the correct EBNF is

statement-seq = statement { statement } .
Now we can merge this into the EBNF for compound-statement to obtain a single rule:

compound-statement = "{" [ statement { statement } ] "}" .
This says that a compound-statement is an optional sequence of one or more statements enclosed in braces. But, "an optional sequence of one or more" is just "a sequence of zero or more". Thus, we can simplify the rule to:

compound-statement = "{" { statement } "}" .
Most of the productions that define arithmetic operators are recursive. For example, the C++ draft standard defines expressions for multiplicative operators as:

multiplicative-expression: pm-expression multiplicative-expression * pm-expression multiplicative-expression / pm-expression multiplicative-expression % pm-expression
We can write this production more concisely using EBNF:

multiplicative-expression = pm-expression { ( "*" | "/" | "%" ) pm-expression } .
This right-hand-side uses parentheses to group the choice of multiplicative operators ( "*" or "/" or "%" ). Without the parentheses, the juxtaposition of symbols to form a sequence has higher precedence than the choice presented by the | operator. That is, without parentheses,

"*" | "/" | "%" pm-expression
specifies either a "*", or a "/", or a sequence consisting of a "%" followed by a pm-expression. With the parentheses,

( "*" | "/" | "%" ) pm-expression
means a sequence consisting of either a "*" or a "/" or a "%", and then a pm-expression.

Syntax Diagrams
While EBNF has notational advantages over K & R's notation, many people find EBNF only marginally more readable, preferring a graphical notation such as syntax diagrams. Jensen and Wirth[4] used syntax diagrams in tandem with EBNF to describe Pascal. At least two books use these diagrams to describe C: Plauger and Brodie[5] call them "railroad-track diagrams;" Darnell and Margolis[6] call them simply "railroad diagrams."
A syntax diagram is a graph (in the computer science sense) composed of boxes (nodes) and paths (arcs) connecting the boxes. Each box contains a single grammar symbol, either a terminal or a non-terminal. Some people (including me) use boxes with rounded ends for terminals and square ends for non-terminals. Others use different typefaces inside the boxes to distinguish terminals from non-terminals.
The paths between the boxes have arrows to indicate the direction of flow. All paths are one-way. Each graph flows generally from left to right, but any graph may contain loops. You can think of each diagram as defining a simple machine (an automaton) that generates the symbols in the boxes it passes through as it follows the arrows. Thus a syntax diagram defines a syntactic contruct by describing all the different sequences of symbols that the construct can generate.
For example, Figure 1 shows a syntax diagram for a delete-expression. It really is nothing more than a pictorial representation of the EBNF rule
delete-expression =
   [ "::" ] "delete" [ "[" "]" ] cast-expression .
They both say the same thing, namely, that a delete-expression is an optional :: followed by the keyword delete, followed by an optional pair of square brackets, followed by a cast-expression.
As a second example, Figure 2 shows a syntax diagram that graphically depicts the rule:
compound-statement =
   "{" { statement } "}" .
Syntax diagrams can be very vivid, but most of us don't yet have really good tools for editing the diagrams, splicing them into source code or pumping them through the 'Net. I know I don't, but if you think you have such tools, I'd like to know about them. Until I do, I'll stick with EBNF.

Lexical vs. Syntactic Rules
A syntax rule such as

delete-expression = [ "::" ] "delete" [ "[" "]" ] cast-expression .
describes a phrase, delete-expression, as a sequence of symbols, rather than as a sequence of individual characters. That is, the rule states that a delete-expression is an optional :: operator, followed by the keyword delete, and so on, rather than stating that it's an optional : followed by a second :, followed by zero or more whitespace characters, followed by a d, and then an e, and then an l, and so on.
The syntax rules treat symbols such as :: and delete as atomic units, called tokens or lexemes. You can freely place whitespace characters (spaces, tabs, newlines) between any tokens, but not inside a token. For example, as far as C++ is concerned

:: delete [ ] p
and

:: delete [ ] p
are the same delete-expression. However, putting a space anywhere inside the keyword delete splits it into two separate tokens, so that

:: del ete [ ] p
is not a delete-expression, but an error.
There are many situations where you need not use any whitespace to separate tokens. For example, the only spaces you really need in

int strcmp ( const char * s1 , const char * s2 ) ;
are the ones between int and strcmp and between const and char (both occurrences). Were you to omit those spaces, then intstrcmp would become a single identifier, as would constchar. C++ will never mistake strcmp(const or char*s2 as identifiers because it does not allow ( and * in identifiers.
Most language descriptions use the same notation to describe tokens that they use for phrases. A production that defines a token is commonly called a lexical rule. For example, the lexical EBNF for an octal-literal in C++ is

octal-literal = "0" { octal-digit } .
This says that an octal-literal is the character O followed by zero or more octal-digits, where

octal-digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" .
Since octal-literal is a token, you cannot add whitespace anywhere inside.
Too many otherwise good language references, including [2] and [6], lump all the rules for tokens and phrases together and expect you to know where intervening whitespace is allowed (in phrases) and where it is not (in tokens). How are you supposed to know? Well, you just are.
Other references, including the C and draft C++ Standards place the lexical rules and the phrase rules in separate sections, and leave it to you to appreciate the implications. [1] and [3] avoid all confusion by omitting the lexical rules entirely.
I have not yet seen anyone propose a notation, either for an EBNF-like notation or for syntax diagrams, that clearly distinguishes lexical rules from phrase rules. I think you just have to group them accordingly and remind people of the distinction. I will certainly try to do so as I use these notations in future columns.

References
[1] Brian Kernighan and Dennis Ritchie. The C Programming Language (Prentice-Hall, 1978).
[2] Samuel Harbison and Guy Steele. C: A Reference Manual, 2nd. ed (Prentice-Hall, 1984).
[3] Bjarne Stroustrup. The C++ Programming Language (Addison-Wesley, 1986).
[4] Kathleen Jensen and Niklaus Wirth. Pascal User Manual and Report (Springer-Verlag, 1974).
[5] P. J. Plauger and Jim Brodie. Standard C (Microsoft Press, 1992).
[6] Peter Darnell and Philip Margolis. Software Engineering in C (Springer-Verlag, 1988).