September 1996/C++ Theory and Practice

Columns

C++ Theory and Practice

Dan Saks

Abstract Declarators, Part 3

Sometimes looking at a problem a different way makes it all come into focus. Dan presents an alternative grammar to make elements of C++ syntax much easier to understand.
Copyright © 1996 by Dan Saks

A declarator is the central part of an object or function declaration. It includes a declarator-id (the name being declared), along with any operators that modify that declarator-id. For example, the declarator in
unsigned char *const p[N];
is *const p[N]. It declares that p (the declarator-id) has type "array with N elements of type const pointer" to something. The decl-specifiers (the stuff to the left of the declarator) tell us that the something is unsigned char.

An abstract declarator is a declarator without a declarator-id. Abstract declarators appear in numerous C++ constructs, probably most often in parameter declarations and cast expressions. They also appear in exception declarations, exception specifications, template argument lists, and sizeof expressions.

For example, a function declaration such as
char *memcmp(const void *,
             const void *,
             size_t);
has three parameter declarations, yet no parameter names. Therefore, each has an abstract declarator. In the first two parameter declarations, the abstract declarator is just *; each parameter's type is "pointer to const void". In the third parameter declaration, the abstract declarator is actually empty; the parameter's type is just size_t.

Abstract declarators appear in cast expressions as part of the type-id (the part of the cast that specifies the result type). For example, in the cast expression static_cast<D &>(b), the & is an abstract declarator specifying a reference type. The complete type-id D & specifies the type "reference to D".

For quite a few months now, I've been developing a program called decl that parses C++ declarations and translates correct ones into English. At present, decl recognizes declarators but not abstract declarators. In my last column (two months ago), I started augmenting the program to recognize abstract declarators, planning to finish it the following month. But, the software business being what it is, things didn't quite work out that way. This month, I'll just say I'll make some more progress.

decl's Current State

decl has two main parts: a scanner and a parser. The scanner reads characters from the input stream, filters out the whitespace characters, and partitions the remaining characters into tokens (identifiers, keywords, operators, and punctuation). The parser acquires tokens from the scanner, pieces them into declarations, rejects incorrect declarations, and translates correct ones into English.

Two months ago, I explained how recognizing abstract declarators complicates the parser by occasionally forcing it to look ahead more than one token to determine the next parsing action. (See "C++ Theory and Practice: Abstract Declarators, Part 1," CUJ, June 1996.) Last month, I gave the scanner the ability to look ahead and back up by more than one token, in preparation for adding abstract declarators to the parser. (See "C++ Theory and Practice: Abstract Declarators, Part 2," CUJ, July 1996.)

It's been several months since I presented the parser in its entirety (see "The Column That Needs a Name: Recovering from Parsing Errors," CUJ, April 1996). As I mentioned, that parser recognizes declarators, but not abstract declarators. It also recognizes function declarators, but does not allow parameters in them. This month, I will explain the remaining design issues for a parser that can recognize not only abstract declarators, but also function declarations with non-empty parameter lists.

Left Recursion vs. Iteration

One of the points I've been trying to make with these articles on parsing is that, once you really learn to read it, the C++ grammar can guide you in synthesizing or decomposing any C++ construct. I've been using the decl program to show how the grammar maps into a fairly simple method for recognizing C++ programs, called recursive-descent parsing. Many compilers actually use recursive-descent. Understanding the technique will give you greater insight into C++ syntax and the meaning of compiler diagnostics.

Before I add new code to parse abstract declarators and function parameter lists, I want to take another look at the grammar. Table 1 shows a grammar for those C++ declarations that decl already recognizes. The grammar lacks some details of complete C++ declarations, but it covers a very useful subset. It also indicates that the parentheses in a function-suffix may surround a parameter-list. Until now, this function-suffix has been just a place-holder; decl does not yet recognize function parameters.

As always, I've expressed the grammar in the EBNF notation, summarized in Table 2. This notation is not the same as the notation used in the draft C++ Standard. Rather, the draft uses the notation that K & R (Kernighan and Ritchie [1] ) used to describe C.

One of the problems with the K & R notation is that it has no direct notation for iteration (repetition). Grammars that use this notation must combine recursion with alternation (choices) to simulate iteration. Don't get me wrong; I like recursion, but not to the exclusion of iteration. Using recursion to describe a repetitive construct is not as clear as using iteration. And, as contradictory as this may seem, there are some recursive rules that you can't map directly into a recursive-descent parser. Permit me to explain.

With recursive-descent parsing, each production in the grammar maps (more-or-less) into a separate parsing function. In decl, the parser is a class, and each parsing function is a member of that class. For example, the grammar has a production
declarator =
    direct-declarator |
    ptr-operator declarator .
and so the parser has a member function called declarator. Each parsing function returns a string that spells out the meaning of what it just parsed.

The right-hand side of a typical production refers to the names of other productions. Each reference becomes a function call to the corresponding parsing function. For example, the parsing function for declarator includes a call to ptr_operator followed by a (recursive) call to declarator. The | (vertical bar) in a production indicates a choice. In this case, it indicates that a declarator can be either a direct-declarator, or a ptr-operator followed by another declarator.

The declarator parsing function appears in Listing 1. input is the scanner object. input.current() returns the current token, and input.current().kind() returns the current token's category. The function looks for a token that can begin a ptr-operator before calling ptr_operator. If it finds no such token, it blindly calls direct_declarator. I took a little liberty in translating the grammar into this function. Had I translated the grammar rigorously, the function would check for a token that can begin a direct-declarator before calling direct_declarator. However, I wrote declarator assuming that direct_declarator will complain if it is not happy with the token it sees.

The declarator parsing function uses recursion, and uses it properly. ptr-operators bind to a declarator from right to left, and recursion is an effective tool for reversing the order of the ptr-operators so they appear from left to right in the output.

For example, the declarator *&r means r is a reference to a pointer, not a pointer to a reference. Each call to declarator that begins with a ptr-operator parses that one ptr-operator and saves its translation. declarator calls itself to parse the remainder of the declarator (which may include more ptr-operators). When the recursive call returns, declarator appends the translation of the ptr-operator to the end translation of the declarator. Thus, the first-in-last-out nature of recursion reverses the order of the ptr-operators.

Now let's look at the grammar for direct-declarator. The draft C++ Standard uses recursion in this rule, too:
direct-declarator =
    declarator-id |
    "(" declarator ")" |
    direct-declarator ( array-suffix | function-suffix ) .
This rule has three primary alternatives. The first two alternatives are no problem. Each begins with a unique token that leads the parser toward that choice. However, the last alternative is a puzzler. It begins with a reference to direct-declarator, which is this very same rule. Thus, the tokens that lead the parser to this alternative are exactly the same as the tokens that lead to the other alternatives. The only way the parser can distinguish the third alternative from the first two is by looking ahead for an array-suffix or a function-suffix.

Now that the scanner provides unbounded lookahead, looking ahead should be easy, but in this case it isn't. Well, maybe the looking is easy but finding the right token is hard. An array-suffix begins with a left bracket and a function-suffix begins with a left parenthesis. The parser cannot simply look for the first occurrence of one of these tokens; it must look for the last occurrence that has not been found by a previous lookahead.

For example, consider the declarator x[10]. The first time the parser enters direct_declarator, it sees x as the current token. It must decide if it can simply accept x as a declarator-id, or call itself and then call array_suffix to parse a suffix. Since there is a suffix after x, direct_declarator does call itself. Each call to array_suffix parses exactly one suffix. Therefore, the recursive call to direct_declarator must parse just x, and leave the [10] for array_suffix.

Upon entering the recursive call to direct_declarator, the current token is still x. This call to direct_declarator can't tell whether it was called recursively or from another function such as declarator. It must decide for itself if it can just accept x as the declarator-id, or if it must call itself again. There is a suffix after x, suggesting that a recursive call is in order. However, that suffix, [10], has already been spoken for (by the previous call to direct_declarator), so this recursive call must not parse the [10]. Thus, when the parser looks ahead for a suffix, it must stop before reaching the [10].

Parsing a declarator such as x[10][20] gets even harder. In general, the parser must somehow exclude one more suffix from the lookahead on each recursive call to direct_declarator. This is no simple task.

The production for direct-declarator shown above is said to be left-recursive, because the recursive reference to direct-declarator appears as the leftmost symbol of an alternative on its right hand side. In contrast, the production for declarator shown earlier is right-recursive. Right-recursive rules translate pretty easily into recursive-descent parsing functions; left-recursive rules do not. Fortunately, you can often transform a left-recursive production into an iterative one that's much easier to parse. Let's do this for direct-declarator.

The left-recursion in direct-declarator suggests that the suffixes group from left to right. For example, in x[10][20], [20] is the suffix for x[10], and [10] is the suffix for x. Thus, x[10][20] is equivalent to (x[10])[20]. A parser can simply read the suffixes from left to right and pass their translations in that order to the output. Therefore, given x[10][20], then x is an "array with 10 elements of type array with 20 elements of type" something.

The first two alternatives in the production for direct-declarator indicate that a direct-declarator need not have any suffix at all. The third alternative indicates that a direct-declarator followed by one suffix is still a direct-declarator, and thus may be followed by two or more suffixes. The iterative rule for direct-declarator is therefore
direct-declarator =
    ( declarator-id | "(" declarator ")" )
        { array-suffix | function-suffix } .
This is the way it appears in Table 1. The parsing function for this production requires no lookahead at all.

Syntax vs. Semantics

Now let's look at the grammar rules for abstract declarators and parameter lists. A parameter-list is a comma-separated list of parameter declarations. The draft C++ Standard describes parameter-list using a left-recursive production:
parameter-list =
    [ parameter-list "," ] parameter-declaration .
Rewriting this using iteration yields a rule that's easier on the eyes as well as the parser:
parameter-list =
    parameter-declaration { "," parameter-declaration } .
The draft describes a parameter-declaration as:
parameter-declaration =
    decl-specifier-seq
        ( declarator | [ abstract-declarator ] ) .
This is similar to a simple-declaration, except that (1) a parameter-declaration can have at most one declarator (rather than a comma-separated list of declarators), and (2) the declarator in a parameter-declaration can be abstract.

Remember, an abstract declarator is just like an ordinary declarator, except it doesn't have a declarator-id. Thus, the productions for declarator and abstract-declarator should be nearly identical. Yet the draft C++ Standard opts to keep them as distinct productions, rather than combine them into one. This approach has both advantages and disadvantages.

There are places in the grammar, such as simple-declaration, that allow declarators but not abstract declarators. There is also one place, type-id, which allows abstract declarators but not ordinary (non-abstract) declarators. If you combine the productions for abstract declarators with declarators, it seems that you lose the ability to distinguish the places which allow only one kind of declarator.

On the other hand, ordinary declarators and abstract declarators are nearly identical, and describing them with different productions obscures their similarities. Table 3 shows the productions for declarator and abstract-declarator in the left-recursive form used in draft C++ Standard. As expected, the productions for declarator and direct-declarator are similar to the productions for abstract-declarator and direct-abstract-declarator, respectively. (The similarities become more apparent if you delete the word abstract everywhere.) However, considering that the only difference between a declarator and an abstract-declarator is the presence or absence of a declarator-id, it's surprising that the productions don't look more alike.

The other problem with using different rules for the different kinds of declarators is that it introduces parsing problems that require elaborate lookahead. Merging declarator and abstract- declarator into a single set of productions solves most of those problems.

As a first step in merging the productions, I removed the left- recursion and replaced it with iteration. The resulting productions appear in Table 4. I believe these productions make the similarities between declarators and abstract-declarators more evident. In fact, the only difference (other than the presence of abstract- in a bunch of places) is that the declarator-id is missing from a direct-abstract-declarator.

The productions in Table 4 still pose parsing problems that require looking ahead. A direct-abstract-declarator can begin with a left parenthesis, followed by an abstract-declarator and a right parenthesis. The parenthesized abstract-declarator is optional, so the parser can skip it and go on to look for an array-suffix or a function-suffix. However, a function-suffix also begins with a left parenthesis. Therefore, the parser must look at the next two tokens to distinguish a function-suffix from a parenthesized abstract-declarator.

For example, consider the abstract declarators (T &) and (T::*). The first declarator, (T &), is a function-suffix. The second, (T::*), is a parenthesized abstract-declarator describing a pointer-to-member. In parsing either of these declarators, the parser enters direct_abstract_declarator with the left parenthesis as the current token. If the parser looks at the next token after the parenthesis and sees a type name such as T, it still can't tell if it has a parenthesized abstract-declarator or a function-suffix. It must look at one more token. If that token is not a ::, then the parenthesis is the start of a function-suffix; otherwise, it's grouping around an abstract-declarator.

The lookahead problem I just described arises regardless of whether the grammar uses left-recursion or iteration. However, the iterative grammar in Table 4 has one problem that the left-recursive grammar in Table 3 does not. According to Table 4 , an abstract-declarator can be empty. According to Table 1, the parameter-list (between the parentheses) in a function-suffix can be omitted. Therefore, a direct-abstract-declarator that is just () could be either parentheses around an empty abstract-declarator, or a function-suffix with the parameter-list omitted. This is a syntactic ambiguity. That is, there are two different and apparently valid ways to parse the same sequence of tokens.

A careful (very careful) reading of the left-recursive grammar in Table 3 shows that it does not share this problem. According to Table 3, an abstract-declarator cannot be empty. Therefore, a direct-abstract-declarator that's just () must be a function-suffix.

I prefer the grammar in Table 4 . It's more readable and easier to map into a parser. The syntactic ambiguity it introduces is most unfortunate. Does this mean we must abandon it? I don't think so. Such ambiguities occur elsewhere in C++, and even in C. Both languages resolve the ambiguities by simply imposing additional semantic rules.

For example, according to C++ grammar, a function declaration such as
int f(const T);
has either an unnamed parameter of type const T, or a named parameter T of type const (and by default) int. The grammar is ambiguous. C++ resolves the conflict according to a semantic rule called the "maximal munch" rule. (See "The C++ Column That Needs a Name: Understanding C++ Declarations," CUJ, December 1995.) According to maximal munch, if T is a type name in the scope enclosing the function declaration, then the function has an unnamed parameter of type const T. Otherwise, it has a named parameter T of type const int.

The C++ standards committee recently discovered a syntactic ambiguity in the construct ~X(). According to the grammar, ~X() means either

~(X()), which applies the ~ operator to a temporary object of type X constructed with no arguments, or

this->~X(), which calls the destructor for type X applied to *this.

The committee opted to add a semantic rule favoring ~(X()).
I usually prefer an unambiguous grammar to an ambiguous one that resolves ambiguities by semantic rules, but I'm more concerned about the overall utility of the language description. I want the production for direct-abstract-declarator in iterative form, so I can combine it with direct-declarator, and I can't use the iterative rule for direct-abstract-declarator without some rule to specify the meaning of (). That rule is: a direct-abstract-declarator consisting of just () is always a function-suffix.

Now, at last, I can merge abstract-declarator with declarator to form a single set of productions. Table 5 shows the resulting simplified productions for parameter-declaration and declarator.

Merging abstract declarators with declarators introduces a problem with declarator-list (used in simple-declaration), which can accept declarators but not abstract declarators. The solution to the problem is a couple more semantic rules. Add the following production to the grammar:
concrete-declarator =
    declarator.
along with a rule explaining that a concrete-declarator must have a declarator-id. Also add:
abstract-declarator =
declarator.

along with a rule explaining that an abstract-declarator must not have a declarator-id.

Rules such as these already exist in both C++ and C. For example, they both have the production:
constant-expression =
    conditional-expression .
along with a slew of rules constraining the operands and operators that can appear in a constant-expression (so the compiler can evaluate the expression at compile time). The alternative is a lot of extra productions to describe constant-expression.

With these productions that distinguish abstract-declarators from concrete-declarators, the production for declarator-list becomes:
declarator-list =
    concrete-declarator
        { "," concrete-declarator } .
I'll show you the code for the revised parser next time. For those of you who can't wait, you can find it on CUJ's ftp site (see page 3 for ftp information).

Errata

In past articles, I presented versions of the decl grammar that defined decl-specifier-seq as:
decl-specifier-seq =
    { decl-specifier } .
This suggests that that the sequence can be empty. It cannot. The grammar in Listing 1 correctly shows that the sequence must contain at least one decl-specifier.

Thanks to Steven Lee (stlee@cisco.com) for bringing this to my attention.

Reference

[1] Brian Kernighan and Dennis Ritchie. The C Programming Language (Prentice Hall, 1978).

Dan Saks is the president of Saks &Associates, which offers consulting and training in C++ and C. He is secretary of the ANSI and ISO C++ committees. Dan is coauthor of C++ Programming Guidelines, and codeveloper of the Plum Hall Validation Suite for C++ (both with Thomas Plum). You can reach him at 393 Leander Dr., Springfield OH, 45504-4906, by phone at (513)324-3601 (the area code changes to 937 after September 1996), or electronically at dsaks@wittenberg.edu.