May 1996/C++ Theory and Practice

Columns

C++ Theory and Practice

Dan Saks

Perspectives on Grammars and Parsers

Parsing C++ declarations is easier than most people realize. Understanding how parsers work gives you the proper conceptual model for parsing a declaration "by hand."

Copyright © 1996 by Dan Saks

For the past several months, I've been probing pretty deeply into the grammatical structure and associated semantics of C++ declarations. I also presented a program that parses C++ declarations and translates them into English.

This month, I want to look back over the past few articles and try to put this discussion of grammars and parsers in some perspective. I suspect that it's pretty easy to lose sight of the big picture, especially since I might not have kept it in focus as well as I should have. Some e-mail I've recently received seems to confirm that suspicion. I'll use my reply to the following letter as the opportunity to refocus on the big picture before I add any more features to the program.

Date: 08 Feb 96 10:36:33 EST
From: Gregor Owen
<71121.625@compuserve.com>
Subject: Parsing C Declarations

I am enjoying your C parsing explorations in CUJ, but I wondered if you noticed how this looks? First you start off with an article promising to explain how to figure out C variable declarations, but then, that task apparently being beyond human capabilities, you describe a program that may be capable of doing it for us. A carping critic might suggest that man has met the C language, and man has lost. This certainly agrees with my day-to-day experience, where I regularly find it impossible to get the declarations to work.

For instance I was trying to make an array of functions const — so I wouldn't try writing into EPROM — and all I could do was
typedef void (*VFUNC)();

const VFUNC cwfuncs[LINES68] = {
    gotime1, gotime2, gotime3, gotime4
};
I spent the usual few minutes poking const into various places in the original arrangement, which (if I'm not mistaken) was something like void (*cwfuncs)[], and I then gave up. The next time I ran into a situation like this I didn't even try to insert the const; I just started with the typedef. I'm almost (but not absolutely) certain there's some "correct" way to do it without the typedef — but life is so short.

I guess what I'm railing about here is that maybe it's time for the C community to just get past denial and face it: it's a great language and everything, but the variable declarations are dysfunctional, they don't really work, and it's too late to fix them.

Best wishes,
J. G. Owen
Huntington Station, NY 11746

First, my articles have been about parsing C++ declarations, not just C declarations. I assume your concerns are directed at C++ as well as C. I never meant to convey the impression that parsing C or C++ is beyond human capabilities. Quite to the contrary, I think it's easier than most people realize. I'll explain why after presenting a little background.

The Lesson of the decl Program

Indeed, C++'s declaration structure is more complicated than we'd like it to be. But I don't think it's quite as bad as a lot of people make it out to be. The real problem is that most programmers somehow acquire an incomplete or incorrect conceptual model of how declarations work.

For example, more than one person has explained to me that they've used, and taught others to use, a fairly simple rule known as the "clockwise" rule, as explained by Andrew Binstock[1] :

Take any declaration, start with the innermost parentheses (in the absence of parentheses, start with the declared item's name), and work clockwise through the declaration going to the right first.

After a couple of simple examples, he gives this one:
char*c[];
Start with c, go clockwise to the right; [] is found: c is an array; keep going clockwise, * is found: c is an array of pointers; loop to char and c is an array of pointers to chars.

When circling through items in parentheses, always resolve the contents of parentheses before going outside of them.

The wording is not very precise, but as I understand it, the rule handles just about anything you can write in Classic C (as defined by Kernighan and Ritchie[2] ). However, it breaks down pretty soon after you introduce anything from C++, such as references or pointers to members. It even fails to handle const and volatile qualifiers (which are in Standard C as well as C++).

The primary problem with the rule is that it visits the pointer operators moving from left to right, when they really bind from right-to-left. When none of the pointers have cv-qualifiers (const and volatile qualifiers) the direction doesn't matter; any way you read it
char **p;
declares p as a "pointer to pointer to char". But if you add cv-qualifiers to the rightmost pointer operator, the direction most certainly does matter. For example,
char **const p;
declares p as a "const pointer to pointer to char". It's not clear what the clockwise rule thinks this is, but it's probably either a "pointer to const pointer to char" or maybe a "pointer to pointer to const char", neither of which is correct.

I'm not suggesting that you should not rely on simplified conceptual models. On the contrary, they're often valuable pedagogical tools. But the clockwise rule and other similar rules of thumb are not good models. The only reason it seems to be a simple rule is because the explanation leaves out some important details. By the time you fill in the missing details, it's not much simpler than the actual implementation.

In particular, the rule states you should "work clockwise through the declaration." It's not until you see an example that you discover that, when you wrap around to the left side, you're supposed to skip some stuff on the left. For example, when you wrap around to the left of
char *c[];
you skip over char and look at the *. Apparently, you only process the char after you finish with all the *s and []s.

The rule never really explains what part you skip over, or conversely, what part you skip to. How do you explain the concepts? Well, you usually start by giving them names. And I know just what to call them: the part you skip over is the decl-specifier-seq, and the part you skip to is the declarator. Where did I get those names? From the C++ grammar.

The only virtue (if you can call it that) of the clockwise rule is that the vague wording lets people pretend that they can avoid learning new terminology such as decl-specifier and declarator. This is very short-sighted. The best way to describe a C/C++ declaration is to say clearly that it has two main parts: a decl-specifier-seq followed by a declarator-list. And the clearest way to say that is with a grammar rule, such as
simple-declaration = decl-specifier-seq declarator-list.
The point I've been trying to make for the past few months is that, once you really learn to read the grammar, it can guide you in synthesizing or decomposing any C++ construct. I presented the decl program, not only to show you some useful C++ programming techniques and provide you with a useful tool, but to enhance your understanding of the C++ grammar. The program shows how the grammar maps into a simple method for recognizing input that you can apply in your head. If nothing else, this understanding should help you in deciphering diagnostic messages from C++ compilers.

The grammar for the C++ declarations recognized by my decl program appears in Table 1. As is my wont, the grammar is expressed in the EBNF (Extended Backus-Naur Form) notation, summarized in Table 2. (To see why I prefer EBNF to other notations, see "The Column that Needs a Name: A Sensible Grammar Notation," CUJ, November 1995.) The grammar omits some aspects of full C++ declarations, but it's still jam-packed with useful information.

For example, in my earlier indictment of the clockwise rule, I explained that the pointer operators in
char **p;
bind right-to-left and not left-to-right. You can infer this from the grammar. Look at the production for declarator:
declarator = direct-declarator | ptr-operatordeclarator.
This production says, among other things, that if you peel away a ptr-operator (a pointer operator such as *, &, T::*, or *const) from a declarator, the remainder of the declarator is itself a declarator. If you think of this production as a parsing function (which is what it is in the decl program), then this implies a recursive call on the declarator function. The declarator parsing function from the decl program appears in Listing 1.

Function calls are first-in-last-out. That is, the function called first finishes last. The inner call on declarator finishes the remainder of the declarator before it returns to let the outer call finish processing its ptr-operator. Every operator in the nested declarator binds more tightly to the declarator-id (the identifier at the core of the declarator) than does the outermost ptr-operator. Thus, **p is *(*p), not (*)*p (if that even makes any sense).

I hope the lesson of the decl program is not that declarations are too hard to decipher without a program, but that the deciphering process is actually pretty easy. You just need the right conceptual model, and that model is a parser driven by a grammar. Moreover, declarations are just one part of C++. There are lots of other parts that the grammar helps explain.

Say It Again, With Feeling

In my introductory article on C++ declarations ("Understanding C++ Declarations," CUJ, December 1995), I mentioned my dismay that many C++ programmers write declarations in the style of
const int* p;
int& r;
rather than
const int *p;
int &r;
That is, they use a space to join the ptr-operators * and & with decl-specifiers rather than with the declarator. I don't think I expressed my dismay strongly enough.

I really believe C++ programmers do themselves and each other a disservice when they write declarations in that style. Sure, the spacing makes no difference to the compiler, but putting the space after the * leaves many people with a false impression about the underlying structure of declarations. Understanding declarators is the key to understanding declarations, and breaking up declarators this way merely denies their existence.

All C++ compilers should come with the following warning on the box:

Warning: The C++ Standards Committee Secretary has determined that denying the existence of declarators can be hazardous to C++ programmer's mental health. It has also been known to cause defects in C++ programs.

Composing Declarations

Now, let's look at the particular declaration problem in Gregor Owen's letter, declaring "array of functions const". I think his solution:
typedef void (*VFUNC)();

const VFUNC cwfuncs[LINES68] = {
    gotime1, gotime2, gotime3, gotime4
};
is just fine. I don't think you can write it any more clearly, and it leaves little doubt about where the const qualifier belongs. But I understand his frustration with being unable to declare cwfuncs in a single declaration. So let's do it.

First, Owen speculates in his letter that the declaration (without const) might be of the form:
void (*cwfuncs)[];
However, this declares cwfuncs as "pointer to array of void". Rather than try to patch this declaration, I'll just start from scratch.

C and C++ do not allow an "array of function", but they do allow "array of pointer to function". Owen obviously knew this because he defined VFUNC as a "pointer to function" type. But it helps to state this explicitly before composing the declarations. Let's forget about adding const for the moment and just describe the type we want for cwfuncs. It is "array with N elements of type pointer to function with no parameters returning void". (N is easier to say than LINES68.)

First, cwfuncs[N] is "array with N elements". Next, *cwfuncs[N] is "array with N elements of type pointer". You don't need grouping parentheses, as in *(cwfuncs[N]) because [N] binds tighter than *.

Next, you add () to specify "function with no parameters". (In C, () means "no parameters.") The production for direct-declarator (Table 1) shows that you add a function suffix (such as () or ()const) after a declarator. However, if you write just *cwfuncs[N](), the () binds tighter than the *, so you get "array with N elements of function with no parameters returning pointer". This is not what we want and it's not valid C/C++. You need grouping parentheses, as in (*cwfuncs[N])(), to obtain "array with N elements of type pointer to function with no parameters".

That's it for the declarator. Just add the decl-specifier void and you're done:
void (*cwfuncs[N])();
Wasn't that easy?

Now, let's add the const. Making the array const means making the array elements const. That is, we want the type to be "array with N elements of type const pointer to function with no parameters returning void".

The grammar production for ptr-operator shows that the cv- qualifier goes immediately to the right of the * it qualifies. For example, the declarator *const p means p is a "const pointer". Writing const *p means p is a "pointer to const". We want "const pointer", so we insert the const into the declarator after the *. The correct declaration is
void (*const cwfuncs[N])();
Now, just to round out the picture, let's try adding const in a few other places. The declaration
void (const *cwfuncs[N])();
is a syntax error. Placing the const inside the parentheses suggests that const is part of the declarator, but a declarator cannot begin with const. Convince yourself by looking at the grammar.

Let's try const in another place. Writing
void const (*cwfuncs[N])();
moves the const from the declarator to the decl-specifiers; it becomes part of the function's return type. This declares cwfuncs as "array with N elements of type pointer to function with no parameters returning const void". Writing the decl-specifiers as const void produces the same result.

Are They Really Broken?

If you don't like the C/C++ declaration syntax, you're not alone. Even Bjarne Stroustrup, the designer of C++, wrote[3] :

The part of the C syntax I disliked most was the declaration syntax. Having both prefix and postfix declarator operators is the source of a fair amount of confusion.

I agree that it isn't pretty, but I can't agree with Gregor Owen that it doesn't work. It just makes us work harder than we want to.

By the way, Stroustrup[3] goes on to explain that he and others considered an alternative syntax that used postfix -> as the "pointer to" declarator operator. For example,
int v[10]->;
would declare v as "array with 10 elements of pointer to int", and
int p->[10];
would declare p as "pointer to array with 10 elements of int". He explains with regret that he never delivered a complete implementation. Now it really is too late to change the syntax.

One More Question

Another reader, Glenn Cole (gcole@netcom.com), wrote to point out that I presented an example I never really explained. In December 1995 I wrote:

For example, I find that I can't teach a C++ course at any level without pausing at some point to review the differences among C declarations such as
const char *x[N];
char const *y[N];
char *const z[N];
and then why
x[i] = z[i];
is valid, while
z[i] = x[i];
is not.

I believe you can infer the answer from the details I presented in the succeeding article ("Understanding C++ Declarators," CUJ, January 1996), but as long as I'm working through some examples, I should do this one.

First,
const char *x[N];
char const *y[N];
declare x and y with the same type, namely "array of N elements with type pointer to const char", whereas
char *const z[N];
declares z as "array of N elements with type const pointer to char".

Each element of z is a const pointer. Since z[i] is const for any i, z[i] is a non-modifiable lvalue — it cannot appear to the left of an assignment (but it can appear on the right). On the other hand, x[i] is non-const, so it can appear to the left of an assignment (as well as the right).

The assignment
x[i] = z[i];
copies a "const pointer to char" from z[i]. The value that leaves z[i] merely has type "pointer to char". How can that be? Consider the behavior of
const int k = 3;
int n;
...
n = k;
Even though the 3 sitting in k is const, the copy of that 3 that winds up in n is not const.

Thus
x[i] = z[i];
converts "pointer to char" to "pointer to const char". This conversion is perfectly safe. You can use x[i] to inspect, but not modify, the character addressed by z[i]. The conversion doesn't allow any accesses through x[i] that aren't already allowed through z[i], so it is indeed safe.

And the Winner is...

You may have noticed that this column no longer needs a name. The name "Theory and Practice" was suggested by Sam Ramirez (sam@hyperk.com). For his creative effort, he will receive a CUJ CD-ROM and a CUJ T-shirt, as well as the admiration of his peers. I think the column name aptly describes the nature of the topics that I've covered and will continue to cover.

References

[1] Andrew Binstock. "This Simple 'Clockwise' Rule Helps Decipher Complex Declarations," The C User's Group Newsletter, June: 1987.

[2] Brian Kernighan and Dennis Ritchie. The C Programming Language [1st ed.] (Prentice-Hall, 1978).

[3] Bjarne Stroustrup. The Design and Implementation of C++ (Addison-Wesley, 1994).

Dan Saks is the president of Saks &Associates, which offers consulting and training in C++ and C. He is secretary of the ANSI and ISO C++ committees. Dan is coauthor of C++ Programming Guidelines, and codeveloper of the Plum Hall Validation Suite for C++ (both with Thomas Plum). You can reach him at 393 Leander Dr., Springfield OH, 45504-4906, by phone at (513)324-3601, or electronically at dsaks@wittenberg.edu.