December 1995/Stepping Up to C++

Columns

Stepping Up to C++

Understanding C++ Declarations

Dan Saks

Dan Saks is the president of Saks & Associates, which offers consulting and training in C++ and C. He is secretary of the ANSI and ISO C++ committees. Dan is coauthor of C++ Programming Guidelines, and codeveloper of the Plum Hall Validation Suite for C++ (both with Thomas Plum). You can reach him at 393 Leander Dr., Springfield OH, 45504-4906, by phone at (513)324-3601, or electronically at dsaks@wittenberg.edu.
Last month, I looked at different notations for describing the syntactic structure of programming languages such as C and C++. (See "The Column That Needs a Name: A Sensible Grammar Notation," CUJ, November, 1995.) Along the way, I explained why I prefer EBNF (Extended Backus-Naur Form) to the notation employed in both the C Standard and the C++ Draft Standard. Now I will use EBNF to help explain one of the most misunderstood aspects of both C and C++ — the structure of declarations.
As I mentioned last month, I've found that few C and C++ programmers really understand the syntax of declarations and their associated semantics. While most programmers have little trouble creating and interpreting simple declarations, their understanding breaks down when confronted with declarations that have too many *s, &s, ( )s, []s or const qualifiers.
I'm not concerned about programmers' abilities to decipher overly-complicated declarations contrived for "see if you can figure this one out" puzzles. My concern is with their abilities to handle even mildly-complicated declarations that occur in the normal course of C++ (and even C) programming. For example, I find that I can't teach a C++ course at any level without pausing at some point to review the differences among C declarations such as
const char *x[N];
char const *y[N];
char *const z[N];
and then why

x[i] = z[i];
is valid, while

z[i] = x[i];
is not.
As if C declarations weren't rich enough by themselves, C++ thickens the stew by throwing in other ingredients such as reference types, qualified names, cv-qualified member functions, and pointers to members. Although a C++ declaration such as

const X::N *((X::*f)(void *, size_t) const);
isn't commonplace, it's not unreasonably complicated either. I've seen declarations such as this in production code and it behooves you to understand them.
C++ adds even more complexity by using minor variations of the declaration syntax in each of several different contexts: declaration-statements, member-declarations, new-expressions and parameter-declarations. At its simplest, the type specification in a new-expression looks just like the type specification in a declaration. However, new-expressions have special rules for more complicated types which invite little surprises. For example,

p1 = new char *[2]; p2 = new (char *)[2];
are both valid and have remarkably different meanings.
Mastering the structure of declarations is even more important in C++ than it is in C. C++ offers you more ways to make mistakes, and so compilers have a harder time telling you exactly what you did wrong. (Sometimes they can't even tell you approximately what you did wrong.) You'll be able to make more sense of error messages if you can parse a declaration properly.
Most of the confusion surrounding declarations stems from the part called a declarator. So that's where we'll focus our attention.

Type Specifers vs. Declarators
Declarations in C++ take various forms. In addition to the basic forms inherited from C for declaring objects, constants, types, and functions, there are additional forms for templates, namespaces, and linkage specifications. However, none of these additional forms contain declarators, so I'll ignore them for the time being.
In C++, the non-terminal corresponding to object and function declarations is ironically called a simple-declaration. The EBNF production (syntax rule) for a simple-declaration is

simple-declaration = { decl-specifier } [ init-declarator-list ] ";".
In other words, a simple-declaration is a sequence of zero or more decl-specifiers followed by an optional init-declarator-list, followed by a semicolon. (I explained EBNF in detail last month. A summary of EBNF appears in Table 1. )
A decl-specifier can be any of a number of things. It can be a type-specifier (a keyword such as int or double, or an identifier that names a type). It can be a storage-class-specifier such as extern, static, or register. It can be a function specifier, such as inline or virtual. It can also be the keyword friend or typedef, which don't seem to fit into any category. For example, the sequence
const unsigned long int
is a decl-specifier sequence. The order doesn't matter, so
int long const unsigned
is also a decl-specifier sequence denoting the same type. Table 2 shows a partial grammar for a decl-specifier.
An init-declarator-list is sequence of one or more init-declarators separated by commas:
init-declarator-list =
   init-declarator { "," init-declarator } .
An init-declarator is a declarator followed by an optional initializer.
init-declarator =
   declarator [ initializer ] .
A declarator is the identifier being declared, along with all the *s, ( )s, []s and (in the case of C++) &s that modify that identifier. For example, *p is a declarator indicating that p is a pointer. A pointer to what? A pointer to the type specified by the decl-specifiers. For example,
const unsigned long int *p
declares p as a pointer to a const unsigned long int.
Thus an init-declarator is a declarator with an initializer. For example, in
unsigned int *p = &ui;
*p = &u is the init-declarator.
You can string several declarators (without or without initializers) in a single declaration, as in
unsigned int *p, &r = u, n;
The declaration contains three declarators: (1) *p, (2) &r, and (3) n. The declaration declares that p is a pointer to an unsigned int,r is a reference to an unsigned int, and n is just an unsigned int.

Syntax and Style (Again)
This brings me to another point about the relationship between syntax and style. I'm not suggesting that cramming more than one declarator on a line is good style, but the syntax does allow it. I really don't think there's anything wrong with a declaration such as

int i, j, k;
or even

char *s1, *s2;
However, I can certainly understand why you might find

char c, *p, buf[BUFSIZ];
to be too busy, and want to split it up. Whatever you choose, you should understand that the *s, &s, []s and ()s are part of a declarator, not part of the decl-specifier sequence.
I bring this up because some programmers prefer to write

const char* p; int& r;
rather than

const char *p; int &r;
That is, they use a space to group the * and & with the decl-specifiers rather than with the declarator. This is no help. It leads to misunderstandings about the meaning of declarations. It's easy to misread

char* s1, s2;
as declaring s1 and s2 both with type "pointer to char." No matter how you space them, the *s and &s are part of the declarator, not the decl-specifier. Thus the declaration is really

char *s1, s2;
or, more explicitly,

char *s1; char s2;
To their credit, many programmers who put the *s and &s with the decl-specifier consistently write only one declarator per declaration, and thus avoid this confusion.
I believe most C++ programmers put the *s and &s with the decl-specifiers because that's the way Bjarne Stroustrup (the inventor of C++) has always done it.. This, in fact, is usually the way you can tell a "real" C++ programmer from a C hack using C++. "Real" C++ programmers do it Stroustrup's way. Too bad he does it wrong. (I guess you now know what that makes me.)
There is a third style, which is to place a space on both sides of a * or &, as in

const char * p; int & r;
This is certainly reasonable. However, I would argue that if you lay out declarations this way, you should also put a space in expressions between a unary operator and its operand. For example, expressions such as

p = & x; * s1 ++ = * s2 ++;
are consistent with the above declarations, while

p = &x; *s1++ = *s2++;
are not. Then again, you might agree with Ralph Waldo Emerson that "a foolish consistency is the hobgoblin of little minds."

The "Maximal Munch" Rule
The decl-specifiers may include a type name. A type name may be a class name, an enumeration name, or a typedef name. Most programmers habitually write the type name as the last (rightmost) decl-specifier, but it need not be so. For example, you can write

static const T *p;
as

const T static *p;
or

T static const *p;
and they all mean the same thing.
A declarator might not have any operators in it — it might be just an identifier. For example, in

static const size_t MAX = 1000;
the declarator is just MAX. If declarators always began with an operator such as * or &, it would be pretty easy to tell where the decl-specifiers ended and the first declarator began. But, since the decl-specifier sequence is just a stream of unordered keywords and identifiers, and the declarator might be just another identifier, finding the boundary is not always that easy. This also makes it harder for compilers to provide precise error messages for errors in declarations. Many compilers just announce they've found a "declaration syntax error."
Other aspects of the syntax combine to make error detection even harder. In C++, as in C, you can omit the type-specifier from the decl-specifier sequence, and the type defaults to int. For example, you can write

const MAX = 1000;
and the compiler assumes MAX is an int. This is called the "implicit int" rule.
The other complicating factor is that the init-declarator-list might be completely empty in some contexts. For example, consider the function declaration

int f(unsigned int u);
Here we have a declaration within a declarator. At the outermost level, int is the decl-specifier and

f(unsigned int u)
is the declarator. But f's declarator contains a declaration in which unsigned int is the decl-specifier sequence, and u is the declarator.
Since the function declaration does not include a function body, the compiler simply parses the formal parameter name in the declarator and then ignores it. Thus, you can omit the name. If the declarator is just an identifier, and you omit it, then you have a declaration which has no declarator at all; it's just a decl-specifier sequence. For example,

int f(unsigned int);
declares f as a function with a single (unnamed) parameter of type unsigned int.
Given all these syntactic complications, how then can a C++ compiler tell where the decl-specifiers end and the declarators begin? It applies the "maximal munch" rule, stated in section 7.1 (page 96) of the ARM [1] as:
The longest sequence of decl-specifiers that could possibly be a type name is taken as the decl-specifiers of a declaration.
In other words, the parser will "munch" as many symbols as it can as part of the decl-specifiers in a declaration before it starts parsing the first declarator. (I looked up "munch" and it means "to chew steadily with a crunching sound.")
To see how this applies, consider the formal parameter list in
int f(const T);
Is T part of the decl-specifier sequence or it is the declarator? In other words, does f have a unnamed parameter of type const T, or a named parameter T of type const int? Well, it depends on the context of the declaration.
Suppose T has been defined as type name, such as
typedef int T;
or
class T;
If this case, the compiler munches T as part of the decl-specifier sequence because that produces the longest possible sequence. The init-declarator-list is then empty. That is, f has an unnamed argument of type const T.
If T has not been defined as a type name, then T could not possibly be part of the decl-specifier sequence. The longest sequence is just const (meaning const int implicitly), and T is the declarator. In this case, f has a named parameter T of type const int.
Late in 1993, the C++ standards committee passed a resolution which we all referred to as "banning implicit int." (That's also the way I described it a few months ago in "Stepping Up to C++: Other Assorted Changes, Part 3," CUJ, September, 1995.) The resolution modified the draft to say that:
Only in function-definitions and in function declarations for constructors, destructors, and type conversions can the decl-specifier-seq be omitted.
This says that you can no longer write function declarations that completely omit the return type specifier, except in constructors, destructors, and conversion operators. For instance, a function declaration such as
f();
will not be standard C++ (when there is a Standard). You must supply a return type, as in
int f();
This is the case for member as well as non-member functions (except for constructors, destructors, and conversion operators).
The "implicit int" ban also states that:
At least one type-specifier is required in a typedef declaration. At least one type-specifier is required in a function declaration unless it declares a constructor, destructor, or type conversion operator.
This says that
extern f();
will not be standard C++ because, even though extern is part of the decl-specifier sequence, it is not a type-specifier. You must write the declaration as
extern int f();
Unfortunately, the current C++ draft considers cv-qualifiers (the keywords const and volatile) to be type-specifiers. (See the production for type-specifier in Table 2. ) Thus, the draft still allows implicit int in some declarations. For example, the draft still allows
const MAX = 1000;
which defines MAX as a const int. It also allows
int f(const T);
even when T is not a previously defined type name, in which parameter T has type const int.
Apparently the ban was not a complete ban. I'm not sure this is what the standards committees intended, but that's the way it is. Therefore, maximal munch is still very much alive, and we must all contend with it when we see vague error messages emanating from compilers. Sigh.
I'll continue next month with a detailed look at the syntax of declarators.

Syntax and Style, Once More
As P.J. Plauger warned me, my article on "Syntax and Style" (CUJ, October, 1995) generated more mail than any other article I've written. (All of it was electronic.) I was pleasantly surprised to find that even those who did not agree with all that I wrote still enjoyed the article.
My managing editor also forwarded me this interesting little note:
My flabber never so gasted! Dan Saks writes "Straker calls this the Allman style. Even though I haven't the foggiest notion where the name comes from, I'll use it anyway."
Is Dan being sarcastic, or is he genuinely unaware of _Eric_ Allman's many achievements, notably his work at UCB with BSD UNIX (esp. SENDMAIL and the [in]famous trapdoor that let in the giant worm) and, whence the eponymy, his long-running columns on C style/usage in UNIX Review?
PAX etc.
Stan Kelly-Bootle
Mill Valley, CA\skb@crl.com
I haven't the foggiest notion who this Stan Kelly-Bootle is, but I'd like to thank him for setting me straight. No, I wasn't being sarcastic, nor am I genuinely unaware of Eric Allman. I just never made the connection, and apparently none of my editors did either. Sorry Eric, no offense intended.
Another reader, Marty Leisner (leisner@sdsp.mc.xerox.com) confirmed the origin of the Allman style. He also explained that the Allman style for the switch statement is not as I explained. He showed me this snippet from Allman's sendmail program:

switch (up->udb_type) { case UDB_DBFETCH: /* get the default case for this database */ if (up->udb_default == NULL) {
I had suggested the code would look like:

switch (up->udb_type) { case UDB_DBFETCH: /* get the default case for this database */ if (up->udb_default == NULL) {
The difference is that the "genuine" Allman code indents the case label two spaces from the enclosing braces. If this is indeed the style, then the rules for the Allman style are more complicated than I supposed.
Two readers wrote to tell me that they've gotten good results from the Free Software Foundation's indent program, a source code reformatter. I'll try to dig up a copy and let you know what I think.

Our T1 Lines are Still Open!
Last month I invited you to submit a new name for my column that reflects my new(?) focus on more advanced C++ topics. That invitation is still open. Send your suggestion(s) to cujed@rdpub.com. The winner (if any) will receive a free copy of the first ever CUJ CD-ROM, plus a free CUJ t-shirt.

Reference
[1] Margaret A. Ellis and Bjarne Stroustrup, The Annotated C++ Reference Manual (Addison-Wesley, 1990).