February 1996/The Column that Needs a Name: Parsing C++ Declarations, Part 1

Columns

The Column that Needs a Name: Parsing C++ Declarations, Part 1

by Dan Saks

For the past two months, I've been discussing the structure of C++ declarations. (See "Understanding C++ Declarations," CUJ December, 1995 and January, 1996.) I continue this month by presenting a C++ program that parses (reads and verifies) C++ declarations. This program should help solidify your understanding of both the grammatical structure of C++ declarations and their corresponding semantics.

Actually, presenting the entire program touches so many issues that I won't be able to cover the entire program in one article. So this month's column focuses on the overall behavior of the program and the design of the character scanning functions.

What It Does

I call my declaration parsing program decl. I figured decl should do more than just read a declaration and say "yea" or "nay," so decl translates correct declarations into English. For example, given the input:
const char *a[10];
decl responds with:
a is ...
array with 10 elements of type...
pointer to ...
const char
Many of you have probably seen similar programs in popular C books such as [1], [2], and [3]. My decl program differs from these in that it handles not only C declarations, but also C++ declarations involving references and pointers to members. For example, given the input:
int T::*pm;
decl responds with:
pm is ...
pointer to member of T with type ...
int
Unlike the other programs, decl produces an English translation only if the input declaration is syntactically correct. decl is pretty thorough, but some errors still slip through its grasp. It checks rigorously for syntax errors; however, it does not maintain a symbol table, so it can't really check for semantic errors.

For example, decl can detect that
const char const *p;
has a duplicate const qualifier, or that
int *;
does not declare anything (the declarator-id is missing). However, decl does not reject
int f[10]();
even though C++ does not allow arrays of functions. I'll elaborate the distinction between syntactic and semantic errors next month.

The Grammar for decl

decl accepts object and function declarations that satisfy the form known in the draft C++ Standard as a simple-declaration. Table 1 shows an EBNF grammar for a simple-declaration. This grammar is a further simplification of the already simplified grammar appearing in Tables 2 and 3 of my previous column ("Understanding C++ Declarators," CUJ January, 1996). I'm using this decl program to illustrate how declarator operators combine with the type specifiers to yield the type of a declaration. Therefore, the grammar in Table 1 omits details, such as initializers, storage-class-specifiers, and function-specifiers, which do not contribute to the type of a declaration.

I've also exercised some editorial discretion and just omitted details from the grammar because I did not want to contend with the complications they would add to the program. For example, the grammar does not provide for either declarator-ids or type-names that are more complicated than a single identifier. It also restricts the constant-expression in an array-suffix to a single identifier or integer literal.

The grammar in Table 1 contains several non-terminals, such as identifier and integer-literal, that never appear on the left-hand-side of a production in Table 1. These symbols represent tokens, and they are defined (by appearing on left-hand-side of a production) in the separate grammar in Table 2.

Not all the productions in Table 2 represent complete tokens. A production in Table 2 represents a token only if its left-hand-side also appears on the right-hand-side of some production in Table 1.

Thus, integer-literal, identifier, name, type-keyword, and cv-qualifier are tokens. The other productions, digit, letter, lowercase-letter, and uppercase-letter, define sets of characters used in composing tokens.

Flavors of Identifiers

The grammar for decl (Table 1) , defines the non-terminal type-name as simply:
type-name =
    name .
This is a bit different from the draft C++ Standard, which defines:
type-name =
    class-name | enum-name | typedef-name .
The short explanation is that I made this change to keep my decl program reasonably simple. But there's more you should know.

The draft C++ Standard refers to non-terminals such as class-name, enum-names and typedef-name as context-dependent keywords. The draft C++ Standard's grammar uses the different context-dependent keywords as if they were different kinds of tokens, but grammatically, they're usually just identifiers. That is, a class-name is usually just an identifier that's been previously declared in the current context as the name of a class. An enum-name is merely an identifier that's been previously declared in the current context as the name of an enumeration. And of course, a typedef-name is an identifier that's been declared in the current context as the name of a typedef. However, a typedef-name that names a class is also a class-name, and a typedef-name that names an enumeration is also an enum-name.

Even though they all look just like identifiers, the grammar must distinguish the context-dependent names from each other and from identifiers because C++ is not a context-free language. That is, a compiler can't parse a C++ program based solely on its syntactic structure; the compiler needs semantic information (such as whether an identifier already names a type) to make parsing decisions. The "maximal munch" rule is a case in point.

Recall that the decl-specifier-seq that begins a simple-declaration may contain keywords and/or a type-specifier, and they may appear in any order. The type-specifier might be a keyword, such as int, or it might be a context-dependent keyword such as a class-name which might, in turn, be an identifier. Then again, the type-specifier might be absent altogether, in which case it defaults to int (by the "implicit int" rule).

In short, the decl-specifier-seq is a variable number of keywords, identifiers, and possibly other stuff, all jumbled together. Since the declarator that follows the decl-specifiers might begin with an identifier, finding the end of the decl-specifier-seq can be a problem. C++ simplifies the problem somewhat by applying the "maximal munch" rule, which states that:

"The longest sequence of decl-specifiers that could possibly be a type name is taken as the decl-specifiers of a declaration."

That is, a C++ compiler will "munch" as many symbols as it can in an attempt to include them in the decl-specifier-seq. As soon as it finds something that can't be part of the decl-specifier-seq, the compiler shifts gears and starts parsing the first declarator.

Sometimes, a compiler can apply the maximal munch rule without using any semantic information. For example, given
int const N = 10;
a compiler can easily tell that N is not a decl-specifier because the decl-specifier-seq already has a type-specifier, namely int, and it can have at most one. (A multi-word combination such as unsigned int is still considered to be only one type-specifier.) After it stops munching the decl-specifiers, the compiler must look up N to make sure it's not already declared in the current scope, but the compiler doesn't need that semantic information from the lookup to know when to stop munching.

On the other hand, there are times when a compiler can apply the maximal munch rule only if it has semantic information from previous declarations to distinguish type-names from other identifiers. For example, given
const T = 10;
the compiler must look up T to determine if it can be part of the decl-specifier-seq. If the compiler finds that T is a type-name, then T is a decl-specifier. This leads to a syntax error, because the declarator is then missing. On the other hand, if the compiler finds that T is not declared in the current context, then it stops munching. The decl-specifier-seq is just const (with an implied int), and the declarator is T.

Now I can really explain why I reduced the production for type-name to just
type-name =
    name .
decl needs a way to distinguish type-names from other identifiers. However, I did not want to build a symbol table to collect semantic information. So I faked the missing semantic information using a simple lexical rule: names and identifiers have the same appearance, except that a name always begins with an uppercase letter, and an identifier always begins with a lowercase letter. For example, N, Name, and Type are names, but n, name, t, and type are identifiers. See the grammar in Table 2 for the formal description of this rule.

As it parses declarations, decl assumes that each name has already been declared with suitable semantics, and each identifier has not yet been declared. Thus, for example, decl thinks
const T;
is an error, because it sees T as a name, which it assumes is a type-name, and thus a decl-specifier. decl includes T in the decl-specifier-seq and then finds the declarator missing. However, decl accepts
const t;
because it regards t as an undeclared identifier. It stops munching the decl-specifiers, and interprets t as the declarator-id.

Designing the Scanner Interface

decl is implemented as two primary components, a scanner and a parser. The scanner reads characters from the input stream, filters out the whitespace characters, and partitions the remaining characters into tokens. The parser acquires tokens from the scanner, pieces them into declarations, and translates them into English.

decl is a typical filter, in the sense defined by Kernighan and Plauger[4] . A filter program has one input, one output, and performs a useful transformation on the data as it passes through. decl reads from standard input, writes to standard output, and turns C++ declarations into English as they pass through.

C++ filter programs can use either FILEs (from the standard header <stdio.h>) or streams (from <iostream>). FILEs tend to produce smaller and faster executables, but streams are safer and more extensible. I opted for the latter in this program.

The parser is essentially the main part of the program; it reads the input, translates it, and writes the output. Although the parser writes strings directly to an ostream (an output stream), it should not read directly from an istream (an input stream). Reading from an istream delivers characters one at a time, but the parser wants to see tokens. Therefore, the parser hands the istream to the scanner; and lets the scanner get the characters one at a time. The scanner turns the characters into tokens and delivers them to the parser. From the parser's perspective, the scanner is the input stream. Therefore, I designed the scanner as a class with a rudimentary, stream-like interface.

The scanner has a constructor
scanner(istream &);
which attaches a scanner to an istream. It also has a function
token get();
which returns the next token from the input stream as an object of type token. In addition, the scanner "remembers" the current token. The parser can get another copy of the current token without changing the state of the scanner by calling
token current() const;
A token object is simply a pair of values. Aside from the default constructor, the token class has only two explicitly-defined public member functions, for querying those values:
string text() const;
returns the token as a string object, and
category kind() const;
returns an enumeration that categorizes the token. This categorization requires some elaboration.

Typically, when a parser is analyzing a phrase, it doesn't care about the particular characters that make up a token, only that the token is member of some category of tokens. For example, when the parser is munching a decl-specifier-seq, it wants to know if the next token is a type-keyword (such as int or float), a cv-qualifier (such as const), or a name. The textual image of the token is immaterial as this point. Therefore, as it gets a token, the scanner classifies that token into a category so that the category is readily available to the parser.

The scanner represents categories as an enumeration:
enum category
    {
    ...
    INT_LITERAL,
    IDENTIFIER,
    NAME,
    TYPE_KEYWORD,
    ...
    };
In addition to these categories, each operator gets its own category. Thus, there are categories for AMPERSAND, COMMA, LEFT_BRACKET, STAR (asterisk), and others. There is also a category, NO_MORE, for the end of input, and another category, NO_SUCH, for an unrecognizable token. ("Unrecognizable" is the politically correct way to say "erroneous.")

The function token::kind returns one of these enumeration values. A typical interaction between the parser and scanner looks something like:
token t = input.get();
if (t.kind() == NAME ||
    t.kind() == INT_LITERAL)
    ...
or
category c = input.current().kind();
if (c == AMPERSAND || c == STAR)
    ...
(input is the scanner object.)

My experience has been that the identifiers that name token categories often show up in other parts of a compiler as the names of other entities. For example, the scanner might use the identifier CONST for the token category of the const keyword, and the symbol table might use CONST as the encoding of a type attribute in a table entry. To avoid such name conflicts, I defined the enumeration type category as a public member of class token, so that all references to the type name or to a category value must have token:: as a qualifier. Thus, the "typical" interactions that I showed before actually look like:
token t = input.get();
if (t.kind() == token::NAME ||
    t.kind() == token::INT_LITERAL)
    ...
or
token::category tc =
    input.current().kind();
if (tc == token::AMPERSAND ||
    tc == token::STAR)
    ...
The class definitions for scanner and token appear together in the header scanner.h (Listing 1) . The header shows a few other details I haven't yet discussed, so I'll discuss them now.

The first eight enumeration constants of type token::category have explicitly defined values. For example,
class token
    {
    ...
    enum category
        {
        AMPERSAND = '&',
        ...
defines token::AMPERSAND with the value of the character code for '&'. This is an old C language idiom (the p.c. way to say "hack") that makes it very easy for the scanner to convert a single-character operator into its corresponding token category. All you need to do is cast the character to type category. The final category, NO_VALUE, is for a token that's none of the above.

This mapping works only for tokens that are single-character operators. The scanner must use other means to map other tokens into categories. However the scanner does that mapping, it must insure that the other token categories do not conflict with the single-character operator categories. Therefore, the enumeration definition defines the first non-operator category to have a value outside the range of character codes. I've seen many programs do this by defining
enum category
    {
    AMPERSAND = '&',
    ...
    NO_MORE = 256,
    ...
I prefer to use the expression CHAR_MAX+1 instead of 256. CHAR_MAX is the maximum value of type char, as defined in <limits.h>. I believe this expresses the design more clearly.

The remaining enumeration constants do not have explicitly- defined values, so the compiler simply assigns them values in ascending order starting with NO_MORE+1.

The token class has two explicitly-declared constructors: a public default constructor
token();
and a private constructor
token(category, const string &);
The definitions for these two constructors appear as inlines in scanner.h (Listing 1) .

I provided the public default constructor so that the parser could declare token objects for use as in
token t;
...
t = input.get();
I also allowed the compiler to generate a public copy constructor and copy assignment operator, so the parser could copy tokens back and forth. (I "allowed" the compiler to generate the functions by not declaring them myself.)

I provided the other token constructor
token(category, const string &);
so the scanner could manufacture new token objects. For example, scanner::get can use
return token(token::NAME, text);
to return a token that represents a name.

In the previous example, text is presumably a string that holds the characters that compose the name. I did not think it was appropriate to bog down this constructor with internal checks to verify that the characters in the text string do indeed spell out a token that belongs in the category. After all, insuring that the category matches the text is the scanner's job.

Since this constructor (the "general" constructor) has no consistency checks, it's not really safe for widespread use. In particular, I don't want the parser using it to generate arbitrary tokens. Thus, I declared the general constructor private to keep it away from the parser, but granted friendship to the scanner so it could still access the general constructor. The net effect is that the parser can create token objects, but only with the default category NO_VALUE, or it can copy tokens returned by the scanner.

In addition to classes scanner and token, scanner.h defines one global function:
const char *image(token::category);
which converts a value of type token::category to a null-terminated character string. In C++, enumeration constants do not automatically convert to some mnemonic representation when you write them to a stream. For example, using the definitions for operator<< in <iostream>
cout << token::AMPERSAND;
does not display token::AMPERSAND as &. Rather, the enumeration value promotes to int, and the decimal value of the character code for '&' appears in the output.

I often find a need to display enumeration values in some mnemonic form. Rather than overload operator<< to do the job, I prefer a more general approach inspired by the Ada language. In Ada, the expression e'image returns the string representation of enumeration value e. The image function declared above provides a similar capability for token::category. To display tc of type token::category in a readable form on cout, you simply write
cout << image(tc);
Though slightly less convenient for output than an operator<< specifically for token::category, the image function is much more flexible. You can use the result of image(tc) for many other operations besides output.

The final detail in the scanner's public interface is the function
token scanner::unget();
which puts the current token back in the input so that it will be returned by the next call to scanner::get. Parsers for really well-designed languages do not need unget, but C++ is not such a language. I'll elaborate on unget next month when I show you the parser for decl.

Listing 2 shows scanner.cpp, which defines the non-inline member functions and static data members for the scanner class. I will explain the implementation details next month.

Listing 3 presents scantest.cpp, a simple program that exercises parts of the scanner. (It does not test scanner::unget.) Simply compile and link scantest.cpp (Listing 3) with scanner.cpp (Listing 2) to produce an executable.

Life on the Bleeding Edge

All of the listings refer to C++ header files without using the .h extension, as prescribed by the C++ draft standard. For example, the listings use
#include <iostream>
rather than
#include <iostream.h>
Unfortunately, most current (early 1996 vintage) compilers and libraries still require the latter. If you want to compile my code on your compiler(s), you have two choices: (1) put the .h extensions back in the code, or (2) write you own headers with names that leave off the .h. I recommend (2). That's what I did.

For example, I wrote a file called "iostream" containing just
#include <iostream.h>
I placed the file in a directory, and altered my compiler configuration to look for headers in that directory as well as all the other usual places. My fervent hope is that I will not need to change my code when the .h extension goes out of vogue.

Writing the header for <string> can be a bit trickier. The standard string class is newer than the iostream headers, and so some compilers don't even supply a string class yet. If that's the case, you might try using Plauger's string class[5].

Those libraries that provide a string class certainly can't put the class in <string.h>. The C++ library already has a <string.h> that has the C library's str and mem functions in it. You may have to hunt for the string class header. For example, in Borland C++ 4.5, the string class is in <cstring.h>. In Watcom C/C++ 10.5, it's in <string.hpp>.

Another problem is that the spelling of the string class name varies across libraries. In the Draft Standard, and in the Borland 4.5 library, the name is spelled string (all in lower case). However, the Watcom 10.5 library spells it String (with the leading S in upper case). In that case, you should define string as a typedef alias for String. Listing 4 shows a string header that makes Watcom's string class look more like the Draft Standard's string class.

I'll show you the rest of the decl program next month.

References

[1] Brian W. Kernighan and Dennis M. Ritchie. The C Programming Language, 2nd. ed. (Prentice Hall, 1988).

[2] Clovis L. Tondo and Scott E. Gimpel. The C Answer Book, 2nd. ed. (Prentice Hall, 1989).

[3] Peter Van der Linden, Expert C Programming (SunSoft Press, 1994).

[4] Brian W. Kernighan and P.J. Plauger. Software Tools (Addison-Wesley, 1976).

[5] P.J. Plauger. The Draft Standard C++ Library (Prentice Hall, 1995).

Dan Saks is the president of Saks &Associates, which offers consulting and training in C++ and C. He is secretary of the ANSI and ISO C++ committees. Dan is coauthor of C++ Programming Guidelines, and codeveloper of the Plum Hall Validation Suite for C++ (both with Thomas Plum). You can reach him at 393 Leander Dr., Springfield OH, 45504-4906, by phone at (513)324-3601, or electronically at dsaks@wittenberg.edu.