STANDARD C++: A STATUS REPORT

Defining a language standard

Dan Saks

Dan is the founder of Saks & Associates. He serves as secretary of the ANSI and ISO C++ standards committees. He is also contributing editor for The C Users Journal and columnist for The C++ Report. He and Thomas Plum are coauthors of C++ Programming Guidelines, and codevelopers of Suite++: The Plum Hall Validation Suite for C++. You can reach him at 393 Leander Dr., Springfield, OH 45504, by phone at 513-324-3604, or at dsaks@wittenberg.edu.

The ANSI C++ technical committee, X3J16, has been working on a formal C++ standard for almost three years. But the committee has yet to release a draft standard to the public, and an official standard is still several years away. Nonetheless, the committee's decisions have already affected the C++ compilers you use today, and will certainly shape the compilers and libraries you use in the future.

In this article, I'll explain how the C++ language definition is changing as it evolves into a standard. I'll cover the committee's major technical decisions and describe various problems that are yet unsolved. I'll also look at the prospects for a standard C++ library.

An International Standard

X3J16 is an ANSI technical committee, but it isn't writing just the U.S. national standard for C++. X3J16 is working closely with the ISO C++ Working Group, WG21, to develop an international C++ standard. (The working group's full name is ISO/IEC JTC1/ SC22/WG21. See the text box entitled, "Who's Standardizing C++?" for a more detailed explanation of the standardization process.) International programming-language standards typically start as national (read "ANSI") standards. But U.S. programming-language standards reflect American natural language and culture. Although programmers around the world may tolerate programming languages with English keywords, many need to express parts of their programs, like string literals, in their native languages. Not unreasonably, many also want to write identifiers and comments in their own language.

But many natural languages, even those based on the Roman alphabet, use more than just the 26 characters in the English alphabet. European keyboards have letters with accents and umlauts (like à and ö, respectively), combination letters (Æ), and other characters. Japanese Kanji keyboards have hundreds of keys for composing thousands of different characters.

European computer systems usually make room for native-language characters by omitting certain punctuation characters. For example, Danish keyboards, displays, and printers replace the [ and ] with Æ and Å, respectively. The C program in Example 1(a) comes out on a Danish printer looking like Example 1(b).

Example 1: (a) C program using U.S. ASCII; (b) the same program using Danish ISO 646; (c) the same program using trigraphs; (d) the same program using the new digraphs.

  (a)
  int main(int argc, char *argv[])
  {
          if (argc < 1 || * argv[1] == '\0') return 0;
          printf("Hello, %s\n", argv[1]);
          return 0;
  }

  (b)
  int main(int argc, char *argvÆA)
  æ

          if (argc < 1 00 *argvÆ1A == '00') return 0;
          printf ("Hello, %s0n", argvÆ1A);
          return 0;

  å

  (c)
  int main(int argc, char *argv??(??))
  ??<

          if (argc < 1 ??!??! *argv??(1??) == '??/0') return 0;
          printf ("Hello, %s??/n", argv??(1??));
          return 0;
  ??>

  (d)
  int main(int argc, char *argv<::>)
  <%

          if (argc < 1 or *argv<:1:> == '??/0) return 0;
          printf ("Hello, %s??/n", argv[1]);
          return 0
  %>

In a serious attempt to accommodate the needs of C programmers in non-English speaking cultures, the ANSI C committee added the wide character type wchar_t, multibyte character literals, trigraphs, and locales. But their efforts fell a little short. ISO adopted ANSI C as the ISO standard, but the ISO C working group spent another few years preparing an addendum to the standard that, among other things, provides better support for linguistic and cultural variations.

Even before X3J16's first meeting in December 1989, SC22 expressed interest in an international standard for C++. However, some SC22 members were concerned that previous ANSI programming-language standards, including C, didn't meet the needs of the international community, even though those ANSI standards were adopted as ISO standards. Sympathizing with their concern, X3J16 decided to try to produce a standard that ISO would accept without change as an international standard.

At SC22's request, X3J16 wrote a project plan for SC22 to create WG21 to work with X3J16 in developing the C++ standard. Also, X3 changed X3J16's charter from type D (domestic standards development) to type I (international standards development). This means that X3J16 is developing the ISO C++ standard with the intent that it will also become the ANSI standard.

WG21 held its first meeting in June 1991. All X3J16 meetings since have been joint meetings with WG21. I call the joint committee "WG21+X3J16." We meet three times a year, typically in March, July, and November. Each meeting lasts four and one-half days.

The Working Paper

Programming-language standards aren't written from scratch; the committee starts with one or more "base documents." X3J16 selected two base documents:

The AT&T C++ 2.1 PRM (Product Reference Manual).

The ANSI C Standard.

Many committee members wanted to use the Annotated C++ Reference Manual (the ARM) as the first base document. The ARM is an updated version of the PRM sprinkled with annotations and commentary that elaborate and clarify the PRM. While the ARM includes complete chapters on templates (Chapter 14) and exception handling (Chapter 15), the PRM has only empty place holders. However, AT&T (which owns the copyrights for both the ARM and the PRM) gave X3J16 permission to use only the PRM (not the ARM) for the standard. Thus, although we often think of the ARM as the base document, strictly speaking, the annotations and commentary aren't included.

When the committee selected the base documents, there was no ISO C standard. We decided that when the ISO C standard becomes available it will be our third base document. The ISO C standard turned out to be the same as ANSI C, but ISO C will soon have an addendum that WG21+X3J16 will have to consider.

The current draft of the C++ standard has no formal standing. We don't even call it the "draft;" we call it the "Working Paper." The project editor used the PRM as the first version of the Working Paper, and has spliced parts of the C standard in as needed.

The editor produces a new Working Paper three times a year, about two months before each meeting. At each meeting, the committee approves the document as the basis for future work. Someday we'll approve the Working Paper as a draft and make it available for public comment. I don't know when that day will be.

More Details.

Templates

AT&T C++ 2.1 does not implement templates, so the PRM does not describe them. However, the ARM describes templates (although the description is labeled "commentary"). The committee added the ARM's chapter on templates (less the annotations) to the Working Paper. At the time, there weren't any commercially available compilers supporting templates. Now there are several. The remaining discussion of templates assumes you are familiar with basic template features. [Editor's Note: For more information on templates, see "Templates in C++" by Nicholas Wilt on page 29 of this issue.]

X3J16's Formal Syntax working group identified problems with the template syntax. All the problems stem from the choice of <and> as delimiters for template argument lists. A C++ parser might have trouble distinguishing when <and> are delimiters and when they are operators.

For example, the formal parameter of a template can be a type, as in:

  template <class T> class list;

or it can be a value, as in:

  template <int n> class buffer;.

For class buffer, the actual argument in a template instantiation can be any integer expression, as in:

  buffer<10>b1;

  buffer<2*BUFSIZ>b2;

It can even be an expression containing the <and> operators, as in:

  buffer<x>y>z> b3;.

A C++ parser must be prepared to look arbitrarily far ahead to determine that x>y>z is the template argument. The committee briefly considered using parentheses, braces, or brackets for template argument-list delimiters, but decided to stick with <and>. The Formal Syntax group has suggested alternative grammars for C++ that correct the problem, but the committee hasn't selected one yet. In the meantime, if you limit your template arguments to simple expressions, you shouldn't have any trouble with today's compilers.

Exceptions

As with templates, the PRM doesn't have a chapter on exceptions, but the ARM does. So the committee adopted the ARM's Chapter 15 on exception handling, less the annotations, of course. Because exception handling is not yet generally available, I'll take a moment to explain it.

Exception handling is a mechanism for responding to events that may disrupt the normal flow of a program. The C++ exception-handling mechanism is designed to handle synchronous exceptions, like resource-allocation failures or values out of range, rather than asynchronous events like device interrupts. The syntax for exception handling uses three new keywords -- catch, throw, and try. My stock example in Figure 1 shows how exceptions work.

Figure 1: Exception handling in C++.

  int f()
      {
      try
          {
          // the compound statement part
          int n = g();
          // ...
          return n;
          }
      catch (int x)      // a catch clause
          {
          cerr << "number" << x << " happened\n";
          return x;
          }
      catch (char *x)    // another catch clause
          {
          // respond in some other way ...
          return -1;
          }
      }
  int g()
      {
      return h();
      }

  int h()
      {
      if (something_wrong)
          throw 2;
      // keep going ...
      }

The entire body of function f is something called a try block. A try block consists of a compound statement followed by one or more catch clauses. The catch clauses handle exceptions that may occur while executing the compound statement. Executing a throw expression triggers ("throws") an exception. The throw may occur in the compound statement itself or in functions called from the compound statement. This particular try block in f calls g, which calls h, which conditionally throws an exception.

Throwing the exception terminates the execution of both h and g, and returns control to a catch clause in f. f has two catch clauses, only one of which can handle the exception. The program selects the catch clause by matching the type of operand in throw with the type of expression in catch. In Figure 1, the operand of the throw expression in h is of type int, so the first catch clause in f catches that exception.

Some of you may recognize that exception handling is similar to the functionality of the standard C setjmp and longjmp functions. However, exceptions are safer and more powerful than setjmp and longjmp. If either g or h declared local objects with destructors, throwing an exception in h invokes the destructors for those local objects as it "unwinds" the stack on the way back to the catch clause in f. longjmp merely discards intervening stack frames as it returns to the setjmp point without calling destructors, leaving resources used by local objects in an uncertain state. Also, C++ exceptions can throw objects of any type to a handler, but longjmp only transmits integer values.

The committee had little doubt about the need for exception handling in C++, but there was considerable debate about the underlying execution model. The question was whether exception handling should only support termination, as described in the ARM, or support resumption instead. With resumption, a catch clause can return control directly to the throw point after handling the exception. With termination, the only way you can return to the throw point is by repeating the normal flow of execution. Bjarne Stroustrup summarized the question as: Does throwing an exception mean "get out" or "get help?" The committee opted for simplicity and stayed with the termination model.

When we added exception handling, we also relaxed the rules for matching throw expressions with catch clauses to allow a wider combination of type matches. We also added a small section on access rules for thrown objects.

New Tokens and Keywords

As I explained previously, programmers in non-English cultures have an added burden programming in C++ because C++ uses punctuation characters that have been replaced by native-language characters. (C programmers have this same problem.)

C++ uses ASCII as its character set. ASCII is the U.S. variant of the ISO 646 standard character set. ISO 646 uses fewer character codes than ASCII. Each national variant of ISO 646 can use the unused codes for native-language characters or symbols. In ASCII, characters like {}[]|^~ occupy the unused ISO 646 codes, and C++ uses these characters heavily.

C++ programming on systems using national variants of ISO 646 might be easier if programmers could write C++ programs using only invariant ISO 646 character set, and avoid the troublesome characters.

Standard C's trigraphs don't offer a particularly readable solution to this problem. Trigraphs are three-character sequences that are alternative representations for the troublesome characters. For example, the trigraphs for [and] are ??(and ??), respectively. Using trigraphs, the C program in Example 1(a) looks like Example 1(c).

But trigraphs were never intended for humans to compose C source code. They were designed to aid mechanical translation into C.

Keld Simonsen, the Danish representative to the ISO C and C++ Working Groups, devised a set of digraphs (two-character symbols) and new keywords as alternate spellings for the offending C and C++ symbols, shown in Table 1. Using these symbols, Example 1(a) looks like Example 1(d). Notice that you still need the trigraphs inside the character and string literals.

Table 1: New digraphs and keywords.

  Existing      Alternate
  -----------------------

     [            <:
     ]            :>
     {            <%
     }            %>
     &            bitand
     &&           and
     I            bitor
     II           or
     ^            xor
     ~            compl
     &=           and_eq
     I=           or_eq
     ^=           xor_eq
     !            not
     !=           not_eq

The ISO C addendum specifies the identifiers in Table 1 as macros defined in a new standard header, iso646.h. Thus, you can continue using those identifiers as user-defined identifiers in C as long as you don't include iso646.h. However, the C++ Working Paper adds the identifiers to the set of reserved words. This means that, at some future date, you will not be able to use those identifiers as user-defined identifiers in any C++ program. Consider yourself warned.

Core Language Issues

New features to C++ draw a lot of attention, but the real work of the standards committee is ironing out the flaws in the language description. These flaws include inconsistencies, ambiguities, and minor omissions. Here are a few of the flaws in the ARM corrected by the Working Paper.

The name S in struct S {...}; is called a "tag name." In C, tags are not type names. That is, you cannot declare S x;. You must write it as struct S x;. In C, you can turn a tag into a type name using a typedef, as in typedef struct S S; and then you can write just S x;.

In C++, tags names are automatically both tag names and type names. For compatibility with C, C++ accepts (and ignores) typedefs that equate type names with tag names.

Some C programmers mimic the C++ behavior by always declaring their structs using typedef struct S {...} S;.

Other C programmers don't even bother with the tag name and write

typedef struct {...} S;.

Section 7.1.3 of the ARM states, "An unnamed class defined in a typedef gets its typedef name as its name. For example, typedef struct {/* ... */} S; // the struct is named S." But it's not clear whether a member function with the same name as the typedef is a constructor. In other words, given the code in Example 2(a), is A::A a constructor?

The committee decided that the answer is no. The commentary in the ARM makes it clear that this rule was only meant to give the class a name for linkage. So the committee changed the rule to say, "An unnamed class defined in a typedef gets its typedef name as its name for linkage purposes." Our intent is that the previous declaration should be equivalent to Example 2(b), making it clear that A::A can't be a constructor. We also intended that this class not have a destructor, but those words are not yet in the Working Paper.

Example 2: (a) An unnamed class defined in a typedef; (b) equivalent to 2(a) using a dummy typedef name.

  (a)
  typedef struct {
      A();
  } A;

  (b)

  struct dummy_name {
      A();
  };
  typedef struct dummy_name A;

Another problem in the ARM appears in Section 6.7. It says, "An auto variable constructed under a condition is destroyed under that condition and cannot be accessed outside that condition." In Example 3, you cannot access j after the first If statement, because either j was never created (if i is 0), or j has already been destroyed. This rule creates the only situation in C++ where the lifetime of a named object ends before it goes out of scope. That is, you can see j but you can't touch it.

Example 3: Case where an auto variable j is constructed within a conditional statement.

  if (i)
      for (int j = 0; j < 100; j++) {
          // ...
      }
  if (j != 100) // error: access outside condition
     // ...

The Working Paper eliminates this anomaly by changing the rules to say the statement in an If, If-Else (both branches), Switch, While, Do, or For statement implicitly defines a local scope. Example 4(a) is now equivalent to Example 4(b).

Example 4: (a) Under new rules, the statement in a conditional implicitly defines a local scope; (b) equivalent to Example 4(a) under the new rules.

  (a)
  if (i)
      for (int j = 0; j < 100; j++) {
           // ...
      }

  (b)
  if (i) {
      for (int j = 0; j < 100; j++) {
           // ...
      }
  }
  // j is no longer in scope

Name Lookup

The committee's Core Language working group has spent a great deal of time trying to pin down the rules for looking up names (identifiers) as they are referenced. The working group used the example in Figure 2 to illustrate the problem.

Figure 2: Name Lookup example.

   1: struct X {
   2:     static int i;
   3:     struct Y {
   4:         int i;
   5:         void f();
   6:     };
   7: };
   8: int i;
   9: void X::Y::f() {
  10:    i = 5;
  11: }

The question is, to which declaration of i does i=5; on line 10 refer? It could be the X::i on line 2, or X::Y::i on line 4, or the global ::i on line 8. The ARM doesn't say. The Core Language working group informally agreed that the answer is X::Y::i. They also agreed that if you comment out line 4, then the answer is X::i.

The committee has yet to formalize these rules, but it appears that they will be something like the following. To look up a name inside the definition of an out-of-line member function:

1. Look in the local scope.

2. For each class name from left to right in the explicitly qualified name of the member function, look in the scope of that class. That is, for X::Y::f, look in Y, then look in X.

3. Look in successive enclosing scopes, moving outward to file scope.

The committee has resolved other problems with friend declarations and the "rewriting" rule for inline member functions. But, there are still many other details to work out. For example, inline friend function declarations raise other unanswered questions. Consider the example in Figure 3, in which function X::f is defined as a friend inline in Y. Amazingly, the ARM doesn't prohibit this definition. So, which i does i = 5; on line 8 refer to? The ARM doesn't say, leaving each implementation to decide for itself. The committee is considering simply banning inline definition of friend functions to avoid this problem.

Figure 3: Inline friend-function example.

   1: int i;
   2: struct X {
   3:     void f();
   4: };
   5: struct Y {
   6:     static int i;
   7:     friend void X::f() {
   8:         i = 5;
   9:     }
  10: };

The Standard Library

Although many C++ users would like the C++ standard to include an extensive class library, that's not likely to happen. The job is just too big. WG21+ X3J16 has wisely limited itself to a few critical library components:

Language support. Functions and classes required for runtime support of the C++ language. This includes standard implementations for the free-store management functions defined in the header new.h: new, delete, and possibly set_new_handler. It also includes exception-handling support defined in a new header exception.h: functions terminate, set_terminate, unexpected, set_unexpected, and a class SUE all similar to those described in Section 15.6c of the ARM.

Input/Output. A simplified version of the iostream library distributed with AT&T's cfront 2.0, with additional support for wide characters, multibyte strings, and locales. The working specification does not use templates, but does use exception handling.

Standard C. The Standard C Library adapted to C++.

Strings. One or more classes to support variable-length strings. Look for strings of char (ordinary "narrow" characters) and wchar_t (wide characters).

Simple foundation classes. Classes like bit sets, bit strings, and a template for general dynamic arrays.

At present, the Working Paper only includes the C++ version of the Standard C Library. Some aspects of the C library don't mesh well with C++, so the C++ standard makes some minor adjustments.

For example, the C declaration for the strchr function in string.h opens a "hole" in the library's type safety. The C declaration for strchr is:

  char *strchr(const char *s, int c);

strchr returns the address of a character in the string addressed by s (or a null pointer). This means that strchr returns the address of a constant character as the address of a modifiable character. This allows accidents like that in Example 5, which tries modifying name even though it's declared constant. memchr, strpbrk, strrchr, and strstr share this problem.

Example 5: Using the C version of strchr, a program can accidentally alter a const char.

  const char name [] = "Nancy";
  char c;

  ...
  *strchr (name, c) = tolower(c);

The C++ library declares all of these functions as an overloaded pair with extern C++ linkage with self-consistent arguments and return types. For example, the C++ library declares strchr as in Example 6.

Example 6: The C++ library declares strchr as shown here.

  extern "C++" const char *strchr(const char *s, int c);
  extern "C++"       char *strchr(      char *s, int c);

WG21+X3J16 does most of its technical work through working groups that analyze technical issues and make recommendations for resolving them. The working groups are:

C Compatibility. Identifies conflicts between Standard C and C++, and attempts to distinguish those conflicts that are necessary and must be documented from those that might be eliminated.
Core Language. Studies problems in the syntax and semantics of the C++ language itself.
Environments. Considers issues in specifying the translation and execution environments for C++ programs, such as program startup and termination, translation limits, and hosted vs. freestanding implementations.
Extensions. Evaluates proposals for extensions to C++.
Formal Syntax. Identifies ambiguities, inconsistencies, and errors in the grammatical specification of the language.
Libraries. Drafts proposed specifications for components in the standard C++ library. (The C++ PRM contains no library specification.)

--D.S.