Bobby Schmidt is a freelance writer, teacher, consultant, and programmer. He is also an alumnus of Microsoft, and an original "associate" of (Dan) Saks & Associates. In other career incarnations, Bobby has been a pool hall operator, radio DJ, private investigator, and astronomer. You may summon him at 3543 167th Ct NE #BB-301, Redmond WA 98052; by phone at (206) 881-6990, or via Internet e-mail as rschmidt@netcom.com.
Introduction
This month, I finish my two-part overview of vocabulary largely cribbed from the C and C++ standards. As before, I introduce these terms both to explain practical problems you may encounter, and as a foundation for future columns. Now some of you may wonder if we will ever actually build on this foundation, thinking you are doomed to an eternity of C grammar school. Rest assured, the payoff starts next month; I explain more at the end of this column.
Translation Units
Last month, I made reference to translation units without really explaining the term. Roughly speaking, translation units are the textual "things" you feed to a compiler, one at a time. You may wonder why I don't call these source files; while I informally use these terms interchangeably, in reality a single translation unit may be composed of multiple source files. According to the C standard:
A C program need not all be translated at the same time. The text of the program is kept in units called source files in this standard. A source file together with all the headers and source files included via the preprocessing directive #include, less any source lines skipped by any of the conditional inclusion preprocessing directives, is called a translation unit.In other words, the compiler takes your original source file (with a name typically ending in .c or .cpp), recursively inserts the text of all #included source files (typically ending in .h), and removes text within skipped conditional compilation clauses (like #ifndef or #else). The result is the body of text actually translated as a unit by the compiler hence the name, translation unit. I'll develop an example to demonstrate this, but be warned that I'm taking some liberties here. I intend this example to give you a better gut feeling for translation units, not to be a strict tutorial on the exact phases of source file translation[1].
Assume you compile the C++ source file mundo.cpp, as shown in Listing 1. That file includes two other source files, hither.h and yon.h (Listing 2 and Listing 3) . The directive #include is a short-hand for textually inserting one source file in another; you could cut-and-paste the contents of hither.h and yon.h into mundo.cpp (as shown in Listing 4) , give that paste-up to the compiler, and achieve generally the same result as you would compiling the original mundo.cpp. Substituting included files is only part of the equation; the compiler also conceptually removes all text skipped by conditional directives. Listing 5 shows the resulting text after preprocessing conditionals are evaluated, and represents the final translation unit.
Note that while the declarations of hither and yon physically appear in the source files hither.h and yon.h, they logically end up in the translation unit arising from mundo.cpp. This is an important distinction between source files and translation units we humans demarcate and organize by source files, but the compiler really cares about the resulting translation units. That hither and yon are physically declared in a header, rather than physically in the mundo.cpp source, ultimately doesn't matter to the compiler; what does matter is that hither's declaration properly shows up in the translation unit, regardless of which source file actually contributes that declaration.
To see this in another light, change the definition of HERE to 0 in mundo.cpp, then mentally "recompile"; the resulting translation unit appears as Listing 6. You'll find that hither is no longer declared in the translation unit, even though its declaration still physically appears in hither.h the #if directive evaluates to zero, thereby filtering out the declaration's text. Again, the compiler does not "see" the original source files; it sees the resulting translation unit.
External and internal linkage are artifacts of translation units, which form encapsulation barriers. Internally linked objects (such as hither) are accessible to all entities within the barrier (that is, within the translation unit defining the object) and inaccessible to all entities outside the barrier (in other translation units). Externally linked object (like yon) are accessible across barriers (in all translation units). In C, judicious use of linkage lets you approximate C++'s public and private access specifiers for static members, an area I will mine in a future column series.
Also, note that I often use the terms translate and compile interchangeably, although strictly speaking compilation is neither necessary (an implementation may use run-time interpretation) nor sufficient (typically you must link compiled translation units to form an executable program).
Syntax vs. Semantics
Try foisting this on your C and C++ compilers:
int x y;Depending on which compiler and language I use, I get various error messages, including
'x' is followed by 'y' (missing ','?) unexpected token following declaration of 'x' syntax error : identifier 'y' (syntactic) unexpected symbol:'<IDENTIFIER>':y y: Couldn't understand this text as a declaration.What are these messages saying? And in particular, what is this business about "tokens" and "syntax errors"?
Tokens
Think of tokens as the smallest pieces of a source file that can be separated by whitespace without having the source's meaning change. As illustration, the code fragment
Wynken = *Blynken++ = "Nod";contains 8 tokens. If I add more whitespace, as in
Wynken = * Blynken ++ = "Nod" ;these tokens become more distinct. Note that I couldn't add whitespace at any other point without changing the source's interpretation by the compiler. For example, modifying the code to any of
Wy nken = *Blynken++ = "Nod"; Wynken = *Blynken++ = "Nod"; Wynken = *Blynken++ = "No d";also modifies what the code means to the compiler. In my original example, the line
int x y;contains the four tokens
int x y ;The error message
unexpected token following declaration of 'x'is really complaining about the token y following the token x.
Syntax
In the parlance of programming language translation, tokens are the smallest elements of a language's syntax. The term syntax[2] ultimately derives from Greek, and literally means "to join or arrange together." By analogy, tokens are to computer language syntax as words are to spoken language syntax.C and C++ recognize specific syntax patterns and rules specifying what categories of tokens can appear in what order. Collectively, these patterns and rules are called the language's grammar, again by analogy to spoken language. Programming languages typically define grammars either textually (as in the C and C++ standards) or graphically (in what are frequently called "railroad track" diagrams)[3].
In the original example, the error messages
syntax error : identifier 'y' (syntactic) unexpected symbol:'<IDENTIFIER>':yimply the compiler saw a token it wasn't expecting. That is, having seen
int xthe compiler (following the language's grammar) expects to see something like, or; or =, but not an identifier. In fact, the first original error message
'x' is followed by 'y' (missing ','?)is trying to be helpful, by suggesting a token that would make the syntax valid.
Semantics
Change the offending code line to
int x = y;Assuming this is the only line in the translation unit, compilers in either language will generate a diagnostic along the lines of
'y' : undeclared identifierNote the lack of complaint about wayward tokens and spurious syntax this code is valid according to both languages' grammar rules. Yet it still doesn't compile. Why?Simply put, while this code matches the expected form of C or C++, it doesn't have the proper meaning. The compiler knows that y is an identifier, but does not know what type of identifier. The compiler must learn that from y's declaration. Simply declaring y any arbitrary way, like
typedef int y; int x = y; /* error - 'y' can't be a type name here */won't work only specific meanings of y such as
int const y = 3; int x = y; /* aaah, much better */do the trick.A language's meaning is more formally known as its semantics, another term derived from Greek. When compiling a program, the translator first ensures the program meets the language's syntax rules, then matches every use of a program element against that element's meaning or semantics.
Correct syntax is a necessary (but not sufficient) condition for correct semantics; that is, a compiler can't tell if a program has the right meaning if it can't understand the program's structure. This explains why you can often fix the compiler errors in a code sequence, recompile, and get different errors in the same code the syntactic errors were blocking the compiler from analyzing the semantic errors.
That a program is correct semantically (i.e., that it compiles) does not guarantee the program will run as you expect, or that it will run the same under different environments (see last month's discussion of portability for more).
lvalue and rvalue
Try serving your C or C++ compiler this:
void f(void) { int const n; n = 5; }and it will likely volley back with something like
'n' : left-side of '=' must be modifiable lvalue.Just what does the compiler mean here? A common rule-of-thumb says "lvalues go to the left of =, rvalues go to the right." Unfortunately, in either language, code such as
int const Yin = 1; /* Yin is an lvalue */ int Yang; /* Yang is an lvalue */ Yin = Yang; /* error, an lvalue that can't go to the left of = ... */ Yang = Yin; /* ... and OK, an lvalue that can go to the right of = */betrays the fallacy of this simple definition.More correctly, an lvalue is an expression designating an object, while an rvalue is an expression designating a value. You can further think of an lvalue as an object's locator value, meaning that an lvalue (locator value) is a special case of an rvalue (general expression value).
This would seem to imply that anywhere an rvalue can appear, an lvalue (masquerading as an rvalue) can appear, but that implication is incorrect, as the following demonstrates:
int const n = 0; int m = 0; enum e { /* OK in both C and C++*/ e0 = 0, /* error in C, OK in C++ */ e1 = n, /* error in both C and C++ */ e2 = m };The complications come largely from the two types of lvalues, modifiable and non-modifiable. In the example above, n is a non-modifiable lvalue (one that cannot change), while m is a modifiable lvalue (one that can). The languages also differ on the semantics of const objects, leading to the split reaction towards e1 = n.Also bear in mind that not all =s are created, well, equal; some represent initialization, others assignment. What a particular = means affects what can appear to its left or right. All this leads to the more accurate rules:
Things can get a bit tricky with C++. For example, in C, a function call is always an rvalue expression; you can never write something like
- modifiable lvalues may always appear to the left, and may sometimes appear to the right, of =.
- non-modifiable (constant) lvalues may sometimes appear to the left, and may sometimes appear to the right, of =.
- rvalues may never appear to the left, and may always appear to the right, of =.
/* error; func() returns rvalue */ func() = 1;in C. In C++, a function call expression may be a modifiable lvalue, non-modifiable lvalue, or rvalue. Declaring the above function as
int &func();renders
/* OK; func() returns modifiable lvalue */ func() = 1;valid. Such behavior distinctions become vital when you design class member functions, operators especially.Finally, note lvalues and rvalues can appear in contexts with no = at all, as in
char s[10]; /* s is lvalue, "hello" is rvalue */ strcpy(s, "hello"); /Object Types
Last month I discussed object definition at length. Every defined object has an associated data type. Each data type, in turn, distills down to some built-in type implicitly defined by the language. The taxonomy of these built-in types differs somewhat between the two languages. I'll start with the C set[4], later mentioning the differences wrought by C++.
Builtin Types
C divides data types into two broad categories, scalar types and non-scalar types. Only scalars are builtin. The notion of scalar comes from mathematics, and designates a single quantity or value. Think of scalars as representing a single (possibly addressable) "thing." All builtin scalars are arithmetic types. Also derived from mathematics, arithmetic types generally hold single arithmetic values (although this is somewhat misleading, as mentioned below).C recognizes two classes of arithmetic types, floating-point types and integral types. Floating-point types are float, double, and long double. Integral types include all the familiar integer types (char, int, short, long, and their signed and unsigned variants), bitfields, and enumerations,
In a variation of C's scheme, C++
To me, bundling all these types as "integral" confuses type implementation with type model and use. All integral types are implemented as, and convertible to, some sort of integer; however some of the integral types, such as bool, do not seem conceptually related to integers in the mathematical sense. This muddling of builtin type implementation and use is less severe in C++, which forces stricter integral type-conversion rules and encourages wider use of derived types.
- adds the builtin integral type wchar_t (which exists as a typedef in C).
- adds the builtin integral type bool.
- considers enumerations as scalar but not integral.
Derived Types
That last term, derived types, bears a bit of examination. Some definitions, such as
char c; int i; float f;use builtin types directly, while others like
char *c; int i[5]; struct f { float x; };use types derived from builtin types. C, bizarrely enough, calls such types derived types. These types structures and unions in particular are the closest in C you can come to defining your own distinct new object type (C structures being the immediate ancestors of C++'s classes).Now, you may be thinking "what about typedef"? A common perception holds that typedef, as its name surely suggests, defines a new type. In reality, this keyword should probably be called "typealias" or some such, since it does not define a new type so much as a new alias or name for an existing type. As illustration, consider the C++ type definition
typedef char *string;If this were really a new type, then the initialization in
char *Romulus; string Remus = Romulus;would not compile (the types would be incompatible). Conversely, C++ would allow overloading on these types, so that the function declarations
void f(char *); void f(string);could both exist in the same scope.In truth, the first example with the initialization does compile, while the second example with the overloaded functions does not. It seems that, from the compiler's point-of-view, string is synonymous with char *, so that anywhere you use one, you can also use the other ... almost.
The problem comes from char * containing two different pieces of C and C++ grammar. char is part of the declaration specifier (the type), and * is part of the declarator (the name being declared), while string is just a declaration specifier[5]. To see this, change the earlier C++ definitions to
// // prepend 'const' qualifier to both // const char *Romulus; const string Remus = Romulus;Now the code no longer compiles. Why not? After all, doesn't the compiler just mentally substitute, giving the equivalent of
const char *Romulus; const char *Remus = Romulus; // compiler substituting // 'char *' for 'string'The short answer is "no." The compiler actually internally renders the original definitions as
const char * Romulus; char * const Remus = Romulus;which are not the same thing at all in the first, the pointed-to object holds constant, while in the second, the pointer itself holds constant.For another demonstration, collapse the original definitions to a single line, as in
string Romulus, Remus;The compiler behaves differently than it would if you wrote
char * Romulus, Remus;In the former, both Romulus and Remus really are of type char *, while in the latter, Romulus is char *, but Remus is just char.So, even though typedef does not introduce a new type, it does introduce a new type name a sometimes subtle distinction. Further, while both string and char * are semantically equivalent, they are syntactically distinct.
Expression Evaluation
Last month[6] I noted that
int n = 0; f(n++, n+1);is not strictly conforming the order of function argument evaluation is unspecified, meaning the resulting function parameters can take on different values on different implementations. This is a specific case of more general rules involving expression parsing and evaluation; these rules are, in my experience, among the most confusing aspects of C and C++ programming. I hold no illusion of doing this subject full justice here; instead, I touch on a few major bug-inducing points, leaving a full treatise to another day.In the code
int x = 0; int y; y = x++ + x++ + x++;what is the final value of y? Your immediate reaction may be 3, assuming the expression and side effects involving x are evaluated left-to-right before being assigned to y. This is the equivalent of
int expression = x; /* expression = 0 */ x++; /* x = 1 */ expression += x; /* expression = 1 */ x++; /* x = 2 */ expression += x; /* expression = 3 */ x++; /* x = 3 */ y = expression; /* y = 3 */But it turns out the compiler may evaluate this statement as if it were written
int expression = x; /* expression = 0 */ expression += x; /* expression = 0 */ expression += x; /* expression = 0 */ y = expression; /* y = 0 */ x++; /* x = 1 */ x++; /* x = 2 */ x++; /* x = 3 */with all the increments held off until the statement end. In both cases, the final value of x is the same, 3 but the value of x assigned to y is different. That difference comes from the order of expression evaluation relative to the order of side effects. This ordering derives from operator precedence, operator associativity, and the location of sequence points.
Operator Precedence
In the above example of
y = x++ + x++ + x++;the operators in question, from highest to lowest precedence, are
Precedence doesn't directly specify the order of expression evaluation it merely defines which parts of the expression bind to which operators. Put another way, precedence tells the compiler where to implicitly insert parentheses in expressions before evaluating them. In our example, precedence renders the code equivalent to
- ++ (post-increment)
- + (addition)
- = (assignment)
y = ( (x++) + (x++) + (x++) );Operator Associativity
Next, factor in associativity, which determines grouping order for operators having the same precedence. In this example, the operator + appears multiple times in the same grouping. According to the language semantic rules, addition is associative left-to-right. This yields another set of implicit parentheses, giving a final grouping of
y = ( ( (x++) + (x++) ) + (x++) );Because addition is mathematically commutative, an implementation (for optimization purposes) may actually choose to add these operands in any order, provided machine exception and overflow behavior doesn't differ between the abstract ordering defined by the language, and the actual ordering used by the implementation (i.e., everything nets out the way your program expects)[7].
Sequence Points
Even though the original statement is grouped as if were written as shown, the actual post-increments of x can occur anywhere along the way. Remember, the sub-expression x++ evaluates to, or "returns," the original value of x; the side effect of incrementing comes after this value is fetched. How far after is limited by the presence of the next sequence point. From the C standard (2.1.2.3 and 3.3):
At certain specified points in the execution sequence called sequence points, all side effects of previous evaluations shall be complete and no side effects of subsequent evaluation shall have taken place.In more colloquial prose, here's what the standard's on about:Between the previous and next sequence point an object shall have its stored value modified at most once by the evaluation of an expression. Furthermore, the prior value shall be accessed only to determine the value to be stored.
Violating these last two points yields undefined program behavior.
- There are places where you are guaranteed all previous side effects have taken effect, and no subsequent side effects have started. These places are called sequence points.
- If you set an object's value, you can do so only once between sequence points.
- If you both get and set an object's value between sequence points, you must do your gets before your set.
Appendix B of the Standard enumerates C's sequence points; C++ generally adopts the C sequence points, although they may occur in new syntactic contexts. In our example, the only sequence point comes at the; following the statement. This implies that all side-effects (in this case, the three post-increments) must be evaluated by the time the statement ends; how far before the statement ends, and in what order, is left up to the implementation.
I think the big surprise here for many people is where sequence points don't occur. In particular, there is no sequence point at any of the assignment operators (=, +=, etc.), nor at the comma separating function call arguments (but confusingly enough, there is at the comma operator same token, different meanings).
C++ and Sequence Points
When you define your own operators in C++, trumping the builtin versions, you also trump those versions' sequence point nature. As an example, given the C++ definition
bool to_be;the statement
bool question = to_be || !to_be;contains a sequence point at the operator | |. Now suppose your compiler doesn't yet feature the bool builtin type, so you decide to roll your own bool class. Further suppose you define a function
bool operator||(bool const &, bool const &);so the code uses your | | instead of the language's builtin version. Now there is no longer a sequence point at | |, for you are no longer really using an operator you are instead using a function call. Rewriting the statement as
bool question = operator ||(to_be, !to_be);makes this more clear.This change in sequence point behavior can be a source of subtle bugs and misunderstanding, especially for programmers experienced in C but new to C++. I hold the opinion you should not be able to overload operators marking sequence points. C++ already prohibits overloading of certain operators (such as . and ? :), so this notion is not without precedent. Even though the language does not prevent you from overloading operators like | |, I strongly dissuade you from doing so. Your maintainers will thank you.
Conclusion
This survey of foundation vocabulary is far from exhaustive, but should serve as a decent common denominator. In future scrivings I'll define new vocabulary on the fly, appealing to earlier columns (both mine and others') as needed. The first of those scrivings comes next month, where I start evolving a user-defined type. The end result will be almost secondary; the real fun comes in the problems encountered with each trial implementation, the reasoning leading to the next implementation, and the interesting design tidbits gleaned along the way.
References and Miscellaneous Notes
[1] For the morbidly curious, phases of translation are covered in section 2.1.1.2 of the C standard, and section 2.1 of the C++ Working Paper.[2] I'm sure the homophonic nature of "syntax" and "sin tax" is pure coincidence, although the obtuse C and C++ grammars make me wonder.
[3] See Dan Saks' November 1995 CUJ article for an example of grammar and syntax diagrams.
[4] For a handy graphical summary of C's built-in type hierarchy, check out P.J. Plauger and Jim Brodie, Standard C (Microsoft Press, 1989).
[5] Small world ... Dan's article in this month's CUJ discusses declaration syntax.
[6] Thanks to alert reader D. Alan Payne for finding a subtle error in this example before we went to press last month.
[7] Take a gander at the C Standard, subclause 2.1.2.3, for a discussion of abstract vs. actual semantics.