Columns


The Learning C/C++urve

Living in Two Worlds

Bobby Schmidt


Bobby Schmidt is a freelance writer, teacher, consultant, and programmer. He is also an alumnus of Microsoft, and an original "associate" of (Dan) Saks & Associates. In other career incarnations, Bobby has been a pool hall operator, radio DJ, private investigator, and astronomer. You may summon him 3543 167th Ct NE #BB-301, Redmond WA 98052; by phone at (206) 881-6990; and via Internet e-mail as rschmid@netcom.com.

Welcome to my world — Jim Reeves

Introduction

G'day. Bobby Schmidt here, welcoming you to the inauguration of my virtual world, "The Learning C/C++urve," the newest monthly column in CUJ's lineup. My theme (well, more of a motif really), is living with both C and C++. I'm aiming this column generally at intermediate-to-advanced C programmers, and beginning-to-intermediate C++ programmers.

My archetypal reader programs in both languages. Because of legacy code, safety issues, and so on, such readers must still code in C; however, to keep current, they must also program in C++. This column helps balance these opposing time demands, taking advantage of language similarities whenever possible. Another column motif is "helping C programmers on their terms."

I have no intention of abandoning C, and will favor those aspects of C++ most like C. I find explaining C++ language and design concepts in C terms genuinely helps C programmers understand this new and at times daunting world.

I also find C++ programmers become seduced early on by the Siren songs of polymorphism, overloading, and the like, without truly mastering the language's underlying C nature. So emphasizing C++'s dominant C gene, I believe, benefits programmers of either language.

Informally, I have a four-fold charter, to wit:

Two large experiences have shaped my views of these languages. Since 1989, I have worked with Microsoft in both the U.S. and Australia, first as an employee and later as a contractor. But for much longer than that, since 1983 I believe, I've been acolyte and friend of Dan Saks. Dan was my most influential undergraduate programming instructor. If you read my code samples in this magazine, then read his, you'll no doubt experience déjà vû most significant aspects of my programming style derive from his. It is fitting I have inherited the beginning C++ audience Dan addressed in the early days of his column, "Stepping Up To C++." Whatever success I may have teaching you owes much to his success teaching me.

To start that teaching, I'll spend the first two columns reviewing some basic vocabulary, addressing practical issues you will encounter as a programmer. The C and C++ language standards define vocabulary for these issues; my colleagues and I will be using this vocabulary. This includes terms you've probably seen but may not be familiar with. I also define some personal nomenclature that should make future discussions simpler.

Turtles

If I asked you to define "word" for a particular computing architecture, you might come back with "the machine's naturally-sized addressable thing, composed of X bytes." Then I'd ask "what is a byte," and you could say "the smallest addressable thing, Y bits long". If I followed with "what is a bit," you'd say "the smallest piece of memory, capable of holding a 0 or 1." To which I'd ask "what is memory," and so on.

Anyone living with children recognizes this pattern of questioning: every definition you give introduces yet another thing requiring definition. There is no absolute end; after a time, one party gives up, deciding the definitions are enough. The question of how much terminology is enough — that is, the point at which all parties can say "we agree" — is a serious one for program language design and specification. The C++ committees have spent years seeking a balance between defining the language completely enough for compiler writers to implement it, and succinctly enough for programmers to compose in it. To find that balance, the committee members must somehow measure just how complete or precise a particular definition is.

A couple years back, Dan Saks introduced to me just such a metric: the turtle. First cousin to pinhead-dancing angels, the somewhat satirical notion of turtles actually derives from a Hindu creation myth, relating how the celestial tortoise (an incarnation of Vishnu) occupies the center of the world. As gods and demons churned the oceans, this tortoise acted as base of the mountain Mandara, and apparently by extension, the entire Earth[1]. You may reasonably ask what the tortoise stands on. According to a more contemporary myth[2], the tortoise stands on another tortoise, which in turn stands on another, and so on — in other words, "it's turtles all the way down." While this notion is suspect both logically and herpetologically, it serves a useful metaphor for situations — like the earlier what is a word dialog — involving infinite regression or precision. In such cases, engineering constraints often dictate a finite number of turtles be defined, in the hopes all parties will "get it" and not be bogged down in theoretically enthralling but practically insignificant detail. As we shall see later, resisting such turtle-itis can be difficult.

Standard C

By standard C, or just plain C, I mean the programming language defined in the 1989 ANSI C Standard[3] and its associated Rationale[4]. From here on, I will refer to the first two documents as the C Standard and the C Rationale, and the body that created them as the C committee. Where the programming language is implied, I'll simply use the Standard, the Rationale, and the Committee. The Standard describes not only the C language, but also the representation of C source and the behavior of C translators. In particular, under the aegis of Standard C exist various levels of conformance and portability behavior; their net effect is to allow the same source code, ostensibly written in Standard C, to compile and run differently under different environments. Because these varieties of Standard C can lead to unexpected program behavior and mysterious compiler error messages, I'd like to discuss them more fully.

C Portability Behavior

What happens if you subscript outside the bounds of an array? That is, given

int a[10];
what is the result of

a[10] = 1;
The C Standard says, colloquially, this is a tragic, unfortunate circumstance, quite regrettable actually, but not so heinous that we want to prescribe run-time mechanisms to catch it. Simply put, some C behavior (like bounds checking) cannot be completely verified at compile time, and is too expensive to verify at run time. Compared to languages like Ada or Pascal, C is relatively libertarian, relying on you to be self-policing. As a result, the ANSI committee purposely left many areas of the language fuzzily defined, behavior they felt they could not or should not specify for all implementations. This latitude allows implementors room for marketing differentiation, system-specific language extensions, and internationalization. In section 1.6, the standard describes four types of such implementation behavior. From least to most restrictive, they are:

Additionally, Appendix F of the Standard enumerates which specific program constructs fall into which behavior bin.

I summarize each behavior below, giving examples in the text and in the code listings.

Undefined Behavior

After execution of

#include <limits.h>
void f(void)
    {
    int x = INT_MAX;  /* largest signed int */
    x++;
    }
what is the value of x? Possible answers include:

In short, anything may happen — your program has broken the law, and the implementation may react however it wants. Such overflow conditions are an example of undefined behavior, described in the Standard as:

Undefined behavior — behavior, upon use of a nonportable or erroneous program construct, of erroneous data, or of indeterminately valued objects, for which the standard imposes no requirements. Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating translation or execution (with the issuance of a diagnostic message).

Uh, right. Well, that's a little turgid, so let's try the Rationale:

Undefined behavior gives the implementor license not to catch certain program errors that are difficult to diagnose.

The rule here is caveat emptor, or perhaps caveat programmer. Undefined behavior is the lawless Wild West of C programming; the implementation exercises free will, doing or not doing whatever it chooses, suffering no obligation to tell you what those choices are. At translation time, the compiler may or may not flag such behavior. At run time, the execution environment may ignore the situation, hang the system, or anything in-between; further, the behavior may differ between each run. Undefined behavior combines the worst of all worlds: your program is definitely non-portable, and very likely in error, but the system doesn't have to tell you. Listing 1 shows a few C constructs yielding undefined translation or execution behavior.

Unspecified Behavior

In the function prototyped

void f(int a, int b);
and called by

int n = 0;
f(n++, n+1);
what are the resulting values of a and b in function f?

Your temptation may be to say a gets 0 (the value of n before ++ is evaluated) and b gets 2 (the value of n after both ++ and +1). Your answer assumes the compiler evaluates function arguments left-to-right. But in fact, the compiler may also choose to evaluate the arguments right-to-left, resulting in a still getting 0, but b getting 1. I suppose an implementation could use a random-number generator to determine the order, so that the same program could behave differently between successive runs. While this may seem far-fetched, the rise of parallel-processing architectures presents the possibility of such behavior. The order of function-argument evaluation is an example of unspecified behavior. Unspecified and undefined behaviors are much alike, with one key difference: While undefined behavior may result from erroneous C constructs, unspecified behavior arises from a correct program using correct data. Your program is right — you are just using some of those fuzzy (but correct) areas of the language.

Unspecified behavior gives the implementor less latitude than does undefined behavior. In particular, the implementation must translate the program — that is, it must catch the questionable behavior. Your program will compile and run, although you may not know exactly how it will behave. Although unspecified behavior is safer than undefined behavior — after all, your program is valid C — you still don't know which correct behavior will happen.

Implementation-defined Behavior

Assuming 16-bit integers and given

int b = -1; /* or 0xFFFF */
what is the value of b after

b >> 1;
The answer is probably -1, 0, INT_MAX/2 or INT_MAX, depending on how the high-bit is interpreted and propagated. This bitwise operation falls under the category of implementation-defined behavior. Such behavior is essentially unspecified behavior that is really specified. The implementor must spell out how the behavior actually manifests; in the example above, the compiler vendor would have to document how >> behaves. Such programs are still non-portable, but at least for any given implementation, you know what will happen.

I find the term implementation-defined a bit misleading, since all these behaviors are defined by the implementation. Better names might be implementation documented or implementation-specified. In actual practice, compiler vendors tend to document some undefined and unspecified behavior too, as a diagnostic aid to their compiler's users. More an act of enlightened self-interest than one of altruism, this documentation may increase market share, and certainly helps those vendors use their own tools for internal development (a process Microsoft calls "eating its own dog food"). Listing 2 shows examples of C implementation-defined behavior.

Locale-specific Behavior

We all know

islower('A');
returns zero (false) while

islower('a');
returns non-zero (true). But what about the non-English character ''? In this case, does

islower('');
return zero or non-zero? Okay, that one's a little surreal, so let's try the more realistic

islower('\xAB);
where 'xAB' is the encoding of some non-ASCII non-English 8-bit character. In this case, does islower return zero or non-zero? The answer depends on so-called locale-specific behavior. From the standard again:

Locale-specific behavior — behavior that depends on local conventions of nationality, culture, and language that each implementation shall document.

In the examples above, the result of islower for anything other than the minimal C character set is locale-specific. Such behavior is much like implementation-defined behavior, with a key difference: a strictly conforming program (described below) can use locale-specific features, thereby ruining such programs' chances of being unambiguously portable.

C Conformance

To encapsulate the above notions of portability behavior, and to describe how closely a program or translator implementation observes these behaviors, the Standard defines levels of conformance. Unfortunately, there is a bit of chicken-and-egg in the Standard's definitions of the terms, in that:

Here's a seeming case of turtles all the way down, for each term's definition references the other term. But wait, it gets worse, for you can have a conforming program compile and run one way on one conforming implementation, and compile and run another way on another conforming implementation. Everybody's conforming, yet the behavior is different, making you wonder just what this program is conforming to.

Happily, the standard rescues us with the notion of a strictly conforming program. More restrictive than a conforming program, a strictly conforming program uses only those parts of the language that the Standard fully defines and specifies. Such a program cannot use undefined, unspecified, or implementation defined behavior; nor can the program exceed minimum guaranteed implementation limits. The goal of strictly conforming programs is portability. Such programs generally behave the same on all conforming implementations, within the bounds of those behaviors the Standard specifies.

However, even this guarantee of portability is not iron-clad, for such programs can have different behaviors depending on locale, and conforming implementations can define their own locale-specific behavior. Moral: There is no such thing as a completely portable C program. As a rule, when I discuss Standard C, I am assuming language features contained in a strictly conforming program translated by a conforming implementation. That way, I know we are all singing from the same hymnal.

Standard C++

There is no C++ standard. The closest thing we have today is the evolving ANSI/ISO C++ draft standard[6] and its progenitor, the ARM[7], where the ARM is to C++ as K&R[8] was to C. If the history of ANSI C holds for C++, the draft's various incarnations, combined with the ARM, should be close enough to the final language for our needs. From now on, I will refer to the hybrid ARM/draft as the C++ Standard, and the language they describe as Standard C++, or simply C++. Over time, as the draft becomes more settled, I'll factor the ARM from the equation.

C++ Portability and Conformance

The C Standard defines C translator implementations in an absolute sense — a conforming implementation will differentiate all non-conforming programs from all conforming programs, compiling and running the latter. That's the idea, anyway. Trouble is, the infamous Halting Problem says such completeness is impossible: no implementation can successfully partition all correct and incorrect programs. Thus, the C Standard sets out a goal for implementations that is theoretically impossible to achieve. The C++ committees have editorially abandoned the notion of conformance, reckoning it unenforceable and unverifiable. No matter how many turtles down the Standard specified (i.e., how completely and precisely the language were defined), there would still be programs that slipped through the cracks (in this regard, the standard is much like traditional law).

Instead, C++ introduces the concept of a program being well formed, that is, a program conforming to the syntax and semantic rules of C++. But to me, this all implies 1) these rules are deterministic and complete, and 2) an implementation can tell when a program is well formed, all of which makes me wonder what those rules have achieved. It looks as if the committees still believe in strict conformance in their hearts, if not in their minds. I guess that looking down and seeing one or two turtles beneath them provides enough comfort. The abandonment of conformance affects C++'s notion of portability. C++ still uses C's terms (undefined behavior, unspecified behavior, and so on); however, rather than finding a theoretical foundation for such terms, the C++ Standard cites specific examples of each type of portability. In the end, while conformance and portability do not have the same official hold in C++ as they do in C, they are nonetheless useful concepts for discussion, and I will apply them freely to both languages. What may be impossible for language implementors to achieve in principle is still useful for language users in practice.

Object Declarations

Statements like

extern int errno;
char const string[] = "Doug Payne";
static struct s *p;
are all examples of C and C++ object (variable) declarations. A general treatise on data declarations could span several columns. I focus here on select aspects of such declarations, leaving the more general topic for future consideration.

Linkage

Not all object declarations refer to unique objects; you may declare a name that refers to the same object named in another declaration. Whether or not multiple declarations can refer to the same object is governed by the object's linkage. C recognizes three types of linkage:

Linkage in C++ is more complex. C++ observes linkage for such non-C entities as classes and templates. Also, while external linkage and no linkage survive reasonably intact, internal linkage may disappear as a side-effect of C++ namespaces. This change is part of a larger C++ philosophical shift away from source files as program building blocks. Unfortunately, discussion of this shift is beyond the scope of this article, and will have to wait for a later day.

Linkage rules are complex; an object's linkage is determined from an amalgam of the object's scope, and the presence (or lack) of linkage keywords. Linkage also typically dictates what symbols are visible to a linker and show up in the linker's map file. Listing 3 shows examples of these linkage permutations for both C and C++. All this can lead to subtle bugs. For example, if in C you define

char z;
at file scope in file1.c (giving z external linkage), declare

extern z; /* oops - no explicit type */
in file2.c, and reference z in file2.c, the following takes place:

These conflicting declarations are an example of C undefined behavior, and here you can see why. file2.c manipulates z as if it were an integer, not a character. Depending on your implementation, behavior may range from benign (if characters and integers have the same size and characteristics) to disastrous (if writing into z as an integer writes past z's actual storage into part of some other object). By the way, this bug could not occur in C++: the compiler would not automatically declare z as an int in file2.c because C++ requires all object declarations to identify the type of the object being declared. This demonstrates an advantage of compiling straight C code with a C++ compiler.

Duration

A declared object's duration determines where and for how long that object lives. C recognizes two classes of such duration:

Characteristically, C++ complicates C's recipe. C++ has more scopes, and combines those scopes in more complicated ways. Also, C++ usurps the keyword static for classes, to differentiate per-class from per-object members. One of the nastier bugs arising from duration comes from holding a pointer to an object when the pointer lives longer than the object being pointed to (another example of undefined behavior). This type of bug is shown by the following fragment:

char *f(void)
    {
    char c= 'a'; /* c has dynamic duration, and
                 probably lives on the stack */
    return &c; /* bad idea - returning pointer to
                storage that no longer exists after
                function returns */
    }

void main(void)
    {
    char *p;
    p = f();  /* p now contains a "random" address*/
    *p = 'b'; /* writing into invalid location
               all bets are off */
    }
Were c declared as

static char c ='a';
giving it static duration, the program would work correctly.

Listing 4 is the same as Listing 3, only annotated for duration. Note that, in practice, you will probably never use the keyword auto, and should seldom use the keyword register (optimizing compilers are typically far better than us at assigning registers).

Definition vs. Declaration

According to the C standard

A declaration specifies the interpretation and attributes of a set of identifiers. A declaration that also causes storage to be reserved for an object or function named by an identifier is a definition.

The C++ standard, paraphrased, says

A declaration introduces one or more names into a program and gives each name a meaning. A declaration is a definition unless it declares one of a cetain set of things (followed by a list of said things).

Note the difference: C considers definitions as exceptional declarations, while C++ considers declarations as exceptional definitions. I summarize these by saying a declaration "declares" or proclaims an object's existence and nature, while a definition also "defines," or creates and initializes, that object. Given all this, is the statement

int d;
a definition of d or a declaration of d? The short answer is "yes"; the real answer depends on context, including

As it encounters an object's declarations, C is on the lookout for one that can also serve as that object's definition. C considers such declarations tentative definitions, thinking to itself "if I don't see any future declaration that is unambiguously a definition, I'll treat this declaration as the real definition." C++ formalizes this behavior with its one definition rule. According to the ARM, this rule says there must be exactly one definition of every declared thing in a program. The draft has been more muddled in defining this rule, although in their last meeting, the committees tightened the rule to more closely hew to the ARM's intent. I read that intent to ideally mean that, in a well-formed program, an entity's single definition is unambiguous; its status as a definition is unaffected by its appearance relative to other declarations of the same entity, even across translation units.

Compare this to C, which may have multiple declarations that can be considered definitions, with the "winner" sometimes determined by declaration ordering or translator whim. The one definition rule is an attempt to accomodate C's type compatibility rules within C++'s, a thorny subject well beyond this column's scope. The rule is also part of the previously mentioned shift away from a conceptual basis in translation units. Listing 5 shows examples of C and C++ declarations and definitions.

Conclusion

I will conclude this vocabulary discussion next month, covering, among others, syntax vs. semantics, lvalues vs. rvalues, and built-in data types. Other subjects covered in coming months include

References

[1] Thanks to Margaret Bullock and Michele Hobart for their research.

[2] Stephen Hawking. A Brief History of Time (Bantam Books, 1988).

[3] American National Standard X3.159-1989, produced by the ANSI committee X3J11.

[4] Officially edited by Randy Hudson, this Rationale was, according to Dan Saks, actually written largely by our illustrious senior editor PJP.

[5] International Organization for Standards, which codified its variant of C after ANSI. Note that ANSI and ISO are co-creating the C++ standard.

[6] Working Paper, written by ANSI X3J16 and ISO WG21.

[7] Margaret A. Ellis and Bjarne Stroustrup. The Annotated Reference Manual. (Addison-Wesley, 1991.)

[8] Brian W. Kernighan and Dennis M. Ritchie. The C Programming Language (Prentice-Hall, 1978).