July 1996/The Learning C/C++urve

Columns

The Learning C/C++urve

Bobby Schmidt

C->C++ Mutations, Part 2

An empty parameter list in a function declaration doesn't mean the same thing in C++ as it does in C. Bobby shows you this and some other things you'll want to know when migrating code from C to C++.
Copyright © 1996 Robert H. Schmidt

This month I continue my series on C constructs that mutate behavior when migrated to C++. Last time I focused on the different type rules C and C++ apply to data pointers. Now I broaden that theme to include the mutated type rules for functions and pointers to functions. I start off examining function prototypes, and how they've evolved over the years. Fittingly, I catalyze this discussion as I did last month's, by examining our old friend malloc.

Function Prototypes in Classic C

In the technical sense of the word "prototype," K&R or Classic C[1 ] did not support prototypes. You declared a function to introduce its name, return type, and linkage, but did not specify in that declaration what number and kinds of parameters the function took. Harking back to last month's discussion of malloc, recall that Classic C declared that function as
char *malloc();
Whereas last month I dwelled upon the char * return type, this time around I look at malloc's parameter list. Were you to see this declaration in modern C code, you might think "here's a function that accepts no arguments and returns a pointer to char." Ignoring the return type for now, you know that malloc can't possibly take zero arguments; otherwise it wouldn't know how much storage to allocate.

In Classic C code, even though malloc was so-declared, you always had to call it with a single argument. Well, okay, you didn't have to. The compiler would have cheerfully let you proclaim
char *c = malloc();
or even
char *c = malloc(6.02e23, "phlogiston", 12, "H");
with nary a peep. That the code wouldn't run correctly wasn't the compiler's problem. This exemplifies Classic C's lack of function prototypes. You as programmer had to guess the correct parameter sequence; if you guessed wrong, the compiler could not help you.

Functions Prototypes in Standard C

I believe the single biggest advance of Standard C over Classic C was the introduction of function prototypes. Although I can find no formal definition of "prototype" in the C Standard, I can provide a loose definition: a prototype is a function declaration that also specifies the order and types of the function's parameters. In the above declaration
char *malloc();
the function has no prototype, for the declaration says nothing about the function's parameters. Contemporary C implementations, as we explored last month, declare malloc as
void *malloc(size_t);
Here that declaration is also a function prototype, telling us that malloc expects exactly one parameter of type size_t. If we compile
void *malloc(size_t);
char *c = malloc(6.02e23, "phlogiston", 12, 'H');
as Standard C, the compiler complains that our argument list does not match the prototype.

Were malloc a function that truly had no parameters, it would be declared as
void *malloc(void);
This does not mean that malloc accepts a single argument of type void, for as I mentioned in June, there can be no objects of type void. Instead, this is a special form of prototype, implying the function has no formal parameters.

Given all this, what would happen if malloc really were declared in Standard C as
void *malloc();
with no prototype? As it turns out, this still compiles. At most you may get an editorial from your compiler, asking if you really want a non-prototyped declaration. My MetroWerks compiler, in best Sergeant Schultz tradition, says nothing.

Perhaps more amazingly still, some Standard C compilers let you call a non-prototyped malloc with varying numbers of arguments:
void *malloc();

void f(void)
    {
    void *p;
    malloc();
    p = malloc(1);
    malloc(1, 2);
    }
Such calls exhibit undefined behavior, meaning a compiler might not diagnose them. Why aren't these calls banned outright? I imagine the thinking went something like this:

Some Classic C functions, like printf, could be called multiple times with different argument lists each time.

Since Classic C didn't support prototypes, there was no way a function could explicitly say that it could take variable-argument lists[2] . Therefore Classic C could not always require that a function be called the same each time.

The Standard C committees had a passionate desire not to break Classic C code.

Thankfully, the C Standard does offer this solace in section 6.9.3:

The use of function declarators with empty parentheses (not prototype-format parameter type declarators) is an obsolescent feature.

Translation: unprototyped functions will go away in a future version of Standard C. You have been warned.

While leaving malloc unprototyped is bad enough, what happens if you leave it completely undeclared? To test this, omit
void *malloc();
and compile just
void f(void)
    {
    void *p;
    malloc();
    p = malloc(1);
    malloc(1, 2);
    }
For compilers allowing this at all, you should find that the first and third calls compile, while the second call fails. Why?

In the absence of an explicit declaration, the compiler synthesizes an implicit declaration for you. In our case, the compiler, upon seeing a call to some previously-undeclared function malloc, internally generates the let's-pretend declaration
extern int malloc();
To verify this implicit declaration, change the example to use the equivalent explicit declaration:
extern int malloc();

void f(void)
    {
    void *p;
    malloc();
    p = malloc(1);
    malloc(1, 2);
    }
The results should be the same as above.

malloc is prototyped here to return an int. In the second call we assign that int to a void *. Standard C doesn't allow this assignment without a cast, leading to the compiler flag.

In the earlier version, where malloc was declared as
void *malloc();
the assignment to p was valid, allowing the second call to compile.

Function Prototypes in C++

Here's where the fun begins. Compile the valid C example
void *malloc();

void f(void)
    {
    void *p;
    malloc();       // OK
    p = malloc(1);  // Error
    malloc(1, 2);   // Error
    }
as C++ and, per usual, the story changes. Where all three calls compiled as C, only the first call compiles as C++, with MetroWerks proclaiming for the other two
function call does not match prototype
To what prototype does this message refer? After all, in the discussion of C we found that declarations like
void *malloc();
lacked a prototype.

As the compiler diagnostic suggests, with the change in language comes a change in the semantics of function declarations. In C, empty () say "I know nothing about this function's parameters." In C++, they say "this function takes no parameters," rendering
void *malloc();
tantamount to
void *malloc(void);
By calling malloc with arguments, we created a mismatch between the call and the prototype, leading to the compiler's message.

The implication here is that a C++ function declaration always acts as a C++ function prototype — a change from C. Further, while C lets you call a function without declaring it, C++ requires the declaration. Since a C++ function declaration always act as prototype, this means that C++ always requires a function prototype.

C++'s rigor concerning function prototyping is consistent with the language's general ethic of type safety and intentional programming; rather than trying to guess your intent as C would, C++ insists you make that intent overt.

Function Overloading

If we are adamant that the previous example compile as C++, there is another way: function overloading. In a departure from C, C++ lets you declare, in the same scope, multiple functions with the same name. If we change the example to
// overload #1
void *malloc();
// overload #2
void *malloc(int);
// overload #3
void *malloc(int, int);

void f(void)
    {
    void *p;
    malloc();      // OK; calls #1
    p = malloc(1); // OK; calls #2
    malloc(1, 2);  // OK; calls #3
    }
and compile, it works; each call selects the overload matching that call's arguments.

While this example is a tad hokey, it does introduce a technique for ferreting out C code bugs that no C translator can catch[3] . Suppose we have two translation units, with the first defined as
void f(unsigned x)
    {
    }
and the second as
void f(unsigned long);

int main(void)
    {
    f(0);
    return 0;
    }
This code successfully compiles and links as C. Whether or not it successfully runs depends on the implementation. The problem is that translation unit #1 thinks f takes an unsigned, while translation unit #2 thinks f takes an unsigned long. If unsigned and unsigned long are represented the same, the program may run fine. But if those types are different sizes, all bets are off: the program may well crash when f's implementation, expecting an unsigned, actually receives an unsigned long.

Why doesn't the compiler catch this? For the answer, I harken back to last November, when I introduced the notion of translation unit. Simply put, a translation unit is the single "thing" that a compiler translates in one chunk. The declarations of each translation unit are distinct from those in any other translation unit. C does not check across translation unit boundaries to ensure declarations are consistent.

Within translation unit #1, all uses of f are consistent, so the compiler is happy. Similarly, within translation unit #2, all uses of f are consistent. That those uses are inconsistent among different translation units is immaterial to the compiler.

Even though the compiler fails us, surely the linker should catch this. Sadly, no, for all the linker knows is

Translation unit #1 defines an externally linked object called f.

Translation unit #2 references an externally linked object called f.

An object with that referenced name exists in some translation unit being linked.

As all external references are satisfied, the linker is sated. The trouble here is that C does not guarantee typesafe linkage; that is, the linker matches names, but does not ensure those names represent objects of the same type.

Function Signatures

The punch line comes when we treat these translation units as C++. Both units successfully compile, but they fail to link. Unlike C, C++ does provide typesafe linkage. The C++ compiler tells the linker not only the name of all externally linked objects, but also the types of those objects. In this scenario, the linker thinks

Translation unit #1 defines an externally linked function f taking a single argument of type unsigned.

Translation unit #2 references an externally linked function f taking a single argument of type unsigned long.

I can't find an externally linked object with the name f taking a single argument of type unsigned long in any translation unit, so my link fails.

Given that the same linker executable can often link both C and C++ code, you may wonder how the linker can be so naive about C yet so savvy with C++. We can divine the answer from the linker's map file, which lists the canon of externally linked object names. For the C example, MetroWerks generates the (simplified) linker map entry
.f    file = "C1.c"
for the externally linked object f defined in the source file C1.c. This entry reinforces that the linker sees only the object names, not the object types.

You may very well have no linker map for the C++ example, since the program failed to link. Modifying the second translation unit into
void f(unsigned long) // define a 2nd 'f'
    {
    }

int main()
    {
    f(0);
    return 0;
    }
allows the program to compile and link as C++. The final executable has two externally linked functions called f, one from each translation unit. The relevant MetroWerks linker map entries are
.f__FUi    file = "C++1.c++"
.f__FUl    file = "C++2.c++"
Note that, where the C linker map had one entry, this map has two. Further, the C map showed the function name as simply f, while here the name is embellished. Such embellishment is called name decoration or, more colloquially, name mangling. This decoration encodes the function's so-called signature (name plus parameter list) into the final object name seen by the linker. From the linker's perspective, this code doesn't have an externally linked object called f; instead, the linker sees two externally linked objects called f__FUi and f__fUl.

The C++ Standard does not define rules for this name decoration, so compiler vendors are free to invent whatever scheme they want. If you are not compiling these examples with MetroWerks, I expect you will see decorated names differing from those I show here. While I don't know MetroWerks' scheme, I find decoding these names pretty easy: the leading f__ is the function name, while Ui and Ul stand for unsigned int and unsigned long, respectively.

Function overloading is an amazingly complex topic, one I'm sure we'll visit again. And while overloading does not explicitly exist to catch such errors in former C code, that it does catch them offers yet another reason to build C code as C++.

Pointers to Functions

Just as the rules for function declaration and prototyping have mutated from C to C++, so too have the rules for pointers to those functions. For instance, the code
int f();

void g()
    {
    int (*p)(long);
    p = f; /* OK in C, error in C++ */
    }
compiles as C, but not as C++. Let's examine this more closely.

In the assignment p = f, the type of p is "pointer to a function returning an int and taking a long". This implies that f must be either the same type as p, or of a type convertible to p's type. Okay, then what is the type of f? Your temptation may be to say "f is a function," but then, what would it mean to assign a function to a pointer? Fortunately, we are saved from such surrealism by section 6.2.2.1 of the C Standard:

Except when it is the operand of the sizeof operator or the unary & operator, a function designator with type "function returning type" is converted to an expression that has type "pointer to function returning type".

Thus, when used in the expression p = f, the name of the function f instantly converts to a pointer to that same function f, as if we had written the assignment
p = &f;
And what is the type of &f? Pointer to a function returning an int and taking ... well, the answer depends on the language used, and follows from what we've seen earlier. In C, the declaration int f() means f takes an unspecified set of arguments, while in C++ the same declaration means f takes exactly zero arguments.

As a result, the assignment from &f to p works in C. In fact, f may very well take a single long, speculation the language can neither confirm nor deny based on f's declaration. In C++ the assignment fails the stronger type checking, for f as declared cannot possibly accept a long. To make the code work in C++, we have to rewrite it as something like
int f(); // changed, no longer accepts 'long'

void g()
    {
    int (*p)();
    p = f; // OK in C and C++
    }
By the way, consider what happens if we make f and g members of some class c, as in
class c
    {
    int f();
    void g()
        {
        int (*p)();
        p = f;
        }
    };
Although the only change we've made is to move f and g from file scope to class scope, the code no longer compiles. Before you test this on your system, I encourage you to reason out what's changed and how to fix it[4] .

Function Pointer Variations

The equivalence relationship between the name of a function and a pointer to that function brings up some interesting variations. As illustration, let's consider the function qsort, declared in the Standard C library header stdlib.h as
void qsort(void *base, size_t nelem, size_t size,
           int (*cmp)(const void *e1, const void *e2));
The item of interest here is the last parameter cmp. The presence of the * before cmp says that cmp must be a pointer to something. The type of that something comes from the rest of the syntax: the trailing () implies pointer to function, while the leading int implies the pointed-to function returns an int. The parentheses around (*cmp) ensure the * binds to cmp, and not to int; I leave it as an exercise for the reader to ponder the meaning if those binding () were left off.

Our earlier equivalence relationship said the name of a function is tantamount to a pointer to that function. As cmp is a pointer, we should be able to replace that pointer with a function name. And, once we replace the pointer, we syntactically no longer need the *. It all nets out to the equivalent declaration
int cmp(const void *e1, const void *e2)
This declaration, substituted into qsort's prototype as
void qsort(void *base, size_t nelem, size_t size,
           int cmp(const void *e1, const void *e2));
compiles and works the same as the earlier (*cmp) version. Not only is this syntax cleaner, but it is also more consistent, matching the syntax for "normal" function declarations outside parameter lists.

We can use this same technique when calling functions. Within the body of qsort itself, we could call cmp as either
(*cmp)(e1, e2);
or
cmp(e1, e2);
As before, this syntax is cleaner, and is consistent with normal function calls. The (*cmp) call further demonstrates an interesting quirk of C and C++ semantic rules. Here's the gist:

cmp is a function.

From the equivalence relationship, cmp instantly converts to a pointer to that function.

*cmp dereferences that pointer, yielding the original function

because *cmp is a function, it instantly converts to a pointer to that function.

**cmp dereferences that pointer, yielding the original function

because **cmp is a function, it instantly converts to a pointer to that function.

and on and on, turtles all the way down, so that we could call cmp as
(*******cmp)(e1, e2);
Naturally this works with function names that aren't parameters, permitting such wonders as
#include <stdlib.h>

int main(void)
    {
    (*******malloc)(10);
    return 0;
    }
On The Road Again

Dan Saks and I held court at the Embedded Systems Conference in Boston the first week of April. As this marked my first trip to New England, I walked away with strong impressions: the forests, the architecture and age of buildings, the windy roads, the drivers on those windy roads.

I also got to talk with a number of conference attendees. While I often champion C++ over C, I can't escape the truth that some C++ constructs contain hidden space/time costs. Given this was a conference about embedded systems, those costs are particularly keen to many people I met; in fact, P.J. Plauger discussed this very issue in his keynote. Some of these costs, like object initialization or virtual functions, exist in C, just more overtly; others, like exceptions, are new and much more hidden. As I continue this column series exploring the migration of C code to C++, I will address these costs along the way.

Erratica

Diligent reader the first P.J. Plauger opines that, in my diatribe last month against Standard C's loose typing of void *, I may have been a tad harsh on the Standards committees. He emphasizes the committees knew exactly what they were doing, and that malloc was a driving force: they wanted malloc to be a generic memory allocator that worked without casting, meaning it had to return a pointer type that could silently convert to other pointer types.

Given that dynamically allocated C++ objects require construction, malloc in C++ is supplanted by new (as we discussed last month). Since its raison d'tre as a completely typeless pointer is gone, void * in C++ is free to employ more constrained type rules.

Diligent reader the second Alessandro Vesely suggests that, contrary to my assertion in March, main need not return an int. Reading his e-mail, I realized I hadn't made clear what I meant by "return." As Ale correctly points out, a program may legitimately terminate with a call to exit, so that control flow never reaches the return statement in main. From that point of view, main doesn't return anything, int or otherwise.

What I had meant was that main must be prototyped to return int, regardless of whether or not at run time it actually does. My understanding, based on section 5.1.2.2.1 of the C Standard, is that a conforming main must be prototyped as either
int main(void)
or
int main(int, char *[])
In wonderful serendipity, literally today as I'm writing, a fair amount of traffic has come over the C Standard Committees' e-mail reflector debating this very point. The consensus so far: Standard C requires an int return type on main, with some members suggesting stronger language in the Standard to reflect this. Stay tuned to this station for news of further developments as they occur.

Notes

[1] As always, "K&R C" refers to the C language spelled out by Brian W. Kernighan and Dennis M. Ritchie's The C Programming Language, Prentice-Hall, 1978.
[2] Standard C adds the ellipsis token ... to designate variable-argument lists. Take a gander at the printf declaration in stdio.h for an example.
[3] The idea for this technique comes from trumpeter extraordinaire and Microsoft weenie Sam Mann, who stumbled into it while porting someone else's C code to C++.
[4] Hint: two semantic properties of f are now incompatible with p.

Bobby Schmidt is a freelance writer, teacher, consultant, and programmer. He is also a member of the ANSI/ISO C standards committee, an alumnus of Microsoft, and an original "associate" of (Dan) Saks & Associates. In other career incarnations, Bobby has been a pool hall operator, radio DJ, private investigator, and astronomer. You may summon him at 3543 167th Ct NE #BB-301, Redmond WA 98052; by phone at (206) 881-6990, or via Internet e-mail as rschmidt@netcom.com.