June 1996/The Learning C/C++urve

Columns

The Learning C/C++urve

Bobby Schmidt

C->C++ Mutations, Part 1

So you're going to port that C code to C++, huh? That C++ compiler may not welcome your code with open arms. Bobby shows how to get on the compiler's good side, and explains why it seems to be so picky.
© 1996 Robert H. Schmidt

This month I start another fabulous column series, written especially for you C programmers who either are about to migrate your C code to C++, or already find yourselves maintaining the same code in both environments. That C++ is so-named is unfortunate, for it can lull you into thinking such code migration or maintenance is laughably simple. After all, isn't C++ just C with, well, a couple of pluses[1] ? Wasn't C++ designed to let programmers port C code straight with no changes? Or, put another way, isn't C++ a strict syntactic and semantic superset of C?

Faithful readers know I typically setup such apparently innocent and obvious questions as strawmen waiting to be flailed upon. And so it is here, for the name notwithstanding, C++ is not just C with some extra stuff. If you port a serious project from C to C++, I can just about guarantee you will encounter code that doesn't compile, or compiles but runs differently. Given that the same code presumably compiled and ran peachy as C, this behavior change can be most disturbing.

I call such changes in C on its way to C++ "mutations." If you think of C as C++'s progenitor, it's as if C++ was supposed to inherit from C intact, but instead ended up with some altered genes along the way. Like mutations in biology, some of these changes are beneficial, some mostly benign, and a few potentially hazardous. I find that, in the short term, most of these mutations are a real bother, requiring source changes that don't benefit the (former) C code immediately; after all, C code, especially code not designed to be ported in the first place, wasn't written to take advantage of C++ strengths. It may make assumptions that were valid in C but fail in C++. Once the port is done, however, and the code is massaged into true "native" C++ code — or, to continue the analogy, adapts to its new environment — the mutations become increasingly beneficial.

In this series, I will explore (to steal an insurance industry phrase) "reasonable and customary" C behavior that changes meaning in C++. These mutations I put into four categories:

Code compiles as C but not as C++.

Code compiles as C++, but runs differently.

Code runs correctly, but works counter to C++'s design strengths, and may not integrate well with later native C++ code.

Code was wrong all along, but C didn't catch it.

I start the series with a topic involving all these categories: dynamic memory allocation.

char * in C

I'll go out on a limb and assume that, somewhere during your long and illustrious C career, you've had cause to call the ubiquitous malloc or one of its brethren. While you may think of these routines as intrinsic to C, the canon of such runtime routines was up actually to the whim of the compiler implementor before ANSI ratified its treaty in 1989. It's true that runtime libraries were heavily influenced by prevailing UNIX practice, and over time most vendors adopted a common subset. But you ultimately had no guarantee any routine you used would actually be available on another platform.

In those heady days I call C's Cretaceous Period, we had a variety of ways to allocate memory from the heap, all of which boiled down to some variant of
char *i;
i = malloc(10);
where malloc was declared as
char *malloc();
without a full prototype. (Remember, pre-ANSI code typically didn't have function prototypes.) Note the return type of char *, often used by Cretaceous code to represent a "typeless" generic pointer. In our earlier example this worked, but in code like
int *i;
i = malloc(10);
we had a problem — we were assigning a char * to an int *. Although malloc gave us a chunk of memory that was properly aligned for interpretation as a block of int, the sad fact is we were still mixing char * woolens and int * linens. Of course, such code won't compile in today's ANSI world, but back then, it often would[2] . The proper solution, to the extent you could call casting a proper solution, was
int *i;
i = (int *) malloc(10);
Clearly, if we wanted to dynamically allocate objects of many different types, we'd end up with lots of casts. You don't find such a casting cornucopia in contemporary C code, thanks to a little ANSI C surrealism called void *.

void * in C

I call void * surreal because, where a char * points to a char, and an int * points to an int, a void * points to ... a void? Well, no. Unlike other pointer types, void * cannot literally point to what its name implies, since you cannot declare an object of type void[3] . In fact, void * is designed to point to anything and everything, whereas its name suggests it points to nothing! Ah, those wacky C committee members.

By adding void *, ANSI C circumvents the casting business described earlier. Along with a proper function prototype, the new malloc declaration becomes[4]
void *malloc(size_t);
and permits
int *i;
i = malloc(10);
without a cast. To quote section 6.2.2.2 of the C Standard:

A pointer to void may be converted to or from a pointer to any incomplete or object type. A pointer to any incomplete type or object type may be converted to a pointer to void and back again; the result shall compare equal to the original pointer.

This means that, in general, you can turn a void * into any other pointer, and any other pointer into a void *, and it all just works. I say "in general" because type qualifiers (const and volatile) complicate matters. You can't, for example, write
int const *i = 0;
void *v = i; /* error */
because that would let you change what i points to via v, and what i points to is supposed to remain constant. You can, however, go in the other direction, as in
void *v = 0;
int const *i = v; /* OK */
since that results in a "safer" conversion (the pointed-to object won't change via i); or you can change v as in
int const *i = 0;
void const *v = i; /* OK */
so both pointers have the same type qualifiers[5] .

With void *, all the Cretaceous casts go away, aesthetic order is restored, and all are happy ... until they port to C++.

void * in C++

Compile the previous valid C example
int *i;
i = malloc(10); // error!?
as C++, and you get a rude shock — the compiler objects to the malloc call, with MetroWerks claiming
illegal implicit conversion from 'void *' to 'int *'
If your project is large and makes prolific use of the heap, you will end up with a glut of such messages. What to do?

As an immediate fix, you need to add casts, so the above becomes
int *i;
i = (int *) malloc(10); // OK
Those of you with compilers sporting the newer RTTI casts can also write
int *i;
i = reinterpret_cast <int *>
       (malloc(10)); // OK
This all looks suspiciously like the pre-ANSI days, where malloc returned char *. Wasn't void * supposed to fix this? It's as if we've come full circle, to the point I'm advocating casts again. I generally abhor casting, so you may well wonder why I use it here. I have two reasons, really:

This is truly the only simple short-term solution to the problem.

It's what the compiler was doing in C anyway, except there the conversion was implicit, while here it is explicit.

Pointer Compatibility

This second reason in particular gives me comfort recommending a casting solution. In C, when you wrote
int *i;
i = malloc(10);
the compiler internally thought "I have no way of knowing if what's returned by malloc really points to an integer, but I trust the user, so I'll implicitly make the conversion." It's as if the compiler put the cast in for you.

C++, with its stronger type checking, refuses to make that assumption. C++'s general strategy is to implicitly convert from types with fewer constraints to types with more, but not the other way around. This explains the allowed conversion of void * to int * in
void *v;
int *i = v; // OK
i can be dereferenced and must point to an int, or may never be dereferenced at all; v has no choice, and can never be dereferenced. The conversion is "safe" since everything v can do with its data, i can do. Put perhaps more rigorously, v's domain of behavior is a subset of i's.

As another example, consider why
int const *ci = 0;
int *i = ci;
fails while
int *i =0;
int const *ci = i;
succeeds. The object referenced by ci has more constraints: not only must it point to an int, but that int cannot be changed. i's referenced object need only be an integer; whether or not it is ever changed is not part of the constraint. As before, everything ci can do with its data, i can do. ci's behavior is a subset of i's, so the conversion is safe.

Limitations of malloc

Adding these casts not only works in the short term, it actually works in the long term as well, at least for code ported from C. The behavior of the code in C++ is identical to that in C. However, when you start massaging the former C code into genuine C++, you will quickly find that malloc has limitations. malloc was designed to carve out a chunk of memory from the heap; it was never designed to initialize or construct an object within that memory. That chunk is just a cluster of bytes, with no type information (hence malloc's return type of void *). How that chunk is interpreted and initialized is up to the user.

For objects not requiring construction, this limitation has no net effect; C types ported to C++ neither have nor require explicit constructors, so malloc's behavior is no drama[6] . Once you add constructors to those former C types, malloc's behavior becomes a serious drama. Those objects' constructors must be called before those objects are used. Since malloc doesn't call constructors, and since by the time malloc hands you a pointer it's too late to call constructors, your objects are in limbo.

You could add a member called init or some such to these types. You would explicitly call this member after allocating the memory, as shown by
typedef struct
    {
    init();
    // presumably some data members here...
    } former_C_type;

void f()
    {
    //
    // Reserve enough storage to hold
    // a former_C_type object, and initialize
    // 'x' to point to that storage.
    //
    former_C_type *x = (former_C_type)
            malloc(sizeof(former_C_type));
    //
    // Initialize/construct that object
    // explicitly with the pseudo-constructor
    // 'init'.
    //
    x->init();
    }
This is still a hack[7] . Ideally, you want something that combines the action of malloc (reserving storage) and the object's constructor (initializing the reserved storage). C++ provides such a mechanism: the operator new.

new

I have no intention of launching into a general treatise on new, a subject meriting its own separate article. Take it on faith for now that, within your newly ported code, you want to turn all possible malloc calls to new calls, so that every instance of
some_type *x = malloc(sizeof(some_type));
becomes
some_type *x = new some_type;
In hand-waving pseudo-code, this is tantamount to
//
// Call 'new' heap manager to allocate storage
// for 'sizeof(some_type)' bytes, and set 'x'
// to point to that new storage.
//
some_type *x = allocate(sizeof(some_type));
//
// Call the 'some_type' default constructor,
// passing 'x' as the constructor's
// 'this' pointer.
//
some_type(x);
Certainly for instances of some_type that will eventually sport constructors, such a transformation from malloc to new is imperative (short of the init solution mentioned earlier). For types that will never have constructors, strictly speaking you don't have to use new. Even so, I encourage you to use new uniformly. Memory allocated by new must be deallocated by delete, while memory allocated by malloc must be deallocated by free. If you start mixing newed and malloced memory, you risk deallocating something the wrong way, leading to undefined and potentially disastrous program behavior. Also, you never know when code using a builtin type today will need to use a real class type tomorrow. If you already use new everywhere, your code becomes more adaptable to future changes.

Having read all that, know that some C code doesn't easily adapt to new. Consider a structure with a variable-size array member, as shown by the common C idiom
struct block
    {
    //
    // 'size' will hold actual # of bytes
    // in 'data'.
    //
    size_t size;
    //
    // 'data' has static size of 1, but at
    // runtime will actually reference a
    // larger piece of storage.
    //
    unsigned char data[1];
    };
This block is allocated and initialized by
//
// Arbitrarily set 'data' size to 100 bytes.
//
size_t const SIZE = 100;
//
// Allocate chunk big enough for'size' + all
// the 'data' elements.
//
block *b = (block *)
        malloc(sizeof(block) + SIZE - 1);
//
// Set 'size' member to actual # of bytes
// in 'data'.
//
b->size = SIZE;
//
// Set all 'data' bytes to 0xFF.
//
for (size_t d = 0; d < SIZE; ++d)
    b->data[d] = 0xFF;
If you substitute new for malloc, as in
size_t const SIZE = 100;
block *b = new block; // changed 'malloc' to 'new'
b->size = SIZE;
for (size_t d = 0; d < SIZE; ++d)
    b->data[d] = 0xFF; // KABOOM!
the code may well blow up on assignment to b->data[1]. Remember, new allocated sizeof(block) bytes, the combined size of block's data members. sizeof b->size is the same as sizeof(size_t), which on my system is four bytes. And sizeof b->data? One! Therefore, new always allocates five bytes. When you go to assign into the data array beyond element 0, there is nothing there to assign into.

This shows a critical difference between malloc and new: malloc needs to be explicitly told how many bytes to allocate, while new implicitly reckons the number of bytes from the size of the allocated object. For most cases, new's behavior is desirable and certainly less error-prone. Even in examples like this, all is not lost, for you can write your own version of new that takes an extra argument specifying the actual size of data:
block *b = new(SIZE) block; // extra 'new' arg
This new can allocate the proper number of bytes. Still, the block constructor is ignorant of these extra bytes, so to be complete, you also need to write a block constructor accepting a size argument, giving a complete allocation call of
block *b = new(SIZE) block(SIZE);
This statement tells new how many bytes to really allocate, and tells the block constructor how big that allocated chunk really is, in both cases trumping block's static type information. I leave it as an exercise for the student to instrument the appropriate class-specific new and constructor, subjects outside this article's purview.

Closing Chords

In general, ANSI C's type compatibility rules are (as they say in the jeans trade) a "relaxed fit" compared to C++'s. Depending on the vintage of C you are porting from, your compiler's rules may be more lax still. Don't be surprised if, upon porting, you find an explosion of type compatibility conflicts beyond even those I've covered here. While this explosion may seem an annoyance, consider that every instance of type incompatibility may be covering a potential design flaw in your code. C++ is not just being picky — it is showing you potentially unsafe or even fatal conversions that may have crippled your C code all along. You just weren't aware.

Even if you never intend to port your C code to C++, I seriously recommend you get hold of a good C++ compiler and run your code through it anyway. All the type conflicts it catches are real, and exist in C; the language simply chooses not to tell you. By seeing them flagged, you have a chance to fix design problems you didn't intend. C's acceptance of these questionable conversions does not somehow imbue them with virtue. It is, I think, a corollary of life: that something is legal doesn't necessarily mean it is right.

Erratica

Regarding my four-part Boolean series, Aaron Cavender noted that in the boolean constructor I used the C-style cast
(char) 1
He suggests that instead I should have used the more stylish RTTI cast
static_cast<char>(1)
At the time, my compilers didn't support this cast, but my new MetroWerks system does. Someone whose opinions I value (and to protect his anonymity I won't mention that it's Saks) thinks these casts are a bit superfluous. I'm not so sure ... I think they telegraph intent, and help demarcate ported C code that hasn't been properly mutated.

Now that I've started using them, I'm finding I actually like these casts, even if they are a bit verbose. So I'm taking Aaron's recommendation, and will use RTTI casts from now on. If your compiler does not support them, fear not: you can always turn an RTTI cast of
XXX_cast<type>(expression)
(where XXX is one of const, dynamic, reinterpret, or static) into the C-style cast
(type) expression
Concerning that same column series, Earl Chew found a bug in the acceptance test suite, as given in Listing 1 of my January 1996 column. Near the very end I wrote
switch (i)
where i is an int. I had actually meant to write
switch (b)
where b is a boolean.

Notes

[1] Those of you harboring less than charitable thoughts towards C++ might say it would be better to call the language C--, or some nefarious variation like --C--. Please direct such thoughts to Saks, since he's on that committee. I wish to maintain what modicum of journalistic purity I have left on the subject.

[2] It used to be worse. I remember compilers taking int *p = 3 without a complaint.

[3] In the parlance of ANSI C, void is an incomplete type that cannot be completed. You cannot create an object of incomplete type. Ergo, you cannot create an object of type void.

[4] size_t is the "return type" of operator sizeof. Think of size_t as being big enough to hold the size (in bytes) of any single object. This is for conforming code; extensions such as the infamous DOS/Win16 huge objects, whose sizes cannot be held by a size_t, are not conforming.

[5] We used to call these cv-qualifiers ("c" for const, "v" for volatile), but in my Dec 1995 ANSI/ISO C Draft Standard, which contains proposed changes to the language, there is a third qualifier called restrict. So rather than call them cvr-qualifiers, we are now using the more general term type qualifiers. This whole business of type qualifiers and how they affect type conversions is just crying out for — you guessed it — another article! Are these languages great job security or what?

[6] Yes, the compiler synthesizes construction for your former C structs, but the net result is a giant no-op, as if the synthesized constructor were empty. The struct is left with "random" values (local duration) or all zeroes (static duration), just as in C.

[7] Actually, an init member is quite useful in other contexts where so-called two-part construction is desirable. That discussion is beyond the scope of this article. I mention it only so you won't think such a technique is always a hack. There are places where it is arguably the best solution.

Bobby Schmidt is a freelance writer, teacher, consultant, and programmer. He is also a member of the ANSI/ISO C standards committee, an alumnus of Microsoft, and an original "associate" of (Dan) Saks & Associates. In other career incarnations, Bobby has been a pool hall operator, radio DJ, private investigator, and astronomer. You may summon him at 3543 167th Ct NE #BB-301, Redmond WA 98052; by phone at (206) 881-6990, or via Internet e-mail as rschmidt@netcom.com.