June 1997/The Learning C/C++urve

Columns

The Learning C/C++urve

Bobby Schmidt

Sargasso Sea

There's more than one way to write a constructor or encapsulate a string, and Bobby shows us quite a few of both.

Copyright © 1997 Robert H. Schmidt

As I warned in January when I started this series, I was willing to let the tides of fortune influence my specific direction. I find much of this month's column inspired by e-mail from Diligent Readers. Fortunately, the Hungarian Notation torrent has largely yielded to the more languorous currents of C encapsulation. I'll start with a few smaller nits first, before moving on to the Big Stuff.

A Free CD with Every Visit!

Good News: apparently reversing their earlier decision, those wacky ISO C++ people have put their Committee Draft (CD) 2 online at <http://www.setech.com/x3.html> and <http://www.maths.warwick.ac.uk/ c++/pub/>. For those of you who have already paid for a copy, at least it's a tax write off.

Bad News: Public comments on this CD closed on 18 March 1997, which is a couple months before you are reading — and literally the day I am writing — this column [1] .

Good News: Comments after 18 March are still eligible for the next public review, whenever that comes.

The X3 Files

On a related note, Dan Saks and I have long called the ANSI C and C++ committees X3J11 and X3J16. I recently learned X3 is changing its name to NCITS (National Committee for Information Technology Standards). What a hideous name — reminds me of a Clearasil commercial. I much prefer TOFKAX (The Organization Formerly Known As X3).

Apparently the folks in TOFKAX are still suffering a bit of identity crisis, since you can reach them at either <http://www.x3.org> or <http://www.ncits.org>. Interestingly, both web sites still refer to the C and C++ committees by their X3-prefixed names.

The Last (?) on Syntax Coloring

Andreas Arff suggests syntax coloring and Hungarian are facets of the same desire — to inform code with meta-information about that code's place in the programming ecosystem — and that I'm somewhat hypocritical in dismissing one but not the other. In part he writes

As you might have understood by now, if you have bothered to read this far, the question is how we (the programmers) get information from reading source code. Not whether your editor supports #if 0...#endif coloring or if you put an m_ in front of a member. Point is, if you flame hungarian you flame syntax coloring and vice versa, because you actually flame the concept of conveying information in the source code. Give it some thought.

My instinct says the two notions are different, but at the moment I don't have a cogent or compelling argument. I told Andreas I'd have to ponder the difference and get back to him. If you have a clear perspective on this, please send it in, and I'll consider it for a future column.

In March I mentioned ObjectMaster by AICUS as a syntax-coloring editor. Creed Erickson notes that AICUS sold ObjectMaster to Altura Software, Inc., and that you can find out more at <http://www.altura.com/om.html>.

For my work, I've settled on Bare Bones Software's BBEdit, which supports syntax coloring for C, C++, and Java, and integrates well with my Metrowerks CodeWarrior IDE. If you are on the Mac but aren't using this editor (or its freeware version BBEdit Lite), get thee to <http://www.barebones.com> and check it out.

Member Initializers

In the March column I suggested
c::c(int p_i) : i(p_i)
    {
    }
as an alternative to warts on data members. Diligent Readers Jonathan Lundquist, Zachary Schmidt, and Robert Mashlan suggest I missed a much simpler approach:
c::c(int i) : i(i)
    {
    }
This works because

The first i in i(i) must refer to a member.

The lookup rules for the second i try to match a parameter before

matching the member.

For trivial constructors, the example as written works. However, this method does not always generalize so well:
c::c(int i) : i(i)
    {
    if (errno != 0)
        // oops, which i gets set?
        i = errno;
    }
The same scope rules that make the second i in i(i) refer to the parameter also make i = errno refer to the parameter. Presuming you mean to set the data member i, you need a change like
c::c(int i) : i(i)
    {
    if (errno != 0)
        c::i = 0; // ok
    }
(By the way, this is a motivation for my preferred style of declaring by-value parameters const. Changing the declaration to c::c(const int i) lets the compiler catch this bug. Diligent Readers know that Dan Saks and I are in friendly disagreement on this point.)

This naming method can also lead to more subtle bugs. For example, if you have a class with data members i and j (declared in that order), and desire to initialize j from i's value, the constructor
c::c(int i) : i(i * 9 / 5 + 32), j(i)
    {
    }
won't work — j initializes from the parameter i instead [2]. To fix this, try alternatives such as
c::c(int i) : i(i * 9 / 5 + 32), j(this->i)
    {
    }
or
c::c(int i) : i(i * 9 / 5 + 32), j(c::i)
    {
    }
(Note the subtle difference between j(c::i), which works, and c::j(i), which won't compile.)

Finally, one of the readers cited above said he accidentally mistyped the parameter:
// oops, j should be i
c::c(int j) : i(i)
    {
    }
so that the member i initialized from itself.

Character Encoding

Jason Zions points out an ambiguity in my March comments on ASCII and ISO 646. I originally wrote

"Classic C was written from an American/ASCII perspective. Standard C and C++ genuflect towards larger target character sets, but essentially keep the ASCII basis for source characters ... More strictly speaking, the basis is the ISO 646 subset of ASCII, which excludes a handful of ASCII characters."

Upon rereading, I find this passage a little off the mark. What I really meant was that, given C/C++ trigraphs, C++ digraphs, and <iso646.h>, you can reduce all C/C++ source to the ISO 646 invariant subset of ASCII. So while C/C++ source still has an ASCII essence and mood, you can get by with a subset.

Joe Manns reminds me that Microsoft's CString class converts to LPCTSTR, which is Microsoft-speak for const TCHAR *. For ASCII applications, TCHAR maps to a single-byte character; for Unicode applications, TCHAR is double-byte. In March I wrote that CString turns into const char *, which is true only for the single-byte condition. Microsoft's MAPI uses TCHARs heavily, so I'll have plenty of opportunity to comment on them as this series continues.

A char Too Far

In February I offered
s->length = strlen(data);
s->data = malloc(s->length);
strcpy(s->data, data);
as part of a string pseudo-constructor. Two readers questioned the function's malloc call. First, Roland Schaeuble correctly notes that I really should have written the allocation line as
s->data = malloc(s->length + 1);
Otherwise, the string is a character short. This was a CS 101 mistake — my apologies.

Next, David Knoble suggests that I check the malloc return value for NULL. Unlike Roland's discovery, which is always a bug and is easily preventable, this one is less clear. Certainly there is a chance that malloc will fail here — but if does, what am I to do? Checking for NULL is a necessary but not sufficient condition; I must also act on that discovery.

Since my desire was to show the essence of object construction, and not how to detect, propagate, and recover from errors, I chose to ignore the possibility of malloc failure. Besides, if malloc failed on such a small object, the application's memory state would be so fragile I doubt reliable recovery would be possible.

Encapsulation Take 1

Now we're getting to the Big Stuff. In February I described a way to hide private implementation in a C translation unit while providing public interface in a C header. A couple of readers provided variations on that theme, and I'd like to explore those in detail this month. First, though, I want to correct a bug from that February example: change the definition of string_construct from
void string_construct
        (string *s, char const *data)
    {
    string_implementation_construct
            (s->private, data);
    }
to
void string_construct
        (string *s, char const *data)
    {
    s->private_ = malloc(sizeof
            (string_implementation));
    string_implementation_construct
            (s->private_, data);
    }
This change properly initializes private_ [3] .

In my original scheme, users see a structure string with a single abstracted member private_. Malcolm Keller Beyer III proposes a solution that removes even that single member:
/* our_string.h */

#include <stdlib.h>

typedef struct string string;

void string_construct(string *, const char *);
size_t string_length(string *);
(I've added a length function to make the example more complete.)

In this context, string is an incomplete type — you know its name, and that it is a struct, but you know none of its members. So while you can define pointers to string, you can't define objects of type string, take sizeof(string), or dereference a string pointer to directly access string members. That access is veiled in the new our_string.c implementation:
/* our_string.c */

#include "our_string.h"

#include <string.h>

typedef struct string
    {
    size_t length;
    char *data;
    }

string;

void string_construct
        (string *this, char const *data)
    {
    if (data == NULL)
        {
        this->length = 0;
        this->data = NULL;
        }
    else
        {
        this->length = strlen(data);
        this->data = malloc(this->length + 1);
        strcpy(this->data, data);
        }
    }

size_t string_length(string *this)
    {
    return this->length;
    }
Compared to my original method, all references to private_ and the string_implementation structure are now gone. This makes the source code more terse, and removes an extra pointer dereference.

Because users can't directly dereference string members in either Malcolm's method or mine, our_string.h can't provide pseudo-inline functions like
#define string_length(s) ((s)->length)
Even real inline functions would face this same problem, an artifact of our attempts to hide the implementation. C++ solves this problem by making the private members visible to everyone, but accessible only to (potentially inlined) member functions. Such capability comes at the cost of increased compilation frequency — every time a class's private member changes, every translation unit referencing that class must recompile.

While Malcolm's method offers some advantages, in my mind it suffers one great deficiency: it can't port to easily to C++. In contrast, you could turn my string into a class, add public members, and still hide the private members as I've shown:
class string
    {
public:
    string();
    size_t length();
    // ... and so on
private:
    void *private_;
    };
The public stuff stays public, while the private stuff stays truly private, still hidden within a translation unit. Dan Saks described such a class technique (dubbed Cheshire Cat) in the April 1994 CUJ.

Encapsulation Take 2

Jamie Blustein sent me a method adapted from one written by — small world! — P.J. Plauger [4] . I've turned Jamie's original example into a string type. I've also played around with a few variations that get increasingly close to C++ class member syntax.

Reflecting Jamie's notions, the new interface features a few new twists:
/* our_string.h */

#include <stdlib.h>

typedef struct string
    {
    char *data_;
    size_t length_;
    }
string;
struct string_interface_
    {
    void (* construct)
            (string * const, const char *);
    void (* destruct)(string * const);
    size_t (* length)(const string * const);
    };

extern const struct string_interface_ STRING;
First, I've removed the previous data-hiding mechanisms (private pointer and incomplete string type) to simplify the example; however, there's no organic reason why you couldn't re-apply them. Second, I've added a new function destruct to make the example more realistic. Third, instead of declaring a set of discrete extern functions, I've declared the single extern structure STRING containing pointers to those functions; the pointers call through to functions implemented (and hidden) in our_string.c.

our_string.c appears in Listing 1. Highlights:

All this pointers are declared as string * const, so that the value of this cannot be changed in a member function. Such usage mimics how C++ implicitly declares this in real member functions.

destruct forms a spin-pair with construct. Unlike a true C++ destructor, destruct is not called automatically — you must remember to explicitly tell a string object to self-destruct.

length operates on a pointer to a const string. This is analogous to length being a C++ const member function.

STRING is a const object, initialized from the functions declared privately within our_string.c.

In earlier versions, you constructed a string object via
string s;
string_construct(&s);
With this version, the construction changes to
string s;
STRING.construct(&s);
The call dereferences the pointer STRING.construct, transferring control and passing &s to the construct function hidden in our_string.c. The more complete example
#include "our_string.h"

int main(void)
    {
    string s;
    size_t n;
    STRING.construct(&s, "abc");
    n = STRING.length(&s);
    STRING.destruct(&s);
    return 0;
    }
shows all the STRING members in action.

Encapsulation Take 3

As written, STRING members look and act like static members of a real C++ class:
class STRING
    {
public:
    static void construct
            (string * const, const char *);
    // ...
    }

int main()
    {
    string s;
    STRING::construct(&s, "abc");
    // ...
    }
To move the syntax a bit closer to that of non-static members, make the function pointers part of each object's state. That is, where the current string objects contain only data, change them so they contain both data and functions. Then you can dereference the objects directly to get at the functions. To make this work, change the string definition to
typedef struct string string;

struct string
    {
    /* public */
    void (* destruct)(string * const);
    size_t (* length)(const string * const);
    /* private */
    char *data_;
    size_t length_;
    };
Because string's functions reference string before it's completely defined, you must forward-declare string as I've shown here. I've also added comments to show the conceptually public and private parts of string.

Finally, because something has to bind "real" functions to string's new function pointers, we continue to have a separate string construction function. For ease of explanation, I'll leave the construct function in the STRING object from the previous example:
struct string_interface_
    {
    void (* construct)
            (string * const, const char *);
    };
The most important change in our_string.c comes in that same construct function:
void construct(string * const this,
        const char *data)
    {
    this->destruct = destruct;
    this->length = length;
    /* rest same as before */
    }
Where the previous version bound destruct and length to a single static object STRING, this version binds them to every string object that gets constructed. Such binding allows the somewhat bizarre usage
string s;
STRING.construct(&s, "abc");
s.length(&s);
s.destruct(&s);
Encapsulation Take 4

Even though the notation has changed some, these functions still behave like static members, requiring an explicit this pointer. When I first pondered this example, I kept looking for an easy way to make string look more like a real C++ class — s.length() instead of s.length(&s).

After several false starts, I finally hit on a way, although it's neither easy nor portable: instead of calling the static functions directly, call a small per-instance code stub or proxy that somehow knows the current this pointer and calls the real static functions for you.

I got this idea from my experience with Windows "thunking," a hideous name popularized by Microsoft. Thunks translate calls from one semantic realm (say, 16-bit) to another (32-bit), often tailoring the call with some instance-specific information (in Win16, thunks could set the base memory address for a function's data [5] ). I also find this method reminiscent of C++ virtual functions, where control flows into a small runtime dispatch table which redirects control to the real functions.

The real trick/hack comes in constructing the proxy stubs. Because you can't directly malloc a new function, you instead must malloc a sequence of data bytes that your target system can interpret and execute as code.

Since I'm writing this on a Macintosh PowerBook, and don't know the first thing about the Mac's PowerPC architecture, I can't show you a real live example. I can, however, sketch out the idea.

For the sake of argument, assume that the 16-bit word sequence
0xE001 0x1234 0x8000 0x5678
executed as instructions on your target system means "push the value 0x1234 on the stack, then call the function at address 0x5678." If you substitute the value of your this pointer for 0x1234 and the address of function length for 0x5678, you instead get a sequence meaning "push this on the stack, then call length." Because construct will be the only function passed an explicit this, it must create the stubs. Integrating this scheme into construct yields Listing 2.

You must also change destruct to expunge these extra proxy stubs:
static void destruct(string * const this)
    {
    if (this->data_ != NULL)
        free(this->data_);
    free(this->destruct);
    free(this->length);
    }
lest you have memory leaks. These frees are about as scary as delete this in C++ — be sure not to reference the object after the destruct call.

Finally, because user code no longer passes parameters to destruct and length, change the string definition to
struct string
    {
    void (* destruct)(void);
    size_t (* length)(void);
    /* ... rest same as before */
    };
With this definition in place, you can now call these functions like real C++ object members:
#include "our_string.h"

int main(void)
    {
    string s;
    size_t n;
    STRING.construct(&s, "abc");
    n = s.length();
    s.destruct();
    return 0;
    }
This example makes a host of simplifying assumptions about the underlying processor architecture. Also, the proxies as written work only for member functions taking exactly one parameter (the now-implicit this pointer). I leave it as an exercise for the student to generalize the method to any arbitrary parameter list. While all this seems a lot of trouble just to fake C++ syntax, there is a deeper motivation: by extending the function dispatch motif, you can emulate inheritance and virtual functions in C.

On Deck

If you didn't see your e-mail mentioned here, take heart. Several of you brought up points I'm deferring until I cover related topics in later columns. I (eventually) answer all e-mail personally; I just don't always answer it in the column.

Next month, I'll ruminate more on protected-vs.-private, and start crafting the first foundation type in the MAPI project.

Notes

1. To make sense of such apparent ruptures in the space-time continuum, tune in the documentary series Star Trek Voyager, which features compelling case studies challenging the traditional boundaries of physics.

2. Don't let this example give you the vapors. I know that relying on member declaration order is hazardous in its own right; I present it only to demonstrate member/parameter name ambiguity.

3. Another CS 101 error. Yeesh, maybe it's time to take one of Saks' beginner courses.

4. P.J. Plauger. Programming on Purpose: Essays on Software Design (Prentice-Hall, 1993). Essay 22: Inherit It, pp. 201-2.

5. Raise your hand if you remember the dreaded "DS != SS" quagmire. o

Bobby Schmidt is a freelance writer, teacher, consultant, and programmer. He is also a member of the ANSI/ISO C standards committee, an alumnus of Microsoft, and an original "associate" of (Dan) Saks & Associates. In other career incarnations, Bobby has been a pool hall operator, radio DJ, private investigator, and astronomer. You may summon him at 14518 104th Ave NE Bothell WA 98011; by phone at +1-206-488-7696, or via Internet e-mail as rschmidt@netcom.com.