February 1997/The Learning C/C++urve

Columns

The Learning C/C++urve

Bobby Schmidt

Further Adventures in Abstraction

True or false? Has Bobby beaten boolean types to death? The real answer is more abstract than you might think at first.

Copyright © 1997 Robert H. Schmidt

This month I continue my broad overview of C abstraction techniques, ending with a preview of how we'll apply those techniques and extend them into the misty realm of C++.

Enumerations

Sherman, cast the Wayback machine to January 1996, the first column of my Boolean series, and the first trial Boolean implementation in that column:
typedef int boolean

#define false 0
#define true 1
This attempt uses macros and type definitions, the two abstraction methods I touched on last month. Here the typedef names and implements the abstracted type boolean, while the #defines name and implement abstracted values of that abstracted type.

These three definitions have no innate linkage to one another. We must choose to conceptually bundle together the two values (false and true), and to bundle that pair in turn with the type (boolean). To me, part of the attraction of abstraction [1] is reducing your conscious enforcement of cohesion. With this solution, you must take three things unrelated (from the language's perspective) and consciously treat them as if they are related.

I would argue that a more abstracted (and certainly more functionally cohesive) solution would bundle the three elements together from both yours and the language's perspectives. In fact, such a solution exists, forming the third trial Boolean implementation in that same early '96 column series:
enum boolean
    {
    false,
    true
    };
Now all names are intrinsically connected. The enumeration simultaneously defines both the abstracted type name and its abstracted values.

Beyond cohesion, enumerations make some improvements (e.g., false and true obey scope) but don't fix everything. In particular, whereas you know the typedefed Boolean is the size of int, you have no such security with the enumeration Boolean. According to the C Standard, the implementation can make any enumeration the size of whatever integral type it sees fit.

Also, enumerations mix with their underlying integral implementations so easily that, as a type-safety mechanism, their usefulness is limited:
enum boolean
    {
    false,
    true
    };

int i = true;       /* OK */
enum boolean b = 2; /* OK */
Thus, enumerations are really no more absolutely abstracted than typedefs. In many contexts, you are still conscious of their underlying implementation. Their principle abstraction benefit, in my view, is the cohesion between enumeration type names and collateral enumeration values.

Arrays

Because C arrays so easily decay into pointers to their first elements, they are what I consider a fragile type. Arrays remind me of hydrogen peroxide (H₂0₂), which is just busting to decompose into free hydrogen (H₂) and oxygen (O₂) at the slightest provocation [2] .

As an example, consider
typedef char string[100];

string s = "g'day"; /* OK */
char *p = s;        /* OK */
You could argue that the conversion from zero-terminated char array literal ("g'day") to string does not destroy the abstraction, since such literals embody C's notion of a conceptual string type. Put another way, the conversion moves a "string" from C's realm into a "string" in yours — there is no change of abstracted meaning.

I don't believe this same reasoning applies to strings turning themselves into char *s. This conversion not only betrays an abstracted entity's implementation as char-based, but reduces the abstraction to a non-abstraction (pointer).

The relationship between an abstracted string and a physical pointer is not necessarily unambiguous. You could presume the pointer refers to the first string element. You could presume it points to the last, allowing easier string concatenation. You could even presume the pointer references a copy of the string, and that modifying through the pointer does not change the original string.

Experienced C programmers, of course, will no doubt consider the first presumption self-evident. I would counter that it is only C's history of treating arrays as pointers that renders this at all evident. To those experienced in other languages, this relationship is not axiomatic.

I find array-to-pointer decay one of C's (and C++'s) greatest design weaknesses, with respect to the typing system. I am abundantly aware of the expressive power, especially of being "close to the machine," such array/pointer synonyms offer. But I would counter that in large projects, the domain in which these synonyms prove necessary are often overshadowed by the domain in which they are unneeded or even detrimental [3] .

Structures

The only way you can well and truly create your own unique type in C is with a structure [4] :
struct node
    {
    void *data;
    struct node *next;
    };
Unfortunately, structures betray their struct-ness when used in declarations:
struct node x;
To make this type more abstracted, combine it with a typedef:
struct node
    {
    void *data;
    struct node *next;
    };
typedef struct node node;

struct node x; /* OK */
node y;        /* also OK */
Such structures achieve abstraction by not magically converting themselves into other types:

Unlike scalar types, structures don't convert to and from other scalar types.

Unlike enumerations, structures don't convert to and from integral types.

Unlike arrays, structures don't convert to pointer types.

In short, to paraphrase Gertrude Stein, a structure is a structure is a structure. It converts to nothing else, and is converted from nothing else. Even typecasts yield to this abstraction barrier:
struct s1
    {
    int i1;
    };

struct s2
    {
    int i2;
    };

struct s1 x1;
/* error */
struct s2 x2 = (struct s1) x1;
As a follow-up to our previous example, structures can increase the abstraction of arrays:
struct string
    {
    char data[100];
    };
typedef struct string string;
Now the earlier valid definitions
string s = "g'day"; /* error */
char *p = s;        /* error */
fail — string is no longer freely miscible with char *.

As I see it, the principle abstraction "gotcha" with structures is that you can't access their implementation without using structure-specific syntax:
int main(void)
    {
    string x;
    strcpy(x.data, "fair dinkum");
    return 0;
    }
Here the token . (dot) says that x is a structure. So while structure types are abstracted with respect to other types, structure objects are not so abstracted.

Functions

To help abstract away a structure's implementation, replace direct structure member references with function calls that hide those references. For instance, you can change the last example to:
char *string_data(string *s)
    {
    return s->data;
    }

int main(void)
    {
    string x;
    strcpy(string_data(&x), "fair dinkum");
    return 0;
    }
or even
void string_copy(string *t,
                 char const *s)
    {
    strcpy(t->data, s);
    }

int main(void)
    {
    string x;
    string_copy(&x, "fair dinkum");
    return 0;
    }
Such function abstraction is especially valuable for initializing structure objects, both as a C-specific design technique and in anticipation of C++'s constructors. Not only does this technique promote abstraction, but it also allows structure objects to perform complex initialization. Rather than simply allowing you to set a variable to a certain bit pattern, construction can perform functions that are logically part of the object's construction but physically not part of the object at all.

Consider a variation of our string structure that hides a char array allocated from the heap and caches its current length:
struct string
    {
    size_t length;
    char *data;
    };

typedef struct string string;
To construct an empty string object requires something like
string x;

x.length = 0;
x.data = NULL;
To construct a string object from a string literal:
char const *initial_value = "wombat";

string y;

y.length = strlen(initial_value);
y.data = malloc(y.length);
strcpy(y.data, initial_value);
This shows that fully constructing a string object requires you to call functions that are not part of that object, and to expose those calls to the whole world. Creating a function constructor yields the more abstracted
void string_construct
        (string *s, char const *data)
    {
    if (data == NULL)
        {
        s->length = 0;
        s->data = NULL;
        }
    else
        {
        s->length = strlen(data);
        s->data = malloc(s->length);
        strcpy(s->data, data);
        }
    }

int main(void)
    {
    string x;
    string y;
    string_construct(&x, NULL);
    string_construct(&y, "wombat");
    return 0;
    }
You can even combine this with macros, as I hinted last month:
#define STRING(s, data)\
        string s; \
        string_construct(&s, (data))

int main(void)
    {
    STRING(x, NULL);
    }
Unfortunately, because C does not let you mix statements with declarations, you can't use STRING multiple times in the same scope:
int main(void)
    {
    STRING(x, NULL);
    STRING(y, "wombat");
    return 0;
    }
This is tantamount to
int main(void)
    {
    string x;
    string_construct(&x, NULL);
    string y;
    string_construct(&y, "wombat");
    return 0;
    }
which is not valid C (but is valid C++). One possible workaround:
string string_construct(char const *data)
    {
    /* implementation left as an exercise
       for the student */
    }

#define STRING(s, data)\
    string s = string_construct(data)

int main(void)
    {
    STRING(x, NULL);
    STRING(y, "wombat");
    return 0;
    }
which expands to
int main(void)
    {
    string x = string_construct(NULL);
    string y = string_construct("wombat");
    return 0;
    }
Since C allows you to initialize structure objects:
string x = {0, NULL};
extending that notion of initialization to full-on construction is not a huge leap. The converse — adapting C++ destructors to C — is less simple, since C has no concept of "de-initialization" and no provision for automatically calling a destruction function when an object goes out of scope.

Nonetheless, by hiding an object's construction this way, you leave open the possibility of changing what it means to construct that object, without allowing that object's users to be (unduly) sensitive to the change.

Encapsulation Via Scope

One age-old encapsulation technique involves reducing a name's scope. In the example
int i = -1;

int main(void)
    {
    int j;
    while (i < 0)
        {
        int k;
        /* ... */
        }
    return i;
    }
i is not encapsulated at all (it is potentially accessible to the entire program), j is encapsulated some (accessible only to main), and k is encapsulated most of all (accessible only to the while-loop body).

While this example shows encapsulation within a function, you can also encapsulate within data:
int i = -1;

struct s
    {
    int j;
    };
As before, i is not encapsulated at all, while j is encapsulated within struct s, and is only accessible via the . or -> operators.

By limiting a name's scope, you rein in the domain over which that name can possibly be used. This not only promotes abstraction (by concealing conceptually hidden names from the largest possible audience), but reduces the chances of name collision (inadvertently using the same name in the same scope for multiple purposes).

Encapsulation Via Translation Unit

Another time-honored encapsulation tactic keeps all names at global scope, but hides them from one-another at translation time:
/* translation unit 1 */

int e;
static int s;

/* translation unit 2 */

extern int e;   /* same 'e' as in unit 1 */
static int s;   /* new 's' */
Each s has the same type (int) and the same scope (global), yet each is unique. Because they have internal linkage (i.e., are declared static at global scope), the names are not visible across translation unit boundaries. Contrast this to e, which has external linkage and is thus visible in both translation units.

To harness translation units for type abstraction:

Determine the functionally cohesive pieces that make up that type.

Move the type's "public" interface into a header file.

Move the type's "private" implementation into a separate C source file. Functions and data declared in the header must have external linkage. Other functions and data should be declared static.

Include the header everywhere the type's interface is needed, but include the implementation in exactly one file.

One possible adaptation of this strategy to our string example yields:
/*
 *  our_string.h
 */

typedef struct string
    {
    void *private;
    }
string;

void string_construct(string *, char const *);

/*
 * our_string.c
 */

#include "our_string.h"
typedef struct string_implementation
    {
    /* same as 'string' from previous example */
    }
string_implementation;

static void string_implementation_construct
        (string_implementation *s,
        char const *data)
    {
    /*  same as 'string_construct' from previous example */
    }

void string_construct
        (string *s, char const *data)
    {
    string_implementation_construct
            (s->private, data);
    }
Points to ponder:

Users know only of string's name, and that it is a struct. The implementation members are hidden through layers of data indirection (void *private) and function indirection (string_construct).

The implementation members are known to our_string.c, which fleshes them out with a private hidden structure string_implementation.

The publicly viewable string_construct manipulates abstracted strings. That function calls a private hidden string_implementation_construct manipulating the "real" data behind the abstracted string.

This technique brings us perilously close to a typical C++ class implementation, and concludes our survey of C abstraction fundamentals.

What's it All Mean, Alfie?

Highlights from these last two columns:

C does not explicitly endorse abstraction, but does permit it some. Typedefs, structures, and functions all put named wrappers around functionally-cohesive implementations.

Encapsulation is a means to the end of abstraction, putting a barrier between code pieces. Just as with abstraction, C tolerates but does not endorse encapsulation. Examples include local objects, structure members, and separate compilation.

You can't ever hide implementation completely. Some code at some level must be aware of how something is implemented. The trick is to minimize the amount of code that is aware.

On this last point, note that C++ still forces you to stay amazingly aware of underlying implementations:

Inline function bodies and private member declarations typically appear in header files, which can be included in other translation units, and thus are not textually hidden from users.

The implementations of baroque class hierarchies like Microsoft's MFC or Standard C++'s STL require careful dissection, so that you may understand the otherwise impenetrable error messages and correctly trump the appropriate type behavior.

Because C++'s object construction can involve a (theoretically) unbounded number of function calls (vs. C's, which simply allocates some possibly initialized bits), analyzing an object's space/time costs often requires knowing how that object is implemented.

And The Winner is ...

Having established a minimal set of C abstraction techniques, we are now ready to tackle a Real World Problem: an e-mail client that uses Microsoft's messaging interface (MAPI). Before you start fibrillating, know that I pick this example because

I have recent experience writing such a program.

It actually offers numerous opportunities to explore abstraction.

It encompasses an application that I presume all of you can relate to.

I promise, even if you are not a Windows wonk, you can still profit from this discussion. If you have @microsoft.com in your e-mail name, know that I am not looking to skewer your benefactor. I'm sure many other large interfaces sets suffer similar limitations; it's just that I happen to be familiar with this one.

Erratica

And finally, the letters keep pouring in! Next month I'll hit the virtual mailbag for more dialectics on Hungarian notation and related matters.

Notes

1. Sounds like something Don King would say, doesn't it?
2. In case you've ever wondered, this very fragility explains hydrogen peroxide bottles being brown. Light accelerates the decomposition. Dark bottles help preserve the peroxide's integrity.
3. Just ask anyone who's taken sizeof(array) where array is a parameter name, only to find array has decayed into a pointer.
4. I discuss structs explicitly, but the same notions apply to unions as well.

Bobby Schmidt is a freelance writer, teacher, consultant, and programmer. He is also a member of the ANSI/ISO C standards committee, an alumnus of Microsoft, and an original "associate" of (Dan) Saks & Associates. In other career incarnations, Bobby has been a pool hall operator, radio DJ, private investigator, and astronomer. You may summon him at 3543 167th Ct NE #BB-301, Redmond WA 98052; by phone at (206) 881-6990, or via Internet e-mail as rschmidt@netcom.com.