September 1998/The Learning C/C++urve

C/C++ Contributing Editors

The Learning C/C++urve: C/C++urve Ball

Bobby Schmidt

What's the difference between an array and a vector? Whatever you make it.

Copyright © 1998 Robert H. Schmidt

Life is change [1]. Starting with this December's CUJ issue, my column becomes a bona fide Q&A feature. As I solicit, answer, and publish so much reader mail anyway, this seems a natural progression. We may have an interactive online presence at some point, but I make no promises — yet.

Diligent Readers may think, "Yes yes, but y'all already have a Q&A columnist." Fear not! Pete Becker is not going away, but instead is starting up a new column that will show how to solve problems effectively using C and C++ (maybe even Java) and will address what Pete calls "low-level design" issues. We each get a chance to fly with new wings.

(I don't have a new column title yet, and before you ask, no I won't hold a contest. Look what happened to the last guy who tried that. Traumatized him so much he's going to have to take a sabbatical.)

My immediate task: wrapping up "The Learning C/C++urve" in the remaining three installments. I've accumulated quite a few odds-n-ends I've never had cause to publish, until now. I'll conclude my formal array discourse this month, leaving the next two months free for my column-topic clearance sale.

Array Length Revisited

In June I introduced templatized array references like
template<class T, ptrdiff_t N>
void f(T (&a)[N])
    {
    // 'a' is an array of length 'N'
    // containing type 'T' elements.
    }
While spelunking Usenet, I ran across a dead-obvious application of this technique, one efficiently converting an array to its length:
template<class Element,
    size_t Length>
inline
size_t lengthof(Element (&)[Length])
    {
    return Length;
    }
Note that the array object is not referenced directly, and hence is left unnamed — the template is merely a contrivance to extract the array parameter's length.

lengthof works as you'd expect:
int a[10];
size_t n = lengthof(a); // OK, sets 'n' to 10
and effectively replaces the idiomatic macro analogue:
#define LENGTHOF(a)\
    (sizeof(a) / sizeof(*a))
     
int a[10];
size_t n = LENGTHOF(a); // also sets 'n' to 10
lengthof solves LENGTHOF's biggest flaws: the lack of scope, the possibility of side effects (from evaluating a twice), and most importantly, the inability to distinguish pointers from arrays:
int *p = new int[10];

// runs, probably yields 1 or 2
size_t n = LENGTHOF(p);
This is another fine legacy of C's array/pointer symbiosis. Such errors are not possible with lengthof:
int *p = new int[10];

// error, 'p' doesn't match the
// function's parameter type
size_t n = lengthof(p);
I'm purposely declaring the length parameter as size_t and not ptrdiff_t as I did before [2]. This way I avoid casting problems, since the length type and the return type are the same, and reinforce the equivalence relationship between lengthof(a) and (sizeof(a) / sizeof(*a)).

Arrays vs. Vectors

At my Editor-in-Chief's suggestion, I checked out a long tortuous thread on news:comp.lang.c++. The topic: are Standard C++ vectors guaranteed to hold contiguous storage, and if so, should that storage always be user accessible? Such an arrangement would let vectors mix with older C-style functions and APIs expecting pointers to array elements:
vector<char> v;
// ...
strlen(v.pointer_to_internal_array());
vectors support random access via v[n] or the equivalent *(v.begin() + n), which is suspiciously like a[n] and *(a + n) for real arrays. However, v.begin() returns a vector::iterator, not a pointer, implying v.begin() + n doesn't use the same + operator as a + n.

Unless, of course, vector::iterator actually is a pointer. In that case, v.begin() must point to contiguous storage so that v.begin() + n works correctly. And that would allow
vector<char> v;
// ...
strlen(v.begin()); // OK, 'v.begin()'
                   // yields a 'char *'
Unfortunately, you have no guarantee that vector::iterator is a pointer. While real pointers make for an easy and simple implementation, they don't allow for reference counting, bounds checking, and other useful techniques. A vector's iterator may actually be a class object or some other distinctly non-pointer entity.

The Plot Thickens

Several postings on Usenet, and at least one prominent C++ book, imply that the storage in a vector is in fact always contiguous. Mr. C++ himself, Bjarne Stroustrup, had this to say in that same Usenet thread:

I suggested to the person who originally raised the issue of contiguous allocation of vector elements that he ask for clarification from the C++ standards committee through his national standards body. This has been done so that this issue will be officially settled in due course.

Until then, I will personally assume vector elements to be allocated contiguously despite the lack of a firm promise and despite the arguments brought forward to the effect that there should be no such guarantee.

If this assumption were true, then vector could potentially be rewritten, allowing portable access to its contiguous storage.

However, I can find no specific language in the C++ Standard supporting this assumption. Certainly a contiguous vector implementation makes sense, given a vector's operational semantics, And I personally can't think of a sensible alternative implementation matching those semantics. But making sense and being required are two different things [3].

My guess is the C++ committee never intended vectors to be guaranteed contiguous, even if every vendor implements them that way anyhow. But that's just a guess. As Bjarne hints, we'll have to wait for clarification of the committee's intent to be sure.

What If...

Just for fun, assume a vector's storage is always contiguous, and that vector::iterator is always a pointer:
vector<int> v1, v2;
// ...
int *p1 = v1.begin(); // OK
Now consider
v1.insert(v2.begin(), 1);
Is the value of p1 still valid?

Maybe, maybe not. If v1 has enough unused capacity to absorb the inserted values, everything's fine. But if v1 is too small, its storage must be relocated and expanded. In that case, the internal storage address would change, leaving p1 pointing to where the storage used to be.

This raises a more fundamental conceptual problem. Consider
int *p1a = v1.begin();
v1.insert(...); // may relocate 'v1' storage
int *p1b =v1.begin();
p1a and p1b both conceptually reference the beginning of the same container, yet may compare as different.

To get around this, vectors could support freezing, much like C++ string streams. Freezing would allow a non-const vector to temporarily act as if it were const:
v1.freeze(true);
int *p1 = v1.begin();
v1.insert(...);   // would fail for a frozen vector
                  // 'p1' is still valid
v1.freeze(false); // 'v1' is thawed
                  // 'p1' may become invalid
Alternatively, vectors could manufacture a contiguous block that mirrors or shadows the real internal storage, a dynamically allocated array filled on demand with the vector's real contents:
vector<int> v1;
// ...
int *p1 = v1.snapshot();
Here p1 references not the vector's real storage, but rather a clone of that storage guaranteed to be contiguous. This way, the vector can use whatever scheme it wants most of the time, creating a contiguous image only when explicitly required to.

The ownership of the storage referenced by p1 is unclear. Does the vector's destructor remove this storage, or is the user responsible? This is especially problematic for multiple references:
int *p1 = v1.pointer();
v1.insert(...);
int *p2 = v1.pointer();
The insertion may have forced a movement of internal storage. Presumably p1 references a copy of the old contents only. Does the vector automatically deallocate this (now out-of-date) block, invalidating p1? Or is the user responsible for deallocating all these blocks, and tracking which blocks contain which vintage of the vector's contents? And if the user must deallocate, is it by free, or delete [], or something else?

Finally, if vector is internally implemented as contiguous storage (a.k.a. an array), a vector's maximum size is bounded by an array's maximum size. For reasons I discussed in May's column, you're safest assuming arrays are limited to 65,535 bytes. This implies a single vector can portably hold at most 65,535 bytes.

That's bytes, not elements. If your vector is holding four-byte long values, it is limited to 16,383 elements. For users wanting large vectors, such a limit may be a dear price to pay for contiguous storage.

(Note that this is not a limitation specific to vectors. Any container implemented by a single array will be portably limited to 65,535 bytes.)

Alternatives to vector

Of course, vector doesn't have the features I cited above, and can't guarantee contiguous storage. But our Array class has held contiguous storage all along. Rather than make vector more array-like, perhaps we should make Array more vector-like.

The largest genetic difference between Array and vector is that the former is fixed-size while the latter can grow. Rather than add another member to Array (analogous to vector::insert), I'll instead modify Array::operator[] to automatically grow the Array when needed:
Array<int, 10> a;
a[9] = 9;   // OK, 'a' is big enough
a[10] = 10; // 'a' needs to grow to handle 11 elements
For such behavior to work, the internal array must be relocatable. This implies that the single fixed-size Array member array_ we've been using needs to be dynamically allocated. Array also needs to track the current length of this internal allocation; when a requested [] index exceeds this current length, the internal allocation must grow.

My candidate solution appears as Listing 1, and shows one more reason to support operator[] directly (vs. converting via operator T * and using the built-in [] operator). Of particular interest is how the two overloads of operator[] respond to an out-of-bounds index:

The writable (non-const) version silently grows the array to accommodate the new index.

The read-only (const) version doesn't grow the array, and assumes the index is in-bounds. A more robust solution would detect an errant index and notify the user.

As written, my solution is pretty naive. I've augmented the interface with the vector-esque member begin, but you'll almost certainly need others. As for implementation improvements, here's one hint:
Array<int, 10> a;
for (int i(0); i < 20; ++i)
    a[i] = i;
How many times is a's storage relocated in this loop? Can you reduce this number? And if so, how do you balance the need for maximum speed (fewer relocations) with the need for minimum space (fewer unused elements)?

Standard C++ containers face these same design problems. You can learn much by comparing and contrasting the different container specifications. Even if you never intend to use these containers, or decide to stick with C only, you'll find much to inform your own designs.

(Speaking of C, I haven't mentioned Variable Length Arrays, a.k.a. VLAs, in this discussion, principally because they don't help much. Like vectors, VLAs have their initial length set dynamically; but unlike vectors, VLAs can't grow after initialization. And of course, VLAs aren't supported by C++ anyway.)

Array Construction

When you declare a real C++ array such as
T a[10];
the element type T must have a default constructor available [4]. In this example, T::T() is called ten times, once for each element of a. If T's default constructor is unavailable:
class T
    {
private:
    T();
    };
the array definition
T a[10]; // error
won't compile [5]. This is true even if other constructors are available:
class T
    {
public:
    T(T const &);
    T(int);
    T(void *);
    };
     
T a[10]; // still an error
The same "feature" extends to the Standard C++ sequence containers (deque, list, and vector). Array follows this trend, thanks to
Array()
        :
        array_(new T[N]),
        length_(N)
    {
    }
where the expression new T[N] calls T's default constructor N times.

If you want to create an Array of elements that don't have default constructors, you can change Array's default constructor to something like
Array()
        :
        array_((T *)
        malloc(N * sizeof(T))),
        length_(N)
    {
    for (ptrdiff_t i(0); i < N; ++i)
        new(array_ + i) T(i);
    }
How this works:

malloc(N * sizeof(T)) allocates the space for the elements, but does not construct them. array_ holds the address of this space.

new(array_ + i) uses so-called "placement new." [6] Whereas plain new allocates memory before constructing an object, placement new does not; instead it uses its pointer argument (array_ + i here) as the address of memory within which to construct the object (that is, the "placement" or location of the constructed object).

T(i) calls the (non-default) T constructor accepting a ptrdiff_t argument.

The net result: a non-default T constructor gets called N times (once for each element) with the construction occurring within the space reserved by malloc. All this presupposes T has an appropriate non-default constructor available. As written, my solution also assumes malloc always succeeds.

The C++ language's rules impose the default-constructor requirement on array elements. Your class-implemented container objects, in contrast, can require whatever element constructor(s) you choose. By adapting the techniques I've shown the past few months, you can combine the efficiency and convenience of arrays with the flexibility of Standard C++ containers.

Trail's End

I may poke and prod at containers a bit more, but my long column arc is done. In the two months remaining before my big make-over, expect to see a grab bag of smaller, less connected items. (This fits with how I expect to run my Q&A anyway.)

Erratica

This time I have genuine spam, er, email from Scott Meyers:

I'm not quite sure why, but now nitpicking your column has become a habit :-) Tsk, tsk, and two demerits, but the type of abc is const char[4], not char[4] as you misinform your readers on p. 83 of the June CUJ. Get a clue or buy a vowel or something...

Scott

Forget vowels — I need to buy a qualifier. He's right, but didn't go far enough. Several of my code comments in that month's issue were wrong.

Standard C reckons string literals as arrays of non-const char. When I shifted from C to C++ in that column, I clearly failed to update the comments accordingly: the type of C++ string literals is array of const char, not array of char.

Everywhere you see a code line like
f("abc"); // OK, instantiates f(char (&)[4])
in that June column mentally substitute
f("abc"); // OK, instantiates f(char const (&)[4])
Be sure to make that a mental substitution, not a real one. Marking on your precious CUJs could reduce their potent collectible value.

Notes

[1] Of all people, my former wife's current husband most recently told me that, in reference to my current partner. Sounds like something from the Jerry Springer show.

[2] I wish this whole ptrdiff_t vs. size_t stuff would just go away. It's enough to make me miss Pascal, with which I did not know (or care) what size and sign my integers had.

[3] Ponder the implication of such contiguous storage requirements on bit-packed vector<bool>.

[4] Clearly this discussion assumes that T is some class or struct type.

[5] Metrowerks CodeWarrior Pro 3 on Mac OS shows a bug here (which I've reported), acting as if the default T constructor is still public.

[6] Dan Saks discusses placement new in his April 1997 CUJ column.

Bobby Schmidt is a freelance writer, teacher, consultant, and programmer. He is also an alumnus of Microsoft, a speaker at the Software Development and Embedded Systems Conferences, and an original "associate" of (Dan) Saks & Associates. In other career incarnations, Bobby has been a pool hall operator, radio DJ, private investigator, and astronomer. You may summon him on the Internet via rschmidt@netcom.com.