Columns


Questions & Answers

Pete Becker

Handling Variable-Size Objects

Pete deals with a couple of sticky problems dealing with object construction — mosty by avoiding the original problem.


To ask Pete a question about C or C++, send e-mail to petebecker@acm.org, use subject line: Questions and Answers, or write to Pete Becker, C/C++ Users Journal, 1601 W. 23rd St., Ste. 200, Lawrence, KS 66046.

Q

My question is about variable-sized objects, where it is possible that different instances would have different sizes. What are the side effects and pitfalls? I cannot see any way to make arrays of these objects work, so one must overload operator new[] and cause it to always fail. (One might be able to create a template class which implemented an array.) One would have to carefully implement the assignment operator, copy constructor, and new operator. I find it very unfortunate that the sizeof operator whose argument is an instance gives a value which is very different from what one's intuition would suggest. I would like to be able to overload the sizeof operator but that would also imply a change in its definition. I suspect that there are many things that I am missing and that your answer is probably "just don't do it."

— Eric Beals

A

I couldn't possibly say "just don't do it" without having more information. The real issue involved in deciding whether to use a particular coding technique is to understand the design problem to be solved and to decide what is the best technique for solving that problem. You're right to ask what the side effects and pitfalls of this approach are, because that's part of choosing the best solution. The other part is knowing what other techniques are applicable, and understanding their side effects and pitfalls. I can make a guess at the problem you're trying to solve, and I'll do that after taking a look at the problems that this technique can cause.

The C and C++ FAQs are always a good place to start when you have a question about programming in C or C++. They're available by ftp from these locations:

ftp://rtfm.mit.edu/pub/usenet-by-group/comp.lang.c/
ftp://rtfm.mit.edu/pub/usenet-by-group/comp.lang.c++/

The C++ FAQ is also accessible in browsable form at this site:

http://www.cerfnet.com/~mpcline/c++-faq-lite/

Here's what the C FAQ has to say about variable-sized objects:

2.6: I came across some code that declared a structure like this:

struct name {
  int namelen;
  char namestr[1];
 };

and then did some tricky allocation to make the namestr array act like it had several elements. Is this legal or portable?

A This technique is popular, although Dennis Ritchie has called it "unwarranted chumminess with the C implementation." An official interpretation has deemed that it is not strictly conforming with the C Standard. (A thorough treatment of the arguments surrounding the legality of the technique is beyond the scope of this list.) It does seem to be portable to all known implementations. (Compilers which check array bounds carefully might issue warnings.) Another possibility is to declare the variable-size element very large, rather than very small; in the case of the above example:

char namestr[MAXSIZE];

where MAXSIZE is larger than any name which will be stored. However, it looks like this technique is disallowed by a strict interpretation of the Standard as well. Furthermore, either of these "chummy" structures must be used with care, since the programmer knows more about their size than the compiler does. (In particular, they can generally only be manipulated via pointers.) References: Rationale Sec. 3.5.4.2.

The reference to "tricky allocation" simply means that when you use this struct you malloc enough space for the struct and the additional bytes that you need to hold your string, all in a single block, then pretend that namestr is actually an array that's large enough to hold your string. Like this:

struct name *nm =
    malloc( sizeof(struct name) + sd strlen(text) );
nm->namelen = strlen(text);
strcpy( nm->namestr, text );

(No, you don't have to add one to strlen(text) in the call to malloc; there's already an extra byte present, in namestr[1]). The assumption that this code makes is that you can access allocated memory that's beyond the end of the declared size of the struct. It's hard to imagine an implementation for which this would not work, and, as the FAQ says, this is, in practice, portable. Unless, of course, your compiler believes you when you say that namestr contains only one char, and tries to hold you to your word by disallowing access beyond the end of the one-byte array.

The usual C technique for copying structs with memcpy and sizeof won't work with objects of this type, because, as you point out, sizeof does not give the correct size of the object. It tells you what the compiler thinks the object's size is. Since you've misled the compiler, you have to handle figuring out the object's size by yourself. That's not hard to do, but it's different from the technique that you use with ordinary objects, and that makes it error prone.

Of course, if you create auto objects of this type they lose their special properties and become almost useless. The same limitation applies if you try to use them as members of another struct. That's what the FAQ means when it says "they can generally only be manipulated via pointers." In short, this type allows you to do things with it that don't make sense. That's not wholly inappropriate in C, with its tradition of type punning. Part of learning to program well in C is learning to handle objects that aren't quite what they appear to be.

C++ takes a somewhat different view of types. The encapsulation that's possible in C++ allows you to restrict the operations on a type to only those that make sense. So, for example, if you use this technique in C++, you should write this struct in a way that prevents creating auto objects with it. You'd do this by making its constructors private and providing static functions to create such objects on the free store. That also restricts the problems involved in copying, since you only give users of this class access to pointers and not to the objects themselves. Users can still use memcpy to try to copy these objects, but that's a rare technique in C++, and ought to raise a warning flag in any reader's mind. You also need to prevent users from deriving classes from this type, since, once again, it loses all its special properties when you derive from it. By properly encapsulating these objects you can make it hard for users to use them in inappropriate ways.

A type that can't be used to create an auto object and that can't be derived from is a bit peculiar, and ought to make you ask yourself "is this type really necessary"? I suspect that it's rarely needed. You can get exactly the same behavior in a struct by using a pointer:

struct name
{
int namelen;
char *namestr;
};

When you create an object of this type you must allocate space for the char array. That allocation can be cleanly encapsulated in a constructor, so your users don't need to remember to do it:

struct name
{
 name( const char *str )
 {
  namelen = strlen(str);
  namestr = new char[namelen+1];
  strcpy( namestr, str );
 }
 ~name()
 {
 delete [] str;
 }
private:
 int namelen;
 char *namestr;
};

Now we have a type that does not have the limitations of the earlier version. It can be used to create auto objects, and it can be used as a base class. By removing the limitations of the earlier version we've created a type that is much more useful.

This version does have a couple of drawbacks, however. Compared to the earlier version it requires more memory and it's a bit slower. The extra memory comes about because of the pointer. Both versions use the same number of bytes to hold namelen and the same number of bytes to hold the actual text. The new version contains a pointer that the earlier one did not. This extra memory makes a difference in application domains such as CAD, where there can be millions of these objects hanging around in memory. Reducing memory requirements by a few million bytes can significantly improve an application's performance by reducing or eliminating the need to swap data from main memory to disk.

The speed difference between these two versions arises from two sources: the need in the newer one to allocate space for the char array separate from the space for the object itself, and from the extra pointer indirection required to access the contents of the char array. If you can demonstrate that either of these operations is a significant bottleneck in your application, you may be justified in using the earlier version of this struct. However, before you do this, be sure to explore other ways of eliminating the bottleneck, such as customized memory management techniques for the char array.

In short, this technique involves defining a type that isn't really a type, because of the limitations you must impose on it in order to use it safely. We need techniques like this in our toolkits for those rare occasions when nothing else will work. There are almost always better ways to accomplish the same thing, however. You should not opt for clever coding techniques like this until you have exhausted the other possibilities, and can honestly say that this is the best approach for your application. I won't say "just don't do it," but I will say, "is this type really necessary?"

Q

I am trying to derive a new class from an existing class (Borland 4.0 class TDate) so it has an additional constructor. See the program listing that follows. I can't seem to find a way to do this. Is this possible or do I have to modify the original class description?

a) When inheriting from a base class, as shown below, is it necessary to redefine the base constructors if you want to use them ?

// date class derived from class TDate
#ifndef DATE
#define DATE
#include <classlib/date.h>
#include <string.h>
#include <stdlib.h>

class Date : public TDate {
public:
 Date (char * str);
 Date () : TDate(){}
 Date (DayTy day, YearTy year)
  : TDate(day, year){}
 Date (DayTy day, const char * month, YearTy year)
  : TDate(day, month, year){}
 Date (DayTy day, MonthTy month, YearTy year)
  : TDate(day, month, year){}
 Date (istream & is)
  : TDate(is){}
 Date (const TTime & time)
  : TDate(time){}
};
#endif DATE

b) Why does the compiler produce the error "Error DATE.CPP 23: Undefined symbol 'TDate::TDate()' in function Date::Date(char *)" in the code shown below?

Date::Date (char * str) : TDate() {
 DayTy d;
 MonthTy m;
 YearTy y;
 char buf[3], td[9];
 memset(buf, 0, 3);
 memset(td, 0, 9);
 strncpy(td, str, 8);
 if( !(strchr(td, '/') || strchr(td, '-'))) {
  m = atoi(strncpy(buf, td, 2));
  d = atoi(strncpy(buf, &td[2], 2));
  y = atoi(strcpy(buf, &td[4]));
  }
 else {
  m = atoi(strtok(td,"/-"));
  d = atoi(strtok(NULL,"/-"));
 y = atoi(strtok(NULL,"/-"));
  }
 TDate::TDate( d, m, y );
/*
 delete this;
 this = ::new TDate( d, m, y );
*/
}

— Bob Williams

A

When I get into trouble trying to implement something I often find that it's worthwhile to take a step back and try to solve a simpler problem. Usually the trouble I've run into is the result of getting confused by trying to do too many things at once. In this example, your code tries to convert text into a date at the same time that it is trying to initialize a base class. If you separate these two things your job becomes much easier. Let's begin by figuring out how to handle the conversion.

The class TDate has a constructor that takes an istream. This constructor reads input from a stream and converts it into the internal format used by TDate. Remember that streams aren't just files: there's a class strstream that talks to an array held in memory. We can use this form of stream to convert text into a date:

#include <iostream>
#include <strstream>
#include <classlib/date.h>

int main( int, char *argv[] )
{
 istrstream strm(argv[1]);
 TDate date(strm);
 cout << date << endl;
 return 0;
}

If you compile and run this program, it produces the following output:

C:>test 01/23/49
January 23, 1949

You don't want to have to remember to write two lines of code every time you need to do this conversion, so you should encapsulate the conversion in a function:

#include <iostream>
#include <strstream>
#include <classlib/date.h>

TDate make_date( const char *date_buf )
{
 istrstream strm(const_cast<char *>(date_buf));
 return TDate(strm);
}

int main(int, char *argv[])
{
 TDate date = make_date( argv[1] );
 cout << date << endl;
 return 0;
}

This version produces the same output as the previous one.

It looks a bit strange to cast date_buf into a pointer to modifiable char when we construct our istrstream, but that's what the iostream library used to require. That's been changed in the current ANSI/ISO working paper: the constructor now takes a const char *. It will take some time yet for implementations to catch up.

This function converts text to a TDate object, but it doesn't quite solve your conversion problem. Inspecting the code in the constructor you wrote shows that you want more flexibility in your conversions than this function provides. In particular, your constructor takes a text string in the form "MMDDYYYY" and converts it into a Date object. My version does not, because the TDate constructor that takes an istream requires delimiters between the month and day and between the day and year. If we try to pass an undelimited string such as "01231949" to this constructor, we get this:

C:>test 01231949
February 28, 2935093

Clearly, we need to make our conversion function a bit smarter to handle this format. That's not hard: all we need to do is check for delimiters. If there are delimiters in the string, we create a stream and construct a TDate from that stream, just as we did before. If not, we parse the string ourselves:

TDate make_date( const char *date_buf )
{
 if(  strchr( date_buf,'/' ) != 0 || strchr( date_buf, '-' != 0 ) )
  {
   // we have delimiters, so
   // can construct from stream
   istrstream strm(const_cast<char *>(date_buf));
   return TDate(strm);
  }

 // no delimiters, so we're on our own
 char temp[3];
 temp[0] = date_buf[0];
 temp[1] = date_buf[1];
 temp[2] = '\0';
 int month = atoi(temp);

 temp[0] = date_buf[2];
 temp[1] = date_buf[3];
 int day = atoi(temp);

 int year = atoi(date_buf+4);

 return TDate(day,month,year);
}

Now we get the correct result:

C:>test 01231949
January 23, 1949

Notice that I didn't use strcpy or strncpy in writing this conversion function. They're really not needed when you're only dealing with two characters at a time.

Now that we can convert a text array into a TDate, let's look at implementing the constructor for your Date class:

Date::Date(const char *str)
  : TDate(make_date(str)) {}

That's all it takes: we use the conversion function to create a TDate object, and initialize the base class using that object and the base class' copy constructor.

Hmm, I still haven't answered either of your questions, have I? What were they again? Oh, I remember: do you have to duplicate all of the base class constructors in the derived class? Yes, you do. The legalistic reason for this is simply that the language definition says that constructors are not inherited. There's a good practical reason for this rule. If constructors were inherited you'd have major coding headaches. For example:

class Base
{
public:
  Base( int );
};

class Derived : public Base
{
public:
  Derived()
   : text("Hello,world!")
   {
     cout << text << endl;
   }
private:
  const char *text;
};

Derived d(1);  // trouble...

The reason that the last line would produce trouble is that the constructor for Base does not, and cannot, initialize the text field in Derived. There is no constructor for Derived that takes an int, but the one in Base does. If the constructor from Base were inherited, this line would create an object with an uninitialized pointer. Attempting to display whatever this uninitialized pointer points to will probably not accomplish anything useful. So the rule is that constructors are not inherited, and as you did in your header, you must duplicate all of the constructor signatures that you want to use, forward their arguments down to the base class' constructor, and perform any additional initialization that your class needs.

You also asked about an error message that the compiler gave you. I don't have BC++ 4.0 any more, and 5.0 doesn't give this error. But there are a couple of things down in that end of the code that I'd like to comment on.

First, be careful with strncpy. If it stops copying because it reached the character count without seeing a terminating null character, it will not write a null character into the target buffer. For example:

char buf[4];
memset( buf, 'x', 4 );
strncpy( buf, "abcdefg", 3 );

After executing strncpy, buf will contain the characters a, b, c, and x. There will be no terminating null character, and any code after this call to strncpy that assumes a null terminator will not work correctly. My guess is that that's the reason for those memset calls at the beginning of your code: they made some problems go away. It's better to avoid problems like this, rather than patching them up afterwards. If you are dealing with null-terminated strings, use strlen to figure out whether the string fits, and if it doesn't, give an error message. Truncating the string, even if you add a terminating null, will rarely be the right way to handle an input error like this.

Second, you've been struggling to try to change the contents of the base class after it's been initialized. The line TDate::TDate(m,d,y); doesn't do this. All it does is create an unnamed object of type TDate, then destroy it.

The code that's commented out also doesn't work, which is probably why it's commented out. Using delete this is rarely a good idea. (Personally, I think it's always an indication of a design error, but some people disagree with me.) Remember, delete releases memory that was allocated from the free store. If the memory that the object occupies was not allocated on the free store, delete this will have various random effects. It will not work sensibly with an object on the stack.

The next line attempts to assign to this, which is not a valid operation. It used to be okay and it was the mechanism for changing the memory allocation scheme for a class. Assigning to this was never a very satisfactory solution to this problem, and was quickly replaced with class-specific operator new and operator delete. Some compilers might still accept it, but even if you have a compiler that allows it, don't use it.

As we saw earlier, the easiest way to initialize a base class is in the initializer list. You can write a function, as we did, to do an elaborate calculation, and simply pass the result to the appropriate constructor. If for some reason that's not a suitable solution, and you really do need to change the values in the base class from inside the constructor body, use the assignment operator. In your example, replace TDate::TDate(m,d,y); with

TDate temp(d,m,y);
TDate::operator=(temp);

That is, create a temporary object using the values you've computed, and assign the value of that object to the base class. If you prefer more terse statements, do it like this:

TDate::operator=(TDate(d,m,y));

Separating your task into two parts helped us improve the design. We simplified the constructor for your Date class by focusing on initialization in the constructor and pushing string parsing off into a separate function. We now have a function that parses a text string to produce a date, and if we need to, we can use that function, unchanged, in other code. This separation makes all of the code easier to understand, and makes it easier to solve the implementation problems that you ran into. o

Pete Becker is a consultant who specializes in teaching C and C++ to experienced programmers. He spent eight years in the C++ group at Borland international, both as a developer and manager. He is a member of the ANSI/ISO C++ standardization committee. He can be reached by e-mail at petebecker@acm.org.