August 1991/Copy-On-Write Objects For C++

Features

Copy-On-Write Objects For C++

Ronald G. White

Ronald G. White has an M.S. in computer science and 20 years of programming experience. He is the author of Tempo for Windows and founder of Tesuji Technology, where he does consulting and custom programming. He specializes in scientific, system, and graphical user interface programming. You may reach Ronald at Tesuji Technology, P.O. Box 4305, Boulder, CO 80306, or at 71500,3566 on CompuServe.
To learn C++, I began working on a large project that I hoped would force me to learn more than I could ever learn just from reading books. During this work, I came up against a problem that seemed ideal for a C++ solution. After several attempts, I came up with what seemed to be a reasonable solution. In this article, I present the problem and my solution. I hope that those of you who are also new to C++ can learn something from it and that those of you who are more sophisticated C++ programmers will take it as a challenge to show me how I could have done it better.
First, the problem: For the project on which I was working, I wanted to maintain multiple generations of many data objects, some of which were quite large. (The project is a program to play the game of Go. Each generation represents a move on the board, and the data objects contain information about the situation on the board. I wanted to maintain multiple generations so the player could undo some number of plays and get back to an earlier situation.) Each generation of objects might differ from the last by changes, additions, and deletions of only a small percentage of the objects. To keep the functions simple, I was copying all of the objects for each new generation. Consequently, there were often multiple copies of the same, unchanged object. As the complexity of this data increased, I realized that I was in danger of running out of memory.
Like many problems in writing software, there is more than one way to solve this problem. Virtual memory is one solution, but I didn't want to deal with the overhead associated with reading from or writing to disk (although I may eventually need to use virtual memory). I could keep track of which object was associated with which generations, but this adds complexity at too high a level in the code. I wanted objects that would only copy themselves when necessary. I think of these as Copy-On-Write, or COW, objects.
This idea comes from a scheme I had read about years ago for keeping multiple generations of data files while using up a minimum amount of disk space. In this scheme, each block in the data file is found via pointers in the header of the data file. A new generation of the file is made by duplicating the header but not the data blocks. Only when the data in a particular block gets changed and must be written back to disk is it necessary to make an actual copy of that data block and modify the header to point to this new block.
Instead of data blocks in a file, I wanted objects in memory that would only get copied when necessary and a solution in which the objects themselves would know when to split into two exact replicas. This proved to be more complicated than I had expected.
Suppose I have three generations of the object A. I haven't made any changes to A since it was originally created in generation one. If I have actually made copies of A, I would have the situation depicted in Diagram 1. I have three pointers, pA1, pA2, and pA3 pointing to three separate copies of the same object: A1, A2, and A3. On the other hand, if I had not made copies of A for each generation, I might have the situation in Diagram 2. The three pointers all point to the same object — the original A1.
Now suppose that in generation three I call a function of A that would change the data in A. For example, pA3->add(new_data). In Diagram 1, this is no problem. A1 and A2 will be unaffected. For the situation in Diagram 2, however, the object A1 would have to duplicate itself, creating a new object, A3, which could then safely be modified without affecting the version pointed at by pA1 and pA2. The problem arises because pA3 still points to A1 and the object A cannot access this pointer or easily change it.
My solution was to add a layer of indirection. As shown in Diagram 3, the generation pointers point to virtual objects vA1, vA2, and vA3. This virtual object is small, possibly no more than a pointer, and inexpensively copied. It points to the actual object, A1. All access to the actual object is through the virtual object, and the functions that use the generation pointer (e.g., pA1) need never know that the virtual object is not the actual object.
The virtual object knows which functions cause changes to the real object and has a way of checking if other virtual objects are pointing to the same real object (I use a counter in the real object). If both of these are true, the virtual object can duplicate the real object, change its pointer to point to the new copy (decrement the counter in the old copy), and proceed with the requested change.
Listing 1, cow.h, is the header that defines the two base classes. The first class, Obj_COW, is the base for the actual, or Copy-On-Write, objects. The second, Obj_Virt, serves as a base class for virtual objects.
The only data associated with Obj_COW is count. count keeps track of how many virtual objects are pointing at this real object. Obj_Virt is declared a friend class so it can manipulate the count and any other data stored in the actual object. A pure virtual function, dup, must be defined for each instance class. This function is used by Obj_Virt when an actual copy of the real object must be made.
The Obj_Virt base class is more complicated. The only data it maintains is po, a pointer to the real object, but it defines a number of functions. The void constructor sets po to NULL instead of creating a real object. If this constructor created an Obj_COW object, it would be the wrong type of object for the derived classes. Instead, the derived virtual object must be responsible for creating an instance of the derived real object. A copy constructor and copy operator are provided, but the derived virtual class must redefine these so the parameter type is correct. The default copy constructor and copy operator definitely cannot be used because they would only copy the virtual object, including the pointer, without updating the real object's count field.
The Obj_Virt's destructor calls the function release, which decrements the real object's count field and deletes the real object if the count is zero. The function set_ptr, which is called by the copy constructor and the copy operator, makes a virtual copy. It calls release in case the virtual object is currently pointing to a real object, sets the pointer to the real object, and increments the count field.
The function print is only provided here for debugging. The functions get_ptr and set_ptr are provided so derived classes do not have to access the pointer directly. The last function, split, is called by functions of the derived virtual class that intend to change the real data. It checks if the real object is being used by multiple virtual objects. If so, it calls the real object's dup function to get a real copy of the object and adjusts the count fields of both the original real object and the new real object.
Listing 2 gives an example of the use of the COW base classes. Foo_COW is the real class derived from Obj_COW. As data, it contains three integers, a, b, and c. The virtual class, Foo, is declared a friend class so it can access the data here. The Foo_COW constructor takes three integers and initializes the corresponding data fields. The dup function, which is required for any class derived from Obj_COW, creates a new Foo_COW object with the data values from the current object. (Note that dup must cast the pointer value it returns.)
The virtual class Foo is derived from Obj_Virt. Its constructor, which also takes three integers, creates a new Foo_COW object with the three integers and sets the pointer to point to this new Foo_COW object. The copy constructor casts its parameter so the Obj_Virt copy constructor can then be used. The copy operator is necessary, as mentioned, to avoid the default operator. The Foo class defines a private function, actual, to more easily access the real data object, Foo_COW. This is used in the functions get_a, get_b, and get_c to return the current data values. The function print is, as before, only for debugging. Finally, the function reset resets the data values. Because reset will change the real object, it calls the function split before doing so to ensure that it changes only its own copy of the real object.
The main function contains code that creates some Foo objects, manipulates them, and prints out their data. Listing 3 shows the output of running this test program (built using Turbo C++ and the small memory model). The three virtual objects all have different addresses, but, correctly, the real objects to which they point are sometimes the same. Also note that the use of the copy operator deletes the original real object and the reset function causes a new allocation of a real object that just happens to have the same address as the object deleted.
My solution to the original problem seems to be workable. I don't find it particularly elegant, however. The fact that there are two base classes, requiring two derived classes for every object type, seems overly complicated. It is also unfortunate that the derived classes must redefine many of the base class functions. I don't doubt that there are more sophisticated C++ programmers out there who could do a better job. If you're one of them, let me know how you would do it — I'm still learning.

Notes
The syntax if (0 == - -po->count), with the constant fIrst, in the function release, may seem odd to experienced C programmers. I write comparisons this way as a result of a suggestion I read in a CUJ article. It allows the compiler to catch the common error of mistakenly using = instead of ==.
In the definition of classes, I have put private data last instead of first. It makes more sense to me that someone looking at a header full of class definitions see the public information first. Often it would be better if they couldn't see the private information at all.
Experienced C++ programmers may wonder why I didn't redefine the -> operator for virtual objects so that it would return the address of the real object to deal with the indirection. I thought about this but could not figure out how to make it work so the real object was duplicated only when necessary. Perhaps one of you can write another article explaining how to make it work.