C/C++ Users Journal, January 2006

C++/CLI Serialization

Converting objects to a form that can persist on disk or be transmitted to other processes

By Rex Jaeschke

Rex Jaeschke is an independent consultant, author, and seminar leader. He serves as editor of the Standards for C++/CLI, CLI, and C#. Rex can be reached at http://www.RexJaeschke.com/.

Most useful applications depend on information of a more permanent nature than that generated during a single execution. For example, applications that access an inventory typically query (and possibly update) one or more related data files. The lives of such "master files" transcend that of the execution of any of the applications that use them. Other applications involve the communication of messages between separate programs, often referred to as client and server. While the life of a message is often much shorter than that of a database record, the creation of both involves the use of some data format external to the applications that manipulate them.

This month, we'll see how objects can be converted into some external form suitable for use in file storage or for transmission during interapplication communication. The process of converting to some external form is known as "serialization," while that of converting back again is known as "deserialization."

Introduction

Consider the example in Listing 1, which writes a number of values of a variety of object types to a new disk file, closes that file, and then reads those values back into memory again.

In case 1, we define a variable of type BinaryFormatter. Objects of this type allow the serialization and deserialization of any object or an entire graph of connected objects in some binary format. (We'll later see how alternate formats can be used.)

In case 2, we create a new file having the name shown. The suffix ".ser" has no special meaning; it's simply a local convention that signifies a serialized data file. Each of the cases 3a-3h results in an object being serialized to that file. In the case of the string, each character is written. In the case of the arrays, all elements are written. In the case of the DateTime, all data contained within that type and any associated dependencies are written. In the case of the primitive type values, they are first boxed and the corresponding objects are written. As such, Serialize need only be defined to accept an argument of type Object^.

We retrieve the serialized data by calling the function Deserialize, as shown in case 6a. Because that function returns a value of type Object^, we need to convert that to the appropriate type, which we do via a cast.

When executed on the date and time shown, the output in Figure 1 is produced.

Serializing Objects That Contain References

In the previous example, we wrote and read relatively simple types. How is an object that contains numerous handles to other objects handled? Consider a dictionary of 20,000+ words, stored in a collection such that entries can be retrieved by key. The Standard Library provides such a collection class called Hashtable, which is used in the example shown in Listing 2.

In case 1, we preallocate the Hashtable with an initial capacity of 21,000 entries. (This simply speeds up the process, as it does not require reallocation during the addition of a large number of entries.) Then we read in words, one per line, from a text file, and add each word to the Hashtable in case 2. The file "dictionary.txt" is available at http://www .cuj.com/code/. Note that, by definition, each entry in a Hashtable is composed of a key/value pair. However, because our key is the value, we use nullptr as the second argument.

Note that Hashtable key values must be unique, and that the type of any object being added must override System::Object's GetHashCode function (String does).

Once all of the words have been read-in and added to the Hashtable, that Hashtable can be written out with one simple call to Serialize, as shown in case 4. In a separate application (see Listing 3), we read in this dictionary and perform lookups on it for each word provided by the user. An example of some input and the corresponding output from this application is shown in Figure 2.

The important lesson here is that we can serialize and deserialize an object of arbitrary size and complexity in a single function call.

Dealing with Multiple Handles

It seems obvious that when we pass a handle to an object to Serialize, a copy of the underlying object is written; however, is that what is really happening? What if we write out an object that contains multiple handles to some other object, or we call Serialize twice, each time giving it a handle to the same object? Do we really want multiple copies of the same object to be written? The application shown in Listing 4 demonstrates this process.

In this example, we wish to serialize objects of type Employee, a user-defined type defined in case 1. For this to work, we must attach the attribute Serializable to that type, as shown. If we attempt to serialize an object of a class type that is not marked with this attribute, an exception of type System::Runtime::Serialization::SerializationException is thrown. The output produced before serialization is shown in Figure 3.

Four graphs are serialized, one per call to Serialize. The first two graphs represent two different Employee objects while the third is really a reference to the second. The fourth graph is an array containing two elements, both of which refer to the first Employee object. The output shows these relationships. The output produced after deserialization is shown in Figure 4.

Note how the third Employee handle is no longer a handle to the second; instead of two Employee objects represented by three graphs, we have three Employees. Similarly, although list[0] and list[1] both refer to the same Employee object, that object is not the first one we got back.

We see then that when multiple graphs are serialized separately and they are interrelated, those relationships are not restored when those graphs are deserialized. However, relationships within any given graph are maintained.

Customized Serialization

By default, when an object is serialized, all nonstatic instance fields are written out and subsequently read back in during deserialization; however, for classes containing static fields, this can present a problem.

The example in Listing 5 uses class Point, which not only contains instance variables to track each Point's x- and y-coordinates, but also keeps track of the number of Points that have been created during this application's execution, in a static variable. This count is accessible via the property PointCount. For example, in Listing 5, four Points are created via explicit constructor calls, and then serialized to a disk file. When they are deserialized, four new Points are created, so the total point count is now eight, as shown by the output in Figure 5.

The output produced confirms that the point count field is being incremented correctly during deserialization. Now let's look at the Point class itself (see Listing 6, available online at http://www .cuj.com/code/).

Whenever a new Point is constructed using either of these public constructors, the point count field is incremented. A problem arises when we deserialize one or more Points, however. A call to Deserialize for a Point effectively creates a new Point object, but it does not cause either of these constructors to be called for that object. Specifically, PointCount will not be incremented automatically, even though the number of new Points has increased by 1. We can override the default serialization and/or deserialization behavior by implementing the interface ISerializable (from System::Runtime::Serialization), as in case 1. This interface requires us to define a function called GetObjectData having a specific signature, and that function allows us to override the serialization process.

The purpose of the function GetObjectData is to populate a SerializationInfo object with the data needed to serialize an object of the parent type—in this case, Point. The name, value, and type information is provided to the function AddValue, which has overloads for all simple types and has object as the second argument. The name string can be arbitrary, so long as it is distinct for each value being serialized in this type. (If a duplicate name is used, a SerializationException is thrown.)

The StreamingContext parameter is ignored in this example; it need only be used in special circumstances, and is not discussed further here.

To override the deserialization process, we must define another constructor, also with a particular signature. Note that this constructor is private. Because it is only ever called by the deserialization machinery, there is no need to make it more accessible.

In cases 8a and 8b, we call GetInt32 to restore the x- and y-coordinate values, respectively. (There is a Get* function for each primitive type as well as for type String.) In cases 9a and 9b, we call GetValue to restore the y- and x-coordinate values, respectively, again by name, by simply using an alternate approach. Finally, in case 10, we increment the counter for the Point being deserialized.

We've done all this work, just to increment one counter. Is all that work necessary? Yes! We need the special constructor so we can replicate the code that is executed when the other "normal" constructors are called (in this case, the incrementing of the counter). However, this special constructor is not called unless the interface ISerializable is referenced, and once we do that, we have to provide a definition for GetObjectData as well.

Another example of augmenting the serialization and deserialization process involves the calculation of checksums. When the object is written out, extra checksum information can be appended to the output stream and then read back in later to test the integrity of the data.

An example of replacement involves the use of encryption. Instead of writing a field's value out directly, that value is first encrypted, with the decryption being applied when the value is read back in.

Identifying the Fields to be Serialized

The default serialization process causes all nonstatic fields to be written out and read back in; however, what if our class contains one or more fields whose values are of a temporary nature, with no utility beyond the current execution? How can we ensure that such a field's value will not be serialized? We achieve this via the attribute NonSerialized (see Listings 7 and 8, available online), which overrides the Serializable attribute that has been applied to the class as a whole. The output produced is shown in Figure 6.

The fields count and description are serialized, while temp1 and temp2 are not. When an object is restored via default deserialization, all fields having the NonSerialized attribute take on their default values, such as zero, false, or nullptr, depending on their type.

Serialization Format

In all the examples of serialization thus far, we've used the type BinaryFormatter, which stores the data in some unspecified format that is compact and that can be processed efficiently. However, other formats are possible. For example, a SOAP formatter can be used. SOAP (Simple Object Access Protocol) is "a simple, XML-based protocol for exchanging structured and type information on the Web. The protocol contains no application or transport semantics, which makes it highly modular and extensible." Other custom formatters can also be created.

Exercises

To reinforce the material we've covered, perform the following activities:

  1. Using Listing 7 (available online) as the basis, implement Pair without using the NonSerialized attribute, by overriding the serialization and deserialization processes instead.
  2. Listing 9 (available online) contains a generic vector class and application. The application serializes and deserializes Vectors of various types using the default process. Modify class Vector's definition such that elements containing the default value are never saved during serialization, by overriding the serialization and deserialization processes. For example, a Vector of 10,000 ints, most of whose values are zero, should take much less space when stored in some sort of compressed format. Experiment with arrays having varying levels of sparseness and element count, comparing the size of the resulting serialized file. Make sure you consider the possibility of a function serializing multiple Vectors, possibly with other object and/or fundamental types in between.

CUJ