XiMoL: An XML Stream Parser

C/C++ Users Journal September, 2004

An XMLserialization library based on streams

By Cyril Godart and Florent Tournois

Cyril Godart is a quantitative researcher at BNP Paribas in London. Florent Tournois is a software architect at the INSEE (Institut National de la Statistiques et des Etudes Economiques). They can be contacted at cyril.godart@bnpparibas.com and florent .tournois@laposte.net, respectively.

XML has known its initial successes in fields related to the Internet and communications. As heir of SGML, however, XML is also a metalanguage for describing data structures. For many programmers, XML is used as a language-neutral description of object-oriented data structures, mainly because of the wealth and power of its processing tools. It is also used in this context to marshal object states over all sorts of hardware and software layers. The process of dumping object states on physical support is called "serialization." Traditionally, XML serialization relies on one of the two W3C specifications—DOM and SAX. In this article, we introduce a new approach that relies on standard C++ tools.

XiMoL is a freely available (http://ximol .sourceforge.net/) XML serialization library based on the concept of streams [4]. To this end, XiMoL uses the C++ iostream interfaces, thus inheriting from two Standards—the W3C XML and the C++ STL specifications. The motivation of XiMoL is to help C++ programmers manipulate XML as standard C++ streams, which is more natural than the C++ DOM or SAX APIs.

For instance, assume you have already written the code of a C++ class XType:

class XType
{
  AType a;
  BType b;
  ... //member functions not shown
};

Then say you want to dump this class as an XML element represented by:

<X>
  <A>...</A>
  <B>...</B>
</X>

where the types AType and BType can be complex types. Further assume that the code for input/output on the XiMoL-defined stream xstream has been written for these classes. The declaration could look like:

xostream& operator<<(xostream &xos,    const AType &x);
xistream& operator>>(xistream &xis,          AType &x);
xostream& operator<<(xostream &xos,    const BType &x);
xistream& operator>>(xistream &xis,          BType &x);

Then all you have to do to obtain the input/output for the XType type is to set or read the correct tag for XType and recursively call the input/output operators for the embedded types.

xostream& operator<<(xostream &xos,    const XType& x)
{
  return xos << XML::STag("X")
             << x.a
             << x.b
             << XML::ETag();
};
xistream& operator>>(xistream &xis,   XType& x)
{
  return xis >> XML::STag("X")
             >> x.a
             >> x.b
             >> XML::ETag();
};

In this example, the code input/output is a well-formed XML document, except for the XML prolog. Strictly speaking, we are inputing/outputing well-formed fragments of XML. Also note that XiMoL requires that the XML code closely reflect the memory layout of the C++ object. Finally, the format of the XML input/output could be provided by a validating schema or DTD, which is the case when dealing with XML/C++ data binding.

The XML::STag function takes care of the start tag and is implemented as a stream manipulator, similar in use to the standard std::setw [10]. This function has several overloads, which, in particular, can take care of XML attributes. The XML::ETag function takes care of the matching end tag. A wealth of functions, not used in this example, are provided by the library to perform operations such as inserting comments, dealing with CData, prologs, and the like. This deceptively simple mechanism allows writing most of the code needed to perform a variety of XML inputs/outputs. Bear in mind, the strength of this code lies in the introduction of the XML specialized class xstream, provided by the XiMoL library, and the user-defined overloads of the 4 and * operators, which is the standard mechanism in C++ for any kind of input/output on streams. If we had to compare this approach to the one described in Scott Moore's "Quick and Easy XML Creation with C++ Classes" [6], we would say that this is the major point where the two depart. This is not only syntactic sugar—the xstream class is actually a fully STL-compliant type of stream.

A Refinement

The nature of the XML serialization of C++ code is subtly different from the issue of reading and writing XML documents. For example, what should the XML output of this class be?

class YType
{
  XType x1,x2;
  ... //member functions not shown
};

As far as we're concerned, there are three different possibilities:

<Y>
  <X><A>...</A><B>...</B></X>
  <X><A>...</A><B>...</B></X>
</Y>

<Y>
  <X1><X><A>...</A><B>...</B></X></X1>
  <X2><X><A>...</A><B>...</B></X></X2>
</Y>

<Y>
  <X1><A>...</A><B>...</B></X1>
  <X2><A>...</A><B>...</B></X2>
</Y>

There is no right or wrong answer to the problem. The first option stipulates that YType is mainly a container of two objects of type XType, where to fetch the information, you rely on the order of the elements. The second and third cases show YType as an aggregation of two embedded objects, which happen to have the same C++ type. That means that the aggregated objects have a well-defined identity. They can be accessed individually, independent of any order. Solutions 2 and 3 are semantically equivalent but the second solution requires more bytes. Also, the first two solutions could be implemented using the tools previously introduced; see Listings 1 and 2.

For this reason, we introduce the XML::OpenSTag function that changes the behavior of the immediately following XML::STag manipulator. The code for the third solution can now be written as:

xostream& operator<<(xostream &xos, const YType& y)
{
  return xos << XML::STag("Y")
             << XML::OpenSTag("X1")  << y.x1
             << XML::OpenSTag("X2")  << y.x2
             << XML::ETag();
};
xistream& operator>>(xistream &xis, YType& y)
{
  return xis >> XML::STag("Y")
             >> XML::OpenSTag("X1")  >> y.x1
             >> XML::OpenSTag("X2")  >> y.x2
             >> XML::ETag();
};

Implementing the xstream Class

You may wonder why we introduce a new stream class when the standard wstream class seems to be a good candidate for accommodating XML input/outputs, in particular because it can deal with Unicode and its encodings [11,14]. There are several reasons for the existence of this class. This is best explained with an example. Consider parsing the user-defined reference &foo;, which could be defined in a DTD as:

<!ENTITY foo "foobar">

First, we need to store the definition of the entity in order to perform on-the-fly replacement of the entity reference &foo; by the corresponding value foobar. To that purpose, we introduce a context as a member of xstream. XiMoL does not limit the use of the context to entity definition buffering. During a read/write operation, the context is also used to keep track of the list of elements and attributes that are higher in the hierarchy than the current element. It also stores the opening tag of the current element until the closing tag has been read/written. It is anticipated that the context will be relied on for other higher level mechanisms as XiMoL evolves.

For the actual replacement of the entity, the putback wstream member function and its spubacks counterpart in stream_buffer seem to be natural candidates. Unfortunately, it turns out that this standard mechanism is insufficient in the context of XML. For example, the putback mechanism assumes the characters that are put back are similar to the ones that were read, which is obviously not the case for an entity. The characters are replaced by the value of the entity. For instance, the number of characters that are put back could extend beyond the buffer end and, in this case, the standard stipulates that the behavior is undefined for some streams. To circumvent this limitation, all xstreams—regardless of the physical support they represent—make use of a specific buffer dedicated to entity manipulations. Returning to the initial example, once read on the stream, &foo; is replaced on the xstream::buffer by its value foobar with the read position just before the beginning of the value text. foobar is then parsed again, as specified by the W3C XML document, and any embedded entity is then replaced.

For the delicate question of character encoding, we rely on the GNU libiconv [3] library, which takes care of the conversion between a variety of character encodings, including the popular UTF-8 for UNICODE. To promote libiconv to a standard C++ facet, we hide it behind the std::codecvt<wchar_t, char, mbstate_t> facet [12]. The internal representation of XiMoL uses the wstring STL string implementation, not the XMLCh* or DOMString introduced in the Character Model [2] and the Document Object Model [1]. The justification for this choice is purely empirical: We have not found any case where the wstring was not up to the task.

Between DOM and SAX

XiMoL is also a paradigm shift from the DOM and SAX approach. If DOM can be seen as a "whole document" approach and SAX as a "per-element" approach, XiMoL is in between, a sort of object-oriented approach to XML treatment, with all the wealth of expressiveness that comes with it. With SAX, it shares the property that code must be written to read-in the XML document and that XiMoL is not a validating processing of the document. At least not out of the box. But contrary to SAX, it allows validation to be written and unlike DOM, it allows partial validation of a document. For all these reasons, XiMoL is more than a C++ library, or for what it is, any particular computer language. It is a new approach to XML treatment.

References

  1. [1] Document Object Model (DOM) Level 3 Core Specification, Version 1.0, W3C Working Draft 22 October 2002, http://www.w3.org/TR/ DOM-Level-3-Core/.
  2. [2] Character Model for the World Wide Web 1.0, W3C Working Draft 30 April 2002, http://www.w3.org/TR/charmod/.
  3. [3] Free Software Foundation, http://www.gnu.org/software/libiconv/.
  4. [4] Kernighan, Brian W. and Dennis M. Ritchie. The C Programming Language, Second Edition, Prentice Hall, 1988.
  5. [5] Sobczak, Maciej. "An Iostream-Compatible Socket Wrapper," C/C++ Users Journal, December 2001.
  6. [6] Moore, Scott. "Quick and Easy XML Creation with C++ Classes," C/C++ Users Journal, February 2001.
  7. [7] Andrivet, Sebastien. "A Simple XML Parser," C/C++ Users Journal, July 1999.
  8. [8] Georgescu, Cristian. "Code Generation Templates Using XML and XSL," C/C++ Users Journal, January 2002.
  9. [9] Nicolls, Matt. "DocumentBuilder: An Alternative to Hard-Coded String Concatenation," C/C++ Users Journal: Java Solutions, June 2002.
  10. [10] Langer, Angelika and Klaus Kreft. Standard C++ IOStreams and Locales: Advanced Programer's Guide and Reference, Addison-Wesley, 2000.
  11. [11] Plauger, P.J. "Standard C/C++: Unicode Files," C/C++ Users Journal, April 1999.
  12. [12] Plauger, P.J. "Standard C/C++ The Facet codecvt," C/C++ Users Journal, December 1997.
  13. [13] Plauger, P.J. "Standard C/C++ Introduction to Locales," C/C++ Users Journal, October 1997.
  14. [14] Plauger, P.J. "Standard C/C++: Multibyte Files," C/C++ Users Journal, May 1999.
  15. [15] Brand, Michael, Ronnie Maor, and Sasha Gontmakher. "XParam: A General-Purpose Serialization Framework for C++," C/C++ Users Journal, July 2002.