March 2003/Quick and Easy XML Creation with C++ Classes

Quick and Easy XML Creation with C++ Classes

Scott Moore

XML is a great communications format. The flexibility and expressiveness of XML belies its simplicity. Even better, the current technologies surrounding XML are powerful and have a lot of industry momentum. One of oldest XML technologies is DOM (Document Object Model) [1] — a standard, language-neutral API for manipulating XML as a tree of nodes. There are several DOM implementations that are freely available. Microsoft and Apache both provide one.

My company writes enterprise financial software, implemented as a three-tier system. Our middle tier is essentially a SOAP (Simple Object Access Protocol) [2] service running under MTS (Microsoft Transaction Server). Because SOAP is XML based, you can use existing XML technologies like DOM to create, parse, and modify SOAP messages. This is exactly how our middle tier reads and writes SOAP messages, using Microsoft's DOM implementation. Our SOAP server needs to respond to SOAP requests very quickly (like most servers), so anything that might improve performance warrants investigation. Because SOAP requests to our middleware are small, using DOM to parse and extract the information has not posed scalability problems. However, a few SOAP responses can get very large, with thousands of nodes, and the response time is sluggish at best.

Investigating further, I wrote a program to create both small and large XML documents using Microsoft's DOM v3.0 [3] and timed the performance. My initial benchmarks showed that doubling the number of nodes could more than triple the processing time. Fortunately, that's not always the case. By rewriting how the nodes were created and added to the document, I got it back down to roughly linear performance.

But even with the rewrite's better performance, it was still taking too long. I concluded that I needed to create the SOAP responses using a different XML technology. I had only two criteria: speed and ease of use. As far as speed goes, I needed something with roughly linear performance, but many times faster than DOM. Getting other developers to use it means it can't be complicated and unwieldy, thus the ease-of-use criteria. Searching around the Internet didn't turn up anything, so I decided to write my own.

Since, ultimately, the XML needs to be in a character buffer to send over HTTP, I decided appending XML to a string would be the most efficient approach. A simple XML API on top of the string buffer would solve the ease-of-use problem, and the solution progressed from there.

Enter XMLSTACK

Unlike HTML, one of XML's requirements is that it be well formed. This includes criteria such as all beginning element tags must have matching ending tags and no tags can overlap each other. I decided the best way to meet this criteria was to use a stack implementation. As element nodes are pushed onto the stack, the starting tag information is written to the XML buffer. Popping the element off the stack forces it to write the ending tag. Once I had that concept down, everything else fell into place. I decided to name my implementation XMLStack, with the objects modeled after their counterparts in DOM.

An example is worth a thousand words, so take a look at Listing 1. This code demonstrates how to create the XML in Figure 1 (formatted for clarity). Much like DOM's document object, the AXMLStackDocument object is the basis for the XML document. The document doesn't actually write any XML itself, but it does hold the XML buffer and creates all the node types that do the actual XML writing.

In Figure 1, the first node I need to create is an element node called <Book>. This is performed by calling pushElement("Book") on the document object. The <Book> element has two child elements called <Title> and <Author>. To create the <Title> element as a child of <Book>, call pushElement ("Title") while <Book> is still on the stack. Now, this is where things start to get interesting. The <Title> element has a value of "Aerospace Engineering For Dummies", which gets placed between the starting and ending tag. The pushElement() method returns a reference to the element node just created, which is what I'm assigning to. Element nodes define an assignment operator (=) for strings and all C++ scalar types, thus providing a natural C++ syntax for assigning values. After assigning "Aerospace Engineering For Dummies", I pop() the <Title> element off the document's stack, which causes the ending tag </Title> to get written to the buffer.

Next, I push the <Author> element onto the stack. It's important to keep in mind when to pop elements off the stack. If I didn't previously pop <Title>, then <Author> would get created as a child of <Title>, which is incorrect. <Author> is a child of <Book>, so I pop <Title>, leaving only the <Book> element on top of the stack. Since <Author> has two child elements, I leave it on the stack until the <LastName> and <FirstName> elements are pushed and popped. Notice that for <LastName>, I called setValue() instead of using the assignment operator. You can use the two interchangeably. Actually, the assignment operator just turns around and calls setValue(). However, there's no performance penalty for using the assignment operator because it is inlined. Once <FirstName> is assigned the value "Bob", the only thing left to do is to pop the remaining elements off the stack. To get the XML in the string buffer, call the xml() method on the document.

Rather than reinvent the wheel, I decided to use std::basic_string as my character buffer. You may know std::basic_string as the more commonly used typedef's std::string or std::wstring. Rather than hardcode a particular instantiation of basic_string, the XMLStack classes are templates, which accept the same arguments as basic_string. This means the XML buffer can either be ANSI or Unicode (or even some other encoding). Providing this flexibility did have a cost however — encoding conversions. When writing to the XML buffer, some classes have hardcoded strings that get placed into the XML buffer. Since the code cannot assume a particular encoding, I created a class called ConvertEncoding that converts strings to the correct encoding. During testing, I noticed that this constant conversion was a performance drain. After restructuring the code, I almost completely eliminated this penalty by performing the conversion once in the constructor of the XMLStackDocument class. Now, all node classes just reference the properly encoded string from the document class. There are two predefined typedefs for the XMLStackDocument class: AXMLStackDocument for ANSI characters and WXMLStackDocument for Unicode characters.

The next coding example, shown in Listing 2, creates the XML shown in Figure 2. This XML has a couple of new node types: processing instructions and attributes. Also, this time around, I will be working with Unicode instead of ANSI, so I'll use the predefined WXMLStackDocument typedef. Just like elements, processing instructions get pushed onto the stack and can be assigned a value. Unlike elements, processing instructions do not accept child nodes, so you must always pop them before creating other nodes. If you do try to push a node on top of a processing instruction, an XMLStackException will get thrown.

Notice the <Book> element has an attribute called Status. Elements are the only nodes that support attributes. To create an attribute node, I call the document's top() method, which returns the node currently on top of the stack (<Book>). I then use the element's createAttribute() method to create a Status attribute and assign it the Unicode string "BackOrdered". Take a look at the line where I assign <PubId> the integer 736. As I mentioned before, the assignment operator is overloaded to accept all the C++ scalar types. It will automatically convert the integer 736 to a string and assign it to <PubId>. Skip a few lines down to the setValue(19.99, 2) method. This assigns a float or double to the node and accepts an optional second argument for the scale (number of digits to the right of the decimal point). The default is six. This is one case where you might want to skip the assignment operator (which uses the default value) if you need finer control of the string conversion. At the end of the code, I make a call to popAll() instead of several pop() calls. The popAll() method calls pop() until no more nodes remain on the document's stack.

Namespace Support

In my final coding example (Listing 3 and Figure 3), I introduce the namespace [4] capabilities. Both elements and attributes can be placed into namespaces, which helps avoid name collisions when working with two or more XML documents. Namespaces are supported in the classes, but there isn't any validation performed on them. For instance, you can declare a namespace prefix (e.g., "Prefix:ElementName") on an element name even if the namespace URI (Uniform Resource Identifier) for that prefix was never previously defined. No exception will be thrown. However, any XML parser will treat that as an error, so the resulting XML will be invalid.

The XML in Figure 3 is a simple SOAP message to a hypothetical service, which returns the current time in a given time zone. SOAP makes heavy use of namespaces, so it's a good way to demonstrate this capability. In Listing 3, the first element I create is <Envelope>, with a namespace prefix of "SOAP-ENV" and a namespace URI of "http://schemas.xmlsoap.org/soap/envelope/". The namespace prefix is provided with the element name, separated by a colon. The second parameter to pushElement() is the namespace URI. Since the namespace URI is defined with the <Envelope> element, it will automatically get written in the starting tag as an attribute (e.g., "xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/""). The prefix "xmlns" declares a namespace. However, there are a couple of other attributes that need to get created, one of them being another namespace declaration ("http://www.time.gov/"). This is performed by calling createAttribute() twice on the <Envelope> element. The namespace declaration "xmlns:TS" is just treated as an attribute named "xmlns:TS" with a value of "http://www.time.gov/" for the namespace URI.

Next, I create an empty element called "SOAP-ENV:Header". Now comes the SOAP body or payload. After pushing the "SOAP-ENV:Body" element onto the stack, I create an element called "TS:GetCurrentTimeRequest". GetCurrentTimeRequest is a method of my hypothetical SOAP service that returns the current time for a time zone. It's also part of the namespace "http://www.time.gov/", which is why it's prefixed with "TS". However, unlike the way DOM works, I do not pass the namespace URI (http://www.time.gov/) as the second parameter because it's already been declared in an ancestor element. If I did pass it, it would write out the namespace declaration again in the starting tag (e.g., <TS:GetCurrentTimeRequest xmlns:TS="http://www.time.gov">). This isn't actually a problem, but there's no need to do it and it keeps the resulting XML smaller. The document object does not validate or keep track of previous namespace declarations, which is why it would automatically write the namespace declaration out again. The only thing left to do is create the <TimeZone> element, assign it the value of "Eastern" and then pop all the remaining nodes off the stack.

Other Node Types

There are several other node types that weren't covered by the examples that I want to mention. If you want to place a comment in the XML, all you need to do is call pushComment("My comment") on the document object and it will insert the comment into the XML buffer surrounded by the appropriate delimiters (e.g., <!—My comment—>). Remember to pop it off the stack, because comments don't accept child nodes and will throw an XMLStackException if you forget.

Although not formally introduced, in my sample code, text nodes have been used behind the scenes. Text nodes are just text that isn't markup (e.g., element tags). When assigning a value to an element node, the code actually pushes a text node on the stack, assigns the value to it, and pops it off. Text nodes also automatically convert the reserved characters & and < to their proper entity references [5] (& and < respectively). Text nodes are created by calling pushText("This & That") on the document object. The resulting XML will look like "This & That".

Another text-like node is called a CDATA section. The difference between the two is that CDATA sections do not require you to escape the reserved characters & and <. This makes them ideal for text that includes markup. On the other hand, CDATA sections cannot contain other CDATA sections, so it's not the perfect solution. The problem is CDATA sections have the ending delimiter "]]>", and it cannot be escaped. If that delimiter can possibly appear in your text, then you will need to use a text node, or a combination of the two. CDATA sections are created by calling pushCDATASection("This & That"). The resulting XML will look like "<![CDATA[This & That]]>". Neither text nodes nor CDATA sections accept child nodes, so pop them before pushing other nodes onto the stack.

The last node provides a way to place raw text into the buffer. No character processing is performed on the string. It could be XML or maybe just text that you know doesn't need to be checked for reserved characters (e.g., checking for reserved characters does exact a small performance penalty). Another use for it could be to write an embedded DTD (Document Type Definition), since support for creating DTDs is not inherently provided. To use it, call pushRawXML(string, string). The first parameter is the text that gets written to the XML buffer when the node is pushed. The second parameter is the text that gets written when the node is popped. Either parameter can be an empty string, and it does accept child nodes.

Performance

As I mentioned in the beginning, one of my criteria for the code is that it be fast. Microsoft's DOM implementation has an extension that will create an XML string from the nodes in DOM. On my PC, I benchmarked creating the XML in Figure 2 with version 3 of their DOM and my XMLStack classes (10,000 times in succession). The XMLStack classes were consistently five-six times faster. Creating very large documents (thousands and thousands of nodes) typically netted over three times the performance using my code. When creating XML in a server environment, this can make a significant difference in how well your server scales. Also, the XMLStack code to create Figure 2 is about half the length of the DOM code.

Error Checking

To be fair, Microsoft's code is also error checking the XML as the nodes are created, something my classes do not do, except in some limited cases. If the _DEBUG preprocessor symbol is defined, then I do perform some validation on element and attribute names. It's not comprehensive, but it does catch some of the common errors (whitespace or illegal characters in names).

Conclusion

DOMs provide a lot of flexibility in XML creation, manipulation, and searching. Sometimes, however, your XML requirements might be more basic, and programmatically creating XML as fast as possible becomes the priority. The XMLStack classes meet that criteria and can make you more productive at the same time. After all, writing less code to achieve the same result is usually a good thing.

References

[1] The DOM specification and related materials can be found at the W3C website: <www.w3c.org/DOM/>.

[2] Don Box et. al. "Simple Object Access Protocol (SOAP) 1.1," May 2000, <www.w3.org/TR/SOAP/>.

[3] Microsoft's DOM (MSXML) can be downloaded at Microsoft's MSDN site: <http://msdn.microsoft.com/xml>.

[4] Various authors. "Namespaces in XML," January 1999, <www.w3.org/TR/1999/REC-xml-names-19990114/>.

[5] There are a few other reserved characters, such as >, that could also be escaped. However, escaping those characters is optional. For efficiency, text nodes only convert the absolute minimum required.

Download the Code

<moore.zip>

About the Author

Scott Moore is a senior developer at netDecide (Scott.Moore@netDecide.com). He enjoys learning new technologies and continually strives to write practical code. By his definition, practical code is bug free, performs well and falls somewhere between being perfect (e.g., academia) and getting something out the door. Scott has a BS in computer science from James Madison University.