Programmers have two systems for parsing an XML document: DOM (Document Object Model) and SAX (Simple API for XML). Parsers that support the first system read the whole document into a data structure in memory and then provide access to it using the W3C's DOM API. This approach requires that the whole document fit into memory and takes a little time while the parsing is done. Furthermore, the user then has to navigate the DOM tree to gain access to the data in the document.
SAX is an event-driven system in that the parser calls user-supplied event handlers as it encounters occurrences of various parts of the XML document, such as elements, text, and so on. The user code in the handlers must process the data as necessary.
This article describes an efficient way to parse an XML document, using Standard C++ library containers in conjunction with a SAX parser, resulting in fast de-serialization of data from an XML file directly to data structures held in memory.
You can achieve this goal through the use of a polymorphic DataElement class, representing an individual XML element. By passing text from the XML document to the put method, this class has the ability to convert and store textual XML data in any data structure the user wishes:
class Element
{
public:
virtual void put
(const std::string& s) const=0;
virtual ~Element() {}
};
Clearly this class is a base class, as evidenced by the pure virtual method put, and the virtual destructor, which ensures proper behavior if you delete objects of classes derived from this class via a base type pointer.For each type of data you wish to parse from the XML, you need a new derived class with an appropriately typed data member pointing at a variable suitable to hold the data item. As the following sections describe, you can either derive this class explicitly, or you can implement the class through a class template.
For example, a class for type long would look like this:
class LongElement : public Element
{
long* ptr_;
public:
LongElement(long* ptr) : ptr_(ptr)
{}
virtual void put
(const std::string& text) const
{
if (ptr_)
{
*ptr_=atol(text.c_str());
}
}
};
The disadvantage of this design is that you need to define a different derived class for each data type on which you want to operate.
The crucial factor is the design of the put method. Since each derived class handles a different type of data, you need a way to code this function in such a way that you don't need to specialize the template for each type -- which would negate the advantage of using a template. The ideal way would be to use a conversion function that itself is a template. Luckily such functions exist in the standard iostream library, which handles the conversion of varied data types to and from character streams, such as you might find in an XML document.
The template version of the derived class looks like this:
template<typename T>
class ElementData : public Element
{
T& ref_;
public:
ElementData(T& item) : ref_(item)
{}
virtual void put(const std::string& s) const
{
std::istringstream stream(s);
stream >> ref_;
}
};
You can see that, like the non-templated version, this class overrides the put method to copy the XML data it is passed into the data storage it was given in its constructor. However this class can cope with any type for which a stream extraction operator has been defined.
Remembering that the SAX parser will call a handler object for the start and end of each element and each piece of character data in the XML document, you need to arrange for the put method of the appropriate object to be called at the appropriate time with each piece of data.
The best way to do this is to use a look-up table that will direct you to the relevant ElementData object for each piece of XML text. For this, you can use the look-up table data structure supplied with the C++ Standard library, STL map class template.
If you are familiar with the std::map, you will recall that it takes two main template arguments, and these are the types of the key into the map and the type of the item to be stored. In this case, these are std::string and a pointer to ElementData respectively. Declare a typedef to make your life easier:
typedef std::map<const std::string, const Element*> ElementMap_t;And what do you use for the key of the map? Since you are mapping from each XML element to its data item, the key must identify the XML element in question. This means you should use its name as the key.
<?xml version='1.0' ?> <Person> <FirstName>Elvis</FirstName> <LastName>Presley</LastName> <DateOfBirth> <Year>1935</Year> <Month>1</Month> <Day>8</Day> </DateOfBirth> </Person>You want to store each individual data item in a separate variable, each with its own ElementData object with the template instantiated for the appropriate type:
std::string FirstName;
std::string LastName;
struct Date {
int year, month, day;
};
Date dob;
ElementMap_t element_map;
element_map.insert(std::make_pair
("FirstName", ElementData<std::string>(FirstName));
element_map.insert(std::make_pair
("LastName", ElementData<std::string>(LastName));
element_map.insert(std::make_pair
("Year", ElementData<int>(dob.year));
element_map.insert(std::make_pair
("Month", ElementData<int>(dob.month));
element_map.insert(std::make_pair
("Day", ElementData<int>(dob.day));
Then you need a SAX parser with a handler that looks up each XML element in the map and calls the put method on the object it finds there, passing the character text from that element.
XPath allows you to specify an element in an XML document using a hierarchical directory-path-like notation. The root of the document is represented by a slash, and each element name is appended, separated by more slashes.
Some examples from the document above:
/Person /Person/FirstName /Person/DateOfBirth/MonthXPath is much more expressive than this, but you can use this simple form of the notation to identify individual elements of the XML document.
The SAX handler is the piece that does all the work. The handler has a number of methods that are called by the parser as the XML document is processed. In this case, you are concerned with the beginning and end of XML elements, and with character text. You use the beginning and end element notifications to keep a record of where you are in the XML document, constructing an XPath string as you go along. This path to the current element is stored on a stack. Any character data for the active element is accumulated until you reach the end of the element. When you do reach the end of an element, you pop the top item off the stack so that the previous element's path becomes the active path.
Here is the declaration of the class:
class MySaxHandler : public HandlerBase
{
const ElementMap_t& element_map_;
std::stack<std::string> current_path_;
std::ostringstream current_text_;
public:
MySaxHandler(const ElementMap_t& map);
void startElement(const XMLCh* const name,
AttributeList& atts);
void endElement(const XMLCh* const name);
void characters(const XMLCh*const text,
const unsigned int length);
};
The constructor simply initializes the object's member variable with a reference to your element map.
MySaxHandler::MySaxHandler
(const ElementMap_t& map)
: element_map_(map)
{
}
void MySaxHandler::startElement(
const XMLCh* const name,
AttributeList& atts)
{
std::ostringstream this_path;
if (!current_path_.empty())
this_path<<current_path_.top();
this_path << 'c:\www.cuj.com/';
write_xml(this_path, name);
current_path_.push(this_path.str());
}
The reason for using a stringstream rather than a simple string is so that you can take advantage of the ostream inserter function write_xml used above. In a moment, I will show how this will handle conversion of XMLCh unicode characters to your local encoding. However, for reasons you shall soon see, this needs to be an explicit write_xml function rather than an overloaded operator<<.Next, characters is called by the SAX parser for all textual element content. You simply maintain a stringstream and insert the new characters into it whenever you get some.
void MySaxHandler::characters(
const XMLCh*const text,
const unsigned int length)
{
write_xml(current_text_, text);
}
Finally, at the end of each element, endElement finds the element in question by looking up its XPath name in the map and then calls put to write the characters saved so far to the stored pointer:
void MySaxHandler::endElement(const XMLCh* const name)
{
if (!current_path_.empty())
{
ElementMap_t::const_iterator
i=element_map_.find(current_path_.top());
if (i != element_map_.end())
{
i->second->put(current_text_.str());
}
current_path_.pop();
current_text_.str("");
}
}
The Xerces SAX parser represents characters using an XMLCh data type and passes strings by pointers to this character type. These XML characters are represented in a Unicode encoding, whereas to store them in standard strings you need them in the local encoding. Xerces provides a static XMLString::transcode function to perform this conversion. The conversion could be automated by building it into the insertion operator for the type XMLCh.
However, XMLCh is a typedef from short, which makes it difficult -- you can't overload based on a typedef because typedef does not create a new type, but simply an alias. Therefore short's standard inserter will be used by the compiler instead. To get around this problem, there are several alternatives: you could use a different function to insert in the stream (rather than operator<<) or explicitly translate the encoding before inserting in the stream.
Here is the function write_xml used earlier to transcode and write XML characters to a stream:
void write_xml(std::ostream& target,
const XMLCh* s)
{
char *p = XMLString::transcode(s);
target << p;
delete [] p;
}
To avoid the call to delete[], you could replace the char* with a smart pointer capable of holding and deleting an array (unlike std::auto_ptr). If target<<p could throw, for example, this smart pointer would be necessary to make the function exception-safe.
template<typename T>
void AddElement(ElementMap_t& map,
T& ref,
const std::string& path)
{
map.insert(std::make_pair
(path, new ElementData<T>(ref)));
}
In a production system, this could even be automatically called by a program that gets the information from a metadata repository of some kind.The final code looks like this:
char filename[]="file.xml" ElementMap_t element_map; AddElement(element_map, FirstName, "c:\www.cuj.com/Person/FirstName"); AddElement(element_map, LastName, "c:\www.cuj.com/Person/LastName"); AddElement(element_map, dob.year, "c:\www.cuj.com/Person/DateOfBirth/Year"); AddElement(element_map, dob.month, "c:\www.cuj.com/Person/DateOfBirth/Month"); AddElement(element_map, dob.day, "c:\www.cuj.com/Person/DateOfBirth/Day"); MySaxHandler handler(element_map); parser.setDocumentHandler(&handler); parser.parse(filename);
Another improvement would be to build in support for multiple XML document types. Since the element map objects contain full XPath names for each element, the same map could be used, and it would continue to uniquely identify each element as it is discovered in any known XML document type.