Parsing XML

Dr. Dobb's Journal January 2001

Your own miniXML parser

By David Cox

David is a software engineer at Xerox and teaches computer science at the Rochester Institute of Technology. He can be reached at guffy@rochester.rr.com.

The XML specification describes how to create documents containing structured, hierarchical information. One of the promises of the XML specification is that these documents are easy to parse. Because I like doing things that are easy, I decided to create my own parser.

I started by downloading the XML specification from http://www.w3c.org/, then purchased a copy of Bob DuCharme's XML: The Annotated Specification (Prentice Hall, 2000). Like many specs, the XML one is short, to the point, and assumes a certain background. DuCharme's annotated spec explains in detail, and with examples, each section of the XML spec. It might also help to have a good general book on XML. However, I did not use one for this project.

I then decided to code the parser in C++ using Visual C++ Version 5. I also used the Standard Template Library (STL) for strings and various containers. With these tools, I began writing a parser that I call "miniXML." (The complete source code for miniXML is available electronically; see "Resource Center," page 5.)

Types of Parsers

XML documents are text documents — an array of characters. Another way of stating the same thing is to say that XML documents are one big string. The goal of an XML parser is to extract information from these strings. Two different ways of doing this have been developed.

One way is to use an event-driven approach. The parser begins reading the string and sends messages when certain events occur. For example, a message is sent when a start tag is encountered and another event when an end tag is reached. Programs that use these parsers have callback functions to process the events. When a message signals that a desired tag has been found, the program can examine the tag and its accompanying information in detail and act accordingly. The SAX parser is the model for this approach.

The second approach builds a tree to represent the XML document. Each tag of the document represents a node in the tree. Once built, a program can traverse the tree to process the document or to search for specific tags. Usually these trees reside in memory, limiting the size of XML documents that can be parsed by this approach. In contrast, event-driven parsers do not create a tree and can parse documents of any size.

The miniXML parser I present here uses the tree approach. While this limits its use to smaller documents, it is very fast. Many, if not most, applications use small documents to do things such as sending small XML documents over the Internet.

Canonical XML

miniXML works with canonical XML. This is the XML you are left with after an XML document is preprocessed. I liken this to a C compiler that first removes directives and macros, leaving only syntactically correct C code. With XML, preprocessing would include processing references to external files (external DTDs) and expansion of entity references. What is left over is still an XML document, but one that uses a simpler syntax. Figure 1 shows this simpler syntax (known as canonical XML). Strictly speaking, XML documents are created using Unicode. However, I wrote miniXML to work only with XML documents written in ASCII.

Is canonical XML useful? Absolutely. Example 1(a) is XML that conforms to this syntax. This is one continuous string, but for readability it can be displayed as Example 1(b). Many real-world applications generate XML documents using this simpler syntax. For this reason, miniXML parses documents that conform to canonical XML. For more information on canonical XML, see http://www.jclark.com/xml/canonxml.html.

Building Trees

miniXML works by building a tree with nodes that are Tag objects. These pointers allow a tag to maintain two lists — one of siblings and one of children. I use direct pointer manipulation rather than an STL container to maintain these two lists.

Each tag also contains a list of Attribute objects and a list of Content objects. For these I do use the STL. These lists are declared as:

typedef std::list<Attribute> AttributeList;

typedef std::list<Content> ContentList;

The code for building the tree is simple and uses techniques taught in any first-year data-structures course.

Lexical Analysis

In general, the goal of lexical analysis is to break a string of characters into a sequence of tokens. Tokens usually represent words separated by white space. However, at certain times, white space in an XML document is important. As a consequence, miniXML generates a sequence of tokens interspersed with strings of characters that include white space.

The Lexan class performs the role of lexical analyzer and provides two primary member functions. The first is called NextToken, which is called as needed to look for the next token in a string. The second is called GetCharData, which returns strings that can contain whitespace. These functions are called as needed to build the Tag tree.

GetCharData works by collecting every character it encounters into a string until it reaches either the end of the XML document or encounters a start tag. NextToken works by reading characters until it recognizes that the characters represent one of the types of tokens in Table 1.

The Parser

A Parser object requests tokens or character data from a Lexan object. As the tokens and character data are returned, the Parser builds the tag tree.

The Parser uses a recursive descent technique. The member function, Translate, starts the process by getting the root tag of the XML document. The function GetTag follows the syntax by parsing the start tag, the content, and finally the end tag with calls to StartTag(), Content(), and EndTag(), respectively.

While parsing content, Lexan could return a StartTag token, signaling the beginning of another tag. If so, Content() recursively calls GetTag and adds the returned tag to the current tag's content list. If Content() encounters a Comment token, the entire comment is added to the content list as a Comment object. If you look at the header file, Parser.h, you will notice that the Comment class is a subclass of Content. Comments are viewed by miniXML simply as a type of Content.

Otherwise, all the characters encountered are collected into a single Content object, and that string is added to the content list. At this point, Content() switches modes from getting the next token to getting character data. Rather than calling Lexan::NextToken, Content() calls Lexan::GetCharData(). As soon as another start tag is encountered, the parser switches back to collecting tokens.

Parsing start and end tags proceed similarly. A start tag is defined syntactically as a tag name followed by zero or more attributes between a < character and > character. Consequently, the StartTag() function is written simply as shown in Listing One. Match is a function that serves only to move the parser to the next token. From the parser's perspective it is telling Match, "I expect to match this token next. If it matches, great. Send it back to me. If it doesn't match, then there is trouble." Match() stores the next token in the class variable m_lookahead. Sometimes Match() is called to match a predefined token such as m_equal or m_leftStartAngle. The former is a token for the equals sign (=) and the latter is a token for the < character. In these cases, if m_lookahead matches the predefined token, the parser moves on to the next token by getting it and storing it in m_lookahead.

In other cases, Match() is called to match m_lookahead with itself. This is really a trick to get Match() to move to the next token. The parser is telling Match(), "Okay, I am done with this token, go get the next one." Match() is just one of those little functions that, without an explanation, leads to a lot of head scratching.

Finally, use of the parser relies on traversing the resulting Tag tree. The TagIterator class assists with this task. A TagIterator object requires that you identify the tag that is the root of the tree. Once this is done, you can call the member functions Begin() and Next() to move through the tree. Listing Two is a program to print each tag as you traverse the tree.

Conclusion

There are many applications for a small XML parser. I think that one of the more interesting uses is to specify a traditional Windows user interface in XML. Each tag would represent some element of the user interface and attributes would represent everything from font to size to position to color. A collection of tags would specify the entire interface for a window or dialog box, allowing an XML document to completely replace Windows resource files.

Other uses for XML, of course, abound. With small parsers such as miniXML, you can easily add these capabilities to your applications. Microsoft recently introduced the Simple Object Access Protocol (SOAP), a protocol that lets programs send XML over HTTP to invoke methods on remote objects. A small XML parser could serve as the engine for implementing this or a comparable (dare I say, homebrew) protocol. If your goal is something less extravagant, then you can easily use miniXML to send data messages formatted as XML over HTTP. By adding XML and HTTP capabilities to applications, software developers can begin to offer alternatives to traditional browsers that have significant value to their customers.

DDJ

Listing One

void Parser::StartTag(Tag * t)
{
    std::string buffer;
    Match(m_leftStartAngle);    // match "<"
    TagName(buffer);            // match tag name
    if (t) t->Name(buffer);     // save tag name
    if (m_lookahead != m_rightStartAngle)
    {
        AttributeList(t);       // match attributes
    }
    Match(m_rightStartAngle);   // match ">"
}
void Parser::EndTag()
{
    std::string tagName;
    Match(m_leftEndAngle);      // match "</"
    TagName(tagName);           // match tag name
    Match(m_rightEndAngle);     // match ">"
}

Back to Article

Listing Two

#include <parser.h>
#include <tagiterator.h>
#include <iostream>

void main(int argc, char *argv[]) {
    Parser p;
    ifstream xmlFile(argv[1]);
    p.Translate(xmlFile);
    Tag * root = p.GetRoot();
    TagIterator ti(root);
    Tag * next = ti.Begin();  // get the root of the tree
    while (next)
    {
        next->Visit(ti.Level());  // Visit prints the tag
        next = ti.Next();         // get the next tag in the tree
    }
    exit(0);
}

Back to Article