Dr. Dobb's Journal February 2001
Just as Perl became the duct tape for the Web, XML is becoming the duct tape for e-business. As a universal data format, XML glues together disparate e-business systems that, in the process of conducting everyday business, need to perform hundreds of transactions per second without outages or crashes. Such systems need XML processors that provide high performance with a small footprint. That's what SAX offers.
In this article, I'll describe SAX, then show how you can use it in Visual Basic applications via the Microsoft XML (MSXML) parser.
SAX stands for the "Simple API for XML." Developed in collaboration by the members of the XML community (see http://www.megginson.com/), SAX is a simple, efficient, and high-performance alternative to the Document Object Model (DOM), a widely supported XML API developed by the WWW Consortium. Unlike DOM, SAX considers XML not as text or an object, but as a stream of events, such as:
Since an application receives these events as the parser reads the document, there is never a moment when the whole document is in memory. As a result, SAX delivers better performance and a smaller footprint for large files.
To use SAX, you create event handlers, plug them into the parser, and call the parser. After that, the parser calls the necessary event handlers while parsing XML documents. Figure 1 illustrates how this relationship works.
The SAX2 API currently has five event handler interfaces, each responsible for a separate group of events:
To use these interfaces, you just implement the ones you need within your code and register them with the parser.
Another interesting SAX feature is XMLFilter, which is both a reader and a group of event handlers. XMLFilter receives events from the reader, does some work based upon those events, and then passes selected events to the next handler. Hence, each event calls a number of event handlers chained one after another. This lets you break XML processing in logical components, such as check inventory, compute tax, or charge credit card, as in Figure 2.
In the following discussion, I'll implement a simple application that performs a keyword search based upon the component approach that SAX and XMLFilter provide. The complete Visual Basic source code for this application is available electronically; see "Resource Center," page 5. There you will find three different implementations of this task, showing three different ways to handle and connect SAX components.
To keep the code manageable, the application's task is to search an XML document (see Listing One) for those <book> elements whose title attribute contains the specified keyword, and then output a list of those books. Notice that in the XML document, most of a book's details are attributes of the <book> element. The one exception is the book's description, which forms the <book> element's content.
As far as coding goes, this application has three clear phases: reading the XML document, filtering the document's contents by keyword, and outputting the results. Each of these phases correlates to a separate SAX component in the sample applications that are available electronically. (For these sample applications, you need the October 2000 MSXML Microsoft XML Parser 3.0 release, available at http://msdn.microsoft.com/xml.)
The first thing you need to do is create the necessary SAX components: a reader, filter, and writer. For the reader, you can use the MSXML SAXXMLReader. You will need to implement the filter and writer components yourself, as described later.
Dim reader As New SAXXMLReader
Dim filter As New KeywordOnly
Dim writer As New PrintOut
Having created the components, you now need to set parameters and connect these components to each other:
TextBox1.text = ""
filter.searchKey = TextKey.text
Set filter.IVBSAXXMLFilter_parent = reader
Set filter.IVBSAXXMLReader_contentHandler = writer
Alternatively, you could set the parent and contentHandler properties using type casting. In fact, type casting is a more proper way of setting these properties because it really sets the properties of the filter interface instead of setting them via a pseudoproperty of an implemented class. This code uses type casting to set the parent property:
Dim chf As IVBSAXXMLFilter
Set chf = contentHandler
Set chf.parent = reader
Electronically available source code (see "Resource Center," page 5) provides implementation for both direct setting of properties (the "complete filter" example) and use of type casting (the "complete filter type casting" example).
Type casting, though more correct, also requires more coding. Type casting requires three lines of code to set the property; while using a pseudoproperty takes only one.
Why do you need to set the parent property for the filter? Because of the filter's dual functionality: as a content handler for the reader, and reader for the writer. Setting the parent property not only sets the filter as the content handler for the reader, but also lets the filter represent itself as a proxy reader by redirecting its own calls, such as "parse()," to the real reader.
Once you have created all the objects, linked them together, and set the parameters, you can parse the document:
On Error GoTo Uh_Oh
filter.IVBSAXXMLReader_parseURL TextFile.text
Exit Sub
Uh_Oh:
TextBox1.text = TextBox1.text &
"*** Error in XML file with the list of books *** "
MSXML implements its SAX interfaces based upon COM/OLE, and COM/OLE interfaces start with "I." So, MSXML added prefixes to the original SAX interface names: "I" for the high-performance COM interfaces used by C++, and "IVB" for the automatable interfaces used by Visual Basic and scripting.
Yet, MSXML uses no prefixes for its classes because those classes usually implement both interfaces (that is, there is no need for a separate "VB" prefix). A single coclass works for both C++ and Visual Basic.
However, there are two names for each of these coclasses. This naming has nothing to do with programming languages, but instead with versioning control. For example, both C++ and Visual Basic can use either the SAXXMLReader and SAXXMLReader30 coclass. The difference is that SAXXMLReader coclass is version independent and SAXXMLReader30 isn't. Version independence means that the SAXXMLReader coclass automatically uses the latest installed version of MSXML, while the version-dependent SAXXMLReader30 coclass continues to use only MSXML 3.0.
The PrintOut class implements the content handler and prints everything out. To implement the content handler, open Visual Basic and create a class module named PrintOut.
Now, let Visual Basic know which interface you are going to implement: Implements IVBSAXContentHandler. Next, add dummy implementations for all the necessary methods. To do that, just select the interface from the Object list (shown in the red circle in Figure 3) and then select each of the methods, one-by-one, from the Procedure list (highlighted with a red arrow in Figure 3).
You can now implement the output. In Listing Two, you direct the output to a text box on the form. This output happens only from the startElement and characters methods.
In the original SAX specification, the content handler returns Null if an attribute is not found. However, in Visual Basic, an empty string and Null are identical. Thus, MSXML's implementation of this interface raises a trappable error instead of returning Null.
To trap and process this error, the example code uses the attrVal helper method. This helper method (function) also provides some small conveniences like a default value and a prefix/suffix.
The main function of the filter is to receive SAX events and pass the selected ones on. Sometimes, it may change the content of XML as well, but in this example it does not.
The filter should implement three interfaces (specifying a library is not necessary, but somewhat useful for readability):
Implements MSXML2.IVBSAXContentHandler
Implements MSXML2.IVBSAXXMLFilter
Implements MSXML2.IVBSAXXMLReader
The only property of the filter interface is parent; see Listing Three. While it does not matter here whether you make the setter and getter private or public, making them public puts them in the interfaces pop-up list that appears when you implement interfaces in Visual Basic.
You use similar code (see Listing Four) for the SAX reader property, contentHandler. You then set all other methods of the SAX reader interface to delegate the call to the parent (if there is a parent); see Listing Five. Also, you need to set the content handler to track elements, and turn the pass-through flag on for elements with selected keyword and off after exiting the element; see Listing Six. Finally, you need to set all other content handler methods to simply pass the event to the next content handler (provided that handler exists); see Listing Seven.
There is a shortcut for implementing all filter/reader methods. This shortcut also helps to avoid type casting in the main program. For this shortcut, you implement only the content handler and add the contentHandler property directly to the KeywordOnly class (that is, 'Set reader.contentHandler = filter'). The code for such implementation is also available electronically as a "simple filter" example (see "Resource Center," page 5). However, the convenience of this shortcut comes at a price:
The same MSXML SAX parser may be used in C++, with even greater gains in performance, by using a separate set of interfaces (ISAXContentHandler, ISAXDTDHandler, and others). However, a complete description of this would require a separate article.
SAX is often a simple and efficient way to read XML. One of the most compelling reasons to use SAX is when you cannot afford the load to the system associated with using the DOM. For example, there is no way to use the DOM (at least not directly) if your documents are so huge (many megabytes in size) that they just don't fit into memory. Sometimes they may fit, but not enough for multithreaded processing of several copies at the same time, which is usual for web-based applications.
Another reason to choose SAX may not be necessity, but comparison. Suppose your Windows 2000 web server processes a high volume of average-sized XML messages. For MSXML, using SAX is normally several times faster than using the DOM. For relatively simple applications, the faster processing achieved by using SAX can translate into the ability to serve twice as many clients at the same time.
DDJ
<?xml version="1.0"?> <booklist> <book title="Building Microsoft Exchange Applications (Solution Developer Series)" author="Peter J. Krebs" ISBN="157231334X" instock="no"> This book will guide programmers and non-programmers to create a professional mail-enabled or groupware application in less than a day with Microsoft Exchange. </book> ... </booklist>
Private Sub IVBSAXContentHandler_startElement(...parameters skipped ...) Form1.TextBox1.text = Form1.TextBox1.text _ & vbNewLine & attrVal(attributes, "title") _ & attrVal(attributes, "author", "", " by ", "") _ & vbNewLine & " -- " _ ... other attributes ... & attrVal(attributes, "instock", " No inventory data.",_ ", In stock: ", ".") End Sub Private Sub IVBSAXContentHandler_characters(strChars As String) Form1.TextBox1.text = Form1.TextBox1.text & text End Sub
Private parent As MSXML2.IVBSAXXMLReader Public Property Set IVBSAXXMLFilter_parent _ ( ByVal RHS As MSXML2.IVBSAXXMLReader) Set parent = RHS Set RHS.contentHandler = Me End Property Public Property Get IVBSAXXMLFilter_parent() As MSXML2.IVBSAXXMLReader IVBSAXXMLFilter_parent = parent End Property
Public ch As IVBSAXContentHandlerPrivate Property Set _ IVBSAXXMLReader_contentHandler(ByVal RHS As MSXML2.IVBSAXContentHandler) Set ch = RHS End Property Public Property Get IVBSAXXMLReader_contentHandler() _ As MSXML2.IVBSAXContentHandler IVBSAXXMLReader_contentHandler = ch End Property
Public Sub IVBSAXXMLReader_parseURL(ByVal strURL As String) If Not IsEmpty(parent) Then: parent.parseURL strURL End Sub
Private PutItOut As Boolean
Private Sub IVBSAXContentHandler_startElement(... attributes skipped...)
Dim i As Integer, s As String
If strQName = "book" Then
On Error GoTo noTitle
s = attributes.getValueFromQName("title")
If InStr(s, searchKey) > 0 Then
PutItOut = True
End If
noTitle:
On Error GoTo 0
If PutItOut Then
If Not IsEmpty(ch) Then
ch.startElement strNamespaceURI,strLocalName,strQName,attributes
End If
End If
End If
End Sub
Private Sub IVBSAXContentHandler_endElement(strNamespaceURI As String, _
strLocalName As String, strQName As String)
If PutItOut And Not IsEmpty(ch) Then
ch.endElement strNamespaceURI, strLocalName, strQName
End If
If strQName = "book" Then
PutItOut = False
End If
End Sub
Private Sub IVBSAXContentHandler_characters(text As String) If PutItOut And Not IsEmpty(ch) Then ch.characters text End If End Sub