Binary XML

Dr. Dobb's Journal November, 2004

Taking on XML's bandwidth and memory constraints

By Oliver Goldman

Oliver is a software architect at Adobe Systems. He can be reached at goldman@ieee.org.

XML is so popular and widely deployed that it can be difficult not to choose it for encoding data. But that doesn't mean XML is well suited for every application. For one thing, it's too verbose for bandwidth- and memory- constrained applications. For another, it doesn't provide any facilities for randomly accessing or updating only portions of a document. And it doesn't deal well at all with binary data.

Consequently, "Binary XML" has become the catchphrase for the various proposals—and, in some cases, standards—designed to address these problems. Not all of these proposals actually involve binary data and none of them produce files that conform to the XML specification, but they do have a common theme—developers want a common way of interchanging data that doesn't have the same drawback as XML itself.

That, in turn, points back to the advantage the XML standard has over the various Binary XML proposals—there's only one of them. In this article, I examine some of the perceived weaknesses of XML—a subject on which there is also something short of agreement, some of the Binary XML proposals, and some notes on where all of this might be heading.

Strengths and Weaknesses

Perhaps the most important perceived strengths of XML are that:

It is easy to create, manipulate, and parse using a variety of programming languages and tools.
It is descriptive—an XML document generally gives its (human) reader some idea of its contents.

Contrast the latter with file formats such as JPEG, the bytes of which can only be understood with a copy of the JPEG specification at hand.

On the other hand, the most commonly claimed weaknesses of XML fall into two categories. Perhaps the most common relate to the use of XML on constrained devices such as cell phones. For these uses, XML is verbose, which is a problem for devices constrained by bandwidth and memory and it takes significant cycles to parse, which is a problem on power- constrained devices. The typical mobile device is constrained by bandwidth, memory, and power. Such uses demand the smallest, easiest-to-parse format.

The second category is at the opposite end of the spectrum where XML is used to encode data sets so large they tax any device. Examples often include images, video, and large databases. In these uses XML is also too verbose, driving up storage, memory, and bandwidth costs. And because XML doesn't support random access, it must instead be accessed from beginning to end; finding, say, the pixel at (317, 40) in frame 4517 can take an awfully long time.

Many of these weaknesses must be taken in context—what's too large or slow for one application might be acceptable for another. But these concerns cannot be dismissed with arguments about increasing CPU speed, available storage, or even better battery technology. The amount of data being handled continues to increase, pushing the envelope at both ends. Doubling the amount of battery power in a cell phone is always expensive, as is installing 2 terabytes of storage instead of one.

Compression

Sheer size is often the first problem to strike those using XML, and so compression is often the proposal for trimming things back down. The simplest approach is to repurpose a general-purpose compression algorithm. For example, SVG standardizes this approach by specifying that an SVG file may be stored either as an XML document or as a gzip-compressed version of the same; the two are treated interchangeably by consumers.

XMill (http://www.research.att.com/sw/tools/xmill/) is a compression tool designed specifically for XML that can achieve better results than more general techniques such as gzip. XMill accomplishes this by rearranging the data for purposes of compression, thus grouping values likely to be similar together where they can be further compressed. The notion is analogous to compressing a database by column, in which direction values are often similar, instead of by row.

The WAP Binary XML Content Format (WBXML; http://www.w3.org/1999/06/NOTE-wbxml-19990624/) was designed to provide a compact representation of XML to communicate with mobile devices over relatively slow wireless links. It uses a token-based compression scheme in which each element, attribute, and so forth is encoded as a small integer value. In general, these assignments are recorded in the encoded file, permitting it to be used with arbitrary XML. WBXML achieves greater efficiency when used with WML, however, by predefining some useful WML tokens.

Compression techniques have the advantage of being useful at both ends of the spectrum: They reduce the amount of data to be transmitted over low-bandwidth links and reduce the storage costs for large data sets. However, they also work at odds with some other desirable properties. Adding decompression requirements to power-constrained devices only consumes more power. It can also make random access more difficult to implement.

Binary Data

XML files can contain only text data, not binary data, which is to say each byte in the file must represent a valid character encoding (or a portion thereof) in the declared encoding for that particular file. However, many documents and data sets contain binary data, such as images, which would violate this restriction if embedded without modification.

To overcome this restriction, binary data is only embedded after first applying a reversible transformation that maps the binary data into text. Which scheme will work depends on which character set is in use. base64, which maps binary data into a small range of characters that is itself a subset of ASCII, is the most versatile and common. The results of these transformations are always bigger than the original binary data. The ratios vary by scheme and character encoding; as a point of reference, base64 with UTF-8 expands binary data by a ratio of 4:3. This expansion is a significant issue for documents containing large amounts of binary data.

The expansion specific to binary data can be avoided with an encoding that retains binary data in its original form while keeping the rest of the message in XML. This is the approach taken by the verbosely named SOAP Messages with Attachments (SWA; http://www.w3.org/TR/SOAP-attachments/), SOAP Message Transmission Optimization Mechanism (MTOM; http://www.w3.org/TR/soap12-mtom/), and XML Optimized Packaging (XOP; http://www.w3.org/TR/xop10/) proposals.

These proposals are themselves based on the MIME specification (http://www.letf.org/rfc/rfc2387.txt), which describes a mechanism for bundling a document with related components together into a single package and in such a way that their relationships, as defined by URLs, are preserved. Each part of the multipart package may contain essentially any kind of data—XML, binary, or otherwise. The SWA specification describes how a SOAP message can be constructed to use a multipart message as the payload instead of a single XML document. The root part of the multipart contains the XML data with the base64-encoded data removed and placed in a set of corresponding attachments. These attachments are referenced by the original XML. Moving binary data out of the XML and into an attachment avoids the base64 expansion and encoding/decoding overhead.

Once the data has been moved into the attachments, it is no longer properly part of the XML document itself. Thus, this works only if senders and receivers agree, in essence, to exchange multipart messages instead of XML documents. This, in turn, means the data can no longer be processed using basic XML tooling; some MIME knowledge is also required. In the XOP and MTOM models, the binary data remains logically part of the XML document, but binary data may be, as part of transmitting the SOAP message, temporarily unbase64 encoded and moved from the document into another part of a multipart message. The transmission mechanism doesn't know a priori which parts of the document are base64-encoded data but, since the base64 translation is reversible, it can simply reencode any portions of the document that appear to the output of the base64 transform.

A variation on this theme is to recode the data itself using XML markup. For example, an image could be encoded in XML as a series of RGB triples—<pixel r="1" g="0" b="0"/>. Such schemes induce much greater data expansion but have the advantage of making the contents of the data itself available to the XML toolkit of XPath, XQuery, XSLT, and so forth. The expansion issue can also be overcome with a compression-based scheme for reducing the size of the entire file: The compression algorithm in effect reduces most of the expansion cause by the transformation to text as it turns the text back into binary data. Of course, this comes at significant computational expense.

Model versus Syntax

XML simultaneously defines both a model and a syntax. The model consists of a set of hierarchically organized elements, each of which may contain character data and have a set of attributes associated with it. The syntax describes how those elements are delimited with angle brackets, the attributes with an equals sign and quotation marks, and so forth. These two aspects do not necessarily need to be defined together as they are in XML. For example, the Abstract Syntax Notation 1 (ASN.1; http://www.asn1.org/) Standard defines a sophisticated model consisting of data type, structures, and unions without defining any syntax, but with the intent of permitting more than one syntax to be defined. At least two such syntaxes exist: the Basic Encoding Rules (BER) and the Packed Encoding Rules (PER).

Separating syntax from model is often desirable because it offers just this kind of flexibility. The XML model has been separated from its syntax after the fact via the definition of the XML Information Set (Infoset; http://www.w3.org/TR/xml-Infoset/). Some newer W3C standards are defined with reference to the Infoset instead of to XML directly.

For some of the proposed binary XML standards, such as those based on gzip, the distinction is largely irrelevant because they simply operate as an additional transformation of the XML-defined syntax. This has the advantage of being generally straightforward and easy to transform to and from XML. However, it also depends on good compression algorithms and available CPU and power to run those algorithms and is limited only to addressing the size of the file. Other proposals approach the problem by providing an alternate encoding of the XML Information Set, much as there are alternate encodings of the ASN.1 model. Such proposals are, thus, free to change essentially anything about the encoding, including relaxing the restriction that it be text based, and can address properties such as random access and incremental update in addition to compactness.

Fast Web Services (http://java.sun.com/developer/technicalArticles/WebServices/ fastWS/) is a Sun Microsystems initiative for building faster web services implementations. Built on ITU standards, the implementation defines an ASN.1 grammar for the XML Information Set. This permits arbitrary XML documents to be encoded using one of the ASN.1-based encoding mechanisms; Fast Web Services uses PER, which is both highly compact and fast to encode and decode, but does not support either random access or incremental update.

Schemas

An XML schema is a formal declaration of a set of valid XML documents. They may be specified using the XML DTD syntax, W3C XML Schema, Relax NG, or other mechanisms. Schema mechanisms all provide information about the set of document that conform to them. Some schema definitions, such as described in W3C XML Schema, indicate which strings in the document encode numbers, dates, strings, and other types of data as well as which elements are legal and how they can be nested.

This additional information about valid documents allows for certain optimizations in the encoding of those files. For example, if a string value is known to represent a 32-bit floating-point value, it can be encoded in binary 32-bit representation, which takes 4 bytes, instead of as text, which could take 20 bytes. That binary representation can often be copied directly to or from memory instead of using printf()- or scanf()-style calls for translation. Of course, these optimizations are available only when a schema is available and all encoded documents are valid with respect to that schema. This limits the scenarios in which these schema-aware encodings can be applied. Nonetheless, they permit great efficiency. Sun's Fast Web Services project, in addition to the XML Information Set encoding described earlier, also provides a schema-aware encoding. Building on ITU standards, the implementation translates XML Schema definitions into ASN.1 and then (as before) uses PER to encode the resulting XML documents. Using a schema specific to the document provides for even more compact encodings than using the XML Information Set-based schema. By providing both, Fast Web Services provides great flexibility.

The Binary Format for MPEG-7 (BiM) is another schema-aware binary XML format, in this case designed to allow for efficient representation of the metadata defined by MPEG-7. BiM is not tied to any particular schema but does require that a schema be available. Unlike all of the previously described proposals, BiM also supports both random access and incremental update features.

W3C Activities

The W3C held a workshop in September 2003 to determine whether a working group with the purpose of creating a binary XML Standard should be chartered. The attendees concluded that the W3C should continue to investigate the area, but stopped short of recommending the chartering of a working group to create such a standard. The conference report includes the 37 position papers that were accepted for the conference, representing a wide range of use cases, requirements, and implementations.

The W3C XML Binary Characterization Working Group (XBC; http://www.w3.org/ XML/Group/Binary/) was formed in March to follow up on the workshop recommendation. Its goal over the next year is to document relevant use cases and related materials. These materials will be used to determine whether the W3C should proceed with developing a Standard.

Conclusion

XML's popularity has led to its widespread application—even in designs for which it is not well suited. This, in turn, has led to a variety of Binary XML proposals—some of which have been described here—for overcoming these limitations. Some of these proposals are ratified standards and seem likely to see significant adoption. For the moment, however, it does not appear that any of these proposals has sufficient support to achieve any wider acceptance.

DDJ