Adding Voice to XHTML

Dr. Dobb's Journal January, 2005

The X+V markup language is designed to do just that

By Gerald McCobb and Jeff Kusnitz

The authors are engineers in IBM's Pervasive Computing division. Gerald is IBM's secondary representative to the W3C Multimodal Working Group; and Jeff is representative to the VoiceXML Forum and the W3C Voice Browser working group. They can be contacted at mccobb@us.ibm.com and jk@us.ibm.com, respectively.

In the spring of 2002, the W3C Multimodal Interaction Working Group began work on a framework for a multimodal language Standard for the World Wide Web. The goal of this Standard was to enable the development of interoperable applications that can interact with users in a variety of ways. Voice and digital pen, in particular, are new modes of interaction that may soon become popular, especially on small devices where a mouse and keyboard are difficult to use.

The current web model assumes that users are using a:

Computer running a traditional visual browser (Internet Explorer, Navigator, Opera, or whatever).
WML/WAP-based browser on a cell phone.
VoiceXML browser with a telephone.

However, a new multimodal web model for applications supports the familiar user-interaction techniques (clicking, typing, tapping) along with additional modes such as speech recognition or text-to-speech. In such cases, users can tap, type, and talk, as well as see and hear. For example, users of a multimodal browser might pick up a phone, dial into a portal, ask "are there any flights from Atlanta to San Francisco," and have a list of flights displayed on the phone. They could then select one of the flights with either a stylus or voice and have the details read over the phone.

Jointly developed by IBM, Motorola, and Opera ASA, XHTML+Voice (X+V) is a multimodal markup language that enables this kind of voice interaction with web applications.

The V in X+V is VoiceXML

VoiceXML 2.0 is a markup language for building voice applications using the web-programming model. Applications typically run over the telephone and users use a telephone for input/output, rather than computers with a keyboard, mouse, and display monitor. In other words, VoiceXML and a voice browser are to the telephony world what HTML and a visual browser are to the desktop PC world.

But VoiceXML is more than a language for building telephony applications. While it's true that most, if not all, current VoiceXML implementations and applications are targeted at mobile users, the language is not strictly tied to a telephony interface. At its core, VoiceXML is a dialog markup language; it lets you easily build complex dialogs between users and computers.

X+V adds a modularized subset of VoiceXML to XHTML, as well as XML Events and a small number of element and attribute extensions to both XHTML and VoiceXML. The VoiceXML modules are defined according to the XHTML modularization framework and contain logical and hierarchical groupings of VoiceXML elements. For example, the executable content module contains the child elements of the block, filled, catch, noinput, nomatch, error, and help elements. Table 1 lists the VoiceXML modules and elements supported by X+V.

X+V Adds XML Events

XML Events is a W3C Recommendation for attaching event listeners and handlers to XML nodes. XML Events has a listener element and a set of attributes—event, handler, observer, target, and so on. They give authors the ability to observe an event as it flows through a node according to the DOM level 2 event model (supported by almost all web browsers and required by X+V), and activate a handler in response to that event. For example, a click event can be observed on an HTML input node, and a VoiceXML form can be activated in response to the click.

Whether HTML 4.01 intrinsic or VoiceXML events, all events in an X+V application flow through the XHTML tree. When users click on an HTML input, the click event flows from the html element to the input (the "capture" phase). Once it reaches the input node (the "target"), it flows back to the the html element (the "bubble" phase). The VoiceXML events processed by an active voice dialog are emitted to the XHTML container according to this rule: The XHTML node that activated the voice dialog is also where all the VoiceXML events emitted by the voice dialog can be observed; that is, the XHTML node is the "target" of the VoiceXML event. For example, if a voice dialog is activated in response to a click event on an input node, then while the voice dialog is active, a VoiceXML help event can be observed on the input node after users say "help."

X+V supports all HTML 4.01 intrinsic events, but removes the "on" prefix; "onload" becomes "load," for instance. VoiceXML events, on the other hand, are distinguished by the "vxml" prefix, so that "help" becomes "vxmlhelp" when it flows through the XHTML tree. There is also the vxmldone event, which is emitted by the VoiceXML form when it has successfully finished running.

Prototype the Voice Interface

Voice dialogs are defined within X+V as VoiceXML forms with unique IDs. Because voice dialogs are encapsulated by VoiceXML forms, the voice interactions for a multimodal application spanning multiple web pages can be prototyped separately from the visual HTML. Call-flow diagrams show the structure of the voice dialog in terms of statements, prompts and input gathering, internal processing, and decisions and gotos for branching to VoiceXML forms. Table 2 describes the operation of each object in a call-flow diagram and its mapping to VoiceXML.

The conditional and goto branching activate successive VoiceXML forms according to the logic of the application. The branching is implemented by X+V as one or more VoiceXML return elements. For each event (such as "vxml.goto.P00010") emitted by the VoiceXML form to the XHTML tree, there must be a corresponding listener that observes the event and activates the next VoiceXML form in response to the event:

<ev:listener ev:event="vxml.goto.P00010"
ev:observer="bd1" ev:handler="#P00010"/>

The XML Events listener activates the VoiceXML form with ID P00010 in response to the vxml.goto.P00010 event, observed on the XHTML node with ID bd1.

The structure of the voice dialog is diagrammed by connecting the call-flow objects together, according to the convention that the call flow proceeds from left to right. Figure 1, the call-flow diagram for the sample X+V application we present here, represents a single web page. The submit goto object is a branch to a voice dialog associated with the next web page retrieved by the browser after the current page is submitted.

Add a Voice Dialog to XHTML

Once the structure of the voice dialog has been prototyped, the call-flow objects can be mapped to their VoiceXML representations; see the last column in Table 1. The VoiceXML form(s) can then be put into the head section of the XHTML page. The VoiceXML elements have to be given an arbitrary prefix (followed by a colon) that identifies them as belonging to the VoiceXML namespace. The specification of the prefix for the VoiceXML namespace is provided by an xmlns attribute added to the html element:

xmlns:vxml="http://www.w3.org/2001/vxml"

The next step is to add the X+V sync elements for connecting the voice input results to the XHTML input controls. The sync element synchronizes VoiceXML fields and XHTML input controls. For example, if users update a VoiceXML field using voice, the synchronized XHTML input is also updated with the results stored in the field. Conversely, if users update an XHTML input using the keyboard, the synchronized VoiceXML field is updated with the contents of the XHTML input.

The sync element belongs to the X+V namespace. The specification of the prefix for the X+V namespace is provided by an xmlns attribute added to the html element:

xmlns:xv="http://www.w3.org/2002/xhtml+ voice"

The sample X+V application we present here has two XHTML text inputs—txtBoxDeptCity for entering a departure city and txtBoxDestCity for an arrival city. The two text inputs are synchronized to the two VoiceXML fields, fieldDeptCity and fieldDestCity, by adding two X+V sync elements to the head section of the page:

<xv:sync xv:input="txtBoxDeptCity" xv:field="#fieldDeptCity"/>
<xv:sync xv:input="txtBoxDestCity" xv:field="#fieldDestCity"/>

The X+V field attribute references an ID placed on a VoiceXML field. Because VoiceXML does not include the id attribute with its field element, X+V adds the xv:id attribute that is referenced by the X+V field attribute:

<vxml:field xv:id="fieldDeptCity" name=
"fieldDeptCity">

All that is left is to activate the VoiceXML form in response to an XML Event. To activate the VoiceXML form in the sample X+V application when the page is loaded, the XML Events event and handler attributes are added to the XHTML body element:

<body id="bd1" ev:event="load"
ev:handler="runForm">

The load event observed on the body activates the VoiceXML form with ID runForm. The event and handler attributes are prefixed with an arbitrary label for the XML Events namespace. The specification of the prefix for the XML Events namespace provided by an xmlns attribute added to the html element is:

xmlns:ev="http://www.w3.org/2001/xmlevents"

Listing One presents a sample X+V application with the VoiceXML form and X+V sync elements added to the XHTML head section. Also included are script handlers for the vxmlerror and vxmldone events. Listing Two is the city grammar for the two VoiceXML fields. There are only seven major U.S. cities that can be matched by the grammar, but more cities can easily be added.

Running the Sample X+V Application

Opera (http://www.opera.com/) and Access (http://www.access-us-inc.com/) both have X+V browsers, which run on a number of different client platforms, including embedded Linux, Pocket PC, and Windows. A free trial of the IBM multimodal tools (http://www-306.ibm.com/software/ pervasive/multimodal/) includes Windows versions of both the Opera and Access Co.'s NetFront X+V browsers. Installed with the IBM multimodal toolkit are the embedded speech engines that the browsers need to run X+V applications.

The sample X+V application can be installed on any web server. Simply save the sample X+V markup in Listing One to a file called "sample-xv.mxml," to a directory that has been configured to deploy web applications. Also, save the file containing the city grammar, "city.grxml," to the same directory. Next, configure the content type the web server issues for files with the ".mxml" extension to "application/xhtml+voice+xml." For grammar files with the extension ".grxml," the content type is "application/srgs+xml." It is also a good idea to add the ".mxml" extension to the "text/html" content type, so that browsers that don't understand X+V can still run the application without voice interaction.

Start either the Opera or NetFront X+V browser and navigate to the "sample-xv.mxml" page in the installed directory on the web server and run the sample X+V application.

X+V Versus SALT

While the W3C Multimodal Interaction Working Group is developing a new multimodal language Standard, X+V and the Speech Application Language Tags (SALT) specification are vying for supremacy in the nascent multimodal application space. SALT can be compared to either VoiceXML or X+V because SALT can be used to develop either voice-only or multimodal applications. When compared to X+V as a multimodal markup language, SALT's generic and minimalist approach to adding voice interaction to web applications has several limitations. SALT dialogs are not modular, reusable across applications, scalable, or inherently safe.

A SALT application is not reusable to the extent that the SALT elements must be embedded in the application document. Every SALT application must be written from the ground up with a multimodal dialog dedicated to the application. However, macros that preprocess SALT source and tools that automatically generate SALT source may alleviate this problem. It is also true that SALT grammars are reusable as they can be referenced outside the current application document.

The VoiceXML dialogs in X+V can be referenced in an external file and thereby can be reused by the XHTML containers. X+V also supports the VoiceXML subdialog element. A subdialog is a VoiceXML form called by another VoiceXML form. Because the subdialog can be called with parameters, it can be written to be very generic, and therefore, reusable. For example, VoiceXML dialogs asking for a date, time, and ZIP code could be placed in an external file and referenced by different multimodal applications.

SALT applications do not scale well. First, the multimodal dialog for a SALT application is typically written in a scripting language, such as JavaScript or JScript. As new features are added to the application, the multimodal dialog grows in complexity until the scripts ultimately become unmanageable. Secondly, because SALT is not tied to a specific event model, the events emitted by the SALT elements do not bubble. This means that for a generic event, such as "help" or "error," the author has to add an event listener for an event for every single SALT element in the application.

According to the SALT 1.0 specification, the "point is to show that all possible user input and error events are caught and safely handled, so that the dialog is never left in a 'hanging' state." This implies that SALT developers must be careful or the application will not work properly, and all error events must be explicitly handled. VoiceXML dialogs, on the other hand, are inherently "safe": All error conditions are handled even if the author chooses not to handle them.

Conclusion

Because X+V's voice dialogs are modular, X+V lets you prototype the voice interface separately from the visual interface. This is important because the voice interface is as important as the visual interface. For example, multimodal applications can also be voice-only or hands-free. Separating the voice interface makes building an X+V multimodal application straightforward:

Develop the call-flow of the voice interaction with the application.
Create the VoiceXML dialogs (or reuse if possible) and XML Event listeners from the call-flow diagram (or generate them if a tool is available).
Create the visual HTML elements of the application (or reuse if it is a legacy application).
Drop the VoiceXML forms and XML Events listeners into one or more web pages of the application.
Add the X+V <sync> tags to the web pages to synchronize VoiceXML field elements with the HTML visual controls.

X+V is built upon the latest W3C Recommendations for visual interaction (XHTML 1.1), authoring event listeners and handlers (XML Events), and voice interaction (VoiceXML 2.0). While both SALT and X+V make use of many of the same web Standards—Speech Synthesis Markup Language (SSML) and Speech Recognition Grammar Specification (SRGS), for instance—only X+V has chartered a path that will continue to align with the W3C as its Standards evolve.

Resources

XHTML+Voice 1.2 Specification (http://www.voicexml.org/specs/multimodal/ x+v/12/).
Speech Application Language Tags (SALT) 1.0 Specification (http://www.saltforum.org/saltforum/downloads/SALT1.0.pdf).
Voice Extensible Markup Language (VoiceXML) Version 2.0 (http://www.w3.org/TR/voicexml20/).
XHTML 1.1 (http://www.w3.org/TR/xhtml11/).
Speech Synthesis Markup Language (SSML) Version 1.0 (http://www.w3.org/TR/speech-synthesis/).
Speech Recognition Grammar Specification (SRGS) Version 1.0 (http://www.w3.org/TR/speech-grammar/).
XML Events (http://www.w3.org/TR/xml-events/).
Semantic Interpretation (http://www.w3.org/TR/semantic-interpretation/).
W3C World Wide Web Consortium (http://www.w3.org/).
W3C Multimodal Interaction Working Group (MMI WG) (http://www.w3.org/2002/mmi/).

DDJ

Listing One

<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:vxml="http://www.w3.org/2001/vxml"
      xmlns:ev="http://www.w3.org/2001/xml-events"
      xmlns:xv="http://www.voicexml.org/2002/xhtml+voice">
  <head><title>Sample X+V Application</title>
    <!-- VoiceXML -->
    <vxml:form id="RunForm">
      <vxml:block>Enter departure and arrival cities</vxml:block>
      <vxml:field xv:id="fieldDeptCity" name="fieldDeptCity">
        <vxml:grammar src="city.grxml"/>
        <vxml:prompt>Where would you like to leave from?
        </vxml:prompt>
      </vxml:field>
      <vxml:field xv:id="fieldDestCity" name="fieldDestCity">
        <vxml:grammar src="city.grxml"/>
        <vxml:prompt>Where would you like to go?</vxml:prompt>
      </vxml:field>
      <vxml:catch event="nomatch noinput help">
          For example, say New York.
      </vxml:catch>
    </vxml:form>
    <!-- sync's -->
    <xv:sync xv:input="txtBoxDeptCity" xv:field="#fieldDeptCity"/>
    <xv:sync xv:input="txtBoxDestCity" xv:field="#fieldDestCity"/>
    <!-- scripts -->
    <script ev:event="vxmlerror" ev:observer="bd1" declare="declare">
    document.getElementById("pg1").innerHTML =
                      "Sorry, there was a speech error";
    </script>
    <script ev:event="vxmldone" ev:observer="bd1" declare="declare">
        document.getElementById("pg1").innerHTML = "Departure city: "
             + document.travelForm.txtBoxDeptCity.value
             + "  Arrival city: "
             + document.travelForm.txtBoxDestCity.value + ".";
    </script>
  </head>
  <body id="bd1" ev:event="load" ev:handler="#RunForm">
    <p id="pg1">Enter departure and arrival cities</p>
    <form name="travelForm" action=".">
      <input name="txtBoxDeptCity" type="text" />&nbsp;
      <input name="txtBoxDestCity" type="text" />
    </form>
  </body>
</html>

Back to article

Listing Two

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN"
                  "http://www.w3.org/TR/speech-grammar/grammar.dtd">
<grammar version="1.0" xmlns="http://www.w3.org/2001/06/grammar" 
         tag-format="semantics/1.0"
         mode="voice" xml:lang="en-US" root="uscity">
<rule id="california">
    <one-of>
        <item>
            <one-of>
                <item> Los Angeles </item>
                <item> L.A. </item>
            </one-of><tag><![CDATA[$="Los Angeles"]]></tag>
        </item>
        <item> San Francisco 
             <tag><![CDATA[$="San Francisco"]]></tag>
        </item>
    </one-of>
</rule>
<rule id="florida">
    <one-of>
        <item> Miami <tag><![CDATA[$="Miami"]]></tag></item>
    </one-of>
</rule>
<rule id="illinois">
    <one-of>
        <item>Chicago<tag><![CDATA[$="Chicago"]]></tag></item>
    </one-of>
</rule>
<rule id="massachusetts">
    <one-of>
        <item>Boston <tag><![CDATA[$="Boston"]]></tag></item>
    </one-of>
</rule>
<rule id="newyork">
    <one-of>
        <item> New York 
            <item repeat="0-1"> City 
               <tag><![CDATA[$="New York City"]]></tag>
            </item>
        </item>
    </one-of>
</rule>
<rule id="washingtonDC">
    <one-of>
        <item> Washington </item>
        <item>
            <item repeat="0-1"> Washington </item> D.C. 
        </item>
    </one-of><tag><![CDATA[$="District of Columbia"]]></tag>
</rule>
<rule id="uscity" scope="public">
    <one-of>
        <item>
          <ruleref uri="#california"/>
             <tag><![CDATA[ $= $california;]]></tag></item>
        <item><ruleref uri="#florida"/>
             <tag><![CDATA[ $= $florida;]]></tag></item>
        <item><ruleref uri="#illinois"/>
              <tag><![CDATA[ $= $illinois;]]></tag></item>
        <item><ruleref uri="#massachusetts"/>
              <tag><![CDATA[ $= $massachusetts;]]></tag></item>
        <item><ruleref uri="#newyork"/>
              <tag><![CDATA[ $= $newyork;]]></tag></item>
        <item><ruleref uri="#washingtonDC"/>
              <tag><![CDATA[ $= $washingtonDC;]]></tag></item>
    </one-of>
</rule>
</grammar>

Back to article