October 20, 2004

VoiceXML promises voice-to-Web convergence

Author: Daniel Rubio

Users of virtually all of today's Web-based applications are constrained to interacting with their a keyboard and screen. To break this paradigm, the World Wide Web Consortium (W3C), the regent body for Web-based standards, working with industry, developed VoiceXML, a standard for interacting with Web-based systems through audio dialogs.

Given that voice-driven apparatuses -- mainly telephones -- are even more ubiquitous than computers, an integration language was a natural step in the W3C's work plans. The necessity of a separate initiative like VoiceXML, instead of simply widening the scope of another mature technology like XHTML, is due to the nature of audio.

While interface-driven applications are static -- technically speaking -- in voice interaction you hear input and then it's gone, making it transient in nature compared to text-based applications. Also, speech recognition is not 100% unequivocal. While a keyboard stroke can guarantee an exact meaning, a spoken syllable could have varying degrees of interpretation, depending on the person, language, context, or even the noise environment in which it's spoken.

Although VoiceXML is a central piece of the work being done on voice applications, separate initiatives have blossomed which complement it, giving way to the W3C Speech Interface Framework, composed of VoiceXML and the following works:

  •  
  • Speech Recognition Grammar Specification (SRGS) is used to decipher and interpret the actual spoken syllables.
     
  • Speech Synthesis Markup Language (SSML) specifies the rendering of synthesized speech to the user, incorporating actual characteristics of spoken instructions given by the application to the end user, such as pitch, speed, or volume.
     
  • Semantic Interpretation is used for specifying the extraction and possible translation of text from the output of a speech recognizer.
     
  • CCXML specifies call control functions.

The actual usage of these voice-driven technologies in a Web-based application is a two-step process. Developers must make provisions on both the client accessing the application and on the server brokering the request. For server-side implementations there are currently a wealth of products that enable voice interaction, from vendors such as IBM and from niche players like BeVocal, which even offer hosting programs for those wishing to voice enable applications.

On the client side you need a VoiceXML-based browser to access the information the server provides. A natural client candidate would be a mobile phone, but there are also several products that permit users to access voice applications from desktop PCs, including an open source VoiceXML browser named PublicVoiceXML.

The actual payload which is interchanged between the server and client looks something like the following snippet:

 

<?xml version="1.0"?> 
<vxml version="2.0"> 
<menu> 
   <prompt> 
     Say one of: <enumerate/> 
   </prompt> 
   <choice next="http://www.voicebank.com/checking.vxml"> 
      Checking 
   </choice> 
   <choice next="http://www.voicebank.com/savings.vxml"> 
      Savings 
   </choice> 
   <choice next="http://www.voicebank.com/investements.vxml"> 
      Investments 
   </choice> 
   <noinput>Please say one of <enumerate/></noinput> 
</menu> 
</vxml> 

This fragment is a simple menu which, upon execution on a VoiceXML browser, would ask the user to respond verbally to one of the options. Once the user answers, the client would prepare another VoiceXML payload containing the response, which would be later followed by another server response containing further VoiceXML instructions to be rendered on the browser.

While the syntax used by VoiceXML is extensive, the previous sample illustrates how verbose its tagged-based form is, like XHTML and WML. As for the actual rendering of voice to this tag-based format, this falls more into the hands of a speech recognizer that has provisions for interpreting voice. However, it is the actual payload in VoiceXML -- exchanging constantly between client and server -- that give the language its appeal, since it contains a standard set of instructions describing possible workflow scenarios like re-requests, voice confidence levels, and other factors.

Although there are currently few Web systems that offer a VoiceXML interface, it's probably just a matter of time before the technology catches on in broader business areas, like those of call-centers, and usher in a new wave of Web-enabled business systems.

Daniel Rubio is the principal consultant at Osmosis Latina, a firm specializing in enterprise software development, training, and consulting based in Mexico.

Category:

  • Web Development
Click Here!