Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 1

Introduction to VoiceXML and Voice Web Architecture

Ken Rehor


Session Overview

• Voice Web Architecture– Components of a Voice Web Application

• Voice Standards– W3C Speech Interface Framework

• VoiceXML– Language features– Execution model - Form Interpretation Algorithm (FIA)

• Application Design Techniques– Static vs. dynamic VoiceXML– Performance Considerations

• CCXML, VoiceXML and VoIP• Application Deployment Models• New Technologies

– Speaker Biometrics, Video, Multimodal, VoiceXML 3.0


Simplifying Voice Services programming

• Web-based architecture for interactive speech services– Exploit web technologies to simplify voice service creation and deployment

– Enable consolidation of voice and web services

– Separate service logic from user interaction

• High-level programming languages– Control speech and telephony resources in uniform manner

– Shield application programmers from implementation details• No need to know ASR, TTS, telephony APIs

– Create portable applications• Run on enterprise system or in telephone network

• Run on a variety of platforms, ASR agnostic


Voice Web Application Architecture


• Standard/Common high-level language– Designed for the task

• Leverage open, known technology– Web protocols, servers, networks, development tools, expertise

• Phone number mapped to URL– Phone number associated with URL of voice service

Key Ideas


Internet or

Intranet

Any phone

Web Browser

HTTP

HTTP

Application(web) server

• Application logic• Content and data• Transaction processing• Database interface

<html>

VoiceXMLbrowser

PSTN orVoIP

Voice / Web Application Architecture

• Grammars• Audio files• Scripts

• Images• Audio files• Scripts

HTTP

.wav

<grxml>

<vxml>

http://www.verizonwireless.com/b2c/store/controller?item=planFirst&action=viewPhoneDetail&selectedPhoneId=1570


.wav

<grxml>

Internet or intranet

PSTN

Caller

Customer service, please…

HTTP

Webserver

<vxml>

AS

R

TT

SA

udio

DT

MF

Te

lep

ho

ny

VoiceXMLinterpreter

middleware

VoiceXMLplatform

Welcome toAcme products

…

Voice Application Architecture and Components

OA

&M



Internet orIntranet

Application(web) server

• Application logic• Content and data• Transaction processing• Database interface

HTTP

<vxml>

Application Backend Architecture

Database(content)

Transaction Server

Web service

Intranet or

Internet

• Grammars• Audio files• Scripts


Components of a Voice Solution

• Traditional phone, VoIP phone, mobile phone, or multimodal device

• Telephone network– Circuit-switched PSTN or packet-switched VoIP

– Connects caller’s telephone with Telephony Server

• Voice User Interface– Dialog structure / flow

– Prompts – what the application says to the user

– Speech grammars – what the user can say

• Application logic that executes on an application server– Web "back-end“

– Database, or database interface

• VoiceXML Server that executes dialogs– Controls resources such as ASR, SIV, TTS, etc

• Data network to connect application server and VoiceXML server


Inbound or Outbound calls

• VoiceXML application works the same for inbound and outbound calls

– Additional call progress detection generally required for outbound

• Simple protocol for initiating outbound calls– No firm standards, but most vendors follow similar techniques

– HTTP, Web Services, etc.


Standards


Value of Open Standards

• Non-proprietary interfaces between components

• Allow choice of best components for the task

• User interface languages– W3C Speech Interface Framework: VoiceXML, SRGS, SSML, SI– W3C: HTML, XHTML, SMIL, X+V– OMA: WAP

• Communication protocols– W3C: CCXML for 3rd-party telephony call control– W3C: HTTP, HTTPS, SOAP, WSDL– IETF: SIP, MRCP, MSCP– 3GPP: IMS– ITU: T1, ISDN


Visual vs. Voice markup

Web app UI• HTML – Structure

– Layout

– Input declaration

– Transitions

• Images

• Audio files / streams

• Video

• Text

• Scripts

Voice Web app UI• VoiceXML – Structure

– Dialog flow

– Input declaration

– Transitions

• Audio files

• Video, Images

• Text (for TTS)

• Scripts


Protocols

Web applications• HTTP, HTTPS

• RTP

• SOAP

• WSDL

• …

Voice Web applications• HTTP, HTTPS

• RTP

• SOAP

• WSDL

• SIP

• …


Voice Standards Activities

• Speech Interface Framework

• Network protocols

– SIP, MRCP v2, etc.

• Platform Certification, Developer Certification,

Speaker Biometrics, Architecture, Tools


Scripts

HTTPHTTPS

HTTPHTTPS

VoIPGateway

VoiceXMLBrowser

Telephony Control Interface: SIP, etc.Dialog Control Interface: SIP, MSCP, etc.

DialogControlInterface

VoiceXMLApplication

CCXML VXML

Conference/MediaServer

CCXMLBrowser

Voice Application Standards

PhoneNetwor

k

Caller

CCXMLCall ControlApplication

Media ControlInterface

SOAP

MRCP Client

Audio

DTMF

GRXML

Scripts

Audio

MediaMixer /Server

T1 / E1ISDNSS7

SIP

RFC 2833

RTP

TTS

Server

M R C P

SIV

Server

ASR

Server

GRXMLSSML ** standards in progress **

GRXML

G.711, WAV, .au, mp3, etc.

SIP NetannMSCMLMOML / MSMLMSCPDMSPMGCPetc.

Telephony ControlInterface

VoiceXML 2.0VoiceXML 2.1ECMAScript 262

MRCP v1MRCP v2

SSML


W3C Speech Interface Framework


Voice Application Components

• Dialog – flow control of the inputs, outputs, next steps

• Input grammars– Control input constraints for DTMF and speech recognition

• Output formatting– Pronunciation, timing, sequencing


W3C Speech Interface Framework

• VoiceXML

• SRGS

• SSML

• Semantic Interpretation

• Pronunciation Lexicon

• Call Control

For more information, see:W3C Voice Browser Working Group http://www.w3.org/Voice/


Voice User Interface - Dialog• W3C VoiceXML 2.0

– W3C Recommendation March 2004– Widely implemented

• Approximately 4 dozen platforms• Many service providers worldwide

– VoiceXML Forum certification program• Nearly two dozen certified platforms, more coming

• W3C VoiceXML 2.1– Candidate Recommendation Sept 2006– Test suite under development; Certification Program to follow– Many platform vendors are implementing

• W3C VoiceXML 3.0– Early stages of development– SCXML – state chart markup language designed as a controller for V3 and

CCXML 2.0 ("Working Draft" Jan 2006)


User Interaction – Input / Output Control

• Input grammars W3C SRGS 1.0

– W3C Recommendation– Widely implemented

• Output formatting W3C SSML 1.0

– W3C Recommendation– Widely implemented, yet minor real support

(most TTS engines ignore the SSML instructions)

• Semantic Interpretation for Speech Recognition W3C SISR 1.0– Nearing Candidate Recommendation– Implementation gaining acceptance


W3C Speech Interface FrameworkSemantic Interpretation


W3C Speech Recognition Grammar Specification

• Markup language to control input constraints– Finite-state speech recognition

– DTMF recognition

• Two variations– XML (GRXML)

– ABNF

• Version 1.0: W3C Recommendation – March 2004

• Implemented and supported by numerous vendors


GRXML ASR example

• asdf<grammar type="application/srgs+xml" root="r2" version="1.0"> <rule id="r2" scope="public">

<one-of> <item>coffee</item> <item>tea</item> <item>milk</item> <item>nothing</item> </one-of> </rule> </grammar>


GRXML DTMF example<?xml version="1.0"?>

<grammar mode="dtmf" version="1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/06/grammar http://www.w3.org/TR/speech-grammar/grammar.xsd" xmlns="http://www.w3.org/2001/06/grammar">

<rule id="digit"> <one-of> <item> 0 </item> <item> 1 </item> <item> 2 </item> <item> 3 </item> <item> 4 </item> <item> 5 </item> <item> 6 </item> <item> 7 </item> <item> 8 </item> <item> 9 </item> </one-of></rule>

<rule id="pin" scope="public"> <one-of> <item> <item repeat="4"><ruleref uri="#digit"/></item> # </item></one-of></rule>

</grammar>


W3C Speech Synthesis Markup Language

• Markup language to control spoken and audio output

• Version 1.0: W3C Recommendation – Sept 2004

• Implemented and supported by numerous vendors

• Version 1.1: under development– Adds support for tonal languages

– First public Working Draft published January 2007


SSML Functions

• Audio output– <audio>

• Text-to-Speech output– Contained within SSML constructs

• Pronunciation controls– <say-as>

• Interpret-as

• Format

• Detail

– <emphasis>

• Timing– <break>


SSML Functions (cont’d)

• Spoken language– xml:lang

• Prosody and Style – voice control– Voice– Gender– Age– Name

• Prosody– <prosody>

• Pitch• Contour• Range• Rate• Duration• Volume


SSML Functions (cont’d)

• Sentence structure– <p>

– <s>

• phoneme -- Modify text– <sub> - substitute text

• Location identification– <mark>


VoiceXML 2.x


VoiceXML Scope

• Human-machine interaction provided by voice response systems: – Output

• play audio files

• produce synthesized speech

– Input

• record spoken input

• recognize spoken input

• collect character input

– Control flow

– Telephony

• transfer a user to another destination, such as a live agent

• disconnect a user


VoiceXML Goals

• Separate user interaction from service logic – Creates new possible business models

• Service developer can be separate from telephony platform provider

• Enable service portability across implementation platforms– Assume common set of platform capabilities

– Provide common language for:

• Content providers, Tool providers, Platform providers

• Safely handle shared network-based applications– deterministic behavior

• Easy to build common types of applications

• Features to build complex types of applications

• Shield application authors from low-level platform-specific details– Promotes portability, ease of service creation


VoiceXML 2.0 Basic Functions

• Input– <field>, <menu> recognition– <record> audio recording

• Output– <prompt> container for TTS or prerecorded audio– <audio> prerecorded audio

• Control Flow– <if>, <else>, <elseif> basic conditional logic– <script> complex scripts using ECMAScript– <goto> transition to a new document– <submit> submit data to a web application

• Telephony– <disconnect>– <transfer>


VoiceXML Execution Model

• Form Interpretation Algorithm <form>• Execution is synchronous (mostly)

– Disconnect events are handled (somewhat) asynchronously

• Audio is queued– Played only when encountering a waiting state

• Processing is always in one of two states:– Waiting for input in an input item

• such as <field>, <record>, or <transfer>– Transitioning between input items in response to an input

• Event-driven– <catch>, <throw> generalized event mechanism– <nomatch>, <noinput> short-hand user-input event handling– <error> short-hand error event handling


Key Points

• Architecture leverages all things "internet"– Languages, protocols, servers, developers, etc.

• Separation of concerns– Application logic / database vs. telephony / speech resources

– Enables new business models

• Voice ASP

• Prepackaged applications

• URL (application) associated with phone number– Calling party or Called party

– Share resources among many applications (VoiceASP)

• High-level languages, specific to domain / task– Simplify development and maintenance


VoiceXML <form> and <field>

• <form> – Dialog container

– "Form Interpretation Algorithm" (FIA) specifies default behavior

• <field> – Collect input from caller– <grammar> specifies input 'constraints'

• <prompt> – Container for <audio> and text


<?xml version="1.0"?><vxml version="2.0">

<form>

<field name="main_menu"> <prompt> <audio src="welcome.wav"> Welcome to Acme. You can choose sales, repair, or order status.</audio> </prompt> <grammar src="main_menu.grxml"/> </field>

<block> <submit next="http://acme.com/route... " method="get"/> </block>

</form></vxml>

Example

main.vxmlNote: Code simplified for demonstration purposes…


User Input - Grammars

• Grammars can be speech or DTMF (touchtone)– Both types can be active simultaneously

• Specified by SRGS– XML grammars are normative (aka GRXML)– ABNF grammars are more concise but more complex to author

• Grammars may be specified inline or sourced externally

• External grammars are referenced by URI

• Multiple grammars may be active simultaneously.


Sales I'd like to place an order I need to talk to a salesmanRepair repair department service service department customer serviceOrder status where's my order? track my order track my shipment where the hell is my stuff?

Grammars can get very complicated:There are many ways to say the same thing…


<grammar …xml:lang="en-US" version="1.0">

<rule id="dept" scope="public"> <one-of> <item>sales</item> <item>repair</item> <item>order status</item></one-of></rule>

</grammar>

Basic GRXML grammar example

main_menu.grxml


<form>

<field name="sales_menu"> <prompt> <audio src="sales_menu.wav"> You've reached Acme's sales department. To place an order, say sales. To speak to an associate, say I'd like to speak to someone. </audio> </prompt> <grammar src="sales_menu.grxml"/> </field>

<block> <submit next="http://acme.com/... " method="get"/> </block>

</form>

VoiceXML example – next step

sales.vxml


<form>


<noinput> You must say something. </noinput>


</form>

VoiceXML example with error handling

newmain.vxml


<form>


<noinput> You must say something. </noinput> <nomatch> I didn't understand you. Please try again. </nomatch>


</form>


newmain.vxml


<form>


<help> You can say sales, repair, or order status. </help> <noinput> You must say something. </noinput> <nomatch> I didn't understand you. Please try again. </nomatch>


</form>


newmain.vxml


Set platform features via <property>

• Input modes: type of input from a callerDTMF-only <property name="inputmodes" value="dtmf">

Voice-only <property name="inputmodes" value="voice">Both <property name="inputmodes" value="dtmf voice">

• Timeouts<property name="timeout" value="1450ms">

<property name="termtimeout" value="2500ms">

...


Call processing: <transfer>

• Blind– Go somewhere but don't return

• Bridge– Add on another party, resume

execution when done talking


<form id="xfer">

<block> <prompt> Calling Riley. Please wait. </prompt> </block>

<transfer name="mycall" dest="tel:+1-555-123-4567" >

</transfer>

</form>


• Blind transfer


<form id="xfer"> <block> <prompt> Calling Riley. Please wait. </prompt> </block>

<transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" >

</transfer></form>


• Bridge transfer



<transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" > <prompt> Say cancel at any time to disconnect this call.</prompt> <grammar src="cancel.grxml" type="application/srgs+xml"/>

</transfer></form>


• Bridge transfer with cancel feature



<transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" > <prompt> Say cancel at any time to disconnect this call.</prompt> <grammar src="cancel.grxml" type="application/srgs+xml"/>

<filled> <assign name="mydur" expr="mycall$.duration"/> <if cond="mycall == 'busy'"> <prompt> Riley's line is busy. Try again later. </prompt> <elseif cond="mycall == 'noanswer'"/> <prompt> Riley didn't answer the phone. Please call back another time. </prompt> </if> </filled>

</transfer></form>




<transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" transferaudio="music.wav" connecttimeout="60s" > <prompt> Say cancel at any time to disconnect this call.</prompt> <grammar src="cancel.grxml" type="application/srgs+xml"/>

<filled> <assign name="mydur" expr="mycall$.duration"/> <if cond="mycall == 'busy'"> <prompt> Riley's line is busy. Try back later. </prompt> <elseif cond="mycall == 'noanswer'"/> <prompt> Riley didn't answer the phone. Please call back another time. </prompt> </if> </filled>

</transfer></form>





New Features in VoiceXML 2.1

• Dynamically referencing grammars and scripts– <grammar expr=“…”> <script expr=“…”>

• Detect Barge-in During Prompt Playback: enhance SSML 1.0 <mark>– Add markexpr attribute

– Add markname and marktime to application.lastresult$ object

• Fetch (XML) data without transition: <data>– Uses read-only subset of DOM

• Dynamically concatenate prompts: <foreach> – Interate through ECMAScript array and execute content

• Record user’s utterance while attempting ASR – recordutterance property

– Add shadow variables: recording, recordingsize, recordingduration

• Send data upon disconnect– <disconnect namelist=“…” >

• Additional <transfer> types– <transfer type=“…” …/>


Dynamic Applications


VoiceXML Application Structure

• Static– User experience is the same for everyone

• Information doesn’t change frequently

• No customization per user, time of day, etc.

• Pages are created once and used many times

• Dynamic– User experience is customized by:

• User: e.g. my.yahoo.com, amazon.com (especially once you log in)

• Situation: e.g. travel specials on expedia.com

– Data driven, e.g. inventory system, airline reservations

– Generated by a program at runtime

• JSP, ASP

• App servers such as BEA, IBM Websphere, Oracle 9iAS


VoiceXML 2.1 and AJAX

• VoiceXML + ECMAScript + <data> + XML

• <data> element allows retrieval of arbitrary XML data without document transition

• Static VoiceXML document can fetch user-specific data at runtime

• Decouple presentation layer from business logic

• Performance improvements due to:– Cache-able VoiceXML

– No need to generate entirely new pages for each dialog when only the content is new

– Less network traffic


Dynamic Application ConsiderationsExecution of VoiceXML is running a program on your server…

• Must guarantee quality of dynamically-generated VoiceXML documents and ASR grammars

– Catch parse errors, execution errors

– What does the caller hear if there is an error?

• not “Could not parse VoiceXML document”

• Runtime performance– Parse and interpretation time of large documents

– Inefficient scripts and speech grammars

• Security implications– Exploit a bug in a particular implementation? Make free phone calls?

– Could there be a VoiceXML virus? Will all platforms protect against them?

Careful application design, testing and monitoring is essential


Dynamic Application Considerations

• A mix of different simultaneous applications means variable platform load and execution profile– Parse time of VoiceXML document

– Fetching VoiceXML documents, grammars, audio from remote web servers

– Load Balancing

– How to protect platform from harmful application? (intentional or otherwise?)

• Max size of document

• Max size of grammar

• Complexity measurement of document or grammar (statically checked before execution?)

Platforms, networks, and applications must be carefully engineered


Performance Considerations


Load Balancing for Performance and Reliability

• CPU/memory utilization– Grammar compilation

– ASR load

– TTS load

• Telephony Network– Channel balancing

– Dead channel

• Incoming/Outgoing channel assignment / mix


Performance: Caching

• Fetched documents, grammars, audio files, streams

• Local or distributed cache?

• Effects of prefetching

• Where to cache generated grammars?– Per system

– In-network

• Use external grammar compilation server?


Application Management


Application Monitoring and Maintenance

• Runtime logs– Web / application server

– Voice server

– Call Detail Reporting

• Utterance recordings and logs– Useful for grammar and dialog tuning

– Security of recordings may be an issue

– Disk space: full-call recordings may be prohibitively large

Usage data must be continually monitored to improve user experience


Operations, Administration, Maintenance, Provisioning

• System Monitoring– Interfacing to existing Telco OSSs– Web-based for ISP environment

• Provisioning– Application, Customer

• DN-URI mapping– Telephony

• Call origination/transfer• Max call timeout• Max number of concurrent calls

– Platform-specific VoiceXML features• ECMAScript allowed?• Telephony control allowed?• Max grammar size


Billing

• "platform time"– Usage of server resources

• Toll Free usage– It's toll free, not free

• Transferred calls– Inbound minutes

– Outbound minutes

– Network features, e.g. Network Redirect

• Outbound calls

Logging and Charging for usage of resources

Accurate billing information is a critical factor in application cost or profitability


Application Deployment Models

Build-your-own network vs. Outsourcing


Build vs. Outsource? Deployment Options Enable a Variety of Business Models

• Completely in-house– Maintain complete control for security– Development and deployment systems can be identical

• Outsourced VoiceXML/Telephony– Large-scale distributed networks without major capital investment– Grow quickly and incrementally

• Completely outsourced hosting– All components and systems managed by 3rd party

• Packaged software– VoiceXML application integrated with existing apps


Completely In-House

• Local control of all systems

• Voice server, app server, database can be on local network

• Development and deployment systems can be identical

• Physical security: in-house team “owns” it

• Failover, reliability, scalability must be locally managed

• Redundant power, networks, etc. are required


CiscoIPCC

VoiceXML On-premises Deploymentusing TDM or VoIP carrier connection

PSTN

VoiceXMLBrowsers

VoIPGateway, PBX, etc.

DatabaseCo-location facility

TDM:DS3,

Multiple PRI,etc.

ASRservers

WebApplications

WebApplications

VoIP"pipe"



Outsourced VoiceXML / Telephony

• Telephony and VoiceXML servers outsourced to "Voice Service Provider" (VSP)

• Application remains in your data center(s)– Geographically distributed

– May be dedicated to specific customers

• Many carrier-grade vendors to choose from


CiscoIPCC

Outsourced VoiceXML / Telephony

PSTN

VoiceXMLBrowsers

VoIPgateway

Database

Co-location facility

ASRservers

Internet

Voice Service Provider:Carrier-grade outsourcing facility

• Architecture is identical to in-house deployment

• Secure IP connection used between facilities

WebApplications

WebApplications



Advantages of Outsourcing to a VSP

• Choice of many vendors: one for all customers, or choose the

best one for each customer

• Add capacity by adding multiple vendors

• No capital investment

• Pay-as-you-go pricing models

• Failover, reliability, scalability simplified

• Physical security of equipment and networks managed by VSP

• VPN or dedicated data connection to your backend systems


Distribute Load to Multiple VSPs

Database

Customerco-location facility

CiscoIPCC

VoiceXMLBrowsers

ASRservers Cisco

IPCC

VoiceXMLBrowsers

ASRservers

CiscoIPCC

VoiceXMLBrowsers

ASRservers

Internet

CiscoIPCC

VoiceXMLBrowsers

ASRservers

PSTN

Multiple co-lo facilitiescan be deployed for geographicredundancy and enhancedcapacity.

WebApplications

WebApplications



Completely Outsourced

• Deploy hardware & software systems at customer-managed co-location facilities

• Deploy complete systems at co-location facilities managed by 3rd party

• Deploy pre-packaged VoiceXML application integrated with customer's call center (managed by customer)


Combination of In-house and Outsourced Several ways to balance resources

• Primary in-house, with overflow or failover to a VSP– Local control of resources

– Overflow to VSP during peak usage

– Backup for failover / disaster recovery

• In-house development, with primary deployment via VSP– In-house development and trials

– “Push to the network” when ready to deploy


CCXML, VoiceXML, and VoIP

3rd-Party Call Control


PSTN

Inbound call using TDM connections

VoiceXMLServer

• 1st-party call control: VoiceXML server handles call routing/setup/answer

Caller


PSTN

customer

Inbound call using VoIP (SIP and RTP)

VoIPGateway

VoiceXMLServer

1. INVITE

2. RTP

• 1st-party call control: VoIP gateway routes call to VoiceXML server, which handles call routing/setup/answer


Why VoIP?

• Flexible network topology

• Simplified integration of voice dialog resources

• Vendor independence for network elements

• Separation of concerns: voice dialog resources vs. call control


PSTN

caller

Inbound Call using 3rd Party Call Control

VoIPGateway

Call RoutingApplication

VoiceXMLServer

1. INVITE

3. RTP

2. INVITE

• 3rd party application handles call routing/setup/answer


PSTN

caller

Outbound call using 3rd Party Call Control

VoIPGateway

OutboundCalling

Application

VoiceXMLServer

1. INVITE

3. RTP

2. INVITE

• 3rd party application handles outbound call initiation/setup/routing

• “Attaches” VoiceXML dialog to connection


What is CCXML?

• XML-based language that manages the connections and resources used in phone calls

• Designed for 3rd-party call control applications

• Allows for easy integration into back end web applications very similar to VoiceXML’s model

• Uses the finite state machine model– Event handlers move from one state to the next using markup tags

• CCXML provides commands to run a “dialog” on a call leg


Why is CCXML Needed?

• VoiceXML was designed primarily for voice dialogs– 1st-party call control: <disconnect> and a several predefined common

<transfer> types

• Connection management requires full asynchronous event handling– Connection/telephony events can occur any time during a call and must be

handled

– VoiceXML specifically limits asynchronous events to simplify the execution and programming model

• 1st-party Call Control can be useful but has limited flexibility– VoiceXML 2.1 <transfer> adds "consultation" feature for network

redirect


Media

HTTPHTTP

PSTN

Caller

TelephonyInterface

CCXMLServer

DialogServer

Telephony ControlInterface

DialogControlInterface

TelephonyWeb

Application

VoiceWeb

Application

CCXML VXML

CCXML System Architecture

ConferenceServer


CCXML features

• Telephony channel control: voice paths and signaling– <createcall>, <accept>, <disconnect>, <reject>, <redirect>

• Media control: Conference Bridges and Mixers– <join>, <unjoin>, <createconference>, <destroyconference>

• Dialog control: Add a VoiceXML (or other dialog) resource to a connection– <dialogstart>, <dialogprepare>, <dialogterminate>


Integration of CCXML and VoiceXML

• Dialogs are created using <dialogstart>– You pass the URL of the document that you want to run

• Dialogs can be ended using <dialogterminate>– This allows CCXML to end a dialog based on a external event such as

someone calling you on a second line

• Dialogs can return data back to the CCXML platform– In VoiceXML use <exit namelist="a b c"/>– This is exposed in the CCXML dialog.exit event


W3C CCXML 1.0 status

• Nearing "Candidate Recommendation" status– Language complete– Test suite under development– Certification Program under consideration

• Growing support throughout the world

• Several open source projects underway– See http://www.sourceforge.net


Next-Generation Technologies


Next-Generation Technologies

• Speaker Biometrics-based authentication– Speaker Identification– Speaker Verification

• Video IVR --VoiceXML augmented with video– Early stages of commercial deployment now– Simple extension to standard platforms– Straightforward step towards full multimodal

• Multimodal– Multiple input modalities: speech recognition, keypad, handwriting,

biometrics (voice, fingerprint, iris, etc.), geolocation, motion– Multiple output modalities: graphics, audio (speech, TTS, music,

polyphonic tones)


Speaker Biometrics


Why Speaker Biometrics?

• Identify an individual for remote transactions

• Text / DTMF PINs are inadequate– Easily compromised

– Easily forgotten

– Does not identify an individual

• US Federal Regulations– FFIEC guidelines for financial services


Speaker Identification and Verification (SIV)

• Authentication– The process of confirming one or more identities.

• Speaker Identification (one-to-many)– Authentication with multiple identity claims.

• Speaker Verification (one-to-one)– Authentication with a single identity claim.


Types of SIV

• Text independent– SIV technology that can operate on any freeform or structured spoken input.

• Text dependent– SIV technology (usually verification technology) that requires the voice input

of one or more specific passwords or pass phrases (having been enrolled).

• Text prompted– SIV technology (usually verification) that randomly selects words and/or

phrases and prompts the speaker to repeat them. The term is also called challenge-response.


Fundamental Phases of SIV

• Enrollment– Capture one or more user utterances to ‘train’ the system

• Verification– Capture one or more user utterances to make an identity claim

• Adaptation & Scoring– Judge how close the user’s verification utterance is to the enrolled

utterance

– Refine the existing enrolled utterance with information from the verification utterance


Video and Multimodal


“Video” VoiceXML

• Video extensions to VoiceXML– 3G Wireless

– VoIP phones

• VoiceXML is just a dialog language– Initially only for voice input/output

• Example– Videomail is a dialog application very similar to voicemail

• Video and audio are somewhat analogous– VoiceXML can be ‘hacked’ to handle video now:

• <audio src="foo.au“/> could “play” a video file via <audio src=“foo.mpeg4”/>

– VoiceXML 3.0 might add a new language feature

• e.g. <video src="foo.avi"> or <media src="foo.mpeg4">


“Video” VoiceXML Deployment and Standardization

• Simple extension to standard platforms– Easy integration with current platforms

– Doesn’t “break” existing functionality

– Well aligned with “VoiceXML model”

• Early stages of commercial deployment– Several vendors have deployed large-scale commercial systems

• Step towards full multimodal


Multimodal Applications

• W3C Multimodal Interaction Working Group– Defining new standards based on extensive industry experience

• IBM / Motorola / Opera X+V 1.2– Early stages of commercial deployment– Freely available from Opera http://dev.opera.com/articles/voice/

For more information, see:W3C Multimodal Interaction Working Group http://www.w3.org/2002/mmi


VoiceXML 3.0


VoiceXML 3.0

• Modularization– Cleanly separate functions to enable integration with other modalities

– Enables code reuse

• New media processing– Video

– Voice processing

– Navigation

– Speaker biometrics

• Separation of data, control flow and presentation– Control flow embodied in new language: SCXML

• Clean data model


• W3C Voice Browser Working Group http://www.w3.org/voice

– VoiceXML 2.0 Recommendation

• http://www.w3.org/TR/voicexml20/

– VoiceXML 2.1 Working Draft

• http://www.w3.org/TR/voicexml21/

– Semantic Interpretation Working Draft

• http://www.w3.org/TR/semantic-interpretation/

– SRGS 1.0 Recommendation

• http://www.w3.org/TR/speech-grammar/

– SSML

• 1.0 Recommendation http://www.w3.org/TR/speech-synthesis/

• 1.1 Working Draft http://www.w3.org/TR/speech-synthesis11/

– CCXML 1.0

• http://www.w3.org/TR/ccxml/

– SCXML

• http://www.w3.org/TR/scxml/

• IETF http://www.ietf.org

References

http://www.w3.org/voice

http://www.w3.org/TR/voicexml20/

http://www.w3.org/TR/voicexml21/

http://www.w3.org/TR/semantic-interpretation/

http://www.w3.org/TR/speech-grammar/

http://www.w3.org/TR/speech-synthesis/

http://www.w3.org/TR/speech-synthesis11/

http://www.w3.org/TR/ccxml/

http://www.w3.org/TR/scxml/

http://www.ietf.org/


Ken Rehorhttp://www.kenrehor.com

VoiceXML Forum Co-founder and past-Chair

Chair, VoiceXML Forum Conformance Committee

Co-Chair, VoiceXML Forum Speaker Biometrics Committee

W3CCo-editor: VoiceXML 1.0, 2.0, 2.1, 3.0Co-editor: CCXML 1.0

Technology

Introduction to VoiceXml and Voice Web Architecture