Speech Technologies and VoiceXML

Speech Technologies and VoiceXML

try

Department of Computer Science

National Cheng-Chi University

Reference [1]Bob Edgar(2001),“The VoiceXML Handbook” ,NY:CM

P Books. [2]Dave Raggett(2001),”Getting started with VoiceXML

2.0”,W3C. [3]Sun Microsystems(1998),”Java Speech Grammar For

mat Specification v1.0”,Sun Microsystems. [4]Chetan Sharma and Jeff Kunins(2002),”VoiceXML:St

rategies and Techniques for Effective Voice Application Development with VoiceXML 2.0”,Wiley.

[5]Brian Eberman,Jerry Carter,Darren Meyer,David Goddeau(2002),”Building VoiceXML Browsers with OpenVXI”, NY:ACM Press.

Reference [6]Microsoft (2002),“Speech Technology Overview ” , htt

p://www.microsoft.com/speech/evaluation/techover/ [7] VoiceGenie Technologies Inc.(2001),”White Paper:S

peaking Freely About The VoiceGenie VoiceXML Gateway and the VoiceXML Interpreter”,VoiceGenie Technologies Inc.

[8]W3C(2002),”VoiceXML Specification v2.0”,W3C. [9]Chun-Feng,Liao(2002),” Basics of Speech Recognitio

n”,NCCU Computer Center.

Presentation Agenda Voice technologies Backgrounds

ASR/TTS

Voice browsing with VoiceXML VoiceXML architecture Implementations of VoiceXML Platform VoiceXML document structure Bringing Voice Technologies into Virtual Environm

ent

Voice Technologies In the mid- to late 1990s, personal computers started

to become powerful enough to support ASR The two key underlying technologies behind these

advances are speech recognition (SR) and text-to-speech synthesis (TTS).

Classification of Voice Application Basic interactive voice response (IVR)

Computer: “For stock quotes, press 1. For trading, press 2. …”

Human: (presses DTMF “1”)

Basic speech ASR C: “Say the stock name for a price quote.” H: “Lucent Technologies”

Classification of Voice Application Advanced speech ASR

C: “Stock Services, how may I help you?” H: “Uh, what’s Lucent trading at?”

“Near-natural language” ASR C: “How may I help you?” H: “Um, yeah, I’d like to get the current price of Lucent

Technologies” C: “Lucent is up two at sixty eight and a half.” H: “OK. I want to buy one hundred shares at market price.” C: “…”

Speech Recognition Capturing speech (analog) signals Digitizing the sound waves, converting them to basic

language units or phonemes, Constructing words from phonemes, and

contextually analyzing the words to ensure correct spelling for words that sound alike (such as write and right).

Speech Recognition Process Flow

Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )

Speech Recognition Process Flow Step 1:User Input

The system catches user’s voice in the form of analog acoustic signal .

Step 2:Digitization Digitize the analog acoustic signal.

Step 3:Phonetic Breakdown Breaking signals into phonemes.

Speech Recognition Process Flow Step 4:Statistical Modeling

Mapping phonemes to their phonetic representation using statistics model (ex:HMM)

Step 5:Matching According to grammar , phonetic representation and Dicti

onary , the system returns an n-best list (I.e.:a word plus a confidence score

Grammar-the union words or phrases to constraint the range of input or output in the voice application.

Dictionary-the mapping table of phonetic representation and word(EX:thu,theethe)

Speech Synthesis Speech Synthesis, or text-to-speech, is the process of

converting text into spoken language. Breaking down the words into phonemes; Analyzing for special handling of text such as numbers,

currency amounts. Generating the digital audio for playback.

Speech Synthesis

Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )

Pervasive Computing Model E-business has changed from client-server model to

web-centric model Once connect to the Internet,one can get any

information he want. But people wants more convenient way to connect to Internet.

Lou Gerstner,CEO of IBM:Pervasive Computing Model is billion people interacting with million e-business with trillion devices interconnected.

Voice Browsing VoiceXML instead of HTML A voice browser instead of an ordinary web browser Phone instead of PC.

Show : An Scenario of Using VoiceXML

應用程式

VoiceXML Overview A language for specifying voice dialogs. Voice dialogs use audio prompts and text-to-speech

(TTS) for output; touch-tone keys (DTMF) and automatic speech recognition (ASR) for input.

Main input/output device (initially) is the phone. Leverages the Internet for application development

and delivery. Standard language enables portability.(unifies dialog

control languages)

History of VoiceXML

Source:VoiceXML forum(http://www.voicexml.org)

Making use of mature Internet Technologies Leverage existing web application development

tools. Leverage existing web infrastructure for application

delivery. Clean separation of service logic from user

interaction.

VoiceXML Platform Architecture

VoiceXML Platform Architecture-1 Telephone and Telephone network-Connects caller’s

telephone with Telephony Server VoiceXML Gateway

Voice Browser Audio input-Speech Recognition (ASR), Touchtone (DT

MF), Audio recording. Audio output-Audio playback, Speech Synthesis (TTS) Interface, Call Controls

VoiceXML Platform Architecture-2 VoiceXML Documents

Dialog and flow control Client-side scripting (ECMAScript) Speech Recognition grammar Speech Synthesis pronunciation control

Document servers(web server) Feeding Static VoiceXML documents or audio files.

Application servers Generate VoiceXML documents dynamically. Server-side application logic Connect to Database, or database interface

Voice Gateway

VoiceXML Gateway(detail)

Implementations of VoiceXML Gateways

In Taiwan: Yes Mobile Chunghwa Telecom Laboratories eWings Technologies, Inc

Free IBM VoiceServerSDK

Open Source CMU:OpenVXI

[DEMO]How to Write and Run VoiceXML Applications?

[DEMO]Generate VoiceXML Document Dynamically-using ASP.NET

VoiceXML Document Structure.

A Simple VoiceXML Document

[DEMO]VoiceXML /HTML Comparison

Bringing Voice Technologies to 3D Virtual Environment

Related Research Raymond L.Smith,III and Stephen D.Roberts:

Using voice input command to operate simulation-animation.

The efficiency issues of ASR/TTS are taken into account. Satoru,Osamu,Katunobu,Takashi,Tomoyoshi,Hideki,

Shotaro,Takio and Katsuhiko: Create 3D virtual user who can speak with user via speake

r and microphone. Virtual User have the ability to learn words and recognize

human face.

We can do more.. Speak to many users who are “moving” in virtual en

vironment. System are built in distributed environment.(I.e. we

b) Make use of XML technology (VoiceXML/SALT).

Problems to Solve Voice /Animation synchronization. Protocol integration. ASR/TTS integration and its performance issues. Virtual user autonomy. The “Voice propagation range” issues.

System Design Prototype

Summary Speech is the most natural way for human to commu

nicate thus it will become an important way in HCI. VoiceXML has revolutionized speech recognition &

telephony application development & deployment. Adding Speech facilities into 3D virtual environment

will make UI more friendly and enable multi-modal input/output.

My research interest on this topic will focus on voice-animation synchronization and enable SR/TTS in distributed 3D virtual environment .

Q & A