Upload
sopoline-knowles
View
46
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Speech Technologies and VoiceXML. try Department of Computer Science National Cheng-Chi University. Reference. [1]Bob Edgar(2001), “ The VoiceXML Handbook ” ,NY:CMP Books. [2]Dave Raggett(2001), ” Getting started with VoiceXML 2.0 ” ,W3C. - PowerPoint PPT Presentation
Citation preview
Reference [1]Bob Edgar(2001),“The VoiceXML Handbook” ,NY:CM
P Books. [2]Dave Raggett(2001),”Getting started with VoiceXML
2.0”,W3C. [3]Sun Microsystems(1998),”Java Speech Grammar For
mat Specification v1.0”,Sun Microsystems. [4]Chetan Sharma and Jeff Kunins(2002),”VoiceXML:St
rategies and Techniques for Effective Voice Application Development with VoiceXML 2.0”,Wiley.
[5]Brian Eberman,Jerry Carter,Darren Meyer,David Goddeau(2002),”Building VoiceXML Browsers with OpenVXI”, NY:ACM Press.
Reference [6]Microsoft (2002),“Speech Technology Overview ” , htt
p://www.microsoft.com/speech/evaluation/techover/ [7] VoiceGenie Technologies Inc.(2001),”White Paper:S
peaking Freely About The VoiceGenie VoiceXML Gateway and the VoiceXML Interpreter”,VoiceGenie Technologies Inc.
[8]W3C(2002),”VoiceXML Specification v2.0”,W3C. [9]Chun-Feng,Liao(2002),” Basics of Speech Recognitio
n”,NCCU Computer Center.
Presentation Agenda Voice technologies Backgrounds
ASR/TTS
Voice browsing with VoiceXML VoiceXML architecture Implementations of VoiceXML Platform VoiceXML document structure Bringing Voice Technologies into Virtual Environm
ent
Voice Technologies In the mid- to late 1990s, personal computers started
to become powerful enough to support ASR The two key underlying technologies behind these
advances are speech recognition (SR) and text-to-speech synthesis (TTS).
Classification of Voice Application Basic interactive voice response (IVR)
Computer: “For stock quotes, press 1. For trading, press 2. …”
Human: (presses DTMF “1”)
Basic speech ASR C: “Say the stock name for a price quote.” H: “Lucent Technologies”
Classification of Voice Application Advanced speech ASR
C: “Stock Services, how may I help you?” H: “Uh, what’s Lucent trading at?”
“Near-natural language” ASR C: “How may I help you?” H: “Um, yeah, I’d like to get the current price of Lucent
Technologies” C: “Lucent is up two at sixty eight and a half.” H: “OK. I want to buy one hundred shares at market price.” C: “…”
Speech Recognition Capturing speech (analog) signals Digitizing the sound waves, converting them to basic
language units or phonemes, Constructing words from phonemes, and
contextually analyzing the words to ensure correct spelling for words that sound alike (such as write and right).
Speech Recognition Process Flow Step 1:User Input
The system catches user’s voice in the form of analog acoustic signal .
Step 2:Digitization Digitize the analog acoustic signal.
Step 3:Phonetic Breakdown Breaking signals into phonemes.
Speech Recognition Process Flow Step 4:Statistical Modeling
Mapping phonemes to their phonetic representation using statistics model (ex:HMM)
Step 5:Matching According to grammar , phonetic representation and Dicti
onary , the system returns an n-best list (I.e.:a word plus a confidence score
Grammar-the union words or phrases to constraint the range of input or output in the voice application.
Dictionary-the mapping table of phonetic representation and word(EX:thu,theethe)
Speech Synthesis Speech Synthesis, or text-to-speech, is the process of
converting text into spoken language. Breaking down the words into phonemes; Analyzing for special handling of text such as numbers,
currency amounts. Generating the digital audio for playback.
Pervasive Computing Model E-business has changed from client-server model to
web-centric model Once connect to the Internet,one can get any
information he want. But people wants more convenient way to connect to Internet.
Lou Gerstner,CEO of IBM:Pervasive Computing Model is billion people interacting with million e-business with trillion devices interconnected.
Voice Browsing VoiceXML instead of HTML A voice browser instead of an ordinary web browser Phone instead of PC.
VoiceXML Overview A language for specifying voice dialogs. Voice dialogs use audio prompts and text-to-speech
(TTS) for output; touch-tone keys (DTMF) and automatic speech recognition (ASR) for input.
Main input/output device (initially) is the phone. Leverages the Internet for application development
and delivery. Standard language enables portability.(unifies dialog
control languages)
Making use of mature Internet Technologies Leverage existing web application development
tools. Leverage existing web infrastructure for application
delivery. Clean separation of service logic from user
interaction.
VoiceXML Platform Architecture-1 Telephone and Telephone network-Connects caller’s
telephone with Telephony Server VoiceXML Gateway
Voice Browser Audio input-Speech Recognition (ASR), Touchtone (DT
MF), Audio recording. Audio output-Audio playback, Speech Synthesis (TTS) Interface, Call Controls
VoiceXML Platform Architecture-2 VoiceXML Documents
Dialog and flow control Client-side scripting (ECMAScript) Speech Recognition grammar Speech Synthesis pronunciation control
Document servers(web server) Feeding Static VoiceXML documents or audio files.
Application servers Generate VoiceXML documents dynamically. Server-side application logic Connect to Database, or database interface
Implementations of VoiceXML Gateways
In Taiwan: Yes Mobile Chunghwa Telecom Laboratories eWings Technologies, Inc
Free IBM VoiceServerSDK
Open Source CMU:OpenVXI
Related Research Raymond L.Smith,III and Stephen D.Roberts:
Using voice input command to operate simulation-animation.
The efficiency issues of ASR/TTS are taken into account. Satoru,Osamu,Katunobu,Takashi,Tomoyoshi,Hideki,
Shotaro,Takio and Katsuhiko: Create 3D virtual user who can speak with user via speake
r and microphone. Virtual User have the ability to learn words and recognize
human face.
We can do more.. Speak to many users who are “moving” in virtual en
vironment. System are built in distributed environment.(I.e. we
b) Make use of XML technology (VoiceXML/SALT).
Problems to Solve Voice /Animation synchronization. Protocol integration. ASR/TTS integration and its performance issues. Virtual user autonomy. The “Voice propagation range” issues.
Summary Speech is the most natural way for human to commu
nicate thus it will become an important way in HCI. VoiceXML has revolutionized speech recognition &
telephony application development & deployment. Adding Speech facilities into 3D virtual environment
will make UI more friendly and enable multi-modal input/output.
My research interest on this topic will focus on voice-animation synchronization and enable SR/TTS in distributed 3D virtual environment .