Upload
roger-lester
View
212
Download
0
Embed Size (px)
Citation preview
BravoBrava Mississippi State University
Can Advances in Speech Recognition make Spoken
Language as Convenient and as Accessible as Online Text?
Joseph Picone, PhDProfessor, Electrical
EngineeringMississippi State University
Patti Price, PhDVP Business Development
BravoBrava LLC
BravoBrava Mississippi State University
Outline
• Introduction and state of the art (Price)
• Research issues (Picone)
– Evaluation metrics– Acoustic modeling – Language modeling – Practical issues – Technology demands
• Conclusion and future directions (Price)
BravoBrava Mississippi State University
Introduction What is Speech Recognition?
SpeechRecognition
Words“How are you?”
Speech Signal
Goal: Automatically extract the string of words spoken from the speech signal
• Speech recognition does NOT determine– Who is talker (speaker recognition, Heck and Reynolds)– Speech output (speech synthesis, Fruchterman and Ostendorf)– What the words mean (speech understanding)
BravoBrava Mississippi State University
Introduction Speech in the Information Age
• Speech & text were revolutionary because of information access• New media and connectivity yield information overload• Can speech technology help?
Time
Source ofInformation Speech Text
Film, video, multimedia, voice mail,radio, television, conferences, web,on-line resources
Access toInformation
Listen,remember
Readbooks
Computertyping
Careful spoken,written input
Conversationallanguage
BravoBrava Mississippi State University
State of the ArtInitial and Current Applications
• Database query – Resource management– Air travel information– Stock quote
1997
•Command and control –Manufacturing–Consumer products
http://www.speech.be.philips.com/
•Dictation –http://www.dragonsys.com –http://www-4.ibm.com/software/speech
Nuance, American Airlines: 1-800-433-7300, touch 1
BravoBrava Mississippi State University
State of the ArtHow Do You Measure?
USC, October 15, 1999: “the world's first machine system that can recognize spoken words better than humans can.”
“ In benchmark testing using just a few spoken words, USC's Berger-Liaw … System not only bested all existing computer speech recognition systems but outperformed the keenest human ears.”
• What benchmarks? What was training? What was test? Were they independent? How large was the vocabulary and the sample size? Did they really test all existing systems?
“… functions at 60 percent recognition with a hubbub level 560 times the strength of the target stimulus.”
• Is that different from chance? Was the noise added or coincident with speech? What kind of noise? Was it independent of the speech?
BravoBrava Mississippi State University
all speakers of the language
including foreign
application independent or
adaptive
all styles including human-human (unaware)
wherever speech occurs
2005
State of the ArtFactors that Affect Performance
vehicle noise radiocell phones
regional accentsnative speakers
competent foreign speakers
some application–
specific data and one engineer
year
natural human-machine dialog (user can adapt)
2000
expert years to create app– specific language model
speaker independent and adaptive
normal officevarious microphonestelephone
planned speech
1995
NOISE ENVIRONMENT
SPEECH STYLE
USER
POPULATION
COMPLEXITY
1985
quiet roomfixed high –quality mic
careful reading
speaker-dep.
application– specific speech and language
BravoBrava Mississippi State University
Research Theory and TrendsInitial and Current Applications
• Insert Joe’s slides here
BravoBrava Mississippi State University
Conclusion and Future DirectionsTrends
We need new technology to help with information overload• Speech information sources are everywhere
– Voice mail messages– Professional talk– Lectures, broadcasts
• Speech sources of information will increase– As devices shrink– As mobility increases– New uses: annotation, documentation
Speech as Access Speech as Source Information as Partner
What are the words? What does it mean? Here’s what you need.
BravoBrava Mississippi State University
Conclusion and Future DirectionsLimitations on Applications
• Recognition performance, especially in error recovery UI
• Natural language understanding (speech differs from text)– Speech unfolds linearly in time– Speech is more indeterminate than text– Speech has different syntax and semantics– Prosody differs from punctuation
• Cost to develop applications (too few experts)
• Cost to integrate/interoperate with other technologies
• New capabilities– "When did he say Y and was he angry?”– Scanning, refocusing quickly (browsing)– Match past pattern, find novel aspects– Proactive information– Gist, summarize, translate for different purposes
BravoBrava Mississippi State University
Conclusion and Future DirectionsApplications on the Horizon
Beginnings of speech as source of information
• ISLIP http://www.mediasite.net/info/frames.htm
• Virage http://www.virage.com Why
doesn’t belong inthe classroom• Beulah Arnott: also true of indoor plumbing
• BravoBrava: Co-evolving technology and people can– Dramatically reduce the cost of delivery of content– Increase its timeliness, quality and appropriateness– Target needs of individual and/or group – Reading Pal demo
Speech technology in education and training• Cliff Stoll, High Tech Heretic
–Good schools need no computers –Bad schools won’t be improved by them
BravoBrava Mississippi State University
SummaryGoal: Speech Better Than Text
Healthy loop between research and applications
• Research leads to applications, which lead to new research opportunities
We need collaboration
• Too much for one person, one site, one country
Humans will probably continue to be better than machines at many things
Can we learn to use technology and training to augment human-human and human-machine collaboration?
It’s not a solved problem
• Further technology development needed to enable the vision