BravoBrava Mississippi State University Can Advances in Speech Recognition make Spoken Language as Convenient and as Accessible as Online Text? Joseph

BravoBrava Mississippi State University

Can Advances in Speech Recognition make Spoken

Language as Convenient and as Accessible as Online Text?

Joseph Picone, PhDProfessor, Electrical

EngineeringMississippi State University

Patti Price, PhDVP Business Development

BravoBrava LLC


Outline

• Introduction and state of the art (Price)

• Research issues (Picone)

– Evaluation metrics– Acoustic modeling – Language modeling – Practical issues – Technology demands

• Conclusion and future directions (Price)


Introduction What is Speech Recognition?

SpeechRecognition

Words“How are you?”

Speech Signal

Goal: Automatically extract the string of words spoken from the speech signal

• Speech recognition does NOT determine– Who is talker (speaker recognition, Heck and Reynolds)– Speech output (speech synthesis, Fruchterman and Ostendorf)– What the words mean (speech understanding)


Introduction Speech in the Information Age

• Speech & text were revolutionary because of information access• New media and connectivity yield information overload• Can speech technology help?

Time

Source ofInformation Speech Text

Film, video, multimedia, voice mail,radio, television, conferences, web,on-line resources

Access toInformation

Listen,remember

Readbooks

Computertyping

Careful spoken,written input

Conversationallanguage


State of the ArtInitial and Current Applications

• Database query – Resource management– Air travel information– Stock quote

1997

•Command and control –Manufacturing–Consumer products

http://www.speech.be.philips.com/

•Dictation –http://www.dragonsys.com –http://www-4.ibm.com/software/speech

Nuance, American Airlines: 1-800-433-7300, touch 1


State of the ArtHow Do You Measure?

USC, October 15, 1999: “the world's first machine system that can recognize spoken words better than humans can.”

“ In benchmark testing using just a few spoken words, USC's Berger-Liaw … System not only bested all existing computer speech recognition systems but outperformed the keenest human ears.”

• What benchmarks? What was training? What was test? Were they independent? How large was the vocabulary and the sample size? Did they really test all existing systems?

“… functions at 60 percent recognition with a hubbub level 560 times the strength of the target stimulus.”

• Is that different from chance? Was the noise added or coincident with speech? What kind of noise? Was it independent of the speech?


all speakers of the language

including foreign

application independent or

adaptive

all styles including human-human (unaware)

wherever speech occurs

2005

State of the ArtFactors that Affect Performance

vehicle noise radiocell phones

regional accentsnative speakers

competent foreign speakers

some application–

specific data and one engineer

year

natural human-machine dialog (user can adapt)

2000

expert years to create app– specific language model

speaker independent and adaptive

normal officevarious microphonestelephone

planned speech

1995

NOISE ENVIRONMENT

SPEECH STYLE

USER

POPULATION

COMPLEXITY

1985

quiet roomfixed high –quality mic

careful reading

speaker-dep.

application– specific speech and language


Research Theory and TrendsInitial and Current Applications

• Insert Joe’s slides here


Conclusion and Future DirectionsTrends

We need new technology to help with information overload• Speech information sources are everywhere

– Voice mail messages– Professional talk– Lectures, broadcasts

• Speech sources of information will increase– As devices shrink– As mobility increases– New uses: annotation, documentation

Speech as Access Speech as Source Information as Partner

What are the words? What does it mean? Here’s what you need.


Conclusion and Future DirectionsLimitations on Applications

• Recognition performance, especially in error recovery UI

• Natural language understanding (speech differs from text)– Speech unfolds linearly in time– Speech is more indeterminate than text– Speech has different syntax and semantics– Prosody differs from punctuation

• Cost to develop applications (too few experts)

• Cost to integrate/interoperate with other technologies

• New capabilities– "When did he say Y and was he angry?”– Scanning, refocusing quickly (browsing)– Match past pattern, find novel aspects– Proactive information– Gist, summarize, translate for different purposes


Conclusion and Future DirectionsApplications on the Horizon

Beginnings of speech as source of information

• ISLIP http://www.mediasite.net/info/frames.htm

• Virage http://www.virage.com Why

doesn’t belong inthe classroom• Beulah Arnott: also true of indoor plumbing

• BravoBrava: Co-evolving technology and people can– Dramatically reduce the cost of delivery of content– Increase its timeliness, quality and appropriateness– Target needs of individual and/or group – Reading Pal demo

Speech technology in education and training• Cliff Stoll, High Tech Heretic

–Good schools need no computers –Bad schools won’t be improved by them


SummaryGoal: Speech Better Than Text

Healthy loop between research and applications

• Research leads to applications, which lead to new research opportunities

We need collaboration

• Too much for one person, one site, one country

Humans will probably continue to be better than machines at many things

Can we learn to use technology and training to augment human-human and human-machine collaboration?

It’s not a solved problem

• Further technology development needed to enable the vision

Documents

BravoBrava Mississippi State University Can Advances in Speech Recognition make Spoken Language as Convenient and as Accessible as Online Text? Joseph