35
BravoBrava Mississippi State University Can Advances in Speech Recognition make Spoken Language as Convenient and as Accessible as Online Text? Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University Patti Price, PhD VP Business Development BravoBrava

Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

  • Upload
    tayten

  • View
    51

  • Download
    1

Embed Size (px)

DESCRIPTION

Can Advances in Speech Recognition make Spoken Language as Convenient and as Accessible as Online Text?. Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University. Patti Price, PhD VP Business Development BravoBrava. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

Can Advances in Speech Recognition make Spoken Language as Convenient and as Accessible as Online Text?

Joseph Picone, PhDProfessor, Electrical

EngineeringMississippi State University

Patti Price, PhDVP Business Development

BravoBrava

Page 2: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

Outline

• Introduction and state of the art (Price)

• Research issues (Picone)

– Evaluation metrics– Acoustic modeling – Language modeling – Practical issues – Technology demands

• Conclusion and future directions (Price)

Page 3: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

Introduction What is Speech Recognition?

SpeechRecognition

Words“How are you?”

Speech Signal

Goal: Automatically extract the string of words spoken from the speech signal

• Speech recognition does NOT determine– Who is talker (speaker recognition, Heck and Reynolds)– Speech output (speech synthesis, Fruchterman and Ostendorf)– What the words mean (speech understanding)

Page 4: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

Introduction Speech in the Information Age

• Speech & text were revolutionary because of information access• New media and connectivity yield information overload• Can speech technology help?

Time

Source ofInformation Speech Text

Film, video, multimedia, voice mail,radio, television, conferences, web,on-line resources

Access toInformation

Listen,remember

Readbooks

Computertyping

Careful spoken,written input

Conversationallanguage

Page 5: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

State of the ArtInitial and Current Applications

• Database query – Resource management– Flight information– Stock quote

1997

•Command and control –Manufacturing–Consumer products

http://www.speech.be.philips.com/

•Dictation –http://www.dragonsys.com –http://www-4.ibm.com/software/speech

Nuance, American Airlines: 1-800-433-7300, touch 1

Page 6: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

State of the ArtHow Do You Measure?

• What benchmarks?

• Have other systems used same one?

• What was training set?

• What was test set?

• Were training and test independent?

• How large was the vocabulary and the sample size?

• What speakers?

• What style speech?

• What kind of noise?

Page 7: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

all speakers of the language

including foreign

application independent or

adaptive

all styles including human-human (unaware)

wherever speech occurs

2005

State of the ArtFactors that Affect Performance

vehicle noise radiocell phones

regional accentsnative speakers

competent foreign speakers

some application–

specific data and one engineer

year

natural human-machine dialog (user can adapt)

2000

expert years to create app– specific language model

speaker independent and adaptive

normal officevarious microphonestelephone

planned speech

1995

NOISE ENVIRONMENT

SPEECH STYLE

USER

POPULATION

COMPLEXITY

1985

quiet roomfixed high –quality mic

careful reading

speaker-dep.

application– specific speech and language

Page 8: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

• Spontaneous telephone speech is still a “grand challenge”.

• Telephone-quality speech is still central to the problem.

• Vision for speech technology continuesto evolve.

• Broadcast news is a very dynamic domain.

0%

10%

30%

40%

20%

Word Error Rate

Level Of Difficulty

Digits

ContinuousDigits

Command and Control

Letters and Numbers

BroadcastNews

Read Speech

ConversationalSpeech

Evaluation MetricsEvolution

Page 9: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

0%

5%

15%

20%

10%

10 dB 16 dB 22 dB Quiet

Wall Street Journal (Additive Noise)

Machines

Human Listeners (Committee)

Word Error Rate

Speech-To-Noise Ratio

• Human performance exceeds machine performance by a factor ranging from 4x to 10x depending on the task.

• On some tasks, such as credit card number recognition, machine performance exceeds humans due to human memory retrieval capacity.

• The nature of the noise is as important as the SNR (e.g., cellular phones).

• A primary failure mode for humans is inattention.

• A second major failure mode is the lack of familiarity with the domain (i.e., business terms and corporation names).

Evaluation MetricsHuman Performance

Page 10: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

100%

1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 20031%

10%

ReadSpeech

1k5k

20k

Noisy

VariedMicrophones

SpontaneousSpeech

ConversationalSpeech

BroadcastSpeech

(Foreign)

(Foreign)10 X

• Common evaluations fuel technology development.

• Tasks become progressively more ambitious and challenging. • A Word Error Rate (WER)

below 10% is considered acceptable. • Performance in the field is typically 2x to 4x worse than performance on an evaluation.

Evaluation MetricsMachine Performance

Page 11: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

• Information extraction is the analysis of natural language to collect information about specified types of entities.

• As the focus shifts to providing enhanced annotations, WER may not be the most appropriate measure of performance (content-based scoring).

F-Measure

0% 10% 20% 30%

70%

90%

80%

100%

Word Error Rate (Hub-4 Eval’98)

Evaluation MetricsBeyond WER: Named Entity

Recall = # slots correctly filled

# slots filled in keyPrecision = # slots correctly filled

# slots filled by system

F-Measure = 2 x recall x precision

(recall + precision)

• An example of named entity annotation: Mr. <en type=“person”>Sears</en> bought a new suit at <en type=“org”> Sears</en> in <en type=“location”>Washington</en>

<time type=“date”>yesterday</time>

• Evaluation Metrics:

Page 12: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

• Our measurements of the signal are ambiguous.

• Region of overlap represents classification errors.

• Reduce overlap by introducing acoustic and linguistic context (e.g., context-dependent phones).

Feature No. 1

Feature No. 2

Ph_1

Ph_3Ph_2

• Comparison of “aa” in “IOck” vs. “iy” in bEAt for conversational speech (SWB)

Recognition ArchitecturesWhy Is Speech Recognition So Difficult?

Page 13: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

MessageSource

LinguisticChannel

ArticulatoryChannel

AcousticChannel

Observable: Message Words Sounds Features

Bayesian formulation for speech recognition:

• P(W|A) = P(A|W) P(W) / P(A)

Recognition ArchitecturesA Communication Theoretic Approach

Objective: minimize the word error rate

Approach: maximize P(W|A) during training

Components:

• P(A|W) : acoustic model (hidden Markov models, mixtures)

• P(W) : language model (statistical, finite state networks, etc.)

The language model typically predicts a small set of next words based on knowledge of a finite number of previous words (N-grams).

Page 14: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

Input Speech

Recognition ArchitecturesIncorporating Multiple Knowledge Sources

AcousticFront-end

AcousticFront-end

• The signal is converted to a sequence of feature vectors based on spectral and temporal measurements.

Acoustic ModelsP(A/W)

Acoustic ModelsP(A/W)

• Acoustic models represent sub-word units, such as phonemes, as a finite-state machine in which states model spectral structure and transitions model temporal structure.

RecognizedUtterance

SearchSearch

• Search is crucial to the system, since many combinations of words must be investigated to find the most probable word sequence.

• The language model predicts the next set of words, and controls which models

are hypothesized.Language Model

P(W)

Page 15: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

FourierTransform

FourierTransform

CepstralAnalysis

CepstralAnalysis

PerceptualWeighting

PerceptualWeighting

TimeDerivative

TimeDerivative

Time Derivative

Time Derivative

Energy+

Mel-Spaced Cepstrum

Delta Energy+

Delta Cepstrum

Delta-Delta Energy+

Delta-Delta Cepstrum

Input Speech

• Incorporate knowledge of the nature of speech sounds in measurement of the features.

• Utilize rudimentary models of human perception.

Acoustic ModelingFeature Extraction

• Measure features 100 times per sec.

• Use a 25 msec window forfrequency domain analysis.

• Include absolute energy and 12 spectral measurements.

• Time derivatives to model spectral change.

Page 16: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

• Acoustic models encode the temporal evolution of the features (spectrum).

• Gaussian mixture distributions are used to account for variations in speaker, accent, and pronunciation.

• Phonetic model topologies are simple left-to-right structures.

• Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models.

• Sharing model parameters is a common strategy to reduce complexity.

Acoustic ModelingHidden Markov Models

Page 17: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

• Closed-loop data-driven modeling supervised only from a word-level transcription

• The expectation/maximization (EM) algorithm is used to improve our parameter estimates.

• Computationally efficient training algorithms (Forward-Backward) have been crucial.

• Batch mode parameter updates are typically preferred.

• Decision trees are used to optimize parameter-sharing, system complexity, and the use of additional linguistic knowledge.

Acoustic ModelingParameter Estimation

• Initialization

• Single Gaussian Estimation

• 2-Way Split

• Mixture Distribution Reestimation

• 4-Way Split

• Reestimation

•••

Page 18: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

Language ModelingIs A Lot Like Wheel of Fortune

Page 19: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

Language ModelingN-Grams: The Good, The Bad, and The Ugly

Bigrams (SWB):

• Most Common: “you know”, “yeah SENT!”, “!SENT um-hum”, “I think”• Rank-100: “do it”, “that we”, “don’t think”• Least Common: “raw fish”, “moisture content”,

“Reagan Bush”

Trigrams (SWB):

• Most Common: “!SENT um-hum SENT!”, “a lot of”, “I don’t know”

• Rank-100: “it was a”, “you know that”• Least Common: “you have parents”,

“you seen Brooklyn”

Unigrams (SWB):

• Most Common: “I”, “and”, “the”, “you”, “a”• Rank-100: “she”, “an”, “going”• Least Common: “Abraham”, “Alastair”, “Acura”

Page 20: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

Language ModelingIntegration of Natural Language

• Natural language constraints can be easily incorporated.

• Lack of punctuation and search space size pose problems.

• Speech recognition typically produces a word-level

time-aligned annotation.

• Time alignments for other levels of information also available.

Page 21: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

• Typical LVCSR systems have about 10M free parameters, which makes training a challenge.

• Large speech databases are required (several hundred hours of speech).

• Tying, smoothing, and interpolation are required.

Implementation Issues Search Is Resource Intensive

Megabytes of Memory

FeatureExtraction

(1M)

Acoustic Modeling

(10M)

LanguageModeling (30M)

Search(150M)

Percentage of CPUFeature

Extraction1%

LanguageModeling

15%

Search 25% Acoustic

Modeling 59%

Page 22: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

• Dynamic programming is used to find the most probable path through the network.

• Beam search is used to control resources.

Implementation IssuesDynamic Programming-Based Search

• Search is time synchronous and left-to-right.

• Arbitrary amounts of silence must be permitted between each word.

• Words are hypothesized many times with different start/stop times, which significantly increases search complexity.

Page 23: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

• Cross-word Decoding: since word boundaries don’t occur in spontaneous speech, we must allow for sequences of sounds that span word boundaries.

• Cross-word decoding significantly increases memory requirements.

Implementation IssuesCross-Word Decoding Is Expensive

Page 24: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

Implementation IssuesDecoding Example

Page 25: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

Implementation IssuesInternet-Based Speech Recognition

Page 26: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

Technology Conversational Speech

• Conversational speech collected over the telephone contains background noise, music, fluctuations in the speech rate, laughter, partial words, hesitations, mouth noises, etc.

• WER has decreased from 100% to 30% in six years.

• Laughter

• Singing

• Unintelligible

• Spoonerism

• Background Speech

• No pauses

• Restarts

• Vocalized Noise

• Coinage

Page 27: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

Technology Audio Indexing of Broadcast News

Broadcast news offers some uniquechallenges:• Lexicon: important information in infrequently occurring words

• Acoustic Modeling: variations in channel, particularly within the same segment (“ in the studio” vs. “on location”)

• Language Model: must adapt (“ Bush,” “Clinton,” “Bush,” “McCain,” “???”)

• Language: multilingual systems? language-independent acoustic modeling?

Page 28: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

• From President Clinton’s State of the Union address (January 27, 2000):

“These kinds of innovations are also propelling our remarkable prosperity...Soon researchers will bring us devices that can translate foreign languagesas fast as you can talk... molecular computers the size of a tear drop with thepower of today’s fastest supercomputers.”

Technology Real-Time Translation

• Imagine a world where:

• You book a travel reservation from your cellular phone while driving in your car without ever talking to a human (database query)

• You converse with someone in a foreign country and neither speakerspeaks a common language (universal translator)

• You place a call to your bank to inquire about your bank account and never have to remember a password (transparent telephony)

• You can ask questions by voice and your Internet browser returns answers to your questions (intelligent query)

• Human Language Engineering: a sophisticated integration of many speech and language related technologies... a science for the next millennium.

Page 29: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

What have we learned?

• supervised training is a good machine learning technique

• large databases are essential for the development of robust statistics

What are the challenges?

• discrimination vs. representation

• generalization vs. memorization

• pronunciation modeling

• human-centered language modelingWhat are the algorithmic issues for the next decade:• Better features by extracting articulatory information?

• Bayesian statistics? Bayesian networks?

• Decision Trees? Information-theoretic measures?

• Nonlinear dynamics? Chaos?

Technology Future Directions

1970

Hidden Markov ModelsAnalog Filter Banks Dynamic Time-Warping

1980 19902000

1960

Page 30: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

To Probe Further References

Journals and Conferences:

[1] N. Deshmukh, et. al., “Hierarchical Search for LargeVocabulary Conversational Speech Recognition,” IEEE Signal Processing Magazine, vol. 1, no. 5, pp. 84- 107, September 1999.

[2] N. Deshmukh, et. al., “Benchmarking Human Performance for Continuous Speech Recognition,” Proceedings of the Fourth International Conference on Spoken Language Processing, pp. SuP1P1.10, Philadelphia, Pennsylvania, USA, October 1996.

[3] R. Grishman, “Information Extraction and Speech Recognition,” presented at the DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, Virginia, USA, February 1998.

[4] R. P. Lippmann, “Speech Recognition By Machines and Humans,” Speech Communication, vol. 22, pp. 1-15, July 1997.

[5] M. Maybury (editor), “News on Demand,” Communications of the ACM, vol. 43, no. 2, February 2000.

[6] D. Miller, et. al., “Named Entity Extraction from Broadcast News,” presented at the DARPA Broadcast News Workshop, Herndon, Virginia, USA, February 1999.

[7] D. Pallett, et. al., “Broadcast News Benchmark Test Results,” presented at the DARPA Broadcast News Workshop, Herndon, Virginia, USA, February 1999.

[8] J. Picone, “Signal Modeling Techniques in Speech Recognition,” IEEE Proceedings, vol. 81, no. 9, pp. 1215- 1247, September 1993.

[9] P. Robinson, et. al., “Overview: Information Extraction from Broadcast News,” presented at the DARPA Broadcast News Workshop, Herndon, Virginia, USA, February 1999.

[10] F. Jelinek, Statistical Methods for Speech Recognition, MIT Press, 1998.

URLs and Resources:

[11] “Speech Corpora,” The Linguistic Data Consortium, http://www.ldc.upenn.edu.

[12] “Technology Benchmarks,” Spoken Natural Language Processing Group, The National Institute for Standards, http://www.itl.nist.gov/iaui/894.01/index.html.

[13] “Signal Processing Resources,” Institute for Signal and Information Technology, Mississippi State University, http://www.isip.msstate.edu.

[14] “Internet- Accessible Speech Recognition Technology,” http://www.isip.msstate.edu/projects/speech/index.html.

[15] “A Public Domain Speech Recognition System,” http://www.isip.msstate.edu/projects/speech/software/index.html.

[16] “Remote Job Submission,” http://www.isip.msstate.edu/projects/speech/experiments/index.html.

[17] “The Switchboard Corpus,” http://www.isip.msstate.edu/projects/switchboard/index.html.

Page 31: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

Conclusion and Future DirectionsTrends

We need new technology to help with information overload• Speech information sources are everywhere

– Voice mail messages– Professional talk– Lectures, broadcasts

• Speech sources of information will increase– As devices shrink– As mobility increases– New uses: annotation, documentation

Speech as Access Speech as Source Information as Partner

What are the words? What does it mean? Here’s what you need.

Page 32: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

Conclusion and Future DirectionsLimitations on Applications

• Recognition performance, especially in error recovery

• Natural language understanding (speech differs from text)– Speech unfolds linearly in time– Speech is more indeterminate than text– Speech has different syntax and semantics– Prosody differs from punctuation

• Cost to develop applications (too few experts)

• Cost to integrate/interoperate with other technologies

• New capabilities– "When did he say Y and was he angry?”– Scanning, refocusing quickly (browsing)– Proactive information: Match past pattern, find novel aspects– Gist, summarize, translate for different purposes

Page 33: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

Conclusion and Future DirectionsApplications on the Horizon

Beginnings of speech as source of information

• ISLIP http://www.mediasite.net/info/frames.htm

• Virage http://www.virage.com Why

doesn’t belong inthe classroom• Beulah Arnott: also true of indoor plumbing

• BravoBrava: Co-evolving technology and people can– Dramatically reduce the cost of delivery of content– Increase its timeliness, quality and appropriateness– Target needs of individual and/or group – Reading Pal demo

Speech technology in education and training• Cliff Stoll, High Tech Heretic

–Good schools need no computers –Bad schools won’t be improved by them

Page 34: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

Reading Pal

Child reads

Errors in red

Looks up word “massive”

Clicks ‘You’ to play it back

Clicks ‘Listen’To play back from “massive”

Page 35: Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University

BravoBrava Mississippi State University

SummaryGoal: Speech Better Than Text

Healthy loop between research and applications • Research leads to applications, which lead to new research

opportunities

We need collaboration

• Too much for one person, one site, one country

Humans will probably continue to be better than machines at many things

Can we learn to use technology and training to augment human-human and human-machine collaboration?

• We need to “micronize” education and training

It’s not a solved problem• Further technology development is needed to enable the

vision