571
Advanced methods in Automatic Speech Recognition Dr. Ralf Schl¨ uter, Prof. Dr.-Ing. Hermann Ney Lehrstuhl f¨ ur Informatik 6 Human Language Technology and Pattern Recognition Computer Science Department, RWTH Aachen University D-52056 Aachen, Germany August 5, 2010 Schl¨ uter/Ney: Advanced ASR 1 August 5, 2010

Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Embed Size (px)

Citation preview

Page 1: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Advanced methods in Automatic SpeechRecognition

Dr. Ralf Schluter, Prof. Dr.-Ing. Hermann Ney

Lehrstuhl fur Informatik 6Human Language Technology and Pattern Recognition

Computer Science Department, RWTH Aachen UniversityD-52056 Aachen, Germany

August 5, 2010

Schluter/Ney: Advanced ASR 1 August 5, 2010

Page 2: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ScheduleCourse: Introduction to Automatic Speech Recognition

Event Times Room Start

Lecture Tue (biweekly) 15:45–17:15h 6124 April 20, 2010Thu 14:00–15:30h 6124

Exercises Tue (biweekly) 15:45–17:15h 6124 April 27, 2010

Room 6124: seminar room of the Lehrstuhl fur Informatik 6

See course site:http://www-i6.informatik.rwth-aachen.de/web/Teaching/Lectures/SS10/advasr

for

I news

I downloads (documents, exercise sheets, etc.)

I course information

I contacts

Schluter/Ney: Advanced ASR 2 August 5, 2010

Page 3: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Contents

0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 3 August 5, 2010

Page 4: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 60.1 Research Topics0.2 Projects0.3 Courses0.4 Textbooks

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 4 August 5, 2010

Page 5: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Lehrstuhl fur Informatik 6: Research TopicsResearch Topics

Method: Stochastic Modelling

I Modelling dependencies and vague knowledge(contrast: rule-based approach)

I Decision making, in particular in context

I Automatic learning from data/examples

Applications:Human Language Technology and Pattern Recognition

Schluter/Ney: Advanced ASR 5 August 5, 2010

Page 6: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Applications: Examples

I Speech recognition

I small vocabularyI large vocabulary

I Machine translation

I Natural language processing

I text/document classificationI information retrievalI parsing and syntactic analysis

I Language understanding and dialog systems

I Image recognition

I object recognitionI handwriting recognition

Schluter/Ney: Advanced ASR 6 August 5, 2010

Page 7: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Applications: Examples

I Diagnosis and expert systems

I Other applications:

I speaker verification and identificationI fingerprint verification and identificationI DNA sequence identificationI gesture recognitionI lip readingI geological analysisI high-energy physics: bubble chamber tracksI ...

Schluter/Ney: Advanced ASR 7 August 5, 2010

Page 8: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 60.1 Research Topics0.2 Projects0.3 Courses0.4 Textbooks

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 8 August 5, 2010

Page 9: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Lehrstuhl fur Informatik 6 (i6): Projects

I ARISE (EU):Automatic Railway Information Systems across Europe– Speech Recognition and Language Modelling

I EuTrans II (EU): Translation of Spoken Language– Speech Recognition and Translation

I Institut fur deutsche Sprache (IdS):– Language Modelling for Newspapers

I Audio Document Retrieval (NRW):– Speech Recognition and Information Retrieval

I Verbmobil II (BMBF): Speech Recognition and Translation forAppointment Scheduling and Traveling Information– Speech Recognition– Speech Translation– Prototype Modules

Schluter/Ney: Advanced ASR 9 August 5, 2010

Page 10: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Projects i6

I Image Object Recognition (RWTH):– OCR (optical character recognition)– Medical Images

I Advisor (EU):– Speech Recognition for German Broadcast News

I EGYPT follow-up (NSF):– Basic Algorithms for Statistical Machine Translation

I Audio Document Retrieval (NRW ?):– German Broadcast News: Recognition and Information Retrieval

I Bilateral Projects with Companies (including start-ups)

I German DFG:– Improved Acoustic Modelling using Structured Models– Statistical Methods for Written Language Translation– Statistical Modeling for Image Object Recognition

Schluter/Ney: Advanced ASR 10 August 5, 2010

Page 11: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Projects i6I Coretex (EU):

– Improving Core Technology for Speech Recognition– Applications: Broadcast News in Several Languages

I LC-Star (EU):– Lexical and Corpora Resources for Recognition, Translationand Synthesis– Prototype system for machine translation of spokensentences

I TC-Star (EU):– Technology and Corpora for Speech to Speech Translation– Applications: Broadcast News and Speeches/Lectures

I Transtype-2 (EU):– Machine translation of written text– Application: interactive machine-aided translation

I PF-Star (EU):– Machine translation of spoken dialogues– Application: tourism and travelling

Schluter/Ney: Advanced ASR 11 August 5, 2010

Page 12: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Projects i6

I JUMAS (EU):– Judicial MAnagement by digital libraries Semantics– Application: audio and video search of court proceedings

I LUNA (EU):– spoken Language UNderstanding in multilinguAl

communication systems

– Application: real-time understanding of spontaneous speechin advanced telecom services

I GALE (US-DARPA):– Global Autonomous Language Exploitation– Application: Information Processing in Multiple Languages

I QUAERO [lat.: to search] (OSEO/France)– multimedia and multilingual indexing– Application: extract information from written texts,

speech and music audio, images, and video

Schluter/Ney: Advanced ASR 12 August 5, 2010

Page 13: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 60.1 Research Topics0.2 Projects0.3 Courses0.4 Textbooks

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 13 August 5, 2010

Page 14: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Courses

I Introductory lectures (L3/4) with exercises (E2) for Bachelor,Master, and Diploma students:– ASR: (Introduction to) Automatic Speech Recognition– PRN: (Introduction to) Pattern Recognition and Neural Networks– NLP: (Introduction to) Natural Language Processing

I Advanced lectures (L3) with exercises (E1/2) for Master andDoctoral students:– advASR: Advanced Automatic Speech Recognition– advPRN: Advanced Pattern Recognition– advNLP: Advanced Natural Language Processing

I Further Lectures (L2) with exercises (E1):– MIP: Medical Image Processing

(’Ringvorlesung’, each WS)

Schluter/Ney: Advanced ASR 14 August 5, 2010

Page 15: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Courses (ctd.)

I Seminars:– Bachelor Degree (SS, Block)– Diplom Degree (SS, Block)– Doctor Degree (WS+SS)

I Laboratory Courses (WS, Block)

I Study Groups (WS+SS: speech, language, image)

New course cycles:year term lectures

09/10 WS PRNN (L4/3,E2) ASR (L4/3,E2)SS NLP (L4/3,E2) advASR (L3,E1)

10/11 WS PRNN (L4/3,E2) ASR (L4/3,E2)

Schluter/Ney: Advanced ASR 15 August 5, 2010

Page 16: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Exams i6: Diplom Degree

I area of specialization (Vertiefungsgebiet) i6 with the topics:

– Automatic Speech Recognition (ASR)– Pattern Recognition and Neural Networks (PRNN)– Natural Language Processing (NLP)– ...select 12 hours (SWS) out of i6 lectures

Schluter/Ney: Advanced ASR 16 August 5, 2010

Page 17: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Exams i6: Diplom Degree (ctd.)

I practical computer science (Prakt. Informatik) (3 areas):recommendation: 12 hours (SWS) out of

two L4 from: ASR, PRNN, NLPone L4 from i6-external lectures:

I data basesI artificial intelligenceI ... additional alternatives: on demand

Schluter/Ney: Advanced ASR 17 August 5, 2010

Page 18: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Examinations i6

I Master in Informatik, Media Informatics or Software SystemsEngineering:credit system: oral exam after each course/at end of lecture period

I ERASMUS students of Computer Science:oral exam/colloquium for graded certificate at end of lecture period

Note: consult Prof. Ney before June 2010 for exam dates, andbefore registering for the exam.

Schluter/Ney: Advanced ASR 18 August 5, 2010

Page 19: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 60.1 Research Topics0.2 Projects0.3 Courses0.4 Textbooks

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 19 August 5, 2010

Page 20: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Textbooks: Topics i6Textbooks on Speech Recognition:

I emphasis on signal processing and small-vocabularyrecognition:L. Rabiner, B. H. Juang: Fundamentals of SpeechRecognition.Prentice Hall, Englewood Cliffs, NJ, 1993.

I emphasis on large vocabulary and language modelling:F. Jelinek: Statistical Methods for Speech Recognition.MIT Press, Cambridge, 1997.

I introduction to both speech and language:D. Jurafsky, J. H. Martin: Speech and Language Processing.Prentice Hall, Englewood Cliffs, NJ, 2000.

I advanced topics:R. De Mori: Spoken Dialogues with Computers.Academic Press, London, 1998

Schluter/Ney: Advanced ASR 20 August 5, 2010

Page 21: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Textbooks: Topics i6Textbooks on Signal Processing:

I A. V. Oppenheim, R. W. Schafer: Discrete Time SignalProcessing, Prentice Hall, Englewood Cliffs, NJ, 1989.

I A. Papoulis: Signal Analysis, McGraw-Hill, New York, NY, 1977.I A. Papoulis: The Fourier Integral and its Applications,

McGraw-Hill Classic Textbook Reissue Series, McGraw-Hill,New York, NY, 1987.

I W. K. Pratt: Digital Image Processing, Wiley & Sons Inc,New York, NY, 1991.

Further reading on Signal Processing:I T. K. Moon, W. C. Stirling: Mathematical Methods and Algorithms

for Signal Processing. Prentice Hall, Upper Saddle River, NJ, 2000.I J. R. Deller, J. G. Proakis, J. H. L. Hansen: Discrete-Time

Processing of Speech Signals, Macmillan Publishing Company,New York, NY, 1993.

I L. Berg: Lineare Gleichungssysteme mit Bandstruktur, VEBDeutscher Verlag der Wissenschaften, Berlin, 1986.

Schluter/Ney: Advanced ASR 21 August 5, 2010

Page 22: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Textbooks: Topics i6

Textbooks on Natural Language Processing(statistical/corpus-based):

I introduction to both speech and language:D. Jurafsky, J. H. Martin: Speech and Language Processing.Prentice Hall, Englewood Cliffs, NJ, 2000.

I emphasis on statistical methods for written language:C. D. Manning, H. Schutze: Foundations of Statistical NaturalLanguage Processing. MIT Press, Cambridge, MA, 1999.

I related field: artificial intelligence:S. Russel, P. Norvig: Artificial Intelligence. Prentice Hall,Englewood Cliffs, NJ, 1995 (in particular Chapters 22-25).

Schluter/Ney: Advanced ASR 22 August 5, 2010

Page 23: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Textbooks: Topics i6Textbooks on Statistical Learning (Pattern Recognition, NeuralNetworks, Data Mining, ...):

I best introduction (including modern concepts):R. O. Duda, P. E. Hart, D. G. Stork: Pattern Classification.2nd ed., J. Wiley & Sons, New York, NY, 2001.

I emphasis on statistical concepts:B. D. Ripley: Pattern Recognition and Neural Networks.Cambridge University Press, Cambridge, England, 1996.

I emphasis on modern statistical concepts:T. Hastie, R. Tibshirani, J. Friedman: The Elements ofStatistical Learning: Data Mining, Inference and Predictions.Springer, New York, 2001.

I emphasis on theory and principles:L. Devroye, J. Gyorfi, G. Lugosi: A Probabilistic Theory ofPattern Recognition. Springer, New York, 1996.

Schluter/Ney: Advanced ASR 23 August 5, 2010

Page 24: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Textbooks: Topics i6

Textbooks on mathematical methods (vector spaces and matrices,statistics, optimization methods, ...):

I best overall summary:T. K. Moon, W. C. Stirling: Mathematical Methods andAlgorithms for Signal Processing. Prentice Hall, Upper SaddleRiver, NJ, 2000.

I introduction to modern statistics:G. Casella, R. L. Berger: Statistical Inference. Wadsworth &Brooks/Cole, Pacific Grove, CA, 1990.

I good overview of numerical algorithms and implementations:W. H. Press, S. A. Teukolsky, W. T. Vetterling,B. P. Flannery: Numerical Recipes in C. CambridgeUniv. Press, Cambridge, 2nd ed., 1992.

Schluter/Ney: Advanced ASR 24 August 5, 2010

Page 25: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition1.1 Overview: Architecture1.2 Phoneme Models and Subword Units1.3 Phonetic Decision Trees1.4 Language Modelling1.5 Dynamic Programming Beam Search1.6 Implementation Details1.7 Excursion (for experts): Language Model Factor1.8 Excursion (for experts): Length Modelling

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 25 August 5, 2010

Page 26: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Overview: Architecture

Starting point: Bayes decision rule

I results in a minimum number of recognition errors(under certain conditions)

I more details:see lecture Pattern Recognition and Neural Networks

Schluter/Ney: Advanced ASR 26 August 5, 2010

Page 27: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Speech Recognition: Bayes’ Decision RuleSpeech Input

AcousticAnalysis

Phoneme Inventory

Pronunciation Lexicon

Language Model

Global Search:

maximize

x1 ... xT

Pr(w1 ... wN) Pr(x1 ... xT | w1...wN)

w1 ... wN

RecognizedWord Sequence

over

Pr(x1 ... xT | w1...wN )

Pr(w1 ... wN)

Schluter/Ney: Advanced ASR 27 August 5, 2010

Page 28: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Speech Recognizer: Sources of Errors

Why does a recognition system make errors?Reasons from the viewpoint of Bayes’ decision rule:

I incorrect acoustic model:– poor acoustic analysis– poor phoneme models– poor pronunciation model

I incorrect language model

I incorrect search procedure:the maximum is not found

I decision rule: discrepancy between evaluation measure (worderror rate) and decision rule (minimizes sentence error rate)

Schluter/Ney: Advanced ASR 28 August 5, 2010

Page 29: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Speech Recognition: Effect of Language Modeland other Knowledge Sources

Importance of higher level knowledge and its integration in thesearch process. Test results on the Wall Street Journal 5k task:

knowledge sources used perplexity phoneme error word errorPP rate [%] rate [%]

unconstrainedphoneme recognition – 36.3 —+ pronunciation lexicon 5000 13.9 40.0+ LM: unigram 746 8.4 22.9

bigram 107 2.8 6.9trigram 56 1.9 4.5

Schluter/Ney: Advanced ASR 29 August 5, 2010

Page 30: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Effect of Knowledge SourcesExample from the Wall Street Journal 5k task:

LM recognized errors

no lexicon k t k t dh ey d v eh d ey n ey ih z n un k oh sh h ee eyd ih ng n dh uh dh s ey l uh f s ur n d h aa s dh aa t sUH b dh uh b r oh k r ih j y ooh n ih t p p

28

0-gram h ih t s eh n uh t ur z n ih g oh sh ee ey t ih ng — — sey l — — s ur t un aa s eh t s aw n t uh b r oh k ur ihj y ooh n ih t s

11

HIT SENATORS — — NEGOTIATING — SALE —CERTAIN ASSETS ONTO — BROKERAGE UNIT’S

9

1-gram ih t s s eh n ih t ih z n ih g oh sh ee ey t ih ng — — sey l — — s ur t un aa s eh t s aw v dh uh b r oh k urih j y ooh n ih t

6

ITS SENATE — IS NEGOTIATING — SALE — CER-TAIN ASSETS OF THE BROKERAGE UNIT

5

2-gram ih t s eh d ih t ih z n ih g oh sh ee ey t ih ng dh uh s eyl aw v s ur t un aa s eh t s aw v dh uh b r oh k ur ih j yooh n ih t

0

IT SAID IT IS NEGOTIATING THE SALE OF CERTAINASSETS OF THE BROKERAGE UNIT

0

Schluter/Ney: Advanced ASR 30 August 5, 2010

Page 31: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition1.1 Overview: Architecture1.2 Phoneme Models and Subword Units1.3 Phonetic Decision Trees1.4 Language Modelling1.5 Dynamic Programming Beam Search1.6 Implementation Details1.7 Excursion (for experts): Language Model Factor1.8 Excursion (for experts): Length Modelling

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 31 August 5, 2010

Page 32: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

From Small to Large Vocabulary:Why Sub-Word Units ?Phoneme Models and Subword Units

For large vocabularies, it is prohibitive to use whole-word modelsfor each word of the vocabulary:

I There are not enough training samples for each word.

I The memory requirements increase linearly with the numberof words (today: no real problem).

Solution: create word models by concatenating sub-word units,such as phonemes, context dependent phonemes, demi-syllables,syllables, . . .Advantages:

I Training data is shared between words.

I Words not seen in training (i.e. without training examples)can be recognized by using a pronunciation lexicon.

Schluter/Ney: Advanced ASR 32 August 5, 2010

Page 33: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Zipf’s LawThe problem of sparse data is related to “Zipf’s law”:

The frequency N(w) of a word w is (approximately)inversely proportional to some power γ of its rank r(w).

N(w) = const · r(w)−γ

Example from the Verbmobil corpus:

rank word frequency

1 ich 18648

2 ja 16613

3 das 14288

4 wir 13532

. . .4440 Abendtermine 1

4441 Aberglaubens 1

. . .10000 zwingend 0

1

10

100

1000

10000

100000

10 100 1000 10000

freq

uenc

y

rank

Schluter/Ney: Advanced ASR 33 August 5, 2010

Page 34: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Phonetic (Phonemic) ModelsDistinguish the various levels:– acoustic realization: acoustic signal– class of equivalent sounds: phone (allophone, triphone)– (more) abstract level: phoneme

Speech sounds may be categorized according to different ’features’:

for consonants

I voiced / voiceless

I manner of articulationstop, nasal, fricative,approximant

I place of articulationlabial, dental, alveolar,palatal,velar, glottal

for vowels

I position of tongue:high/low, front/back

I rounded or not

Schluter/Ney: Advanced ASR 34 August 5, 2010

Page 35: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Subword Units

speech ⇐⇒ temporal sequence of sounds⇓ ⇓

acoustic signal ⇐⇒ temporal sequence of acoustic vectors,(acoustic realization of the sounds)

Model of speech production:

I Every sound has a program for the movements of the vocaltract.

I Movements of individual sounds merge into one continuoussequence of movements.

I Ideal positioning of the vocal tract is only approximated(depending on the amount of coarticulation).⇒ the real acoustic signal differs from the ’ideal’ signal

Schluter/Ney: Advanced ASR 35 August 5, 2010

Page 36: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

The Vocal Tract

Drawing by Laszlo Kubinyi c©Scientific American 1977

Schluter/Ney: Advanced ASR 36 August 5, 2010

Page 37: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Subword Units

Criteria for sound classification

I type of articulation (fricative, plosive)

I location of articulation ([p]: labial, [s]: dental)

I consonants and vowels

I voiced and unvoiced sounds

I stationary and non stationary sounds(vowels vs. diphtongs, plosives)

Perception of sounds:

I loudness

I tone (smoothed spectrum = formant spectrum)

I unvoiced, voiced (fundamental frequency, pitch)

Schluter/Ney: Advanced ASR 37 August 5, 2010

Page 38: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Phonemes

I The pronunciation of a word is usually described in a lessdetailed way using phonemes.

I A phoneme is an abstraction over different phoneticrealizations.

I Two sounds correspond to different phonemes if they canoccur in the same context and distinguish different words.

I The phoneme inventory of a language can be inferred from“minimal pairs”.

I A minimal pair is a pair of words whose phonetictranscriptions have an edit distance of one.

Schluter/Ney: Advanced ASR 38 August 5, 2010

Page 39: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

PhonemesExamples of minimal pairs for German

Vowels:i: / o: Kiel / KohlI / E fit / fette: / Y fehle / fullea: / a Rate / RatteY / 9 Hulle / Holleo: / aU roh / rau

e: / E: Tee / Teint

Consonants:p / b packe / backe

t / m Tasse / Massek / ts Kahn / Zahnf / v Fall / Walls / S Bus / Buschs / z Muße / Musel / – Klette / Kette

Schluter/Ney: Advanced ASR 39 August 5, 2010

Page 40: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Phonemes

Characteristics of the phoneme set:

I The phoneme set is language specific. Examples:

Chinese [l ] – [r ] one phonemeArabic [ki ] – [ku] different phonemes

I Humans are trained to distinguish sounds of specific languages.I The acoustic realizations of phonemes are context dependent

(coarticulation):I static dependencies on surrounding phonemesI dynamic dependency:

temporal overlap of the articulation of subsequent phonemes

Schluter/Ney: Advanced ASR 40 August 5, 2010

Page 41: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Phoneme System for German in SAMPA notationconsonants

plosivesp Pein p aI nb Bein b aI nt Teich t aI Cd Deich d aI Ck Kunst k U n s tg Gunst g U n s t

fricativesf fast f a s tv was v a ss Tasse t a s @z Hase h a: z @S waschen v a S @ nZ Genie Z e n i:C sicher z i C 6j Jahr j a: 6x Buch b u: xh Hand h a n t

consonantssonorants

m mein m aI nn nein n aI nN Ding d I Nl Leim l aI mR Reim R aI m

vowels“checked” (short) vowels

I Sitz z I t sE Gesetz g @ z E t sa Satz z a t sO Trotz t r O t sU Schutz S U t sY hubsch h Y p S9 plotzlich p l 9 t z l I C

vowels“free” (long) vowels

i: Lied l i: te: Beet b e: tE: spat S p E: ta: Tat t a: to: rot r o: tu: Blut b l u: ty: suß z y: s2: blod b l 2: t

diphthongsaI Eis aI saU Haus h aU sOY Kreuz k r OY t s

“schwa” vowels@ bitte b I t @6 besser b E s 6

Schluter/Ney: Advanced ASR 41 August 5, 2010

Page 42: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

PhonemesFunction of the phonemes:

acoustic signal continuous,infinite number of realizations

⇑ (1:∞)

(allo-)phones discrete sounds, approx. 40 000

⇑ (1:1000)

phonemes discrete, alphabet: 40 – 60,depending on the language

m (1:1)

words of the language: several 100 000 wordspronunciation and meaning

Schluter/Ney: Advanced ASR 42 August 5, 2010

Page 43: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Context Dependent Subword UnitsThe acoustic realization of phonemes is context dependent.

Context dependent modelling is more accurate:

I Diphones A B C D E#A | AB | BC | CD | DE | E #

I Syllables: group of phonemes, standard form consonant–vowel–consonant, about 20000 syllables for German.

vowel

consonantconsonant

time

energy

I Demi syllables (syllables split at the vowel)

I Consonant clusters

Schluter/Ney: Advanced ASR 43 August 5, 2010

Page 44: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Subword Units

Example: possible subword units for German

Subword Unit Number Representation of the(approx.) acoustic signal

phonemes 50 inaccurateconsonant clusters 250 . . .

and vowels . . .diphones 2500 . . .

demi-syllables . . . . . .syllables 20 000 accurate

note terminology: consonant cluster = consonant sequence

Schluter/Ney: Advanced ASR 44 August 5, 2010

Page 45: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Subword UnitsPractical reasons for using subword units in speech recognition:

I Not enough training data for whole word models.

I More observations for subword units (better training).

I Vocabulary can be extended without new acoustic training.Specifying the corresponding subword units is sufficient.

Important issues when using subword units:

I Define and specify subword units.

I Map the continuous signal to the discrete sequence of unitsi.e. specify the units and the pronunciation lexicon.

I Train the subword units.

I Use the subword units and the pronunciation lexicon forrecognition.

Schluter/Ney: Advanced ASR 45 August 5, 2010

Page 46: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

HMMs for phonemesLayers of the acoustic modelling

words: THIS BOOK IS GOOD

phonemes: th i s i z g uh d

subphonemes: b b uh uh uh k k... ...

b uh k

acoustic vectors: ... ...

speech signal: ... ...

cl rel on off cl rel

words: THIS BOOK IS GOOD

phonemes: th i s i z g uh d

subphonemes: b b uh uh uh k k... ...

b uh k

acoustic vectors: ... ...

speech signal: ... ...

cl rel on off cl rel

words: THIS BOOK IS GOOD

phonemes: th i s i z g uh d

subphonemes: b b uh uh uh k k... ...

b uh k

acoustic vectors: ... ...

speech signal: ... ...

cl rel on off cl rel

words: THIS BOOK IS GOOD

phonemes: th i s i z g uh d

subphonemes: b b uh uh uh k k... ...

b uh k

acoustic vectors: ... ...

speech signal: ... ...

cl rel on off cl rel

words: THIS BOOK IS GOOD

phonemes: th i s i z g uh d

subphonemes: b b uh uh uh k k... ...

b uh k

acoustic vectors: ... ...

speech signal: ... ...

cl rel on off cl rel

Speech can be modeled on any of these layers

Schluter/Ney: Advanced ASR 46 August 5, 2010

Page 47: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

HMMs for phonemes

Different HMM topologies for phonemes can be used, define

I number of states and

I allowed transitions.

Usually three sub phonemes are used:Begin – Middle – End

3 state model

orB M E B M E

Schluter/Ney: Advanced ASR 47 August 5, 2010

Page 48: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

HMMs for phonemes“IBM model”

B

M EB M

E

E

B

B M

M

M

Properties:I Transition assigned emissions: the emission probability

distributions are assigned to the transitions (not to the states).I Number of possible paths is restricted for short vector sequences:

1: B2: BM3: BME4: BMME5: BBMME , BMMEE , BMMME6: . . .

Schluter/Ney: Advanced ASR 48 August 5, 2010

Page 49: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

HMMs for phonemesI 6 state model

B M EB M E

PHO

NE

ME

X

STA

TE

IN

DE

X

TIME INDEX

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

E

M

B

Schluter/Ney: Advanced ASR 49 August 5, 2010

Page 50: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Pronunciation LexiconA pronunciation lexicon with phonetic transcriptions is required whenusing subword units. Usually phonemes are used as subword units.Example: English digits

word phonemes

Zero Z IH R OW

One W AH N

Two T UW

Three TH R IY

Four F OW R

Five F AY V

Six S IH K S

Seven S EH V AX N

Eight EY T

Nine N AY N

Oh OW

phoneme number of

occurances

AH 1AX 1AY 2F 2EH 1EY 1IH 2IY 1K 1N 4OW 3S 3R 2T 2TH 1UW 1V 2W 1Z 1

Schluter/Ney: Advanced ASR 50 August 5, 2010

Page 51: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Pronunciation LexiconContext dependencies:

I Context independent (real) phoneme models:International phonetic alphabet defines 74 phonemes for English.In practical applications about 40–50 phonemes are used typically.

z e r oZ IH R OW

I Context dependent phoneme models:Coarticulation is considered:

z e r o

#ZIH Z IHR IHROW ROW#

I “Diphone” AB,BC : context dependent phoneme in diphonecontext

I “Triphone” ABC : context dependent phoneme in triphonecontext

Schluter/Ney: Advanced ASR 51 August 5, 2010

Page 52: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Pronunciation LexiconTerminology:

I context independent phonemes (’monophones’, more or lessthe ’real’ phonemes as defined in linguistics)

I phonemes in (left or right) diphone context (’diphones’)I phonemes in triphone context (’triphones’)I phonemes in word context (’wordphones’)

The context dependency only determines the labels for emissionprobabilities which have to be specified for each state of aphoneme model.The emission probabilities can be trained independent from thephonetic context.

The sequence of states is used for recognition:I word: sequence of phoneme models,I phoneme model: sequence of HMM states⇒ word: sequence of HMM states.

Schluter/Ney: Advanced ASR 52 August 5, 2010

Page 53: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training Phoneme Models

I The sequence of HMM stateindices for a word depend on thephoneme sequence.

I The training procedure correspondsto the one used for word models:

x:

:x:

:x

::y

:yx

x::z:

::

y:

z

x

y

utterance 3

utterance 2

utterance 1

x,y,z: labels of the mixtures

z:

I Time alignment: assign a stateto every acoustic vector.

I Parameter estimation:

. Collect all observations forevery mixture m of everyphoneme model based on thetime alignment.

. Estimate the model parameters for

all densities l of the mixture m:- Reference or prototype vector µlm

- pooled variance vector σ2m

- mixture weight p(l |m)

Schluter/Ney: Advanced ASR 53 August 5, 2010

Page 54: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training Phoneme Models

Practical considerations:All possible triphones 503 = 125000 are too many!

I Combine monophones, diphones and triphones:only use diphones and triphones that occur more thane.g. 100 times in the training data.

I Generalized context dependent phoneme models:use phoneme classes (nasals, fricatives, vowels, stopconsonants, ...)

g(A)Bg(C) instead of ABC , g(X ) phoneme class X belongs to

I Parameter tying:use clustering or Classification And Regression Trees (CART)to tie “similar” phonemes

⇒ a few thousand models that are actually used

Schluter/Ney: Advanced ASR 54 August 5, 2010

Page 55: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition1.1 Overview: Architecture1.2 Phoneme Models and Subword Units1.3 Phonetic Decision Trees1.4 Language Modelling1.5 Dynamic Programming Beam Search1.6 Implementation Details1.7 Excursion (for experts): Language Model Factor1.8 Excursion (for experts): Length Modelling

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 55 August 5, 2010

Page 56: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Phonetic Decision TreesMotivation

Classification and Regression Trees (CART)

I Used in the acoustic modelling of phonemes

I 50 phonemes ⇒ 503 = 125000 possible phonemes in triphonecontext (“triphones”)

I Problem:I too many triphones to be trained reliablyI many triphones are not seen in trainingI considering across-word contexts, this effect increases

I Solution:I tie parameters of similar triphonesI a decision tree determines similarity

Schluter/Ney: Advanced ASR 56 August 5, 2010

Page 57: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

MotivationA phoneme X has 2500 possible triphone contexts aXb.Phonetic decision tree for a phoneme X :

Q 0(a,b) ?

Q 2(a,b) ?Q 1(a,b) ?

y n

y n y n

y n

A path through the decision tree is defined by the answers tophonetic questions Q0,Q1,Q2, . . .e.g.: - “is the left context a fricative?”

- “is the right context a plosive?”

Schluter/Ney: Advanced ASR 57 August 5, 2010

Page 58: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Example

L-BOUNDARY

R-LIQUID L-BACK-R

1/10477 R-LAX-VOWEL L-R L-L-NASAL

3/2370 R-TENSE-VOWEL

2/3628 1/9337

2/892 L-LAX-VOWEL 4/2692 R-UR

1/2098 3/3197 1/526 R-S/SH

1/848 L-EE

1/635 8/3179

Schluter/Ney: Advanced ASR 58 August 5, 2010

Page 59: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Motivation

Properties of the tree:

I Every leaf of the tree stands for a generalized phoneticcontext and has a corresponding HMM emission probability.

I An adequate generalization for triphones not seen in trainingcan be expected.

Schluter/Ney: Advanced ASR 59 August 5, 2010

Page 60: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Motivation

General application for CART:

Given the two variables

c = class index

x ∈ IRD , or discrete

observation

model conditional probability

p(c |x)∑c p(c |x) = 1

Q 2 x ?

Q 0 x ?

Q 1 x ?

y n

y n y n

y n

Leaf t:Distribution: x t: p(c|t)

“Classification tree” vs. “estimation tree”:an estimation tree models the conditional probability withoutclassification.

Schluter/Ney: Advanced ASR 60 August 5, 2010

Page 61: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training PrincipleGiven the training data

[xn, yn], n = 1, . . . ,N;

with x → independent variable

y → dependent variable

two subsets of x are considered

t, tL, tR ⊂ x

Define a tree by binary splitting of a node or subtree:

t = tL ∪ tR , tL ∩ tR = ∅.

y nt

tL tR

x ?tL

Schluter/Ney: Advanced ASR 61 August 5, 2010

Page 62: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training PrincipleDefine:

I A “score” g(yn|t) for every observation (xn, yn) with xn ∈ t;

I A score for the node t:

G (t) :=∑

n:xn∈t

g(yn|t).

The score function g(yn|t) shall be additive.

Note the change in the score when splitting t in the subsets tL and tR :

∆G (tL|t) = G (t)− G (tL)− G (tR)

best split tL for given t:

maxtL

∆G (tL|t)

Schluter/Ney: Advanced ASR 62 August 5, 2010

Page 63: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training PrincipleUse the log-likelihood (log-probability) criterion for G (t).(θ represents the parameter of the distribution):

g(yn|t) := log pθ(yn|t)

G (t) := maxθ

∑n:xn∈t

log pθ(yn|t)

G (tL),G (tR) correspondingly.

Optimization:I Learn the best parameters θ for a hypothetic split tL at a node t.I Choose optimal split.

Thus:

θ = θ (xn, yn) : xn ∈ tL; n = 1, . . . ,NG (tL) =

∑n:xn∈tL

log pθ(yn|tL)

Schluter/Ney: Advanced ASR 63 August 5, 2010

Page 64: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training Principle: Discrete Observationsy with discrete values:the parameters θ are the distribution p(y |t) itself(non parametric model)Then:∑n:xn∈t

log p(yn|t) =∑y

N(t, y) · log p(y |t)− λ

[∑y

p(y |t)− 1

]

∂p(y |t):

N(t, y)

p(y |t)− λ = 0

∂λ:∑y

p(y |t)− 1 = 0

⇒ θ ≡ p(y |t) =N(t, y)

N(t)

with the counts: N(t, y),N(t)

Schluter/Ney: Advanced ASR 64 August 5, 2010

Page 65: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training Principle: Discrete Observations

For the optimum:

G (t) =∑

n:xn∈t

log pθ(yn|t)

=∑

n:xn∈t

logN(t, y)

N(t)

=∑y

N(t, y) · logN(t, y)

N(t)

= N(t)∑y

p(y |t) · log p(y |t)

entropy

Schluter/Ney: Advanced ASR 65 August 5, 2010

Page 66: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training Principle: Continuous Observationsy with continuous values, especially Gaussian distribution:

pθ(y |t) = N (y |µt ,Σt)

N (y |µt ,Σt) =1√

det(2πΣt)· exp

[−1

2(y − µt)T Σ−1

t (y − µt)

]

G (t) =∑

n:xn∈t

logN (yn|µt , Σt)

= −N(t)

2log det

[2πΣt

]− 1

2

∑n:xn∈t

(yn − µt)T Σ−1t (yn − µt)

with

N(t) :=∑

n:xn∈t

1

Schluter/Ney: Advanced ASR 66 August 5, 2010

Page 67: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training Principle: Continuous Observations

Maximum-likelihood estimation for µt and Σt :

µt =1

N(t)

∑n:xn∈t

yn

Σt =1

N(t)

∑n:xn∈t

(yn − µt)(yn − µt)T

Schluter/Ney: Advanced ASR 67 August 5, 2010

Page 68: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training Principle: Continuous ObservationsUsing a diagonal covariance matrix:

Σt =

σ2

t1 0σ2

t2. . .

0 σ2tD

σ2

td =1

N(t)

∑n:xn∈t

(ynd − µtd)2

∑n:xn∈t

(yn − µt)T Σ−1t (yn − µt) =

∑n:xn∈t

∑d

(ynd − µtd

σtd

)2

=∑d

1

σ2td

·∑

n:xn∈t

(ynd − µtd)2

= N(t) · D

Schluter/Ney: Advanced ASR 68 August 5, 2010

Page 69: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training Principle: Continuous ObservationsGeneral case, full covariance matrix (the index t of µt and Σt isdropped here for simplification)

zn := yn − µ Σ :=1

N(t)

N∑n:xn∈t

zn · zTn

∑n:xn∈t

zTn Σ−1zn =

∑n

D∑i=1;j=1

zni

(Σ−1

)ij

znj

=∑

ij

[ ∑n:xn∈t

zniznj

](Σ−1

)ij

= N(t) ·∑

ij

Σij

(Σ−1

)ij

= N(t) ·∑

j

∑i

Σji

(Σ−1

)ij

= N(t) ·D∑

j=1

δjj = N(t) · D

Schluter/Ney: Advanced ASR 69 August 5, 2010

Page 70: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training Principle: Continuous Observations

Thus:

G (t) =∑

n:xn∈t

logN (yn|µt , Σt)

= −N(t)

2log det

[2πΣt

]− 1

2

∑n:xn∈t

(yn − µt)T Σ−1t (yn − µt)

= −N(t)

2log det

[2πΣt

]− N(t)

2D

= −N(t)

2log det

(2πeΣt

)= −N(t)

2

[D log(2π) + D + log(det Σt)

]

Schluter/Ney: Advanced ASR 70 August 5, 2010

Page 71: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training Principle: Continuous ObservationsThe improvement of the log-likelihood score by splittingt = tL ∪ tR ; tL ∩ tR = ∅ is:

∆G (t) = G (t)− G (tL)− G (tR)

= . . .

=N(tL)

2log

[det ΣtL

det Σt

]+

N(tR)

2log

[det ΣtR

det Σt

]For a diagonal covariance matrix:

log det Σt = log∏d

σ2td

=D∑

d=1

log σ2td

Schluter/Ney: Advanced ASR 71 August 5, 2010

Page 72: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Leaving One Out

So far without leaving one out:

maxθ

∑n

log pθ(yn)⇒ θ = θ(yN1 )

The optimal value is substituted in the log likelihood:∑n

log pθ(yN1 )(yn),

so every observation is considered twice:

1. to determine θ

2. to determine how good the model pθ(y) explains theobservations y1, . . . , yN .

⇒ score evaluation is too optimistic.

Schluter/Ney: Advanced ASR 72 August 5, 2010

Page 73: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Leaving One OutLeaving one out:

Idea: take yn out of the training set when evaluating pθ(yn)

Use θ(yN1 \yn) instead of θ(yN

1 ) for the leaving one out scoreevaluation: ∑

n

log pθ(yN1 \yn)(yn)

For Gaussian distributions the calculation of θ(yN1 \yn) is easy:

µn :=1

N − 1

∑m 6=n

ym

Σn :=1

N − 1

∑m 6=n

(ym − µn)(ym − µn)t

Schluter/Ney: Advanced ASR 73 August 5, 2010

Page 74: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition1.1 Overview: Architecture1.2 Phoneme Models and Subword Units1.3 Phonetic Decision Trees1.4 Language Modelling1.5 Dynamic Programming Beam Search1.6 Implementation Details1.7 Excursion (for experts): Language Model Factor1.8 Excursion (for experts): Length Modelling

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 74 August 5, 2010

Page 75: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Language Modelling

Goal:model syntax and semantics of natural language (spoken or written)

needed in automatic systems that processspeech (= spoken language) or language (= written language):

I speech recognition

I speech and text translation

I (spoken and written) language understanding

I spoken dialog systems

I text summarization

I ...

Schluter/Ney: Advanced ASR 75 August 5, 2010

Page 76: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ExampleFinite state networks for digit strings:Syntactical constraints can be expressed with formal grammars,here represented as networks

I String of three digitsV VV

1 2 3 4

V zero, one, two, three, . . . ,nineε

I String with even number of digitsV V

V

1 2 3

V zero, one, two, three, . . . ,nineε

I Unconstrained digit stringV

1

V zero, one, two, three, . . . ,nineε

Schluter/Ney: Advanced ASR 76 August 5, 2010

Page 77: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Silence ModelAllow silence between the words:

I String of three digits

V VV

Sil Sil Sil Sil

1 2 3 4

V zero, one, two, three, . . . ,nineε

I String with even number of digits

V V

V

Sil Sil Sil

1 2 3

V zero, one, two, three, . . . ,nineε

I Unconstrained digit stringV | Sil

1

V zero, one, two, three, . . . ,nineε

Schluter/Ney: Advanced ASR 77 August 5, 2010

Page 78: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Unfolding the network

For the recognition, the network has to be unfolded along the time axis:

Sil

V1

2

3

1

2

3

1

2

3

1

2

3

Sil

V

Sil

V

Sil

V

Sil

V

Sil

V

Sil

V

Sil

V

Sil

V

Sil

V

Sil

V

Sil

V

t

s

1

2

3

V

V V

Sil

Sil

Sil

1

2

3

1

2

3

The computational complexity is proportional to the number ofacoustic transitions.

As shown later, it is favorable to reduce the number of “real”i.e. acoustic transitions.

Schluter/Ney: Advanced ASR 78 August 5, 2010

Page 79: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Language model networks

As shown in the example, a network consists of transitions and nodes.

I Transitions:correspond to spoken words (including silence)

I Nodes:Every word has a start and end node, they define the syntactic(linguistic) context of the transition.

As shown on the next page, a word A can occur in four differentcontexts.

Schluter/Ney: Advanced ASR 79 August 5, 2010

Page 80: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Language model networksPossible contexts for a word A:

a)

1

2

3

A

A

1

2

3

A

b)1

2

3

A

A

1

2

3A

In case a) the automaton is non deterministic.

Schluter/Ney: Advanced ASR 80 August 5, 2010

Page 81: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bayes Decision Rule and Perplexity

Bayes decision rule and perplexity

I Bayes decision rule using maximum approximation:

[wN

1

]opt

= arg max[wN

1 ]

Pr(wN

1 ) ·max[sT

1 ]

T∏t=1

p(xt , st |st−1,wN1 )

I The perplexity (corpus perplexity / test perplexity) of alanguage model and a test corpus [wN

1 ] is defined as

PP = Pr(wN1 )−

1N

=

[N∏

n=1

Pr(wn|wn−11 )

]− 1N

Schluter/Ney: Advanced ASR 81 August 5, 2010

Page 82: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bayes Decision Rule and Perplexity

I The logarithm of the perplexity then is:

log PP = log[Pr(wN

1 )−1N

]= − 1

N

N∑n=1

log(Pr(wn|wn−11 ))

A small perplexity corresponds to strong language model restrictions.Properties of the perplexity:

I normalization: probability per word

I inverse probability: number of possible choices per word position

I probability zero: infinite penalty

Schluter/Ney: Advanced ASR 82 August 5, 2010

Page 83: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bayes Decision Rule and Perplexity

Now assume constant probabilities (no dependence on the wordand between words):

Pr(wN1 ) =

N∏n=1

1

W=

(1

W

)N

with W = size of the vocabulary

Then the perplexity becomes:

PP = Pr(wN1 )−

1N =

[(1

W

)N]− 1

N

= W

Note: In this case, the perplexity only depends on the vocabularysize W . Nevertheless, in the general case the perplexity dependson the test corpus it is computed on.

Schluter/Ney: Advanced ASR 83 August 5, 2010

Page 84: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Language Model NetworksLanguage model Pr(wN

1 ) in networks:A deterministic finite state automaton (DFA) is defined by atransition function δ

V = 1, . . . ,V nodes (linguistic contexts)

W = 1, . . . ,W arcs (words including silence)

δ : V ×W → V(v ,w)→ v ′ = δ(v ,w)

A path in the network defines a word sequence [wN1 ].

For every word w given a node v , a probability p(w |v) is defined:

p(w |v) =

0 if word w does not

leave node v≤ 1 else

Schluter/Ney: Advanced ASR 84 August 5, 2010

Page 85: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Language Model NetworksThe sum of the probabilities p(w |v) over all words w for eachnode v is:

W∑w=1

p(w |v) = 1

w = 1

w = W

v

Index convention:vn+1 := δ(vn,wn)

wnv

nvn+1

Schluter/Ney: Advanced ASR 85 August 5, 2010

Page 86: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Language model networksIn the general case of non deterministic finite state automata(NFA) the sum over all paths corresponding to a word sequencewN

1 has to be calculated:

Pr(wN1 ) =

∑vN

1

N∏

n=1

p(wn|vn)

.

Often the maximum approximation is used:

Pr(wN1 ) ≈ max

vN1

N∏

n=1

p(wn|vn)

= maxvN

1

N∏

n=1

p(wn|δ(vn−1,wn−1))

Note the hierarchical structure of the grammars:

I HMM: acoustic modelling for each word defining thecorrespondence between word classes and acoustic vectors

I LM: networkSchluter/Ney: Advanced ASR 86 August 5, 2010

Page 87: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Language model networksNon deterministic and deterministic finite state automata (NFAand DFA):

I general case NFA:

p(w , v |v ′) = p(v |v ′)︸ ︷︷ ︸Transitionprob.

· p(w |v ′, v)︸ ︷︷ ︸Emissionprob.

I special case DFA: given a pair (v ′,w) the successor state v isdetermined by v = δ(v ′,w),therefore a different factorization of p(w , v |v ′) is useful:

p(w , v |v ′) = p(w |v ′) · p(v |v ′,w)

with

p(v |v ′,w) =

1 v = δ(v ′,w)0 v 6= δ(v ′,w)

For an allowed transition (v ′,w)→ v = δ(v ′,w) theprobability is: p(w , v |v ′) = p(w |v ′)

Schluter/Ney: Advanced ASR 87 August 5, 2010

Page 88: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Dynamic Programming Recursion

Search using language model networks: dynamic programming.

Here the auxiliary quantity Qv (t, s; w) used to derive the dynamicprogramming recursion is defined as:

Qv (t, s; w) := probability of the best path at time tleading to the state s of word wwith starting node v .

Note the additional index v .

Schluter/Ney: Advanced ASR 88 August 5, 2010

Page 89: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Dynamic Programming Recursion

I Within words: acoustic search

Qv (t, s; w) = maxs′

Qv (t − 1, s ′; w) · p(xt , s|s ′,w)

σopt

v (t, s; w) := arg maxs′

Qv (t − 1, s ′; w) · p(xt , s|s ′,w)

Bv (t, s; w) = Bv (t − 1, σopt

v (t, s; w); w)

I Word boundaries: language model recombination

Qv (t − 1, 0; w) = maxv ′,w ′:δ(v ′,w ′)=v

Qv ′(t − 1,S(w ′); w ′) · p(w |v)

= p(w |v) · max

v ′,w ′:δ(v ′,w ′)=v

Qv ′(t − 1, S(w ′); w ′)

Bv (t − 1, 0,w) = t − 1

Schluter/Ney: Advanced ASR 89 August 5, 2010

Page 90: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Dynamic Programming Recursion

Word boundaries: language model recombination

w = 1

w = W

v

v’

v’

v’

1

2

3

w’

w’

1

N

Schluter/Ney: Advanced ASR 90 August 5, 2010

Page 91: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Dynamic Programming Recursion

The dynamic programming recursion is carried out for every wordw and node v . The context defined by v has to be considered inthe traceback arrays:

Score:

H(v , t) = maxv ′,w ′:δ(v ′,w ′)=v

Qv ′(t,S(w ′); w ′)

Starting node, word:

(V ,W )(v , t) = arg maxv ′,w ′:δ(v ′,w ′)=v

Qv ′(t,S(w ′); w ′)

Backpointer:

B(v , t) = B(t,S(W (v , t)); W (v , t))

The index pair (predecessor node, word) is stored in the tracebackarray (V ,W )(v , t).It can be interpreted as linguistic copy of word w in the context v .

Schluter/Ney: Advanced ASR 91 August 5, 2010

Page 92: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Example

Language model network and corresponding language model transitions:

1

2

4

3

A B

C

ED

B

Sil

Sil

Sil

Sil

1

2

3

4

B

Sil

D

C

Sil

E

A

Sil

B

SilSil

B

Sil

A

Sil

C

D

Sil

B

E

acoustic transitons empty transitions language model transitions

t t

Schluter/Ney: Advanced ASR 92 August 5, 2010

Page 93: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

m–Gram Language ModelsFactorization without restrictions (w ∈ W ∪ $; $=sentence end):

Pr(wN1 ) =

N∏n=1

Pr(wn|wn−11 )

Limit the dependence:

Unigram LM : Pr(wN1 ) =

N∏n=1

p(wn)

Position Unigram LM : Pr(wN1 ) =

N∏n=1

p(wn|n)

Bigram LM : Pr(wN1 ) =

N∏n=1

p(wn|wn−1)

Trigram LM : Pr(wN1 ) =

N∏n=1

p(wn|wn−2,wn−1)

Schluter/Ney: Advanced ASR 93 August 5, 2010

Page 94: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training Bigram LMs

Training bigram language models:Count words and word pairs:

p(w |v) =N(v ,w)

N(v)

N(v ,w) : word pair count (v ,w)

in the training text

N(v) : count for word v

Schluter/Ney: Advanced ASR 94 August 5, 2010

Page 95: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training Bigram LMsMotivation:

Bigram probability:

Pr(wN1 ) =

N∏n=1

p(wn|wn−1)

maximize log-likelihood function:

F =N∑

n=1

log p(wn|wn−1) with∑w

p(w |v) = 1 ∀ v

F =∑v ,w

N(v ,w) log p(w |v) −∑v

µv

[∑w

p(w |v)− 1

]

Schluter/Ney: Advanced ASR 95 August 5, 2010

Page 96: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training Bigram LMs

Set derivative of log-likelihood w.r.t. p(w |v) and µv to zero toobtain maximum:

∂F

∂p(w |v)=

N(v ,w)

p(w |v)− µv

!= 0

∂F

∂µv=

∑w

p(w |v)− 1!

= 0

Solution:

p(w |v) =N(v ,w)∑w ′ N(v ,w ′)

=N(v ,w)

N(v)

Schluter/Ney: Advanced ASR 96 August 5, 2010

Page 97: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

DiscountingProblem:many pairs (v ,w) are not seen in training

N(v ,w) = 0 ,

relative frequency is zero.Discounting: shift probability mass from seen to unseen events.

I Linear discounting:

p(w |v) =

(1− λ) · N(v ,w)N(v)

N(v ,w) > 0

λ · p(w)∑w ′:N(v ,w ′)=0

p(w ′)N(v ,w) = 0

Estimate 0 < λ < 1 by Leaving One Out:Leave (v ,w) out of the corpus→ change counts: N(v ,w)→ N(v ,w)− 1 for N(v ,w) > 1

Schluter/Ney: Advanced ASR 97 August 5, 2010

Page 98: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Linear DiscountingLeaving-one-out distribution with linear discounting:

p−1(w |v) =

(1− λ) · N(v ,w)− 1N(v)− 1

N(v ,w) > 1

λ · p(w)∑w ′:N(v ,w ′)=1

p(w ′)N(v ,w) = 1

Log-likelihood criterion:

F (λ) =∑v ,w

N(v ,w) · log p−1(w |v)

=∑

v ,w :N(v ,w)>1

N(v ,w) · log(1− λ)N(v ,w)− 1

N(v)− 1

+∑

v ,w :N(v ,w)=1

N(v ,w) · log λp(w)∑

w ′:N(v ,w ′)=1

p(w ′)

Schluter/Ney: Advanced ASR 98 August 5, 2010

Page 99: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Linear DiscountingRewrite log-likelihood criterion:

F (λ) =∑

v ,w :N(v ,w)>1

N(v ,w) · log(1− λ) +∑

v ,w :N(v ,w)=1

N(v ,w) · log λ

+∑

v ,w :N(v ,w)>1

N(v ,w) · logN(v ,w)− 1

N(v)− 1

+∑

v ,w :N(v ,w)=1

N(v ,w) · logp(w)∑

w ′:N(v ,w ′)=1

p(w ′)

= const(λ)

=

[N −

∑v ,w :N(v ,w)=1

N(v ,w)

]· log(1− λ)

+∑

v ,w :N(v ,w)=1

N(v ,w) · log λ+ const(λ)

= (N − n1) · log(1− λ) + n1 · log(λ) + const(λ)

Schluter/Ney: Advanced ASR 99 August 5, 2010

Page 100: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Linear Discounting

Log-likelihood criterion:

F (λ) = (N − n1) · log(1− λ) + n1 · log(λ) + const(λ)

with n1 :=∑

v ,w :N(v ,w)=1

1

= number of bigram singletons

N := size of the corpus

Differentiate and set to zero to obtain maximum w.r.t. λ:

λ =n1

N

Schluter/Ney: Advanced ASR 100 August 5, 2010

Page 101: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Absolute Discounting

I Absolute discounting

p(w |v) =

N(v ,w)− bN(v)

N(v ,w) > 0

b · W −W0(v)N(v)

p(w)∑w ′:N(v ,w ′)=0

p(w ′)N(v ,w) = 0

W := vocabulary size

W0(v) := number of words that do not occur as

successor of v .

Schluter/Ney: Advanced ASR 101 August 5, 2010

Page 102: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Absolute Discounting

Leaving-One-Out approach with maximum-likelihood estimation:

F (b) = n1 · log(b) +∑

v ,w :N(v ,w)>1

N(v ,w) log

(N(v ,w)− 1− b

N(v)− 1

)= n1 · log(b) +

∑r>1

r · nr · log(r − 1− b) + const(b)

with nr :=∑

v ,w :N(v ,w)=r

1

= Number of word pairs seen r times

Differentiate F (b) by b and rewrite:

n1

b− 2n2

1− b=

∑r>2

rnr

r − 1− b

Schluter/Ney: Advanced ASR 102 August 5, 2010

Page 103: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Absolute DiscountingThere is no closed form solution, but the following estimate can beproven:

n1

n1 + 2n2 + 12 [N − n1 − 2n2]

≤ b ≤ n1

n1 + 2n2

Usually the upper bound is a sufficient estimate

b =n1

n1 + 2n2

For corpora of 10− 20 million words and a vocabulary of10.000− 20.000 words b ≈ 0.95

Result:A LM where all wN

1 are possible to be recognized, i.e. Pr(wN1 ) > 0.

Ideal case:typical word sequence wN

1 : high probability Pr(wN1 )

possible word sequence wN1 : low probability Pr(wN

1 )untypical word sequence wN

1 : very low probability Pr(wN1 )

Schluter/Ney: Advanced ASR 103 August 5, 2010

Page 104: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

m-Grams

Using m–grams homophones (phonetically equal words withdifferent spellings) can be distinguished.

Examples from the IBM TANGORA system:

I To, too, two:Twenty–two people are too many to be put in this room.

I Right, Wright, write:Please write to Mrs. Wright right away.

Schluter/Ney: Advanced ASR 104 August 5, 2010

Page 105: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bigram LM Complexity

A bigram for the three words A, B, andC represented as network.

A

B

C

sil

sil

sil

A

A

A

B

B

B

C

C

C

With W words the network has W 2

arcs plus W arcs for silence.

Problem: the computational complexityrises like W 2

Introducing empty transitions can help.

Schluter/Ney: Advanced ASR 105 August 5, 2010

Page 106: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bigram LM ComplexityBigram LM with empty transitions

C

B

A

Sil

Sil

Sil

Sil

1

2

3

0

Start node: 0 End node: 1,2,3

A

B

C

Sil

Schluter/Ney: Advanced ASR 106 August 5, 2010

Page 107: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bigram LM ComplexityBigram LM: silence as part of the words

C

B

A

Sil

Sil

Sil

Sil

Start node: 0 End node: 1,2,3

A

B

C

ε

ε

ε

1

2

3

0Sil

Schluter/Ney: Advanced ASR 107 August 5, 2010

Page 108: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Unfolding the Bigram LMUnfolding the bigram over the time:

C

B

A

Sil

A

B

C

A

B

C

Sil

Sil

Sil

Sil

Sil

Sil

Sil A

B

CC

B

A

Sil

acoustic transitions empty transitions language model transitions

t t

time

(t=0)

Schluter/Ney: Advanced ASR 108 August 5, 2010

Page 109: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Unfolding the Bigram LMUnfolding the bigram over the time (silence as part of the words):

C

B

A

Sil

A

B

C

A

B

C

Sil

Sil

Sil

Sil

Sil

Sil

Sil A

B

CC

B

A

Sil

acoustic transitions empty transitions language model transitions

t t

time

(t=0)

Schluter/Ney: Advanced ASR 109 August 5, 2010

Page 110: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bigram LM in RecognitionBigram LM in recognition:

The network has the following transitions:

Sil ( silence at the beginning of the sentence)A word AB word B::

ASil (silence after word A)

BSil (silence after word B):

We augment the vocabulary w as follows:

w ∈ Sil , A, B, . . . , ASil , BSil , . . .

Schluter/Ney: Advanced ASR 110 August 5, 2010

Page 111: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bigram LM in Recognition

The auxiliary quantity Q(t, s,w) for dynamic programming isdefined as:

Q(t, s,w) := probability of the best partial path at time tleading to state s of word w .

The recursion then is:

· Within words:

Q(t, s; w) = maxs′Q(t − 1, s ′; w) · p(xt , s|s ′,w)

· Word boundaries:

Q(t − 1, 0; w) = maxvQ(t − 1,S(v); v) · p(w |v)

(The special handling of the silence word is not expressed inthe equations)

Schluter/Ney: Advanced ASR 111 August 5, 2010

Page 112: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bigram LM in Recognition

In principle the probability p(w |v) is the LM,but silence transitions require special interpretation:

Transition LM Probabilityp(w |v)

A – B p(B|A)

A – ASil 1

ASil – B p(B|A)

ASil – ASil 1

Sil – B p(B) : unigram

ASil – BSil 0 : not possible

Schluter/Ney: Advanced ASR 112 August 5, 2010

Page 113: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bigram LM in Recognition

Traceback arrays:it is easiest to store the decisions about starting words w at time t:

score: H(w , t) = maxvQ(t, S(v); v) · p(w |v)

predecessor: V (w , t) = arg maxvQ(t,S(v); v) · p(w |v)

backpointer: B(w , t) = B(t,S(V (w , t)); V (w , t))

Schluter/Ney: Advanced ASR 113 August 5, 2010

Page 114: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bigram LM in Recognition

Remarks:

I Due to the regular structure of the bigram it is sufficient tostore either the LM nodes or the predecessor words in thetraceback arrays.

I The traceback at the end of the sentence has to start at theword ends not the word beginnings.

I The real implementation is different to optimize memoryefficiency:

I traceback arrays with one index instead of a pair (w , t)I when using beam search (following section) the number of

word ends reached is smaller, instead of storing wordbeginnings it is more efficient to store word ends.

Schluter/Ney: Advanced ASR 114 August 5, 2010

Page 115: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Trigram LM

The trigram language model probability is given by:

Pr(wn|wn−11 ) = p(wn|wn−2,wn−1)

Notation: (u, v ,w) = (wn−2,wn−1,wn)u, v are the predecessor words of w , these have to be considered inthe LM recombination.The auxiliary quantity for dynamic programming is defined as:

Qv (t, s; w) := probability of the best path at time tleading to the state s of word w withpredecessor word v .

I For each word w a copy for every predecessor word v has tobe made.

I The cost of an arc only depend on the arc itself. This allowsthe practical implementation of dynamic programming.

Schluter/Ney: Advanced ASR 115 August 5, 2010

Page 116: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Unfolding the Trigram LMTrigram LM recombination:

A

B

C

C

C

C

A

B

C

C

C

C

A

B

C

B

B

B

A

B

C

B

B

B

A

B

C

A

A

A

A

B

C

A

A

A

timett

Schluter/Ney: Advanced ASR 116 August 5, 2010

Page 117: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Trigram LMDynamic programming recursion:

I within words:

Qv (t, s; w) = maxs′Qv (t − 1, s ′; w) · p(xt , s|s ′,w)

I word boundaries:

Qv (t − 1, 0; w) = maxuQu(t − 1,S(v); v) · p(w |u, v)

Traceback arrays (at word beginnings):

score: H(v ,w , t) = maxuQu(t, S(v); v) · p(w |u, v)

predecessor: U(v ,w , t) = arg maxuQu(t,S(v); v) · p(w |u, v)

backpointer: B(v ,w , t) = BU(v ,w ,t)(t, S(v); v)

Silence: in principle, silence is treated as in the bigram LM,

the implementation is more complex.

Schluter/Ney: Advanced ASR 117 August 5, 2010

Page 118: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Trigram LM: Traceback Implementationword string: w1, . . . ,wn, . . . ,wN

with word boundaries: t1, . . . , tn, . . . , tNsentence end symbol: $ (= Sil)

I Note: traceback in reverse order (start at end with n = 1)I Initialization: best word end

(w2,w1) = arg maxv ,wQv (T , S(w),w) · p($|v ,w)

t1 = T ; t2 = Bw2(T ,S(w1),w1)

I Loop: n = 2while tn > 0 do

n = n + 1wn = U(wn−1,wn−2, tn−1)tn = B(wn−1,wn, tn−1)

N = n − 1reverse: (w1, t1), . . . , (wN , tN) ← (wN , tN), . . . , (w1, t1)

Schluter/Ney: Advanced ASR 118 August 5, 2010

Page 119: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Complexity of DP Beam Search

Time complexity for full DP search (later: DP beam search):

T = number of time frames of test utteranceW = nunber of acoustic reference wordsS = average number of states per (acoustic) wordK = number of positions in position unigramsilence model: 1 state

language model acoustic search comparisons language modeltype (= 3· HMM states) comparisons

unigram 3 · T · [W · S + 1] T · [W + 1]position unigram 3 · T · K · [W · S + 1] T · K · [W + 1]bigram 3 · T · [W · S + W ] T ·W · [W + 1]trigram 3 · T ·W · [W · S + W ] T ·W ·

[W 2 + 1

]

Schluter/Ney: Advanced ASR 119 August 5, 2010

Page 120: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Memory Complexity of DP Search

Memory requirements:

I acoustic search: one column for backpointer and score

I language model recombination: traceback arrays with oneentry for each LM node

LM type acoustic search language model

unigram 2 · [W · S + 1] 2 · Tposition unigram 2 · K · [W · S + 1] 2 · T · Kbigram 2 · [W · S + W ] 2 · T · [W + 1]trigram 2 ·W · [W · S + . . . ] 2 · T ·

[W 2 + . . .

]

Schluter/Ney: Advanced ASR 120 August 5, 2010

Page 121: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition1.1 Overview: Architecture1.2 Phoneme Models and Subword Units1.3 Phonetic Decision Trees1.4 Language Modelling1.5 Dynamic Programming Beam Search1.6 Implementation Details1.7 Excursion (for experts): Language Model Factor1.8 Excursion (for experts): Length Modelling

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 121 August 5, 2010

Page 122: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Beam SearchDynamic Programming Beam Search along with Implementation Details

The search consists of principal components:

I language model recombination: word boundaries

I acoustic search: word interior

I bookkeeping: decisions about word (and boundary) hypotheses

I traceback: construct best scoring word sequence

Modifications for large vocabulary systems as opposed to digitstring recognition:

I limit the search space by beam search

I modified bookkeeping for active hypotheses (due to beam search)

I modified bookkeeping for traceback arrays (due to beam search)

I garbage collection for traceback arrays (due to beam search)

Schluter/Ney: Advanced ASR 122 August 5, 2010

Page 123: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Traceback Arrays

Traceback arraysI Use one index:

I less memory (exhaustive search)I beam search, only few word ends are reached

I Bookkeeping is possible at these stages:I at the word endsI at the LM nodes (most efficient, smallest number of hypotheses)I at the word beginnings

I Organization:I so far: one element in traceback array per time frameI now: backpointer does not point at the time frame the

predecessor word ended, it points at the array element with thecorresponding information.

Schluter/Ney: Advanced ASR 123 August 5, 2010

Page 124: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Traceback ArraysReminder:The entries of the traceback array define the nodes of a tree:

Time

I Garbage collection (beam search): hypotheses can be pruned,array entries no backpointer points at are labeled as free.

I Partial traceback: if all backpointers of active hypothesespoint at one entry in the traceback array the decision beforethis entry is determined.

I Experimental experience (beam search): delay depends on thetask, 1–2 words when using partial traceback.

Schluter/Ney: Advanced ASR 124 August 5, 2010

Page 125: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Beam Search: Pruning

Beam search:

I suboptimal heuristic approach: give up global optimum.

I time synchronous search: remaining cost of the path isconstant for all hypotheses.

I baseline method for pruning:discard unlikely hypotheses at every time frame t:

I Acoustic pruning:retains state hypotheses whose scores are close to the score ofthe best state hypothesis:

QAC (t) := max(v ,s)

Qv (t, s) ,

prune state hypothesis (s, t; v) iff:

Qv (t, s) < fAC · QAC (t)

Schluter/Ney: Advanced ASR 125 August 5, 2010

Page 126: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Beam Search: Pruning

I additional pruning steps:I Language model pruning:

retains tree start-up hypotheses whose score is close to thescore of the best tree start-up hypothesis:

QLM(t) := maxv Qv (t, s = 0)

prune tree start-up hypothesis iff:

Qv (t, s = 0) < fLM · QLM(t),

I Histogram pruning:limits the number of surviving state hypotheses to a maximumnumber (MaxHyp).

Schluter/Ney: Advanced ASR 126 August 5, 2010

Page 127: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Beam Search: Pruning

Using pruning techniques can lead to search errors inducingrecognition errors.Remember: possible reasons for recognition errors:

I shortcomings of the acoustic models

I shortcomings of the language models

I search errors (when using beam search or other heuristic methods)

In general better acoustic models and language models focus thesearch space, i.e. they allow for tighter pruning thresholds.

Schluter/Ney: Advanced ASR 127 August 5, 2010

Page 128: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Beam Search: PruningIllustration of the search process (DP beam search) for connecteddigit recognition:

0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300

stat

es

time framesSchluter/Ney: Advanced ASR 128 August 5, 2010

Page 129: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Beam Search: ExampleExample of the dependency between the search space and the worderror rate (WER):WSJ task, vocabulary size = 20000 words, bigram LM PP = 200.

AcuThr: acoustic pruning threshold.States: average number of state hypotheses in HMM after pruning.

AcuThr States WERk (average) [%]

50 252 45.660 677 28.365 1068 24.275 2396 20.6

100 12908 18.4110 21894 18.3120 32538 18.2130 43862 18.2

Schluter/Ney: Advanced ASR 129 August 5, 2010

Page 130: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition1.1 Overview: Architecture1.2 Phoneme Models and Subword Units1.3 Phonetic Decision Trees1.4 Language Modelling1.5 Dynamic Programming Beam Search1.6 Implementation Details1.7 Excursion (for experts): Language Model Factor1.8 Excursion (for experts): Length Modelling

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 130 August 5, 2010

Page 131: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

RWTH ASR System: Teaching PatchClasses and Dependencies:

Search::SearchAlgorithm

SearchInterface

LinearSearch

LinearSearch::SearchSpace

Lexicon

Bookkeeping

RWTH ASR System

RWTH ASR Teaching Patch

- acoustic model- language model- corpus handling- pronunciation lexicon handling- general search environment

- interface to RWTH ASR System- implementation of linear search, including bookkeeping, and traceback

Schluter/Ney: Advanced ASR 131 August 5, 2010

Page 132: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

RWTH ASR System: Teaching Patch

Types

#ifndef _TEACHING_TYPES_HH#define _TEACHING_TYPES_HH

#include <vector>#include <limits>

namespace Teaching typedef unsigned int Time; typedef unsigned short Mixture; typedef unsigned int Word; typedef unsigned short Phoneme; typedef unsigned short State; typedef unsigned int Index; typedef std::vector<Word> WordSequence; typedef std::vector<Mixture> MixtureSequence; typedef float Score;

static const Word invalidWord = std::numeric_limits<Word>::max(); static const Index invalidIndex = std::numeric_limits<Index>::max(); static const Score maxScore = std::numeric_limits<Score>::max();

#endif // _TEACHING_TYPES_HH

Apr 14, 10 9:35 Page 1/1Types.hh

Printed by schluter

Wednesday April 14, 2010 1/1

Schluter/Ney: Advanced ASR 132 August 5, 2010

Page 133: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Interface to RWTH ASR SystemGeneral Interface to Teaching PatchClass SearchInterface provides connection to general searchwork around, including handling of configuration, resources (corpusand models) as well as an overall workaround for the specificsearch implementation.

Main functions to be implemented here are:

I initialize: Search initialization

I processFrame: expansion of hypotheses to next time frame

I getResult: traceback of best recognized word sequence

Implementation:

I show SearchInterface.hh

I show SearchInterface.cc

I show LinearSearch.hh

Schluter/Ney: Advanced ASR 133 August 5, 2010

Page 134: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Interface to RWTH ASR SystemPhonem list and Pronunciation LexiconConfiguration file: XML format, example:

<?xml version="1.0" encoding="ascii"?>

<lexicon>

<phoneme-inventory>

<phoneme><symbol>AE</symbol></phoneme>

<phoneme><symbol>AH</symbol></phoneme>

<phoneme><symbol>N</symbol></phoneme>

<phoneme><symbol>D</symbol></phoneme>

...

</phoneme-inventory>

<lemma>

<orth>AND</orth>

<phon>AE N D</phon>

<phon>AH N D</phon>

</lemma>

...

</lexicon>

Lexicon configuration file: show an4.lexicon

Implementation: show Lexicon.hh show Lexicon.cc

Schluter/Ney: Advanced ASR 134 August 5, 2010

Page 135: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Example: Implementation of Dynamic ProgrammingBeam Search for Bigram LM

Consider:

C

B

A

Sil

A

B

C

A

B

C

Sil

Sil

Sil

Sil

Sil

Sil

Sil A

B

CC

B

A

Sil

acoustic transitions empty transitions language model transitions

t t

time

(t=0)

Schluter/Ney: Advanced ASR 135 August 5, 2010

Page 136: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Example: Implementation of Dynamic ProgrammingBeam Search for Bigram LM

Consider:

C

B

A

Sil

A

B

C

A

B

C

Sil

Sil

Sil

Sil

Sil

Sil

Sil A

B

CC

B

A

Sil

acoustic transitions empty transitions language model transitions

t t

time

(t=0)

Schluter/Ney: Advanced ASR 136 August 5, 2010

Page 137: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Dynamic Handling of State HypothesesGoal: complexity should be linear in

the number of active hypotheses.⇒ discard low probability hyps.⇒ incomplete state hyps.

Efficient expansion of thehypotheses from t to t + 1requires these operations:

t t+1

states

dead states

active states

time

I search (x , S)

I insert (x ,S)

I initialize (S)

I enumerate (S)

Compare methods for set representation:

I dictionary operations

I array representation of sets

I inverted lists and bit vectorsSchluter/Ney: Advanced ASR 137 August 5, 2010

Page 138: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Linear Search Implementation

Search Space RepresentationPruning necessitates dynamic handling of word and state hyps:

I List of active wordsword, stateHypBegin, stateHypEnd, entryStateHypothesis

I List of active states for every wordstate, score, backpointer

I To address active words, a list of all words pointing into thelist of active words is used.

I A list of all states of a word is used to handle active successorstates during the expansion of the states of a word.

Implementation: show LinearSearch.cc

Schluter/Ney: Advanced ASR 138 August 5, 2010

Page 139: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Linear Search ImplementationSearch Space Representation: Word hypotheses

wordstateHypBegin

stateHypEnd

entryStateHypothesis

index

12345678910

nWords_nWords_+1

.......

....

state score backpointer

......

word

........

nWords_+2nWords_+3nWords_+4nWords_+5nWords_+6nWords_+7nWords_+8

2*nWords_-1

invalidinvalid

8

3

10

nWords_+8

.

..

..

..

..

..

..

..

..

..

..

12478569101345781215

2459

9

0

invalid

invalidinvalidinvalidinvalid

invalidinvalidinvalidinvalidinvalidinvalidinvalidinvalid

invalidinvalid

1

0

3

2

...

000

0

invalid

..

.

wordHypothesisMap_ wordHypotheses_ stateHypotheses_

Schluter/Ney: Advanced ASR 139 August 5, 2010

Page 140: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Linear Search ImplementationSearch Space Representation: State Expansion

state score backpointer

......

12478569101345781215

2459000

0

..

.

stateHypotheses_

state score backpointer

......

125 6789101112

newStateHypotheses_

stateHypothesisMap_

index12345678910

state

.

.......

invalid

invalid

1112

lexicon[w].size

invalid

13invalid

word=3stateHypBegin

stateHypEnd

entryStateHypothesis

wordHypotheses_

..

..

Schluter/Ney: Advanced ASR 140 August 5, 2010

Page 141: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Linear Search ImplementationSearch Space Representation: Bookkeeping

Implementation: show BookKeeping.hh show BookKeeping.cc

word score time timestamp backpointer

.....

sentinelBackpointer = 0123456789101112131415

100

100

100100

100

100

100100

silence_ 0 0 0

81

96

93

97

79

4763

6

50

13

5

13

5

04

1

6

56

6

1

8080

80

90

9080

70

80

68

69

8579

76

35

52

64state score backpointerstateHypotheses_

3

8

14

11

...

...

...

...

lastTimestamp_=100

bookKeeping_

Schluter/Ney: Advanced ASR 141 August 5, 2010

Page 142: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Implementation of DP Beam Search for Bigram LM

Implementation example in C++ code: show Linear Search

Schluter/Ney: Advanced ASR 142 August 5, 2010

Page 143: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition1.1 Overview: Architecture1.2 Phoneme Models and Subword Units1.3 Phonetic Decision Trees1.4 Language Modelling1.5 Dynamic Programming Beam Search1.6 Implementation Details1.7 Excursion (for experts): Language Model Factor1.8 Excursion (for experts): Length Modelling

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 143 August 5, 2010

Page 144: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Excursion (for experts): Language Model Factor

Experiments show that to achieve high performance, it is veryimportant to give the language model Pr(wN

1 ) much more weightthan the acoustic model Pr(xT

1 |wN1 ).

Why?

Schluter/Ney: Advanced ASR 144 August 5, 2010

Page 145: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Language Model FactorStarting point:Bayes decision rule with true models Pr(wN

1 ) and Pr(xT1 |wN

1 ):

arg maxwN

1

Pr(wN

1 ) · Pr(xT1 |wN

1 ).

In training, we compute an estimate of the true models:

Pr(wN1 ) → p(wN

1 ),

Pr(xT1 |wN

1 ) → p(xT1 |wN

1 ).

The shapes (i.e. the weights) of the model distributions arechanged by exponentiation with exponents α and β:

p(wN1 ) → pα(wN

1 )

p(xT1 |wN

1 ) → pβ(xT1 |wN

1 )

Schluter/Ney: Advanced ASR 145 August 5, 2010

Page 146: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Language Model Factor

Instead of re-normalizing each individual model separately,we re-normalize by defining the following posterior probability:

p(wN1 |xT

1 ) =pα(wN

1 ) · pβ(xT1 |wN

1 )∑wN

1

pα(wN1 ) · pβ(xT

1 |wN1 )

=pα(wN

1 ) · pβ(xT1 |wN

1 )

const(wN1 )

Schluter/Ney: Advanced ASR 146 August 5, 2010

Page 147: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Language Model FactorDecision rule with weight exponents:

r(xT1 ) = arg max

wN1

p(wN

1 |xT1 )

= arg maxwN

1

pα(wN

1 ) · pβ(xT1 |wN

1 )

const(wN1 )

= arg max

wN1

pα(wN

1 ) · pβ(xT1 |wN

1 )

= arg maxwN

1

log[pα(wN

1 ) · pβ(xT1 |wN

1 )]

= arg maxwN

1

α log p(wN

1 ) + β log p(xT1 |wN

1 )

= arg maxwN

1

α

βlog p(wN

1 ) + log p(xT1 |wN

1 )

The factor α/β is referred to as language model factor(e.g. ≈ 10− 15).

Schluter/Ney: Advanced ASR 147 August 5, 2010

Page 148: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Dependent Language Model Factor

Consider the posterior probability with suitable word dependentexponents β(w):

p(wN1 |xT

1 ) =

∏n

[pα(wn|wn−1

1 ) · pβ(wn)(′x ′n|wn)]

∑wN

1

∏n

[pα(wn|wn−1

1 ) · pβ(wn)(′x ′n|wn)]

Schluter/Ney: Advanced ASR 148 August 5, 2010

Page 149: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Dependent Language Model Factor

Decision rule with word-dependent weight exponents:

r(xT1 ) = arg max

wN1

p(wN

1 |xT1 )

= arg maxwN

1

∏n

[pα(wn|wn−1

1 ) · pβ(wn)(′x ′n|wn)]

= arg maxwN

1

∑n

[α log p(wn|wn−1

1 ) + β(wn) log p(′x ′n|wn)]

= arg maxwN

1

∑n

[log p(wn|wn−1

1 ) +β(wn)

αlog p(′x ′n|wn)

]

Effect: word dependent scale factors β(w)/α.Training: like maximum entropy training.

Schluter/Ney: Advanced ASR 149 August 5, 2010

Page 150: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Scale Factors for Each Knowledge Source

Apply scale exponents to each of the knowledge sources:language model, transition and emission probabilities:

p(wN1 |xT

1 ) =

N∏n=1

pα(wn|wn−1n−2 ) ·max

sT1

T∏t=1

[pβ(st |st−1,w

N1 ) · pγ(xt |st ,w

N1 )]

∑wN

1

N∏n=1

pα(wn|wn−1n−2 ) ·max

sT1

T∏t=1

[pβ(st |st−1, wN

1 ) · pγ(xt |st , wN1 )]

Schluter/Ney: Advanced ASR 150 August 5, 2010

Page 151: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Scale Factors for Each Knowledge Source

Resulting Bayes decision rule:

r(xT1 ) = arg max

wN1

p(wN

1 |xT1 )

= arg maxwN

1

α

N∑n=1

log p(wn|wn−1n−2 )

+ maxsT

1

T∑t=1

[β log p(st |st−1,w

N1 )

+ γ log p(xt |st ,wN1 )]

Schluter/Ney: Advanced ASR 151 August 5, 2010

Page 152: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition1.1 Overview: Architecture1.2 Phoneme Models and Subword Units1.3 Phonetic Decision Trees1.4 Language Modelling1.5 Dynamic Programming Beam Search1.6 Implementation Details1.7 Excursion (for experts): Language Model Factor1.8 Excursion (for experts): Length Modelling

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 152 August 5, 2010

Page 153: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Excursion (for experts): Length ModellingExplicit length models: for xT

1 and wN1 , the lengths T and N are

random variables themselves.I language model: p(N,wN

1 ), check normalization:

p(N,wN1 ) = p(N) · p(wN

1 |N)∑N,wN

1

p(N,wN1 ) =

∑N

p(N)∑wN

1

p(wN1 |N)

=∑N

p(N) ·∑wN

1

N∏n=1

p(wn|wn−11 |N)

=∑N

p(N) ·N∏

n=1

∑wn

p(wn|wn−11 |N)

=∑N

p(N) · 1 = 1

I acoustic model: p(T , xT1 |wN

1 ) (check normalization)

Schluter/Ney: Advanced ASR 153 August 5, 2010

Page 154: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Length Modelling

I Language model:

p(N,wN1 ) = p(N) · p(wN

1 |N)

model assumptions:

= p(N) ·N∏

n=1

p(wn|wn−1n−2 ,N)

Schluter/Ney: Advanced ASR 154 August 5, 2010

Page 155: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Length Modelling

I Acoustic model with word boundaries tN1 (with t0 = 0, tN = T ):

p(T , xT1 |wN

1 ) =∑tN−1

1

p(tN1 , x

T1 |wN

1 )

p(tN1 , x

T1 |wN

1 ) = p(tN1 |wN

1 ) · p(xT1 |tN

1 ,wN1 )

model assumptions:

=N∏

n=1

[p(tn|tn−1,wn) · p(x tn

tn−1+1|wn, tnn−1)

]

Schluter/Ney: Advanced ASR 155 August 5, 2010

Page 156: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Length Modelling: Bayes Decision Rule

Optimization criterion (maximum approximation) using trigram LMp(wn|wn−1

w−2,N) and word segmentation tN1 (with t0 = 0, tN = T ):

maxN

p(N) · max

wN1 ,t

N1

N∏n=1

[p(wn|wn−1

w−2,N) · p(tn|tn−1,wn) · p(x tntn−1+1|wn, t

nn−1)

]

with the length models:

I length dependencies in language models:p(N) and p(wn|wn−1

w−2,N),

I duration models of acoustic models:p(tn|tn−1,wn).

Experimental results: rarely tested and no significantimprovements.

Schluter/Ney: Advanced ASR 156 August 5, 2010

Page 157: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree2.1 Review: Linear Search2.2 Tree Search2.3 Dynamic Programming2.4 Pruning Methods2.5 Language Model Look-Ahead2.6 Phoneme Look-Ahead2.7 Subtree Dominance Recombination (SDR)2.8 Implementation Details

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 157 August 5, 2010

Page 158: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Review: Linear Search

Bayesdecision rule:

Speech Input

AcousticAnalysis

Phoneme Inventory

Pronunciation Lexicon

Language Model

Global Search:

maximize

x1 ... xT

Pr(w1 ... wN) Pr(x1 ... xT | w1...wN)

w1 ... wN

RecognizedWord Sequence

over

Pr(x1 ... xT | w1...wN )

Pr(w1 ... wN)

Schluter/Ney: Advanced ASR 158 August 5, 2010

Page 159: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Review: Linear Search

The search process has to findthe spoken word sequence in theset of possible word sequences.

The set of possible wordsequences can be represented asa tree.

A

BB

C

A

B

C

A

B

C

A

B

C

C

C

A

B

C

A

A

B

C

A

B

Pr(B)

Pr(A|B)

Pr(A|BA)

Schluter/Ney: Advanced ASR 159 August 5, 2010

Page 160: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Review: Linear Search

Maximize over all possible wordsequences:

[wN

1

]opt

= argmaxwN

1

Pr(wN1 ) ·

∑sT

1

Pr(xT1 , s

T1 |wN

1 )

typical assumptions:

I Maximum approximation on state level:

[wN

1

]opt

= argmaxwN

1

Pr(wN

1 ) ·maxsT

1

Pr(xT1 , s

T1 |wN

1 )

.

Note:Only consider state sequences sT

1 that are consistent with wN1 .

Schluter/Ney: Advanced ASR 160 August 5, 2010

Page 161: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Review: Linear SearchAssumptions used:

I Bigram language model: p(wn|wn−1) or p(w |v) .

A

B

C

t3 t3t2t2 t4t1n=3n=2n=1

TIME t / WORD POSITION n

B

C

AA

B

C

A

B

C

...

...

...

...

...

...

When recognizing continuous speech, language modelrecombination is carried out every 10ms because the wordboundaries are not known.

Schluter/Ney: Advanced ASR 161 August 5, 2010

Page 162: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Review: Linear Search

Words are represented as phoneme sequences:

I Typically about 40 phonemes (German, French, English).

I A pronunciation lexicon gives the transcription.

I Words are modelled as concatenation of phoneme models.

I Automatic training.

I Context dependentphonemes:

I diphonesI triphones

i.e. one phoneme in therespective context(not two or three phonemes).

PHO

NE

ME

X

STA

TE

IN

DE

X

TIME INDEX

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

E

M

B

Schluter/Ney: Advanced ASR 162 August 5, 2010

Page 163: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Review: Linear Search

Dynamic programming with beam search:

I left-to-right time synchronous evaluation:direct comparison of scores

I beam search:I find best hypothesis at time tI prune unlikely hypotheses

I experiments: fraction of thepotential space (depends onperplexity, acoustic models,...)

I implementation:dynamic search space

w1[1:n1]

wr [1:n r ]

wR[1:nR]

0 t T

Schluter/Ney: Advanced ASR 163 August 5, 2010

Page 164: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree2.1 Review: Linear Search2.2 Tree Search2.3 Dynamic Programming2.4 Pruning Methods2.5 Language Model Look-Ahead2.6 Phoneme Look-Ahead2.7 Subtree Dominance Recombination (SDR)2.8 Implementation Details

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 164 August 5, 2010

Page 165: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Tree SearchI Every word end hypothesis activates the

initial states of all following words.I Experiments show that in the case of large vocabulary speech

recognition (20000 word vocabulary) 90% of the search effortis in the first two phonemes.

I Solution: represent the vocabulary as a lexical prefix tree.

...

a

c

e

Sil

anbr

l e

...

can

cab

cable

car

Sil

a) b)

Example with letters instead of phonemes:a) tree lexicon, b) linear lexicon.

Schluter/Ney: Advanced ASR 165 August 5, 2010

Page 166: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Tree SearchThe phonemes (triphones) at the arcs of the lexical tree aremodeled as usual with 6-state HMMs.

At the tree nodes the following transitions are possible:

a

n

b

r

Schluter/Ney: Advanced ASR 166 August 5, 2010

Page 167: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Tree Search

The following table shows some statistics for the tree lexicon forthe so called Wall Street Journal task (19 978 words):

I 44 context independent (CI) phoneme models

I 4688 context dependent (CD) phoneme models

I a word: typically 6 phonemes

⇒ compression factor: CI:6 · 19 978

44587≈ 2.7

CD:6 · 19 978

63155≈ 1.9

terminology: lexicon: linear vs. treeterminology: search: linear vs. tree

Schluter/Ney: Advanced ASR 167 August 5, 2010

Page 168: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Tree Search

phoneme number of number of arcsgeneration word ends CD tree CI tree

1 10 544 452 231 3626 6523 1476 9335 37994 2959 12180 82455 3791 11653 94326 3576 9402 79327 2845 6829 58958 2240 4563 40259 1414 2632 2341

10 778 1359 1223... ... ... ...16 7 8 817 1 1 1

sum 19978 63155 44587

Schluter/Ney: Advanced ASR 168 August 5, 2010

Page 169: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Tree Search

I The average compression factor of a tree lexicon compared toa linear lexicon is about 2 to 3.

I Since the first phonemes require the most search effort thesearch space reduction using beam search is much larger(a factor of 15 was determined in experiments on the WSJcorpus with a vocabulary of only 5000 words).

I A fundamental problem: the word identity is not known till aword end node is reached. This requires a different languagemodel recombination:Use a separate tree for every predecessor word(only a few trees are active due to beam search).

Schluter/Ney: Advanced ASR 169 August 5, 2010

Page 170: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Tree SearchBigram language model recombination for a three word vocabulary:

tt

LANGUAGEMODEL

ACOUSTICMODEL

ACOUSTICMODEL

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

AA

BB

CC

Schluter/Ney: Advanced ASR 170 August 5, 2010

Page 171: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Trigram language modelrecombination for a threeword vocabulary:

tt

LANGUAGEMODEL

ACOUSTICMODEL

ACOUSTICMODEL

A

B

C

AA

A

B

C

BA

A

B

C

CA

A

B

C

AB

A

B

C

BB

A

B

C

CB

A

B

C

AC

A

B

C

BC

A

B

C

CC

A

B

C

AA

A

B

C

BA

A

B

C

CA

A

B

C

AB

A

B

C

BB

A

B

C

CB

A

B

C

AC

A

B

C

BC

A

B

C

CC

Schluter/Ney: Advanced ASR 171 August 5, 2010

Page 172: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree2.1 Review: Linear Search2.2 Tree Search2.3 Dynamic Programming2.4 Pruning Methods2.5 Language Model Look-Ahead2.6 Phoneme Look-Ahead2.7 Subtree Dominance Recombination (SDR)2.8 Implementation Details

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 172 August 5, 2010

Page 173: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Dynamic Programming

In the previous picture every arc has costs that only depend on thearc itself, not on the history of the preceding path.

Therefore dynamic programming can be used:

Indices: t time, v predecessor word, w word index (defined only atword end states), s state, Sv final state of words v in the tree.

Auxiliary quantity:

Qv (t, s) := probability of the best partial path producing the acous-tic vectors x1, . . . , xt and ending in state s of the treelexicon with v as predecessor word.

Schluter/Ney: Advanced ASR 173 August 5, 2010

Page 174: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Dynamic Programming

Within a word (a tree):

Qv (t, s) = maxσp(xt , s|σ) · Qv (t − 1, σ)

Maximize σ over the best predecessor state. At phonemeboundaries the complex transitions have to be considered.

After finishing the computation for the states within a word, therecombination is carried out at the word boundaries:

Qv (t, 0) = maxup(v |u) · Qu(t, Sv ) ,

were s = 0 is a virtual starting state and Sv the final state of wordv .

Schluter/Ney: Advanced ASR 174 August 5, 2010

Page 175: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Silence Copies

The silence model requires aspecial handling:

I silence has to be allowedbetween words

I the word before silence hasto be stored

Solution:

I a silence arc is added toevery tree

I during language modelrecombination only the rootof the same tree can bereached from the silencenode.

tt

LANGUAGEMODEL

ACOUSTICMODEL

ACOUSTICMODEL

A A

B

C

A

B

C

A

B

C

A

B

C

B

C

A

B

C

A

B

C

A

B

C

Sil Sil

Sil

Sil

Sil

Sil

A

B

C

Sil

Sil

Sil B

C

Sil

A

Schluter/Ney: Advanced ASR 175 August 5, 2010

Page 176: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

The silence handling for atrigram language model

ABC

A

Sil

A

ABC

A

Sil

A

ABC

A

Sil

B

ABC

A

Sil

C

ABC

B

Sil

A

ABC

A

Sil

B

ABC

A

Sil

C

ABC

B

Sil

A

ABC

B

Sil

B

ABC

B

Sil

B

ABC

B

Sil

C

ABC

C

Sil

A

ABC

C

Sil

B

ABC

B

Sil

C

ABC

C

Sil

A

ABC

C

Sil

B

ABC

C

Sil

C

ABC

Sil

Sil

BC

C

Sil

C

ABC

Sil

Sil

tt

LANGUAGEMODEL

ACOUSTICMODEL

ACOUSTICMODEL

ABC

A

Sil

A

Schluter/Ney: Advanced ASR 176 August 5, 2010

Page 177: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree2.1 Review: Linear Search2.2 Tree Search2.3 Dynamic Programming2.4 Pruning Methods2.5 Language Model Look-Ahead2.6 Phoneme Look-Ahead2.7 Subtree Dominance Recombination (SDR)2.8 Implementation Details

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 177 August 5, 2010

Page 178: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Pruning MethodsI Acoustic pruning after acoustic recombination:

1. determine the score of thebest hypothesis at time t:

QAC (t) := max(v ,s)

Qv (t, s).

2. prune, eliminate ahypothesis (t, s; v) if

Qv (t, s) < fAC · QAC (t).

fAC < 1 is the acousticpruning threshold.

s (sorted by Score)

smin

a) b)

-log Q (t,s)v

-log Q (t)AC

-(log Q (t) + log f )AC AC

I Effect of acoustic pruning:a) narrow minimum of the path scores→ only a few hypotheses survive the pruning;

b) broad minimum → many hypotheses survive.

Schluter/Ney: Advanced ASR 178 August 5, 2010

Page 179: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Pruning Methods

I Language model pruning (tree start pruning)prune at the tree roots directly after language modelrecombination:

I determine the score of the best starting tree hypothesis t:

QLM(t) := maxv

Qv (t, s = 0).

I prune a tree hypothesis (t, s = 0; v) if:

Qv (t, s = 0) < fLM · QLM(t) .

fLM < 1 is the language model pruning threshold.

Schluter/Ney: Advanced ASR 179 August 5, 2010

Page 180: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Pruning MethodsI Histogram pruning

is applied after acoustic and language model pruning,if the number of active hypotheses N(t) at time t is largerthan a value MaxHyp:

N(t) > MaxHyp.

Histogram pruning only retains the MaxHyp best hypotheses(using bucket sort).Temporary peaks in the number of hypotheses are eliminated.

t

N hyp (t)

MaxHyp

Experimental result:Acoustic pruning is by far the most important method. The otherseliminate temporary peaks in the number of hypotheses.

Schluter/Ney: Advanced ASR 180 August 5, 2010

Page 181: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Experimental Results

Experimental results on the WSJ corpus:

I Training corpus: WSJ0,1;

I Test corpus: NAB ’94 H1 development set;

I speaker independent;

I word conditioned tree copies;I vocabulary:

I W = 20 000 words;I tree: 63 155 arcs (4 688 context dependent phoneme models)I 6 states per phoneme

I bigram language model perplexity 215

I MaxHyp: 100 000

Schluter/Ney: Advanced ASR 181 August 5, 2010

Page 182: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Experimental Results: Effect of Acoustic PruningThe acoustic pruning threshold was varied. The number of activestates, arcs, trees, and words was measured along with the worderror rate (WER).

average number of activethreshold states arcs trees word ends WER[%]

(after pruning)

50 k 252 80 4 9 45.660 k 677 213 6 17 28.365 k 1068 332 9 24 24.275 k 2396 703 15 49 20.6

100 k 12908 3554 38 238 18.4120 k 32538 8838 59 564 18.2130 k 43862 11784 67 745 18.2170 k 79836 20649 87 1363 18.1

Schluter/Ney: Advanced ASR 182 August 5, 2010

Page 183: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Experimental Results: Effect of Acoustic Pruning

I The word error rate reaches saturation atabout 10000 active phoneme arcs or 40000 states.

I The potential search space consists of20000 · 63000 · 6 ≈ 8 · 109 states.

Schluter/Ney: Advanced ASR 183 August 5, 2010

Page 184: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Experimental Results: Effect of Language Model

I Corpus: WSJ (Wall Street Journal) Nov ’92, development and eval.

I 5k vocabulary.

I 1.44 h speech data (516984 time frames).

I 18 speakers, 740 sentences, 12137 spoken words.

I 2001 mixtures, 96277 densities.

I Pooled variance vector.

I Feature vector with 33 components.

I LDA (linear discriminant analysis).

I CART (phonetic decision tree).

I Pruning parameter was optimized for trigram LMand kept constant for bi-, uni- und zerogram LM.

Schluter/Ney: Advanced ASR 184 August 5, 2010

Page 185: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Experimental Results: Effect of the Language Model

I LM look-ahead:I zerogram-LM: no LM look-aheadI unigram-LM: unigram look-aheadI bigram-LM: bigram look-aheadI trigram-LM: bigram look-ahead

I real time factor (RTF) measured on a500 Mhz Alpha-PC with 512 MB memory.

Schluter/Ney: Advanced ASR 185 August 5, 2010

Page 186: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Experimental Results

Effect of language model:# states search space

after after pruning error rates [%]model PP expansion states arcs trees del ins WER RTF

zerogram 4988 10702 7537 1825 1 14.6 0.1 40.0 13.3

unigram 746 6521 3880 958 1 4.6 1.1 22.9 11.0

bigram 107 33775 6472 1835 36 1.4 0.5 6.9 12.2

trigram 56 55769 8013 2333 67 0.7 0.4 4.5 14.9

Schluter/Ney: Advanced ASR 186 August 5, 2010

Page 187: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree2.1 Review: Linear Search2.2 Tree Search2.3 Dynamic Programming2.4 Pruning Methods2.5 Language Model Look-Ahead2.6 Phoneme Look-Ahead2.7 Subtree Dominance Recombination (SDR)2.8 Implementation Details

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 187 August 5, 2010

Page 188: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Language Model Look-Ahead

I Language model look-aheadProblem: word identity needed for language model in generalonly know at word ends.Idea: use the knowledge from the language model as early aspossible in the search process. From a certain state (or arc)only a small subset of leaves i.e. word ends can be reached.

W (s) := set of word ends that can be reached from state s.

sW(s)

s~v

Schluter/Ney: Advanced ASR 188 August 5, 2010

Page 189: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Language Model Look-AheadOnly a few language model probabilities are possible.Define best language model probability reachable from state s:

πv (s) := maxw∈W (s)

p(w |v).

For acoustic pruning these values are combined with the originalauxiliary scores Qv (t, s) :

Qv (t, s) := πv (s) · Qv (t, s)

and acoustic pruning is applied to Qv (t, s) instead of Qv (t, s).

Problem:

I Vocabulary W = 20000.I 4688 context dependent phoneme models.I Lexikal prefix tree with 63000 phoneme arcs.I 20000 · 63000 factorized bigram language model probabilities:⇒ huge look-up table!

Schluter/Ney: Advanced ASR 189 August 5, 2010

Page 190: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Language Model Look-Ahead

Solution to huge potential number of LM probabilities needed:I Path compression:

I upper limit: W words ⇒ 2 ·W tree nodes→ compressed LM look-ahead tree

I optional: only consider the first 2-4 phoneme arc generationsof the LM look-ahead tree.

I LM tree factorization on demand.

Schluter/Ney: Advanced ASR 190 August 5, 2010

Page 191: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Language Model Look-Ahead

Examples of a language model look-ahead tree:

v

w1

w2

w3

w4

w5

0.2

0.3

0.1

0.15

0.05

1.0

1.0

1.0

1.0

1.01.0

1.0

1.0

1.0

0.50.3

0.33

0.33

1.0

0.66

w60.2

0.66

1.06

v

w1

w2

w3

w4

w5

0.2

0.3

0.1

0.15

0.05

1.0

1.0

1.0

0.5

0.3

0.33

0.33

1.0

0.66

w0.2

0.66

LM tree before compression. LM tree after compression.

Schluter/Ney: Advanced ASR 191 August 5, 2010

Page 192: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Language Model Look-Ahead

To reduce memory and computation, a unigram language modelcan be used instead of the bigram model to calculate thelook-ahead.

πv (s) := maxw∈W (s)

p(w)

Then only one value has to be stored for every node.

Experimental results: NAB’94 H1 development set

I 20 speakers (10 men und 10 woman).

I 310 utterances with 7387 words,(199 out of vocabulary words).

I RTF (real time factor) measured on an SGI R5000.

Schluter/Ney: Advanced ASR 192 August 5, 2010

Page 193: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Language Model Look-Ahead

LM for LM look-ahead tree search space recog. errors [%]look-ahead gen. arcs trees states arcs trees DEL / INS WER RTF

no – – – 65568 16932 26 2.4 / 2.5 16.3 110.0– – – 50020 13034 20 2.5 / 2.5 16.6 91.4

unigram 17 63155 1 16960 4641 32 2.5 / 2.5 16.4 68.1(PP = 972.6) 17 63155 1 9443 2599 22 2.6 / 2.5 16.8 54.4

bigram 17 29270 300 3312 935 13 2.5 / 2.6 16.5 32.9(PP = 198.4) 3 12002 300 3277 924 13 2.4 / 2.6 16.5 31.4

2 4097 300 3611 1012 12 2.4 / 2.6 16.9 32.21 544 300 5786 1643 11 2.6 / 2.8 17.0 36.2

Schluter/Ney: Advanced ASR 193 August 5, 2010

Page 194: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree2.1 Review: Linear Search2.2 Tree Search2.3 Dynamic Programming2.4 Pruning Methods2.5 Language Model Look-Ahead2.6 Phoneme Look-Ahead2.7 Subtree Dominance Recombination (SDR)2.8 Implementation Details

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 194 August 5, 2010

Page 195: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Phoneme Look-AheadPhoneme look-ahead:

I look ahead into the ’future’;

I by one phoneme (CD, CI);

I for the acoustic data.

α

βγ

αγ

δ

ε

ε

t t+ t∆

t∆

αγ

β

ε

ε

Schluter/Ney: Advanced ASR 195 August 5, 2010

Page 196: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Phoneme Look-Ahead

Definitions and notation:

α: a phoneme arc in the lexical prefix tree that shall be started.

α: predecessor arc of α. One of its final states has been reachedduring search.

q( α; t,∆t): probability that α produces the vectors xt+1, ..., xt+∆t .

The look-ahead time ∆t is constant and approximately equalsthe average phoneme length:

q(α; t,∆t) := max[st+∆t

t+1 ]p(x t+∆t

t+1 , st+∆tt+1 |α).

Schluter/Ney: Advanced ASR 196 August 5, 2010

Page 197: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Phoneme Look-Ahead

Pruning strategy:

I determine the best state hypothesis Q0(t):

Qv (t, α) := q(α; t,∆t) · Qv (t,Sα)

Q0(t) := max(v ,α)Qv (t, α)

I prune the arc hypothesis (α, v) at time t, if:

Qv (t, α) < fPA · Q0(t)

I approximate Q0(t):

Q0(t) ≈ maxαq(α; t,∆t) ·max

(v ,β)Qv (t,Sβ)

Schluter/Ney: Advanced ASR 197 August 5, 2010

Page 198: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Phoneme Look-Ahead

Calculating the look-ahead probability q(α; t,∆t):

I Time alignment for every hypothesized phoneme α:

Φα(τ, s; t) := probability that the phoneme α withthe states 1, ..., s generates the vectorsxt+1, ..., xt+τ .

Schluter/Ney: Advanced ASR 198 August 5, 2010

Page 199: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Phoneme Look-AheadI Hidden Markov model with S = 6 states:

PHONEME

STA

TE

IN

DE

X

TIME INDEX

α

α∼PHONEME

t t+ t

... ... ... ... ... ... ... ... ... ...

I Phoneme look-ahead probability:

q(α; t,∆t) := max

maxsΦα(∆t, s; t),max

τΦα(τ,S ; t)∆t/τ

Schluter/Ney: Advanced ASR 199 August 5, 2010

Page 200: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Phoneme Look-Ahead

Efficient calculation of the phoneme look-ahead probability:

⇒ approximations:

I CI phoneme models instead of CD models(44 CI vs. 4688 CD models);

I smaller number of densities;

I calculation of the look-ahead every second time frame;

I collapsed one-state HMM.

Schluter/Ney: Advanced ASR 200 August 5, 2010

Page 201: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Phoneme Look-Ahead

PHONEME

STA

TE

IN

DE

X

TIME INDEX

α

α∼PHONEME

t t+ t

... ... ... ... ... ... ... ... ... ...

Collapsed Hidden Markov Model with one state

Schluter/Ney: Advanced ASR 201 August 5, 2010

Page 202: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Phoneme Look-Ahead

Test conditions: NAB’94-H1-Development 20k

I Acoustic search (’detailed search’):I 4 688 CD phoneme models,I 290 000 Laplacian densities per gender.

I Phoneme look-ahead:I 43 CI phoneme models + 1 silence model,I ∆t = 6 Time Frames,I 6 state HMM (497 / 1 926 Laplacian mixtures),I 1 state HMM (175 / 675 Laplacian mixtures).

I Bigram LM look-ahead.

I RTF (real time factor) measured on a SGI R5000.

Schluter/Ney: Advanced ASR 202 August 5, 2010

Page 203: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Phoneme Look-Ahead

Experimental results:phoneme search space recognition errors

look-ahead [#] [%]HMM # dns states arcs trees DEL / INS WER RTF

no - 3312 935 13 2.5 / 2.6 16.5 32.91-state 175 1901 452 10 2.5 / 2.7 16.6 17.9

175 1551 359 9 2.5 / 2.7 16.7 15.4675 1808 434 11 2.5 / 2.6 16.6 16.4675 1508 354 10 2.5 / 2.6 16.8 14.2

6-state 497 2472 605 12 2.5 / 2.6 16.5 19.4497 1671 379 10 2.4 / 2.7 16.8 16.4

1926 2160 516 11 2.5 / 2.6 16.5 18.01926 1784 413 11 2.5 / 2.7 16.6 16.6

Schluter/Ney: Advanced ASR 203 August 5, 2010

Page 204: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree2.1 Review: Linear Search2.2 Tree Search2.3 Dynamic Programming2.4 Pruning Methods2.5 Language Model Look-Ahead2.6 Phoneme Look-Ahead2.7 Subtree Dominance Recombination (SDR)2.8 Implementation Details

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 204 August 5, 2010

Page 205: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Subtree Dominance Recombination

Consider a hypothesis (s, t; v) (bigram LM):

sv

v’

w

s

t τSchluter/Ney: Advanced ASR 205 August 5, 2010

Page 206: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Subtree Dominance RecombinationConsider the acoustic score for the best path from a state s attime t to a final state Sw at time τ :

h((t, s)→ (τ,Sw )) := maxsτtPr(xτt , s

τt | ) : st = s, sτ = Sw

Note: h((t, s)→ (τ,Sw )) does not depend on the preceding word v .

Because of the recombination on word level we can eliminate orrecombine the hypotheses (s, t; v) if:

∀w ∈W (s) := set of leaves i.e. words that can be reached from s:

p(w |v) · Qv (t, s) · h(. . . ) < maxv ′ 6=v

p(w |v ′) · Qv ′(t, s) · h(. . . )

m

p(w |v) · Qv (t, s) < maxv ′ 6=v

p(w |v ′) · Qv ′(t, s)

Schluter/Ney: Advanced ASR 206 August 5, 2010

Page 207: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Subtree Dominance RecombinationThe partial tree (s, t, v0) with

v0 = arg maxv ′

p(w |v ′) · Qv ′(t, s)

dominates the rivaling hypotheses (s, t; v)

Due to the complex (w , v) dependency this equation is not very useful.Simplify both sides by approximation using an upper bound for theleft side and a lower bound for the right side:

maxw∈W (s)

p(w |v) · Qv (t, s) < maxv ′ 6=v

min

w∈W (s)p(w |v ′) · Qv ′(t, s)

.

Especially at the beginning of a tree (s = 0)

maxw

p(w |v) · Qv (t, s = 0) < maxv ′ 6=v

minw

p(w |v ′) · Qv ′(t, s = 0).

Schluter/Ney: Advanced ASR 207 August 5, 2010

Page 208: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Subtree Dominance Recombination

Note:

a) easy implementation:only two arrays with index v are needed:

πmax(v , s) := maxw∈W (s)

p(w |v) (cf. LM look-ahead)

πmin(v , s) := minw∈W (s)

p(w |v)

b) similarity to LM look-ahead

c) acoustic pruning is sufficient for an efficient focusing search,LM pruning is not needed.

Schluter/Ney: Advanced ASR 208 August 5, 2010

Page 209: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Subtree Dominance Recombination

Experimental results:

I NAB’94 H1 development corpus.

I Vocabulary: 20000 words.

I Bigram language model.

I Language model pruning threshold fLM =∞.

I RTF (real time factor) measured on an SGI R5000.

bigram MaxHyp search space error rate [%]look-ahead states arcs trees DEL / INS WER RTF

w/o SDR 150 000 3312 935 13 2.5 / 2.6 16.5 32.9with SDR ∞ 3496 887 24 2.4 / 2.6 16.5 39.3

150 000 3093 859 12 2.4 / 2.6 16.5 29.010 000 2594 743 12 2.4 / 2.6 16.5 25.9

5 000 2091 606 11 2.4 / 2.6 16.6 25.6

Schluter/Ney: Advanced ASR 209 August 5, 2010

Page 210: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree2.1 Review: Linear Search2.2 Tree Search2.3 Dynamic Programming2.4 Pruning Methods2.5 Language Model Look-Ahead2.6 Phoneme Look-Ahead2.7 Subtree Dominance Recombination (SDR)2.8 Implementation Details

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 210 August 5, 2010

Page 211: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Tree Search Implementation

Implementation in RWTH ASR teaching patch:

I Bigram LM.

I Bookkeeping: at the beginning of each tree.

I Dynamic generation of tree copies.

I ”Direct memory access” for trees and phoneme arcs,(compare to corresponding methods in linear search).

I Language model look-ahead.

I NO phoneme look-ahead.

I NO subtree dominance recombination.

Schluter/Ney: Advanced ASR 211 August 5, 2010

Page 212: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

RWTH ASR System: Tree Search Teaching PatchClasses and Dependencies similar to linear search:

Search::SearchAlgorithm

SearchInterface

WordConditionedTreeSearch

WordConditionedTreeSearch::SearchSpace

Lexicon

Bookkeeping

RWTH ASR System

RWTH ASR Tree Search Teaching Patch

- acoustic model- language model- corpus handling- pronunciation lexicon handling- general search environment

- interface to open source RWTH ASR System- implementation of tree search, including bookkeeping, and traceback

WordConditionedTreeSearch::ActiveTrees

Schluter/Ney: Advanced ASR 212 August 5, 2010

Page 213: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Interface to RWTH ASR SystemGeneral Interface to Teaching PatchClass SearchInterface provides connection to general searchwork around as in the linear search implementation (cf.Slides 130ff). It includes handling of configuration, resources(corpus and models) as well as an overall workaround for thespecific search implementation.

As for linear search, the main functions to be implemented are:

I initialize: Search initialization

I processFrame: expansion of hypotheses to next time frame

I getResult: traceback of best recognized word sequence

Implementation:

I show SearchInterface.hh

I show SearchInterface.cc

I show WordConditionedTreeSearch.hh

Schluter/Ney: Advanced ASR 213 August 5, 2010

Page 214: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Example: Implementation of Dynamic ProgrammingBeam Search for Bigram LM

Bigram language model recombination for a three word vocabulary:including silence

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

acoustic model language model acoustic model

Schluter/Ney: Advanced ASR 214 August 5, 2010

Page 215: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Example: Implementation of Dynamic ProgrammingBeam Search for Bigram LM

Bigram language model recombination for a three word vocabularyincluding silence:

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

acoustic model language model acoustic model

Sil

Sil

Sil

Sil

Sil

Sil

Schluter/Ney: Advanced ASR 215 August 5, 2010

Page 216: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bigram LM and Word Conditioned Tree Copies

proceed over time t from left to rightACOUSTIC LEVEL: process state hypotheses

- initialization: Qv (t − 1, 0) = H(v ; t − 1)- time alignment: Qv (t, s) using DP- propagate back pointers in time alignment- prune unlikely hypotheses- purge bookkeeping lists

WORD PAIR LEVEL: process word end hypothesessingle best: for each pair (w ; t) do

H(w ; t) = maxv [p(w |v) Qv (t,S(w))]v0(w ; t) = arg maxv [p(w |v) Qv (t,S(w))]- store best predecessor v0:=v0(w ; t)- store best boundary τ0 := τ(t; v0,w)

traceback

Schluter/Ney: Advanced ASR 216 August 5, 2010

Page 217: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Conditioned Tree Search ImplementationSearch Space RepresentationPruning necessitates dynamic handling of word and state hyps:

I List of active treespredecessorWord, arcHypBegin, arcHypEnd

I List of active (phoneme/triphone) arcs for every treearc, stateHypBegin, stateHypEnd

I List of active states for every (phoneme/triphone) arcstate, score, backpointer

I To address active trees, a list of all trees pointing into the listof active trees is used, which also hold potential word endhypotheses (re)starting a treeisActive,*wordHyp

I A list of all arcs of a tree is used to handle active successorarcs during the expansion of the states within a tree.

Implementation: show WordConditionedTreeSearch.cc

Schluter/Ney: Advanced ASR 217 August 5, 2010

Page 218: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Tree Search ImplementationSearch Space Representation: Tree Hypotheses

isActive

12345678910

nWords_-2nWords_-1

state score backpointer

......

predecessorWord

........

1

2

4

1

3

2

3

3

4

5

2

3

4

5

6

2

4

5

1

2

4

0

3

4

6

treeMap_

treeHypotheses_

stateHypotheses_

*wordHyp

..

..

predecessorwordarcHypBegin

arcHypEnd

arcHypotheses_

arcstateHypBegin

stateHypEnd

....

..

....

..

111213141516171819202122232425

FALSE

FALSE

FALSE

TRUE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

TRUE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

TRUE

TRUE

TRUE

TRUE

9

3

12

21

nWords-2

16

5

6

5

3

1

2

3

5

6

2

4

Schluter/Ney: Advanced ASR 218 August 5, 2010

Page 219: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Tree Search ImplementationSearch Space Representation: Within-Tree Arc Hypotheses

state score backpointer

......

24

treeHypotheses_

stateHypotheses_

predecessorword =12

arcHypBegin

arcHypEnd

5652

..

..

arcHypotheses_

arcstateHypBegin

stateHypEnd

.

..

wordHypotheses_

word=12 score=s backpointer=b

..

..

isActivepredecessorWord

treeMap_

*wordHyp

111213

FALSE

FALSE

TRUE

..

..

0 FALSE

...

treeLexicon_

......

d

e

h

de

h

arcHypIndex

predecessor

prePredecessor

predecessorScore

prePredecessorScore

predecessorBackpointer

prePredecessorBackpointer

isStarted

predecessorMap_

activeArcs_

SearchSpace::ActiveArcs

f

g

d

e

f

g

h

..

..

..

..

a b cd e

a

b

c

f gh

a

b

c

5

6

7

inv 8 8FA inv inv inv

FA

FA

TRinv s b

TRinv s b

TRinv s b

s1 b1

s2 b2

s3 b3

s1 b1s2 b256

s1 b1s2 b256oSt TR inv 8 invs3 b35

oSt TR inv 8 invs3 b35

Schluter/Ney: Advanced ASR 219 August 5, 2010

Page 220: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Tree Search ImplementationSearch Space Representation: Initial State Hypotheses per Tree Arc

state score backpointer

......

24

stateHypotheses_

5652

arcHypotheses_

arcstateHypBegin

stateHypEnd

.

..

...

d

e

h

arcHypIndex

predecessor

prePredecessor

predecessorScore

prePredecessorScore

predecessorBackpointer

prePredecessorBackpointer

isStarted

predecessorMap_

activeArcs_

SearchSpace::ActiveArcs

d

e

f

g

h

..

..

..

..

a b cd e

a

b

c

f gh

5

6

7

inv 8 8FA inv inv inv

FA

FA

TRinv s b

TRinv s b

TRinv s b

s1 b1

s2 b2

s3 b3

s1 b1s2 b256

s1 b1s2 b256oSt TR inv 8 invs3 b35

oSt TR inv 8 invs3 b35

SearchSpace::initTimeAlignment

score backpointerstate

-1

0

1

2

3

4

5

6

88

88

88

8

score backpointerstate

-1

0

1

2

3

4

5

6

88

score backpointerstate

-1

0

1

2

3

4

5

6

88

88

8

...

hmm (arc=d)

hmm (arc=h)

hmm (arc=b)

...

...

...

s b

s1 b1

s2 b2

88

s4 b4

s5 b5

s6 b6

s1 b1

s2 b2

s4 b4

s5 b5

s6 b6

Schluter/Ney: Advanced ASR 220 August 5, 2010

Page 221: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Tree Search ImplementationSearch Space Representation: State Expansion per Tree Arc

SearchSpace::computeTimeAlignment

score backpointerstate

-1

0

1

2

3

4

5

6

88

88

88

8

score backpointerstate

-1

0

1

2

3

4

5

6

88

score backpointerstate

-1

0

1

2

3

4

5

6

88

88

8

...

hmm (arc=d)

hmm (arc=h)

hmm (arc=b)

...

...

...

s b

s1 b1

s2 b2

88

s1 b1

s2 b2

s4 b4

s5 b5

s6 b6

state score backpointer

.

..

...

1

2

2

3

4

5

6

1

2

3

4

newStateHypotheses_

....

.

....

newArcHypotheses_

arcstateHypBegin

stateHypEnd

....

..

....

.

....

.

....

.

b

d

h

newTreeHypotheses_

predecessorwordarcHypBegin

arcHypEnd

....

..12

....

.

....

.a

.....

.

.

.

.....

.

...

.

SearchSpace::pruneStatesAndFindWordEnds

stateHypotheses_ arcHypotheses_ treeHypotheses_

state score backpointerarc

stateHypBegin

stateHypEnd predecessorword

arcHypBegin

arcHypEnd

....

...

.

........

Schluter/Ney: Advanced ASR 221 August 5, 2010

Page 222: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Tree Search ImplementationSearch Space Representation: Bookkeeping (as in Linear Search)

Implementation: show BookKeeping.hh show BookKeeping.cc

word score time timestamp backpointer

.....

sentinelBackpointer = 0123456789101112131415

100

100

100100

100

100

100100

silence_ 0 0 0

81

96

93

97

79

4763

6

50

13

5

13

5

04

1

6

56

6

1

8080

80

90

9080

70

80

68

69

8579

76

35

52

64state score backpointerstateHypotheses_

3

8

14

11

...

...

...

...

lastTimestamp_=100

bookKeeping_

Schluter/Ney: Advanced ASR 222 August 5, 2010

Page 223: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Trigram LM and Word Conditioned Tree CopiesTrigram LM:

proceed over time t from left to rightACOUSTIC LEVEL: process states of lexical trees

– initialization: Quv (t − 1, 0) = H(u, v ; t − 1)Buv (t − 1, 0) = t − 1

– time alignment: Quv (t, s) using DP– propagate back pointers Buv (t, s)– prune unlikely hypotheses– purge bookkeeping lists

WORD PAIR LEVEL: process word endsfor each triple (v ,w ; t) do

H(v ,w ; t) = maxu p(w |u, v) Quv (t,Sw )

u0(v ,w ; t) = arg maxu p(w |u, v) Quv (t,Sw )

– store best predecessor u0 := u0(v ,w ; t)– store best boundary τ0 := Bu0v (t,Sw )

Schluter/Ney: Advanced ASR 223 August 5, 2010

Page 224: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Generalization to Trigram and n-gram

Implementation of tree search for trigram or higher order languagemodels:

I General structure of search remains the same - seeimplementation for bigram (especially within each tree).

I Most important difference: hashing instead of full maps forcontext handling.

LM bigram trigram n-gram

context word word pair n − 1 wordsmaps by word hash index for context

I Generalization to n-gram straightforward:I Hashing can be applied to any LM context length, therefore:I Same implementation for arbitrary LM context length.I If larger context helps (lower perplexity), search space will

benefit.

Schluter/Ney: Advanced ASR 224 August 5, 2010

Page 225: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models3.1 Motivation for Across-Word Models3.2 Implementation of Across-Word Models3.3 Extension of the Tree Lexicon3.4 Extension of the Language Model Recombination3.5 Activating the First Phoneme Generation3.6 Recombination after the First Phoneme Generation3.7 Dynamic Programming Recursion3.8 Extended Language Model Look-Ahead3.9 Experimental Results

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 225 August 5, 2010

Page 226: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Motivation for Across-Word Models

I The words of the vocabulary are modeled with phonemes(example: Airline – ae r l ah ih n).

I The acoustic realization depends on the context of a phoneme(coarticulation).

I This variability is modeled with triphones, i.e. phonemes in atriphone context.

I A model for phoneme a with predecessor x and successor y(xay ) is different from the model of a with predecessor w andsuccessor z (w az).

Schluter/Ney: Advanced ASR 226 August 5, 2010

Page 227: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Motivation for Across-Word Models

I The models for words consist of concatenated phonememodels.

I When using triphone context, in principle two ways ofmodelling are possible:

I Triphone context is only considered within words. At wordboundaries the so called “word boundary context” is used(conventional within word modelling).

+ Advantage: easy implementation of the modelling; wordmodels can be taken directly from the lexicon.

– Disadvantage: coarticulation at word boundaries is notmodelled well.

I Triphone context is also considered across word boundaries(across-word modelling).

+ Advantage: coarticulation at word boundaries is taken intoconsideration.

– Disadvantage: higher search effort is required. The wordmodels depend on the predecessor and successor words.

Schluter/Ney: Advanced ASR 227 August 5, 2010

Page 228: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Motivation for Across-Word ModelsI Example for the word: Airline – ae r l ah ih n– Transcription with context dependencies only within the word:

ae r l ih n# r ae l r ah ah n ih #ahl ih

– Transcription with context dependent models both within theword and at the word boundaries:

ae

r l ah ih

na r

ae l r ah l ih ah n

ih a

nih b

nih $

aeb r

ae $ r

I Note: across-word modeling includes potential case $ for nocoarticulation at word boundaries (e.g. due to pause).

Schluter/Ney: Advanced ASR 228 August 5, 2010

Page 229: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models3.1 Motivation for Across-Word Models3.2 Implementation of Across-Word Models3.3 Extension of the Tree Lexicon3.4 Extension of the Language Model Recombination3.5 Activating the First Phoneme Generation3.6 Recombination after the First Phoneme Generation3.7 Dynamic Programming Recursion3.8 Extended Language Model Look-Ahead3.9 Experimental Results

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 229 August 5, 2010

Page 230: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Search Implementation with Across-Word Models

I Starting point: word conditioned tree search, contextdependent models only within words.

I In principle, there are two possibilities for using across-wordmodels:

1. Multi pass search:I generate a word graph from an initial search without

across-word models;I extract the N best word sequences from the graph;I rescore the N best hypothesized sentences using across-word

models.

2. Single pass search:I extend word conditioned tree search to allow processing of

context dependencies beyond word boundaries– extend the tree lexicon;– extend the language model recombination.

I In the following the single pass search strategy will bedescribed.

Schluter/Ney: Advanced ASR 230 August 5, 2010

Page 231: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models3.1 Motivation for Across-Word Models3.2 Implementation of Across-Word Models3.3 Extension of the Tree Lexicon3.4 Extension of the Language Model Recombination3.5 Activating the First Phoneme Generation3.6 Recombination after the First Phoneme Generation3.7 Dynamic Programming Recursion3.8 Extended Language Model Look-Ahead3.9 Experimental Results

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 231 August 5, 2010

Page 232: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Extension of the Tree LexiconI The conventional tree lexicon only uses within-word models.I At word boundaries the “word boundary context” is used:

AIRLIFT

AIRLINE

ABOUT

ae# r

rae l

tf #

n ih #

a# b

tow #

ahl ih ihah nlr ah

il f

f i t

b a ow ow

b t

lr i

I Note: the consistent word boundary context # enables theorganization of the pronunciations in a lexical prefix tree, witha single path through the tree for each pronunciation.

Schluter/Ney: Advanced ASR 232 August 5, 2010

Page 233: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Extension of the Tree Lexicon

I To model coarticulation across word boundaries, the wordbeginnings and ends have to be expanded (fan-in and fan-out):

AIRLIFT

AIRLINE

ABOUT

rae l

lr i

il f

f i t

tf a

tf z

lr ah ahl ih

owb t

a z b

ae a r

ae z r

a a b

ih ah n

tow a

tow z

b a ow

n ih a

n ih z

Schluter/Ney: Advanced ASR 233 August 5, 2010

Page 234: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models3.1 Motivation for Across-Word Models3.2 Implementation of Across-Word Models3.3 Extension of the Tree Lexicon3.4 Extension of the Language Model Recombination3.5 Activating the First Phoneme Generation3.6 Recombination after the First Phoneme Generation3.7 Dynamic Programming Recursion3.8 Extended Language Model Look-Ahead3.9 Experimental Results

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 234 August 5, 2010

Page 235: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Extension of the Language Model Recombination

Review: language model recombi-nation for within-wordmodelling with a bigramlanguage model.

Notation:A,E , I : ending/beginning wordsy : last generation phonemex : phoneme of second to

last generationa, b, c : first generation phonemese, f , g : second generation phonemes#: general word boundary context$: silence context, for word

transitions with silencebetween the words

A

a# e

a# f

b# e

b# f

b# g

c# e

c# f

E

a# e

a# f

b# e

b# f

b# g

c# e

c# f

I

a# e

a# f

b# e

b# f

b# g

c# e

c# f

......

......

yx #

E

I

A

A

......

......

...

yx #

E

I

A

I

......

...

yx #

E

I

A

E

......

... ......

......

Schluter/Ney: Advanced ASR 235 August 5, 2010

Page 236: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Extension of the Language Model RecombinationWhen using across-word models:

I Word ends generate more than one arc in the lexical prefixtree.

I For every possible phonetic right across-word context there isan individual arc.

I The hypothesis of a word end is combined with the hypothesisof a successor word.

I The first phoneme of the successor is determined at theending of a word.

I The hypothesized context influences the future search space.

I Language model recombination can only combine word endhypotheses with the same right across-word context.

The following picture only shows the ending and beginning of wordE with the last phoneme y .

Schluter/Ney: Advanced ASR 236 August 5, 2010

Page 237: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

E

yx a

yx b

yx c

ay e

ay f

by eby f

by g

cy e

cy fE

E

yx a

yx b

yx c

E

I

yx a

yx b

yx c

E

A ...

...

...

Schluter/Ney: Advanced ASR 237 August 5, 2010

Page 238: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models3.1 Motivation for Across-Word Models3.2 Implementation of Across-Word Models3.3 Extension of the Tree Lexicon3.4 Extension of the Language Model Recombination3.5 Activating the First Phoneme Generation3.6 Recombination after the First Phoneme Generation3.7 Dynamic Programming Recursion3.8 Extended Language Model Look-Ahead3.9 Experimental Results

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 238 August 5, 2010

Page 239: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Activating the First Phoneme Generation

At the root of a “tree” the set of arcs that will be activateddepends on:

I The last phoneme of the predecessor word (left context of thenew arc).

I The right context with which the predecessor word has ended(central phoneme of the new arc).

I special case silence ($): in case of long pauses between theending and the starting word it is assumed that nocoarticulation occurs. Activate all arcs of the first phonemegeneration with silence as left context.

Schluter/Ney: Advanced ASR 239 August 5, 2010

Page 240: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

yx a

yx b

yx c

E

ay e

ay f

by e

by f

by g

cy e

cy f

Eyx $ $

c$ e

a$ e

c$ f

b$ eb$ fb$ g

a$ f

Schluter/Ney: Advanced ASR 240 August 5, 2010

Page 241: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models3.1 Motivation for Across-Word Models3.2 Implementation of Across-Word Models3.3 Extension of the Tree Lexicon3.4 Extension of the Language Model Recombination3.5 Activating the First Phoneme Generation3.6 Recombination after the First Phoneme Generation3.7 Dynamic Programming Recursion3.8 Extended Language Model Look-Ahead3.9 Experimental Results

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 241 August 5, 2010

Page 242: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Recombination after the First Phoneme Generation

I The across-word context ofthe ending word only influ-ences the first phoneme ofthe successor.

I It is possible to recombinethe arcs of the non-coarticu-lation transition with thoseof the coarticulationtransition.

yx a

yx b

yx c

E

ay e

ay f

by e

by f

by g

cy e

cy f

Eyx $ $

c$ f

a$ e

Schluter/Ney: Advanced ASR 242 August 5, 2010

Page 243: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models3.1 Motivation for Across-Word Models3.2 Implementation of Across-Word Models3.3 Extension of the Tree Lexicon3.4 Extension of the Language Model Recombination3.5 Activating the First Phoneme Generation3.6 Recombination after the First Phoneme Generation3.7 Dynamic Programming Recursion3.8 Extended Language Model Look-Ahead3.9 Experimental Results

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 243 August 5, 2010

Page 244: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Dynamic Programming Recursion

I Within-word modelling:

1. Within words:

Qv (t, s) = maxσ

Qv (t − 1, σ) · p(xt , s|σ)

2. At word boundaries:

Qv (t, s = 0) = maxu

Qu(t,Sv ) · p(v |u)

Schluter/Ney: Advanced ASR 244 August 5, 2010

Page 245: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Dynamic Programming RecursionI Across-word modelling:

1. Within words (as before):

Qv (t, s) = maxσ

p(xt , s|σ) · Qv (t − 1, σ)

2. At word boundaries:

The right across-word context χ of the ending word v wordhas to be considered during language model recombination forthis word:

Qv (t, s0(χ)) = maxu

p(v |u) · Qu(t,S(v ,χ))

with χ right context of the predecessor word v

S(v ,χ) last state in fan-out arc of word v with right context χ

s0(χ) virtual start state of subtree following word end hypothe-sis v with context χ. There is only one virtual start statefor within-word models, for across-word modelling a dif-ferent virtual start state (s0(χ)) for every right contextχ of predecessor word v has to be considered.

Schluter/Ney: Advanced ASR 245 August 5, 2010

Page 246: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models3.1 Motivation for Across-Word Models3.2 Implementation of Across-Word Models3.3 Extension of the Tree Lexicon3.4 Extension of the Language Model Recombination3.5 Activating the First Phoneme Generation3.6 Recombination after the First Phoneme Generation3.7 Dynamic Programming Recursion3.8 Extended Language Model Look-Ahead3.9 Experimental Results

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 246 August 5, 2010

Page 247: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Extended Language Model Look-AheadI The hypothesis of a word end with a certain right across-word

context restricts the set of possible successor words to the setof words which begin with the phoneme that washypothesized as right context.

This restricts the set of possible language model probabilitiesand allows additional pruning.

Define:

W (χ) := set of words, with χ as first phoneme.

For every ending word v and every possible right context χthe maximal language model probability at the end of a (yetunknown) word w following v can be calculated:

π(v , χ) := maxw∈W (χ)

p(w |v)

Schluter/Ney: Advanced ASR 247 August 5, 2010

Page 248: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Extended Language Model Look-Ahead

π(v , χ) := maxw∈W (χ)

p(w |v)

I This maximal probability can be considered as soon, as theright context χ at the end of the predecessor word v ishypothezised.

I Let Σ(v , χ) be the set of states in a fan-out arc of an endingword hypothesis v , with right context χ.

I The score QFOu (t, s) for a state s ∈ Σ(v , χ) within a fan-out

arc of word v , with right context χ in the tree of thepredecessor word u then is:

QFOu (t, s) := π(v , χ) · Qu(t, s)

Schluter/Ney: Advanced ASR 248 August 5, 2010

Page 249: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Extended Language Model Look-Ahead

Compute the overall best fan-out score:

QFO(t) := maxu,v ,χ

QFO

u (t, s) : s ∈ Σ(v , χ)

A state in the arc of an ending word is pruned if:

QFOu (t, s) < fAC ,FO · QFO(t)

with the fan-out pruning threshold 0 < fAC ,FO < 1.

Schluter/Ney: Advanced ASR 249 August 5, 2010

Page 250: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models3.1 Motivation for Across-Word Models3.2 Implementation of Across-Word Models3.3 Extension of the Tree Lexicon3.4 Extension of the Language Model Recombination3.5 Activating the First Phoneme Generation3.6 Recombination after the First Phoneme Generation3.7 Dynamic Programming Recursion3.8 Extended Language Model Look-Ahead3.9 Experimental Results

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 250 August 5, 2010

Page 251: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Experimental Results

I Verbmobil eval’96, 5k vocabulary

I 35 speakers, 305 utterances, 5421 words

I within-word models (gender independent):2001 mixtures, 195451 densities

I across-word models (gender independent):3001 mixtures, 209484 densities

I trigram perplexity 36.4

I Without extended language model look-ahead

I RTF (real time factor) on a 500 Mhz AlphaPC

search space error rate [%]models states arcs trees del. ins. WER RTF

within-word 31040 8677 135 3.3 3.2 17.4 46.48

across-word 85729 21932 49 3.7 2.3 15.8 64.04

Schluter/Ney: Advanced ASR 251 August 5, 2010

Page 252: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Experimental ResultsI NAB’94 H1 development-set, 20k vocabularyI 20 speaker (10 men and 10 woman), 310 utterances with

7387 words (199 out of vocabulary words)I within-word models:

I female: 2001 mixtures, 279079 densitiesI male: 2001 mixtures, 269141 densities

I across-word models:I female: 4001 mixtures, 281022 densitiesI male: 4001 mixtures, 354607 densities

I trigram perplexity 121.8I With extended language model look-aheadI RTF (real time factor) on a 533 Mhz AlphaPC

search space error rate [%]models states arcs trees del. ins. WER RTF

within-word 19599 5821 136 1.5 2.6 12.9 37.9

across-word 17811 2822 37 1.3 2.4 11.7 36.8

Schluter/Ney: Advanced ASR 252 August 5, 2010

Page 253: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications4.1 Why Word Graphs?4.2 Word Graph Generation using Word Pair Approximation4.3 Word Graph Pruning4.4 Multi-Pass Search4.5 Language Model Rescoring4.6 Path probabilities over word graphs4.7 Confidence Measures4.8 Bayes Risk Decoding4.9 System Combination4.10 Experiments4.11 References

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 253 August 5, 2010

Page 254: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Why Word Graphs?Goal: generate search space representation, covering all hypothesesfrom beam search, which were not pruned.

I M-best list of sentences: extract list of M recognizedsentences with best recognition score/probability.Problem: inefficient due to redundancy and combinatorialcomplexity.

I Word graph: compact search space representation, containingthose word hypotheses during beam search, which were notpruned.The edges of word graphs contain varying degrees ofinformation about the search space:

I word identityI acoustic scoreI word start and end timesI contextI language model scoreI etc., e.g. derived information like confidence score.

Schluter/Ney: Advanced ASR 254 August 5, 2010

Page 255: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Why Word Graphs?

Word graphs are useful for a number of methods:I multi-pass search: restrict search space for

I language model rescoring,I acoustic model rescoring,I unsupervised adaptation (see Chapter 7),

I confidence scoring;

I system combination;

I word error minimization;

I discriminative training (see Chapter 8);

I interface to subsequent applications like machine translation,or natural language understanding.

Schluter/Ney: Advanced ASR 255 August 5, 2010

Page 256: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bayes’ Decision RuleSpeech Input

AcousticAnalysis

Phoneme Inventory

Pronunciation Lexicon

Language Model

Global Search:

maximize

x1 ... xT

Pr(w1 ... wN) Pr(x1 ... xT | w1...wN)

w1 ... wN

RecognizedWord Sequence

over

Pr(x1 ... xT | w1...wN )

Pr(w1 ... wN)

Schluter/Ney: Advanced ASR 256 August 5, 2010

Page 257: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Search: Word Graph vs. Single Best Sentence

Original interest in word graphs: language model rescoring in atwo-pass approach.

Howe to apply a refined language model?

I One-pass approach: find single-best sentenceperforming the full recognition process directly using therefined language model.

I Two-pass approach:

I first pass: use baseline language model (e.g. trigram languagemodel),output (interface): word graph or M-best list.

I second pass: rescore output with refined/final language model,e.g. context-free grammar.

I Extension: multi-pass approach.

Schluter/Ney: Advanced ASR 257 August 5, 2010

Page 258: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Search: Two-Pass Approach

Single BestSentence

Phoneme Inventory

Pronunciation Lexicon

....

M-best List ofSentence Hypotheses

wN1

.. p (xT1wN )|1

AcousticVectors

1xT

1st Pass:

BaselineRecognition

2nd Pass:

Rescoring

1xT w1N )( | p )w1

Np (1

BaselineLanguage Model

RefinedLanguage Model

)w1Np (

2

Schluter/Ney: Advanced ASR 258 August 5, 2010

Page 259: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Search: M-best List of Sentences vs. Word Graph

M-best list of sentences:

I list L(xT1 ) of M best sentences (M = 100, ..., 1000),

I for each sentence wN1 , report acoustic probability Pr(xT

1 |wN1 ).

The M-best list L(xT1 ) can be considered as a filter:

I first pass: generate list L(xT1 ) ⊂

wN

1

,

I second pass: find the maximum in this subset of wordsequences with refined models

arg maxwN

1 ∈L(xT1 )

Pr(wN

1 )Pr(xT1 |wN

1 )

with more complex language models (language modelrescoring) or refined acoustic models like across word models(acoustic model rescoring).

Schluter/Ney: Advanced ASR 259 August 5, 2010

Page 260: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Search: M-best List of Sentences vs. Word Graph

More compact alternative to M-best list: word graph.Attach to each word graph arc:

I word identity w ,

I beginning time tb and end time te ,

I acoustic probability p(x tetb |w) of acoustic vectors xtb , ..., xte .

Example: simple word graph

Schluter/Ney: Advanced ASR 260 August 5, 2010

Page 261: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Why Word Graphs?Summary of language model rescoring:

I separation of two levels:I acoustic recognition,I postprocessing: refined language model.

I Restricted search space representations:I list of M best sentences,I word graph: network.

I Example: 2 word hypotheses per position in a 10-wordsentence

I M-best list: 210 = 1024 sentence hypotheses,I word graph: 2 · 10 = 20 arcs.

I Problem for word graph: word boundaries may vary

W W

U V

TIMETIME

Schluter/Ney: Advanced ASR 261 August 5, 2010

Page 262: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Examples of Word Graphs

100 20030 230Time [10ms]

Wor

d In

dex

sil

WHEN

WHAT

A

THE

IT

IS

WRONG

WITH

A

AS

HIS

WITH

THIS

SPLIT

PICTURE

SHARE

sil

Spoken:

“What is wrong with this picture?”

0 100 200Time [10ms]

Wor

d In

dex

sil

IN

CAN

KEN

AND

TENNIS

KANSAS

CHINA’S

CANADA’S

CANVAS

JANICE

THIS

ITS

AS

HIS

MISS

CHEMIST

NETWORK

BE

B.

SAVE

SAVED

TO

TWO

sil

Spoken:

“Can this network be saved?”

Schluter/Ney: Advanced ASR 262 August 5, 2010

Page 263: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications4.1 Why Word Graphs?4.2 Word Graph Generation using Word Pair Approximation4.3 Word Graph Pruning4.4 Multi-Pass Search4.5 Language Model Rescoring4.6 Path probabilities over word graphs4.7 Confidence Measures4.8 Bayes Risk Decoding4.9 System Combination4.10 Experiments4.11 References

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 263 August 5, 2010

Page 264: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Boundary FunctionDefinitions:

h(w ; τ, t) := Pr(x tτ+1|w) = probability that word w produces

the acoustic vectors xτ+1...xt .G (wn

1 ; t) := Pr(wn1 ) · Pr(x t

1|wn1 )

= (joint) probability of generating the acousticvectors x1...xt and a word sequence wn

1 withending time t.

Decomposition:

x1, · · · , · · · , xτ︸ ︷︷ ︸G (wn−1

1 ; τ)

xτ+1, · · · , xt︸ ︷︷ ︸h(wn; τ, t)

xt+1, · · · , · · · , xT︸ ︷︷ ︸. . .

timetτ

W1 ... Wn - 1 Wn

T

...

W

Schluter/Ney: Advanced ASR 264 August 5, 2010

Page 265: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Boundary Function

I Optimization over word boundary τ :

G (wn1 ; t) = max

τPr(wn|wn−1

1 ) · G (wn−11 ; τ) · h(wn; τ, t)

= Pr(wn|wn−11 ) ·max

τG (wn−1

1 ; τ) · h(wn; τ, t)

I Define word boundary function:

τ(t; wn1 ) := arg max

τG (wn−1

1 ; τ) · h(wn; τ, t)

Schluter/Ney: Advanced ASR 265 August 5, 2010

Page 266: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Recombination using the Word Boundary Function

Consider an m-gram language model:

p(wn|wn−1n−m+1) or p(um|um−1

1 ).

To apply dynamic programming it is sufficient to consider only thelast m − 1 words u2, . . . , um−1.

Define a corresponding auxiliary quantity:

H(um2 ; t) := max

wn−m+11

[Pr(wn

1 ) · Pr(x t1|wn

1 ) : wnn−m+2 = um

2

]= maximum conditional probability for the generation

of the acoustic vectors x1...xt by a word sequence

that ends with the words um2 at time t.

Schluter/Ney: Advanced ASR 266 August 5, 2010

Page 267: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Recombination using the Word Boundary FunctionThe dynamic programming recursion to calculate the auxiliaryquantity H(um

2 ; t) efficiently optimizes over the word u1 and theword boundary τ :

H(um2 ; t) = max

u1

[p(um|um−1

1 ) ·maxτ

H(um−1

1 ; τ) · h(um; τ , t)].

Maximum approximation is used here to avoid the sum over allword boundaries τ .

The definition of the corresponding word boundary functionbecomes:

τ(t; um1 ) := arg max

τ

H(um−1

1 ; τ) · h(um; τ , t)

leading to:

H(um2 ; t) = max

u1

[p(um|um−1

1 ) · H(um−11 ; τ(t; um

1 ))

· h(um; τ(t; um1 ), t)

].

Schluter/Ney: Advanced ASR 267 August 5, 2010

Page 268: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Recombination using the Word Boundary Function

Example: word boundary function with a bigram language model:

τ(t; u1, u2) := arg maxτ H(u1; τ) · h(u2; τ, t)

W W

U V

TIMETIME

The word boundary depends on the identity of the surrounding words.

Schluter/Ney: Advanced ASR 268 August 5, 2010

Page 269: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Pair ApproximationRemember: τ = start time of hypothesis (t,w)

Assumption: τ only depends on predecessor word vAssumption: (in addition to (t,w)): τ = τ(t; v ,w)

ba m-2

u

u

u

tTIME

v

w

vm-1

m

Experiments show: approximation correct if predecessor word vsufficiently long.

Schluter/Ney: Advanced ASR 269 August 5, 2010

Page 270: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Pair Approximation: IllustrationTwo examples for word pair approximation:

good example: predecessor wordum−1 := v of word um := w issufficiently long.

tTIME

tTIME

m-2

u = w

u = v

u a, b

m-1

m

m-2

u = w

u = v’

u a, b

m-1

m

bad example: predecessor wordum−1 := v of word um := w istoo short.

tTIME

tTIME

m-2

u = w

u = v

u a, b

m-1

m

m-2

u = w

u = v’

u a, b

m-1

m

Schluter/Ney: Advanced ASR 270 August 5, 2010

Page 271: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Algorithm for Word Graph Construction

Algorithm in words:

I For each time frame t, consider all word pairs (v ,w)using beam search.

I For each triple (t; v ,w), keep track of:I the word boundary τ(t; v ,w),I the word score h(w ; τ(t; v ,w), t).

I End of sentence: traceback.

Schluter/Ney: Advanced ASR 271 August 5, 2010

Page 272: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Algorithm for Word Graph Constructionproceed over time t from left to right

ACOUSTIC LEVEL: process states of lexical trees– initialization: Qv (t − 1, 0) = H(v ; t − 1)

Bv (t − 1, 0) = t − 1– time alignment: Qv (t, s) using DP– propagate back pointers Bv (t, s)– prune unlikely hypotheses– purge bookkeeping lists

WORD PAIR LEVEL: process word endssingle best: for each pair (w ; t) do

H(w ; t) = maxv p(w |v) Qv (t, Sw )

v0(w ; t) = arg maxv p(w |v) Qv (t, Sw )

– store best predecessor v0 := v0(w ; t)– store best boundary τ0 := Bv0 (t, Sw )

word graph: for each triple (t; v ,w) store– word boundary τ(t; v ,w) := Bv (t, Sw )– word score h(w ; τ, t) := Qv (t, Sw )/H(v ; τ)

Traceback after last time frame T is processed:– single best: start at best sentence end: w = arg max

v p($|v) H(v ,T )

– word graph: start at all possible sentence ends.

Schluter/Ney: Advanced ASR 272 August 5, 2010

Page 273: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Algorithm for Word Graph ConstructionDepending on how they are constructed, different types of wordgraphs can be defined:

I Word pair approximation ⇒ word/predecessor conditionedword graphs

I Word sequence implies sequence of word boundaries (unique).I Each word sequence occurs only once.

I Generalization: word triple, quadruple, etc. approximation(cf. language model context).

I Other methods ⇒ time conditioned word graphsI A word sequence might occur more than once with different

word boundary times.

Depending on what is to be done with the word graph, one or theother method might be suitable.

Is it possible to guarantee that the M-best sentences are containedin the word graph?

I It is highly probable, butI there is no formal proof.

Schluter/Ney: Advanced ASR 273 August 5, 2010

Page 274: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications4.1 Why Word Graphs?4.2 Word Graph Generation using Word Pair Approximation4.3 Word Graph Pruning4.4 Multi-Pass Search4.5 Language Model Rescoring4.6 Path probabilities over word graphs4.7 Confidence Measures4.8 Bayes Risk Decoding4.9 System Combination4.10 Experiments4.11 References

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 274 August 5, 2010

Page 275: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Graph PruningPruning methods can also be applied on word graphs.

I Forward Pruning:I Word end pruning retains word hypotheses (v ,w ; t) whose

scores are relatively close to score of best word hypothesis attime t:

Hmax(t) = maxw H(w ; t)

with

H(w ; t) = maxv p(w |v) · H(v ; τ(t; v ,w)) · h(w ; τ(t; v ,w), t) .

I Prune hypothesis (v ,w ; t) iff:

p(w |v) · H(v ; τ(t; v ,w)) · h(w ; τ(t; v ,w), t) < fLAT · Hmax(t),

with the word graph pruning threshold fLAT < 1.I In addition, histogram pruning limits the number of surviving

word hypotheses to a maximum number (typically 1000 pertime frame).

Schluter/Ney: Advanced ASR 275 August 5, 2010

Page 276: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Graph Pruning

I During acoustic search, forward pruning is the standardapproach.

I Similar to acoustic and language model pruning, forwardpruning does not guarantee to find the best scoring wordsequence.

I Often the need arises to prune word graphs after generatinginitial, usually larger word graphs.

I Observation: when further pruning word graphs after theirgeneration, forward pruning becomes less effective, since incontrast to the initial generation of the word graph, the scoreof the overall best word sequence already is known.

I Optimal approach: prune word graph relative to the bestrecognized/best scoring word sequence.

I Advantage: the best scoring word sequence will not be pruneditself, which is not guaranteed in forward-only pruning.

Schluter/Ney: Advanced ASR 276 August 5, 2010

Page 277: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Graph Pruning

Forward-Backward Pruning:

I Idea: for every edge (w ; τ, t) in the word graph the scoreQ(w ; τ, t) of the best path through the word graph includingthis arc is calculated.

I An arc (w ; τ, t) is pruned iff:

Q(w ; τ, t) > FLAT FB + max(w ;τ,t)

Q(w ; τ, t) ,

where FLAT FB > 0 is the forward-backward pruning thresholdin the log domain.

I The term “forward-backward” relates to the correspondingtwofold dynamic programming algorithm to compute thispruning step efficiently.

Schluter/Ney: Advanced ASR 277 August 5, 2010

Page 278: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Graph Pruning

I To calculate Q(w ; τ, t) efficiently, it is decomposed into thesum of a forward and a backward score:

Q(w ; τ, t) = Qf (w ; τ, t) + Qb(w ; τ, t),

withQf (w ; τ, t) :=− log p(w ; τ, t|x t

1) = -log prob. to end in an edge(w ; τ, t) given x t

1,Qb(w ; τ, t) :=− log p(w ; τ, t|xT

t+1) = -log prob. to start with edge(w ; τ, t) followed by xT

t+1.

I The forward score Qf (w ; τ, t) and backward score Qb(w ; τ, t)can then be computed efficiently by a dynamic programmingsearch pass each on the word graph:

1. forward pass from t = 1 to t = T2. backward pass from t = T to t = 1

Schluter/Ney: Advanced ASR 278 August 5, 2010

Page 279: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Graph Pruning: Experimental ResultsExperimental results on NAB’94 (ARPA Benchmark 1994)

I NAB’94 (development 20k)

I W = 20 000 word vocabularyI 199 Out of vocabulary words (2.7%)

I 20 speakers (10 female, 10 male)I 310 sentences ( 155 per gender )I 7387 spoken words

I recognizer setupI 2001 context dependent phoneme models (gender independent)I 216000 densitiesI trigram language model: perplexity 121.8I single best word error rate: 14.5 %

Additional evaluation measure for word graphs:

I GER [%]: graph error rate. Error rate of that hypothesis inthe word graph having the smallest Levenshtein distance tothe spoken word sequence.

Schluter/Ney: Advanced ASR 279 August 5, 2010

Page 280: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Graph Pruning: Experimental ResultsForward Pruning

per spoken word−logfLat arcs nodes word boundaries GER [%]

500.00 105.07 45.70 10.54 4.44300.00 99.74 45.12 10.20 4.44250.00 89.11 43.69 9.50 4.45200.00 71.87 40.85 8.39 4.51180.00 59.31 35.55 7.42 4.62160.00 46.52 29.34 6.35 4.81140.00 33.74 22.40 5.20 5.08120.00 22.86 15.96 4.20 5.40100.00 14.07 10.31 3.29 5.96

80.00 8.23 6.36 2.59 6.8560.00 4.78 3.89 2.10 7.8940.00 2.97 2.55 1.76 9.3920.00 2.05 1.85 1.54 10.7910.00 1.77 1.65 1.45 11.63

5.00 1.64 1.56 1.40 11.901.00 1.54 1.48 1.36 12.12

Schluter/Ney: Advanced ASR 280 August 5, 2010

Page 281: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Graph Pruning: Experimental ResultsForward-Backward Pruning

per spoken wordsFLat fb arcs nodes word boundaries GER [%]

750.00 97.07 43.11 9.96 4.51500.00 82.47 39.92 9.96 4.53250.00 27.55 17.21 6.00 4.78200.00 16.95 11.26 4.58 4.95180.00 13.45 9.17 4.03 4.98160.00 10.51 7.36 3.54 5.28140.00 8.03 5.78 3.08 5.58120.00 6.06 4.49 2.66 5.92100.00 4.58 3.51 2.32 6.44

80.00 3.48 2.77 2.03 7.1660.00 2.68 2.23 1.79 8.1140.00 2.09 1.82 1.59 9.6720.00 1.65 1.51 1.42 11.8710.00 1.45 1.38 1.33 13.04

5.00 1.35 1.31 1.29 13.751.00 1.25 1.25 1.24 14.34

Schluter/Ney: Advanced ASR 281 August 5, 2010

Page 282: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Graph Pruning: Experimental Results

0

0 10 20 30 40 50 604

5

6

7

8

9

10

11

12

13

word graph density

graph error rate [%]

forward pruning

forward backward pruning

Schluter/Ney: Advanced ASR 282 August 5, 2010

Page 283: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications4.1 Why Word Graphs?4.2 Word Graph Generation using Word Pair Approximation4.3 Word Graph Pruning4.4 Multi-Pass Search4.5 Language Model Rescoring4.6 Path probabilities over word graphs4.7 Confidence Measures4.8 Bayes Risk Decoding4.9 System Combination4.10 Experiments4.11 References

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 283 August 5, 2010

Page 284: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Multi-Pass SearchWord graphs can be used as search space reresentation forconsecutively constraining the search space to be used.

SIL BBBB

A

CC CCCC

AAAAA

C

TIME

Properties of word graphs:

I incoming arcs: maximum = vocabulary size

I outgoing arcs: no maximum

I silence is included as integrated part of the words

w

sil

ε

Search over word graph:I Dynamic programming through graph.

Schluter/Ney: Advanced ASR 284 August 5, 2010

Page 285: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Multi-Pass SearchUsing word graphs as (constrained) search space representations, anumber of multiple pass search strategies are possible:

I Language model rescoring:I First pass full search with baseline language model to generate

word graph.I Second pass search over word graph using advanced language

model.I Word boundaries and acoustic scores are kept constant (from

first pass).I Acoustic rescoring:

I First pass full search with baseline acoustic model, e.g. usingmonophone or within-word triphones.

I Second pass search over word graph using advanced acousticmodel, e.g. across-word triphones.

I If word boundaries are kept constant, acoustic rescoring can bedone per arc/word.

I Word boundaries can also be recalculated: full searchconstrained to word graph (discarding first path wordboundaries).

Schluter/Ney: Advanced ASR 285 August 5, 2010

Page 286: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Multi-Pass SearchUsing word graphs as (constrained) search space representations, anumber of multiple pass search strategies are possible:

I Language model rescoring:I First pass full search with baseline language model to generate

word graph.I Second pass search over word graph using advanced language

model.I Word boundaries and acoustic scores are kept constant (from

first pass).I Acoustic rescoring:

I First pass full search with baseline acoustic model, e.g. usingmonophone or within-word triphones.

I Second pass search over word graph using advanced acousticmodel, e.g. across-word triphones.

I If word boundaries are kept constant, acoustic rescoring can bedone per arc/word.

I Word boundaries can also be recalculated: full searchconstrained to word graph (discarding first path wordboundaries).

Schluter/Ney: Advanced ASR 285 August 5, 2010

Page 287: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Multi-Pass SearchI In multi-pass search strategies, language model and acoustic

model rescoring usually are combined, forming a consecutivesequence of search passes which are increasingly contrained.

I Potential advantage: the application of simpler models atearlier search stages with larger search spaces might helpreduce the search complexity, allowing the application of morecomplex and computationally demanding models to be appliedin a more constrained search space.

I Disadvantages:I Each intermediate step in a multi-pass search strategy

represents decisions to constrain the further search space basedon incomplete knowledge (i.e. disregarding more advancedmodels to be applied at a later stage). Hypotheses discardedat some level can not or only partly be recovered in laterstages of the search process.

I Less advanced models call for less severe pruning, i.e. thereduction of the search space is limited by the models used andneeds to be optimized.

Schluter/Ney: Advanced ASR 286 August 5, 2010

Page 288: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications4.1 Why Word Graphs?4.2 Word Graph Generation using Word Pair Approximation4.3 Word Graph Pruning4.4 Multi-Pass Search4.5 Language Model Rescoring4.6 Path probabilities over word graphs4.7 Confidence Measures4.8 Bayes Risk Decoding4.9 System Combination4.10 Experiments4.11 References

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 287 August 5, 2010

Page 289: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Language Model RescoringWord graph rescoring algorithm (2nd pass):

input: word graphset of word boundaries τ(t; um

1 )set of word scores h(um; τ, t)

proceed over time t from left to right

process each word pair umm−1 in the graph

get word boundaries τ(t; um1 ) and scores h(um; τ, t)

process all word sequences um1

H(um1 ; t) := p(um|um−1

1 ) · H(um−11 ; τ) · h(um; τ, t)

H(um2 ; t) = max

u1

H(um1 ; t)

B(um2 ; t) = arg max

u1

H(um1 ; t)

traceback: use back pointers B(um2 ; t)

Note: the language model used here is different from the one usedto construct the word graph.

Schluter/Ney: Advanced ASR 288 August 5, 2010

Page 290: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Graph Rescoring: ResultsTest: NAB’94 – H1 Development

I vocabulary: 20000 words

I 10 male and 10 female speakers

I total of 310 sentences = 7387 spoken words

I unknown words: 199 (out of vocabulary rate: 2.7 %)I test set perplexities:

bigram: PPbi = 198.1trigram: PPtri = 130.2

Acoustic Modelling:

I trained on: WSJ0 and WSJ1 (284 speakers; 80 h of speech)

I phoneme models: 4688 (43+1 monophones, 557 diphones,4087 triphones)

I tied states: 4623 (6-state HMM)

I densities: 290 000 per gender

Schluter/Ney: Advanced ASR 289 August 5, 2010

Page 291: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Graph Method: Experimental Results

Search space for word graph generation:

I search space: 27672 states, 7674 arcs, 115 trees

Rate of out of vocabulary words (OOV): 2.7%.

Word graph and evaluation measures:

WGD: word graph density (word arcs per spoken word)

NGD: node graph density (nodes per spoken word)

BGD: boundary graph density(different word boundaries per spoken word)

GER[%]: word graph error rate (’sub,del,ins’)

WER[%]: word recognition error rate(trigram: PPtri = 130.2)

Schluter/Ney: Advanced ASR 290 August 5, 2010

Page 292: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Word Graph Method: ResultsGraph Recognition

Graph density word error rate word error ratefLat WGD NGD BGD del - ins GER[%] del - ins WER[%]300 1476 181.80 19.67 9 - 38 4.2 120 - 202 14.3150 1415 175.93 19.25 9 - 38 4.2 120 - 202 14.3100 684 104.22 14.50 13 - 38 4.3 120 - 202 14.3

90 460 77.35 12.31 12 - 38 4.3 120 - 202 14.380 269 51.75 9.84 15 - 44 4.5 120 - 202 14.370 137 31.43 7.42 19 - 51 4.8 120 - 202 14.360 60 17.20 5.33 23 - 56 5.1 122 - 201 14.350 25 8.98 3.77 40 - 64 5.8 124 - 200 14.340 10 4.79 2.66 50 - 81 6.8 122 - 202 14.330 4 2.75 2.01 84 - 94 8.4 135 - 191 14.620 2 1.83 1.60 113 - 129 10.6 147 - 184 14.810 1 1.41 1.36 156 - 160 13.8 173 - 180 15.6

5 1 1.29 1.28 168 - 177 15.3 181 - 190 16.31 1 1.25 1.24 167 - 192 16.3 167 - 195 16.4

Schluter/Ney: Advanced ASR 291 August 5, 2010

Page 293: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Recognition Results: Word Graph vs. One Pass(A) two pass search (word conditioned bigram search with word

graph construction and subsequent word graph rescoring witha trigram).

(B) single pass time conditioned trigram search (Chapter 5)

I NAB H1 development corpus; 20k vocabularyI 310 test sentences, 7387 wordsI Bigram LM with a perpexity of 198I Trigram LM with a perpexity of 130

Number of: sentences word errorstwo pass (A) / single pass (B)

identical score 277 898(A) is better than (B) 1 11 / 11(B) is better than (A) 32 146 / 121total 310 1055 / 1030

Schluter/Ney: Advanced ASR 292 August 5, 2010

Page 294: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Recognition Results: Word Graph vs. One Pass(A) two pass search (word conditioned bigram search with word

graph construction and subsequent word graph rescoring witha trigram).

(B) single pass time conditioned trigram search (Chapter 5)

I NAB H1 development corpus; 64k vocabularyI 310 test sentences, 7387 wordsI Bigram LM with a perpexity of 237I Trigram LM with a perpexity of 172

Number of: sentences word errorstwo pass (A) / single pass (B)

identical score 220 452 / 452(A) is better than (B) 1 2 / 2(B) is better than (A) 89 439 / 423total 310 893 / 877

Schluter/Ney: Advanced ASR 293 August 5, 2010

Page 295: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Recognition Results: Word Graph vs. Integrated Method

Examples of spoken test sentences (SNT):

I integrated search method (ISM)

I word graph method (WGM)

SNT: ... A FEW MUTUAL FUND INVESTORS INTO MODELS OF ...WGM: ... A FEW MUTUAL FUND INVESTORS IN TWO MODELS OF ...ISM: ... A FEW MUTUAL FUND INVESTORS INTO MODELS OF ...

SNT: ... WHEN YOU BUY A MUTUAL FUND FROM YOUR BROKERWGM: ... WHEN YOU BUY A MUTUAL FUND FROM A BROKERISM: ... WHEN YOU BUY A MUTUAL FUND FROM YOUR BROKER

SNT: I ALMOST WISH I HADN’T SEEN THIS PARTWGM: I ALMOST WISH I HADN’T SEEN AS PARTISM: I ALMOST WISH I HADN’T SEEN THIS PART

Schluter/Ney: Advanced ASR 294 August 5, 2010

Page 296: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bigram Cache Model using Word Graph

Cache (unigram/bigram) language model:

I Cache LM is difficult to handle in integrated search

I Cache updated after each sentence boundary (not after each word!)

I Two modes: supervised/unsupervised

Language model PP del - ins WER [%]

Bigram: without cache 198.1 179 - 195 16.5with cache: unsupervised 178.0 190 - 179 16.3

supervised 171.2 194 - 179 16.0

Trigram: without cache 130.2 120 - 202 14.3with cache: unsupervised 117.1 127 - 182 14.0

supervised 113.1 128 - 194 13.8

Schluter/Ney: Advanced ASR 295 August 5, 2010

Page 297: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Search effort: Word Graph vs. Integrated Method

Compare computational search complexity, test condition:

I subset of NAB’94 H1 development set:= 10 male and 10 female speakers,= 20 sentences = 519 spoken words

(of which 21 words are out of vocabulary)= 211 sec

I perplexities:PPbi = 201.3, PPtri = 134.1

I SGI workstation (Indy R4400, SpecInt’92: 94)

Schluter/Ney: Advanced ASR 296 August 5, 2010

Page 298: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Search effort: Word Graph vs. Integrated Method

Method: Word Graph Integratedstates/arcs/trees states/arcs/trees

Search space: 16274 / 4516 / 41 32291 / 9413 / 108

Time [sec] [%] [sec] [%]Log-likelihood computation 14830 78.2 15174 67.2Acoustic search 3603 19.0 6576 29.1Lang. model recombination 139 0.7 621 2.7Word graph rescoring 210 1.1 – –Other operations 177 1.0 208 1.0Overall recognition 18959 100.0 22579 100.0Real time factor 90 - 107 -

Schluter/Ney: Advanced ASR 297 August 5, 2010

Page 299: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Conclusions: Word Graph Generation and Rescoring

Word graph generation:

I Specification of the word graph problem.

I Word pair approximation.

I Extension of word conditioned one-pass algorithm.

I Modification: bookkeeping only.

Word graph (language model) rescoring:

I Experimental results for the 20k WSJ/NAB task (within-wordmodeling): 4 to 25 word arcs per spoken word sufficient.

I Comparison with integrated search: negligible degradation.

I Note: only ’forward’ pruning using bigram LM.

Schluter/Ney: Advanced ASR 298 August 5, 2010

Page 300: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications4.1 Why Word Graphs?4.2 Word Graph Generation using Word Pair Approximation4.3 Word Graph Pruning4.4 Multi-Pass Search4.5 Language Model Rescoring4.6 Path probabilities over word graphs4.7 Confidence Measures4.8 Bayes Risk Decoding4.9 System Combination4.10 Experiments4.11 References

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 299 August 5, 2010

Page 301: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Path probabilities over word graphsWord graph L:

I promising part of the search spaceI probability along a path is p(W ,X ), W ∈ Σ∗

I path posterior probability

p(W |X ) =p(W ,X )

p(X )=

p(W ,X )∑V∈Σ∗

p(V ,X )

Example:I

like

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

I like this meal

vealI like this

alike

alike this

his veal

veal

Schluter/Ney: Advanced ASR 300 August 5, 2010

Page 302: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Path probabilities over word graphsPosterior probability of path (edge sequence) eN

1 through word graph L:

I word graph L is acyclic

I edge e has word label: i(e) ∈ Σ ∪ ε,with finite alphabet (vocabulary) Σ and empty word ε

I word sequence for eN1 ∈ π(L):

W := i(eN1 ) := [i(en)]Nn=1

I edge weight of e: w(e) := pAM(e) · pLM(e)β, whereacoustic score pAM(e), LM score pLM(e) and LM-scale β

I path probability for eL1 ∈ π(L)

p(eN1 ,X ) :=

N∏n=1

w(e)γ ,

where usually γ = 1/β

Schluter/Ney: Advanced ASR 301 August 5, 2010

Page 303: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Edge probabilities over word graphsPosterior probability of edge e in word graph L:

I probability that a path goes through edge e

p(e|X ) :=∑

eN1 ∈π(L):∃n:en=e

p(eN1 |X )

Example:

Ilike

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

I like this meal

vealI like this

alike

alike this

his veal

veal

edge

Schluter/Ney: Advanced ASR 302 August 5, 2010

Page 304: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Forward/Backward AlgorithmGoal: find efficient way to calculate edge probabilities.

Aproach: decompose edge probability into forward and backwardpart.

p(e,X ) =∑

eN1 ∈π(L):∃n:en=e

p(eN1 ,X )

=

[ ∑ek

1 :to(ek )=from(e)

k∏m=1

w(em)

︸ ︷︷ ︸:=Φ(from(e))

]· w(e) ·

[ ∑eNl :

to(e)=from(el )

N∏n=l

w(en)

︸ ︷︷ ︸:=Ψ(from(e))

]

I ek1 : partial path from initial state to state from(e)

I eNl : partial path from state to(e) to a final state

I forward score of state s: Ψ(s)I backward score of state s: Φ(s)

Schluter/Ney: Advanced ASR 303 August 5, 2010

Page 305: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Forward/Backward AlgorithmRecursive computation of forward score Φ(s):

Φ(s) =∑ek

1 :to(ek )=s

k∏m=1

w(em)

=∑

e∈In(s)

w(e)∑e′1

l :to(e′l )=from(e)

l∏n=1

w(e ′n)

︸ ︷︷ ︸:=Φ(from(e))I Notation:

I In(s): set of edges ending in state sI from(e), to(e): predecessor and successor state of edge e

I efficient computation via dynamic programmingI let sF be the unique final state:

Φ(sF ) =∑

eN1 ∈π(L)

p(eN1 ,X ) = p(X )

Schluter/Ney: Advanced ASR 304 August 5, 2010

Page 306: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Forward/Backward AlgorithmRecursive computation of backward score Ψ(s):

Ψ(s) =∑eNl :

from(el )=s

N∏n=l

w(en)

=∑

e∈Out(s)

w(e)∑e′k

N :from(e′k )=to(e)

k∏m=1

w(e ′m)

︸ ︷︷ ︸:=Ψ(to(e))I Notation:

I Out(s): set of edges leaving state sI from(e), to(e): predecessor and successor state of edge e

I efficient computation via dynamic programmingI let sI be the unique initial state:

Ψ(sI ) =∑

eN1 ∈π(L)

p(eN1 ,X ) = p(X )

Schluter/Ney: Advanced ASR 305 August 5, 2010

Page 307: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Forward/Backward AlgorithmImplementation in pseudo code:

Φ(sI ) := 1.0for state s in L in topological order:

Φ(s) := 0.0for edge e in In(s):

Φ(s) += Φ(from(e)) · w(e)Ψ(sF ) := 1.0for state s in L in reversed topological order:

Ψ(s) := 0.0for edge e in Out(s):

Ψ(s) += w(e) ·Ψ(to(e))assert Φ(sF ) = Ψ(sI )p(X ) := Φ(sF )for edge e in L:

p(e|X ) :=[Φ(from(e)) · w(e) ·Ψ(to(e))

]/p(X )

Complexities:I loop over [reverse] topological order of acyclic graph L: O(|E (L)|)I posterior probability for each edge in graph L: O(|E (L)|)

Schluter/Ney: Advanced ASR 306 August 5, 2010

Page 308: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Generalization: Forward/Backward over Semirings

Semiring (K,⊕,⊗, 0, 1):

I set KI binary operation ⊕ called additionI binary operation ⊗ called multiplication

1. (K,⊕, 0) is a commutative monoid2. (K,⊗, 1) is a (commutative) monoid

(commutativity required for backward scores)3. multiplication distributes over addition:

I x ⊗ (y ⊕ z) = (x ⊗ y)⊕ (x ⊗ z)I (x ⊕ y)⊗ z = (x ⊗ z)⊕ (y ⊗ z)

4. 0 is an annihilator for ⊗:I 0⊗ x = x ⊗ 0 = 0

I forward/backward algorithm can be applied to anycommutative semiring

I interpretation of result depends on semiring

Schluter/Ney: Advanced ASR 307 August 5, 2010

Page 309: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Generalization: Forward/Backward over Semirings

Semiring K x ⊕ y x ⊗ y 0 1

probability R+ x + y x · y 0 1

log R ∪ +∞ − log(e−x + e−y

)x + y +∞ 0

tropical R ∪ +∞ min(x , y) x + y +∞ 0

I probability semiringI fb(e,X ): edge probability p(e,X )

I log semiringI fb(e,X ): probability in negated log-space − log p(e,X )I numerical stability

I tropical semiringI fb(e,X ): probability of best path through eI Viterbi approximationI implements single-source shortest paths/distances algorithm

(Bellman-Ford)I forward-backward graph pruning possible/useful

(generalization).

Schluter/Ney: Advanced ASR 308 August 5, 2010

Page 310: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Generalization: Forward/Backward over SemiringsExpectation Semiring

Goal: enable efficient computation of expectations.

I Set R+ × R: elements called “probability” and “value”I Addition: (p1, v1)⊕ (p2, v2) := p1 + p1, v1 + v2)I Multiplication: (p1, v1)⊗ (p2, v2) := (p1 · p2, p1 · v2 + p2 · v1)I Define cost:

I Cost for each path through eN1 ∈ π(L)

I Cost is additive along paths, i.e. c(eN1 ) =

∑Nn=1 c(en)

I initialize value component as w(e)[v ] := w(e)[p] · c(e)

I forward/backward computation on graph LI p-component of expectation semiring realizes probability semiring:

fb(e,X )[p] = p(e,X ): probability for paths through eI fb(e,X )[v ]/ fb(e,X )[p]: expected cost for paths through eI expected cost for paths through the graph:

Φ(sF )[v ]/Φ(sF )[p] = Ψ(sI )[v ]/Ψ(sI )[p] =∑

W∈Σ∗

p(W |X ) c(W )

Schluter/Ney: Advanced ASR 309 August 5, 2010

Page 311: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Generalization: Forward/Backward over SemiringsExpectation Semiring

Efficient computation of expectation for graph L:

I graph L[p] stores probabilities on edges

I graph L[c] stores costs on edges

I L[p] and L[c] are identical apart from edge weights

EL[p]L[c] := p(X )−1∑

eN1 ∈π(L)

w(eN1 )[p] w(eL

1 )[c]

=∑

eN1 ∈π(L)

p(eN1 |X ) c(eN

1 )

=∑

e∈E(L)

p(e|X ) c(e)

I proof: exercise

Schluter/Ney: Advanced ASR 310 August 5, 2010

Page 312: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Frame-wise word posteriors over word graphsPosterior probability of word w at time t in word graph L:

I probability that word w is observed at time t given X

pt(w |X ) :=∑

eN1 ∈π(L):

∃n:tb(en)≤t<te(en)∧ i(en)=w

p(eN1 |X ) with tb(e) and te(e) being

the start and end times ofedge e.

Ilike

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

I like this meal

vealI like this

alike

alike this

his veal

veal

edge

edge time frame

Schluter/Ney: Advanced ASR 311 August 5, 2010

Page 313: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Frame-wise word posteriors over word graphs

Ilike

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

I like this meal

vealI like this

alike

alike this

his veal

veal

edge

edge time frame

pt=13(“this”|X ) = p(e1|X ) + p(e2|X )

= p(W1|X ) + p(W2|X ) + p(W4|X ) = 0.4 + 0.3 + 0.1 = 0.8

Schluter/Ney: Advanced ASR 312 August 5, 2010

Page 314: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Frame-wise word posteriors over word graphs

Frame-wise word posterior probabilities from graph L given X

I probability that word w is observed at time t given X

pt(w |X ) :=∑

eN1 ∈π(L):

∃n:tb(en)≤t<te(en)∧ i(en)=w

p(eN1 |X ) =

∑e∈E(L):

tb(e)≤t<te(e)∧ i(e)=w

p(e|X )

I defines a probability distribution for each time frameI applications:

I confidence measuresI graph decodingI acoustic model training

Schluter/Ney: Advanced ASR 313 August 5, 2010

Page 315: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications4.1 Why Word Graphs?4.2 Word Graph Generation using Word Pair Approximation4.3 Word Graph Pruning4.4 Multi-Pass Search4.5 Language Model Rescoring4.6 Path probabilities over word graphs4.7 Confidence Measures4.8 Bayes Risk Decoding4.9 System Combination4.10 Experiments4.11 References

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 314 August 5, 2010

Page 316: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Confidence Measures

I How confident is a recognizer about the correctness of ahypothesized word?

I Possible approach: confidence measure for word w occurringat position n in the hypothesis:

conf(w , n) := pn(w |X )

I Problems:I What is the “position” n?I How to compute a position for a word (and edge) in a graph?

I Theoretically given by Levenshtein alignment.

I Ignores deletions:Introduce confidence that no word occurs at position n, i.e. pn(ε|X )).

I Resort: Avoid explicit word positions by using frame-wiseword posterior probabilities.

Schluter/Ney: Advanced ASR 315 August 5, 2010

Page 317: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Confidence Measures

Confidence measures based on frame-wise word posteriorprobabilities:

I Idea: approximate position via time segment.

I Average frame-wise word posterior for a graph edge e:

conf(e) := d(e)−1

te(e)−1∑t=tb(e)

pt(i(e)|X ),

with the duration d(e) of edge e.

I Maximum approximation:

conf(e) := maxtb(e)≤t<te(e)

pt(i(e)|X )

Schluter/Ney: Advanced ASR 316 August 5, 2010

Page 318: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Confidence MeasuresPosition-Wise Word Posteriors over Word Graphs

Posterior probability of word w at position s in word graph L:I assign edges to positions via mapping g : E (L)→ N

pn(w |X ) :=∑

eN1 ∈π(L):

∃m:g(em)=n∧ i(em)=w

p(eN1 |X )

I like this meal

vealI like this

alike

alike this

his veal

veal

position

Ilike

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

1 2 3 4

Schluter/Ney: Advanced ASR 317 August 5, 2010

Page 319: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Confidence MeasuresPosition-Wise Word Posteriors over Word Graphs

I like this meal

vealI like this

alike

alike this

his veal

veal

position

Ilike

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

1 2 3 4

pn=3(“this”|X ) = p(W1|X ) + p(W2|X ) + p(W4|X ) = 0.4 + 0.3 + 0.1 = 0.8

Schluter/Ney: Advanced ASR 318 August 5, 2010

Page 320: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Confidence MeasuresPosition-wise word posterios over word graphs

Consider mapping g : E (L)→ N to compute position-wise wordposterior probabilities from graph L given X :

I Monotonicity: for all graph paths eN1 require:

g(en) < g(en+j) for j > 0

I Probability that word w is observed at position n given X reducesto sum over edge posterior probabilities (→ forward-backward)

pn(w |X ) :=∑

eN1 ∈π(L):

∃m:g(em)=n∧ i(em)=w

p(eN1 |X ) =

∑e∈E(L):g(e)=n∧ i(e)=w

p(e|X )

I Defines a probability distribution for each position, where

pn(ε|X ) := 1−∑w∈Σ

pn(w |X )

I How to obtain g : E (L)→ N?Will be discussed later, assumed to be given for now.

Schluter/Ney: Advanced ASR 319 August 5, 2010

Page 321: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Confidence Measures

Confidence measures based on position-wise word posteriors:

I Use position given by g : E (L)→ N

conf(i(e), g(e)) := pg(e)(i(e)|X )

I Note: provides confidence for deletionsI eN

1 is best scoring graph path.I confidence for deletions between em and em+1:

conf(ε, n) :=

pn(ε|X ) for g(m) < n < g(m + 1))0 otherwise

I Compact representation of position-wise word posteriors is theconfusion network.

Schluter/Ney: Advanced ASR 320 August 5, 2010

Page 322: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Confidence MeasuresConfusion networks from word graphs

Confusion network (CN) derived from word graph L:

I slot-wise posterior probability distribution

pn(w |X ) :=∑

eN1 ∈π(L):

∃m:g(em)=n∧ i(em)=w

p(eN1 |X ), pn(ε|X ) := 1−

∑w∈Σ

pn(w |X )

alike

likeI this

his meal

veal

I

like

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

1 2 3 4

Schluter/Ney: Advanced ASR 321 August 5, 2010

Page 323: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Confidence MeasuresConfusion networks from word graphs

alike

likeI this

his meal

veal

I

like

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

1 2 3 4

alike

likeI this

his meal

veal

I

like

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

1 2 3 4

alike

likeI this

his meal

veal

I

like

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

1 2 3 4

Schluter/Ney: Advanced ASR 322 August 5, 2010

Page 324: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications4.1 Why Word Graphs?4.2 Word Graph Generation using Word Pair Approximation4.3 Word Graph Pruning4.4 Multi-Pass Search4.5 Language Model Rescoring4.6 Path probabilities over word graphs4.7 Confidence Measures4.8 Bayes Risk Decoding4.9 System Combination4.10 Experiments4.11 References

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 323 August 5, 2010

Page 325: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bayes Risk Decoding

I Viterbi decoding minimizes sentence error, but:

I Standard evaluation metric is word error(defined by Levenshtein distance)

I Question: How to do better than Viterbi decoding?

I Bayes risk decoding with Levenshtein distanceyields minimum expected word error

Schluter/Ney: Advanced ASR 324 August 5, 2010

Page 326: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bayes Risk DecodingMotivation

word sequence W p(W |X )

I like this meal 0.4I like this veal 0.3

alike his veal 0.2alike this veal 0.1

I W := wN1 ∈ [Σ ∪ ε]N is hypothesis

I expected error for position n is (1− pn(wn|X ))

I expected error for W is∑N

n=1(1− pn(wn|X ))

I expected error of Viterbi hypothesis “I like this meal”:0.3 + 0.3 + 0.2 + 0.6 = 1.4

I minimum expected error hypothesis “I like this veal”:0.3 + 0.3 + 0.2 + 0.4 = 1.2

Schluter/Ney: Advanced ASR 325 August 5, 2010

Page 327: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bayes Risk DecodingConfusion network decoding

W p(W |X )

I like this meal 0.4I like this veal 0.3

alike his veal 0.2alike this veal 0.1

Confusion network with position posterior probabilities pn(w |X ):

I like this meal

- alike his veal

0.7

0.3

0.7

0.3

0.8

0.2

0.4

0.6

I minimum expected error hypothesis: W = [arg max pn(w |X )]Nn=1I equals decoding of CNI CN decoding aims at minimizing a word-based errorI Viterbi decoding aims at minimizing sentence errorI we are interested in minimized word error

(defined by the Levenshtein distance)Schluter/Ney: Advanced ASR 326 August 5, 2010

Page 328: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bayes Risk DecodingBayes risk decoding with loss function L : Σ∗ × Σ∗ → R

W := arg minW∈Σ∗

∑V∈Σ∗

p(V |X ) L(W ,V )

I W has the minimum expected lossI sentence error → MAP (or Viterbi) decoding

L(W ,V ) := 1− δ(W ,V )

I Levenshtein distance

L(W ,V ) := Lev(W ,V )

I loss function of choice for speech recognition evaluationI non-local dependenciesI exact computation prohibitive on graphsI use approximation → only local dependencies

I graph-based Bayes risk decoderI paths through graph instead of W ∈ Σ∗

Schluter/Ney: Advanced ASR 327 August 5, 2010

Page 329: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Graph-Based Bayes Risk Decoding

Hypothesis and summation space

I summation space graph S defines posterior distributionp(W |X ) for W ∈ Σ∗

I S is given by word graph from HMM decoder

I hypothesis space graph H defines set of possible decodingoutcomes W

I for unknown posterior distributionI H usually a superset of SI H depends on loss functionI sentence error → H = S

W := arg mineN

1 ∈π(H)

∑f M1 ∈π(S)

p(f M1 |X ) L(eN

1 , fM

1 )

efficient computation, if L(eN1 , f

M1 ) has only local dependencies

Schluter/Ney: Advanced ASR 328 August 5, 2010

Page 330: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Local Loss Functions

I for eN1 ∈ π(H) and f M

1 ∈ π(S)

L(eN1 , f

M1 ) =

N∑n=1

M∑m=1

L(en, fm)

I word-wise loss computation

I only local dependencies

Schluter/Ney: Advanced ASR 329 August 5, 2010

Page 331: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bayes Risk Decoding with Local Loss FunctionsBayes decision rule with local loss function:

W := arg mineN

1 ∈π(H)

∑f M1 ∈π(S)

p(f M1 |X )

N∑n=1

M∑m=1

L(en, fm)

= arg mineN

1 ∈π(H)

N∑n=1

∑f ∈E(S)

p(f |X ) L(en, f )

︸ ︷︷ ︸:=c(en,S)

I efficient computation

1. rescore each arc in e ∈ E (H) with c(e,S)2. apply Viterbi decoder to H

I quadratic runtime: O(|E (H)||E (S)|)

What are good local loss functions?

Schluter/Ney: Advanced ASR 330 August 5, 2010

Page 332: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Levenshtein Distance Approximations

I local loss function shall approximate Levenshtein distanceI frame error based loss functions

I count errors on a per-frame baseI normalize frame error on a per-word baseI no alignment problem

I CN distanceI CN defines alignment between any two paths in a graphI compute distance based on the CN alignmentI Levenshtein distance is lower bound for CN distance

For both types of loss functions:experiments show strong correlation between exact word error andword error approximation

Schluter/Ney: Advanced ASR 331 August 5, 2010

Page 333: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Frame ErrorIdea: compare words frame-wise using word boundaries fromtime-alignment.Example:

0 10 15 205 25 30

13 frame errors 1 frame error

reference:

hypothesis:

time frame:

I like this meal

alike this meal

Frame error (FE) definition:

L(eN1 , f

M1 ) :=

T∑t=1

(1− δ(i(et), i(ft))

et , ft : edges (from eN1 , f

M1 which intersect with time frame t, i.e.

et ∈ eN1 | tb(et) ≤ t < te(et)

Schluter/Ney: Advanced ASR 332 August 5, 2010

Page 334: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Frame Error Based Loss FunctionsDefine:

I d(e): duration in number of time framesI o(e, f ): overlap in number of time framesI h(e, f ): overlap conditioned on label equality

Rewrite frame error:

L(eN1 , f

M1 ) :=

T∑t=1

(1− δ(i(et), i(ft))

=N∑

n=1

d(en)−M∑

m=1

o(en, fm) · δ(i(en), i(fm))︸ ︷︷ ︸:=h(en,fm)

=

M∑m=1

[d(fm)−

N∑n=1

h(en, fm)

]

Schluter/Ney: Advanced ASR 333 August 5, 2010

Page 335: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Frame Error Based Loss Functions

Rescoring function for (position-wise) Bayes risk decoder:

c(e,S) :=∑

f ∈E(S)

p(f |X ) L(e, f )

= d(e)−∑

f ∈E(S)

p(f |X ) h(e, f )

=

te(e)−1∑t=tb(e)

[1− pt(i(e)|X )]

I simple rescoring with frame-wise word posterior probabilities

Schluter/Ney: Advanced ASR 334 August 5, 2010

Page 336: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Frame Error Based Loss Functions

Rescoring function for (position-wise) Bayes risk decoder:

c(e,S) =

te(e)−1∑t=tb(e)

[1− pt(i(e)|X )]

Discrepancy between word and frame error:

I long words have higher impact in frame error based decisioni.e. the corresponding frame error generally increases withword length

I idea: normalize frame error on a per-word basis

I open: normalize per hypothesis or per reference word?

Schluter/Ney: Advanced ASR 335 August 5, 2010

Page 337: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Hypothesis-Side Normalized Frame Error

Rescoring function for Bayes risk decoder with frame errornormalized by hypothesis duration:

W := arg mineN

1 ∈H

N∑n=1

[1−

∑f ∈S

p(f |X )h(en, f )

d(en)

]

= arg mineN

1 ∈H

N∑n=1

1− 1

d(en)

te(e)−1∑t=tb(e)

pt(i(e)|X )

︸ ︷︷ ︸

:=c(en,S)

I normalize FE for each edge e ∈ H by d(e)

I normalized sum of frame-wise word posterior probabilities

I high number of deletions

Schluter/Ney: Advanced ASR 336 August 5, 2010

Page 338: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Reference-Side Normalized Frame ErrorRescoring function for Bayes risk decoder with frame errornormalized by reference duration:

W := arg mineN

1 ∈H

∑f ∈S

p(f |X )

[1−

N∑n=1

h(en, f )

d(f )

]

= arg mineN

1 ∈H

[∑f ∈S

p(f |X )−∑f ∈S

N∑n=1

h(en, f )

d(f )p(f |X )

]

= arg mineN

1 ∈H

N∑n=1

[−∑f ∈S

p(f |X )h(en, f )

d(f )

]︸ ︷︷ ︸

:=c(en,S)

I normalize FE for each edge f ∈ S by d(f )I normalization → cannot use frame-wise word posteriorsI high number of insertions

Schluter/Ney: Advanced ASR 337 August 5, 2010

Page 339: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bias in Single-Sided Frame Error Normalization

I hypothesis-side normalization does not penalize deletions

I reference-side normalization does not penalize insertions

I best approach in practice: average both normalizations

Example:I like

alike that infection

satisfaction

0.2 0.8 1.0 1.0

1.0 1.0 0.4 0.6

reference:

hypothesis:

error normalized by

1. reference

2. hypothesis

average of 1. + 2. 0.6 0.9 0.7 0.8

Schluter/Ney: Advanced ASR 338 August 5, 2010

Page 340: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Hypothesis Space for Minimum Frame Error SearchI frame error computation requires word boundaries in HI derive H from summation space graph S

I build time-conditioned graph from S:merge all states with equal time stamps

I π(H) ⊇ π(S)

Example:

Ilike

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

original lattice:

time−conditioned form:

Ilike

alike

alike

this

his

meal

0

3 10

11

24veal

14

Schluter/Ney: Advanced ASR 339 August 5, 2010

Page 341: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Confusion Network Distance

I recall: a confusion network CN(L) is derived from graph L bya function

g : E (L)→ N,

which maps each edge to a unique position, while keeping theorder of each path contained: g(en) < g(en+1) for eN

1 ∈ π(L)

I g(·) maps each path in L to a path in CN(L)

I each path in CN(L) has equal length N := maxe∈E(L)

g(e)

I CN(L) defines a distance between any two paths eN1 and f N1through the confusion network mapping to positions:

L(eN1 , fN

1 ) :=N∑

n=1

[1− δ(i(en), i(fn))]

Schluter/Ney: Advanced ASR 340 August 5, 2010

Page 342: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Confusion Network DistanceRecall Bayes decision rule with local loss function:

W := arg mineN

1 ∈π(H)

∑f M1 ∈π(S)

p(f M1 |X )

N∑n=1

M∑m=1

L(en, fm)

I problem: confusion network in general does not allow to keeporiginal path probabilities

I idea: define loss function between paths from confusionnetwork and its source graph, where the latter contains theoriginal path probabilities

I approach: use confusion network as Bayes risk decodinghypothesis space H only, whereas for summation the originalgraph S is used

I let g(·) be defined for two graphs H and S⇒ g(·) defines a loss function L : π(H)× π(S)→ R

I note: loss function does not remain local in the above form,but does retain its efficiency.

Schluter/Ney: Advanced ASR 341 August 5, 2010

Page 343: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Hypothesis Space for CN DecodingI summation space S given by word graph from HMM decoderI let confusion network mapping g : E (S)→ [1,N ] be givenI construction of hypothesis space H:

1. initialize N + 1 states; state 0 is the initial and N the finalstate

2. foreach state s ∈ [1,N ):foreach w ∈ Σ ∪ ε:

connect state s with state s + 1 with edge e,where i(e) = w and g(e) = s + 1

I H has CN-structureI any path eN1 ∈ π(H) has exactly N arcsI a path eN1 ∈ π(H) has no gaps, i.e. g(es) = s

I H contains all possible Bayes risk decoding results for the CNdistance as loss function

I further investigations make use of special structure of H;minimum CN distance decoding with an arbitrary hypothesisspace is left as an exercise

Schluter/Ney: Advanced ASR 342 August 5, 2010

Page 344: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

CN Distance for GraphsCN distance between eN1 ∈ H and f M

1 ∈ S:I substitutions:

I hypothesis and reference word,but hypothesis word 6= reference word

LS(eN1 , f M1 ) :=

N∑n=1

M∑m=1

δ(

g(en), g(fm))·[1− δ

(i(en), ε

)]·[1− δ

(i(fm), ε

)]·[1− δ

(i(en), i(fm)

)]I deletions:

I no hypothesis word, but reference word

LD(eN1 , f M1 ) :=

N∑n=1

M∑m=1

δ(

g(en), g(fm))· δ(

i(en), ε)

·[1− δ

(i(fm), ε

)]Schluter/Ney: Advanced ASR 343 August 5, 2010

Page 345: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

CN Distance for Graphs

CN distance between eN1 ∈ H and f M1 ∈ S (ctd.):

I insertions:I hypothesis word, but no reference wordI problem: position labels for f M

1 are not strictly consecutive, gaps!I indirect: # ins = #(hyp. words) - #(cor and sub)

LI(eN1 , fM

1 ) :=

[ N∑n=1

[1− δ

(i(en), ε

)] ]−

[ N∑n=1

M∑m=1

δ(

g(en), g(fm))

·[1− δ

(i(en), ε

)]·[1− δ

(i(fm), ε

)] ]

Schluter/Ney: Advanced ASR 344 August 5, 2010

Page 346: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

CN Distance for GraphsCN distance: sum of four costs:

L(eN1 , fM

1 )

:=

[ N∑n=1

[1− δ

(i(en), ε

)] ]

+N∑

n=1

M∑m=1

δ(

g(en), g(fm))·

·−[1− δ

(i(en), ε

)]·[1− δ

(i(fm), ε

)]+[1− δ

(i(en), ε

)]·[1− δ

(i(fm), ε

)]·[1− δ

(i(en), i(fm)

)]+ δ(

i(en), ε)·[1− δ

(i(fm), ε

)] ︸ ︷︷ ︸=:l(en,fm)

I consider local sum of Kronecker delta expressions l(e, f )

Schluter/Ney: Advanced ASR 345 August 5, 2010

Page 347: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

CN Distance for Graphs

Simplify local sum of deltas:

l(e, f ) = −[1− δ

(i(e), ε

)]·[1− δ

(i(f ), ε

)]+[1− δ

(i(e), ε

)]·[1− δ

(i(f ), ε

)]·[1− δ

(i(e), i(f )

)]+ δ(

i(e), ε)·[1− δ

(i(f ), ε

)]= −

[1− δ

(i(e), ε

)]·[1− δ

(i(f ), ε

)]· δ(

i(e), i(f ))

+ δ(

i(e), ε)·[1− δ

(i(f ), ε

)]= −

[1− δ

(i(e), ε

)]· δ(

i(e), i(f ))

+ δ(

i(e), ε)·[1− δ

(i(f ), ε

)]

Schluter/Ney: Advanced ASR 346 August 5, 2010

Page 348: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

CN Distance for GraphsResubstitute local sum of deltas into CN distance:

L(eN1 , fM

1 ) =N∑

n=1

[1− δ

(i(en), ε

)]+

M∑m=1

δ(

g(en), g(fm))l(en, fm)

=N∑

n=1

L1(en) +

M∑m=1

L2(en, fm)

I alternative local loss function

Corresponding decision rule:

W := arg mineN1 ∈π(H)

∑f M1 ∈π(S)

p(f M1 |X )

N∑n=1

L1(en) +

M∑m=1

L2(en, fm)

= arg mineN1 ∈π(H)

N∑n=1

L1(en) +

∑f ∈E(S)

p(f |X )L2(en, f )

= arg min

eN1 ∈π(H)R(eN1 |X )

Schluter/Ney: Advanced ASR 347 August 5, 2010

Page 349: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

CN Decoding

R (eN1 |X ) =N∑

n=1

L1(en) +

∑f ∈E(S)

p(f |X )L2(en, fm)

(posterior risk)

=N∑

n=1

1− δ(i(en), ε) +

∑f ∈E(S)

p(f |X )δ(

g(en), g(f ))·

·[δ(

i(en), ε)·[1− δ

(i(f ), ε

)]−[1− δ

(i(en), ε

)]· δ(

i(en), i(f ))]

=N∑

n=1

1−

[1− δ

(i(en), ε

)] ∑f ∈E(S)

p(f |X )δ(

g(en), g(f ))· δ(

i(en), i(f ))

− δ(i(en), ε)

[1−

∑f ∈E(S)

p(f |X )δ(

g(en), g(f ))[

1− δ(

i(f ), ε)]]

=N∑

n=1en 6=ε

[1− pn(i(en)|X )

]+N∑

n=1en=ε

[1− pn(ε|X )

]=N∑

n=1

[1− pn(i(en)|X )

]Schluter/Ney: Advanced ASR 348 August 5, 2010

Page 350: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

CN DecodingBayes risk decoding:

W = arg mineN1 ∈π(H)

R(eN1 |X )

= arg mineN1 ∈π(H)

N∑n=1

[1− pn(i(en)|X )]

=

[arg max

w∈Σ∪εpn(w |X )

]Nn=1

with

pn(w |X ) :=∑

f ∈E(S)

p(f |X ) · δ(n, g(f )

)· δ(w , i(f )

)pn(ε|X ) := 1−

∑f ∈E(S)

p(f |X ) · δ(n, g(f )

)·[1− δ

(i(f ), ε

)]= 1−

∑w∈Σ

pn(w |X )

Schluter/Ney: Advanced ASR 349 August 5, 2010

Page 351: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

CN Decoding

Bayes risk decoding:

W =

[arg max

w∈Σ∪εpn(w |X )

]Nn=1

I decoding only requires CN derived from S(valid only for the special construction of H)

I gaps in the position labels are handled by the derivedposterior probabilities for the empty word pn(ε|X )

Schluter/Ney: Advanced ASR 350 August 5, 2010

Page 352: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

CN Distance vs. Levenshtein Distance

I CN alignments cannot always represent Levenshteinalignments

I CN distance is upper bound to Levenshtein distance

Example:

a b a b a

b a b a c a b a c a

a b a b a b a b a c

a b a c a

Given:

1.

2.

3.

a b a b a

b a b a c

a b a c a

Lev(1, 2) = 2 Lev(1, 3) = 1 Lev(2, 3) = 2Levenshtein

CNa b a b a

b a b a c a b a c a

a b a b a b a b a c

a b a c a

CN(1, 2) = 2 CN(1, 3) = 1 CN(2, 3) = 3a b a b a

b a b a c

a b a c a

Schluter/Ney: Advanced ASR 351 August 5, 2010

Page 353: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

CN ConstructionI CN construction for graphs: find g : E (L)→ NI g(·) induces the Bayes risk decoder

CN(X ; g) := arg mineN1 ∈H

∑f M1

p(f M1 |X ) L(eN1 , f

M1 ; g)

I goal: find g(·) such that the error on a training set [W ,X ]Rr=1

is minimized

gopt :=R∑

r=1

Lev(Wr ,CN(Xr ; g))

I other optimization criteria possible,e.g. choose g(·) such that the Bayes risk is minimized

I how to compute gopt for LVCSR graphs?I heuristic 1: N-best lists, align entries to Viterbi path (pivot element)I heuristic 2: cluster the edges in S

Schluter/Ney: Advanced ASR 352 August 5, 2010

Page 354: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Edge Clustering Algorithm

An edge clustering algorithmI idea: use a fixed set of clusters

I use set of pivot edges to initialize edge clustersI add more pivot edges, if not all edges can be clustered

I all edges in a cluster must overlap in timeI sort clusters by average edge start timeI g(en) < g(en+1) holds for each eN

1 ∈ π(S)

I define appropriate distance function betweenedge e and edge cluster E ; usually based on

I time overlapI word equalityI pronunciation similarity

I weight distance for edge e with p(e|X )

⇒ clustering is driven by likely edges

Schluter/Ney: Advanced ASR 353 August 5, 2010

Page 355: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Edge Clustering AlgorithmAlgorithm outline:

initialize set of pivot edges P ← ∅initialize remaining edge set R = E (S)while R is not emptychoose set of pivot edges PR from RP ← P ∪ PR

reinitialize remaining edge set R ← ∅initialize edge clusters from pivot edgesfor e ∈ E (S)\P sorted by weighted cluster distance:

C ← nearest edge cluster of eif e and e ′ overlap foreach e ′ ∈ C:assign e to C

else:assign e to R

I open: how to find the pivot edges?

Schluter/Ney: Advanced ASR 354 August 5, 2010

Page 356: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Edge Clustering Algorithm

I initialize clusters fromI none overlapping edgesI most likely edges

Pivot edge set algorithm outline:

select pivot arcs(E):

initialize pivot edge set P ← ∅for e ∈ E sorted by p(e|X ):if foreach e ′ ∈ P: e ′ and e not overlap:assign e to P

return P

Schluter/Ney: Advanced ASR 355 August 5, 2010

Page 357: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Edge Clustering AlgorithmExample:

0

10

32

40

hello

helloeh

[si]

1.

2.

eh hello

hello

hello

3.

hello

hello

eh

eh [si]

1. initial pivot arcs: “eh:0-10” “hello:10-40”2. assign “hello:0-32” to second cluster3. new pivot pivot arc: “[si]:32-40” (since [si] arc does not overlap)

Schluter/Ney: Advanced ASR 356 August 5, 2010

Page 358: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications4.1 Why Word Graphs?4.2 Word Graph Generation using Word Pair Approximation4.3 Word Graph Pruning4.4 Multi-Pass Search4.5 Language Model Rescoring4.6 Path probabilities over word graphs4.7 Confidence Measures4.8 Bayes Risk Decoding4.9 System Combination4.10 Experiments4.11 References

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 357 August 5, 2010

Page 359: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

System Combination

System combination

I observation: different ASR systems lead to different errors

I graphs provided by J systems?

I question: how to minimize word error by combining the J graphs?

I how to compute a single hypothesis from J parallel word graphs?

Schluter/Ney: Advanced ASR 358 August 5, 2010

Page 360: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

System CombinationI idea: introduce system as hidden variable

p(W |X ) =J∑

j=1

p(W |j ,X )p(j),

assumption: system prior is independent of acoustic X

⇒ summation space is given by the graph union

S :=J⋃

j=1

Lj

⇒ path probability for a path through S

p(eN1 |X ) :=

J∑j=1

p(j)pj(eN1 |X ),

where pj(eN1 |X ) = 0 if eN

1 /∈ π(Lj)

Schluter/Ney: Advanced ASR 359 August 5, 2010

Page 361: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

System Combination

Can we reduce graph combination to a single graph problem?I construct modified graph union S′

1. introduce super initial state s0

2. foreach system j:connect s0 with initial state of Lj with edge e,

where i(e) = ε and w(e) = p(j)/pj(X )

I for a path eN1 ∈ π(S′) holds

p(eN1 ,X ) = p(j)pj(eN

1 |X ) = p(eN1 |X )

graph combination ⇔ Bayes risk decoding of S′

Schluter/Ney: Advanced ASR 360 August 5, 2010

Page 362: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

System Combination

Alternative approaches:

I Minimum frame error combination

I Confusion Network Combination

I ROVER

Schluter/Ney: Advanced ASR 361 August 5, 2010

Page 363: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Minimum frame error combination

Rescoring function for Bayes risk decoder:

c(e,S) := d(e)−∑

f ∈E(S)

p(f |X ) h(e, f ) =

te(e)−1∑t=tb(e)

[1− pt(i(e)|X )]

Extension to system combination:

I compute frame-wise word posteriors as

pt(w |X ) :=J∑

j=1

p(j)pj ,t(w |X )

I however: equivalent to computing pt(w |X ) from S′!

Schluter/Ney: Advanced ASR 362 August 5, 2010

Page 364: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ROVER Alignment

Recogniser Output Voting Error Reduction (ROVER):I Levenshtein alignment of individual systems outputs:

1. select supervisor system (usually with lowest expected WER)2. align other hypotheses to supervisor

I system-dependent hypothesis Wj := wNj

j ,1 of jth system

I alignment of J system-dependent hypotheses

A := (A1,A2, . . . ,AS) with Ak := (ak,1, . . . , ak,J)

I interpretation: at alignment position sthe words w1,as,1 , w2,as,2 , . . . , wJ,as,J

compete,where as,j = 0 indicates the empty word, i.e. ws,0 := ε

Schluter/Ney: Advanced ASR 363 August 5, 2010

Page 365: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ROVER DecodingROVER (ctd.):

I use word-wise scoresI majority voting:

cj(s,w) :=

1 , if wj,as,j = w0 , otherwise

I confidence voting:

cj(s,w) :=

pj,as,j (w |X ) , if as,j 6= 0 and w = wj,as,j

p0 , if as,j = 0 and w = ε0 , otherwise

where p0 constant confidence for a deletion

I decode alignment position-wise

W =

arg maxw∈Σ∪ε

J∑j=1

cj(s,w)

S

s=1

Schluter/Ney: Advanced ASR 364 August 5, 2010

Page 366: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Confusion Network Combination (CNC)

Confusion Network Combination (CNC):

I Levenshtein alignment of system-dependent CNsI ROVER vs. CNC:

I information provided by jth system at position n

ROVER CNCword majority conf.

w1 1 pj,n(w1|X ) pj,n(w1|X )w2 0 0 pj,n(w2|X )w3 0 0 pj,n(w3|X )...

......

...

words sorted by pn(w |X ) in decreasing orderI CN provides at each position prob. distribution over all wordsI Levenshtein alignment of CNs with local cost defined between

prob. distributions

Schluter/Ney: Advanced ASR 365 August 5, 2010

Page 367: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Confusion Network Combination (CNC)CNC (ctd.)

I alignment between two CNs

A := (A1,A2, . . . ) with As := (as,1, as,2)

I interpretation: combined position-wise word posteriorprobability distr. at alignment position s

ps(w |X ) :=2∑

j=1

p(j)pj ,as,j(w |X )

where pj ,0(ε|X ) := 1I alignment result is a CN of length S⇒ expected (position-wise) error for alignment A

c(A) =S∑

s=1

[1− max

w∈Σ∪εps(w |X ; A)

]⇒ minimizing c(A) ⇒ Bayes decision rule using CN distance

Schluter/Ney: Advanced ASR 366 August 5, 2010

Page 368: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Confusion Network Combination (CNC)

Levensthein alignment of CNs

I align 1st CN and 2nd CN

C (l , k) := min

C (k − 1, l) + c(k , 0),

C (k − 1, l − 1) + c(k , l),C (k , l − 1) + c(0, l)

with local cost minimizing the position-wise/local word error:

c(l , k) := 1− maxw∈Σ∪ε

p(1)p1,k(w |X ) + p(2)p2,l(w |X )

I align 1+2nd CN and 3rd CN

I . . .

Schluter/Ney: Advanced ASR 367 August 5, 2010

Page 369: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications4.1 Why Word Graphs?4.2 Word Graph Generation using Word Pair Approximation4.3 Word Graph Pruning4.4 Multi-Pass Search4.5 Language Model Rescoring4.6 Path probabilities over word graphs4.7 Confidence Measures4.8 Bayes Risk Decoding4.9 System Combination4.10 Experiments4.11 References

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 368 August 5, 2010

Page 370: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ExperimentsChinese 230h Test System

I 230h training data (BN/BC)I ML-trained speaker-normalized models, 1M GaussiansI 60K vocabulary, 4-gram LMI two-pass recognition (speaker adaptation, LM-rescoring)I systems vary in the

I acoustic front-ends: MFCC, PLP, or Gammatone featuresI phonetic decision tree: randomized CART estimation

Chinese Development and Test Sets

running vocabularyname speech words chars words chars

gale-dev07 2.5h 27.1K 46.8K 5.2K 1.9Kgale-eval07 1.6h - 28.1K - 1.7Kgale-dev08 1.0h 10.5K 18.2K 2.9K 1.4K

Schluter/Ney: Advanced ASR 369 August 5, 2010

Page 371: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ExperimentsEnglish TC-Star 2007 Evaluation System

English TC-Star 2007 Evaluation SystemI 92h training data (European parliament plenary sessions, EPPS)I single, globally pooled, diagonal co-variance matrixI MPE trained speaker normalized models, 1M GaussiansI 50K vocabulary, 4-gram LMI two-pass recognition (speaker adaptation, LM-rescoring)I four systems that differ mainly in

I system 1: MFCC front-endI system 2: Gammatone front-endI system 3: MFCC front-end + NN-based phoneme posteriorsI system 4: MFCC front-end + 190h unsupervised EPPS training data

English EPPS Development and Test Sets

name speech running words

epps-dev06 3.2h 27.0Kepps-eval06 3.2h 30.0Kepps-eval07 2.9h 27.0K

Schluter/Ney: Advanced ASR 370 August 5, 2010

Page 372: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ExperimentsEnglish TC-Star 2007 Cross-Site

English TC-Star 2007 Evaluation Cross-Site System Combination

I 92h training data (European parliament plenary sessions, EPPS)

I 190h unsupervised training data (EPPS, optional)I lattices provided by

I FBK/IRST, Trento, Italy (IRST)I CNRS/LIMSI, Paris, France (LIMSI)I RWTH Aachen University, Germany (RWTH)I University of Karlsruhe, Germany (UKA)

I if a site does internal system combination, e.g. RWTH Aachen,lattices are from the best single system

Schluter/Ney: Advanced ASR 371 August 5, 2010

Page 373: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bayes Risk DecodingChinese 230h Test System

CER[%]System Decoder dev071 eval07 dev08

s1 Viterbi 14.54 15.08 13.28CN 14.32 14.95 13.10

s2 Viterbi 14.82 15.02 13.54CN 14.52 14.74 13.35

s3 Viterbi 15.07 15.60 13.80CN 14.86 15.42 13.67

1 tuning set, CER: [Chinese] Character Error Rate

Schluter/Ney: Advanced ASR 372 August 5, 2010

Page 374: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bayes Risk DecodingEnglish TC-Star 2007 Evaluation System

WER[%]System Decoder dev06 eval061 eval07

s1 Viterbi 11.09 8.43 9.81CN 11.97 8.83 10.48

s2 Viterbi 11.89 8.70 10.07CN 11.87 9.31 11.57

s3 Viterbi 12.43 8.98 10.76CN 10.73 8.22 9.57

s4 Viterbi 12.06 9.44 11.73CN 11.42 8.61 9.78

1 tuning set

Schluter/Ney: Advanced ASR 373 August 5, 2010

Page 375: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bayes Risk DecodingEnglish TC-Star 2007 Cross-Site

WER[%]System Decoder eval061 eval07

LIMSI Viterbi 8.16 9.13CN 8.07 8.96

RWTH Viterbi 8.46 9.71CN 8.24 9.54

UKA Viterbi 8.80 10.22CN 8.98 10.36

IRST Viterbi 10.09 9.81CN 10.06 9.82

1 tuning set

Schluter/Ney: Advanced ASR 374 August 5, 2010

Page 376: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Bayes Risk DecodingFrame Error vs. CN-Distance

loss frame CER[%] (del/ins) errSystem function norm. dev1 eval

Chinese frame err. hypothesis (2.92/1.38) 14.35 (3.01/0.75) 13.13symmetric (2.52/1.61) 14.23 (2.75/0.94) 13.11

CN-dist. n.a. (2.79/1.45) 14.30 (2.85/0.80) 13.05

WER[%] (del/ins) err

English frame err. hypothesis (1.95/1.15) 8.08 (2.22/0.99) 9.00symmetric (1.68/1.32) 8.05 (1.84/1.15) 9.00

CN-dist. n.a. (1.65/1.33) 8.07 (1.76/1.18) 8.961 tuning set, CER: [Chinese] Character Error Rate

Schluter/Ney: Advanced ASR 375 August 5, 2010

Page 377: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

System CombinationChinese 230h Test System

CER[%]System Combination dev071 eval07 dev08

s1 14.54 15.08 13.28s2 14.82 15.02 13.54s3 15.07 15.60 13.80

s1+s2 union 13.54 13.96 12.43CNC 13.55 13.95 12.49ROVER 13.63 14.09 12.61

s1+s2+s3 union 13.15 13.65 12.19CNC 13.16 13.74 12.14ROVER 13.22 13.86 12.47union CN 13.15 13.65 12.19

1 tuning set, CER: [Chinese] Character Error Rate

Schluter/Ney: Advanced ASR 376 August 5, 2010

Page 378: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

System CombinationEnglish TC-Star 2007 Evaluation System

WER[%]System Combination dev06 eval061 eval07

s1 11.09 8.43 9.81s2 11.89 8.70 10.07s3 12.43 8.98 10.76s4 12.06 9.44 11.73

s1+s2 union 10.21 7.79 8.97CNC 10.22 7.82 8.98ROVER 10.54 7.90 9.11

s1+s2+s3 union 10.21 7.73 8.96CNC 10.14 7.70 8.98ROVER 10.42 7.73 9.17

s1+s2+s3+s4 union 10.33 7.59 8.94CNC 10.22 7.59 8.92ROVER 10.70 7.67 9.15

1 tuning setSchluter/Ney: Advanced ASR 377 August 5, 2010

Page 379: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

System CombinationEnglish TC-Star 2007 Cross-Site

WER[%]System Combination eval061 eval07

LIMSI 8.16 9.13RWTH 8.46 9.71UKA 8.80 10.22IRST 10.09 9.81

LIMSI+RWTH union 6.46 7.67CNC 6.38 7.51ROVER 6.69 7.85

LIMSI+RWTH+UKA union 6.38 7.63CNC 6.27 7.24ROVER 6.32 7.77

LIMSI+RWTH+UKA+IRST union 6.28 7.36CNC 6.14 7.12ROVER 6.21 7.26

1 tuning setSchluter/Ney: Advanced ASR 378 August 5, 2010

Page 380: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications4.1 Why Word Graphs?4.2 Word Graph Generation using Word Pair Approximation4.3 Word Graph Pruning4.4 Multi-Pass Search4.5 Language Model Rescoring4.6 Path probabilities over word graphs4.7 Confidence Measures4.8 Bayes Risk Decoding4.9 System Combination4.10 Experiments4.11 References

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 379 August 5, 2010

Page 381: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ReferencesWord Graphs

I S. Ortmanns, H. Ney, and X. Aubert. A word graph algorithm forlarge vocabulary continuous speech recognition, Comput. SpeechLang., vol. 11, pp. 43-72, Jan. 1997.

I H. Ney, S. Ortmanns, and I. Lindam. Extensions to the word graphmethod for large vocabulary continuous speech recognition, in Proc.IEEE Int. Conf. Acoustics, Speech, Signal Processing 1997,Munich, Germany, Apr. 1997, pp. 1787-1790.

I A. Sixtus and S. Ortmanns. High quality word graphs using forward-backward pruning, in Proc. IEEE Int. Conf. Acoustics, Speech,Signal Processing, Phoenix, AZ, Mar. 1999, pp. 5930-596.

Schluter/Ney: Advanced ASR 380 August 5, 2010

Page 382: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ReferencesConfidence Measures

I F. Wessel, R. Schluter, K. Macherey, and H. Ney. ConfidenceMeasures for Large Vocabulary Continuous Speech Recognition.IEEE Transactions on Speech and Audio Processing, volume 9,number 3, pages 288-298, March 2001.

I T. Kemp and T. Schaaf, Estimating confidence using word lattices,in Proc. 5th Eur. Conf. Speech, Communication, Technology 1997,Rhodes, Greece, Sept. 1997, pp. 827-830.

I L. Chase, Word and acoustic confidence annotation for largevocabulary speech recognition, in Proc. 5th Eur. Conf. SpeechCommunication Technology, Rhodes, Greece, Sept. 1997, pp.815-818.

I S. Cox and R. Rose, Confidence measures for the switchboarddatabase, in Proc. IEEE Int. Conf. Acoustics, Speech, SignalProcessing 1996, Atlanta, GA, May 1996, pp. 511-514.

Schluter/Ney: Advanced ASR 381 August 5, 2010

Page 383: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ReferencesConfidence Measures

I C. Neti, S. Roukos, and E. Eide, Word-based confidence measuresas a guide for stack search in speech recognition, in Proc. IEEE Int.Conf. Acoustics, Speech, Signal Processing 1997, Munich, Germany,Apr. 1997, pp. 883-886.

I B. Rueber, Obtaining confidence measures from sentenceprobabilities, in Proc. 5th Eur. Conf. Speech CommunicationTechnology 1997, Rhodes, Greece, Sept. 1997, pp. 739-742.

I T. Schaaf and T. Kemp, Confidence measures for spontaneousspeech recognition, in Proc. IEEE Int. Conf. Acoustics, Speech,Signal Processing 1997, Munich, Germany, Apr. 1997, pp. 875-878.

I M. Siu, H. Gish, and F. Richardson, Improved estimation, evaluationand applications of confidence measures for speech recognition, inProc. 5th Eur. Conf. Speech Communication Technology 1997,Rhodes, Greece, Sept. 1997, pp. 831-834.

I S. Young, Detecting misrecognitions and out-of-vocabulary words,in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing1994, Adelaide, Australia, Apr. 1994, pp. 21-24.

Schluter/Ney: Advanced ASR 382 August 5, 2010

Page 384: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ReferencesBayes Risk Decoding

I A. Stolcke, Y. Konig, and M. Weintraub. Explicit word errorminimization in N-best list rescoring. Eurospeech, 1997.

I V. Goel, W. Byrne, and S Khudanpur. LVCSR rescoring withmodified loss functions: a decision theoretic perspective. ICASSP,1998.

I V. Goel, Sh. Kumar, and W. Byrne. Segmental minimum Bayes-riskASR voting strategies. ICSLP, 2000.

I V. Goel, Sh. Kumar, and W. Byrne. Confidence based latticesegmentation and minimum Bayes-risk decoding. Eurospeech, 2001.

I F. Wessel, R. Schluter, and H. Ney. Explicit word error minimizationusing word hypothesis posterior probabilities. ICASSP, 2001.

I R. Schluter, Th. Scharrenbach, V. Steinbiss, and H. Ney. Bayes riskminimization using metric loss functions. Eurospeech, 2005.

Schluter/Ney: Advanced ASR 383 August 5, 2010

Page 385: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ReferencesBayes Risk Decoding

I H. Xu, D. Povey, J. Zhu, and G. Wu. Minimum hypothesis phoneerror as a decoding method for speech recognition. Interspeech,2009.

I A. Stolcke et al. The SRI March 2000 Hub-5 conversational speechsystem. NIST Speech Transcription Workshop, 2000.

I D. Hakkani and G. Riccardi. A general algorithm for word graphmatrix decomposition. ICASSP, 2003.

I J. Xue and Y. Zhao. Improved confusion network algorithm andshortest path search from word lattice. ICASSP, 2005.

I L. Mangu, E. Brill, and A. Stolcke. Finding consensus in speechrecognition: Word error minimization and other applications ofconfusion networks. CSL, 2000.

I B. Hoffmeister, R. Schluter, and H. Ney. Bayes risk approximationsusing time overlap with an application to system combination.Interspeech, 2009.

Schluter/Ney: Advanced ASR 384 August 5, 2010

Page 386: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ReferencesModel and System Combination

I P. Beyerlein. Discriminative model combination. ASRU, 1997.

I D. Vergyri. Integration of multiple knowledge sources in speechrecognition using minimum error training. PhD thesis, 2000.

I A. Zolnay, R. Schluter, and H. Ney. Acoustic feature combinationfor robust speech recognition. ICASSP, 2005.

I B. Hoffmeister, R. Liang, R. Schluter, and H. Ney. Log-linear modelcombination with word-dependent scaling factors. Interspeech,2009.

I J. Fiscus. A post-processing system to yield reduced word errorrates: Recognizer output voting error reduction (ROVER). ASRU,1997.

I A. Stolcke et al. The SRI March 2000 Hub-5 conversational speechtranscription system. NIST Speech Transcription Workshop, 2000.

Schluter/Ney: Advanced ASR 385 August 5, 2010

Page 387: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ReferencesModel and System Combination

I G. Evermann and P. Woodland. Posterior probability decoding,confidence estimation and system combination. NIST SpeechTranscription Workshop, 2000.

I I.-F. Chen and L.-S. Lee. A new framework for system combinationbased on integrated hypothesis space. Interspeech, 2006.

I R. Zhang and A. Rudnicky. Investigations of issues for usingmultiple acoustic models to improve continuous speech recognition.Interspeech, 2006.

I D. Hillard, B. Hoffmeister, M. Ostendorf, R. Schluter, and H. Ney.iROVER: Improving system combination with classification. HLT,2007.

I B. Hoffmeister, R. Schluter, and H. Ney. iCNC and iROVER: Thelimits of improving system combination with classification?Interspeech, 2008.

Schluter/Ney: Advanced ASR 386 August 5, 2010

Page 388: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ReferencesModel and System Combination

I H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, and G.Zweig. The IBM 2004 conversational telephony system for richtranscription. ICASSP, 2005.

I S. Stuker, Ch. Fugen, S. Burger, and M. Wolfel. Cross-systemadaptation and combination for continuous speech recognition: theinfluence of phoneme set and acoustic front-end. Interspeech, 2006.

I D. Guiliani and F. Brugnara. Experiments on cross-system acousticmodel adaptation. ASRU, 2007.

I B. Hoffmeister, Ch. Plahl, P. Fritz, G. Heigold, J. Loof, R. Schluter,and H. Ney. Development of the 2007 RWTH mandarin LVCSRsystem. ASRU, 2007.

Schluter/Ney: Advanced ASR 387 August 5, 2010

Page 389: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search5.1 Idea5.2 M-Gram Language Models5.3 Implementation5.4 Extension to Trigram Language Models5.5 Equivalence of Word Conditioned and Time Conditioned Search Organization5.6 Across-Word Contexts5.7 Pruning

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 388 August 5, 2010

Page 390: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned Search: IdeaIdea

At every time τ a separate copy of the lexical prefix tree (timeconditioned tree copy) with the hypothesis Qτ (t, s) is started.

τ ttime

Possible advantage: using beam search only a few time conditionedtree copies are active (experiments will show 40-50).

Schluter/Ney: Advanced ASR 389 August 5, 2010

Page 391: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchApproach

I Time synchronous evaluation of the Bayes’ decision rule as before.

I Full sequence joint probability:

Pr(w1 . . .wn) · Pr(x1 . . . xT |w1 . . .wN)

I Definitions:I probability that word w generated the acoustic vectors xτ+1, . . . , xt :

h(w ; τ, t) := Pr(x tτ+1|w)

I conditional probability for the generation of the acoustic vectorsx1, . . . , xt and a word sequence wn

1 that ends at time t:

G (wn1 ; t) := Pr(wn

1 ) · Pr(x t1 |wn

1 )

Schluter/Ney: Advanced ASR 390 August 5, 2010

Page 392: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchApproach

I Decomposition (recall word graph generation):x1, . . . , . . . , xτ︸ ︷︷ ︸G(wn−1

1 ;τ)

xτ+1, . . . , xt︸ ︷︷ ︸h(wn; τ, t)

xt+1, . . . , . . . , xT︸ ︷︷ ︸...max

τp(wn|wN

1 )︸ ︷︷ ︸=G(wn

1 ;t)

timetτ

W1 ... Wn - 1 Wn

T

...

W

I Optimize over word boundary: τ

G (wn1 ; t) = max

τPr(wn|wn−1

1 ) · G (wn−11 ; τ) · h(wn; τ, t)

= Pr(wn|wn−11 ) ·max

τG (wn−1

1 ; τ) · h(wn; τ, t)

I So far: no optimization of the word boundaries:organization of word successor hypotheses as tree.

Schluter/Ney: Advanced ASR 391 August 5, 2010

Page 393: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchApproach

I So far: no optimizationof the word boundaries:organization of wordsuccessor hypothesesas tree.

C

ABC

A

ABC

B

ABC

C

ABC

A

ABC

B

ABC

C

ABC

A

ABC

B

ABC

C

A

B

C

A

ABC

A

ABC

B

ABC

C

ABC

A

ABC

B

ABC

C

ABC

A

ABC

B

ABC

C

A

B

C

B

ABC

A

ABC

B

ABC

C

ABC

A

ABC

B

ABC

C

ABC

A

ABC

B

ABC

C

A

B

C

Schluter/Ney: Advanced ASR 392 August 5, 2010

Page 394: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search5.1 Idea5.2 M-Gram Language Models5.3 Implementation5.4 Extension to Trigram Language Models5.5 Equivalence of Word Conditioned and Time Conditioned Search Organization5.6 Across-Word Contexts5.7 Pruning

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 393 August 5, 2010

Page 395: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchM-Gram Language Models

Consider the m-gram language model p(um|um−11 ).

I Distinguish word sequences by their last (m − 1) words um−11

I Definition:I Conditional probability for an acoustic vector sequence x1 . . . xt

and a word sequence that ends at time t with a wordsubsequence um

2 :

H(um2 ; t) = max

wm1

[Pr(wn

1 ) · Pr(x t1 |wn

1 ) : wnn−m+2 = um

2

]I Optimize word ends (dynamic programming):

H(um2 ; t) = max

u1

p(um|um−1

1 ) ·maxτ

[H(um−1

1 ; τ) · h(um; τ, t)]

I Using bigram language model p(w |v):

H(w ; t) = maxv

p(w |v) ·max

τ

[H(v ; τ) · h(w ; τ, t)

]Schluter/Ney: Advanced ASR 394 August 5, 2010

Page 396: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchConsider the hypothesis that word w ends at time t.

t

τ

w

τ1 τ2 τ3

Optimization over word boundary/word start time at word end!Schluter/Ney: Advanced ASR 395 August 5, 2010

Page 397: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search5.1 Idea5.2 M-Gram Language Models5.3 Implementation5.4 Extension to Trigram Language Models5.5 Equivalence of Word Conditioned and Time Conditioned Search Organization5.6 Across-Word Contexts5.7 Pruning

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 396 August 5, 2010

Page 398: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchImplementation

Principle:

I Time synchronous evaluation of the hypotheses (as before).

I Lexical search tree.

I Start a separate lexical tree at every time frame.

I Define: Qτ (t, s) = probability of the best partial path thatgenerates the acoustic vectors x1 . . . xt up to a state s at timet of a lexical prefix tree with start time τ .

Schluter/Ney: Advanced ASR 397 August 5, 2010

Page 399: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchImplementation

Dynamic programming recursion for bigram language models:

I Initialization (best predecessor hypothesis):

Qt−1(t − 1, s) =

max

uH(u; t − 1) : s = 0

0.0 : s > 0

I Within words:

Qτ (t, s) = maxσp(xt , s|σ) · Qτ (t − 1, σ)

I Word boundaries:

H(w ; t) = maxvp(w |v) ·max

τH(v ; τ) · Qτ (t, Sw )

maxu

H(u; τ)

= maxvp(w |v) ·max

τH(v ; τ) · h(w ; τ, t)

Schluter/Ney: Advanced ASR 398 August 5, 2010

Page 400: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchImplementation

Properties of the time conditioned search:

I Time synchronous.

I Store the score H(v ; τ) for corrections, modifications of wordboundary optimization.

I Separate word boundary optimization over start time andpredecessor word.

I No backpointers.

Schluter/Ney: Advanced ASR 399 August 5, 2010

Page 401: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned Search: Implementationproceed over time t from left to right

ACOUSTIC LEVEL: process hypotheses Qτ (t, s)- initialization:

Qt−1(t − 1, s) =

max

uH(u; t − 1) if s = 0

0.0 if s > 0

- time alignment: Qτ (t, s) using DP recursion- prune unlikely hypotheses- purge bookkeeping lists

WORD PAIR LEVEL: process hypotheses Qτ (t,Sw )for each pair (w ; t) do

- word score: h(w ; τ, t) = Qτ (t,Sw )/maxu

H(u; τ)

- best predecessor and best boundary:[v0, τ0](w , t) := arg max

v ,τp(w |v) h(w ; τ, t) H(v ; τ)

- store [v0, τ0](w , t)- tree start-up score:

H(w ; t) = p(w |v0(.)) h(w ; τ0(.), t) H(v0(.); τ0(.))

Schluter/Ney: Advanced ASR 400 August 5, 2010

Page 402: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search5.1 Idea5.2 M-Gram Language Models5.3 Implementation5.4 Extension to Trigram Language Models5.5 Equivalence of Word Conditioned and Time Conditioned Search Organization5.6 Across-Word Contexts5.7 Pruning

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 401 August 5, 2010

Page 403: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchExtension to Trigram Language Models

I Trigram language model p(w |uv):

H(v ,w ; t) = maxu

p(w |uv) ·max

τ

[H(u, v ; τ) · h(w ; τ, t)

]I Dynamic programming recursion initialization (best word end

hypothesis):

Qt−1(t − 1, s) =

max(u,v)H(u, v ; t − 1) : s = 0

0.0 : s > 0

I Within words:

Qτ (t, s) = maxσp(xt , s|σ) · Qτ (t − 1, σ)

I Word boundaries:

H(v ,w ; t) = maxu

p(w |uv)·max

τ

[H(u, v ; t)· Qτ (t, Sw )

max(u′,v ′)

H(u′, v ′; τ)

]

Schluter/Ney: Advanced ASR 402 August 5, 2010

Page 404: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned Search: Experimental ResultsTest: NAB’94 – H1 Development

I vocabulary: 20000 words

I 10 male and 10 female speakers

I total of 310 sentences = 7387 spoken words

I unknown words: 199 (out of vocabulary rate: 2.7 %)I test set perplexities:

bigram: PPbi = 198.1trigram: PPtri = 130.2

Acoustic Modelling:

I trained on: WSJ0 and WSJ1 (284 speakers; 80 h of speech)

I phoneme models: 4688 (43+1 monophones, 557 diphones,4087 triphones)

I tied states: 4623 (6-state HMM)

I densities: 290 000 per gender

Schluter/Ney: Advanced ASR 403 August 5, 2010

Page 405: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchExperimental Results: Word Conditioned Tree Search vs. Time Conditioned Tree Search

Bigram language model, PP=198:Search average number of del - ins WER[%]

States Arcs Trees Words LM Recword 5022 1416 20 86 86 2.5 - 2.7 17.1

conditioned 9335 2593 28 156 156 2.5 - 2.7 16.826493 7222 45 540 540 2.4 - 2.6 16.539146 10528 53 610 610 2.4 - 2.6 16.452786 13968 60 560 560 2.4 - 2.6 16.4

time 6175 1781 27 117 1539 2.4 - 2.9 17.3conditioned 11280 3101 31 197 2976 2.4 - 2.7 16.7

30019 7998 37 520 8761 2.4 - 2.6 16.544289 11679 41 777 13162 2.4 - 2.6 16.560692 15877 43 1075 18081 2.4 - 2.6 16.4

Schluter/Ney: Advanced ASR 404 August 5, 2010

Page 406: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchExperimental Results: Word Conditioned Tree Search vs. Time Conditioned Tree Search

Trigram language model, PP=122:LM average number of del - ins WER[%]

States Arcs Trees Words LM Recword 4714 1392 29 80 80 1.7 - 2.9 14.8

conditioned 18734 5430 70 294 294 1.6 - 2.8 14.248940 13877 112 702 702 1.6 - 2.8 14.167541 18764 130 953 953 1.6 - 2.7 14.086772 23688 145 1223 1223 1.6 - 2.7 13.9

time 8573 2459 28 136 1477 1.6 - 2.9 14.5conditioned 15103 4194 31 234 2806 1.6 - 2.8 14.1

24914 6780 34 388 4858 1.6 - 2.7 14.038323 10299 38 606 7663 1.6 - 2.7 14.071757 18953 42 1202 14293 1.6 - 2.7 14.0

From: S. Ortmanns, H. Ney, F. Seide, I. Lindam, ”A Comparison of TimeConditioned and Word Conditioned Search Techniques for Large VocabularySpeech Recognition,” Proc. Int. Conf. on Spoken Language Processing,

Philadelphia, PA, pp. 2091-2094, October 1996.

Schluter/Ney: Advanced ASR 405 August 5, 2010

Page 407: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search5.1 Idea5.2 M-Gram Language Models5.3 Implementation5.4 Extension to Trigram Language Models5.5 Equivalence of Word Conditioned and Time Conditioned Search Organization5.6 Across-Word Contexts5.7 Pruning

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 406 August 5, 2010

Page 408: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchEquivalence of Word Conditioned and Time Conditioned Search Organization

For time conditioned search define:

Qτ (t, s)

:= maxwn−1

1

Pr(wn−1

1 ) ·maxst

1

Pr(x t

1 , st1|wn−1

1 , )

: sτ = Swn−1 ; st = s

Review:Word conditioned tree search for m-gram language models p(vm|vm−1

1 ):

Qvm−11

(t, s)

:= maxwn−1

1

Pr(wn−1

1 ) ·maxst

1

Pr(x t

1 , st1|wn−1

1 )

: wn−1n−m+1 = vm−1

1 ; st = s

Schluter/Ney: Advanced ASR 407 August 5, 2010

Page 409: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchEquivalence of Word Conditioned and Time Conditioned Search Organization

s

s

tv1m-1

time conditioned search word conditioned search

Schluter/Ney: Advanced ASR 408 August 5, 2010

Page 410: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchEquivalence of Word Conditioned and Time Conditioned Search Organization

I The path with the history vm−11 that ends at time t in state s,

must have reached state Svm−1 once at time τ : sτ := Svm−1 .I This path is split into two partial paths at time τ ,

were time τ has to be optimized:

Qvm−11

(t, s)

=maxτ

[maxwn−1

1

Pr(wn−1

1 ) ·maxsτ1

Pr(xτ1 , s

τ1 |wn−1

1 ) : wn−1n−m+1 = vm−1

1 ; sτ = Svm−1

·max

stτ+1

Pr(x t

τ+1, stτ+1| ) : st = s

]=max

τ

[H(vm−1

1 ; τ) ·maxstτ+1

Pr(x t

τ+1, stτ+1| ) : st = s

]=max

τ

[H(vm−1

1 ; τ) · Qτ (t, s)

maxum−11

H(um−11 , τ)

]

Schluter/Ney: Advanced ASR 409 August 5, 2010

Page 411: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchEquivalence of Word Conditioned and Time Conditioned Search Organization

Qvm−11

(t, s) = maxτ

[H(vm−1

1 ; τ)

maxum−11

H(um−11 , τ)

· Qτ (t, s)

]∀s with s 6= 0.

Especially for a word boundary s = Svm we get:

Qτ (t,Svm) = maxum−1

1

H(um−11 , τ) · h(vm; τ, t)

Substitute:

Qvm−11

(t, Svm) = maxτ

[H(vm−1

1 ; τ) · h(vm; τ, t)]

Schluter/Ney: Advanced ASR 410 August 5, 2010

Page 412: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchEquivalence of Word Conditioned and Time Conditioned Search Organization

Word conditioned search with recombination for s = 0:

Qvm2

(t, s = 0) = maxv1

p(vm|vm−1

1 ) · Qvm−11

(t,Svm)

= maxv1

p(vm|vm−1

1 ) ·maxτ

H(vm−1

1 ; τ) · h(vm; τ, t)

= H(vm2 ; t).

I It is possible to convert scores of the time conditioned searchinto scores of the word conditioned search.

I The opposite conversion is usually impossible because of theoptimization over τ which is done implicitly in wordconditioned search.

→ time conditioned search is “less compact” without furtheroptimization/pruning.

Schluter/Ney: Advanced ASR 411 August 5, 2010

Page 413: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search5.1 Idea5.2 M-Gram Language Models5.3 Implementation5.4 Extension to Trigram Language Models5.5 Equivalence of Word Conditioned and Time Conditioned Search Organization5.6 Across-Word Contexts5.7 Pruning

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 412 August 5, 2010

Page 414: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchAcross-Word Contexts

I Across-word contexts lead to multiple tree roots, dependingon left phoneme context of words to be started.

I Problem: time conditioned search assumes unique tree root,relative to which word probabilities can be computed.

I Observation: coarticulated rootstates are recombined after firstgeneration.

I Idea: partition tree intoword-conditioned root network,followed by time-conditionedsubtrees.

I Subtrees are compatible with timeconditioned search structure, i.e.have unique tree roots.

Ω0

Ω2

Ω3

Ω1

Schluter/Ney: Advanced ASR 413 August 5, 2010

Page 415: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search5.1 Idea5.2 M-Gram Language Models5.3 Implementation5.4 Extension to Trigram Language Models5.5 Equivalence of Word Conditioned and Time Conditioned Search Organization5.6 Across-Word Contexts5.7 Pruning

6. Normalization and Adaptation

7. Discriminative Training

Schluter/Ney: Advanced ASR 414 August 5, 2010

Page 416: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchPruning

Standard pruning:

I As for word conditioned search, acoustic and language modelpruning are applied.

I Acoustic pruning:I Acoustic pruning threshold:

QAC(t) := maxτ,s

Qτ (t, s).

I Discard all hypotheses (τ, s) at time t for which

Qτ (t, s) < fAC · QAC(t).

I Note: Implicitly, pruning affects all word end hypothesesattached to a search tree.

Schluter/Ney: Advanced ASR 415 August 5, 2010

Page 417: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchPruning

I Language model pruning:I Language model recombination/word boundary optimization:

H(vm2 ; t) = max

v1

p(vm|vm−1

1 ) ·maxτ

H(vm−11 ; τ) · Qτ (t,Svm)

maxum−1

1

H(um−11 ; τ)

Qt(t, 0) = max

vm2

H(vm2 ; t).

I Pruning threshold:

QLM(t) := maxuM

2

H(uM2 ; t).

I Define set E (t) of all word end hypotheses surviving languagemodel pruning at a given time frame:

E (t) :=

(vm−11 , q)

∣∣q = H(vm−11 ; t), q > fLM · QLM(t)

I Attach E (t) to the corresponding search tree started at time t.

Schluter/Ney: Advanced ASR 416 August 5, 2010

Page 418: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchPruning

I LM look-ahead:I only unigram LM look-ahead is efficient.I LM look-ahead with context dependency (bigram or higher)

requires maximization over all contexts, information gain doesnot pay off w.r.t. efficiency.

I Efficiency problems:

I Only single best ending word hypotheses is used to start treeat a given time frame.

I Word end hypotheses with different probabilities attached tosearch tree.

⇒ Some of the word end hypothese might already be below thepruning thresholds.

I LM recombination builds on all word end hypotheses withvarying start times.

I LM pruning is applied afterwards.⇒ Word boundary optimization delayed.

Schluter/Ney: Advanced ASR 417 August 5, 2010

Page 419: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchPruning

Early word end pruning:I To prevent too much language model recombinations, apply

acoustic pruning to expanded word end hypotheses beforecomputing the LM probability (cf. word conditioned search).

I Word boundary optimization:

Qvm−11

(t) := maxτ

H(vm−11 ; τ) · Qτ (t, Svm)

maxum−1

1

H(um−11 ; τ)

I Acoustic pruning on expanded word end hypotheses:I Discard all word end hypotheses (vm−1

1 ; t) with

Qvm−11

(t) < fAC · QAC(t)

I LM recombination on remaining hypotheses:

H(vm2 ; t) = max

v1

p(vm|vm−1

1 ) · Qvm−11

(t)

Qt(t, 0) = maxvm

2

H(vm2 ; t).

I Anticipated LM pruning: use/update current best hypothesis.Schluter/Ney: Advanced ASR 418 August 5, 2010

Page 420: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchPruning

Context RecombinationI In time conditioned search word boundaries are recombined

only at the word ends, whereas in word conditioned search,word boundaries are recombined implicitly early on.

I Search trees for adjacent start times can be expected toexpand to the same state hypotheses (and therefore also thesame word end hypotheses).

I Idea: discard word end hypothesis vm−11 from specific time

frame, if all expanded state hypotheses with respect to thisword end hypothesis within one tree are lower than theequivalent state hypotheses within another tree.

I A hypothesis (vm−11 , q) ∈ E (τ) ending in the word sequence

vm−11 with probability q can be removed from E (τ) if:

q · Qτ (t, s)

Qτ (τ, 0)< max

τ ′,q′:(h,q′)∈E(τ ′)q′ · Qτ ′(t, s)

Qτ ′(τ′, 0)

Schluter/Ney: Advanced ASR 419 August 5, 2010

Page 421: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchPruning

I Complexity of context recombination is problematic: all activecontexts vm−1

1 , and states s have to be visited.I Approximation: perform context recombination not in every time

frame, but at certain intervals.

Further speed-up of time conditioned search:I Using an HMM with skip transition, a word end hypothesis

still only is generated from the last state of a word, eventhough a skip from the second to last state also would leavethe words HMM.

I Nevertheless, this approximation does not have a significanteffect on the search result.

I In the same way it can be expected that the effect would besmall, if word starts also would not be allowed at every timeframe.

I Approach: only allow word starts (i.e. tree starts) at certainintervals.

Schluter/Ney: Advanced ASR 420 August 5, 2010

Page 422: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchPruning: Results

Search space comparison:I Average number of different active word end state hypotheses

per time frame, independent of context and start time.I Average number of word end hypotheses per time frame.

Results on a small subset of the NAB’94 H1 dev corpus with 20kword vocabulary, with relaxed pruning constraints:

search trees conditioned on:word context [k] start time [k]

Word end states 5.00 5.5

word ends: w/o early pruning 5.00 1150.0with early pruning 11.3after pruning 1.60 8.3after recombination 0.77 1.5

Schluter/Ney: Advanced ASR 421 August 5, 2010

Page 423: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchPruning: Results

Experiments on European Parliament Plenary Sessions (EPPS) English:

I Vocabulary: 60k words.

I Across-word models.I Word conditioned search (WCS):

I optional bigram language model look-ahead (Bigram LAH)

I Time Conditioned Search (TCS):I Context recombination every 5th time frame (CR5).I Tree start every 2nd time frame (Interval 2).

I Measure word error rate (WER) vs. real time factor (RTF).

I Comparison of time and word conditioned search.

Schluter/Ney: Advanced ASR 422 August 5, 2010

Page 424: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchPruning: Results

13

13.2

13.4

13.6

13.8

14

14.2

14.4

0 1 2 3 4 5 6 7 8 9

WE

R [%

]

RTF

EPPS Eval06, 60k words

TCSTCS, CR5

TCS Interval 2TCS Interval 2, CR5

Schluter/Ney: Advanced ASR 423 August 5, 2010

Page 425: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchPruning: Results

13

13.2

13.4

13.6

13.8

14

14.2

14.4

0 1 2 3 4 5 6 7 8 9

WE

R [%

]

RTF

EPPS Eval06, 60k words

WCSWCS, Bigram LAH

TCSTCS Interval 2, CR5

Schluter/Ney: Advanced ASR 424 August 5, 2010

Page 426: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Time Conditioned SearchConclusions

Pruning experiments show:

I Context recombination leads to moderate improvements.

I A tree start interval of 2 frames gives significant speed-up.

I context recombination gives negligible effect on top of treestart interval.

I Tree start interval larger than 2 gives degradations.

Overall:

I With optimized pruning, time conditioned search outperformsword conditioned search.

Schluter/Ney: Advanced ASR 425 August 5, 2010

Page 427: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation6.1 Introduction6.2 Audio Segmentation6.3 Speaker Clustering6.4 Vocal Tract Length Normalization (VTLN)6.5 Maximum Likelihood Linear Regression (MLLR)6.6 Feature Space MLLR (fMLLR)6.7 Speaker Adaptive Training6.8 Examples and Results6.9 Summary

7. Discriminative Training

Schluter/Ney: Advanced ASR 426 August 5, 2010

Page 428: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Normalization and AdaptationMotivation

Objective: try to fit acoustic model to speaker/condition.

I Need new acoustic adaptation data from the current speakeror condition.

Distinctions:I Model based – Feature based

I Modify model to better fit the features ⇒ adaptation.I Transform features to better fit model ⇒ normalization.

I Supervised – UnsupervisedI Transcriptions available for adaptation data ⇒ supervised.I No transcriptions available ⇒ unsupervised.

I Recognition side – Also in trainingI Normalization also in training is straightforward.I Possible also for adaptation ⇒ speaker adaptive training (SAT).

Schluter/Ney: Advanced ASR 427 August 5, 2010

Page 429: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Normalization and AdaptationSupervised Adaptation

Supervised adaptation = constrained acoustic model reestimation.

I Need manual transcriptions of adaptation data.

I Theoretically similar to acoustic model training – maximumlikelihood.

Model based: M ′ = f (M, φ)

φ = arg maxφ

P(xT1 |wT

1 ,M′).

Feature based: x ′T1 = f (xT1 , θ) (note Jacobi determinant)

θ = arg maxθ

∣∣∣∣df (xT1 , θ)

dxT1

∣∣∣∣P(x ′T1 |wT

1 ,M).

Like in AM training: discriminative training possible and useful.

Schluter/Ney: Advanced ASR 428 August 5, 2010

Page 430: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Normalization and AdaptationUnsupervised Adaptation - Theoretical Foundation

Simultaneous generation of adaptation word sequence andestimation of adaptation parameters.

I Model based: M ′ = f (M, φ)

wT1 , φ = arg max

wT1 ,φ

P(wT1 )P(xT

1 |wT1 ,M

′).

I Feature based method can be defined analogously.

I Above method infeasible in practice.

I Approximations always used.

Schluter/Ney: Advanced ASR 429 August 5, 2010

Page 431: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Normalization and AdaptationUnsupervised Adaptation - Approximations

Word graph based approximation

I Perform first pass recognition generating word graph.

I Posterior confidence measures from word graph.

I Weighted accumulation.

First-best approximation

I Only take first best output instead of word graph.

I Estimation is performed exactly as in the supervised case, butuse first pass output as transcription.

I This method is used most often in practice.

Schluter/Ney: Advanced ASR 430 August 5, 2010

Page 432: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Normalization and AdaptationEstimation Criteria

Different estimation criteria are possibleI Supervised adaptation

I Discriminative criteria can be applied (see Chapter 7).I No principal difference to acoustic model training.

I Unsupervised adaptationI Currently no significant improvement using discriminative

criteria compared to maximum likelihood estimation.I Appropriate criteria are still an open research issue.

Schluter/Ney: Advanced ASR 431 August 5, 2010

Page 433: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Normalization and AdaptationUse Cases: Transcription System

I Objective: generate transcription from untranscribed audiodata.

I Offline system – multiple passes over data allowed.I Several different speakers and conditions, this requires:

I Segmentation, to partition audio in homogeneous portions:single speaker and single condition per segment.

I Clustering of segments into same (similar) speakers and/orconditions.

I Speakers not previously known/no transcribed adaptationdata available: unsupervised adaptation.

Schluter/Ney: Advanced ASR 432 August 5, 2010

Page 434: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Normalization and AdaptationUse Cases: Dictation System

I Objective: Transcribe speech from known speaker in real time.

I Typically only one speaker uses each system (at a time).I User corrects errors in the ASR output ⇒

I Corrected transcripts usable for supervised adaptation.

I Amount of adaptation data continually increases ⇒I Complexity of adaptation method should also increase.I Finally complete retraining of acoustic model might be used.

Schluter/Ney: Advanced ASR 433 August 5, 2010

Page 435: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Normalization and AdaptationUse Cases: On-line Interactive System

I Objective: Transcribe speech from different unknown speakersin real time.

I Used for dialog as well as command and control systems.

I Unsupervised adaptation.

I Requires fast adaptation, i.e. adaptation using very little data(some seconds).

I Can benefit from incremental adaptation, i.e. continuallyupdating adaptation for new data.

Schluter/Ney: Advanced ASR 434 August 5, 2010

Page 436: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation6.1 Introduction6.2 Audio Segmentation6.3 Speaker Clustering6.4 Vocal Tract Length Normalization (VTLN)6.5 Maximum Likelihood Linear Regression (MLLR)6.6 Feature Space MLLR (fMLLR)6.7 Speaker Adaptive Training6.8 Examples and Results6.9 Summary

7. Discriminative Training

Schluter/Ney: Advanced ASR 435 August 5, 2010

Page 437: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Audio Segmentation

I Objective: split audio stream in homogeneous regionsaccording to:

I speaker identity,I recording condition (e.g. background noise, telephone channel),I signal type (e.g. speech, music, noise, silence), andI spoken word sequence.

I Segmentation affects speech recognition performance:I speaker adaptation and speaker clustering:

assume one speaker per segment,I language model assumes sentence boundaries at segment end,I non-speech regions cause insertion errors,I overlapping speech is not recognized correctly,

causes errors at surrounding regions,I words must not be split.

Schluter/Ney: Advanced ASR 436 August 5, 2010

Page 438: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Audio SegmentationMethods

I Metric basedI Compute distance between adjacent regions.I Segment at maxima of the distances.I Distances: KL distance, Bayesian information criterion.

I Model basedI Classify regions using precomputed models for music, speech, etc.I Segment at changes in acoustic class.

I Decoder guidedI Apply speech recognition to input audio stream.I Segment at silence regions.I Other information produced by the decoder can be used.

Schluter/Ney: Advanced ASR 437 August 5, 2010

Page 439: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Audio SegmentationMetric-based using BIC

Bayesian Information Criterion (BIC):

I Likelihood criterion for a model θ, given data set xN1 :

BIC(θ, xN1 ) = log p(xN

1 |θ)− λ

2· d(θ) · log(N),

d(θ): number of parameters in θ,λ: penalty weight for model complexity.

I used for model selection: choose model maximizing the BIC.

Schluter/Ney: Advanced ASR 438 August 5, 2010

Page 440: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Audio SegmentationMetric-based using BIC (ctd.)

Change point detection using BIC:

I Input stream is modeled as Gaussian process in the cepstral space.

I Feature vectors of one segment: drawn from multivariate Gaussian:xi . . . xj ∼ N(µ,Σ).

I For hypothesized segment boundary t in xT1 decide between

I x1 . . . xT ∼ N(µ,Σ), andI x1 . . . xt ∼ N(µ1,Σ1) xt+1 . . . xT ∼ N(µ2,Σ2).

I Use difference of BIC values:

∆BIC(t) = T log |Σ| − t|Σ1| − (T − t)|Σ2| − λP

P =1

2(D +

1

2D(D + 1)) log T

D: dimensionality of feature vectors.

I Change point if ∆BIC(t) > 0.

Schluter/Ney: Advanced ASR 439 August 5, 2010

Page 441: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Audio SegmentationMetric-based using BIC (ctd.)

I Detecting single change point in xT1 :

t = argmaxt∆BIC(t)

I Detecting multiple change points:I Apply change point detection to a sliding, size-growing window.I Start with a window of initial size.I Enlarge window until change point detected, or maximum

window size reached.I Restart with initial window size behind previous change point.

Schluter/Ney: Advanced ASR 440 August 5, 2010

Page 442: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Literature

I G. Schwarz, “Estimating the dimension of a model,” The Annals ofStatistics, vol. 6, no. 2, pp. 461–464, 1978.

I S.S. Chen and P.S. Gopalakrishnan, “Speaker, environment andchannel change detection and clustering via the Bayesianinformation criterion,” in DARPA Broadcast News Transcriptionand Understanding Workshop, Feb. 1998, pp. 127–132.

I M. Cettolo, M. Vescovi, and R. Rizzi, “Evaluation of BIC-basedalgorithms for audio segmentation,” Computer Speech andLanguage, vol. 19, no. 2, pp. 147–170, Apr. 2005.

I S. E. Tranter, K. Yu, G. Evermann, and P. C. Woodland,“Generating and evaluating segmentations for automatic speechrecognition of conversational telephone speech,” in Proc. ICASSP,May 2004, pp. 753–756.

I D. Rybach, C. Gollan, R. Schluter, and H. Ney, “AudioSegmentation for Speech Recognition using Segment Features,” inProc. ICASSP, Apr. 2009, pp. 4197–4200.

Schluter/Ney: Advanced ASR 441 August 5, 2010

Page 443: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation6.1 Introduction6.2 Audio Segmentation6.3 Speaker Clustering6.4 Vocal Tract Length Normalization (VTLN)6.5 Maximum Likelihood Linear Regression (MLLR)6.6 Feature Space MLLR (fMLLR)6.7 Speaker Adaptive Training6.8 Examples and Results6.9 Summary

7. Discriminative Training

Schluter/Ney: Advanced ASR 442 August 5, 2010

Page 444: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Speaker ClusteringIntroduction

I Objective: Group speech segments into clusters for adaptation.

I Segments from same or similar speakers should be grouped.

I Requirement: Clustering should give lowest possibleadaptation WER.

BIC Method

I Uses acoustic features only.

I Greedy, bottom up clustering.

I BIC used to control number of clusters.

Schluter/Ney: Advanced ASR 443 August 5, 2010

Page 445: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Speaker ClusteringBIC Clustering

I Greedy, bottom up, BIC clustering method [Chen+@IBM 1998].

I Each cluster is modeled using single Gaussian, full covariance.

I BIC criterion, Ck is clustering with K clusters, xT1 input data

points, and feature dimension D:

BIC (CK , xT1 ) =

K∑k=1

(−1

2nk log |Σk |

)− 1

2K (D +

1

2D(D + 1)) log T .

I Algorithm:

0. Start with one cluster for each segment.1. Try all possible pairwise cluster merges.2. Merge the pair that gives the largest increase in BIC (Ck , xT

1 ).3. Iterate from 1, until BIC (Ck , xT

1 ) starts to decrease.

Schluter/Ney: Advanced ASR 444 August 5, 2010

Page 446: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Speaker ClusteringBIC Clustering (ctd.)

I In practice modifications to the BIC criterion can be used[[email protected] 2000].

I Criterion takes form: likelihood term + complexity term.

I As in segmentation section, multiply complexity term with apenalty weight λ.

I Tune λ by optimizing WER on a development set.

I In a practical implementation, a K × K matrix is used tocache change in BIC for each merge pair, between iterations.

I Also possible to use BIC only as stop criterion, with distancemeasure to choose pair-wise merging of clusters.

Schluter/Ney: Advanced ASR 445 August 5, 2010

Page 447: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Speaker ClusteringReferences

I S. S. Chen and P. S. Gopalakrishnan, Clustering Via theBayesian Information Criterion With Applications in SpeechRecognition, ICASSP 1998.

I B. Zhou and J.H.L. Hansen, Unsupervised audio streamsegmentation and clustering via the Bayesian informationcriterion, ICSLP 2000.

Schluter/Ney: Advanced ASR 446 August 5, 2010

Page 448: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation6.1 Introduction6.2 Audio Segmentation6.3 Speaker Clustering6.4 Vocal Tract Length Normalization (VTLN)6.5 Maximum Likelihood Linear Regression (MLLR)6.6 Feature Space MLLR (fMLLR)6.7 Speaker Adaptive Training6.8 Examples and Results6.9 Summary

7. Discriminative Training

Schluter/Ney: Advanced ASR 447 August 5, 2010

Page 449: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Vocal Tract Length Normalization (VTLN)Introduction

What is the vocal tract?I Dimensions of vocal tract

vary between speakers.

I Different vocal tract lengthsgive different formantpositions⇒ systematic differences infeatures between speakers.

I Goal: warp frequencyspectrum to match spectrafrom different speakers forthe same phoneme.

Schluter/Ney: Advanced ASR 448 August 5, 2010

Page 450: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Vocal Tract Length Normalization (VTLN)Introduction

Schluter/Ney: Advanced ASR 449 August 5, 2010

Page 451: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Vocal Tract Length Normalization (VTLN)Parameterization

I VTLN warping performed as part of acoustic feature extraction.

I Warping is performed on spectrum, before Mel-warping.

I Typically: warp filter bank centers/widths (as Mel-warping).

Linear warping

I Single (varying) parameter α,termed warping factor.

I To get bijective mapping, kneeparameter ω0 often used⇒ piecewise linear warping.

α>1

α<1

ω

π

π

α=1

ω0

ω∼

Schluter/Ney: Advanced ASR 450 August 5, 2010

Page 452: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Vocal Tract Length Normalization (VTLN)Estimation

Estimation from formant position

I Compute mean position of (some) formant for speaker.I Derive warping factor from difference with global formant position.I In [Eide+ 96], third formant used, c.f. the paper for details.

Maximum likelihood (ML) estimation [Lee+@AT&T 1996]

α = arg maxα

P(x (α)T1 |α,wN

1 )

I Grid search over warping factors, i.e. [0.8:0.1:1.2],maximization problem.

I In practice: Use alignment (or other state distribution)obtained by existing acoustic model.

I Stable due to single parameter, can be used speaker-wise orsegment-wise.

Schluter/Ney: Advanced ASR 451 August 5, 2010

Page 453: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Vocal Tract Length Normalization (VTLN)Variants

Fast VTLN [Welling+@RWTH 1999]

I ML estimation requires transcription.

I For use in recognition ⇒ Two pass unsupervised.I Goal: eliminate first recognition pass:

I Train (GMM) warping factor classifier on acoustic training set.I Classifier uses only acoustic features (similar to clustering).I In recognition, estimate warping factor using classifier.

Effect of Jacobi Determinant

I Varying transformation (=warping) in optimization.

I Thus, should include Jacobi determinant.

I Solution: VTLN as linear transform, see [Pitz@RWTH 200]and [Umesh+@Indian Inst. Tech. 2005].

Schluter/Ney: Advanced ASR 452 August 5, 2010

Page 454: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Vocal Tract Length Normalization (VTLN)References

I A. Andreou, T.Kamm, and J. Cohen, Experiments in VocalTract Normalization, CAIP Workshop 1994.

I E. Eide and H. Gish, A Parametric Approach to Vocal TractLength Normalization, ICASSP 1996.

I L. Lee, R. C. Rose, Speaker normalization using efficientfrequency warping procedures, ICASSP 1996.

I L. Welling and S. Kanthak and H. Ney, Improved Methods forVocal Tract Normalization, ICASSP 1999.

I M. Pitz, Investigations on Linear Transformations for SpeakerAdaptation and Normalization, PhD Thesis, Aachen,Germany, March 2005.

I S. Umesh and A. Zolnay and H. Ney, ImplementingFrequency-Warping and VTLN Through LinearTransformation of Conventional MFCC, Interspeech 2005.

Schluter/Ney: Advanced ASR 453 August 5, 2010

Page 455: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation6.1 Introduction6.2 Audio Segmentation6.3 Speaker Clustering6.4 Vocal Tract Length Normalization (VTLN)6.5 Maximum Likelihood Linear Regression (MLLR)6.6 Feature Space MLLR (fMLLR)6.7 Speaker Adaptive Training6.8 Examples and Results6.9 Summary

7. Discriminative Training

Schluter/Ney: Advanced ASR 454 August 5, 2010

Page 456: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Maximum Likelihood Linear Regression (MLLR)Introduction

I Goal: Approximate speaker specific acoustic model usingaffine transforms of speaker independent model Gaussianparameters [Leggetter+@Cambridge 1995].

I Parameterizations (r is speaker/condition)I Mean adaptation: µ′rs = Arsµs + brs .I Covariance adaptation: Σ′rs = HrsΣsHT

rs .I Often only mean adaptation used – Σ-adaptation not

discussed here.

I Typically estimated using maximum likelihood criterion.

I State tying: Ars , brs , and Hrs are either global or depend onsets of phonetically related states.

Schluter/Ney: Advanced ASR 455 August 5, 2010

Page 457: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Maximum Likelihood Linear Regression (MLLR)Estimation

I Rewrite MLLR formulation:

ξs = [1 µTs ]T, Wrs = [brs Ars ], µ′rs = Wrs ξs .

I Assume adaptation data (xT1 ,w

N1 ) from single speaker, drop index r .

I Depending on the amount of adaptation data, tying is applied.I For the derivation, we assume a single pooled adaptation

matrix W , i.e. we assume complete tying.I All other cases of tying can be derived from the pooled case by

dividing the adaptation data according to the Viterbi alignment.I Parameter estimation using maximum likelihood criterion:

W = arg maxW

P(xT1 |wN

1 , θ,W ).

I Optimization using expectation maximization (EM) gives:

W ML = arg minW

∑s

T∑t=1

γs(t)(xt −W ξs)TΣ−1s (xt −W ξs)

Schluter/Ney: Advanced ASR 456 August 5, 2010

Page 458: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Maximum Likelihood Linear Regression (MLLR)Estimation: General Case - Full Covariance

Take derivative w.r.t. adaptation matrix component W and set to zero:

∂W

∑s

T∑t=1

γs(t)(xt −W ξs)TΣ−1s (xt −W ξs)

∝∑

s

T∑t=1

γs(t)Σ−1s

[xtξ

Ts −W ξsξ

Ts

]!

= 0

⇔∑

s

T∑t=1

γs(t)Σ−1s W ξsξ

Ts =

∑s

T∑t=1

γs(t)Σ−1s xtξ

Ts

Schluter/Ney: Advanced ASR 457 August 5, 2010

Page 459: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Maximum Likelihood Linear Regression (MLLR)Estimation: Case with Diagonal Covariances

Now assume diagonal covariances Σs,ij = σ2s,iδij .

Component-wise representation and rearrangement:

∑s

T∑t=1

γs(t)Σ−1s W ξsξ

Ts =

∑s

T∑t=1

γs(t)Σ−1s xtξ

Ts

⇔∑n

Win

∑s

T∑t=1

γs(t)1

σ2s,i

ξs,nξs,j︸ ︷︷ ︸=:G

(i)nj

=∑

s

1

σ2s,i

T∑t=1

γs(t)xt,iξs,j︸ ︷︷ ︸=:Zij

∀ i , j

⇔[W · G (i)

]= Zij ∀ i , j

∣∣∣∣ · (G (i)−1)jk∀ k,

∑j

⇔ Wik =(Z · G (i)−1)

ik∀ i , k

Schluter/Ney: Advanced ASR 458 August 5, 2010

Page 460: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Maximum Likelihood Linear Regression (MLLR)Estimation

Therefore, assuming diagonal covariances, the adaptation matrixcan be obtained row-wise, i.e. for each row i separately:

Wij =(ZsG

(i)s

−1)ij

∀ i , j

with accumulators

G (i) =∑

s

1

σ2s,i

ξsξTs

T∑t=1

γs(t),

Z =∑

s

T∑t=1

γs(t)Σ−1s xtξ

Ts with Σs = diag(σ2

s,1, . . . , σ2s,D).

I Closed form solution for adaptation matrix.

I Accumulator storage requirements: O(D3)

(for G (i)).

Schluter/Ney: Advanced ASR 459 August 5, 2010

Page 461: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Maximum Likelihood Linear Regression (MLLR)Tree Based State Tying

I Described in [Leggetter+@Cambridge 1995].

I Basis: Phonetic classification tree, i.e. from AM state tying.

I Each node represents a regression class, a set of states.

I Typically silence has a separate state.I Dynamically adjusted regression classes – Algorithm:.

I Accumulation of statistics in each leaf node.I Additionally accumulate observation count accumulator.I Combine (=add) accumulators from all child nodes – store in

parent node.I Continue merging until root node is reached.I For each leaf node:

1. If observation count larger than threshold: use accumulator.2. Else go to parent node and repeat from 1.

I Algorithm allows regression classes with too few observationsto still be adapted, by combining data from other classes.

Schluter/Ney: Advanced ASR 460 August 5, 2010

Page 462: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Maximum Likelihood Linear Regression (MLLR)Variants

I Standard MLLR uses same number of matrices A and offsets b.I But also possible to have many more offsets than matrices

[Digilakis+@John Hopkins Workshop 1999].I Allows finer granularity in adjusting number of parameters.I When no transformation matrices, only offsets – sometimes

known as shift MLLR or bias MLLR.I Estimation is very similar to standard MLLR, but A and b

computed separately.I Method for A is formally the same as for matrix W in MLLR.I Offset bc given by the following formula - depends on

corresponding matrix Ac ′(c).

bc =1

T

T∑t=1st∈c

(xt − Ac ′(c)µst

).

Schluter/Ney: Advanced ASR 461 August 5, 2010

Page 463: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Maximum Likelihood Linear Regression (MLLR)References

I C. J. Leggetter and P. C. Woodland, Maximum LikelihoodLinear Regression for Speaker Adaptation of ContinuousDensity Hidden Markov Models , CSL April 1995

I C.J. Leggetter and P.C. Woodland, Flexible SpeakerAdaptation Using Maximum Likelihood Linear Regression,ARPA SLT Workshop 1995

I V. Digalakis, S. Berkowitz, E. Bocchieri, C. Boulis, W. Byrne,H. Collier, A. Corduneanu, A. Kannan, S. Khudanpur, A.Sankar, Rapid Speech Recognizer Adaptation to NewSpeakers, ICASSP 1999

Schluter/Ney: Advanced ASR 462 August 5, 2010

Page 464: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation6.1 Introduction6.2 Audio Segmentation6.3 Speaker Clustering6.4 Vocal Tract Length Normalization (VTLN)6.5 Maximum Likelihood Linear Regression (MLLR)6.6 Feature Space MLLR (fMLLR)6.7 Speaker Adaptive Training6.8 Examples and Results6.9 Summary

7. Discriminative Training

Schluter/Ney: Advanced ASR 463 August 5, 2010

Page 465: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Feature Space MLLR (fMLLR)Introduction

I Goal: Normalize features to better fit speaker [Digalakis+@SRI 1995].

I Affine transform of features: x ′t = Arsxt + brs .

I Note: transformation of probability density, Jacobiandeterminant needed:

P(x ′t) = N (x ′t |µs ,Σs)⇔ P(xt) = |Ars |N (Arsxt + brs |µs ,Σs).

I Advantages:I Pure feature transform ⇒ no changes to core speech

recognition decoder needed.I Speaker adaptive training easy to implement.

I State tying: single global matrix per speaker.

I But: Can also be seen as model transform⇒ regression classes possible with changes to decoder

Schluter/Ney: Advanced ASR 464 August 5, 2010

Page 466: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Feature Space MLLR (fMLLR)Equivalence Model and Feature Transform

MLLR with same matrix for mean and covariance transform:

P(xt | Ars , brs , µs ,Σs) =

=1√

det(2πArsΣsATrs)

e−12

(xt−Arsµs−brs)T(ArsΣsATrs)−1(xt−Arsµs−brs)

=1

det(Ars)√

det(2πΣ)e−

12

(A′rsx+b′rs−µ)TΣ−1(A′rsx+b′rs−µ)

= det(A′rs)N (A′rsx |µ,Σ).

where A′rs = A−1rs and b′rs = −A−1

rs brs .I This is exactly the formula for fMLLR:

I Gaussian distributed random variable, affine transform⇒ Gaussian also in new coordinate system.

I Well known fact from mathematical statistics.

Schluter/Ney: Advanced ASR 465 August 5, 2010

Page 467: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Feature Space MLLR (fMLLR)Maximum Likelihood Estimation

As for MLLR, rewrite in linear form, drop speaker index r , andassume complete tying:

x ′t = W ξt with ξt = [1 xTt ]T and W = [b A].

Maximum likelihood EM equation:

W ML=arg maxW

T log det A− 1

2

∑s

T∑t=1

γs(t)(W ξt − µs)TΣ−1s (W ξt − µs)

=arg maxW

T log det A +

∑s

T∑t=1

γs(t) Trace(

W ξtµTs Σ−1

s

)− 1

2

∑s

T∑t=1

γs(t) Trace(

Σ−1s W ξtξ

Tt W T

).

Schluter/Ney: Advanced ASR 466 August 5, 2010

Page 468: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Feature Space MLLR (fMLLR)Maximum Likelihood Estimation

Rewrite in terms of sufficient statistics:

W ML = arg maxW

T log det A + Trace

(W∑

s

T∑t=1

γs(t)ξtµTs Σ−1

s︸ ︷︷ ︸=:Z

)

− 1

2

D∑i ,j=1

D+1∑k,l=1

WjkWil ·∑

s

T∑t=1

γs(t)Σ−1s,ijξt,kξt,l︸ ︷︷ ︸

=:Hijkl (4th order tensor)

diagonalcovar.

= arg maxW

T log det A +

D∑i=1

D+1∑k=1

Wik

∑s

T∑t=1

γs(t)ξt,kµs,i1

σ2s,i︸ ︷︷ ︸

=:Zki

− 1

2

D∑i=1

D+1∑k,l=1

WikWil ·∑

s

T∑t=1

γs(t)1

σ2s,i

ξt,kξt,l︸ ︷︷ ︸=:G

(i)kl (3rd ord. tensor/matrix ∀ i)

Schluter/Ney: Advanced ASR 467 August 5, 2010

Page 469: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Feature Space MLLR (fMLLR)Maximum Likelihood Estimation

Case of diagonal covariances Σs,ij = σ2s,iδij :

W ML = arg maxW

T log det A +

D∑i=1

D+1∑k=1

WikZki

− 1

2

D∑i=1

D+1∑k,l=1

WikWilG(i)kl

,

with sufficient statistics:

G (i) =∑

s

T∑t=1

γs(t)

σ2s,i

ξtξTt

Z =∑

s

T∑t=1

γs(t)ξtµTs Σ−1

s

I No closed form solution for W , iterative methods needed.I Row-wise iterative solution in [Gales+@Cambridge 1998].

Schluter/Ney: Advanced ASR 468 August 5, 2010

Page 470: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Feature Space MLLR (fMLLR)Dimension Reducing Adaptation

Dimension reducing fMLLR [Loof+@RWTH 2007]

I Goal: Utilize more context information from features.

I Replace LDA matrix (instead of applying fMLLR after LDA).

I Non quadratic matrix⇒ Derivation as for fMLLR not possible.

I Instead of ML, use ratio of acoustic model likelihood and aglobal competing model likelihood.

I Competing model: single Gaussian trained on adaptation data,I Parameters: µ′, Σ′ (in transformed feature space).

In essence: How much better do the transformed features fit thetarget acoustic model, than it fits a global single Gaussian?

Schluter/Ney: Advanced ASR 469 August 5, 2010

Page 471: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Feature Space MLLR (fMLLR)Dimension Reducing fMLLR

Consider adaptation data likelihood for global single Gaussian:

log P(Wr xT1 |µ′,Σ′) =

T∑t=1

logN (Wr xt |µ′,Σ′)

=T∑

t=1

[− 1

2log det(2πΣ′)− 1

2(Wr xt − µ′)TΣ′−1(Wr xt − µ′)

]= −T

2log[(2π)D det(Σ′)

]−1

2Trace

[ T∑t=1

(Wr xt − µ′)(Wr xt − µ′)T · Σ′−1

]︸ ︷︷ ︸

= Trace(TΣ′Σ′−1) = T ·Trace ID = TD

= −T

2log det Σ′ − TD

2log 2π − TD

2

Schluter/Ney: Advanced ASR 470 August 5, 2010

Page 472: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Feature Space MLLR (fMLLR)Dimension Reducing fMLLR

Maximize likelihood ratio:

Wr = arg maxWr

− log P(Wr xT

1 |µ′,Σ′) + log P(Wr xT1 |M,wN

1 )

with

log P(Wr xT1 |µ′,Σ′) = −T

2log det Σ′ − TD

2log 2π − TD

2

= −T

2log det

[Wr ΣW T

r

]+ const(Wr )

with covariance Σ of untransformed features: Σ′ = Wr ΣW Tr .

Substitute into maximum likelihood ratio:

Wr = arg maxWr

T

2log det(W ΣW T ) + log P(Wr xT

1 |M,wN1 )

I Further derivation similar to regular fMLLR, details in

[Loof+@RWTH 2007]

Schluter/Ney: Advanced ASR 471 August 5, 2010

Page 473: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Feature Space MLLR (fMLLR)References

I V. Digalakis, D. Rtischev L. Neumeyer, and SpeakerAdaptation Using Constrained Estimation of GaussianMixtures, IEEE SAP Sep. 1995.

I M. J. F. Gales, Maximum Likelihood Linear Transformationsfor HMM-based Speech Recognition, CSL Apr. 1998.

I J. Loof, R. Schluter, and H. Ney, Efficient Estimation ofSpeaker-specific Projecting Feature Transforms, Interspeech2007.

Schluter/Ney: Advanced ASR 472 August 5, 2010

Page 474: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation6.1 Introduction6.2 Audio Segmentation6.3 Speaker Clustering6.4 Vocal Tract Length Normalization (VTLN)6.5 Maximum Likelihood Linear Regression (MLLR)6.6 Feature Space MLLR (fMLLR)6.7 Speaker Adaptive Training6.8 Examples and Results6.9 Summary

7. Discriminative Training

Schluter/Ney: Advanced ASR 473 August 5, 2010

Page 475: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Speaker Adaptive Training (SAT)Introduction

I Adaptation compensates for speaker differences in recognition.

I But: We also have speaker differences in training corpus.

I Question: How can we compensate for both these differences?

Speaker-Adaptive Normalization

I Apply transform on training data also.

I Model training using transformed acoustic features.

Speaker-Adaptive Adaptation

I Interaction between model and transform requiressimultaneous model and transform parameter training.

I Cannot simply retrain model→ modified acoustic model training necessary.

Schluter/Ney: Advanced ASR 474 August 5, 2010

Page 476: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Speaker Adaptive TrainingVTLN and fMLLR

Training Procedure

1. Estimate speaker independent acoustic model MSI as normal.

2. Decide on / estimate a target model MT for adaptation estimation.

3. Use MT to estimate VTLN or fMLLR adaptation supervisedfor each speaker in training.

4. Transform features using the estimated adaptation.

5. Train speaker adaptive model MSAT on transformed features,starting from MSI.

Recognition Procedure

1. First pass recognition using speaker independent acoustic model MSI.

2. Estimate VTLN or fMLLR adaptation unsupervised usingtarget model MT.

3. Transform features using the estimated adaptation.

4. Second pass recognition using speaker adaptive model MSAT.

Schluter/Ney: Advanced ASR 475 August 5, 2010

Page 477: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Speaker Adaptive TrainingVTLN and fMLLR

I Original SAT was estimated iteratively, where in each iterationboth acoustic model and adaptation parameters werereestimated [Anastasakos+@BBN 1996].

I Problem: interaction between acoustic model complexity andadaptation.

I A simple, single Gaussian target model is known to lowerWER both for VTLN [Welling+@RWTH 1999] and fMLLR[Stemmer+@FBK 2005].

Schluter/Ney: Advanced ASR 476 August 5, 2010

Page 478: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Speaker Adaptive TrainingMLLR

I Iterative re-estimation of MLLR, Gaussian means µs , andGaussian covariances Σs [Anastasakos+@BBN 1996].

I MLLR parameters Ar and br estimated as usual on best mode.

I Re-estimation for Gaussian parameters using maximum likelihood:

µs =

(T∑t

γs(t)ATr(t)Σ−1

s Ar(t)

)−1( T∑t

γs(t)ATr(t)Σ−1

s (xt − br (t))

)Σs = Same as for standard ML but using SAT mean

I Drawback: Require D × D accumulator for each mean µs .

I 800k Gaussians, D=45, 64 bit float ⇒ 12 GB accumulator.

Schluter/Ney: Advanced ASR 477 August 5, 2010

Page 479: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Speaker Adaptive TrainingReferences

I T. Anastasakos, J. McDonough, R. Schwartz, and J.Makhoul, A Compact Model for Speaker-Adaptive Training,ICSLP, Philadelphia, PA, USA 1996

I G. Stemmer, F. Brugnara, and D. Giuliani, Adaptive TrainingUsing Simple Target Models, ICASSP, Philadelphia, PA, USA2005

I L. Welling and S. Kanthak and H. Ney, Improved Methods forVocal Tract Normalization, ICASSP 1999

Schluter/Ney: Advanced ASR 478 August 5, 2010

Page 480: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation6.1 Introduction6.2 Audio Segmentation6.3 Speaker Clustering6.4 Vocal Tract Length Normalization (VTLN)6.5 Maximum Likelihood Linear Regression (MLLR)6.6 Feature Space MLLR (fMLLR)6.7 Speaker Adaptive Training6.8 Examples and Results6.9 Summary

7. Discriminative Training

Schluter/Ney: Advanced ASR 479 August 5, 2010

Page 481: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Examples and ResultsExperimental setup

All examples using RWTH TC-STAR 2006 English ASR systems:

I Training corpus: 88h.

I Development and Test sets: 3.2h each.

I MFCC + voicedness feature.

I One pass, Fast VTLN.

I LDA dimension: 45 × 153.

I ML trained acoustic models, approx. 900k Gaussians.

I Two pass system: fMLLR and MLLR unsupervised using firstbest output.

I Note: some differences between systems – direct WERcomparison between slides not always possible

Schluter/Ney: Advanced ASR 480 August 5, 2010

Page 482: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Examples and ResultsBasic Methods – Progressive Improvement

Dev Eval

Baseline 18.5 -

+VTLN+voice 17.2 14.4

+fMLLR 15.7 -

+MLLR 14.0 11.8

I Different adaptation methods can be combined.

I Here: Each method gives an additional improvement – notalways the case.

Schluter/Ney: Advanced ASR 481 August 5, 2010

Page 483: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Examples and ResultsDifferent combinations

fML

LR

ML

LR

SA

T

Pro

j.

WER [%]Dev Eval

no no no no 16.4 13.5yes 15.1 11.9

yes 15.0 11.7yes no 14.4 11.0

yes 14.4 10.8yes no no 14.0 11.0

yes 13.8 11.0yes no 13.6 10.6

yes 13.3 10.4

I Here: Baseline includes VTLN.

I fMLLR Improvement:8 – 12% relative.

I Additional improvementMLLR: 7 – 8% relative(from MLLR alone more).

I Improvement from SAT:5 – 8% relative.

I Total improvement (overVTLN): 19 – 23% rel.

Schluter/Ney: Advanced ASR 482 August 5, 2010

Page 484: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Examples and ResultsUse of Confidence Measures

I Baseline: fMLLR SAT using dimension reducing transforms.

I Confidence measure for adaptation: instead of using bestrecognized transcription from first pass, use all word graphhypotheses weighted by confidence measure.

I Goal: Evaluate different model adaptation schemes, with andwithout confidence measures.

Dev EvalNo Conf Conf No Conf Conf

1st pass 16.3 – 13.2 –

fMLLRP SAT 14.2 – 10.5 –

MAP 14.1 13.0 10.5 9.8

MLLR 13.2 12.9 10.1 9.8

Shift-MLLR 13.2 12.7 10.2 9.6

Schluter/Ney: Advanced ASR 483 August 5, 2010

Page 485: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation6.1 Introduction6.2 Audio Segmentation6.3 Speaker Clustering6.4 Vocal Tract Length Normalization (VTLN)6.5 Maximum Likelihood Linear Regression (MLLR)6.6 Feature Space MLLR (fMLLR)6.7 Speaker Adaptive Training6.8 Examples and Results6.9 Summary

7. Discriminative Training

Schluter/Ney: Advanced ASR 484 August 5, 2010

Page 486: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Summary

I Speaker adaptation important to obtain good performance inspeaker independent speech recognition.

I Approach and parameterization usually depend on:I supervision (yes/no),I recognition mode (on-/offline),I amount of adaptation data.

I Feature and model based approaches usually are additive.

I Unsupervised adaptation: 20% relative word error rateimprovement possible.

Schluter/Ney: Advanced ASR 485 August 5, 2010

Page 487: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training7.1 Introduction7.2 Overview7.3 Training Criteria7.4 Parameter Optimization7.5 Efficient Calculation of Discriminative Statistics7.6 Generalisation to Log-Linear Modeling7.7 Convex Optimization7.8 Incorporation of Margin Concept7.9 Conclusions7.10 References7.11 Speech Tasks: Corpus Statistics & Setups

Schluter/Ney: Advanced ASR 486 August 5, 2010

Page 488: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

IntroductionMotivation

Aim of discriminative methods: improve class separation

I standard maximum likelihood (ML) training: maximizereference class conditional pθ(x |c)

I maximum mutual information (MMI) training: maximize

reference class posterior pθ(c|x) =p(c) · pθ(x |c)∑

c ′

p(c ′) · pθ(x |c ′)Where’s the difference?

I Ideally: (almost) no difference! In case of infinite trainingdata and correct model assumptions, the true probabilities areobtained in both cases. They lead to equal decisions, providedthe class prior p(c) is known. (Proof: model free optimization.)

I ML training: classes are handled independently, thereforedecision boundaries are not considered explicitly in training.

I in MMI training and generally in discriminative training, thereference class directly competes against all other classes,decision boundaries become relevant in training.

Schluter/Ney: Advanced ASR 487 August 5, 2010

Page 489: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

IntroductionMotivation

I In practice, model assumptions are incorrect, and trainingdata is limited. Here discriminative training can be beneficial.

Example: a two class problem (with pooled covariance matrix)

-1

0

1

2

-5 -4 -3 -2 -1 0 1 2

y

x

ML/MMI

class -1class +1

MLMMI

-1

0

1

2

-5 -4 -3 -2 -1 0 1 2

y

x

ML MMI

I Clearly, in case of ML training, the outlier deteriorates thedecision boundary, whereas MMI training registers the minorimportance of the outlier.

I MMI captures decision boundary, although model assumptiondoes not fit in second case (pooled covariance).

Schluter/Ney: Advanced ASR 488 August 5, 2010

Page 490: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training7.1 Introduction7.2 Overview7.3 Training Criteria7.4 Parameter Optimization7.5 Efficient Calculation of Discriminative Statistics7.6 Generalisation to Log-Linear Modeling7.7 Convex Optimization7.8 Incorporation of Margin Concept7.9 Conclusions7.10 References7.11 Speech Tasks: Corpus Statistics & Setups

Schluter/Ney: Advanced ASR 489 August 5, 2010

Page 491: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Overview

Questions:

I Which discriminative criterion to take?

I Relation to decision rule and evaluation measure?

I How to optimize criterion?

I Efficiency?

I Influence of modeling?

I Uniqueness of solution?

I Generalization?

Bottomline:

I How to utilize available training materialto obtain optimum recognition performance?

Schluter/Ney: Advanced ASR 490 August 5, 2010

Page 492: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training7.1 Introduction7.2 Overview7.3 Training Criteria7.4 Parameter Optimization7.5 Efficient Calculation of Discriminative Statistics7.6 Generalisation to Log-Linear Modeling7.7 Convex Optimization7.8 Incorporation of Margin Concept7.9 Conclusions7.10 References7.11 Speech Tasks: Corpus Statistics & Setups

Schluter/Ney: Advanced ASR 491 August 5, 2010

Page 493: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training CriteriaNotation

Xr sequence xr ,1, xr ,2, ..., xr ,Tr acoustic observationvectors

Wr spoken word sequence wr ,1,wr ,2, ...,wr ,Nr in trainingutterance r

W any word sequence

p(W ) language model probability, supposed to be given

pθ(Xr |W ) acoustic emission probability/acoustic model

θ set of all parameters of the acoustic model

Mr set of competing word sequences to be considered

f smoothing function

Schluter/Ney: Advanced ASR 492 August 5, 2010

Page 494: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training CriteriaGeneral Approach

I input: training data and stochastic model pθ(X ,W )input: with free model parameters θ

I output: “optimal” model parameters θI optimality defined via training criterion

θ := arg maxθF (θ)

Unified training criterion [Macherey+ 2005]

F (θ) =R∑

r=1

f

(log

(∑W p(W )pθ(Xr |W ) · A(W ,Wr )∑

W∈Mrp(W )pθ(Xr |W )

))I covers maximum mutual information (MMI), minimum

classification error (MCE), minimum phone/word error(MPE/MWE)

I control set Mr of competing hypotheses, cost function,smoothing function, scaling of models (not shown)

Schluter/Ney: Advanced ASR 493 August 5, 2010

Page 495: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training CriteriaProbabilistic Training Criteria

Objective

I find good estimate of probability distribution

I optimality regarding error via Bayes’ decoding(asymptotic w.r.t. amount of training data)

Schluter/Ney: Advanced ASR 494 August 5, 2010

Page 496: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training CriteriaMaximum Likelihood (ML)

I optimization of joint probability

arg maxθ

∑r

log(p(Wr )pθ(X |Wr )

)= arg max

θ

∑r

log pθ(Xr |Wr )

I Tutorial on HMM [Rabiner 1989].

I Maximization of probability of reference word sequences(classes).

I Model correctness important.

I HMM: maximization for each class separately.

I Neglects competing classes.

I Expectation-maximiation: local convergence guaranteed.

I Estimation efficient, easily parallelizable.

Schluter/Ney: Advanced ASR 495 August 5, 2010

Page 497: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training CriteriaMaximum Mutual Information (MMI)

I optimization of conditional probability

arg maxθ

∑r

log pθ(Wr |Xr ) = arg maxθ

∑r

logp(Wr )pθ(Xr |Wr )∑

V p(V )pθ(Xr |V )

I Considers competing classes and therefore decision boundariesI Necessitates set of competing classes on training data.I Optimization for standard modeling (HMMs, mixture

distributions): only gradient descent or similar.I Optimization using log-linear modeling: convex problemI 1st application of MMI for ASR using discrete HMMs [Bahl+ 1986]:

I 2000 isolated words, 18% rel. improvement in word error rate.I MMI: discrete and continuous probability densities [Brown 1987]:

I isolated E-set letters, 18% rel. improvement in recognition rate.I MMI: discrete & continuous probabilty densities [Normandin 1991]:

I digit strings, up to 50% rel. improvement in string error rate.

Schluter/Ney: Advanced ASR 496 August 5, 2010

Page 498: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training CriteriaError-Based Training Criteria

Objective: Optimize some error measure directly, e.g.:I Empirical recognition error on training data

I Advantage: direct relation to decision ruleI Problem: non-differentiable training criterion, use of

differentiable approximations in practiceI Problem: ASR classes (words/word sequences) difficult to

handle

I Model-based expected error on training dataI Advantage: word or phoneme error easy to handleI Usually, approximated word/phoneme error, but correct edit

distance also is viable [Heigold+ 2005]I Relation to decision rule less straight-forward.I Over-training and generalization becomes an issue

(→ regularization, margin)

Schluter/Ney: Advanced ASR 497 August 5, 2010

Page 499: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training CriteriaMinimum classification error (MCE)

I For ASR: minimization of smoothed empirical sentenceerror [Juang & Katagiri 1992,Chou+ 1992].

arg minθ

1

R

R∑r=1

1

1+[ pαθ (Xr |Wr ) · pα(Wr )∑

W 6=Wr

pαθ (Xr |W ) · pα(W )

]2%

I Smoothing parameters α and %.

I Upper bound to Bayes’ error rate for any acousticmodel [Schluter+ 2001]

I Lesser effect of incorrect model assumptions.

Schluter/Ney: Advanced ASR 498 August 5, 2010

Page 500: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training CriteriaMinimum word/phone error (MWE/MPE)

I minimization of model-based expected word/phone error ontraining data [Povey & Woodland 2002]

arg maxθ

R∑r=1

∑W A(W ,Wr )p(W )pθ(Xr |W )∑

W p(W )pθ(Xr |W )

I Criterion: maximum expected accuracy A(W ,Wr ).

I Accuracy usually approximate, but exact case based on edit(Levenshtein) distance also possible [Heigold+ 2005].

I Regularization (e.g. I-smoothing [Povey & Woodland 2002])necessary due to overtraining.

I Usually better than MMI and MCE.

Schluter/Ney: Advanced ASR 499 August 5, 2010

Page 501: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training CriteriaPractical Issues

I Importance of language model in training of acoustic model.

I Relative and absolute scaling of language and acoustic modelin training.

I Necessity for recognition of training data.

I Efficient calculation of discriminative training statistics usingword lattices.

Schluter/Ney: Advanced ASR 500 August 5, 2010

Page 502: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training CriteriaLanguage Models for Discriminative Training

Potential Importance of Language Model Choice:

I language model for recognition of alternative word sequences

I language model dependence of discriminative training criterionitself

I interaction of language model of acoustic model parameters

Correlation hypothesis:Only those acoustic models need optimization, which even to-gether with a language model do not sufficiently discriminate.

−→ language model choice would correlate for training and recognition.

Masking hypothesis:Language model usually largely improves recognition accuracyand might mask deficiencies of the acoustic models.

−→ suboptimal language models for training would give betterperformance.

Schluter/Ney: Advanced ASR 501 August 5, 2010

Page 503: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training CriteriaLanguage Model for Discriminative Training

I Discriminative training includes language model.I In training, unigram language model usually leads to the best

word error rates [Schluter+ 1999] (WSJ 5k):

language models criterion word error rates[%]recog train dev eval dev& eval

bi – ML 6.91 6.78 6.86zero MMI 6.71 6.03 6.41uni 6.59 6.00 6.33 -8%bi 6.71 6.20 6.48tri 6.87 6.54 6.72

tri – ML 4.82 4.11 4.51zero MMI 4.63 4.05 4.38uni 4.30 3.64 4.01 -11%bi 4.48 3.94 4.24tri 4.58 4.00 4.33

Schluter/Ney: Advanced ASR 502 August 5, 2010

Page 504: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training CriteriaScaling of likelihoods

I recognition: absolute scaling of likelihoods irrelevant(language model scale vs. acoustic model scale)

I absolute scaling does have impact on word posteriorcalculation [Wessel+ 1998,Woodland & Povey 2000]

I use language model scale β also in training:

p(X ,W ) = p(W )βpθ(X |W )

I replace p(X ,W ) with:

p(X ,W )γ = p(W )βγpθ(X |W )γ for γ ∈ [0, 1]

I optimum approx. for γ = 1β ., i.e. use

p(X ,W )1β = p(W )pθ(X |W )

I For simplicity here usually omitted in equations.

Schluter/Ney: Advanced ASR 503 August 5, 2010

Page 505: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training CriteriaCompeting Word Sequences

I Problem: Exponential number of competing word sequences.I Competing word sequences need to be estimated:

I Hypothesis-generation on training data using recognizer.I Initial lattice generation using recognizer sufficient.I Later acoustic model rescoring constrained to lattice.

I Representation and processing of competing word sequences.I Efficient algorithms to process word lattices.I Generic implementation: weighted finite state transducers.

Schluter/Ney: Advanced ASR 504 August 5, 2010

Page 506: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training CriteriaCompeting Word Sequences

History:I best recognized word sequence for MMI (Corrective Training)

[Normandin 1991]:I considers incorrectly recognized training sentences only

I best incorrectly recognized word sequence for MCE [Juang &Katagiri 1992]:

I interpretation of smoothed sentence error still valid

I N-best recognized word sequences for MMI [Chow 1990]:I continuous speech recognition, 1000 wordsI only minor improvements in word error rate

I word graphs from recognition for MMI training [Valtchev+

1997]:I large vocabulary, 64k wordsI efficient implementationI 5-10% relative improvement in word error rate

Schluter/Ney: Advanced ASR 505 August 5, 2010

Page 507: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Training CriteriaComparative Experimental Results

WER [%]SieTill WSJ 5k EPPS English Mandarin BN/BC

Crit. Test Dev Evl Dev06 Evl06 Evl07 Dev07 Evl06

ML 1.81 4.55 3.74 14.4 10.8 12.0 15.1 21.9MMI 1.79 4.07 3.53 13.8 11.0 12.0 14.4 20.8MCE 1.69 4.02 3.47 13.8 11.0 11.9MWE 3.98 3.44MPE 4.17 3.62 13.4 10.2 11.5 14.2 20.6

I SieTill [Schluter 2000]

I WSJ 5k [Macherey 2010]

I EPPS/broadcasts [Heigold 2010]

Schluter/Ney: Advanced ASR 506 August 5, 2010

Page 508: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training7.1 Introduction7.2 Overview7.3 Training Criteria7.4 Parameter Optimization7.5 Efficient Calculation of Discriminative Statistics7.6 Generalisation to Log-Linear Modeling7.7 Convex Optimization7.8 Incorporation of Margin Concept7.9 Conclusions7.10 References7.11 Speech Tasks: Corpus Statistics & Setups

Schluter/Ney: Advanced ASR 507 August 5, 2010

Page 509: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Parameter OptimizationMotivation

Goal: optimization method for discriminative training criteria F (θ) w.r.t.Goal: set of parameters θ which provides reasonable convergence.

I Various approaches, e.g.:I extended Baum-Welch (EBW) [Normandin 1991]I gradient descent, study: e.g. [Valtchev 1995]I MMI with log-linear models: generalized iterative scaling (GIS)I generalization of GIS to log-linear models with hidden variables

and further criteria like MPE and MCE [Heigold+ 2008a]

I Problems:I robust setting of step sizes/iteration constants (EBW and

gradient descent),I convergence speed (especially GIS).

Schluter/Ney: Advanced ASR 508 August 5, 2010

Page 510: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Parameter OptimizationExtended Baum-Welch

I Motivated by a growth transformation [Gopalakrishnan+ 1991]

I Widely used for discriminative training of Gaussian mixtureHMMs, e.g. [Normandin 1991,Valtchev+ 1997,Schluter2000,Woodland & Povey 2002]

I Highly optimized heuristics for finding right order ofmagnitude for iteration constants.

I Training of Gaussian mixture HMMs: require positivevariances to obtain estimate for iteration constants.

Schluter/Ney: Advanced ASR 509 August 5, 2010

Page 511: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Parameter OptimizationGradient descent

Follow gradient to optimize parameter:

θ = θ + γ∇θFθ

Step sizes:

I heuristic, e.g. for MCE [Chou+ 1992])

I by comparison to EBW [Schluter 2000]

Convergence:

I local optimum

I better convergence: general purpose approaches,e.g. Qprop, Rprop, or L-BFGS, for experimental comparisonssee [McDermott & Katagiri 2005,McDermott+

2007,Gunawardana+ 2005,Mahajan+ 2006]

Schluter/Ney: Advanced ASR 510 August 5, 2010

Page 512: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Parameter OptimizationRprop [Riedmiller & Braun 1993]

General purpose gradient based optimization:

I assume iteration n

I parameter update:

θ(n+1)i = θ

(n)i + γ

(n)i sign

(∂F (θ(n))

∂θi

)I update of step sizes γ

(n)i :

γ(n+1)i =

minγ(n)

i · η+, γmax if ∂F (θ(n))∂θi

· ∂F (θ(n−1))∂θi

> 0

maxγ(n)i · η−, γmin if ∂F (θ(n))

∂θi· ∂F (θ(n−1))

∂θi< 0

γ(n)i otherwise

I η+ ∈ (1,∞), η− ∈ (0, 1)

Schluter/Ney: Advanced ASR 511 August 5, 2010

Page 513: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Parameter OptimizationFormal gradient of MMI

I notation:I W : word sequence w1, . . . ,wN

I r : index of training segment/utterance given by (Xr ,Wr )I Xr : acoustic observation vector sequence xr1, . . . , xrT

I Wr : reference/spoken word sequence wr1, . . . ,wrN

I sT1 : HMM state sequence s1, . . . , sT

I MMI training criterion:

FMMI(θ) =∑

r

log

(p(Wr )pθ(Xr |Wr )∑

W

p(W )pθ(Xr |W )

)

=∑

r

(log p(Wr )pθ(Xr |Wr )− log

∑W

p(W )pθ(Xr |W )

)I acoustic model (HMM):

pθ(Xr ,W ) =∑sTr

1

Tr∏t=1

p(st |st−1)pθ(xrt |st)

Schluter/Ney: Advanced ASR 512 August 5, 2010

Page 514: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Parameter OptimizationDerivative of MMI w.r.t. Parameter

Gradient of MMI criterion:

∇θFMMI(θ) =∑

r

(∇θ log pθ(Xr |Wr )

∑W

p(W )pθ(Xr |W )∇θ log pθ(Xr |W )∑W ′

p(W ′)pθ(Xr |W ′)

)

For efficient evaluation, consider derivative of acoustic model,∇θ log pθ(Xr |W ).

Schluter/Ney: Advanced ASR 513 August 5, 2010

Page 515: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Parameter OptimizationDerivative of Acoustic Model

∇θ log pθ(Xr ,W ) = ∇θ log∑

sTr1 :W

Tr∏t=1

pθ(xrt |st)p(st |st−1)

=Tr∑t=1

∑sTr

1 :W

(∇θ log pθ(xrt |st)

)·∏Tr

t′=1 pθ(xrt′ |st′)p(st′ |st′−1)∑σTr

1 :W

∏Trτ=1 pθ(xrτ |sτ )p(sτ |sτ−1)

=Tr∑t=1

∑s

(∇θ log pθ(xrt |s)

∑sTr

1 :st=spθ(Xr , s

Tr1 |W )

pθ(Xr |W )

=Tr∑t=1

∑s

γrt(s|W ) · ∇θ log pθ(xrt |s)

with the word sequence conditioned state posterior (occupancy):

γrt(s|W ) =

∑sTr

1 :st=spθ(Xr , s

Tr1 |W )

pθ(Xr |W )= pθ,t(s|Xr ,W )

Schluter/Ney: Advanced ASR 514 August 5, 2010

Page 516: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Parameter OptimizationDerivative of MMI w.r.t. Parameter

resubstitute derivative of acoustic model into derivative of MMI criterion:

∇θFMMI(θ) =∑

r

Tr∑t=1

∑s

(∇θ log pθ(xrt |s)

·(γrt(s|Wr )−

∑W

p(W )pθ(Xr |W )γrt(s|W )∑W ′

p(W ′)pθ(Xr |W ′)

)

=∑

r

Tr∑t=1

∑s

(∇θ log pθ(xrt |s)

)·(γrt(s|Wr )− γrt(s)

)with the general state posterior (occupancy):

γrt(s) =

∑W

p(W )pθ(Xr |W )γrt(s|W )∑W ′

p(W ′)pθ(Xr |W ′)= pθ,t(s|Xr )

Schluter/Ney: Advanced ASR 515 August 5, 2010

Page 517: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Parameter OptimizationEfficient Calculation of State Occupancies

I Efficient calculation of spoken word sequence conditional stateoccupancy γrt(s|Wr ): forward-backward state probabilities ontrellis of word sequence.

I Efficient calculation of general state occupancy γrt(s):forward-backward probabilities on trellis of word lattice.

Viterbi approximation:I γrt(s|W ) = δs,srt(W ) with forced alignment

Sr (W ) = sr1(W ), . . . , srTr (W ) of spoken word sequenceI Assume a (word) lattice Mr for utterance r , with edges ω

representing a word w(ω) (in context) with start time ts(ω)and end time te(ω), and a corresponding forced alignmentstets (ω). An edge sequence W ∈Mr then corresponds to the

word sequence W (W). Consequently, the language model andacoustic model can also be defined for an edge sequence,which then might specify word boundaries, phonetic andlanguage model context.

Schluter/Ney: Advanced ASR 516 August 5, 2010

Page 518: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Parameter OptimizationWord Posterior Probabilities

For the general state occupancy in Viterbi approximation we obtain:

γrt(s) =

∑W

p(W )pθ(Xr |W )δs,srt(W )

pθ(Xr )

=∑ω

δs,srt(ω)

∑W:ω∈W

p(W)pθ(Xr |W)

pθ(Xr )

=∑ω

δs,srt(ω)p(ω|Xr )

with the edge (or word in context) posterior

p(ω|Xr ) =∑W:ω∈W

p(W)pθ(Xr |W)

pθ(Xr )

A forward-backward algorithm is used to efficiently compute edge(word in context) posterior probabilities using word lattices.

Schluter/Ney: Advanced ASR 517 August 5, 2010

Page 519: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Parameter OptimizationFormal gradient of MPE

I A(W ,Wr ): accuracy (negated error) between string W and Wr

I example (MPE): approximate phone accuracy [Povey &Woodland 2002]

I expectation of accuracy:

Eθ[A(·,Wr )] :=∑W

A(W ,Wr ) · p(W )pθ(Xr |W )∑W ′

p(W ′)pθ(Xr |W ′)

I MPE training criterion:

FMPE(θ) =∑

r

Eθ[A(·,Wr )]

Schluter/Ney: Advanced ASR 518 August 5, 2010

Page 520: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Parameter OptimizationDerivative of MPE w.r.t. Parameter

Derivative of MPE criterion:

∇θpθ(Xr |W ) = pθ(Xr |W ) ·(∇θ log pθ(Xr |W )

)∇θFMPE(θ) =

∑r

∑W

(A(W ,Wr )− Eθ[A(·,Wr )]

)·(∇θ log pθ(Xr |W )

· p(W )pθ(Xr |W )∑W ′

p(W ′)pθ(Xr |W ′)

For efficient evaluation, consider derivative of acoustic model:

∇θ log pθ(Xr |W ) =Tr∑t=1

∑s

(∇θ log pθ(xrt |s)

∑sTr

1 :st=spθ(Xr , s

Tr1 |W )

pθ(Xr |W )

Schluter/Ney: Advanced ASR 519 August 5, 2010

Page 521: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Parameter OptimizationDerivative of MPE w.r.t. Parameter

resubstitute derivative of acoustic model into derivative of MPE criterion:

∇θFMPE(θ) =∑

r

Tr∑t=1

∑s

(∇θ log pθ(xrt |s)

)· γrt(s)

with the general state accuracy:

γrt(s) =∑W

(A(W ,Wr )−Eθ[A(·,Wr )]

∑sTr

1 :st=s

p(W )pθ(Xr , sTr1 |W )

∑W ′

p(W ′)pθ(Xr |W ′)

which can be computed efficiently, similar to the case of generalstate occupancies.

Schluter/Ney: Advanced ASR 520 August 5, 2010

Page 522: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Parameter OptimizationEfficient Calculation of State Accuracy

In general:

I assumption: A(W ,Wr ) =Tr∑t=1

A(srt(W ), srt(Wr ))

I example: approximate phone accuracy [Povey & Woodland2002]

I efficient calculation of general state accuracy γrt(s):forward-backward accuracies on trellis of word lattice [Povey& Woodland 2002]

Schluter/Ney: Advanced ASR 521 August 5, 2010

Page 523: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Parameter OptimizationWord Posterior Accuracies

For the general state accuracy we in Viterbi approximation obtain:

γrt(s) =

∑W

(A(W ,Wr − Eθ[A(·,Wr )])

)· p(W )pθ(Xr |W )δs,srt(W )

pθ(Xr )

=∑ω

δs,srt(ω)

∑W :ω∈W

(A(W ,Wr − Eθ[A(·,Wr )])

)· p(W )pθ(Xr |W )

pθ(Xr )

=∑ω

δs,srt(ω)p(ω|Xr )

with the edge (or word in context) posterior accuracies

p(ω|Xr ) =∑

W :ω∈W

(A(W ,Wr − Eθ[A(·,Wr )])

)· p(W )pθ(Xr |W )

pθ(Xr )

Later, an efficient way of computing edge (word in context)posterior accuracies using word lattices will be presented.

Schluter/Ney: Advanced ASR 522 August 5, 2010

Page 524: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training7.1 Introduction7.2 Overview7.3 Training Criteria7.4 Parameter Optimization7.5 Efficient Calculation of Discriminative Statistics7.6 Generalisation to Log-Linear Modeling7.7 Convex Optimization7.8 Incorporation of Margin Concept7.9 Conclusions7.10 References7.11 Speech Tasks: Corpus Statistics & Setups

Schluter/Ney: Advanced ASR 523 August 5, 2010

Page 525: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Efficient Calculation of Discriminative StatisticsForward/Backward Probabilities on Word Lattices

I Let ωs(W) and ωe(W) be the first and last edge of acontinuous edge sequence W on a word lattice.

I Assume that the lattice fully encodes the language model context:

p(W (W)) = p(W = ωN1 ) =

N∏n=1

p(ωn|ωn−1)

Let ωri and ωrf be the initial and final edges of a word lattice forutterance r . Then define the following forward (Φ) and backward(Ψ) probabilities on initial and final partial edge sequences on theword lattice respectively:

Φ(ω) =∑

W:ωs(W)=ωri

ωe(W)=ω

p(W)pθ(xrte(W)1 |W)

Ψ(ω) =∑

W:ωs(W)=ω

ωe(W)=ωrf

p(W)pθ(xrTr

ts(W)|W)

Schluter/Ney: Advanced ASR 524 August 5, 2010

Page 526: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Efficient Calculation of Discriminative StatisticsForward/Backward Probabilities on Word Lattices

For the forward probability a recursion formulae can be derived byseparating the last edge from the edge sequence in the summationand ≺ denoting direct predecessor edges:

Φ(ω) =∑

W:ωs(W)=ωri

ωe(W)=ω

p(W)pθ(xrte(W)1 |W)

=∑ω′≺ω

∑W ′:ωs(W ′)=ωri

ωe(W ′)=ω′

p(W ′)p(ω|ω′)pθ(xrte(W ′)1 |W ′)pθ(xr

te(ω)ts(ω)|ω)

=∑ω′≺ω

Φ(ω′)p(ω|ω′)pθ(xrte(ω)ts(ω)|ω).

Using this recursion formula, the forward probabilities can becalculated efficiently on word lattices.

Schluter/Ney: Advanced ASR 525 August 5, 2010

Page 527: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Efficient Calculation of Discriminative StatisticsForward/Backward Probabilities on Word Lattices

Similar to the forward probabilities, a recursion formula can bederived for efficient calculation of the backward probabilities and denoting direct successor edges:

Ψ(ω) =∑

W:ωs(W)ωωe(W)=ωrf

p(W)pθ(xrTr

ts(ω)|ωW)

=∑ω′ω

∑W ′:ωs(W ′)ω′ωe(W ′)=ωrf

p(ω′|ω)p(W ′)pθ(xrte(ω)ts(ω)|ω)pθ(xr

Tr

ts(ω′)|ω′W ′)

=∑ω′ω

pθ(xrte(ω)ts(ω)|ω)p(ω′|ω)Ψ(ω′)

Schluter/Ney: Advanced ASR 526 August 5, 2010

Page 528: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Efficient Calculation of Discriminative StatisticsForward/Backward Probabilities on Word Lattices

Using the forward and backward probabilities, the edge/wordposterior on a word lattice can be written as

p(ω|Xr ) =

Φ(ω)∑ω′ω

p(ω′|ω)Ψ(ω′)

Φ(ωrf )

with pθ(Xr ) = Φ(ωrf ) = Ψ(ωri ).

Word posterior probabilities follow naturally from MPE and similardiscriminative training criteria. They also are the basis forconfidence measures, which are used for unsupervised training,adaptation, or dialog systems. They are also part of approximateapproaches to Bayes’ decision rule with word error cost, likeconfusion networks [Mangu+ 1999], or minimum frame worderror [Wessel+ 2001a,Hoffmeister+ 2006].

Schluter/Ney: Advanced ASR 527 August 5, 2010

Page 529: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Efficient Calculation of Discriminative StatisticsFB Probabilities: Generalization to WFSTs

Replace word lattice with WFST

I edge label: word with pronunciation

I weight of edge ω: p ← p(ω|ω′) · pθ(xrte(ω)ts(ω)|ω)

I semiring: substitute arithmetic operations (multiplication,addition, inversion) with operations of probability semiring

Semiring IK p ⊕ p′ p ⊗ p′ 0 1 inv(p)

probability IR+ p + p′ p · p′ 0 1 1p

Example WFST from SieTill, Wr =”drei sechs neun” (in red)

1

2

3

4

5

6

7 8 90

neun /9//−876null /0//−424drei /3//−384

[SIL] /si//−2618

sechs /6//−1014

drei /3//−1013

drei /3//−909

zwei /2//−556

neun /9//−480

[SIL] /si//−706

[SIL] /si//−274

[SIL] /si//−632

[SIL] /si//−719

sieben /7//67

sieben /7//−568

fünf /5//−437

Schluter/Ney: Advanced ASR 528 August 5, 2010

Page 530: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Efficient Calculation of Discriminative StatisticsFB Probabilities: Generalization to WFSTs

Forward probabilities (pre(ω) ∈ W such that pre(ω) ≺ ω)

Φ(ω) :=⊕

W:ωs(W)=ωri

ωe(W)=ω

⊗ω∈W

p(ω|pre(ω))⊗ pθ(xrte(ω)ts(ω)|ω)

=⊕ω′≺ω

Φ(ω′)⊗ p(ω|ω′)⊗ pθ(xrte(ω)ts(ω)|ω)

Backward probabilities: similar

Using the forward and backward probabilities, the edge posterioron a WFST Xr can be written as

p(ω|Xr ) = Φ(ω)⊗( ⊕ω′ω

p(ω′|ω)⊗Ψ(ω′)

)⊗ inv(Φ(ωrf ))

Schluter/Ney: Advanced ASR 529 August 5, 2010

Page 531: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Efficient Calculation of Discriminative StatisticsExpectation Semiring

vector weight (p, v) of edge ω with

I p ← p(ω|ω′) · pθ(xrte(ω)ts(ω)|ω)

I v ← A(ω) · pI accuracy of edge ω such that

⊗ω∈W A(ω) = A(W,Wr )

I approximate phone accuracy [Povey & Woodland 2002] can bedecomposed in this way

I such a decomposition not possible in general

expectation semiring [Eisner 2001]:vector semiring whose first component is a probability semiring

Semiring IK (p, v)⊕ (p′, v′) (p, v)⊗ (p′, v′) 0 1 inv(p, v)

expectation IR+ × IR (p + p′, v + v′) (p · p′, p · v′ + p′ · v) (0, 0) (1, 0)

„1p,− v

p2

«

Schluter/Ney: Advanced ASR 530 August 5, 2010

Page 532: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Efficient Calculation of Discriminative StatisticsEdge Posteriors & Expectation Semiring

probability semiringI word posterior probabilities (see MMI derivative) identical to

edge posteriors using probability semiring

p(ω|Xr ) = pprobability(ω|Xr )

I intuitive and classical result [Rabiner 1989]

expectation semiringI word posterior accuracies (see MPE derivative) identical to

v -component of edge posteriors using expectationsemiring [Heigold+ 2008b]

p(ω|Xr ) = pexpectation,v (ω|Xr )

I also use this identity to efficiently calculateI derivative of unified training criterionI covariance between two random additive variables (related to

MPE derivative)

Schluter/Ney: Advanced ASR 531 August 5, 2010

Page 533: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training7.1 Introduction7.2 Overview7.3 Training Criteria7.4 Parameter Optimization7.5 Efficient Calculation of Discriminative Statistics7.6 Generalisation to Log-Linear Modeling7.7 Convex Optimization7.8 Incorporation of Margin Concept7.9 Conclusions7.10 References7.11 Speech Tasks: Corpus Statistics & Setups

Schluter/Ney: Advanced ASR 532 August 5, 2010

Page 534: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Generalisation to Log-Linear ModelingTransformation: Gaussian into Log-Linear Model

Definition of Modelsassume feature vector x ∈ IRD and class c ∈ 1, . . . ,C

Gaussian modelN (x |µc ,Σc) with

I means µc ∈ IRD

I positive-definite covariancematrices Σc ∈ IRD×D

induces posterior pθ(c|x)

p(c)N (x |µc ,Σc)∑c ′

p(c ′)N (x |µc ′ ,Σc ′)

I include priors p(c) ∈ IR+

Log-linear model with uncon-strained parameters

I λc0 ∈ IRI λc1 ∈ IRD

I λc2 ∈ IRD×D

exp(x>λc2x + λ>c1x + λc0

)∑c ′

exp(x>λc ′2x + λ>c ′1x + λc ′0

)

Schluter/Ney: Advanced ASR 533 August 5, 2010

Page 535: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Generalisation to Log-Linear ModelingTransformation: Gaussian into Log-Linear Model

Comparison of terms quadratic, linear, and constant inobservations x leads to the transformation rules [Saul & Lee2002,Gunawardana+ 2005]:

1. λc2 = −12 Σ−1

c

2. λc1 = Σ−1c µc

3. λc0 = −12

(µ>c Σ−1

c µc + log |2πΣc |)

+ log p(c)

Schluter/Ney: Advanced ASR 534 August 5, 2010

Page 536: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Generalisation to Log-Linear ModelingTransformation: Log-Linear into Gaussian Model

I invert transformation from Gaussian to log-linear model

1. Σc = −12λ−1c2

2. µc = Σ−1c λc1

3. p(c) = exp(λc0 + 1

2

(µ>c Σ−1

c µc + log |2πΣc |))

I problem: parameter constraints not satisfied in generalI covariance matrices Σc must be positive-definiteI priors p(c) must be normalized

I solution: model parameters for posterior are ambiguouse.g. for ∆λ2 ∈ IRD×D ,∆λ0 ∈ IR

exp(x>(λc2 + ∆λ2)x + λ>c1x + (λc0 + ∆λ0)

)∑c ′

exp(x>(λc ′2 + ∆λ2)x + λ>c ′1x + (λc ′0 + ∆λ0)

)=

exp(x>λc2x + λ>c1x + λc0

)∑c ′

exp(x>λc ′2x + λ>c ′1x + λc ′0

)Schluter/Ney: Advanced ASR 535 August 5, 2010

Page 537: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Generalisation to Log-Linear ModelingTransformation: Log-Linear into Gaussian Model

invert transformation rules for transformed log-linear model

1. Σc = −12 (λc2 + ∆λ2)−1

2. µc = Σ−1c λc1

3. p(c) = exp((λc0 + ∆λ0) + 1

2

(µ>c Σ−1

c µc + log |2πΣc |))

use additional degrees of freedom to impose parameter constraints

I choose ∆λ2 ∈ IRD×D such that λc2 + ∆λ2 arenegative-definite

I choose ∆λ0 such that p(c) is normalized, i.e.,

∆λ0 := − log∑

c

exp

(λc0 +

1

2

(µ>c Σ−1

c µc + log |2πΣc |))

Schluter/Ney: Advanced ASR 536 August 5, 2010

Page 538: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training7.1 Introduction7.2 Overview7.3 Training Criteria7.4 Parameter Optimization7.5 Efficient Calculation of Discriminative Statistics7.6 Generalisation to Log-Linear Modeling7.7 Convex Optimization7.8 Incorporation of Margin Concept7.9 Conclusions7.10 References7.11 Speech Tasks: Corpus Statistics & Setups

Schluter/Ney: Advanced ASR 537 August 5, 2010

Page 539: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Convex OptimizationMotivation

Conventional approach:

I depends on initialization and choice of optimization algorithm

I spurious local optima (non-convex training criterion)

I many heuristics required

I i.e., involves much engineering work

“Fool-proof” approach:

I unique optimum (independent of initialization)

I accessibility of global optimum (convex training criterion)

I joint optimization of all model parameters, no parameters tobe tuned

Schluter/Ney: Advanced ASR 538 August 5, 2010

Page 540: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Convex OptimizationConvex Training Criteria in Speech Recognition

Assumptions to cast HCRF into CRF

I log-linear parameterization, e.g.p(x |s) = exp

(x>λs2x + λ>s1x + λs0

)and p(s|s ′) = exp(αs′s)

I MMI-like training criterion

I alignment represents spoken sequence

I alignment of spoken sequence known and kept fixed

I use single densities with augmented features instead of mixtures

I exact normalization constant

Schluter/Ney: Advanced ASR 539 August 5, 2010

Page 541: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Convex OptimizationLattice-Based MMI

Flattice(λ) =∑

r

∑sTr

1 ∈Nr

pλ(xTr1 , sTr

1 )

∑sTr

1 ∈Dr

pλ(xTr1 , sTr

1 )

I numerator word lattice Nr : state sequences sT1 representing

correct hypothesisI denominator word lattice Dr : correct and competing state

sequences, use word pair approximation and pruningI non-convex

fünf

fünf

fünfeins

eins

eins

sechs

[SIL]

zwei

neun

sechs[SIL]

acht

[SIL]acht

[SIL]

acht

acht

sechs

[SIL]

eins

drei

0 44152 277

[SIL]

114

115

112

81

78

74

75

Word lattice

Schluter/Ney: Advanced ASR 540 August 5, 2010

Page 542: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Convex OptimizationFool-Proof MMI

Ffool(λ) =∑

r

pλ(xTr1 , sTr

1 )∑sTr

1 ∈Sr

pλ(xTr1 , sTr

1 )

I consider only best state sequence sT1 in numerator, kept fixed

I sum over full state sequence network in denominatorI convex

0: ε

0: ε

ε ε:

ε ε:

0: ε

ε ε:

ε29:

0: ε

ε7:

0: ε

ε ε:

18:ε

ε29:

ε29:

ε ε:

0: ε

ε ε:

ε7:

ε ε:

ε7:

18:ε

ε29:

ε29:

51:ε

ε62:

ε84:

ε73:

ε29:

ε40:

18:ε

ε95:

ε40:

ε29:

ε106:

ε73:

ε62:

51:ε51:ε

ε62:

51:εε62:

ε73:

ε40:

ε84:

0: ε

18:ε

ε29:

ε29:

ε62:

51:ε

ε40:

ε29:

ε29:

ε40:

18:ε

ε7:ε7:

ε ε:

ε29:

ε29:

ε7:

0: ε

15

17

17

16

7:acht

18

18

18

18

18:acht

19

19

19

19

1918:acht

18:acht

20

20

20

20

20

7:acht

21

21

21

21

21

18:acht

7:acht

22

22

22

22

22

22

23

23

23

23

23

23

23

23

24

24

24

24

24

24

24

24

24

25

25

25

25

25

25

25

25

25

18:acht

18:acht

18:acht

18:acht

Part of (pruned) HMM state network

Schluter/Ney: Advanced ASR 541 August 5, 2010

Page 543: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Convex OptimizationFrame-Based MMI

Fframe(λ) =∑

r

Tr∑t=1

pλ(xt , st)S∑

s=1

pλ(xt , s)

I frame discrimination, cf. hybrid approach

I assume alignment for numerator sT1 , kept fixed

I summation over all HMM states s ∈ 1, . . . ,S indenominator

I convex

Schluter/Ney: Advanced ASR 542 August 5, 2010

Page 544: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Convex OptimizationRefinements to MMI

These refinements do not break convexity:

I `2-regularization

I margin term

Schluter/Ney: Advanced ASR 543 August 5, 2010

Page 545: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Convex OptimizationExperimental Results

Initialization

Analyze effect of initialparameters on training.

I vary initialization fordifferent trainingcriteria

I experiments: digitstrings (SieTill,German, telephone)

-4000

-3000

-2000

-1000

0

0 50 100 150 200 250 300

F

iteration index

frame-based M-MMIlattice-based M-MMI

fool-proof M-MMIfrom scratch

from MLfrom frame

1.5

2

2.5

3

3.5

4

4.5

5

0 50 100 150 200 250 300

WE

R [%

]

iteration index

frame-based M-MMIlattice-based M-MMI

fool-proof M-MMIfrom scratch

from MLfrom frame

Schluter/Ney: Advanced ASR 544 August 5, 2010

Page 546: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Convex OptimizationRead Speech (WSJ)

I 5k-vocabulary, trigram language modelI phone-based HMMs, 1,500 CART-tied triphonesI audio data: 15h (training), 0.4h (test)

I log-linear model with kernel-like features f (x)I first (fd(x) = xd) and second (fdd′(x) = xd · xd′) order featuresI cluster features: assume GMM of marginal distribution,

p(x) =∑

l p(x , l)

fl(x) =

p(l |x) if p(l |x) ≥ threshold

0 otherwise

I starting from scratch (model) and linear segmentationI frame-based MMI, with re-alignmentsI details: [Wiesler 2009]

Schluter/Ney: Advanced ASR 545 August 5, 2010

Page 547: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Convex OptimizationRead Speech (WSJ)

Feature setup WER [%]

First order features, monophones 22.7+second order features 10.3+210 cluster features + temporal context of size 9 6.2+1,500 CART-tied HMM states (triphones) 3.9+realignment 3.6GHMM (ML) 3.6

(MMI) 3.0

Schluter/Ney: Advanced ASR 546 August 5, 2010

Page 548: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training7.1 Introduction7.2 Overview7.3 Training Criteria7.4 Parameter Optimization7.5 Efficient Calculation of Discriminative Statistics7.6 Generalisation to Log-Linear Modeling7.7 Convex Optimization7.8 Incorporation of Margin Concept7.9 Conclusions7.10 References7.11 Speech Tasks: Corpus Statistics & Setups

Schluter/Ney: Advanced ASR 547 August 5, 2010

Page 549: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Incorporation of Margin ConceptMotivation

Goal: incorporation of margin term into conventional training criteriaI replace likelihoods p(W )p(X |W ) with margin-likelihoods

p(W )p(X |W ) exp(−ρA(W ,Wr ))I A(W ,Wr ): accuracy between hypothesis W and reference Wr

I interpretation (boosting):emphasize incorrect hypotheses by up-weighting

I interpretation (large margin): next slides

Margin

low complexity taskdifferent

loss functions

convergence local optima

different parameterization optimization

Margin in trainingis promising.

Individual contribution of margin in LVCSR training?

Schluter/Ney: Advanced ASR 548 August 5, 2010

Page 550: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Incorporation of Margin ConceptSupport Vector Machines (Hinge Loss)

Optimization problem for SVMs

SVM(λ) = −C

2‖λ‖2 −

R∑r=1

l(Wr , dr ; ρ)

I feature functions f (X ,W ), model parameters λ

I distance drW := λ>(f (Xr ,Wr )− f (Xr ,W ))

I hinge loss function l (hinge)(Wr , dr ; ρ) :=maxW 6=Wr max −drW + ρ(A(Wr ,Wr )− A(W ,Wr )), 0

I `2-regularization with constant C > 0

Altun 2003,Taskar 2003

Schluter/Ney: Advanced ASR 549 August 5, 2010

Page 551: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Incorporation of Margin ConceptSmooth Approximation to SVM: Margin-MMI

Margin-based/modified MMI (M-MMI)

FM-MMI,γ(λ) =− C

2‖λ‖2

+R∑

r=1

1

γlog

(exp(γ(λ>f (Xr ,Wr )− ρA(Wr ,Wr )))∑W exp(γ(λ>f (Xr ,W )− ρA(W ,Wr )))

)

Lemma: FM-MMI,γγ→∞→ SVMhinge (pointwise convergence).

Heigold+ 2008b

Schluter/Ney: Advanced ASR 550 August 5, 2010

Page 552: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Incorporation of Margin ConceptProof

∆A(W ,Wr ) := A(Wr ,Wr )− A(W ,Wr )

−1

γlog

exp(γ(λ>f (Xr ,Wr )− ρA(Wr ,Wr )))∑W

exp(γ(λ>f (Xr ,W )− ρA(W ,Wr )))

=

1

γlog

1 +∑

W 6=Wr

exp(γ(−drW + ρ∆A(W ,Wr )))

γ→∞→

maxW 6=Wr

−drW + ρ∆A(W ,Wr ) if ∃W 6= Wr : drW < ρ∆A(W ,Wr )

0 otherwise

= maxW 6=Wr

max−drW + ρ∆A(W ,Wr ), 0

=: l (hinge)(Wr , dr ; ρ).

Schluter/Ney: Advanced ASR 551 August 5, 2010

Page 553: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Incorporation of Margin ConceptSupport Vector Machines (Margin Error)

Optimization problem for SVMs

SVM(λ) = −C

2‖λ‖2 −

R∑r=1

l(Wr , dr ; ρ)

I feature functions f (X ,W ), model parameters λ

I distance drW := λ>(f (Xr ,Wr )− f (Xr ,W ))

I margin error loss functionl (error)(Wr , dr ; ρ) := E

(A(arg minW [drW + ρA(W ,Wr )],Wr )

)I `2-regularization with constant C > 0

Heigold+ 2008b

Schluter/Ney: Advanced ASR 552 August 5, 2010

Page 554: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Incorporation of Margin ConceptSmooth Approximation to SVM: Margin-MPE

Margin-based/modified MPE (M-MPE)

FM-MPE,γ(λ) = −C

2‖λ‖2

+R∑

r=1

∑W

E (W ,Wr )

(exp(γ(λ>f (Xr ,Wr )− ρA(Wr ,Wr )))∑

V exp(γ(λ>f (Xr ,V )− ρA(V ,Wr )))

)

Lemma: FM-MPE,γγ→∞→ SVMerror .

Heigold+ 2008b

Schluter/Ney: Advanced ASR 553 August 5, 2010

Page 555: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Incorporation of Margin ConceptExperimental Evaluation of Margin

Digit strings (SieTill, German, telephone)

dns/state feature orders # param, criterion WER [%]

1 first 11k ML 3.8MMI 2.9M-MMI 2.7

64 first 690k ML 1.8MMI 1.8M-MMI 1.6

1 first, second, 1,409k Frame 1.8and third MMI 1.7

M-MMI 1.5

Schluter/Ney: Advanced ASR 554 August 5, 2010

Page 556: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Incorporation of Margin ConceptExperimental Evaluation of Margin

European parliament plenary sessions in English (EPPS) andMandarin broadcasts

WER [%]EPPS En Mandarin BN/BC

Criterion 90h 230h 1500h

ML 12.0 21.9 17.9

MMI 20.8M-MMI 20.6

MPE 11.5 20.6 16.5M-MPE 11.3 20.3 16.3

Schluter/Ney: Advanced ASR 555 August 5, 2010

Page 557: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Incorporation of Margin ConceptExperimental Evaluation of Margin

Handwriting Recognition (IFN/ENIT)

I isolated town names, handwritten

I choose slice features to use 1D HMM

I details: see [Dreuw+ 2009]

WER [%]Criterion abc-d abd-c acd-b bcd-a abcd-e

ML 7.8 8.7 7.8 8.7 16.8

MMI 7.4 8.2 7.6 8.4 16.4M-MMI 6.1 6.8 6.1 7.0 15.4

Schluter/Ney: Advanced ASR 556 August 5, 2010

Page 558: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training7.1 Introduction7.2 Overview7.3 Training Criteria7.4 Parameter Optimization7.5 Efficient Calculation of Discriminative Statistics7.6 Generalisation to Log-Linear Modeling7.7 Convex Optimization7.8 Incorporation of Margin Concept7.9 Conclusions7.10 References7.11 Speech Tasks: Corpus Statistics & Setups

Schluter/Ney: Advanced ASR 557 August 5, 2010

Page 559: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ConclusionsEffective Discriminative Training

I Discriminative CriteriaI fit decision rule: minimize training errorI limit overfitting: include regularization and margin to exploit

limit overfitting: remaining degrees of freedom of the parametersI Optimization Methods

I general purpose methods give robust estimatesI in convex case gradient descent still faster than

growth transform (GIS)I Log-Linear Modeling

I convex (w/o hidden variables)I covers Gaussians completely, w/o constraints on e.g. varianceI opens modeling up to arbitrary featuresI initialization: from scratch or from Gaussians

I Estimation of StatisticsI efficiency: use word lattice to represent competing word sequencesI implementation: generic approach using WFSTs,

implementation: covers class of criteria

Schluter/Ney: Advanced ASR 558 August 5, 2010

Page 560: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training7.1 Introduction7.2 Overview7.3 Training Criteria7.4 Parameter Optimization7.5 Efficient Calculation of Discriminative Statistics7.6 Generalisation to Log-Linear Modeling7.7 Convex Optimization7.8 Incorporation of Margin Concept7.9 Conclusions7.10 References7.11 Speech Tasks: Corpus Statistics & Setups

Schluter/Ney: Advanced ASR 559 August 5, 2010

Page 561: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ReferencesI Y. Altun, I. Tsochantaridis, and T. Hofmann, “Hidden Markov support

vector machines,” in International Conference on Machine Learning(ICML) 2003, Washington, DC, USA, 2003.

I L. R. Bahl, P. F. Brown, P. V. de Souza, R. L. Mercer. “MaximumMutual Information Estimation of Hidden Markov Model Parameters forSpeech Recognition,”Proc. 1986 Int. Conf. on Acoustics, Speech and Signal Processing, Vol.1, pp. 49-52, Tokyo, Japan, May 1986.

I P. F. Brown. The Acoustic-Modeling Problem in Automatic SpeechRecognition, Ph.D. thesis, Department of Computer Science, CarnegieMellon University, Pittsburgh, PA, May 1987.

I W. Chou, B.-H. Juang, and C.-H. Lee, “Segmental GPD training ofHMM based speech recognizer,” in IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP) 1992, San Francisco,CA, USA, March 1992, pp. 473-476.

I Y.-L. Chow: “Maximum Mutual Information Estimation of HMMParameters for Continuous Speech Recognition using the N-bestAlgorithm,” inProc. Int. Conf. on Acoustics, Speech and SignalProcessing (ICASSP), pp. 701–704, Albuquerque, NM, April 1990.

Schluter/Ney: Advanced ASR 560 August 5, 2010

Page 562: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ReferencesI P. Dreuw, G. Heigold, and H. Ney, “Confidence-based discriminative

training for model adaptation in offline Arabic handwriting recognition,”in International Conference on Document Analysis and Recognition(ICDAR), Barcelona, Spain, July 2009.

I J. Eisner, “Expectation semirings: Flexible EM for finite-statetransducers,” in International Workshop on Finite-State Methods andNatural Language Processing (FSMNLP), Helsinki, Finland, August 2001.

I P. S. Gopalakrishnan, D. Kanevsky, A. Nadas, D. Nahamoo. “AnInequality for Rational Functions with Applications to Some StatisticalEstimation Problems,” IEEE Transactions on Information Theory, Vol.37, Nr. 1, pp. 107-113, January 1991.

I A. Gunawardana, M. Mahajan, A. Acero, and J. Platt, “Hiddenconditional random fields for phone classification,” in Interspeech, pp.117 120, Lisbon, Portugal, Sept. 2005.

I G. Heigold, W. Macherey, R. Schluter, and H. Ney: ”Minimum ExactWord Error Training,” in Proc. IEEE Automatic Speech Recognition andUnderstanding Workshop (ASRU), pages 186–190, San Juan, PuertoRico, November 2005.

Schluter/Ney: Advanced ASR 561 August 5, 2010

Page 563: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ReferencesI G. Heigold, T. Deselaers, R. Schluter, H. Ney: “GIS-like Estimation of

Log-Linear Models with Hidden Variables,” in Proc. IEEE InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP), pages4045–4048, Las Vegas, NV, USA, April 2008.

I G. Heigold, T. Deselaers, R. Schluter, H. Ney, “Modified MMI/MPE: Adirect evaluation of the margin in speech recognition,” in InternationalConference on Machine Learning (ICML), pp. 384-391, Helsinki, Finland,July 2008.

I G. Heigold: A Log-Linear Modeling Framework for Speech Recognition,Doctoral Thesis to be submitted, RWTH Aachen University, Aachen,Germany, 2010.

I B. Hoffmeister, T. Klein, R. Schluter, and H. Ney: “Frame Based SystemCombination and a Comparison with Weighted ROVER and CNC,” inProc. Interspeech, pages 537–540, Pittsburgh, PA, USA, September2006.

I B.-H. Juang and S. Katagiri, “Discriminative learning for minimum errorclassification,” IEEE Transactions on Signal Processing, vol. 40, no. 12,pp. 3043-3054, 1992.

Schluter/Ney: Advanced ASR 562 August 5, 2010

Page 564: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

References

I D. Kanevsky: “Extended Baum Welch transformations for generalfunctions,” in Proc. IEEE International Conference on Acoustics, Speech,and Signal Processing (ICASSP), pp. 821-824, Montreal, Quebec,Canada, May 2004.

I W. Macherey, L. Haferkamp, R. Schluter, H. Ney: “Investigations onError Minimizing Training Criteria for Discriminative Training inAutomatic Speech Recognition,” in Proc. European Conference onSpeech Communication and Technology (Interspeech), Lisbon, September2005.

I W. Macherey, “Discriminative training and acoustic modeling forautomatic speech recognition,” Ph.D. thesis to be submitted, RWTHAachen University, 2010.

I M. Mahajan, A. Gunawardana, A. Acero: “Training algorithms for hiddenconditional random fields,” in Proc IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP), Toulouse, France,May 2006.

Schluter/Ney: Advanced ASR 563 August 5, 2010

Page 565: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

References

I L. Mangu, E. Brill, A. Stolcke: “Finding Consensus Among Words:Lattice-Based Word Error Minimization,” Proc. European Conference onSpeech Communication and Technology (EUROSPEECH), pp. 495–498,Budapest, Hungary, Sept. 1999.

I E. McDermott, S. Katagiri: “Minimum Classification Error for Large ScaleSpeech Recognition Tasks using Weighted Finite State Transducers,” inProc. IEEE International Conference on Acoustics, Speech, and SignalProcessing (ICASSP), Philadelphia, PA, USA, April 2005.

I E. McDermott, T. Hazen, J.L. Roux, A. Nakamura, S. Katagiri:“Discriminative training for large vocabulary speech recognition usingMinimum Classification Error,” in Proc. IEEE Transactions on Audio,Speech and Language Processing (ICASSP), Vol. 15, No. 1, pp. 203–223,April 2007.

I Y. Normandin, “Hidden Markov Models, Maximum Mutual Information,and the Speech Recognition Problem,” Ph.D. thesis, McGill University,Montreal, Canada, 1991.

Schluter/Ney: Advanced ASR 564 August 5, 2010

Page 566: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ReferencesI D. Povey and P. C. Woodland, “Minimum phone error and I- smoothing

for improved discriminative training,” in IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP) 2002, Orlando, FL,May 2002, vol. 1, pp. 105–108.

I L.R. Rabiner, “A tutorial on hidden Markov models and selectedapplications in speech recognition,” in Proc. of the IEEE, vol. 77, no. 2,pp. 257–286, 1989.

I M. Riedmiller and H. Braun, “A direct adaptive method for fasterbackpropagation learning: The Rprop algorithm,” in IEEE InternationalConference on Neural Networks (ICNN) 1993, San Francisco, CA, USA,1993, pp. 586 591.

I L. Saul and D. Lee, “Multiplicative updates for classification by mixturemodels,” in T.G. Dietterich, S. Becker, and Z. Ghahramani, editor,Advances in Neural Information Processing Systems (NIPS). MIT Press,2002.

I R. Schluter, B. Muller, F. Wessel, and H. Ney: “Interdependence ofLanguage Models and Discriminative Training,” in Proc. IEEE AutomaticSpeech Recognition and Understanding Workshop (ASRU), Vol. 1, pages119–122, Keystone, CO, December 1999.

Schluter/Ney: Advanced ASR 565 August 5, 2010

Page 567: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

References

I R. Schluter: Investigations on Discriminative Training Criteria, DoctoralThesis, RWTH Aachen University, Aachen, Germany, Sept. 2000.

I R. Schluter, H. Ney: ”Model-based MCE Bound to the True Bayes’Error,” IEEE Signal Processing Letters, Vol. 8, No. 5, pages 131–133,May 2001.

I B. Taskar, C. Guestrin, and D. Koller, “Max-margin Markov networks,” inAdvances in Neural Information Processing Systems (NIPS) 2003, 2003.

I V. Valtchev: Discriminative Methods in HMM-based Speech Recognition,Ph.D. thesis, St. John’s College, University of Cambridge, Cambridge,March 1995.

I V. Valtchev, J. J. Odell, P. C. Woodland, S. J. Young. “MMIE Trainingof Large Vocabulary Recognition Systems,” Speech Communication, Vol.22, No. 4, pp. 303-314, September 1997.

I F. Wessel, K. Macherey, and R. Schluter, “Using word probabilities asconfidence measures,” in IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP) 1998, Seattle, WA, USA, May1998, pp. 225-228.

Schluter/Ney: Advanced ASR 566 August 5, 2010

Page 568: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

ReferencesI F. Wessel, R. Schluter, and H. Ney: “Explicit Word Error Minimization

using Word Hypothesis Posterior Probabilities,” in Proc. IEEEInternational Conference on Acoustics, Speech, and Signal Processing(ICASSP), pages 33–36, Salt Lake City, Utah, May 2001.

I F. Wessel, R. Schluter, H. Ney: “Explicit Word Error Minimization usingWord Hypothesis Posterior Probabilities,” Proc. IEEE InternationalConference on Acoustics, Speech and Signal Processing, pp. 33–36, SaltLake City, Utah, May 2001.

I S. Wiesler, et al., “Investigations on features for log-linear acousticmodels in continuous speech recognition,” in IEEE Automatic SpeechRecognition and Understanding Workshop (ASRU), Merano, Italy, Dec.2009.

I P. C. Woodland and D. Povey, “Large scale discriminative training forspeech recognition,” in Automatic Speech Recognition (ASR) 2000,Paris, France, September 2000, pp. 7–16.

I P.C. Woodland, D. Povey: “Large Scale Discriminative Training ofHidden Markov Models for Speech Recognition.” Computer Speech andLanguage, Vol. 16, No. 1, pp. 2548, 2002.

Schluter/Ney: Advanced ASR 567 August 5, 2010

Page 569: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training7.1 Introduction7.2 Overview7.3 Training Criteria7.4 Parameter Optimization7.5 Efficient Calculation of Discriminative Statistics7.6 Generalisation to Log-Linear Modeling7.7 Convex Optimization7.8 Incorporation of Margin Concept7.9 Conclusions7.10 References7.11 Speech Tasks: Corpus Statistics & Setups

Schluter/Ney: Advanced ASR 568 August 5, 2010

Page 570: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Speech Tasks: Corpus Statistics & Setups

Identifier Description Train/Test [h]

SieTill German digit strings 11/11 (Test)EPPS En English European Parliament 92/2.9 (Evl07)

plenary speechBNBC Cn 230h Mandarin broadcasts 230/2.2 (Evl06)BNBC Cn 1500h Mandarin broadcasts 1,500/2.2 (Evl06)

Identifier #States/#Dns Features

SieTill 430/27k 25 LDA(MFCC)EPPS En 4,500/830k 45 LDA(MFCC+voicing)

+VTLN+SAT/CMLLRBNBC Cn 230h 4,500/1,100k 45 LDA(MFCC)+3 tones

+VTLN+SAT/CMLLRBNBC Cn 1500h 4,500/1,200k 45 SAT/CMLLR(PLP+voicing

+3 tones+32 NN)+VTLN

Schluter/Ney: Advanced ASR 569 August 5, 2010

Page 571: Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Handwriting Tasks: Corpus Statistics & Setups

IFN/ENIT:

I isolated Tunisian town names

I 4 training folds + 1 additional fold for testing

I simple appearance-based image slice features

I each fold comprises approximately 500,000 frames

#Observations [k]Corpus Towns Frames

IFN/ENIT a 6.5 452b 6.7 459c 6.5 452d 6.7 451e 6.0 404

Schluter/Ney: Advanced ASR 570 August 5, 2010