Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,

Advanced methods in Automatic SpeechRecognition

Dr. Ralf Schluter, Prof. Dr.-Ing. Hermann Ney

Lehrstuhl fur Informatik 6Human Language Technology and Pattern Recognition

Computer Science Department, RWTH Aachen UniversityD-52056 Aachen, Germany

August 5, 2010

Schluter/Ney: Advanced ASR 1 August 5, 2010

ScheduleCourse: Introduction to Automatic Speech Recognition

Event Times Room Start

Lecture Tue (biweekly) 15:45–17:15h 6124 April 20, 2010Thu 14:00–15:30h 6124

Exercises Tue (biweekly) 15:45–17:15h 6124 April 27, 2010

Room 6124: seminar room of the Lehrstuhl fur Informatik 6

See course site:http://www-i6.informatik.rwth-aachen.de/web/Teaching/Lectures/SS10/advasr

for

I news

I downloads (documents, exercise sheets, etc.)

I course information

I contacts


http://www-i6.informatik.rwth-aachen.de/web/Teaching/Lectures/SS10/advasr

Contents

0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition

2. Search using Lexical Pronunciation Tree

3. Across-Word Models

4. Word graphs and Applications

5. Time Conditioned Search

6. Normalization and Adaptation

7. Discriminative Training


Outline0. Lehrstuhl fur Informatik 60.1 Research Topics0.2 Projects0.3 Courses0.4 Textbooks









Lehrstuhl fur Informatik 6: Research TopicsResearch Topics

Method: Stochastic Modelling

I Modelling dependencies and vague knowledge(contrast: rule-based approach)

I Decision making, in particular in context

I Automatic learning from data/examples

Applications:Human Language Technology and Pattern Recognition


Applications: Examples

I Speech recognition

I small vocabularyI large vocabulary

I Machine translation

I Natural language processing

I text/document classificationI information retrievalI parsing and syntactic analysis

I Language understanding and dialog systems

I Image recognition

I object recognitionI handwriting recognition


Applications: Examples

I Diagnosis and expert systems

I Other applications:

I speaker verification and identificationI fingerprint verification and identificationI DNA sequence identificationI gesture recognitionI lip readingI geological analysisI high-energy physics: bubble chamber tracksI ...











Lehrstuhl fur Informatik 6 (i6): Projects

I ARISE (EU):Automatic Railway Information Systems across Europe– Speech Recognition and Language Modelling

I EuTrans II (EU): Translation of Spoken Language– Speech Recognition and Translation

I Institut fur deutsche Sprache (IdS):– Language Modelling for Newspapers

I Audio Document Retrieval (NRW):– Speech Recognition and Information Retrieval

I Verbmobil II (BMBF): Speech Recognition and Translation forAppointment Scheduling and Traveling Information– Speech Recognition– Speech Translation– Prototype Modules


Projects i6

I Image Object Recognition (RWTH):– OCR (optical character recognition)– Medical Images

I Advisor (EU):– Speech Recognition for German Broadcast News

I EGYPT follow-up (NSF):– Basic Algorithms for Statistical Machine Translation

I Audio Document Retrieval (NRW ?):– German Broadcast News: Recognition and Information Retrieval

I Bilateral Projects with Companies (including start-ups)

I German DFG:– Improved Acoustic Modelling using Structured Models– Statistical Methods for Written Language Translation– Statistical Modeling for Image Object Recognition


Projects i6I Coretex (EU):

– Improving Core Technology for Speech Recognition– Applications: Broadcast News in Several Languages

I LC-Star (EU):– Lexical and Corpora Resources for Recognition, Translationand Synthesis– Prototype system for machine translation of spokensentences

I TC-Star (EU):– Technology and Corpora for Speech to Speech Translation– Applications: Broadcast News and Speeches/Lectures

I Transtype-2 (EU):– Machine translation of written text– Application: interactive machine-aided translation

I PF-Star (EU):– Machine translation of spoken dialogues– Application: tourism and travelling


Projects i6

I JUMAS (EU):– Judicial MAnagement by digital libraries Semantics– Application: audio and video search of court proceedings

I LUNA (EU):– spoken Language UNderstanding in multilinguAl

communication systems

– Application: real-time understanding of spontaneous speechin advanced telecom services

I GALE (US-DARPA):– Global Autonomous Language Exploitation– Application: Information Processing in Multiple Languages

I QUAERO [lat.: to search] (OSEO/France)– multimedia and multilingual indexing– Application: extract information from written texts,

speech and music audio, images, and video











Courses

I Introductory lectures (L3/4) with exercises (E2) for Bachelor,Master, and Diploma students:– ASR: (Introduction to) Automatic Speech Recognition– PRN: (Introduction to) Pattern Recognition and Neural Networks– NLP: (Introduction to) Natural Language Processing

I Advanced lectures (L3) with exercises (E1/2) for Master andDoctoral students:– advASR: Advanced Automatic Speech Recognition– advPRN: Advanced Pattern Recognition– advNLP: Advanced Natural Language Processing

I Further Lectures (L2) with exercises (E1):– MIP: Medical Image Processing

(’Ringvorlesung’, each WS)


Courses (ctd.)

I Seminars:– Bachelor Degree (SS, Block)– Diplom Degree (SS, Block)– Doctor Degree (WS+SS)

I Laboratory Courses (WS, Block)

I Study Groups (WS+SS: speech, language, image)

New course cycles:year term lectures

09/10 WS PRNN (L4/3,E2) ASR (L4/3,E2)SS NLP (L4/3,E2) advASR (L3,E1)

10/11 WS PRNN (L4/3,E2) ASR (L4/3,E2)


Exams i6: Diplom Degree

I area of specialization (Vertiefungsgebiet) i6 with the topics:

– Automatic Speech Recognition (ASR)– Pattern Recognition and Neural Networks (PRNN)– Natural Language Processing (NLP)– ...select 12 hours (SWS) out of i6 lectures


Exams i6: Diplom Degree (ctd.)

I practical computer science (Prakt. Informatik) (3 areas):recommendation: 12 hours (SWS) out of

two L4 from: ASR, PRNN, NLPone L4 from i6-external lectures:

I data basesI artificial intelligenceI ... additional alternatives: on demand


Examinations i6

I Master in Informatik, Media Informatics or Software SystemsEngineering:credit system: oral exam after each course/at end of lecture period

I ERASMUS students of Computer Science:oral exam/colloquium for graded certificate at end of lecture period

Note: consult Prof. Ney before June 2010 for exam dates, andbefore registering for the exam.











Textbooks: Topics i6Textbooks on Speech Recognition:

I emphasis on signal processing and small-vocabularyrecognition:L. Rabiner, B. H. Juang: Fundamentals of SpeechRecognition.Prentice Hall, Englewood Cliffs, NJ, 1993.

I emphasis on large vocabulary and language modelling:F. Jelinek: Statistical Methods for Speech Recognition.MIT Press, Cambridge, 1997.

I introduction to both speech and language:D. Jurafsky, J. H. Martin: Speech and Language Processing.Prentice Hall, Englewood Cliffs, NJ, 2000.

I advanced topics:R. De Mori: Spoken Dialogues with Computers.Academic Press, London, 1998


Textbooks: Topics i6Textbooks on Signal Processing:

I A. V. Oppenheim, R. W. Schafer: Discrete Time SignalProcessing, Prentice Hall, Englewood Cliffs, NJ, 1989.

I A. Papoulis: Signal Analysis, McGraw-Hill, New York, NY, 1977.I A. Papoulis: The Fourier Integral and its Applications,

McGraw-Hill Classic Textbook Reissue Series, McGraw-Hill,New York, NY, 1987.

I W. K. Pratt: Digital Image Processing, Wiley & Sons Inc,New York, NY, 1991.

Further reading on Signal Processing:I T. K. Moon, W. C. Stirling: Mathematical Methods and Algorithms

for Signal Processing. Prentice Hall, Upper Saddle River, NJ, 2000.I J. R. Deller, J. G. Proakis, J. H. L. Hansen: Discrete-Time

Processing of Speech Signals, Macmillan Publishing Company,New York, NY, 1993.

I L. Berg: Lineare Gleichungssysteme mit Bandstruktur, VEBDeutscher Verlag der Wissenschaften, Berlin, 1986.


Textbooks: Topics i6

Textbooks on Natural Language Processing(statistical/corpus-based):

I introduction to both speech and language:D. Jurafsky, J. H. Martin: Speech and Language Processing.Prentice Hall, Englewood Cliffs, NJ, 2000.

I emphasis on statistical methods for written language:C. D. Manning, H. Schutze: Foundations of Statistical NaturalLanguage Processing. MIT Press, Cambridge, MA, 1999.

I related field: artificial intelligence:S. Russel, P. Norvig: Artificial Intelligence. Prentice Hall,Englewood Cliffs, NJ, 1995 (in particular Chapters 22-25).


Textbooks: Topics i6Textbooks on Statistical Learning (Pattern Recognition, NeuralNetworks, Data Mining, ...):

I best introduction (including modern concepts):R. O. Duda, P. E. Hart, D. G. Stork: Pattern Classification.2nd ed., J. Wiley & Sons, New York, NY, 2001.

I emphasis on statistical concepts:B. D. Ripley: Pattern Recognition and Neural Networks.Cambridge University Press, Cambridge, England, 1996.

I emphasis on modern statistical concepts:T. Hastie, R. Tibshirani, J. Friedman: The Elements ofStatistical Learning: Data Mining, Inference and Predictions.Springer, New York, 2001.

I emphasis on theory and principles:L. Devroye, J. Gyorfi, G. Lugosi: A Probabilistic Theory ofPattern Recognition. Springer, New York, 1996.


Textbooks: Topics i6

Textbooks on mathematical methods (vector spaces and matrices,statistics, optimization methods, ...):

I best overall summary:T. K. Moon, W. C. Stirling: Mathematical Methods andAlgorithms for Signal Processing. Prentice Hall, Upper SaddleRiver, NJ, 2000.

I introduction to modern statistics:G. Casella, R. L. Berger: Statistical Inference. Wadsworth &Brooks/Cole, Pacific Grove, CA, 1990.

I good overview of numerical algorithms and implementations:W. H. Press, S. A. Teukolsky, W. T. Vetterling,B. P. Flannery: Numerical Recipes in C. CambridgeUniv. Press, Cambridge, 2nd ed., 1992.


Outline0. Lehrstuhl fur Informatik 6

1. Large Vocabulary Speech Recognition1.1 Overview: Architecture1.2 Phoneme Models and Subword Units1.3 Phonetic Decision Trees1.4 Language Modelling1.5 Dynamic Programming Beam Search1.6 Implementation Details1.7 Excursion (for experts): Language Model Factor1.8 Excursion (for experts): Length Modelling








Overview: Architecture

Starting point: Bayes decision rule

I results in a minimum number of recognition errors(under certain conditions)

I more details:see lecture Pattern Recognition and Neural Networks


Speech Recognition: Bayes’ Decision RuleSpeech Input

AcousticAnalysis

Phoneme Inventory

Pronunciation Lexicon

Language Model

Global Search:

maximize

x1 ... xT

Pr(w1 ... wN) Pr(x1 ... xT | w1...wN)

w1 ... wN

RecognizedWord Sequence

over

Pr(x1 ... xT | w1...wN )

Pr(w1 ... wN)


Speech Recognizer: Sources of Errors

Why does a recognition system make errors?Reasons from the viewpoint of Bayes’ decision rule:

I incorrect acoustic model:– poor acoustic analysis– poor phoneme models– poor pronunciation model

I incorrect language model

I incorrect search procedure:the maximum is not found

I decision rule: discrepancy between evaluation measure (worderror rate) and decision rule (minimizes sentence error rate)


Speech Recognition: Effect of Language Modeland other Knowledge Sources

Importance of higher level knowledge and its integration in thesearch process. Test results on the Wall Street Journal 5k task:

knowledge sources used perplexity phoneme error word errorPP rate [%] rate [%]

unconstrainedphoneme recognition – 36.3 —+ pronunciation lexicon 5000 13.9 40.0+ LM: unigram 746 8.4 22.9

bigram 107 2.8 6.9trigram 56 1.9 4.5


Effect of Knowledge SourcesExample from the Wall Street Journal 5k task:

LM recognized errors

no lexicon k t k t dh ey d v eh d ey n ey ih z n un k oh sh h ee eyd ih ng n dh uh dh s ey l uh f s ur n d h aa s dh aa t sUH b dh uh b r oh k r ih j y ooh n ih t p p

28

0-gram h ih t s eh n uh t ur z n ih g oh sh ee ey t ih ng — — sey l — — s ur t un aa s eh t s aw n t uh b r oh k ur ihj y ooh n ih t s

11

HIT SENATORS — — NEGOTIATING — SALE —CERTAIN ASSETS ONTO — BROKERAGE UNIT’S

9

1-gram ih t s s eh n ih t ih z n ih g oh sh ee ey t ih ng — — sey l — — s ur t un aa s eh t s aw v dh uh b r oh k urih j y ooh n ih t

6

ITS SENATE — IS NEGOTIATING — SALE — CER-TAIN ASSETS OF THE BROKERAGE UNIT

5

2-gram ih t s eh d ih t ih z n ih g oh sh ee ey t ih ng dh uh s eyl aw v s ur t un aa s eh t s aw v dh uh b r oh k ur ih j yooh n ih t

0

IT SAID IT IS NEGOTIATING THE SALE OF CERTAINASSETS OF THE BROKERAGE UNIT

0











From Small to Large Vocabulary:Why Sub-Word Units ?Phoneme Models and Subword Units

For large vocabularies, it is prohibitive to use whole-word modelsfor each word of the vocabulary:

I There are not enough training samples for each word.

I The memory requirements increase linearly with the numberof words (today: no real problem).

Solution: create word models by concatenating sub-word units,such as phonemes, context dependent phonemes, demi-syllables,syllables, . . .Advantages:

I Training data is shared between words.

I Words not seen in training (i.e. without training examples)can be recognized by using a pronunciation lexicon.


Zipf’s LawThe problem of sparse data is related to “Zipf’s law”:

The frequency N(w) of a word w is (approximately)inversely proportional to some power γ of its rank r(w).

N(w) = const · r(w)−γ

Example from the Verbmobil corpus:

rank word frequency

1 ich 18648

2 ja 16613

3 das 14288

4 wir 13532

. . .4440 Abendtermine 1

4441 Aberglaubens 1

. . .10000 zwingend 0

1

10

100

1000

10000

100000

10 100 1000 10000

freq

uenc

y

rank


Phonetic (Phonemic) ModelsDistinguish the various levels:– acoustic realization: acoustic signal– class of equivalent sounds: phone (allophone, triphone)– (more) abstract level: phoneme

Speech sounds may be categorized according to different ’features’:

for consonants

I voiced / voiceless

I manner of articulationstop, nasal, fricative,approximant

I place of articulationlabial, dental, alveolar,palatal,velar, glottal

for vowels

I position of tongue:high/low, front/back

I rounded or not


Subword Units

speech ⇐⇒ temporal sequence of sounds⇓ ⇓

acoustic signal ⇐⇒ temporal sequence of acoustic vectors,(acoustic realization of the sounds)

Model of speech production:

I Every sound has a program for the movements of the vocaltract.

I Movements of individual sounds merge into one continuoussequence of movements.

I Ideal positioning of the vocal tract is only approximated(depending on the amount of coarticulation).⇒ the real acoustic signal differs from the ’ideal’ signal


The Vocal Tract

Drawing by Laszlo Kubinyi c©Scientific American 1977


Subword Units

Criteria for sound classification

I type of articulation (fricative, plosive)

I location of articulation ([p]: labial, [s]: dental)

I consonants and vowels

I voiced and unvoiced sounds

I stationary and non stationary sounds(vowels vs. diphtongs, plosives)

Perception of sounds:

I loudness

I tone (smoothed spectrum = formant spectrum)

I unvoiced, voiced (fundamental frequency, pitch)


Phonemes

I The pronunciation of a word is usually described in a lessdetailed way using phonemes.

I A phoneme is an abstraction over different phoneticrealizations.

I Two sounds correspond to different phonemes if they canoccur in the same context and distinguish different words.

I The phoneme inventory of a language can be inferred from“minimal pairs”.

I A minimal pair is a pair of words whose phonetictranscriptions have an edit distance of one.


PhonemesExamples of minimal pairs for German

Vowels:i: / o: Kiel / KohlI / E fit / fette: / Y fehle / fullea: / a Rate / RatteY / 9 Hulle / Holleo: / aU roh / rau

e: / E: Tee / Teint

Consonants:p / b packe / backe

t / m Tasse / Massek / ts Kahn / Zahnf / v Fall / Walls / S Bus / Buschs / z Muße / Musel / – Klette / Kette


Phonemes

Characteristics of the phoneme set:

I The phoneme set is language specific. Examples:

Chinese [l ] – [r ] one phonemeArabic [ki ] – [ku] different phonemes

I Humans are trained to distinguish sounds of specific languages.I The acoustic realizations of phonemes are context dependent

(coarticulation):I static dependencies on surrounding phonemesI dynamic dependency:

temporal overlap of the articulation of subsequent phonemes


Phoneme System for German in SAMPA notationconsonants

plosivesp Pein p aI nb Bein b aI nt Teich t aI Cd Deich d aI Ck Kunst k U n s tg Gunst g U n s t

fricativesf fast f a s tv was v a ss Tasse t a s @z Hase h a: z @S waschen v a S @ nZ Genie Z e n i:C sicher z i C 6j Jahr j a: 6x Buch b u: xh Hand h a n t

consonantssonorants

m mein m aI nn nein n aI nN Ding d I Nl Leim l aI mR Reim R aI m

vowels“checked” (short) vowels

I Sitz z I t sE Gesetz g @ z E t sa Satz z a t sO Trotz t r O t sU Schutz S U t sY hubsch h Y p S9 plotzlich p l 9 t z l I C

vowels“free” (long) vowels

i: Lied l i: te: Beet b e: tE: spat S p E: ta: Tat t a: to: rot r o: tu: Blut b l u: ty: suß z y: s2: blod b l 2: t

diphthongsaI Eis aI saU Haus h aU sOY Kreuz k r OY t s

“schwa” vowels@ bitte b I t @6 besser b E s 6


PhonemesFunction of the phonemes:

acoustic signal continuous,infinite number of realizations

⇑ (1:∞)

(allo-)phones discrete sounds, approx. 40 000

⇑ (1:1000)

phonemes discrete, alphabet: 40 – 60,depending on the language

m (1:1)

words of the language: several 100 000 wordspronunciation and meaning


Context Dependent Subword UnitsThe acoustic realization of phonemes is context dependent.

Context dependent modelling is more accurate:

I Diphones A B C D E#A | AB | BC | CD | DE | E #

I Syllables: group of phonemes, standard form consonant–vowel–consonant, about 20000 syllables for German.

vowel

consonantconsonant

time

energy

I Demi syllables (syllables split at the vowel)

I Consonant clusters


Subword Units

Example: possible subword units for German

Subword Unit Number Representation of the(approx.) acoustic signal

phonemes 50 inaccurateconsonant clusters 250 . . .

and vowels . . .diphones 2500 . . .

demi-syllables . . . . . .syllables 20 000 accurate

note terminology: consonant cluster = consonant sequence


Subword UnitsPractical reasons for using subword units in speech recognition:

I Not enough training data for whole word models.

I More observations for subword units (better training).

I Vocabulary can be extended without new acoustic training.Specifying the corresponding subword units is sufficient.

Important issues when using subword units:

I Define and specify subword units.

I Map the continuous signal to the discrete sequence of unitsi.e. specify the units and the pronunciation lexicon.

I Train the subword units.

I Use the subword units and the pronunciation lexicon forrecognition.


HMMs for phonemesLayers of the acoustic modelling

words: THIS BOOK IS GOOD

phonemes: th i s i z g uh d

subphonemes: b b uh uh uh k k... ...

b uh k

acoustic vectors: ... ...

speech signal: ... ...

cl rel on off cl rel




b uh k







b uh k







b uh k







b uh k




Speech can be modeled on any of these layers


HMMs for phonemes

Different HMM topologies for phonemes can be used, define

I number of states and

I allowed transitions.

Usually three sub phonemes are used:Begin – Middle – End

3 state model

orB M E B M E


HMMs for phonemes“IBM model”

B

M EB M

E

E

B

B M

M

M

Properties:I Transition assigned emissions: the emission probability

distributions are assigned to the transitions (not to the states).I Number of possible paths is restricted for short vector sequences:

1: B2: BM3: BME4: BMME5: BBMME , BMMEE , BMMME6: . . .


HMMs for phonemesI 6 state model

B M EB M E

PHO

NE

ME

X

STA

TE

IN

DE

X

TIME INDEX

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

E

M

B


Pronunciation LexiconA pronunciation lexicon with phonetic transcriptions is required whenusing subword units. Usually phonemes are used as subword units.Example: English digits

word phonemes

Zero Z IH R OW

One W AH N

Two T UW

Three TH R IY

Four F OW R

Five F AY V

Six S IH K S

Seven S EH V AX N

Eight EY T

Nine N AY N

Oh OW

phoneme number of

occurances

AH 1AX 1AY 2F 2EH 1EY 1IH 2IY 1K 1N 4OW 3S 3R 2T 2TH 1UW 1V 2W 1Z 1


Pronunciation LexiconContext dependencies:

I Context independent (real) phoneme models:International phonetic alphabet defines 74 phonemes for English.In practical applications about 40–50 phonemes are used typically.

z e r oZ IH R OW

I Context dependent phoneme models:Coarticulation is considered:

z e r o

#ZIH Z IHR IHROW ROW#

I “Diphone” AB,BC : context dependent phoneme in diphonecontext

I “Triphone” ABC : context dependent phoneme in triphonecontext


Pronunciation LexiconTerminology:

I context independent phonemes (’monophones’, more or lessthe ’real’ phonemes as defined in linguistics)

I phonemes in (left or right) diphone context (’diphones’)I phonemes in triphone context (’triphones’)I phonemes in word context (’wordphones’)

The context dependency only determines the labels for emissionprobabilities which have to be specified for each state of aphoneme model.The emission probabilities can be trained independent from thephonetic context.

The sequence of states is used for recognition:I word: sequence of phoneme models,I phoneme model: sequence of HMM states⇒ word: sequence of HMM states.


Training Phoneme Models

I The sequence of HMM stateindices for a word depend on thephoneme sequence.

I The training procedure correspondsto the one used for word models:

x:

:x:

:x

::y

:yx

x::z:

::

y:

z

x

y

utterance 3

utterance 2

utterance 1

x,y,z: labels of the mixtures

z:

I Time alignment: assign a stateto every acoustic vector.

I Parameter estimation:

. Collect all observations forevery mixture m of everyphoneme model based on thetime alignment.

. Estimate the model parameters for

all densities l of the mixture m:- Reference or prototype vector µlm

- pooled variance vector σ2m

- mixture weight p(l |m)


Training Phoneme Models

Practical considerations:All possible triphones 503 = 125000 are too many!

I Combine monophones, diphones and triphones:only use diphones and triphones that occur more thane.g. 100 times in the training data.

I Generalized context dependent phoneme models:use phoneme classes (nasals, fricatives, vowels, stopconsonants, ...)

g(A)Bg(C) instead of ABC , g(X ) phoneme class X belongs to

I Parameter tying:use clustering or Classification And Regression Trees (CART)to tie “similar” phonemes

⇒ a few thousand models that are actually used











Phonetic Decision TreesMotivation

Classification and Regression Trees (CART)

I Used in the acoustic modelling of phonemes

I 50 phonemes ⇒ 503 = 125000 possible phonemes in triphonecontext (“triphones”)

I Problem:I too many triphones to be trained reliablyI many triphones are not seen in trainingI considering across-word contexts, this effect increases

I Solution:I tie parameters of similar triphonesI a decision tree determines similarity


MotivationA phoneme X has 2500 possible triphone contexts aXb.Phonetic decision tree for a phoneme X :

Q 0(a,b) ?

Q 2(a,b) ?Q 1(a,b) ?

y n

y n y n

y n

A path through the decision tree is defined by the answers tophonetic questions Q0,Q1,Q2, . . .e.g.: - “is the left context a fricative?”

- “is the right context a plosive?”


Example

L-BOUNDARY

R-LIQUID L-BACK-R

1/10477 R-LAX-VOWEL L-R L-L-NASAL

3/2370 R-TENSE-VOWEL

2/3628 1/9337

2/892 L-LAX-VOWEL 4/2692 R-UR

1/2098 3/3197 1/526 R-S/SH

1/848 L-EE

1/635 8/3179


Motivation

Properties of the tree:

I Every leaf of the tree stands for a generalized phoneticcontext and has a corresponding HMM emission probability.

I An adequate generalization for triphones not seen in trainingcan be expected.


Motivation

General application for CART:

Given the two variables

c = class index

x ∈ IRD , or discrete

observation

model conditional probability

p(c |x)∑c p(c |x) = 1

Q 2 x ?

Q 0 x ?

Q 1 x ?

y n

y n y n

y n

Leaf t:Distribution: x t: p(c|t)

“Classification tree” vs. “estimation tree”:an estimation tree models the conditional probability withoutclassification.


Training PrincipleGiven the training data

[xn, yn], n = 1, . . . ,N;

with x → independent variable

y → dependent variable

two subsets of x are considered

t, tL, tR ⊂ x

Define a tree by binary splitting of a node or subtree:

t = tL ∪ tR , tL ∩ tR = ∅.

y nt

tL tR

x ?tL


Training PrincipleDefine:

I A “score” g(yn|t) for every observation (xn, yn) with xn ∈ t;

I A score for the node t:

G (t) :=∑

n:xn∈t

g(yn|t).

The score function g(yn|t) shall be additive.

Note the change in the score when splitting t in the subsets tL and tR :

∆G (tL|t) = G (t)− G (tL)− G (tR)

best split tL for given t:

maxtL

∆G (tL|t)


Training PrincipleUse the log-likelihood (log-probability) criterion for G (t).(θ represents the parameter of the distribution):

g(yn|t) := log pθ(yn|t)

G (t) := maxθ

∑n:xn∈t

log pθ(yn|t)

G (tL),G (tR) correspondingly.

Optimization:I Learn the best parameters θ for a hypothetic split tL at a node t.I Choose optimal split.

Thus:

θ = θ (xn, yn) : xn ∈ tL; n = 1, . . . ,NG (tL) =

∑n:xn∈tL

log pθ(yn|tL)


Training Principle: Discrete Observationsy with discrete values:the parameters θ are the distribution p(y |t) itself(non parametric model)Then:∑n:xn∈t

log p(yn|t) =∑y

N(t, y) · log p(y |t)− λ

[∑y

p(y |t)− 1

]

∂

∂p(y |t):

N(t, y)

p(y |t)− λ = 0

∂

∂λ:∑y

p(y |t)− 1 = 0

⇒ θ ≡ p(y |t) =N(t, y)

N(t)

with the counts: N(t, y),N(t)


Training Principle: Discrete Observations

For the optimum:

G (t) =∑

n:xn∈t

log pθ(yn|t)

=∑

n:xn∈t

logN(t, y)

N(t)

=∑y

N(t, y) · logN(t, y)

N(t)

= N(t)∑y

p(y |t) · log p(y |t)

entropy


Training Principle: Continuous Observationsy with continuous values, especially Gaussian distribution:

pθ(y |t) = N (y |µt ,Σt)

N (y |µt ,Σt) =1√

det(2πΣt)· exp

[−1

2(y − µt)T Σ−1

t (y − µt)

]

G (t) =∑

n:xn∈t

logN (yn|µt , Σt)

= −N(t)

2log det

[2πΣt

]− 1

2

∑n:xn∈t

(yn − µt)T Σ−1t (yn − µt)

with

N(t) :=∑

n:xn∈t

1


Training Principle: Continuous Observations

Maximum-likelihood estimation for µt and Σt :

µt =1

N(t)

∑n:xn∈t

yn

Σt =1

N(t)

∑n:xn∈t

(yn − µt)(yn − µt)T


Training Principle: Continuous ObservationsUsing a diagonal covariance matrix:

Σt =

σ2

t1 0σ2

t2. . .

0 σ2tD

σ2

td =1

N(t)

∑n:xn∈t

(ynd − µtd)2

∑n:xn∈t

(yn − µt)T Σ−1t (yn − µt) =

∑n:xn∈t

∑d

(ynd − µtd

σtd

)2

=∑d

1

σ2td

·∑

n:xn∈t

(ynd − µtd)2

= N(t) · D


Training Principle: Continuous ObservationsGeneral case, full covariance matrix (the index t of µt and Σt isdropped here for simplification)

zn := yn − µ Σ :=1

N(t)

N∑n:xn∈t

zn · zTn

∑n:xn∈t

zTn Σ−1zn =

∑n

D∑i=1;j=1

zni

(Σ−1

)ij

znj

=∑

ij

[ ∑n:xn∈t

zniznj

](Σ−1

)ij

= N(t) ·∑

ij

Σij

(Σ−1

)ij

= N(t) ·∑

j

∑i

Σji

(Σ−1

)ij

= N(t) ·D∑

j=1

δjj = N(t) · D


Training Principle: Continuous Observations

Thus:

G (t) =∑

n:xn∈t

logN (yn|µt , Σt)

= −N(t)

2log det

[2πΣt

]− 1

2

∑n:xn∈t

(yn − µt)T Σ−1t (yn − µt)

= −N(t)

2log det

[2πΣt

]− N(t)

2D

= −N(t)

2log det

(2πeΣt

)= −N(t)

2

[D log(2π) + D + log(det Σt)

]


Training Principle: Continuous ObservationsThe improvement of the log-likelihood score by splittingt = tL ∪ tR ; tL ∩ tR = ∅ is:

∆G (t) = G (t)− G (tL)− G (tR)

= . . .

=N(tL)

2log

[det ΣtL

det Σt

]+

N(tR)

2log

[det ΣtR

det Σt

]For a diagonal covariance matrix:

log det Σt = log∏d

σ2td

=D∑

d=1

log σ2td


Leaving One Out

So far without leaving one out:

maxθ

∑n

log pθ(yn)⇒ θ = θ(yN1 )

The optimal value is substituted in the log likelihood:∑n

log pθ(yN1 )(yn),

so every observation is considered twice:

1. to determine θ

2. to determine how good the model pθ(y) explains theobservations y1, . . . , yN .

⇒ score evaluation is too optimistic.


Leaving One OutLeaving one out:

Idea: take yn out of the training set when evaluating pθ(yn)

Use θ(yN1 \yn) instead of θ(yN

1 ) for the leaving one out scoreevaluation: ∑

n

log pθ(yN1 \yn)(yn)

For Gaussian distributions the calculation of θ(yN1 \yn) is easy:

µn :=1

N − 1

∑m 6=n

ym

Σn :=1

N − 1

∑m 6=n

(ym − µn)(ym − µn)t











Language Modelling

Goal:model syntax and semantics of natural language (spoken or written)

needed in automatic systems that processspeech (= spoken language) or language (= written language):

I speech recognition

I speech and text translation

I (spoken and written) language understanding

I spoken dialog systems

I text summarization

I ...


ExampleFinite state networks for digit strings:Syntactical constraints can be expressed with formal grammars,here represented as networks

I String of three digitsV VV

1 2 3 4

V zero, one, two, three, . . . ,nineε

I String with even number of digitsV V

V

1 2 3


I Unconstrained digit stringV

1



Silence ModelAllow silence between the words:

I String of three digits

V VV

Sil Sil Sil Sil

1 2 3 4


I String with even number of digits

V V

V

Sil Sil Sil

1 2 3


I Unconstrained digit stringV | Sil

1



Unfolding the network

For the recognition, the network has to be unfolded along the time axis:

Sil

V1

2

3

1

2

3

1

2

3

1

2

3

Sil

V

Sil

V

Sil

V

Sil

V

Sil

V

Sil

V

Sil

V

Sil

V

Sil

V

Sil

V

Sil

V

t

s

1

2

3

V

V V

Sil

Sil

Sil

1

2

3

1

2

3

The computational complexity is proportional to the number ofacoustic transitions.

As shown later, it is favorable to reduce the number of “real”i.e. acoustic transitions.


Language model networks

As shown in the example, a network consists of transitions and nodes.

I Transitions:correspond to spoken words (including silence)

I Nodes:Every word has a start and end node, they define the syntactic(linguistic) context of the transition.

As shown on the next page, a word A can occur in four differentcontexts.


Language model networksPossible contexts for a word A:

a)

1

2

3

A

A

1

2

3

A

b)1

2

3

A

A

1

2

3A

In case a) the automaton is non deterministic.


Bayes Decision Rule and Perplexity

Bayes decision rule and perplexity

I Bayes decision rule using maximum approximation:

[wN

1

]opt

= arg max[wN

1 ]

Pr(wN

1 ) ·max[sT

1 ]

T∏t=1

p(xt , st |st−1,wN1 )

I The perplexity (corpus perplexity / test perplexity) of alanguage model and a test corpus [wN

1 ] is defined as

PP = Pr(wN1 )−

1N

=

[N∏

n=1

Pr(wn|wn−11 )

]− 1N



I The logarithm of the perplexity then is:

log PP = log[Pr(wN

1 )−1N

]= − 1

N

N∑n=1

log(Pr(wn|wn−11 ))

A small perplexity corresponds to strong language model restrictions.Properties of the perplexity:

I normalization: probability per word

I inverse probability: number of possible choices per word position

I probability zero: infinite penalty



Now assume constant probabilities (no dependence on the wordand between words):

Pr(wN1 ) =

N∏n=1

1

W=

(1

W

)N

with W = size of the vocabulary

Then the perplexity becomes:

PP = Pr(wN1 )−

1N =

[(1

W

)N]− 1

N

= W

Note: In this case, the perplexity only depends on the vocabularysize W . Nevertheless, in the general case the perplexity dependson the test corpus it is computed on.


Language Model NetworksLanguage model Pr(wN

1 ) in networks:A deterministic finite state automaton (DFA) is defined by atransition function δ

V = 1, . . . ,V nodes (linguistic contexts)

W = 1, . . . ,W arcs (words including silence)

δ : V ×W → V(v ,w)→ v ′ = δ(v ,w)

A path in the network defines a word sequence [wN1 ].

For every word w given a node v , a probability p(w |v) is defined:

p(w |v) =

0 if word w does not

leave node v≤ 1 else


Language Model NetworksThe sum of the probabilities p(w |v) over all words w for eachnode v is:

W∑w=1

p(w |v) = 1

w = 1

w = W

v

Index convention:vn+1 := δ(vn,wn)

wnv

nvn+1


Language model networksIn the general case of non deterministic finite state automata(NFA) the sum over all paths corresponding to a word sequencewN

1 has to be calculated:

Pr(wN1 ) =

∑vN

1

N∏

n=1

p(wn|vn)

.

Often the maximum approximation is used:

Pr(wN1 ) ≈ max

vN1

N∏

n=1

p(wn|vn)

= maxvN

1

N∏

n=1

p(wn|δ(vn−1,wn−1))

Note the hierarchical structure of the grammars:

I HMM: acoustic modelling for each word defining thecorrespondence between word classes and acoustic vectors

I LM: networkSchluter/Ney: Advanced ASR 86 August 5, 2010

Language model networksNon deterministic and deterministic finite state automata (NFAand DFA):

I general case NFA:

p(w , v |v ′) = p(v |v ′)︸︷︷︸Transitionprob.

· p(w |v ′, v)︸︷︷︸Emissionprob.

I special case DFA: given a pair (v ′,w) the successor state v isdetermined by v = δ(v ′,w),therefore a different factorization of p(w , v |v ′) is useful:

p(w , v |v ′) = p(w |v ′) · p(v |v ′,w)

with

p(v |v ′,w) =

1 v = δ(v ′,w)0 v 6= δ(v ′,w)

For an allowed transition (v ′,w)→ v = δ(v ′,w) theprobability is: p(w , v |v ′) = p(w |v ′)


Dynamic Programming Recursion

Search using language model networks: dynamic programming.

Here the auxiliary quantity Qv (t, s; w) used to derive the dynamicprogramming recursion is defined as:

Qv (t, s; w) := probability of the best path at time tleading to the state s of word wwith starting node v .

Note the additional index v .



I Within words: acoustic search

Qv (t, s; w) = maxs′

Qv (t − 1, s ′; w) · p(xt , s|s ′,w)

σopt

v (t, s; w) := arg maxs′

Qv (t − 1, s ′; w) · p(xt , s|s ′,w)

Bv (t, s; w) = Bv (t − 1, σopt

v (t, s; w); w)

I Word boundaries: language model recombination

Qv (t − 1, 0; w) = maxv ′,w ′:δ(v ′,w ′)=v

Qv ′(t − 1,S(w ′); w ′) · p(w |v)

= p(w |v) · max

v ′,w ′:δ(v ′,w ′)=v

Qv ′(t − 1, S(w ′); w ′)

Bv (t − 1, 0,w) = t − 1



Word boundaries: language model recombination

w = 1

w = W

v

v’

v’

v’

1

2

3

w’

w’

1

N



The dynamic programming recursion is carried out for every wordw and node v . The context defined by v has to be considered inthe traceback arrays:

Score:

H(v , t) = maxv ′,w ′:δ(v ′,w ′)=v

Qv ′(t,S(w ′); w ′)

Starting node, word:

(V ,W )(v , t) = arg maxv ′,w ′:δ(v ′,w ′)=v

Qv ′(t,S(w ′); w ′)

Backpointer:

B(v , t) = B(t,S(W (v , t)); W (v , t))

The index pair (predecessor node, word) is stored in the tracebackarray (V ,W )(v , t).It can be interpreted as linguistic copy of word w in the context v .


Example

Language model network and corresponding language model transitions:

1

2

4

3

A B

C

ED

B

Sil

Sil

Sil

Sil

1

2

3

4

B

Sil

D

C

Sil

E

A

Sil

B

SilSil

B

Sil

A

Sil

C

D

Sil

B

E

acoustic transitons empty transitions language model transitions

t t


m–Gram Language ModelsFactorization without restrictions (w ∈ W ∪ $; $=sentence end):

Pr(wN1 ) =

N∏n=1

Pr(wn|wn−11 )

Limit the dependence:

Unigram LM : Pr(wN1 ) =

N∏n=1

p(wn)

Position Unigram LM : Pr(wN1 ) =

N∏n=1

p(wn|n)

Bigram LM : Pr(wN1 ) =

N∏n=1

p(wn|wn−1)

Trigram LM : Pr(wN1 ) =

N∏n=1

p(wn|wn−2,wn−1)


Training Bigram LMs

Training bigram language models:Count words and word pairs:

p(w |v) =N(v ,w)

N(v)

N(v ,w) : word pair count (v ,w)

in the training text

N(v) : count for word v


Training Bigram LMsMotivation:

Bigram probability:

Pr(wN1 ) =

N∏n=1

p(wn|wn−1)

maximize log-likelihood function:

F =N∑

n=1

log p(wn|wn−1) with∑w

p(w |v) = 1 ∀ v

F =∑v ,w

N(v ,w) log p(w |v) −∑v

µv

[∑w

p(w |v)− 1

]


Training Bigram LMs

Set derivative of log-likelihood w.r.t. p(w |v) and µv to zero toobtain maximum:

∂F

∂p(w |v)=

N(v ,w)

p(w |v)− µv

!= 0

∂F

∂µv=

∑w

p(w |v)− 1!

= 0

Solution:

p(w |v) =N(v ,w)∑w ′ N(v ,w ′)

=N(v ,w)

N(v)


DiscountingProblem:many pairs (v ,w) are not seen in training

N(v ,w) = 0 ,

relative frequency is zero.Discounting: shift probability mass from seen to unseen events.

I Linear discounting:

p(w |v) =

(1− λ) · N(v ,w)N(v)

N(v ,w) > 0

λ · p(w)∑w ′:N(v ,w ′)=0

p(w ′)N(v ,w) = 0

Estimate 0 < λ < 1 by Leaving One Out:Leave (v ,w) out of the corpus→ change counts: N(v ,w)→ N(v ,w)− 1 for N(v ,w) > 1


Linear DiscountingLeaving-one-out distribution with linear discounting:

p−1(w |v) =

(1− λ) · N(v ,w)− 1N(v)− 1

N(v ,w) > 1

λ · p(w)∑w ′:N(v ,w ′)=1

p(w ′)N(v ,w) = 1

Log-likelihood criterion:

F (λ) =∑v ,w

N(v ,w) · log p−1(w |v)

=∑

v ,w :N(v ,w)>1

N(v ,w) · log(1− λ)N(v ,w)− 1

N(v)− 1

+∑

v ,w :N(v ,w)=1

N(v ,w) · log λp(w)∑

w ′:N(v ,w ′)=1

p(w ′)


Linear DiscountingRewrite log-likelihood criterion:

F (λ) =∑

v ,w :N(v ,w)>1

N(v ,w) · log(1− λ) +∑

v ,w :N(v ,w)=1

N(v ,w) · log λ

+∑

v ,w :N(v ,w)>1

N(v ,w) · logN(v ,w)− 1

N(v)− 1

+∑

v ,w :N(v ,w)=1

N(v ,w) · logp(w)∑

w ′:N(v ,w ′)=1

p(w ′)

= const(λ)

=

[N −

∑v ,w :N(v ,w)=1

N(v ,w)

]· log(1− λ)

+∑

v ,w :N(v ,w)=1

N(v ,w) · log λ+ const(λ)

= (N − n1) · log(1− λ) + n1 · log(λ) + const(λ)


Linear Discounting

Log-likelihood criterion:

F (λ) = (N − n1) · log(1− λ) + n1 · log(λ) + const(λ)

with n1 :=∑

v ,w :N(v ,w)=1

1

= number of bigram singletons

N := size of the corpus

Differentiate and set to zero to obtain maximum w.r.t. λ:

λ =n1

N


Absolute Discounting

I Absolute discounting

p(w |v) =

N(v ,w)− bN(v)

N(v ,w) > 0

b · W −W0(v)N(v)

p(w)∑w ′:N(v ,w ′)=0

p(w ′)N(v ,w) = 0

W := vocabulary size

W0(v) := number of words that do not occur as

successor of v .


Absolute Discounting

Leaving-One-Out approach with maximum-likelihood estimation:

F (b) = n1 · log(b) +∑

v ,w :N(v ,w)>1

N(v ,w) log

(N(v ,w)− 1− b

N(v)− 1

)= n1 · log(b) +

∑r>1

r · nr · log(r − 1− b) + const(b)

with nr :=∑

v ,w :N(v ,w)=r

1

= Number of word pairs seen r times

Differentiate F (b) by b and rewrite:

n1

b− 2n2

1− b=

∑r>2

rnr

r − 1− b


Absolute DiscountingThere is no closed form solution, but the following estimate can beproven:

n1

n1 + 2n2 + 12 [N − n1 − 2n2]

≤ b ≤ n1

n1 + 2n2

Usually the upper bound is a sufficient estimate

b =n1

n1 + 2n2

For corpora of 10− 20 million words and a vocabulary of10.000− 20.000 words b ≈ 0.95

Result:A LM where all wN

1 are possible to be recognized, i.e. Pr(wN1 ) > 0.

Ideal case:typical word sequence wN

1 : high probability Pr(wN1 )

possible word sequence wN1 : low probability Pr(wN

1 )untypical word sequence wN

1 : very low probability Pr(wN1 )


m-Grams

Using m–grams homophones (phonetically equal words withdifferent spellings) can be distinguished.

Examples from the IBM TANGORA system:

I To, too, two:Twenty–two people are too many to be put in this room.

I Right, Wright, write:Please write to Mrs. Wright right away.


Bigram LM Complexity

A bigram for the three words A, B, andC represented as network.

A

B

C

sil

sil

sil

A

A

A

B

B

B

C

C

C

With W words the network has W 2

arcs plus W arcs for silence.

Problem: the computational complexityrises like W 2

Introducing empty transitions can help.


Bigram LM ComplexityBigram LM with empty transitions

C

B

A

Sil

Sil

Sil

Sil

1

2

3

0

Start node: 0 End node: 1,2,3

A

B

C

Sil


Bigram LM ComplexityBigram LM: silence as part of the words

C

B

A

Sil

Sil

Sil

Sil

Start node: 0 End node: 1,2,3

A

B

C

ε

ε

ε

1

2

3

0Sil


Unfolding the Bigram LMUnfolding the bigram over the time:

C

B

A

Sil

A

B

C

A

B

C

Sil

Sil

Sil

Sil

Sil

Sil

Sil A

B

CC

B

A

Sil

acoustic transitions empty transitions language model transitions

t t

time

(t=0)


Unfolding the Bigram LMUnfolding the bigram over the time (silence as part of the words):

C

B

A

Sil

A

B

C

A

B

C

Sil

Sil

Sil

Sil

Sil

Sil

Sil A

B

CC

B

A

Sil


t t

time

(t=0)


Bigram LM in RecognitionBigram LM in recognition:

The network has the following transitions:

Sil ( silence at the beginning of the sentence)A word AB word B::

ASil (silence after word A)

BSil (silence after word B):

We augment the vocabulary w as follows:

w ∈ Sil , A, B, . . . , ASil , BSil , . . .


Bigram LM in Recognition

The auxiliary quantity Q(t, s,w) for dynamic programming isdefined as:

Q(t, s,w) := probability of the best partial path at time tleading to state s of word w .

The recursion then is:

· Within words:

Q(t, s; w) = maxs′Q(t − 1, s ′; w) · p(xt , s|s ′,w)

· Word boundaries:

Q(t − 1, 0; w) = maxvQ(t − 1,S(v); v) · p(w |v)

(The special handling of the silence word is not expressed inthe equations)



In principle the probability p(w |v) is the LM,but silence transitions require special interpretation:

Transition LM Probabilityp(w |v)

A – B p(B|A)

A – ASil 1

ASil – B p(B|A)

ASil – ASil 1

Sil – B p(B) : unigram

ASil – BSil 0 : not possible



Traceback arrays:it is easiest to store the decisions about starting words w at time t:

score: H(w , t) = maxvQ(t, S(v); v) · p(w |v)

predecessor: V (w , t) = arg maxvQ(t,S(v); v) · p(w |v)

backpointer: B(w , t) = B(t,S(V (w , t)); V (w , t))



Remarks:

I Due to the regular structure of the bigram it is sufficient tostore either the LM nodes or the predecessor words in thetraceback arrays.

I The traceback at the end of the sentence has to start at theword ends not the word beginnings.

I The real implementation is different to optimize memoryefficiency:

I traceback arrays with one index instead of a pair (w , t)I when using beam search (following section) the number of

word ends reached is smaller, instead of storing wordbeginnings it is more efficient to store word ends.


Trigram LM

The trigram language model probability is given by:

Pr(wn|wn−11 ) = p(wn|wn−2,wn−1)

Notation: (u, v ,w) = (wn−2,wn−1,wn)u, v are the predecessor words of w , these have to be considered inthe LM recombination.The auxiliary quantity for dynamic programming is defined as:

Qv (t, s; w) := probability of the best path at time tleading to the state s of word w withpredecessor word v .

I For each word w a copy for every predecessor word v has tobe made.

I The cost of an arc only depend on the arc itself. This allowsthe practical implementation of dynamic programming.


Unfolding the Trigram LMTrigram LM recombination:

A

B

C

C

C

C

A

B

C

C

C

C

A

B

C

B

B

B

A

B

C

B

B

B

A

B

C

A

A

A

A

B

C

A

A

A

timett


Trigram LMDynamic programming recursion:

I within words:

Qv (t, s; w) = maxs′Qv (t − 1, s ′; w) · p(xt , s|s ′,w)

I word boundaries:

Qv (t − 1, 0; w) = maxuQu(t − 1,S(v); v) · p(w |u, v)

Traceback arrays (at word beginnings):

score: H(v ,w , t) = maxuQu(t, S(v); v) · p(w |u, v)

predecessor: U(v ,w , t) = arg maxuQu(t,S(v); v) · p(w |u, v)

backpointer: B(v ,w , t) = BU(v ,w ,t)(t, S(v); v)

Silence: in principle, silence is treated as in the bigram LM,

the implementation is more complex.


Trigram LM: Traceback Implementationword string: w1, . . . ,wn, . . . ,wN

with word boundaries: t1, . . . , tn, . . . , tNsentence end symbol: $ (= Sil)

I Note: traceback in reverse order (start at end with n = 1)I Initialization: best word end

(w2,w1) = arg maxv ,wQv (T , S(w),w) · p($|v ,w)

t1 = T ; t2 = Bw2(T ,S(w1),w1)

I Loop: n = 2while tn > 0 do

n = n + 1wn = U(wn−1,wn−2, tn−1)tn = B(wn−1,wn, tn−1)

N = n − 1reverse: (w1, t1), . . . , (wN , tN) ← (wN , tN), . . . , (w1, t1)


Time Complexity of DP Beam Search

Time complexity for full DP search (later: DP beam search):

T = number of time frames of test utteranceW = nunber of acoustic reference wordsS = average number of states per (acoustic) wordK = number of positions in position unigramsilence model: 1 state

language model acoustic search comparisons language modeltype (= 3· HMM states) comparisons

unigram 3 · T · [W · S + 1] T · [W + 1]position unigram 3 · T · K · [W · S + 1] T · K · [W + 1]bigram 3 · T · [W · S + W ] T ·W · [W + 1]trigram 3 · T ·W · [W · S + W ] T ·W ·

[W 2 + 1

]


Memory Complexity of DP Search

Memory requirements:

I acoustic search: one column for backpointer and score

I language model recombination: traceback arrays with oneentry for each LM node

LM type acoustic search language model

unigram 2 · [W · S + 1] 2 · Tposition unigram 2 · K · [W · S + 1] 2 · T · Kbigram 2 · [W · S + W ] 2 · T · [W + 1]trigram 2 ·W · [W · S + . . . ] 2 · T ·

[W 2 + . . .

]











Beam SearchDynamic Programming Beam Search along with Implementation Details

The search consists of principal components:

I language model recombination: word boundaries

I acoustic search: word interior

I bookkeeping: decisions about word (and boundary) hypotheses

I traceback: construct best scoring word sequence

Modifications for large vocabulary systems as opposed to digitstring recognition:

I limit the search space by beam search

I modified bookkeeping for active hypotheses (due to beam search)

I modified bookkeeping for traceback arrays (due to beam search)

I garbage collection for traceback arrays (due to beam search)


Traceback Arrays

Traceback arraysI Use one index:

I less memory (exhaustive search)I beam search, only few word ends are reached

I Bookkeeping is possible at these stages:I at the word endsI at the LM nodes (most efficient, smallest number of hypotheses)I at the word beginnings

I Organization:I so far: one element in traceback array per time frameI now: backpointer does not point at the time frame the

predecessor word ended, it points at the array element with thecorresponding information.


Traceback ArraysReminder:The entries of the traceback array define the nodes of a tree:

Time

I Garbage collection (beam search): hypotheses can be pruned,array entries no backpointer points at are labeled as free.

I Partial traceback: if all backpointers of active hypothesespoint at one entry in the traceback array the decision beforethis entry is determined.

I Experimental experience (beam search): delay depends on thetask, 1–2 words when using partial traceback.


Beam Search: Pruning

Beam search:

I suboptimal heuristic approach: give up global optimum.

I time synchronous search: remaining cost of the path isconstant for all hypotheses.

I baseline method for pruning:discard unlikely hypotheses at every time frame t:

I Acoustic pruning:retains state hypotheses whose scores are close to the score ofthe best state hypothesis:

QAC (t) := max(v ,s)

Qv (t, s) ,

prune state hypothesis (s, t; v) iff:

Qv (t, s) < fAC · QAC (t)



I additional pruning steps:I Language model pruning:

retains tree start-up hypotheses whose score is close to thescore of the best tree start-up hypothesis:

QLM(t) := maxv Qv (t, s = 0)

prune tree start-up hypothesis iff:

Qv (t, s = 0) < fLM · QLM(t),

I Histogram pruning:limits the number of surviving state hypotheses to a maximumnumber (MaxHyp).



Using pruning techniques can lead to search errors inducingrecognition errors.Remember: possible reasons for recognition errors:

I shortcomings of the acoustic models

I shortcomings of the language models

I search errors (when using beam search or other heuristic methods)

In general better acoustic models and language models focus thesearch space, i.e. they allow for tighter pruning thresholds.


Beam Search: PruningIllustration of the search process (DP beam search) for connecteddigit recognition:

0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300

stat

es

time framesSchluter/Ney: Advanced ASR 128 August 5, 2010

Beam Search: ExampleExample of the dependency between the search space and the worderror rate (WER):WSJ task, vocabulary size = 20000 words, bigram LM PP = 200.

AcuThr: acoustic pruning threshold.States: average number of state hypotheses in HMM after pruning.

AcuThr States WERk (average) [%]

50 252 45.660 677 28.365 1068 24.275 2396 20.6

100 12908 18.4110 21894 18.3120 32538 18.2130 43862 18.2











RWTH ASR System: Teaching PatchClasses and Dependencies:

Search::SearchAlgorithm

SearchInterface

LinearSearch

LinearSearch::SearchSpace

Lexicon

Bookkeeping

RWTH ASR System

RWTH ASR Teaching Patch

- acoustic model- language model- corpus handling- pronunciation lexicon handling- general search environment

- interface to RWTH ASR System- implementation of linear search, including bookkeeping, and traceback


RWTH ASR System: Teaching Patch

Types

#ifndef _TEACHING_TYPES_HH#define _TEACHING_TYPES_HH

#include <vector>#include <limits>

namespace Teaching typedef unsigned int Time; typedef unsigned short Mixture; typedef unsigned int Word; typedef unsigned short Phoneme; typedef unsigned short State; typedef unsigned int Index; typedef std::vector<Word> WordSequence; typedef std::vector<Mixture> MixtureSequence; typedef float Score;

static const Word invalidWord = std::numeric_limits<Word>::max(); static const Index invalidIndex = std::numeric_limits<Index>::max(); static const Score maxScore = std::numeric_limits<Score>::max();

#endif // _TEACHING_TYPES_HH

Apr 14, 10 9:35 Page 1/1Types.hh

Printed by schluter

Wednesday April 14, 2010 1/1


Interface to RWTH ASR SystemGeneral Interface to Teaching PatchClass SearchInterface provides connection to general searchwork around, including handling of configuration, resources (corpusand models) as well as an overall workaround for the specificsearch implementation.

Main functions to be implemented here are:

I initialize: Search initialization

I processFrame: expansion of hypotheses to next time frame

I getResult: traceback of best recognized word sequence

Implementation:

I show SearchInterface.hh

I show SearchInterface.cc

I show LinearSearch.hh


Interface to RWTH ASR SystemPhonem list and Pronunciation LexiconConfiguration file: XML format, example:

<?xml version="1.0" encoding="ascii"?>

<lexicon>

<phoneme-inventory>

<phoneme><symbol>AE</symbol></phoneme>

<phoneme><symbol>AH</symbol></phoneme>

<phoneme><symbol>N</symbol></phoneme>

<phoneme><symbol>D</symbol></phoneme>

...

</phoneme-inventory>

<lemma>

<orth>AND</orth>

<phon>AE N D</phon>

<phon>AH N D</phon>

</lemma>

...

</lexicon>

Lexicon configuration file: show an4.lexicon

Implementation: show Lexicon.hh show Lexicon.cc


Example: Implementation of Dynamic ProgrammingBeam Search for Bigram LM

Consider:

C

B

A

Sil

A

B

C

A

B

C

Sil

Sil

Sil

Sil

Sil

Sil

Sil A

B

CC

B

A

Sil


t t

time

(t=0)



Consider:

C

B

A

Sil

A

B

C

A

B

C

Sil

Sil

Sil

Sil

Sil

Sil

Sil A

B

CC

B

A

Sil


t t

time

(t=0)


Dynamic Handling of State HypothesesGoal: complexity should be linear in

the number of active hypotheses.⇒ discard low probability hyps.⇒ incomplete state hyps.

Efficient expansion of thehypotheses from t to t + 1requires these operations:

t t+1

states

dead states

active states

time

I search (x , S)

I insert (x ,S)

I initialize (S)

I enumerate (S)

Compare methods for set representation:

I dictionary operations

I array representation of sets

I inverted lists and bit vectorsSchluter/Ney: Advanced ASR 137 August 5, 2010

Linear Search Implementation

Search Space RepresentationPruning necessitates dynamic handling of word and state hyps:

I List of active wordsword, stateHypBegin, stateHypEnd, entryStateHypothesis

I List of active states for every wordstate, score, backpointer

I To address active words, a list of all words pointing into thelist of active words is used.

I A list of all states of a word is used to handle active successorstates during the expansion of the states of a word.

Implementation: show LinearSearch.cc


Linear Search ImplementationSearch Space Representation: Word hypotheses

wordstateHypBegin

stateHypEnd

entryStateHypothesis

index

12345678910

nWords_nWords_+1

.......

....

state score backpointer

......

word

........

nWords_+2nWords_+3nWords_+4nWords_+5nWords_+6nWords_+7nWords_+8

2*nWords_-1

invalidinvalid

8

3

10

nWords_+8

.

..

..

..

..

..

..

..

..

..

..

12478569101345781215

2459

9

0

invalid

invalidinvalidinvalidinvalid

invalidinvalidinvalidinvalidinvalidinvalidinvalidinvalid

invalidinvalid

1

0

3

2

...

000

0

invalid

..

.

wordHypothesisMap_ wordHypotheses_ stateHypotheses_


Linear Search ImplementationSearch Space Representation: State Expansion


......

12478569101345781215

2459000

0

..

.

stateHypotheses_


......

125 6789101112

newStateHypotheses_

stateHypothesisMap_

index12345678910

state

.

.......

invalid

invalid

1112

lexicon[w].size

invalid

13invalid

word=3stateHypBegin

stateHypEnd

entryStateHypothesis

wordHypotheses_

..

..


Linear Search ImplementationSearch Space Representation: Bookkeeping

Implementation: show BookKeeping.hh show BookKeeping.cc

word score time timestamp backpointer

.....

sentinelBackpointer = 0123456789101112131415

100

100

100100

100

100

100100

silence_ 0 0 0

81

96

93

97

79

4763

6

50

13

5

13

5

04

1

6

56

6

1

8080

80

90

9080

70

80

68

69

8579

76

35

52

64state score backpointerstateHypotheses_

3

8

14

11

...

...

...

...

lastTimestamp_=100

bookKeeping_


Implementation of DP Beam Search for Bigram LM

Implementation example in C++ code: show Linear Search











Excursion (for experts): Language Model Factor

Experiments show that to achieve high performance, it is veryimportant to give the language model Pr(wN

1 ) much more weightthan the acoustic model Pr(xT

1 |wN1 ).

Why?


Language Model FactorStarting point:Bayes decision rule with true models Pr(wN

1 ) and Pr(xT1 |wN

1 ):

arg maxwN

1

Pr(wN

1 ) · Pr(xT1 |wN

1 ).

In training, we compute an estimate of the true models:

Pr(wN1 ) → p(wN

1 ),

Pr(xT1 |wN

1 ) → p(xT1 |wN

1 ).

The shapes (i.e. the weights) of the model distributions arechanged by exponentiation with exponents α and β:

p(wN1 ) → pα(wN

1 )

p(xT1 |wN

1 ) → pβ(xT1 |wN

1 )


Language Model Factor

Instead of re-normalizing each individual model separately,we re-normalize by defining the following posterior probability:

p(wN1 |xT

1 ) =pα(wN

1 ) · pβ(xT1 |wN

1 )∑wN

1

pα(wN1 ) · pβ(xT

1 |wN1 )

=pα(wN

1 ) · pβ(xT1 |wN

1 )

const(wN1 )


Language Model FactorDecision rule with weight exponents:

r(xT1 ) = arg max

wN1

p(wN

1 |xT1 )

= arg maxwN

1

pα(wN

1 ) · pβ(xT1 |wN

1 )

const(wN1 )

= arg max

wN1

pα(wN

1 ) · pβ(xT1 |wN

1 )

= arg maxwN

1

log[pα(wN

1 ) · pβ(xT1 |wN

1 )]

= arg maxwN

1

α log p(wN

1 ) + β log p(xT1 |wN

1 )

= arg maxwN

1

α

βlog p(wN

1 ) + log p(xT1 |wN

1 )

The factor α/β is referred to as language model factor(e.g. ≈ 10− 15).


Word Dependent Language Model Factor

Consider the posterior probability with suitable word dependentexponents β(w):

p(wN1 |xT

1 ) =

∏n

[pα(wn|wn−1

1 ) · pβ(wn)(′x ′n|wn)]

∑wN

1

∏n

[pα(wn|wn−1

1 ) · pβ(wn)(′x ′n|wn)]


Word Dependent Language Model Factor

Decision rule with word-dependent weight exponents:

r(xT1 ) = arg max

wN1

p(wN

1 |xT1 )

= arg maxwN

1

∏n

[pα(wn|wn−1

1 ) · pβ(wn)(′x ′n|wn)]

= arg maxwN

1

∑n

[α log p(wn|wn−1

1 ) + β(wn) log p(′x ′n|wn)]

= arg maxwN

1

∑n

[log p(wn|wn−1

1 ) +β(wn)

αlog p(′x ′n|wn)

]

Effect: word dependent scale factors β(w)/α.Training: like maximum entropy training.


Scale Factors for Each Knowledge Source

Apply scale exponents to each of the knowledge sources:language model, transition and emission probabilities:

p(wN1 |xT

1 ) =

N∏n=1

pα(wn|wn−1n−2 ) ·max

sT1

T∏t=1

[pβ(st |st−1,w

N1 ) · pγ(xt |st ,w

N1 )]

∑wN

1

N∏n=1

pα(wn|wn−1n−2 ) ·max

sT1

T∏t=1

[pβ(st |st−1, wN

1 ) · pγ(xt |st , wN1 )]


Scale Factors for Each Knowledge Source

Resulting Bayes decision rule:

r(xT1 ) = arg max

wN1

p(wN

1 |xT1 )

= arg maxwN

1

α

N∑n=1

log p(wn|wn−1n−2 )

+ maxsT

1

T∑t=1

[β log p(st |st−1,w

N1 )

+ γ log p(xt |st ,wN1 )]











Excursion (for experts): Length ModellingExplicit length models: for xT

1 and wN1 , the lengths T and N are

random variables themselves.I language model: p(N,wN

1 ), check normalization:

p(N,wN1 ) = p(N) · p(wN

1 |N)∑N,wN

1

p(N,wN1 ) =

∑N

p(N)∑wN

1

p(wN1 |N)

=∑N

p(N) ·∑wN

1

N∏n=1

p(wn|wn−11 |N)

=∑N

p(N) ·N∏

n=1

∑wn

p(wn|wn−11 |N)

=∑N

p(N) · 1 = 1

I acoustic model: p(T , xT1 |wN

1 ) (check normalization)


Length Modelling

I Language model:

p(N,wN1 ) = p(N) · p(wN

1 |N)

model assumptions:

= p(N) ·N∏

n=1

p(wn|wn−1n−2 ,N)


Length Modelling

I Acoustic model with word boundaries tN1 (with t0 = 0, tN = T ):

p(T , xT1 |wN

1 ) =∑tN−1

1

p(tN1 , x

T1 |wN

1 )

p(tN1 , x

T1 |wN

1 ) = p(tN1 |wN

1 ) · p(xT1 |tN

1 ,wN1 )

model assumptions:

=N∏

n=1

[p(tn|tn−1,wn) · p(x tn

tn−1+1|wn, tnn−1)

]


Length Modelling: Bayes Decision Rule

Optimization criterion (maximum approximation) using trigram LMp(wn|wn−1

w−2,N) and word segmentation tN1 (with t0 = 0, tN = T ):

maxN

p(N) · max

wN1 ,t

N1

N∏n=1

[p(wn|wn−1

w−2,N) · p(tn|tn−1,wn) · p(x tntn−1+1|wn, t

nn−1)

]

with the length models:

I length dependencies in language models:p(N) and p(wn|wn−1

w−2,N),

I duration models of acoustic models:p(tn|tn−1,wn).

Experimental results: rarely tested and no significantimprovements.




2. Search using Lexical Pronunciation Tree2.1 Review: Linear Search2.2 Tree Search2.3 Dynamic Programming2.4 Pruning Methods2.5 Language Model Look-Ahead2.6 Phoneme Look-Ahead2.7 Subtree Dominance Recombination (SDR)2.8 Implementation Details







Review: Linear Search

Bayesdecision rule:

Speech Input

AcousticAnalysis

Phoneme Inventory


Language Model

Global Search:

maximize

x1 ... xT

Pr(w1 ... wN) Pr(x1 ... xT | w1...wN)

w1 ... wN


over

Pr(x1 ... xT | w1...wN )

Pr(w1 ... wN)



The search process has to findthe spoken word sequence in theset of possible word sequences.

The set of possible wordsequences can be represented asa tree.

A

BB

C

A

B

C

A

B

C

A

B

C

C

C

A

B

C

A

A

B

C

A

B

Pr(B)

Pr(A|B)

Pr(A|BA)



Maximize over all possible wordsequences:

[wN

1

]opt

= argmaxwN

1

Pr(wN1 ) ·

∑sT

1

Pr(xT1 , s

T1 |wN

1 )

typical assumptions:

I Maximum approximation on state level:

[wN

1

]opt

= argmaxwN

1

Pr(wN

1 ) ·maxsT

1

Pr(xT1 , s

T1 |wN

1 )

.

Note:Only consider state sequences sT

1 that are consistent with wN1 .


Review: Linear SearchAssumptions used:

I Bigram language model: p(wn|wn−1) or p(w |v) .

A

B

C

t3 t3t2t2 t4t1n=3n=2n=1

TIME t / WORD POSITION n

B

C

AA

B

C

A

B

C

...

...

...

...

...

...

When recognizing continuous speech, language modelrecombination is carried out every 10ms because the wordboundaries are not known.



Words are represented as phoneme sequences:

I Typically about 40 phonemes (German, French, English).

I A pronunciation lexicon gives the transcription.

I Words are modelled as concatenation of phoneme models.

I Automatic training.

I Context dependentphonemes:

I diphonesI triphones

i.e. one phoneme in therespective context(not two or three phonemes).

PHO

NE

ME

X

STA

TE

IN

DE

X

TIME INDEX

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

E

M

B



Dynamic programming with beam search:

I left-to-right time synchronous evaluation:direct comparison of scores

I beam search:I find best hypothesis at time tI prune unlikely hypotheses

I experiments: fraction of thepotential space (depends onperplexity, acoustic models,...)

I implementation:dynamic search space

w1[1:n1]

wr [1:n r ]

wR[1:nR]

0 t T











Tree SearchI Every word end hypothesis activates the

initial states of all following words.I Experiments show that in the case of large vocabulary speech

recognition (20000 word vocabulary) 90% of the search effortis in the first two phonemes.

I Solution: represent the vocabulary as a lexical prefix tree.

...

a

c

e

Sil

anbr

l e

...

can

cab

cable

car

Sil

a) b)

Example with letters instead of phonemes:a) tree lexicon, b) linear lexicon.


Tree SearchThe phonemes (triphones) at the arcs of the lexical tree aremodeled as usual with 6-state HMMs.

At the tree nodes the following transitions are possible:

a

n

b

r


Tree Search

The following table shows some statistics for the tree lexicon forthe so called Wall Street Journal task (19 978 words):

I 44 context independent (CI) phoneme models

I 4688 context dependent (CD) phoneme models

I a word: typically 6 phonemes

⇒ compression factor: CI:6 · 19 978

44587≈ 2.7

CD:6 · 19 978

63155≈ 1.9

terminology: lexicon: linear vs. treeterminology: search: linear vs. tree


Tree Search

phoneme number of number of arcsgeneration word ends CD tree CI tree

1 10 544 452 231 3626 6523 1476 9335 37994 2959 12180 82455 3791 11653 94326 3576 9402 79327 2845 6829 58958 2240 4563 40259 1414 2632 2341

10 778 1359 1223... ... ... ...16 7 8 817 1 1 1

sum 19978 63155 44587


Tree Search

I The average compression factor of a tree lexicon compared toa linear lexicon is about 2 to 3.

I Since the first phonemes require the most search effort thesearch space reduction using beam search is much larger(a factor of 15 was determined in experiments on the WSJcorpus with a vocabulary of only 5000 words).

I A fundamental problem: the word identity is not known till aword end node is reached. This requires a different languagemodel recombination:Use a separate tree for every predecessor word(only a few trees are active due to beam search).


Tree SearchBigram language model recombination for a three word vocabulary:

tt

LANGUAGEMODEL

ACOUSTICMODEL

ACOUSTICMODEL

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

AA

BB

CC


Trigram language modelrecombination for a threeword vocabulary:

tt

LANGUAGEMODEL

ACOUSTICMODEL

ACOUSTICMODEL

A

B

C

AA

A

B

C

BA

A

B

C

CA

A

B

C

AB

A

B

C

BB

A

B

C

CB

A

B

C

AC

A

B

C

BC

A

B

C

CC

A

B

C

AA

A

B

C

BA

A

B

C

CA

A

B

C

AB

A

B

C

BB

A

B

C

CB

A

B

C

AC

A

B

C

BC

A

B

C

CC











Dynamic Programming

In the previous picture every arc has costs that only depend on thearc itself, not on the history of the preceding path.

Therefore dynamic programming can be used:

Indices: t time, v predecessor word, w word index (defined only atword end states), s state, Sv final state of words v in the tree.

Auxiliary quantity:

Qv (t, s) := probability of the best partial path producing the acous-tic vectors x1, . . . , xt and ending in state s of the treelexicon with v as predecessor word.


Dynamic Programming

Within a word (a tree):

Qv (t, s) = maxσp(xt , s|σ) · Qv (t − 1, σ)

Maximize σ over the best predecessor state. At phonemeboundaries the complex transitions have to be considered.

After finishing the computation for the states within a word, therecombination is carried out at the word boundaries:

Qv (t, 0) = maxup(v |u) · Qu(t, Sv ) ,

were s = 0 is a virtual starting state and Sv the final state of wordv .


Silence Copies

The silence model requires aspecial handling:

I silence has to be allowedbetween words

I the word before silence hasto be stored

Solution:

I a silence arc is added toevery tree

I during language modelrecombination only the rootof the same tree can bereached from the silencenode.

tt

LANGUAGEMODEL

ACOUSTICMODEL

ACOUSTICMODEL

A A

B

C

A

B

C

A

B

C

A

B

C

B

C

A

B

C

A

B

C

A

B

C

Sil Sil

Sil

Sil

Sil

Sil

A

B

C

Sil

Sil

Sil B

C

Sil

A


The silence handling for atrigram language model

ABC

A

Sil

A

ABC

A

Sil

A

ABC

A

Sil

B

ABC

A

Sil

C

ABC

B

Sil

A

ABC

A

Sil

B

ABC

A

Sil

C

ABC

B

Sil

A

ABC

B

Sil

B

ABC

B

Sil

B

ABC

B

Sil

C

ABC

C

Sil

A

ABC

C

Sil

B

ABC

B

Sil

C

ABC

C

Sil

A

ABC

C

Sil

B

ABC

C

Sil

C

ABC

Sil

Sil

BC

C

Sil

C

ABC

Sil

Sil

tt

LANGUAGEMODEL

ACOUSTICMODEL

ACOUSTICMODEL

ABC

A

Sil

A











Pruning MethodsI Acoustic pruning after acoustic recombination:

1. determine the score of thebest hypothesis at time t:

QAC (t) := max(v ,s)

Qv (t, s).

2. prune, eliminate ahypothesis (t, s; v) if

Qv (t, s) < fAC · QAC (t).

fAC < 1 is the acousticpruning threshold.

s (sorted by Score)

smin

a) b)

-log Q (t,s)v

-log Q (t)AC

-(log Q (t) + log f )AC AC

I Effect of acoustic pruning:a) narrow minimum of the path scores→ only a few hypotheses survive the pruning;

b) broad minimum → many hypotheses survive.


Pruning Methods

I Language model pruning (tree start pruning)prune at the tree roots directly after language modelrecombination:

I determine the score of the best starting tree hypothesis t:

QLM(t) := maxv

Qv (t, s = 0).

I prune a tree hypothesis (t, s = 0; v) if:

Qv (t, s = 0) < fLM · QLM(t) .

fLM < 1 is the language model pruning threshold.


Pruning MethodsI Histogram pruning

is applied after acoustic and language model pruning,if the number of active hypotheses N(t) at time t is largerthan a value MaxHyp:

N(t) > MaxHyp.

Histogram pruning only retains the MaxHyp best hypotheses(using bucket sort).Temporary peaks in the number of hypotheses are eliminated.

t

N hyp (t)

MaxHyp

Experimental result:Acoustic pruning is by far the most important method. The otherseliminate temporary peaks in the number of hypotheses.


Experimental Results

Experimental results on the WSJ corpus:

I Training corpus: WSJ0,1;

I Test corpus: NAB ’94 H1 development set;

I speaker independent;

I word conditioned tree copies;I vocabulary:

I W = 20 000 words;I tree: 63 155 arcs (4 688 context dependent phoneme models)I 6 states per phoneme

I bigram language model perplexity 215

I MaxHyp: 100 000


Experimental Results: Effect of Acoustic PruningThe acoustic pruning threshold was varied. The number of activestates, arcs, trees, and words was measured along with the worderror rate (WER).

average number of activethreshold states arcs trees word ends WER[%]

(after pruning)

50 k 252 80 4 9 45.660 k 677 213 6 17 28.365 k 1068 332 9 24 24.275 k 2396 703 15 49 20.6

100 k 12908 3554 38 238 18.4120 k 32538 8838 59 564 18.2130 k 43862 11784 67 745 18.2170 k 79836 20649 87 1363 18.1


Experimental Results: Effect of Acoustic Pruning

I The word error rate reaches saturation atabout 10000 active phoneme arcs or 40000 states.

I The potential search space consists of20000 · 63000 · 6 ≈ 8 · 109 states.


Experimental Results: Effect of Language Model

I Corpus: WSJ (Wall Street Journal) Nov ’92, development and eval.

I 5k vocabulary.

I 1.44 h speech data (516984 time frames).

I 18 speakers, 740 sentences, 12137 spoken words.

I 2001 mixtures, 96277 densities.

I Pooled variance vector.

I Feature vector with 33 components.

I LDA (linear discriminant analysis).

I CART (phonetic decision tree).

I Pruning parameter was optimized for trigram LMand kept constant for bi-, uni- und zerogram LM.


Experimental Results: Effect of the Language Model

I LM look-ahead:I zerogram-LM: no LM look-aheadI unigram-LM: unigram look-aheadI bigram-LM: bigram look-aheadI trigram-LM: bigram look-ahead

I real time factor (RTF) measured on a500 Mhz Alpha-PC with 512 MB memory.



Effect of language model:# states search space

after after pruning error rates [%]model PP expansion states arcs trees del ins WER RTF

zerogram 4988 10702 7537 1825 1 14.6 0.1 40.0 13.3

unigram 746 6521 3880 958 1 4.6 1.1 22.9 11.0

bigram 107 33775 6472 1835 36 1.4 0.5 6.9 12.2

trigram 56 55769 8013 2333 67 0.7 0.4 4.5 14.9











Language Model Look-Ahead

I Language model look-aheadProblem: word identity needed for language model in generalonly know at word ends.Idea: use the knowledge from the language model as early aspossible in the search process. From a certain state (or arc)only a small subset of leaves i.e. word ends can be reached.

W (s) := set of word ends that can be reached from state s.

sW(s)

s~v


Language Model Look-AheadOnly a few language model probabilities are possible.Define best language model probability reachable from state s:

πv (s) := maxw∈W (s)

p(w |v).

For acoustic pruning these values are combined with the originalauxiliary scores Qv (t, s) :

Qv (t, s) := πv (s) · Qv (t, s)

and acoustic pruning is applied to Qv (t, s) instead of Qv (t, s).

Problem:

I Vocabulary W = 20000.I 4688 context dependent phoneme models.I Lexikal prefix tree with 63000 phoneme arcs.I 20000 · 63000 factorized bigram language model probabilities:⇒ huge look-up table!



Solution to huge potential number of LM probabilities needed:I Path compression:

I upper limit: W words ⇒ 2 ·W tree nodes→ compressed LM look-ahead tree

I optional: only consider the first 2-4 phoneme arc generationsof the LM look-ahead tree.

I LM tree factorization on demand.



Examples of a language model look-ahead tree:

v

w1

w2

w3

w4

w5

0.2

0.3

0.1

0.15

0.05

1.0

1.0

1.0

1.0

1.01.0

1.0

1.0

1.0

0.50.3

0.33

0.33

1.0

0.66

w60.2

0.66

1.06

v

w1

w2

w3

w4

w5

0.2

0.3

0.1

0.15

0.05

1.0

1.0

1.0

0.5

0.3

0.33

0.33

1.0

0.66

w0.2

0.66

LM tree before compression. LM tree after compression.



To reduce memory and computation, a unigram language modelcan be used instead of the bigram model to calculate thelook-ahead.

πv (s) := maxw∈W (s)

p(w)

Then only one value has to be stored for every node.

Experimental results: NAB’94 H1 development set

I 20 speakers (10 men und 10 woman).

I 310 utterances with 7387 words,(199 out of vocabulary words).

I RTF (real time factor) measured on an SGI R5000.



LM for LM look-ahead tree search space recog. errors [%]look-ahead gen. arcs trees states arcs trees DEL / INS WER RTF

no – – – 65568 16932 26 2.4 / 2.5 16.3 110.0– – – 50020 13034 20 2.5 / 2.5 16.6 91.4

unigram 17 63155 1 16960 4641 32 2.5 / 2.5 16.4 68.1(PP = 972.6) 17 63155 1 9443 2599 22 2.6 / 2.5 16.8 54.4

bigram 17 29270 300 3312 935 13 2.5 / 2.6 16.5 32.9(PP = 198.4) 3 12002 300 3277 924 13 2.4 / 2.6 16.5 31.4

2 4097 300 3611 1012 12 2.4 / 2.6 16.9 32.21 544 300 5786 1643 11 2.6 / 2.8 17.0 36.2











Phoneme Look-AheadPhoneme look-ahead:

I look ahead into the ’future’;

I by one phoneme (CD, CI);

I for the acoustic data.

α

βγ

αγ

δ

ε

ε

t t+ t∆

t∆

αγ

β

ε

ε


Phoneme Look-Ahead

Definitions and notation:

α: a phoneme arc in the lexical prefix tree that shall be started.

α: predecessor arc of α. One of its final states has been reachedduring search.

q( α; t,∆t): probability that α produces the vectors xt+1, ..., xt+∆t .

The look-ahead time ∆t is constant and approximately equalsthe average phoneme length:

q(α; t,∆t) := max[st+∆t

t+1 ]p(x t+∆t

t+1 , st+∆tt+1 |α).


Phoneme Look-Ahead

Pruning strategy:

I determine the best state hypothesis Q0(t):

Qv (t, α) := q(α; t,∆t) · Qv (t,Sα)

Q0(t) := max(v ,α)Qv (t, α)

I prune the arc hypothesis (α, v) at time t, if:

Qv (t, α) < fPA · Q0(t)

I approximate Q0(t):

Q0(t) ≈ maxαq(α; t,∆t) ·max

(v ,β)Qv (t,Sβ)


Phoneme Look-Ahead

Calculating the look-ahead probability q(α; t,∆t):

I Time alignment for every hypothesized phoneme α:

Φα(τ, s; t) := probability that the phoneme α withthe states 1, ..., s generates the vectorsxt+1, ..., xt+τ .


Phoneme Look-AheadI Hidden Markov model with S = 6 states:

PHONEME

STA

TE

IN

DE

X

TIME INDEX

α

α∼PHONEME

t t+ t

... ... ... ... ... ... ... ... ... ...

I Phoneme look-ahead probability:

q(α; t,∆t) := max

maxsΦα(∆t, s; t),max

τΦα(τ,S ; t)∆t/τ


Phoneme Look-Ahead

Efficient calculation of the phoneme look-ahead probability:

⇒ approximations:

I CI phoneme models instead of CD models(44 CI vs. 4688 CD models);

I smaller number of densities;

I calculation of the look-ahead every second time frame;

I collapsed one-state HMM.


Phoneme Look-Ahead

PHONEME

STA

TE

IN

DE

X

TIME INDEX

α

α∼PHONEME

t t+ t

... ... ... ... ... ... ... ... ... ...

Collapsed Hidden Markov Model with one state


Phoneme Look-Ahead

Test conditions: NAB’94-H1-Development 20k

I Acoustic search (’detailed search’):I 4 688 CD phoneme models,I 290 000 Laplacian densities per gender.

I Phoneme look-ahead:I 43 CI phoneme models + 1 silence model,I ∆t = 6 Time Frames,I 6 state HMM (497 / 1 926 Laplacian mixtures),I 1 state HMM (175 / 675 Laplacian mixtures).

I Bigram LM look-ahead.

I RTF (real time factor) measured on a SGI R5000.


Phoneme Look-Ahead

Experimental results:phoneme search space recognition errors

look-ahead [#] [%]HMM # dns states arcs trees DEL / INS WER RTF

no - 3312 935 13 2.5 / 2.6 16.5 32.91-state 175 1901 452 10 2.5 / 2.7 16.6 17.9

175 1551 359 9 2.5 / 2.7 16.7 15.4675 1808 434 11 2.5 / 2.6 16.6 16.4675 1508 354 10 2.5 / 2.6 16.8 14.2

6-state 497 2472 605 12 2.5 / 2.6 16.5 19.4497 1671 379 10 2.4 / 2.7 16.8 16.4

1926 2160 516 11 2.5 / 2.6 16.5 18.01926 1784 413 11 2.5 / 2.7 16.6 16.6











Subtree Dominance Recombination

Consider a hypothesis (s, t; v) (bigram LM):

sv

v’

w

s

t τSchluter/Ney: Advanced ASR 205 August 5, 2010

Subtree Dominance RecombinationConsider the acoustic score for the best path from a state s attime t to a final state Sw at time τ :

h((t, s)→ (τ,Sw )) := maxsτtPr(xτt , s

τt | ) : st = s, sτ = Sw

Note: h((t, s)→ (τ,Sw )) does not depend on the preceding word v .

Because of the recombination on word level we can eliminate orrecombine the hypotheses (s, t; v) if:

∀w ∈W (s) := set of leaves i.e. words that can be reached from s:

p(w |v) · Qv (t, s) · h(. . . ) < maxv ′ 6=v

p(w |v ′) · Qv ′(t, s) · h(. . . )

m

p(w |v) · Qv (t, s) < maxv ′ 6=v

p(w |v ′) · Qv ′(t, s)


Subtree Dominance RecombinationThe partial tree (s, t, v0) with

v0 = arg maxv ′

p(w |v ′) · Qv ′(t, s)

dominates the rivaling hypotheses (s, t; v)

Due to the complex (w , v) dependency this equation is not very useful.Simplify both sides by approximation using an upper bound for theleft side and a lower bound for the right side:

maxw∈W (s)

p(w |v) · Qv (t, s) < maxv ′ 6=v

min

w∈W (s)p(w |v ′) · Qv ′(t, s)

.

Especially at the beginning of a tree (s = 0)

maxw

p(w |v) · Qv (t, s = 0) < maxv ′ 6=v

minw

p(w |v ′) · Qv ′(t, s = 0).



Note:

a) easy implementation:only two arrays with index v are needed:

πmax(v , s) := maxw∈W (s)

p(w |v) (cf. LM look-ahead)

πmin(v , s) := minw∈W (s)

p(w |v)

b) similarity to LM look-ahead

c) acoustic pruning is sufficient for an efficient focusing search,LM pruning is not needed.



Experimental results:

I NAB’94 H1 development corpus.

I Vocabulary: 20000 words.

I Bigram language model.

I Language model pruning threshold fLM =∞.

I RTF (real time factor) measured on an SGI R5000.

bigram MaxHyp search space error rate [%]look-ahead states arcs trees DEL / INS WER RTF

w/o SDR 150 000 3312 935 13 2.5 / 2.6 16.5 32.9with SDR ∞ 3496 887 24 2.4 / 2.6 16.5 39.3

150 000 3093 859 12 2.4 / 2.6 16.5 29.010 000 2594 743 12 2.4 / 2.6 16.5 25.9

5 000 2091 606 11 2.4 / 2.6 16.6 25.6











Tree Search Implementation

Implementation in RWTH ASR teaching patch:

I Bigram LM.

I Bookkeeping: at the beginning of each tree.

I Dynamic generation of tree copies.

I ”Direct memory access” for trees and phoneme arcs,(compare to corresponding methods in linear search).

I Language model look-ahead.

I NO phoneme look-ahead.

I NO subtree dominance recombination.


RWTH ASR System: Tree Search Teaching PatchClasses and Dependencies similar to linear search:

Search::SearchAlgorithm

SearchInterface

WordConditionedTreeSearch

WordConditionedTreeSearch::SearchSpace

Lexicon

Bookkeeping

RWTH ASR System

RWTH ASR Tree Search Teaching Patch

- acoustic model- language model- corpus handling- pronunciation lexicon handling- general search environment

- interface to open source RWTH ASR System- implementation of tree search, including bookkeeping, and traceback

WordConditionedTreeSearch::ActiveTrees


Interface to RWTH ASR SystemGeneral Interface to Teaching PatchClass SearchInterface provides connection to general searchwork around as in the linear search implementation (cf.Slides 130ff). It includes handling of configuration, resources(corpus and models) as well as an overall workaround for thespecific search implementation.

As for linear search, the main functions to be implemented are:

I initialize: Search initialization

I processFrame: expansion of hypotheses to next time frame

I getResult: traceback of best recognized word sequence

Implementation:

I show SearchInterface.hh

I show SearchInterface.cc

I show WordConditionedTreeSearch.hh



Bigram language model recombination for a three word vocabulary:including silence

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

acoustic model language model acoustic model



Bigram language model recombination for a three word vocabularyincluding silence:

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

acoustic model language model acoustic model

Sil

Sil

Sil

Sil

Sil

Sil


Bigram LM and Word Conditioned Tree Copies

proceed over time t from left to rightACOUSTIC LEVEL: process state hypotheses

- initialization: Qv (t − 1, 0) = H(v ; t − 1)- time alignment: Qv (t, s) using DP- propagate back pointers in time alignment- prune unlikely hypotheses- purge bookkeeping lists

WORD PAIR LEVEL: process word end hypothesessingle best: for each pair (w ; t) do

H(w ; t) = maxv [p(w |v) Qv (t,S(w))]v0(w ; t) = arg maxv [p(w |v) Qv (t,S(w))]- store best predecessor v0:=v0(w ; t)- store best boundary τ0 := τ(t; v0,w)

traceback


Word Conditioned Tree Search ImplementationSearch Space RepresentationPruning necessitates dynamic handling of word and state hyps:

I List of active treespredecessorWord, arcHypBegin, arcHypEnd

I List of active (phoneme/triphone) arcs for every treearc, stateHypBegin, stateHypEnd

I List of active states for every (phoneme/triphone) arcstate, score, backpointer

I To address active trees, a list of all trees pointing into the listof active trees is used, which also hold potential word endhypotheses (re)starting a treeisActive,*wordHyp

I A list of all arcs of a tree is used to handle active successorarcs during the expansion of the states within a tree.

Implementation: show WordConditionedTreeSearch.cc


Tree Search ImplementationSearch Space Representation: Tree Hypotheses

isActive

12345678910

nWords_-2nWords_-1


......

predecessorWord

........

1

2

4

1

3

2

3

3

4

5

2

3

4

5

6

2

4

5

1

2

4

0

3

4

6

treeMap_

treeHypotheses_

stateHypotheses_

*wordHyp

..

..

predecessorwordarcHypBegin

arcHypEnd

arcHypotheses_

arcstateHypBegin

stateHypEnd

....

..

....

..

111213141516171819202122232425

FALSE

FALSE

FALSE

TRUE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

TRUE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

FALSE

TRUE

TRUE

TRUE

TRUE

9

3

12

21

nWords-2

16

5

6

5

3

1

2

3

5

6

2

4


Tree Search ImplementationSearch Space Representation: Within-Tree Arc Hypotheses


......

24

treeHypotheses_

stateHypotheses_

predecessorword =12

arcHypBegin

arcHypEnd

5652

..

..

arcHypotheses_

arcstateHypBegin

stateHypEnd

.

..

wordHypotheses_

word=12 score=s backpointer=b

..

..

isActivepredecessorWord

treeMap_

*wordHyp

111213

FALSE

FALSE

TRUE

..

..

0 FALSE

...

treeLexicon_

......

d

e

h

de

h

arcHypIndex

predecessor

prePredecessor

predecessorScore

prePredecessorScore

predecessorBackpointer

prePredecessorBackpointer

isStarted

predecessorMap_

activeArcs_

SearchSpace::ActiveArcs

f

g

d

e

f

g

h

..

..

..

..

a b cd e

a

b

c

f gh

a

b

c

5

6

7

inv 8 8FA inv inv inv

FA

FA

TRinv s b

TRinv s b

TRinv s b

s1 b1

s2 b2

s3 b3

s1 b1s2 b256

s1 b1s2 b256oSt TR inv 8 invs3 b35

oSt TR inv 8 invs3 b35


Tree Search ImplementationSearch Space Representation: Initial State Hypotheses per Tree Arc


......

24

stateHypotheses_

5652

arcHypotheses_

arcstateHypBegin

stateHypEnd

.

..

...

d

e

h

arcHypIndex

predecessor

prePredecessor

predecessorScore

prePredecessorScore

predecessorBackpointer

prePredecessorBackpointer

isStarted

predecessorMap_

activeArcs_

SearchSpace::ActiveArcs

d

e

f

g

h

..

..

..

..

a b cd e

a

b

c

f gh

5

6

7

inv 8 8FA inv inv inv

FA

FA

TRinv s b

TRinv s b

TRinv s b

s1 b1

s2 b2

s3 b3

s1 b1s2 b256

s1 b1s2 b256oSt TR inv 8 invs3 b35

oSt TR inv 8 invs3 b35

SearchSpace::initTimeAlignment

score backpointerstate

-1

0

1

2

3

4

5

6

88

88

88

8


-1

0

1

2

3

4

5

6

88


-1

0

1

2

3

4

5

6

88

88

8

...

hmm (arc=d)

hmm (arc=h)

hmm (arc=b)

...

...

...

s b

s1 b1

s2 b2

88

s4 b4

s5 b5

s6 b6

s1 b1

s2 b2

s4 b4

s5 b5

s6 b6


Tree Search ImplementationSearch Space Representation: State Expansion per Tree Arc

SearchSpace::computeTimeAlignment


-1

0

1

2

3

4

5

6

88

88

88

8


-1

0

1

2

3

4

5

6

88


-1

0

1

2

3

4

5

6

88

88

8

...

hmm (arc=d)

hmm (arc=h)

hmm (arc=b)

...

...

...

s b

s1 b1

s2 b2

88

s1 b1

s2 b2

s4 b4

s5 b5

s6 b6


.

..

...

1

2

2

3

4

5

6

1

2

3

4

newStateHypotheses_

....

.

....

newArcHypotheses_

arcstateHypBegin

stateHypEnd

....

..

....

.

....

.

....

.

b

d

h

newTreeHypotheses_

predecessorwordarcHypBegin

arcHypEnd

....

..12

....

.

....

.a

.....

.

.

.

.....

.

...

.

SearchSpace::pruneStatesAndFindWordEnds

stateHypotheses_ arcHypotheses_ treeHypotheses_

state score backpointerarc

stateHypBegin

stateHypEnd predecessorword

arcHypBegin

arcHypEnd

....

...

.

........


Tree Search ImplementationSearch Space Representation: Bookkeeping (as in Linear Search)

Implementation: show BookKeeping.hh show BookKeeping.cc

word score time timestamp backpointer

.....

sentinelBackpointer = 0123456789101112131415

100

100

100100

100

100

100100

silence_ 0 0 0

81

96

93

97

79

4763

6

50

13

5

13

5

04

1

6

56

6

1

8080

80

90

9080

70

80

68

69

8579

76

35

52

64state score backpointerstateHypotheses_

3

8

14

11

...

...

...

...

lastTimestamp_=100

bookKeeping_


Trigram LM and Word Conditioned Tree CopiesTrigram LM:

proceed over time t from left to rightACOUSTIC LEVEL: process states of lexical trees

– initialization: Quv (t − 1, 0) = H(u, v ; t − 1)Buv (t − 1, 0) = t − 1

– time alignment: Quv (t, s) using DP– propagate back pointers Buv (t, s)– prune unlikely hypotheses– purge bookkeeping lists

WORD PAIR LEVEL: process word endsfor each triple (v ,w ; t) do

H(v ,w ; t) = maxu p(w |u, v) Quv (t,Sw )

u0(v ,w ; t) = arg maxu p(w |u, v) Quv (t,Sw )

– store best predecessor u0 := u0(v ,w ; t)– store best boundary τ0 := Bu0v (t,Sw )


Generalization to Trigram and n-gram

Implementation of tree search for trigram or higher order languagemodels:

I General structure of search remains the same - seeimplementation for bigram (especially within each tree).

I Most important difference: hashing instead of full maps forcontext handling.

LM bigram trigram n-gram

context word word pair n − 1 wordsmaps by word hash index for context

I Generalization to n-gram straightforward:I Hashing can be applied to any LM context length, therefore:I Same implementation for arbitrary LM context length.I If larger context helps (lower perplexity), search space will

benefit.





3. Across-Word Models3.1 Motivation for Across-Word Models3.2 Implementation of Across-Word Models3.3 Extension of the Tree Lexicon3.4 Extension of the Language Model Recombination3.5 Activating the First Phoneme Generation3.6 Recombination after the First Phoneme Generation3.7 Dynamic Programming Recursion3.8 Extended Language Model Look-Ahead3.9 Experimental Results






Motivation for Across-Word Models

I The words of the vocabulary are modeled with phonemes(example: Airline – ae r l ah ih n).

I The acoustic realization depends on the context of a phoneme(coarticulation).

I This variability is modeled with triphones, i.e. phonemes in atriphone context.

I A model for phoneme a with predecessor x and successor y(xay ) is different from the model of a with predecessor w andsuccessor z (w az).


Motivation for Across-Word Models

I The models for words consist of concatenated phonememodels.

I When using triphone context, in principle two ways ofmodelling are possible:

I Triphone context is only considered within words. At wordboundaries the so called “word boundary context” is used(conventional within word modelling).

+ Advantage: easy implementation of the modelling; wordmodels can be taken directly from the lexicon.

– Disadvantage: coarticulation at word boundaries is notmodelled well.

I Triphone context is also considered across word boundaries(across-word modelling).

+ Advantage: coarticulation at word boundaries is taken intoconsideration.

– Disadvantage: higher search effort is required. The wordmodels depend on the predecessor and successor words.


Motivation for Across-Word ModelsI Example for the word: Airline – ae r l ah ih n– Transcription with context dependencies only within the word:

ae r l ih n# r ae l r ah ah n ih #ahl ih

– Transcription with context dependent models both within theword and at the word boundaries:

ae

r l ah ih

na r

ae l r ah l ih ah n

ih a

nih b

nih $

aeb r

ae $ r

I Note: across-word modeling includes potential case $ for nocoarticulation at word boundaries (e.g. due to pause).











Search Implementation with Across-Word Models

I Starting point: word conditioned tree search, contextdependent models only within words.

I In principle, there are two possibilities for using across-wordmodels:

1. Multi pass search:I generate a word graph from an initial search without

across-word models;I extract the N best word sequences from the graph;I rescore the N best hypothesized sentences using across-word

models.

2. Single pass search:I extend word conditioned tree search to allow processing of

context dependencies beyond word boundaries– extend the tree lexicon;– extend the language model recombination.

I In the following the single pass search strategy will bedescribed.











Extension of the Tree LexiconI The conventional tree lexicon only uses within-word models.I At word boundaries the “word boundary context” is used:

AIRLIFT

AIRLINE

ABOUT

ae# r

rae l

tf #

n ih #

a# b

tow #

ahl ih ihah nlr ah

il f

f i t

b a ow ow

b t

lr i

I Note: the consistent word boundary context # enables theorganization of the pronunciations in a lexical prefix tree, witha single path through the tree for each pronunciation.


Extension of the Tree Lexicon

I To model coarticulation across word boundaries, the wordbeginnings and ends have to be expanded (fan-in and fan-out):

AIRLIFT

AIRLINE

ABOUT

rae l

lr i

il f

f i t

tf a

tf z

lr ah ahl ih

owb t

a z b

ae a r

ae z r

a a b

ih ah n

tow a

tow z

b a ow

n ih a

n ih z











Extension of the Language Model Recombination

Review: language model recombi-nation for within-wordmodelling with a bigramlanguage model.

Notation:A,E , I : ending/beginning wordsy : last generation phonemex : phoneme of second to

last generationa, b, c : first generation phonemese, f , g : second generation phonemes#: general word boundary context$: silence context, for word

transitions with silencebetween the words

A

a# e

a# f

b# e

b# f

b# g

c# e

c# f

E

a# e

a# f

b# e

b# f

b# g

c# e

c# f

I

a# e

a# f

b# e

b# f

b# g

c# e

c# f

......

......

yx #

E

I

A

A

......

......

...

yx #

E

I

A

I

......

...

yx #

E

I

A

E

......

... ......

......


Extension of the Language Model RecombinationWhen using across-word models:

I Word ends generate more than one arc in the lexical prefixtree.

I For every possible phonetic right across-word context there isan individual arc.

I The hypothesis of a word end is combined with the hypothesisof a successor word.

I The first phoneme of the successor is determined at theending of a word.

I The hypothesized context influences the future search space.

I Language model recombination can only combine word endhypotheses with the same right across-word context.

The following picture only shows the ending and beginning of wordE with the last phoneme y .


E

yx a

yx b

yx c

ay e

ay f

by eby f

by g

cy e

cy fE

E

yx a

yx b

yx c

E

I

yx a

yx b

yx c

E

A ...

...

...











Activating the First Phoneme Generation

At the root of a “tree” the set of arcs that will be activateddepends on:

I The last phoneme of the predecessor word (left context of thenew arc).

I The right context with which the predecessor word has ended(central phoneme of the new arc).

I special case silence ($): in case of long pauses between theending and the starting word it is assumed that nocoarticulation occurs. Activate all arcs of the first phonemegeneration with silence as left context.


yx a

yx b

yx c

E

ay e

ay f

by e

by f

by g

cy e

cy f

Eyx $ $

c$ e

a$ e

c$ f

b$ eb$ fb$ g

a$ f











Recombination after the First Phoneme Generation

I The across-word context ofthe ending word only influ-ences the first phoneme ofthe successor.

I It is possible to recombinethe arcs of the non-coarticu-lation transition with thoseof the coarticulationtransition.

yx a

yx b

yx c

E

ay e

ay f

by e

by f

by g

cy e

cy f

Eyx $ $

c$ f

a$ e












I Within-word modelling:

1. Within words:

Qv (t, s) = maxσ

Qv (t − 1, σ) · p(xt , s|σ)

2. At word boundaries:

Qv (t, s = 0) = maxu

Qu(t,Sv ) · p(v |u)


Dynamic Programming RecursionI Across-word modelling:

1. Within words (as before):

Qv (t, s) = maxσ

p(xt , s|σ) · Qv (t − 1, σ)

2. At word boundaries:

The right across-word context χ of the ending word v wordhas to be considered during language model recombination forthis word:

Qv (t, s0(χ)) = maxu

p(v |u) · Qu(t,S(v ,χ))

with χ right context of the predecessor word v

S(v ,χ) last state in fan-out arc of word v with right context χ

s0(χ) virtual start state of subtree following word end hypothe-sis v with context χ. There is only one virtual start statefor within-word models, for across-word modelling a dif-ferent virtual start state (s0(χ)) for every right contextχ of predecessor word v has to be considered.











Extended Language Model Look-AheadI The hypothesis of a word end with a certain right across-word

context restricts the set of possible successor words to the setof words which begin with the phoneme that washypothesized as right context.

This restricts the set of possible language model probabilitiesand allows additional pruning.

Define:

W (χ) := set of words, with χ as first phoneme.

For every ending word v and every possible right context χthe maximal language model probability at the end of a (yetunknown) word w following v can be calculated:

π(v , χ) := maxw∈W (χ)

p(w |v)


Extended Language Model Look-Ahead

π(v , χ) := maxw∈W (χ)

p(w |v)

I This maximal probability can be considered as soon, as theright context χ at the end of the predecessor word v ishypothezised.

I Let Σ(v , χ) be the set of states in a fan-out arc of an endingword hypothesis v , with right context χ.

I The score QFOu (t, s) for a state s ∈ Σ(v , χ) within a fan-out

arc of word v , with right context χ in the tree of thepredecessor word u then is:

QFOu (t, s) := π(v , χ) · Qu(t, s)


Extended Language Model Look-Ahead

Compute the overall best fan-out score:

QFO(t) := maxu,v ,χ

QFO

u (t, s) : s ∈ Σ(v , χ)

A state in the arc of an ending word is pruned if:

QFOu (t, s) < fAC ,FO · QFO(t)

with the fan-out pruning threshold 0 < fAC ,FO < 1.












I Verbmobil eval’96, 5k vocabulary

I 35 speakers, 305 utterances, 5421 words

I within-word models (gender independent):2001 mixtures, 195451 densities

I across-word models (gender independent):3001 mixtures, 209484 densities

I trigram perplexity 36.4

I Without extended language model look-ahead

I RTF (real time factor) on a 500 Mhz AlphaPC

search space error rate [%]models states arcs trees del. ins. WER RTF

within-word 31040 8677 135 3.3 3.2 17.4 46.48

across-word 85729 21932 49 3.7 2.3 15.8 64.04


Experimental ResultsI NAB’94 H1 development-set, 20k vocabularyI 20 speaker (10 men and 10 woman), 310 utterances with

7387 words (199 out of vocabulary words)I within-word models:

I female: 2001 mixtures, 279079 densitiesI male: 2001 mixtures, 269141 densities

I across-word models:I female: 4001 mixtures, 281022 densitiesI male: 4001 mixtures, 354607 densities

I trigram perplexity 121.8I With extended language model look-aheadI RTF (real time factor) on a 533 Mhz AlphaPC

search space error rate [%]models states arcs trees del. ins. WER RTF

within-word 19599 5821 136 1.5 2.6 12.9 37.9

across-word 17811 2822 37 1.3 2.4 11.7 36.8






4. Word graphs and Applications4.1 Why Word Graphs?4.2 Word Graph Generation using Word Pair Approximation4.3 Word Graph Pruning4.4 Multi-Pass Search4.5 Language Model Rescoring4.6 Path probabilities over word graphs4.7 Confidence Measures4.8 Bayes Risk Decoding4.9 System Combination4.10 Experiments4.11 References





Why Word Graphs?Goal: generate search space representation, covering all hypothesesfrom beam search, which were not pruned.

I M-best list of sentences: extract list of M recognizedsentences with best recognition score/probability.Problem: inefficient due to redundancy and combinatorialcomplexity.

I Word graph: compact search space representation, containingthose word hypotheses during beam search, which were notpruned.The edges of word graphs contain varying degrees ofinformation about the search space:

I word identityI acoustic scoreI word start and end timesI contextI language model scoreI etc., e.g. derived information like confidence score.


Why Word Graphs?

Word graphs are useful for a number of methods:I multi-pass search: restrict search space for

I language model rescoring,I acoustic model rescoring,I unsupervised adaptation (see Chapter 7),

I confidence scoring;

I system combination;

I word error minimization;

I discriminative training (see Chapter 8);

I interface to subsequent applications like machine translation,or natural language understanding.


Bayes’ Decision RuleSpeech Input

AcousticAnalysis

Phoneme Inventory


Language Model

Global Search:

maximize

x1 ... xT

Pr(w1 ... wN) Pr(x1 ... xT | w1...wN)

w1 ... wN


over

Pr(x1 ... xT | w1...wN )

Pr(w1 ... wN)


Search: Word Graph vs. Single Best Sentence

Original interest in word graphs: language model rescoring in atwo-pass approach.

Howe to apply a refined language model?

I One-pass approach: find single-best sentenceperforming the full recognition process directly using therefined language model.

I Two-pass approach:

I first pass: use baseline language model (e.g. trigram languagemodel),output (interface): word graph or M-best list.

I second pass: rescore output with refined/final language model,e.g. context-free grammar.

I Extension: multi-pass approach.


Search: Two-Pass Approach

Single BestSentence

Phoneme Inventory


....

M-best List ofSentence Hypotheses

wN1

.. p (xT1wN )|1

AcousticVectors

1xT

1st Pass:

BaselineRecognition

2nd Pass:

Rescoring

1xT w1N )( | p )w1

Np (1

BaselineLanguage Model

RefinedLanguage Model

)w1Np (

2


Search: M-best List of Sentences vs. Word Graph

M-best list of sentences:

I list L(xT1 ) of M best sentences (M = 100, ..., 1000),

I for each sentence wN1 , report acoustic probability Pr(xT

1 |wN1 ).

The M-best list L(xT1 ) can be considered as a filter:

I first pass: generate list L(xT1 ) ⊂

wN

1

,

I second pass: find the maximum in this subset of wordsequences with refined models

arg maxwN

1 ∈L(xT1 )

Pr(wN

1 )Pr(xT1 |wN

1 )

with more complex language models (language modelrescoring) or refined acoustic models like across word models(acoustic model rescoring).


Search: M-best List of Sentences vs. Word Graph

More compact alternative to M-best list: word graph.Attach to each word graph arc:

I word identity w ,

I beginning time tb and end time te ,

I acoustic probability p(x tetb |w) of acoustic vectors xtb , ..., xte .

Example: simple word graph


Why Word Graphs?Summary of language model rescoring:

I separation of two levels:I acoustic recognition,I postprocessing: refined language model.

I Restricted search space representations:I list of M best sentences,I word graph: network.

I Example: 2 word hypotheses per position in a 10-wordsentence

I M-best list: 210 = 1024 sentence hypotheses,I word graph: 2 · 10 = 20 arcs.

I Problem for word graph: word boundaries may vary

W W

U V

TIMETIME


Examples of Word Graphs

100 20030 230Time [10ms]

Wor

d In

dex

sil

WHEN

WHAT

A

THE

IT

IS

WRONG

WITH

A

AS

HIS

WITH

THIS

SPLIT

PICTURE

SHARE

sil

Spoken:

“What is wrong with this picture?”

0 100 200Time [10ms]

Wor

d In

dex

sil

IN

CAN

KEN

AND

TENNIS

KANSAS

CHINA’S

CANADA’S

CANVAS

JANICE

THIS

ITS

AS

HIS

MISS

CHEMIST

NETWORK

BE

B.

SAVE

SAVED

TO

TWO

sil

Spoken:

“Can this network be saved?”











Word Boundary FunctionDefinitions:

h(w ; τ, t) := Pr(x tτ+1|w) = probability that word w produces

the acoustic vectors xτ+1...xt .G (wn

1 ; t) := Pr(wn1 ) · Pr(x t

1|wn1 )

= (joint) probability of generating the acousticvectors x1...xt and a word sequence wn

1 withending time t.

Decomposition:

x1, · · · , · · · , xτ︸︷︷︸G (wn−1

1 ; τ)

xτ+1, · · · , xt︸︷︷︸h(wn; τ, t)

xt+1, · · · , · · · , xT︸︷︷︸. . .

timetτ

W1 ... Wn - 1 Wn

T

...

W


Word Boundary Function

I Optimization over word boundary τ :

G (wn1 ; t) = max

τPr(wn|wn−1

1 ) · G (wn−11 ; τ) · h(wn; τ, t)

= Pr(wn|wn−11 ) ·max

τG (wn−1

1 ; τ) · h(wn; τ, t)

I Define word boundary function:

τ(t; wn1 ) := arg max

τG (wn−1

1 ; τ) · h(wn; τ, t)


Recombination using the Word Boundary Function

Consider an m-gram language model:

p(wn|wn−1n−m+1) or p(um|um−1

1 ).

To apply dynamic programming it is sufficient to consider only thelast m − 1 words u2, . . . , um−1.

Define a corresponding auxiliary quantity:

H(um2 ; t) := max

wn−m+11

[Pr(wn

1 ) · Pr(x t1|wn

1 ) : wnn−m+2 = um

2

]= maximum conditional probability for the generation

of the acoustic vectors x1...xt by a word sequence

that ends with the words um2 at time t.


Recombination using the Word Boundary FunctionThe dynamic programming recursion to calculate the auxiliaryquantity H(um

2 ; t) efficiently optimizes over the word u1 and theword boundary τ :

H(um2 ; t) = max

u1

[p(um|um−1

1 ) ·maxτ

H(um−1

1 ; τ) · h(um; τ , t)].

Maximum approximation is used here to avoid the sum over allword boundaries τ .

The definition of the corresponding word boundary functionbecomes:

τ(t; um1 ) := arg max

τ

H(um−1

1 ; τ) · h(um; τ , t)

leading to:

H(um2 ; t) = max

u1

[p(um|um−1

1 ) · H(um−11 ; τ(t; um

1 ))

· h(um; τ(t; um1 ), t)

].


Recombination using the Word Boundary Function

Example: word boundary function with a bigram language model:

τ(t; u1, u2) := arg maxτ H(u1; τ) · h(u2; τ, t)

W W

U V

TIMETIME

The word boundary depends on the identity of the surrounding words.


Word Pair ApproximationRemember: τ = start time of hypothesis (t,w)

Assumption: τ only depends on predecessor word vAssumption: (in addition to (t,w)): τ = τ(t; v ,w)

ba m-2

u

u

u

tTIME

v

w

vm-1

m

Experiments show: approximation correct if predecessor word vsufficiently long.


Word Pair Approximation: IllustrationTwo examples for word pair approximation:

good example: predecessor wordum−1 := v of word um := w issufficiently long.

tTIME

tTIME

m-2

u = w

u = v

u a, b

m-1

m

m-2

u = w

u = v’

u a, b

m-1

m

bad example: predecessor wordum−1 := v of word um := w istoo short.

tTIME

tTIME

m-2

u = w

u = v

u a, b

m-1

m

m-2

u = w

u = v’

u a, b

m-1

m


Algorithm for Word Graph Construction

Algorithm in words:

I For each time frame t, consider all word pairs (v ,w)using beam search.

I For each triple (t; v ,w), keep track of:I the word boundary τ(t; v ,w),I the word score h(w ; τ(t; v ,w), t).

I End of sentence: traceback.


Algorithm for Word Graph Constructionproceed over time t from left to right

ACOUSTIC LEVEL: process states of lexical trees– initialization: Qv (t − 1, 0) = H(v ; t − 1)

Bv (t − 1, 0) = t − 1– time alignment: Qv (t, s) using DP– propagate back pointers Bv (t, s)– prune unlikely hypotheses– purge bookkeeping lists

WORD PAIR LEVEL: process word endssingle best: for each pair (w ; t) do

H(w ; t) = maxv p(w |v) Qv (t, Sw )

v0(w ; t) = arg maxv p(w |v) Qv (t, Sw )

– store best predecessor v0 := v0(w ; t)– store best boundary τ0 := Bv0 (t, Sw )

word graph: for each triple (t; v ,w) store– word boundary τ(t; v ,w) := Bv (t, Sw )– word score h(w ; τ, t) := Qv (t, Sw )/H(v ; τ)

Traceback after last time frame T is processed:– single best: start at best sentence end: w = arg max

v p($|v) H(v ,T )

– word graph: start at all possible sentence ends.


Algorithm for Word Graph ConstructionDepending on how they are constructed, different types of wordgraphs can be defined:

I Word pair approximation ⇒ word/predecessor conditionedword graphs

I Word sequence implies sequence of word boundaries (unique).I Each word sequence occurs only once.

I Generalization: word triple, quadruple, etc. approximation(cf. language model context).

I Other methods ⇒ time conditioned word graphsI A word sequence might occur more than once with different

word boundary times.

Depending on what is to be done with the word graph, one or theother method might be suitable.

Is it possible to guarantee that the M-best sentences are containedin the word graph?

I It is highly probable, butI there is no formal proof.











Word Graph PruningPruning methods can also be applied on word graphs.

I Forward Pruning:I Word end pruning retains word hypotheses (v ,w ; t) whose

scores are relatively close to score of best word hypothesis attime t:

Hmax(t) = maxw H(w ; t)

with

H(w ; t) = maxv p(w |v) · H(v ; τ(t; v ,w)) · h(w ; τ(t; v ,w), t) .

I Prune hypothesis (v ,w ; t) iff:

p(w |v) · H(v ; τ(t; v ,w)) · h(w ; τ(t; v ,w), t) < fLAT · Hmax(t),

with the word graph pruning threshold fLAT < 1.I In addition, histogram pruning limits the number of surviving

word hypotheses to a maximum number (typically 1000 pertime frame).


Word Graph Pruning

I During acoustic search, forward pruning is the standardapproach.

I Similar to acoustic and language model pruning, forwardpruning does not guarantee to find the best scoring wordsequence.

I Often the need arises to prune word graphs after generatinginitial, usually larger word graphs.

I Observation: when further pruning word graphs after theirgeneration, forward pruning becomes less effective, since incontrast to the initial generation of the word graph, the scoreof the overall best word sequence already is known.

I Optimal approach: prune word graph relative to the bestrecognized/best scoring word sequence.

I Advantage: the best scoring word sequence will not be pruneditself, which is not guaranteed in forward-only pruning.


Word Graph Pruning

Forward-Backward Pruning:

I Idea: for every edge (w ; τ, t) in the word graph the scoreQ(w ; τ, t) of the best path through the word graph includingthis arc is calculated.

I An arc (w ; τ, t) is pruned iff:

Q(w ; τ, t) > FLAT FB + max(w ;τ,t)

Q(w ; τ, t) ,

where FLAT FB > 0 is the forward-backward pruning thresholdin the log domain.

I The term “forward-backward” relates to the correspondingtwofold dynamic programming algorithm to compute thispruning step efficiently.


Word Graph Pruning

I To calculate Q(w ; τ, t) efficiently, it is decomposed into thesum of a forward and a backward score:

Q(w ; τ, t) = Qf (w ; τ, t) + Qb(w ; τ, t),

withQf (w ; τ, t) :=− log p(w ; τ, t|x t

1) = -log prob. to end in an edge(w ; τ, t) given x t

1,Qb(w ; τ, t) :=− log p(w ; τ, t|xT

t+1) = -log prob. to start with edge(w ; τ, t) followed by xT

t+1.

I The forward score Qf (w ; τ, t) and backward score Qb(w ; τ, t)can then be computed efficiently by a dynamic programmingsearch pass each on the word graph:

1. forward pass from t = 1 to t = T2. backward pass from t = T to t = 1


Word Graph Pruning: Experimental ResultsExperimental results on NAB’94 (ARPA Benchmark 1994)

I NAB’94 (development 20k)

I W = 20 000 word vocabularyI 199 Out of vocabulary words (2.7%)

I 20 speakers (10 female, 10 male)I 310 sentences ( 155 per gender )I 7387 spoken words

I recognizer setupI 2001 context dependent phoneme models (gender independent)I 216000 densitiesI trigram language model: perplexity 121.8I single best word error rate: 14.5 %

Additional evaluation measure for word graphs:

I GER [%]: graph error rate. Error rate of that hypothesis inthe word graph having the smallest Levenshtein distance tothe spoken word sequence.


Word Graph Pruning: Experimental ResultsForward Pruning

per spoken word−logfLat arcs nodes word boundaries GER [%]

500.00 105.07 45.70 10.54 4.44300.00 99.74 45.12 10.20 4.44250.00 89.11 43.69 9.50 4.45200.00 71.87 40.85 8.39 4.51180.00 59.31 35.55 7.42 4.62160.00 46.52 29.34 6.35 4.81140.00 33.74 22.40 5.20 5.08120.00 22.86 15.96 4.20 5.40100.00 14.07 10.31 3.29 5.96

80.00 8.23 6.36 2.59 6.8560.00 4.78 3.89 2.10 7.8940.00 2.97 2.55 1.76 9.3920.00 2.05 1.85 1.54 10.7910.00 1.77 1.65 1.45 11.63

5.00 1.64 1.56 1.40 11.901.00 1.54 1.48 1.36 12.12


Word Graph Pruning: Experimental ResultsForward-Backward Pruning

per spoken wordsFLat fb arcs nodes word boundaries GER [%]

750.00 97.07 43.11 9.96 4.51500.00 82.47 39.92 9.96 4.53250.00 27.55 17.21 6.00 4.78200.00 16.95 11.26 4.58 4.95180.00 13.45 9.17 4.03 4.98160.00 10.51 7.36 3.54 5.28140.00 8.03 5.78 3.08 5.58120.00 6.06 4.49 2.66 5.92100.00 4.58 3.51 2.32 6.44

80.00 3.48 2.77 2.03 7.1660.00 2.68 2.23 1.79 8.1140.00 2.09 1.82 1.59 9.6720.00 1.65 1.51 1.42 11.8710.00 1.45 1.38 1.33 13.04

5.00 1.35 1.31 1.29 13.751.00 1.25 1.25 1.24 14.34


Word Graph Pruning: Experimental Results

0

0 10 20 30 40 50 604

5

6

7

8

9

10

11

12

13

word graph density

graph error rate [%]

forward pruning

forward backward pruning











Multi-Pass SearchWord graphs can be used as search space reresentation forconsecutively constraining the search space to be used.

SIL BBBB

A

CC CCCC

AAAAA

C

TIME

Properties of word graphs:

I incoming arcs: maximum = vocabulary size

I outgoing arcs: no maximum

I silence is included as integrated part of the words

w

sil

ε

Search over word graph:I Dynamic programming through graph.


Multi-Pass SearchUsing word graphs as (constrained) search space representations, anumber of multiple pass search strategies are possible:

I Language model rescoring:I First pass full search with baseline language model to generate

word graph.I Second pass search over word graph using advanced language

model.I Word boundaries and acoustic scores are kept constant (from

first pass).I Acoustic rescoring:

I First pass full search with baseline acoustic model, e.g. usingmonophone or within-word triphones.

I Second pass search over word graph using advanced acousticmodel, e.g. across-word triphones.

I If word boundaries are kept constant, acoustic rescoring can bedone per arc/word.

I Word boundaries can also be recalculated: full searchconstrained to word graph (discarding first path wordboundaries).


Multi-Pass SearchUsing word graphs as (constrained) search space representations, anumber of multiple pass search strategies are possible:

I Language model rescoring:I First pass full search with baseline language model to generate

word graph.I Second pass search over word graph using advanced language

model.I Word boundaries and acoustic scores are kept constant (from

first pass).I Acoustic rescoring:

I First pass full search with baseline acoustic model, e.g. usingmonophone or within-word triphones.

I Second pass search over word graph using advanced acousticmodel, e.g. across-word triphones.

I If word boundaries are kept constant, acoustic rescoring can bedone per arc/word.

I Word boundaries can also be recalculated: full searchconstrained to word graph (discarding first path wordboundaries).


Multi-Pass SearchI In multi-pass search strategies, language model and acoustic

model rescoring usually are combined, forming a consecutivesequence of search passes which are increasingly contrained.

I Potential advantage: the application of simpler models atearlier search stages with larger search spaces might helpreduce the search complexity, allowing the application of morecomplex and computationally demanding models to be appliedin a more constrained search space.

I Disadvantages:I Each intermediate step in a multi-pass search strategy

represents decisions to constrain the further search space basedon incomplete knowledge (i.e. disregarding more advancedmodels to be applied at a later stage). Hypotheses discardedat some level can not or only partly be recovered in laterstages of the search process.

I Less advanced models call for less severe pruning, i.e. thereduction of the search space is limited by the models used andneeds to be optimized.











Language Model RescoringWord graph rescoring algorithm (2nd pass):

input: word graphset of word boundaries τ(t; um

1 )set of word scores h(um; τ, t)

proceed over time t from left to right

process each word pair umm−1 in the graph

get word boundaries τ(t; um1 ) and scores h(um; τ, t)

process all word sequences um1

H(um1 ; t) := p(um|um−1

1 ) · H(um−11 ; τ) · h(um; τ, t)

H(um2 ; t) = max

u1

H(um1 ; t)

B(um2 ; t) = arg max

u1

H(um1 ; t)

traceback: use back pointers B(um2 ; t)

Note: the language model used here is different from the one usedto construct the word graph.


Word Graph Rescoring: ResultsTest: NAB’94 – H1 Development

I vocabulary: 20000 words

I 10 male and 10 female speakers

I total of 310 sentences = 7387 spoken words

I unknown words: 199 (out of vocabulary rate: 2.7 %)I test set perplexities:

bigram: PPbi = 198.1trigram: PPtri = 130.2

Acoustic Modelling:

I trained on: WSJ0 and WSJ1 (284 speakers; 80 h of speech)

I phoneme models: 4688 (43+1 monophones, 557 diphones,4087 triphones)

I tied states: 4623 (6-state HMM)

I densities: 290 000 per gender


Word Graph Method: Experimental Results

Search space for word graph generation:

I search space: 27672 states, 7674 arcs, 115 trees

Rate of out of vocabulary words (OOV): 2.7%.

Word graph and evaluation measures:

WGD: word graph density (word arcs per spoken word)

NGD: node graph density (nodes per spoken word)

BGD: boundary graph density(different word boundaries per spoken word)

GER[%]: word graph error rate (’sub,del,ins’)

WER[%]: word recognition error rate(trigram: PPtri = 130.2)


Word Graph Method: ResultsGraph Recognition

Graph density word error rate word error ratefLat WGD NGD BGD del - ins GER[%] del - ins WER[%]300 1476 181.80 19.67 9 - 38 4.2 120 - 202 14.3150 1415 175.93 19.25 9 - 38 4.2 120 - 202 14.3100 684 104.22 14.50 13 - 38 4.3 120 - 202 14.3

90 460 77.35 12.31 12 - 38 4.3 120 - 202 14.380 269 51.75 9.84 15 - 44 4.5 120 - 202 14.370 137 31.43 7.42 19 - 51 4.8 120 - 202 14.360 60 17.20 5.33 23 - 56 5.1 122 - 201 14.350 25 8.98 3.77 40 - 64 5.8 124 - 200 14.340 10 4.79 2.66 50 - 81 6.8 122 - 202 14.330 4 2.75 2.01 84 - 94 8.4 135 - 191 14.620 2 1.83 1.60 113 - 129 10.6 147 - 184 14.810 1 1.41 1.36 156 - 160 13.8 173 - 180 15.6

5 1 1.29 1.28 168 - 177 15.3 181 - 190 16.31 1 1.25 1.24 167 - 192 16.3 167 - 195 16.4


Recognition Results: Word Graph vs. One Pass(A) two pass search (word conditioned bigram search with word

graph construction and subsequent word graph rescoring witha trigram).

(B) single pass time conditioned trigram search (Chapter 5)

I NAB H1 development corpus; 20k vocabularyI 310 test sentences, 7387 wordsI Bigram LM with a perpexity of 198I Trigram LM with a perpexity of 130

Number of: sentences word errorstwo pass (A) / single pass (B)

identical score 277 898(A) is better than (B) 1 11 / 11(B) is better than (A) 32 146 / 121total 310 1055 / 1030


Recognition Results: Word Graph vs. One Pass(A) two pass search (word conditioned bigram search with word

graph construction and subsequent word graph rescoring witha trigram).

(B) single pass time conditioned trigram search (Chapter 5)

I NAB H1 development corpus; 64k vocabularyI 310 test sentences, 7387 wordsI Bigram LM with a perpexity of 237I Trigram LM with a perpexity of 172

Number of: sentences word errorstwo pass (A) / single pass (B)

identical score 220 452 / 452(A) is better than (B) 1 2 / 2(B) is better than (A) 89 439 / 423total 310 893 / 877


Recognition Results: Word Graph vs. Integrated Method

Examples of spoken test sentences (SNT):

I integrated search method (ISM)

I word graph method (WGM)

SNT: ... A FEW MUTUAL FUND INVESTORS INTO MODELS OF ...WGM: ... A FEW MUTUAL FUND INVESTORS IN TWO MODELS OF ...ISM: ... A FEW MUTUAL FUND INVESTORS INTO MODELS OF ...

SNT: ... WHEN YOU BUY A MUTUAL FUND FROM YOUR BROKERWGM: ... WHEN YOU BUY A MUTUAL FUND FROM A BROKERISM: ... WHEN YOU BUY A MUTUAL FUND FROM YOUR BROKER

SNT: I ALMOST WISH I HADN’T SEEN THIS PARTWGM: I ALMOST WISH I HADN’T SEEN AS PARTISM: I ALMOST WISH I HADN’T SEEN THIS PART


Bigram Cache Model using Word Graph

Cache (unigram/bigram) language model:

I Cache LM is difficult to handle in integrated search

I Cache updated after each sentence boundary (not after each word!)

I Two modes: supervised/unsupervised

Language model PP del - ins WER [%]

Bigram: without cache 198.1 179 - 195 16.5with cache: unsupervised 178.0 190 - 179 16.3

supervised 171.2 194 - 179 16.0

Trigram: without cache 130.2 120 - 202 14.3with cache: unsupervised 117.1 127 - 182 14.0

supervised 113.1 128 - 194 13.8


Search effort: Word Graph vs. Integrated Method

Compare computational search complexity, test condition:

I subset of NAB’94 H1 development set:= 10 male and 10 female speakers,= 20 sentences = 519 spoken words

(of which 21 words are out of vocabulary)= 211 sec

I perplexities:PPbi = 201.3, PPtri = 134.1

I SGI workstation (Indy R4400, SpecInt’92: 94)


Search effort: Word Graph vs. Integrated Method

Method: Word Graph Integratedstates/arcs/trees states/arcs/trees

Search space: 16274 / 4516 / 41 32291 / 9413 / 108

Time [sec] [%] [sec] [%]Log-likelihood computation 14830 78.2 15174 67.2Acoustic search 3603 19.0 6576 29.1Lang. model recombination 139 0.7 621 2.7Word graph rescoring 210 1.1 – –Other operations 177 1.0 208 1.0Overall recognition 18959 100.0 22579 100.0Real time factor 90 - 107 -


Conclusions: Word Graph Generation and Rescoring

Word graph generation:

I Specification of the word graph problem.

I Word pair approximation.

I Extension of word conditioned one-pass algorithm.

I Modification: bookkeeping only.

Word graph (language model) rescoring:

I Experimental results for the 20k WSJ/NAB task (within-wordmodeling): 4 to 25 word arcs per spoken word sufficient.

I Comparison with integrated search: negligible degradation.

I Note: only ’forward’ pruning using bigram LM.











Path probabilities over word graphsWord graph L:

I promising part of the search spaceI probability along a path is p(W ,X ), W ∈ Σ∗

I path posterior probability

p(W |X ) =p(W ,X )

p(X )=

p(W ,X )∑V∈Σ∗

p(V ,X )

Example:I

like

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

I like this meal

vealI like this

alike

alike this

his veal

veal


Path probabilities over word graphsPosterior probability of path (edge sequence) eN

1 through word graph L:

I word graph L is acyclic

I edge e has word label: i(e) ∈ Σ ∪ ε,with finite alphabet (vocabulary) Σ and empty word ε

I word sequence for eN1 ∈ π(L):

W := i(eN1 ) := [i(en)]Nn=1

I edge weight of e: w(e) := pAM(e) · pLM(e)β, whereacoustic score pAM(e), LM score pLM(e) and LM-scale β

I path probability for eL1 ∈ π(L)

p(eN1 ,X ) :=

N∏n=1

w(e)γ ,

where usually γ = 1/β


Edge probabilities over word graphsPosterior probability of edge e in word graph L:

I probability that a path goes through edge e

p(e|X ) :=∑

eN1 ∈π(L):∃n:en=e

p(eN1 |X )

Example:

Ilike

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

I like this meal

vealI like this

alike

alike this

his veal

veal

edge


Forward/Backward AlgorithmGoal: find efficient way to calculate edge probabilities.

Aproach: decompose edge probability into forward and backwardpart.

p(e,X ) =∑

eN1 ∈π(L):∃n:en=e

p(eN1 ,X )

=

[ ∑ek

1 :to(ek )=from(e)

k∏m=1

w(em)

︸︷︷︸:=Φ(from(e))

]· w(e) ·

[ ∑eNl :

to(e)=from(el )

N∏n=l

w(en)

︸︷︷︸:=Ψ(from(e))

]

I ek1 : partial path from initial state to state from(e)

I eNl : partial path from state to(e) to a final state

I forward score of state s: Ψ(s)I backward score of state s: Φ(s)


Forward/Backward AlgorithmRecursive computation of forward score Φ(s):

Φ(s) =∑ek

1 :to(ek )=s

k∏m=1

w(em)

=∑

e∈In(s)

w(e)∑e′1

l :to(e′l )=from(e)

l∏n=1

w(e ′n)

︸︷︷︸:=Φ(from(e))I Notation:

I In(s): set of edges ending in state sI from(e), to(e): predecessor and successor state of edge e

I efficient computation via dynamic programmingI let sF be the unique final state:

Φ(sF ) =∑

eN1 ∈π(L)

p(eN1 ,X ) = p(X )


Forward/Backward AlgorithmRecursive computation of backward score Ψ(s):

Ψ(s) =∑eNl :

from(el )=s

N∏n=l

w(en)

=∑

e∈Out(s)

w(e)∑e′k

N :from(e′k )=to(e)

k∏m=1

w(e ′m)

︸︷︷︸:=Ψ(to(e))I Notation:

I Out(s): set of edges leaving state sI from(e), to(e): predecessor and successor state of edge e

I efficient computation via dynamic programmingI let sI be the unique initial state:

Ψ(sI ) =∑

eN1 ∈π(L)

p(eN1 ,X ) = p(X )


Forward/Backward AlgorithmImplementation in pseudo code:

Φ(sI ) := 1.0for state s in L in topological order:

Φ(s) := 0.0for edge e in In(s):

Φ(s) += Φ(from(e)) · w(e)Ψ(sF ) := 1.0for state s in L in reversed topological order:

Ψ(s) := 0.0for edge e in Out(s):

Ψ(s) += w(e) ·Ψ(to(e))assert Φ(sF ) = Ψ(sI )p(X ) := Φ(sF )for edge e in L:

p(e|X ) :=[Φ(from(e)) · w(e) ·Ψ(to(e))

]/p(X )

Complexities:I loop over [reverse] topological order of acyclic graph L: O(|E (L)|)I posterior probability for each edge in graph L: O(|E (L)|)


Generalization: Forward/Backward over Semirings

Semiring (K,⊕,⊗, 0, 1):

I set KI binary operation ⊕ called additionI binary operation ⊗ called multiplication

1. (K,⊕, 0) is a commutative monoid2. (K,⊗, 1) is a (commutative) monoid

(commutativity required for backward scores)3. multiplication distributes over addition:

I x ⊗ (y ⊕ z) = (x ⊗ y)⊕ (x ⊗ z)I (x ⊕ y)⊗ z = (x ⊗ z)⊕ (y ⊗ z)

4. 0 is an annihilator for ⊗:I 0⊗ x = x ⊗ 0 = 0

I forward/backward algorithm can be applied to anycommutative semiring

I interpretation of result depends on semiring


Generalization: Forward/Backward over Semirings

Semiring K x ⊕ y x ⊗ y 0 1

probability R+ x + y x · y 0 1

log R ∪ +∞ − log(e−x + e−y

)x + y +∞ 0

tropical R ∪ +∞ min(x , y) x + y +∞ 0

I probability semiringI fb(e,X ): edge probability p(e,X )

I log semiringI fb(e,X ): probability in negated log-space − log p(e,X )I numerical stability

I tropical semiringI fb(e,X ): probability of best path through eI Viterbi approximationI implements single-source shortest paths/distances algorithm

(Bellman-Ford)I forward-backward graph pruning possible/useful

(generalization).


Generalization: Forward/Backward over SemiringsExpectation Semiring

Goal: enable efficient computation of expectations.

I Set R+ × R: elements called “probability” and “value”I Addition: (p1, v1)⊕ (p2, v2) := p1 + p1, v1 + v2)I Multiplication: (p1, v1)⊗ (p2, v2) := (p1 · p2, p1 · v2 + p2 · v1)I Define cost:

I Cost for each path through eN1 ∈ π(L)

I Cost is additive along paths, i.e. c(eN1 ) =

∑Nn=1 c(en)

I initialize value component as w(e)[v ] := w(e)[p] · c(e)

I forward/backward computation on graph LI p-component of expectation semiring realizes probability semiring:

fb(e,X )[p] = p(e,X ): probability for paths through eI fb(e,X )[v ]/ fb(e,X )[p]: expected cost for paths through eI expected cost for paths through the graph:

Φ(sF )[v ]/Φ(sF )[p] = Ψ(sI )[v ]/Ψ(sI )[p] =∑

W∈Σ∗

p(W |X ) c(W )


Generalization: Forward/Backward over SemiringsExpectation Semiring

Efficient computation of expectation for graph L:

I graph L[p] stores probabilities on edges

I graph L[c] stores costs on edges

I L[p] and L[c] are identical apart from edge weights

EL[p]L[c] := p(X )−1∑

eN1 ∈π(L)

w(eN1 )[p] w(eL

1 )[c]

=∑

eN1 ∈π(L)

p(eN1 |X ) c(eN

1 )

=∑

e∈E(L)

p(e|X ) c(e)

I proof: exercise


Frame-wise word posteriors over word graphsPosterior probability of word w at time t in word graph L:

I probability that word w is observed at time t given X

pt(w |X ) :=∑

eN1 ∈π(L):

∃n:tb(en)≤t<te(en)∧ i(en)=w

p(eN1 |X ) with tb(e) and te(e) being

the start and end times ofedge e.

Ilike

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

I like this meal

vealI like this

alike

alike this

his veal

veal

edge

edge time frame


Frame-wise word posteriors over word graphs

Ilike

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

I like this meal

vealI like this

alike

alike this

his veal

veal

edge

edge time frame

pt=13(“this”|X ) = p(e1|X ) + p(e2|X )

= p(W1|X ) + p(W2|X ) + p(W4|X ) = 0.4 + 0.3 + 0.1 = 0.8


Frame-wise word posteriors over word graphs

Frame-wise word posterior probabilities from graph L given X

I probability that word w is observed at time t given X

pt(w |X ) :=∑

eN1 ∈π(L):

∃n:tb(en)≤t<te(en)∧ i(en)=w

p(eN1 |X ) =

∑e∈E(L):

tb(e)≤t<te(e)∧ i(e)=w

p(e|X )

I defines a probability distribution for each time frameI applications:

I confidence measuresI graph decodingI acoustic model training











Confidence Measures

I How confident is a recognizer about the correctness of ahypothesized word?

I Possible approach: confidence measure for word w occurringat position n in the hypothesis:

conf(w , n) := pn(w |X )

I Problems:I What is the “position” n?I How to compute a position for a word (and edge) in a graph?

I Theoretically given by Levenshtein alignment.

I Ignores deletions:Introduce confidence that no word occurs at position n, i.e. pn(ε|X )).

I Resort: Avoid explicit word positions by using frame-wiseword posterior probabilities.


Confidence Measures

Confidence measures based on frame-wise word posteriorprobabilities:

I Idea: approximate position via time segment.

I Average frame-wise word posterior for a graph edge e:

conf(e) := d(e)−1

te(e)−1∑t=tb(e)

pt(i(e)|X ),

with the duration d(e) of edge e.

I Maximum approximation:

conf(e) := maxtb(e)≤t<te(e)

pt(i(e)|X )


Confidence MeasuresPosition-Wise Word Posteriors over Word Graphs

Posterior probability of word w at position s in word graph L:I assign edges to positions via mapping g : E (L)→ N

pn(w |X ) :=∑

eN1 ∈π(L):

∃m:g(em)=n∧ i(em)=w

p(eN1 |X )

I like this meal

vealI like this

alike

alike this

his veal

veal

position

Ilike

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

1 2 3 4


Confidence MeasuresPosition-Wise Word Posteriors over Word Graphs

I like this meal

vealI like this

alike

alike this

his veal

veal

position

Ilike

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

1 2 3 4

pn=3(“this”|X ) = p(W1|X ) + p(W2|X ) + p(W4|X ) = 0.4 + 0.3 + 0.1 = 0.8


Confidence MeasuresPosition-wise word posterios over word graphs

Consider mapping g : E (L)→ N to compute position-wise wordposterior probabilities from graph L given X :

I Monotonicity: for all graph paths eN1 require:

g(en) < g(en+j) for j > 0

I Probability that word w is observed at position n given X reducesto sum over edge posterior probabilities (→ forward-backward)

pn(w |X ) :=∑

eN1 ∈π(L):


p(eN1 |X ) =

∑e∈E(L):g(e)=n∧ i(e)=w

p(e|X )

I Defines a probability distribution for each position, where

pn(ε|X ) := 1−∑w∈Σ

pn(w |X )

I How to obtain g : E (L)→ N?Will be discussed later, assumed to be given for now.


Confidence Measures

Confidence measures based on position-wise word posteriors:

I Use position given by g : E (L)→ N

conf(i(e), g(e)) := pg(e)(i(e)|X )

I Note: provides confidence for deletionsI eN

1 is best scoring graph path.I confidence for deletions between em and em+1:

conf(ε, n) :=

pn(ε|X ) for g(m) < n < g(m + 1))0 otherwise

I Compact representation of position-wise word posteriors is theconfusion network.


Confidence MeasuresConfusion networks from word graphs

Confusion network (CN) derived from word graph L:

I slot-wise posterior probability distribution

pn(w |X ) :=∑

eN1 ∈π(L):


p(eN1 |X ), pn(ε|X ) := 1−

∑w∈Σ

pn(w |X )

alike

likeI this

his meal

veal

I

like

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

1 2 3 4


Confidence MeasuresConfusion networks from word graphs

alike

likeI this

his meal

veal

I

like

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

1 2 3 4

alike

likeI this

his meal

veal

I

like

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

1 2 3 4

alike

likeI this

his meal

veal

I

like

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

1 2 3 4











Bayes Risk Decoding

I Viterbi decoding minimizes sentence error, but:

I Standard evaluation metric is word error(defined by Levenshtein distance)

I Question: How to do better than Viterbi decoding?

I Bayes risk decoding with Levenshtein distanceyields minimum expected word error


Bayes Risk DecodingMotivation

word sequence W p(W |X )

I like this meal 0.4I like this veal 0.3

alike his veal 0.2alike this veal 0.1

I W := wN1 ∈ [Σ ∪ ε]N is hypothesis

I expected error for position n is (1− pn(wn|X ))

I expected error for W is∑N

n=1(1− pn(wn|X ))

I expected error of Viterbi hypothesis “I like this meal”:0.3 + 0.3 + 0.2 + 0.6 = 1.4

I minimum expected error hypothesis “I like this veal”:0.3 + 0.3 + 0.2 + 0.4 = 1.2


Bayes Risk DecodingConfusion network decoding

W p(W |X )

I like this meal 0.4I like this veal 0.3

alike his veal 0.2alike this veal 0.1

Confusion network with position posterior probabilities pn(w |X ):

I like this meal

- alike his veal

0.7

0.3

0.7

0.3

0.8

0.2

0.4

0.6

I minimum expected error hypothesis: W = [arg max pn(w |X )]Nn=1I equals decoding of CNI CN decoding aims at minimizing a word-based errorI Viterbi decoding aims at minimizing sentence errorI we are interested in minimized word error

(defined by the Levenshtein distance)Schluter/Ney: Advanced ASR 326 August 5, 2010

Bayes Risk DecodingBayes risk decoding with loss function L : Σ∗ × Σ∗ → R

W := arg minW∈Σ∗

∑V∈Σ∗

p(V |X ) L(W ,V )

I W has the minimum expected lossI sentence error → MAP (or Viterbi) decoding

L(W ,V ) := 1− δ(W ,V )

I Levenshtein distance

L(W ,V ) := Lev(W ,V )

I loss function of choice for speech recognition evaluationI non-local dependenciesI exact computation prohibitive on graphsI use approximation → only local dependencies

I graph-based Bayes risk decoderI paths through graph instead of W ∈ Σ∗


Graph-Based Bayes Risk Decoding

Hypothesis and summation space

I summation space graph S defines posterior distributionp(W |X ) for W ∈ Σ∗

I S is given by word graph from HMM decoder

I hypothesis space graph H defines set of possible decodingoutcomes W

I for unknown posterior distributionI H usually a superset of SI H depends on loss functionI sentence error → H = S

W := arg mineN

1 ∈π(H)

∑f M1 ∈π(S)

p(f M1 |X ) L(eN

1 , fM

1 )

efficient computation, if L(eN1 , f

M1 ) has only local dependencies


Local Loss Functions

I for eN1 ∈ π(H) and f M

1 ∈ π(S)

L(eN1 , f

M1 ) =

N∑n=1

M∑m=1

L(en, fm)

I word-wise loss computation

I only local dependencies


Bayes Risk Decoding with Local Loss FunctionsBayes decision rule with local loss function:

W := arg mineN

1 ∈π(H)

∑f M1 ∈π(S)

p(f M1 |X )

N∑n=1

M∑m=1

L(en, fm)

= arg mineN

1 ∈π(H)

N∑n=1

∑f ∈E(S)

p(f |X ) L(en, f )

︸︷︷︸:=c(en,S)

I efficient computation

1. rescore each arc in e ∈ E (H) with c(e,S)2. apply Viterbi decoder to H

I quadratic runtime: O(|E (H)||E (S)|)

What are good local loss functions?


Levenshtein Distance Approximations

I local loss function shall approximate Levenshtein distanceI frame error based loss functions

I count errors on a per-frame baseI normalize frame error on a per-word baseI no alignment problem

I CN distanceI CN defines alignment between any two paths in a graphI compute distance based on the CN alignmentI Levenshtein distance is lower bound for CN distance

For both types of loss functions:experiments show strong correlation between exact word error andword error approximation


Frame ErrorIdea: compare words frame-wise using word boundaries fromtime-alignment.Example:

0 10 15 205 25 30

13 frame errors 1 frame error

reference:

hypothesis:

time frame:

I like this meal

alike this meal

Frame error (FE) definition:

L(eN1 , f

M1 ) :=

T∑t=1

(1− δ(i(et), i(ft))

et , ft : edges (from eN1 , f

M1 which intersect with time frame t, i.e.

et ∈ eN1 | tb(et) ≤ t < te(et)


Frame Error Based Loss FunctionsDefine:

I d(e): duration in number of time framesI o(e, f ): overlap in number of time framesI h(e, f ): overlap conditioned on label equality

Rewrite frame error:

L(eN1 , f

M1 ) :=

T∑t=1

(1− δ(i(et), i(ft))

=N∑

n=1

d(en)−M∑

m=1

o(en, fm) · δ(i(en), i(fm))︸︷︷︸:=h(en,fm)

=

M∑m=1

[d(fm)−

N∑n=1

h(en, fm)

]


Frame Error Based Loss Functions

Rescoring function for (position-wise) Bayes risk decoder:

c(e,S) :=∑

f ∈E(S)

p(f |X ) L(e, f )

= d(e)−∑

f ∈E(S)

p(f |X ) h(e, f )

=

te(e)−1∑t=tb(e)

[1− pt(i(e)|X )]

I simple rescoring with frame-wise word posterior probabilities


Frame Error Based Loss Functions

Rescoring function for (position-wise) Bayes risk decoder:

c(e,S) =

te(e)−1∑t=tb(e)

[1− pt(i(e)|X )]

Discrepancy between word and frame error:

I long words have higher impact in frame error based decisioni.e. the corresponding frame error generally increases withword length

I idea: normalize frame error on a per-word basis

I open: normalize per hypothesis or per reference word?


Hypothesis-Side Normalized Frame Error

Rescoring function for Bayes risk decoder with frame errornormalized by hypothesis duration:

W := arg mineN

1 ∈H

N∑n=1

[1−

∑f ∈S

p(f |X )h(en, f )

d(en)

]

= arg mineN

1 ∈H

N∑n=1

1− 1

d(en)

te(e)−1∑t=tb(e)

pt(i(e)|X )

︸︷︷︸

:=c(en,S)

I normalize FE for each edge e ∈ H by d(e)

I normalized sum of frame-wise word posterior probabilities

I high number of deletions


Reference-Side Normalized Frame ErrorRescoring function for Bayes risk decoder with frame errornormalized by reference duration:

W := arg mineN

1 ∈H

∑f ∈S

p(f |X )

[1−

N∑n=1

h(en, f )

d(f )

]

= arg mineN

1 ∈H

[∑f ∈S

p(f |X )−∑f ∈S

N∑n=1

h(en, f )

d(f )p(f |X )

]

= arg mineN

1 ∈H

N∑n=1

[−∑f ∈S

p(f |X )h(en, f )

d(f )

]︸︷︷︸

:=c(en,S)

I normalize FE for each edge f ∈ S by d(f )I normalization → cannot use frame-wise word posteriorsI high number of insertions


Bias in Single-Sided Frame Error Normalization

I hypothesis-side normalization does not penalize deletions

I reference-side normalization does not penalize insertions

I best approach in practice: average both normalizations

Example:I like

alike that infection

satisfaction

0.2 0.8 1.0 1.0

1.0 1.0 0.4 0.6

reference:

hypothesis:

error normalized by

1. reference

2. hypothesis

average of 1. + 2. 0.6 0.9 0.7 0.8


Hypothesis Space for Minimum Frame Error SearchI frame error computation requires word boundaries in HI derive H from summation space graph S

I build time-conditioned graph from S:merge all states with equal time stamps

I π(H) ⊇ π(S)

Example:

Ilike

alike

alike

this

his

thisveal

meal

0

3 10

11

10 14

24veal

14

original lattice:

time−conditioned form:

Ilike

alike

alike

this

his

meal

0

3 10

11

24veal

14


Confusion Network Distance

I recall: a confusion network CN(L) is derived from graph L bya function

g : E (L)→ N,

which maps each edge to a unique position, while keeping theorder of each path contained: g(en) < g(en+1) for eN

1 ∈ π(L)

I g(·) maps each path in L to a path in CN(L)

I each path in CN(L) has equal length N := maxe∈E(L)

g(e)

I CN(L) defines a distance between any two paths eN1 and f N1through the confusion network mapping to positions:

L(eN1 , fN

1 ) :=N∑

n=1

[1− δ(i(en), i(fn))]


Confusion Network DistanceRecall Bayes decision rule with local loss function:

W := arg mineN

1 ∈π(H)

∑f M1 ∈π(S)

p(f M1 |X )

N∑n=1

M∑m=1

L(en, fm)

I problem: confusion network in general does not allow to keeporiginal path probabilities

I idea: define loss function between paths from confusionnetwork and its source graph, where the latter contains theoriginal path probabilities

I approach: use confusion network as Bayes risk decodinghypothesis space H only, whereas for summation the originalgraph S is used

I let g(·) be defined for two graphs H and S⇒ g(·) defines a loss function L : π(H)× π(S)→ R

I note: loss function does not remain local in the above form,but does retain its efficiency.


Hypothesis Space for CN DecodingI summation space S given by word graph from HMM decoderI let confusion network mapping g : E (S)→ [1,N ] be givenI construction of hypothesis space H:

1. initialize N + 1 states; state 0 is the initial and N the finalstate

2. foreach state s ∈ [1,N ):foreach w ∈ Σ ∪ ε:

connect state s with state s + 1 with edge e,where i(e) = w and g(e) = s + 1

I H has CN-structureI any path eN1 ∈ π(H) has exactly N arcsI a path eN1 ∈ π(H) has no gaps, i.e. g(es) = s

I H contains all possible Bayes risk decoding results for the CNdistance as loss function

I further investigations make use of special structure of H;minimum CN distance decoding with an arbitrary hypothesisspace is left as an exercise


CN Distance for GraphsCN distance between eN1 ∈ H and f M

1 ∈ S:I substitutions:

I hypothesis and reference word,but hypothesis word 6= reference word

LS(eN1 , f M1 ) :=

N∑n=1

M∑m=1

δ(

g(en), g(fm))·[1− δ

(i(en), ε

)]·[1− δ

(i(fm), ε

)]·[1− δ

(i(en), i(fm)

)]I deletions:

I no hypothesis word, but reference word

LD(eN1 , f M1 ) :=

N∑n=1

M∑m=1

δ(

g(en), g(fm))· δ(

i(en), ε)

·[1− δ

(i(fm), ε

)]Schluter/Ney: Advanced ASR 343 August 5, 2010

CN Distance for Graphs

CN distance between eN1 ∈ H and f M1 ∈ S (ctd.):

I insertions:I hypothesis word, but no reference wordI problem: position labels for f M

1 are not strictly consecutive, gaps!I indirect: # ins = #(hyp. words) - #(cor and sub)

LI(eN1 , fM

1 ) :=

[ N∑n=1

[1− δ

(i(en), ε

)] ]−

[ N∑n=1

M∑m=1

δ(

g(en), g(fm))

·[1− δ

(i(en), ε

)]·[1− δ

(i(fm), ε

)] ]


CN Distance for GraphsCN distance: sum of four costs:

L(eN1 , fM

1 )

:=

[ N∑n=1

[1− δ

(i(en), ε

)] ]

+N∑

n=1

M∑m=1

δ(

g(en), g(fm))·

·−[1− δ

(i(en), ε

)]·[1− δ

(i(fm), ε

)]+[1− δ

(i(en), ε

)]·[1− δ

(i(fm), ε

)]·[1− δ

(i(en), i(fm)

)]+ δ(

i(en), ε)·[1− δ

(i(fm), ε

)] ︸︷︷︸=:l(en,fm)

I consider local sum of Kronecker delta expressions l(e, f )


CN Distance for Graphs

Simplify local sum of deltas:

l(e, f ) = −[1− δ

(i(e), ε

)]·[1− δ

(i(f ), ε

)]+[1− δ

(i(e), ε

)]·[1− δ

(i(f ), ε

)]·[1− δ

(i(e), i(f )

)]+ δ(

i(e), ε)·[1− δ

(i(f ), ε

)]= −

[1− δ

(i(e), ε

)]·[1− δ

(i(f ), ε

)]· δ(

i(e), i(f ))

+ δ(

i(e), ε)·[1− δ

(i(f ), ε

)]= −

[1− δ

(i(e), ε

)]· δ(

i(e), i(f ))

+ δ(

i(e), ε)·[1− δ

(i(f ), ε

)]


CN Distance for GraphsResubstitute local sum of deltas into CN distance:

L(eN1 , fM

1 ) =N∑

n=1

[1− δ

(i(en), ε

)]+

M∑m=1

δ(

g(en), g(fm))l(en, fm)

=N∑

n=1

L1(en) +

M∑m=1

L2(en, fm)

I alternative local loss function

Corresponding decision rule:

W := arg mineN1 ∈π(H)

∑f M1 ∈π(S)

p(f M1 |X )

N∑n=1

L1(en) +

M∑m=1

L2(en, fm)

= arg mineN1 ∈π(H)

N∑n=1

L1(en) +

∑f ∈E(S)

p(f |X )L2(en, f )

= arg min

eN1 ∈π(H)R(eN1 |X )


CN Decoding

R (eN1 |X ) =N∑

n=1

L1(en) +

∑f ∈E(S)

p(f |X )L2(en, fm)

(posterior risk)

=N∑

n=1

1− δ(i(en), ε) +

∑f ∈E(S)

p(f |X )δ(

g(en), g(f ))·

·[δ(

i(en), ε)·[1− δ

(i(f ), ε

)]−[1− δ

(i(en), ε

)]· δ(

i(en), i(f ))]

=N∑

n=1

1−

[1− δ

(i(en), ε

)] ∑f ∈E(S)

p(f |X )δ(

g(en), g(f ))· δ(

i(en), i(f ))

− δ(i(en), ε)

[1−

∑f ∈E(S)

p(f |X )δ(

g(en), g(f ))[

1− δ(

i(f ), ε)]]

=N∑

n=1en 6=ε

[1− pn(i(en)|X )

]+N∑

n=1en=ε

[1− pn(ε|X )

]=N∑

n=1

[1− pn(i(en)|X )

]Schluter/Ney: Advanced ASR 348 August 5, 2010

CN DecodingBayes risk decoding:

W = arg mineN1 ∈π(H)

R(eN1 |X )

= arg mineN1 ∈π(H)

N∑n=1

[1− pn(i(en)|X )]

=

[arg max

w∈Σ∪εpn(w |X )

]Nn=1

with

pn(w |X ) :=∑

f ∈E(S)

p(f |X ) · δ(n, g(f )

)· δ(w , i(f )

)pn(ε|X ) := 1−

∑f ∈E(S)

p(f |X ) · δ(n, g(f )

)·[1− δ

(i(f ), ε

)]= 1−

∑w∈Σ

pn(w |X )


CN Decoding

Bayes risk decoding:

W =

[arg max

w∈Σ∪εpn(w |X )

]Nn=1

I decoding only requires CN derived from S(valid only for the special construction of H)

I gaps in the position labels are handled by the derivedposterior probabilities for the empty word pn(ε|X )


CN Distance vs. Levenshtein Distance

I CN alignments cannot always represent Levenshteinalignments

I CN distance is upper bound to Levenshtein distance

Example:

a b a b a

b a b a c a b a c a

a b a b a b a b a c

a b a c a

Given:

1.

2.

3.

a b a b a

b a b a c

a b a c a

Lev(1, 2) = 2 Lev(1, 3) = 1 Lev(2, 3) = 2Levenshtein

CNa b a b a

b a b a c a b a c a

a b a b a b a b a c

a b a c a

CN(1, 2) = 2 CN(1, 3) = 1 CN(2, 3) = 3a b a b a

b a b a c

a b a c a


CN ConstructionI CN construction for graphs: find g : E (L)→ NI g(·) induces the Bayes risk decoder

CN(X ; g) := arg mineN1 ∈H

∑f M1

p(f M1 |X ) L(eN1 , f

M1 ; g)

I goal: find g(·) such that the error on a training set [W ,X ]Rr=1

is minimized

gopt :=R∑

r=1

Lev(Wr ,CN(Xr ; g))

I other optimization criteria possible,e.g. choose g(·) such that the Bayes risk is minimized

I how to compute gopt for LVCSR graphs?I heuristic 1: N-best lists, align entries to Viterbi path (pivot element)I heuristic 2: cluster the edges in S


Edge Clustering Algorithm

An edge clustering algorithmI idea: use a fixed set of clusters

I use set of pivot edges to initialize edge clustersI add more pivot edges, if not all edges can be clustered

I all edges in a cluster must overlap in timeI sort clusters by average edge start timeI g(en) < g(en+1) holds for each eN

1 ∈ π(S)

I define appropriate distance function betweenedge e and edge cluster E ; usually based on

I time overlapI word equalityI pronunciation similarity

I weight distance for edge e with p(e|X )

⇒ clustering is driven by likely edges


Edge Clustering AlgorithmAlgorithm outline:

initialize set of pivot edges P ← ∅initialize remaining edge set R = E (S)while R is not emptychoose set of pivot edges PR from RP ← P ∪ PR

reinitialize remaining edge set R ← ∅initialize edge clusters from pivot edgesfor e ∈ E (S)\P sorted by weighted cluster distance:

C ← nearest edge cluster of eif e and e ′ overlap foreach e ′ ∈ C:assign e to C

else:assign e to R

I open: how to find the pivot edges?


Edge Clustering Algorithm

I initialize clusters fromI none overlapping edgesI most likely edges

Pivot edge set algorithm outline:

select pivot arcs(E):

initialize pivot edge set P ← ∅for e ∈ E sorted by p(e|X ):if foreach e ′ ∈ P: e ′ and e not overlap:assign e to P

return P


Edge Clustering AlgorithmExample:

0

10

32

40

hello

helloeh

[si]

1.

2.

eh hello

hello

hello

3.

hello

hello

eh

eh [si]

1. initial pivot arcs: “eh:0-10” “hello:10-40”2. assign “hello:0-32” to second cluster3. new pivot pivot arc: “[si]:32-40” (since [si] arc does not overlap)











System Combination

System combination

I observation: different ASR systems lead to different errors

I graphs provided by J systems?

I question: how to minimize word error by combining the J graphs?

I how to compute a single hypothesis from J parallel word graphs?


System CombinationI idea: introduce system as hidden variable

p(W |X ) =J∑

j=1

p(W |j ,X )p(j),

assumption: system prior is independent of acoustic X

⇒ summation space is given by the graph union

S :=J⋃

j=1

Lj

⇒ path probability for a path through S

p(eN1 |X ) :=

J∑j=1

p(j)pj(eN1 |X ),

where pj(eN1 |X ) = 0 if eN

1 /∈ π(Lj)


System Combination

Can we reduce graph combination to a single graph problem?I construct modified graph union S′

1. introduce super initial state s0

2. foreach system j:connect s0 with initial state of Lj with edge e,

where i(e) = ε and w(e) = p(j)/pj(X )

I for a path eN1 ∈ π(S′) holds

p(eN1 ,X ) = p(j)pj(eN

1 |X ) = p(eN1 |X )

graph combination ⇔ Bayes risk decoding of S′


System Combination

Alternative approaches:

I Minimum frame error combination

I Confusion Network Combination

I ROVER


Minimum frame error combination

Rescoring function for Bayes risk decoder:

c(e,S) := d(e)−∑

f ∈E(S)

p(f |X ) h(e, f ) =

te(e)−1∑t=tb(e)

[1− pt(i(e)|X )]

Extension to system combination:

I compute frame-wise word posteriors as

pt(w |X ) :=J∑

j=1

p(j)pj ,t(w |X )

I however: equivalent to computing pt(w |X ) from S′!


ROVER Alignment

Recogniser Output Voting Error Reduction (ROVER):I Levenshtein alignment of individual systems outputs:

1. select supervisor system (usually with lowest expected WER)2. align other hypotheses to supervisor

I system-dependent hypothesis Wj := wNj

j ,1 of jth system

I alignment of J system-dependent hypotheses

A := (A1,A2, . . . ,AS) with Ak := (ak,1, . . . , ak,J)

I interpretation: at alignment position sthe words w1,as,1 , w2,as,2 , . . . , wJ,as,J

compete,where as,j = 0 indicates the empty word, i.e. ws,0 := ε


ROVER DecodingROVER (ctd.):

I use word-wise scoresI majority voting:

cj(s,w) :=

1 , if wj,as,j = w0 , otherwise

I confidence voting:

cj(s,w) :=

pj,as,j (w |X ) , if as,j 6= 0 and w = wj,as,j

p0 , if as,j = 0 and w = ε0 , otherwise

where p0 constant confidence for a deletion

I decode alignment position-wise

W =

arg maxw∈Σ∪ε

J∑j=1

cj(s,w)

S

s=1


Confusion Network Combination (CNC)

Confusion Network Combination (CNC):

I Levenshtein alignment of system-dependent CNsI ROVER vs. CNC:

I information provided by jth system at position n

ROVER CNCword majority conf.

w1 1 pj,n(w1|X ) pj,n(w1|X )w2 0 0 pj,n(w2|X )w3 0 0 pj,n(w3|X )...

......

...

words sorted by pn(w |X ) in decreasing orderI CN provides at each position prob. distribution over all wordsI Levenshtein alignment of CNs with local cost defined between

prob. distributions


Confusion Network Combination (CNC)CNC (ctd.)

I alignment between two CNs

A := (A1,A2, . . . ) with As := (as,1, as,2)

I interpretation: combined position-wise word posteriorprobability distr. at alignment position s

ps(w |X ) :=2∑

j=1

p(j)pj ,as,j(w |X )

where pj ,0(ε|X ) := 1I alignment result is a CN of length S⇒ expected (position-wise) error for alignment A

c(A) =S∑

s=1

[1− max

w∈Σ∪εps(w |X ; A)

]⇒ minimizing c(A) ⇒ Bayes decision rule using CN distance


Confusion Network Combination (CNC)

Levensthein alignment of CNs

I align 1st CN and 2nd CN

C (l , k) := min

C (k − 1, l) + c(k , 0),

C (k − 1, l − 1) + c(k , l),C (k , l − 1) + c(0, l)

with local cost minimizing the position-wise/local word error:

c(l , k) := 1− maxw∈Σ∪ε

p(1)p1,k(w |X ) + p(2)p2,l(w |X )

I align 1+2nd CN and 3rd CN

I . . .











ExperimentsChinese 230h Test System

I 230h training data (BN/BC)I ML-trained speaker-normalized models, 1M GaussiansI 60K vocabulary, 4-gram LMI two-pass recognition (speaker adaptation, LM-rescoring)I systems vary in the

I acoustic front-ends: MFCC, PLP, or Gammatone featuresI phonetic decision tree: randomized CART estimation

Chinese Development and Test Sets

running vocabularyname speech words chars words chars

gale-dev07 2.5h 27.1K 46.8K 5.2K 1.9Kgale-eval07 1.6h - 28.1K - 1.7Kgale-dev08 1.0h 10.5K 18.2K 2.9K 1.4K


ExperimentsEnglish TC-Star 2007 Evaluation System

English TC-Star 2007 Evaluation SystemI 92h training data (European parliament plenary sessions, EPPS)I single, globally pooled, diagonal co-variance matrixI MPE trained speaker normalized models, 1M GaussiansI 50K vocabulary, 4-gram LMI two-pass recognition (speaker adaptation, LM-rescoring)I four systems that differ mainly in

I system 1: MFCC front-endI system 2: Gammatone front-endI system 3: MFCC front-end + NN-based phoneme posteriorsI system 4: MFCC front-end + 190h unsupervised EPPS training data

English EPPS Development and Test Sets

name speech running words

epps-dev06 3.2h 27.0Kepps-eval06 3.2h 30.0Kepps-eval07 2.9h 27.0K


ExperimentsEnglish TC-Star 2007 Cross-Site

English TC-Star 2007 Evaluation Cross-Site System Combination

I 92h training data (European parliament plenary sessions, EPPS)

I 190h unsupervised training data (EPPS, optional)I lattices provided by

I FBK/IRST, Trento, Italy (IRST)I CNRS/LIMSI, Paris, France (LIMSI)I RWTH Aachen University, Germany (RWTH)I University of Karlsruhe, Germany (UKA)

I if a site does internal system combination, e.g. RWTH Aachen,lattices are from the best single system


Bayes Risk DecodingChinese 230h Test System

CER[%]System Decoder dev071 eval07 dev08

s1 Viterbi 14.54 15.08 13.28CN 14.32 14.95 13.10

s2 Viterbi 14.82 15.02 13.54CN 14.52 14.74 13.35

s3 Viterbi 15.07 15.60 13.80CN 14.86 15.42 13.67

1 tuning set, CER: [Chinese] Character Error Rate


Bayes Risk DecodingEnglish TC-Star 2007 Evaluation System

WER[%]System Decoder dev06 eval061 eval07

s1 Viterbi 11.09 8.43 9.81CN 11.97 8.83 10.48

s2 Viterbi 11.89 8.70 10.07CN 11.87 9.31 11.57

s3 Viterbi 12.43 8.98 10.76CN 10.73 8.22 9.57

s4 Viterbi 12.06 9.44 11.73CN 11.42 8.61 9.78

1 tuning set


Bayes Risk DecodingEnglish TC-Star 2007 Cross-Site

WER[%]System Decoder eval061 eval07

LIMSI Viterbi 8.16 9.13CN 8.07 8.96

RWTH Viterbi 8.46 9.71CN 8.24 9.54

UKA Viterbi 8.80 10.22CN 8.98 10.36

IRST Viterbi 10.09 9.81CN 10.06 9.82

1 tuning set


Bayes Risk DecodingFrame Error vs. CN-Distance

loss frame CER[%] (del/ins) errSystem function norm. dev1 eval

Chinese frame err. hypothesis (2.92/1.38) 14.35 (3.01/0.75) 13.13symmetric (2.52/1.61) 14.23 (2.75/0.94) 13.11

CN-dist. n.a. (2.79/1.45) 14.30 (2.85/0.80) 13.05

WER[%] (del/ins) err

English frame err. hypothesis (1.95/1.15) 8.08 (2.22/0.99) 9.00symmetric (1.68/1.32) 8.05 (1.84/1.15) 9.00

CN-dist. n.a. (1.65/1.33) 8.07 (1.76/1.18) 8.961 tuning set, CER: [Chinese] Character Error Rate


System CombinationChinese 230h Test System

CER[%]System Combination dev071 eval07 dev08

s1 14.54 15.08 13.28s2 14.82 15.02 13.54s3 15.07 15.60 13.80

s1+s2 union 13.54 13.96 12.43CNC 13.55 13.95 12.49ROVER 13.63 14.09 12.61

s1+s2+s3 union 13.15 13.65 12.19CNC 13.16 13.74 12.14ROVER 13.22 13.86 12.47union CN 13.15 13.65 12.19

1 tuning set, CER: [Chinese] Character Error Rate


System CombinationEnglish TC-Star 2007 Evaluation System

WER[%]System Combination dev06 eval061 eval07

s1 11.09 8.43 9.81s2 11.89 8.70 10.07s3 12.43 8.98 10.76s4 12.06 9.44 11.73

s1+s2 union 10.21 7.79 8.97CNC 10.22 7.82 8.98ROVER 10.54 7.90 9.11

s1+s2+s3 union 10.21 7.73 8.96CNC 10.14 7.70 8.98ROVER 10.42 7.73 9.17

s1+s2+s3+s4 union 10.33 7.59 8.94CNC 10.22 7.59 8.92ROVER 10.70 7.67 9.15

1 tuning setSchluter/Ney: Advanced ASR 377 August 5, 2010

System CombinationEnglish TC-Star 2007 Cross-Site

WER[%]System Combination eval061 eval07

LIMSI 8.16 9.13RWTH 8.46 9.71UKA 8.80 10.22IRST 10.09 9.81

LIMSI+RWTH union 6.46 7.67CNC 6.38 7.51ROVER 6.69 7.85

LIMSI+RWTH+UKA union 6.38 7.63CNC 6.27 7.24ROVER 6.32 7.77

LIMSI+RWTH+UKA+IRST union 6.28 7.36CNC 6.14 7.12ROVER 6.21 7.26

1 tuning setSchluter/Ney: Advanced ASR 378 August 5, 2010










ReferencesWord Graphs

I S. Ortmanns, H. Ney, and X. Aubert. A word graph algorithm forlarge vocabulary continuous speech recognition, Comput. SpeechLang., vol. 11, pp. 43-72, Jan. 1997.

I H. Ney, S. Ortmanns, and I. Lindam. Extensions to the word graphmethod for large vocabulary continuous speech recognition, in Proc.IEEE Int. Conf. Acoustics, Speech, Signal Processing 1997,Munich, Germany, Apr. 1997, pp. 1787-1790.

I A. Sixtus and S. Ortmanns. High quality word graphs using forward-backward pruning, in Proc. IEEE Int. Conf. Acoustics, Speech,Signal Processing, Phoenix, AZ, Mar. 1999, pp. 5930-596.


ReferencesConfidence Measures

I F. Wessel, R. Schluter, K. Macherey, and H. Ney. ConfidenceMeasures for Large Vocabulary Continuous Speech Recognition.IEEE Transactions on Speech and Audio Processing, volume 9,number 3, pages 288-298, March 2001.

I T. Kemp and T. Schaaf, Estimating confidence using word lattices,in Proc. 5th Eur. Conf. Speech, Communication, Technology 1997,Rhodes, Greece, Sept. 1997, pp. 827-830.

I L. Chase, Word and acoustic confidence annotation for largevocabulary speech recognition, in Proc. 5th Eur. Conf. SpeechCommunication Technology, Rhodes, Greece, Sept. 1997, pp.815-818.

I S. Cox and R. Rose, Confidence measures for the switchboarddatabase, in Proc. IEEE Int. Conf. Acoustics, Speech, SignalProcessing 1996, Atlanta, GA, May 1996, pp. 511-514.


ReferencesConfidence Measures

I C. Neti, S. Roukos, and E. Eide, Word-based confidence measuresas a guide for stack search in speech recognition, in Proc. IEEE Int.Conf. Acoustics, Speech, Signal Processing 1997, Munich, Germany,Apr. 1997, pp. 883-886.

I B. Rueber, Obtaining confidence measures from sentenceprobabilities, in Proc. 5th Eur. Conf. Speech CommunicationTechnology 1997, Rhodes, Greece, Sept. 1997, pp. 739-742.

I T. Schaaf and T. Kemp, Confidence measures for spontaneousspeech recognition, in Proc. IEEE Int. Conf. Acoustics, Speech,Signal Processing 1997, Munich, Germany, Apr. 1997, pp. 875-878.

I M. Siu, H. Gish, and F. Richardson, Improved estimation, evaluationand applications of confidence measures for speech recognition, inProc. 5th Eur. Conf. Speech Communication Technology 1997,Rhodes, Greece, Sept. 1997, pp. 831-834.

I S. Young, Detecting misrecognitions and out-of-vocabulary words,in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing1994, Adelaide, Australia, Apr. 1994, pp. 21-24.


ReferencesBayes Risk Decoding

I A. Stolcke, Y. Konig, and M. Weintraub. Explicit word errorminimization in N-best list rescoring. Eurospeech, 1997.

I V. Goel, W. Byrne, and S Khudanpur. LVCSR rescoring withmodified loss functions: a decision theoretic perspective. ICASSP,1998.

I V. Goel, Sh. Kumar, and W. Byrne. Segmental minimum Bayes-riskASR voting strategies. ICSLP, 2000.

I V. Goel, Sh. Kumar, and W. Byrne. Confidence based latticesegmentation and minimum Bayes-risk decoding. Eurospeech, 2001.

I F. Wessel, R. Schluter, and H. Ney. Explicit word error minimizationusing word hypothesis posterior probabilities. ICASSP, 2001.

I R. Schluter, Th. Scharrenbach, V. Steinbiss, and H. Ney. Bayes riskminimization using metric loss functions. Eurospeech, 2005.


ReferencesBayes Risk Decoding

I H. Xu, D. Povey, J. Zhu, and G. Wu. Minimum hypothesis phoneerror as a decoding method for speech recognition. Interspeech,2009.

I A. Stolcke et al. The SRI March 2000 Hub-5 conversational speechsystem. NIST Speech Transcription Workshop, 2000.

I D. Hakkani and G. Riccardi. A general algorithm for word graphmatrix decomposition. ICASSP, 2003.

I J. Xue and Y. Zhao. Improved confusion network algorithm andshortest path search from word lattice. ICASSP, 2005.

I L. Mangu, E. Brill, and A. Stolcke. Finding consensus in speechrecognition: Word error minimization and other applications ofconfusion networks. CSL, 2000.

I B. Hoffmeister, R. Schluter, and H. Ney. Bayes risk approximationsusing time overlap with an application to system combination.Interspeech, 2009.


ReferencesModel and System Combination

I P. Beyerlein. Discriminative model combination. ASRU, 1997.

I D. Vergyri. Integration of multiple knowledge sources in speechrecognition using minimum error training. PhD thesis, 2000.

I A. Zolnay, R. Schluter, and H. Ney. Acoustic feature combinationfor robust speech recognition. ICASSP, 2005.

I B. Hoffmeister, R. Liang, R. Schluter, and H. Ney. Log-linear modelcombination with word-dependent scaling factors. Interspeech,2009.

I J. Fiscus. A post-processing system to yield reduced word errorrates: Recognizer output voting error reduction (ROVER). ASRU,1997.

I A. Stolcke et al. The SRI March 2000 Hub-5 conversational speechtranscription system. NIST Speech Transcription Workshop, 2000.



I G. Evermann and P. Woodland. Posterior probability decoding,confidence estimation and system combination. NIST SpeechTranscription Workshop, 2000.

I I.-F. Chen and L.-S. Lee. A new framework for system combinationbased on integrated hypothesis space. Interspeech, 2006.

I R. Zhang and A. Rudnicky. Investigations of issues for usingmultiple acoustic models to improve continuous speech recognition.Interspeech, 2006.

I D. Hillard, B. Hoffmeister, M. Ostendorf, R. Schluter, and H. Ney.iROVER: Improving system combination with classification. HLT,2007.

I B. Hoffmeister, R. Schluter, and H. Ney. iCNC and iROVER: Thelimits of improving system combination with classification?Interspeech, 2008.



I H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, and G.Zweig. The IBM 2004 conversational telephony system for richtranscription. ICASSP, 2005.

I S. Stuker, Ch. Fugen, S. Burger, and M. Wolfel. Cross-systemadaptation and combination for continuous speech recognition: theinfluence of phoneme set and acoustic front-end. Interspeech, 2006.

I D. Guiliani and F. Brugnara. Experiments on cross-system acousticmodel adaptation. ASRU, 2007.

I B. Hoffmeister, Ch. Plahl, P. Fritz, G. Heigold, J. Loof, R. Schluter,and H. Ney. Development of the 2007 RWTH mandarin LVCSRsystem. ASRU, 2007.







5. Time Conditioned Search5.1 Idea5.2 M-Gram Language Models5.3 Implementation5.4 Extension to Trigram Language Models5.5 Equivalence of Word Conditioned and Time Conditioned Search Organization5.6 Across-Word Contexts5.7 Pruning




Time Conditioned Search: IdeaIdea

At every time τ a separate copy of the lexical prefix tree (timeconditioned tree copy) with the hypothesis Qτ (t, s) is started.

τ ttime

Possible advantage: using beam search only a few time conditionedtree copies are active (experiments will show 40-50).


Time Conditioned SearchApproach

I Time synchronous evaluation of the Bayes’ decision rule as before.

I Full sequence joint probability:

Pr(w1 . . .wn) · Pr(x1 . . . xT |w1 . . .wN)

I Definitions:I probability that word w generated the acoustic vectors xτ+1, . . . , xt :

h(w ; τ, t) := Pr(x tτ+1|w)

I conditional probability for the generation of the acoustic vectorsx1, . . . , xt and a word sequence wn

1 that ends at time t:

G (wn1 ; t) := Pr(wn

1 ) · Pr(x t1 |wn

1 )



I Decomposition (recall word graph generation):x1, . . . , . . . , xτ︸︷︷︸G(wn−1

1 ;τ)

xτ+1, . . . , xt︸︷︷︸h(wn; τ, t)

xt+1, . . . , . . . , xT︸︷︷︸...max

τp(wn|wN

1 )︸︷︷︸=G(wn

1 ;t)

timetτ

W1 ... Wn - 1 Wn

T

...

W

I Optimize over word boundary: τ

G (wn1 ; t) = max

τPr(wn|wn−1

1 ) · G (wn−11 ; τ) · h(wn; τ, t)

= Pr(wn|wn−11 ) ·max

τG (wn−1

1 ; τ) · h(wn; τ, t)

I So far: no optimization of the word boundaries:organization of word successor hypotheses as tree.



I So far: no optimizationof the word boundaries:organization of wordsuccessor hypothesesas tree.

C

ABC

A

ABC

B

ABC

C

ABC

A

ABC

B

ABC

C

ABC

A

ABC

B

ABC

C

A

B

C

A

ABC

A

ABC

B

ABC

C

ABC

A

ABC

B

ABC

C

ABC

A

ABC

B

ABC

C

A

B

C

B

ABC

A

ABC

B

ABC

C

ABC

A

ABC

B

ABC

C

ABC

A

ABC

B

ABC

C

A

B

C











Time Conditioned SearchM-Gram Language Models

Consider the m-gram language model p(um|um−11 ).

I Distinguish word sequences by their last (m − 1) words um−11

I Definition:I Conditional probability for an acoustic vector sequence x1 . . . xt

and a word sequence that ends at time t with a wordsubsequence um

2 :

H(um2 ; t) = max

wm1

[Pr(wn

1 ) · Pr(x t1 |wn

1 ) : wnn−m+2 = um

2

]I Optimize word ends (dynamic programming):

H(um2 ; t) = max

u1

p(um|um−1

1 ) ·maxτ

[H(um−1

1 ; τ) · h(um; τ, t)]

I Using bigram language model p(w |v):

H(w ; t) = maxv

p(w |v) ·max

τ

[H(v ; τ) · h(w ; τ, t)

]Schluter/Ney: Advanced ASR 394 August 5, 2010

Time Conditioned SearchConsider the hypothesis that word w ends at time t.

t

τ

w

τ1 τ2 τ3

Optimization over word boundary/word start time at word end!Schluter/Ney: Advanced ASR 395 August 5, 2010










Time Conditioned SearchImplementation

Principle:

I Time synchronous evaluation of the hypotheses (as before).

I Lexical search tree.

I Start a separate lexical tree at every time frame.

I Define: Qτ (t, s) = probability of the best partial path thatgenerates the acoustic vectors x1 . . . xt up to a state s at timet of a lexical prefix tree with start time τ .



Dynamic programming recursion for bigram language models:

I Initialization (best predecessor hypothesis):

Qt−1(t − 1, s) =

max

uH(u; t − 1) : s = 0

0.0 : s > 0

I Within words:

Qτ (t, s) = maxσp(xt , s|σ) · Qτ (t − 1, σ)

I Word boundaries:

H(w ; t) = maxvp(w |v) ·max

τH(v ; τ) · Qτ (t, Sw )

maxu

H(u; τ)

= maxvp(w |v) ·max

τH(v ; τ) · h(w ; τ, t)



Properties of the time conditioned search:

I Time synchronous.

I Store the score H(v ; τ) for corrections, modifications of wordboundary optimization.

I Separate word boundary optimization over start time andpredecessor word.

I No backpointers.


Time Conditioned Search: Implementationproceed over time t from left to right

ACOUSTIC LEVEL: process hypotheses Qτ (t, s)- initialization:

Qt−1(t − 1, s) =

max

uH(u; t − 1) if s = 0

0.0 if s > 0

- time alignment: Qτ (t, s) using DP recursion- prune unlikely hypotheses- purge bookkeeping lists

WORD PAIR LEVEL: process hypotheses Qτ (t,Sw )for each pair (w ; t) do

- word score: h(w ; τ, t) = Qτ (t,Sw )/maxu

H(u; τ)

- best predecessor and best boundary:[v0, τ0](w , t) := arg max

v ,τp(w |v) h(w ; τ, t) H(v ; τ)

- store [v0, τ0](w , t)- tree start-up score:

H(w ; t) = p(w |v0(.)) h(w ; τ0(.), t) H(v0(.); τ0(.))











Time Conditioned SearchExtension to Trigram Language Models

I Trigram language model p(w |uv):

H(v ,w ; t) = maxu

p(w |uv) ·max

τ

[H(u, v ; τ) · h(w ; τ, t)

]I Dynamic programming recursion initialization (best word end

hypothesis):

Qt−1(t − 1, s) =

max(u,v)H(u, v ; t − 1) : s = 0

0.0 : s > 0

I Within words:

Qτ (t, s) = maxσp(xt , s|σ) · Qτ (t − 1, σ)

I Word boundaries:

H(v ,w ; t) = maxu

p(w |uv)·max

τ

[H(u, v ; t)· Qτ (t, Sw )

max(u′,v ′)

H(u′, v ′; τ)

]


Time Conditioned Search: Experimental ResultsTest: NAB’94 – H1 Development

I vocabulary: 20000 words

I 10 male and 10 female speakers

I total of 310 sentences = 7387 spoken words

I unknown words: 199 (out of vocabulary rate: 2.7 %)I test set perplexities:

bigram: PPbi = 198.1trigram: PPtri = 130.2

Acoustic Modelling:

I trained on: WSJ0 and WSJ1 (284 speakers; 80 h of speech)

I phoneme models: 4688 (43+1 monophones, 557 diphones,4087 triphones)

I tied states: 4623 (6-state HMM)

I densities: 290 000 per gender


Time Conditioned SearchExperimental Results: Word Conditioned Tree Search vs. Time Conditioned Tree Search

Bigram language model, PP=198:Search average number of del - ins WER[%]

States Arcs Trees Words LM Recword 5022 1416 20 86 86 2.5 - 2.7 17.1

conditioned 9335 2593 28 156 156 2.5 - 2.7 16.826493 7222 45 540 540 2.4 - 2.6 16.539146 10528 53 610 610 2.4 - 2.6 16.452786 13968 60 560 560 2.4 - 2.6 16.4

time 6175 1781 27 117 1539 2.4 - 2.9 17.3conditioned 11280 3101 31 197 2976 2.4 - 2.7 16.7

30019 7998 37 520 8761 2.4 - 2.6 16.544289 11679 41 777 13162 2.4 - 2.6 16.560692 15877 43 1075 18081 2.4 - 2.6 16.4


Time Conditioned SearchExperimental Results: Word Conditioned Tree Search vs. Time Conditioned Tree Search

Trigram language model, PP=122:LM average number of del - ins WER[%]

States Arcs Trees Words LM Recword 4714 1392 29 80 80 1.7 - 2.9 14.8

conditioned 18734 5430 70 294 294 1.6 - 2.8 14.248940 13877 112 702 702 1.6 - 2.8 14.167541 18764 130 953 953 1.6 - 2.7 14.086772 23688 145 1223 1223 1.6 - 2.7 13.9

time 8573 2459 28 136 1477 1.6 - 2.9 14.5conditioned 15103 4194 31 234 2806 1.6 - 2.8 14.1

24914 6780 34 388 4858 1.6 - 2.7 14.038323 10299 38 606 7663 1.6 - 2.7 14.071757 18953 42 1202 14293 1.6 - 2.7 14.0

From: S. Ortmanns, H. Ney, F. Seide, I. Lindam, ”A Comparison of TimeConditioned and Word Conditioned Search Techniques for Large VocabularySpeech Recognition,” Proc. Int. Conf. on Spoken Language Processing,

Philadelphia, PA, pp. 2091-2094, October 1996.











Time Conditioned SearchEquivalence of Word Conditioned and Time Conditioned Search Organization

For time conditioned search define:

Qτ (t, s)

:= maxwn−1

1

Pr(wn−1

1 ) ·maxst

1

Pr(x t

1 , st1|wn−1

1 , )

: sτ = Swn−1 ; st = s

Review:Word conditioned tree search for m-gram language models p(vm|vm−1

1 ):

Qvm−11

(t, s)

:= maxwn−1

1

Pr(wn−1

1 ) ·maxst

1

Pr(x t

1 , st1|wn−1

1 )

: wn−1n−m+1 = vm−1

1 ; st = s



s

tτ

s

tv1m-1

time conditioned search word conditioned search



I The path with the history vm−11 that ends at time t in state s,

must have reached state Svm−1 once at time τ : sτ := Svm−1 .I This path is split into two partial paths at time τ ,

were time τ has to be optimized:

Qvm−11

(t, s)

=maxτ

[maxwn−1

1

Pr(wn−1

1 ) ·maxsτ1

Pr(xτ1 , s

τ1 |wn−1

1 ) : wn−1n−m+1 = vm−1

1 ; sτ = Svm−1

·max

stτ+1

Pr(x t

τ+1, stτ+1| ) : st = s

]=max

τ

[H(vm−1

1 ; τ) ·maxstτ+1

Pr(x t

τ+1, stτ+1| ) : st = s

]=max

τ

[H(vm−1

1 ; τ) · Qτ (t, s)

maxum−11

H(um−11 , τ)

]



Qvm−11

(t, s) = maxτ

[H(vm−1

1 ; τ)

maxum−11

H(um−11 , τ)

· Qτ (t, s)

]∀s with s 6= 0.

Especially for a word boundary s = Svm we get:

Qτ (t,Svm) = maxum−1

1

H(um−11 , τ) · h(vm; τ, t)

Substitute:

Qvm−11

(t, Svm) = maxτ

[H(vm−1

1 ; τ) · h(vm; τ, t)]



Word conditioned search with recombination for s = 0:

Qvm2

(t, s = 0) = maxv1

p(vm|vm−1

1 ) · Qvm−11

(t,Svm)

= maxv1

p(vm|vm−1

1 ) ·maxτ

H(vm−1

1 ; τ) · h(vm; τ, t)

= H(vm2 ; t).

I It is possible to convert scores of the time conditioned searchinto scores of the word conditioned search.

I The opposite conversion is usually impossible because of theoptimization over τ which is done implicitly in wordconditioned search.

→ time conditioned search is “less compact” without furtheroptimization/pruning.











Time Conditioned SearchAcross-Word Contexts

I Across-word contexts lead to multiple tree roots, dependingon left phoneme context of words to be started.

I Problem: time conditioned search assumes unique tree root,relative to which word probabilities can be computed.

I Observation: coarticulated rootstates are recombined after firstgeneration.

I Idea: partition tree intoword-conditioned root network,followed by time-conditionedsubtrees.

I Subtrees are compatible with timeconditioned search structure, i.e.have unique tree roots.

Ω0

Ω2

Ω3

Ω1











Time Conditioned SearchPruning

Standard pruning:

I As for word conditioned search, acoustic and language modelpruning are applied.

I Acoustic pruning:I Acoustic pruning threshold:

QAC(t) := maxτ,s

Qτ (t, s).

I Discard all hypotheses (τ, s) at time t for which

Qτ (t, s) < fAC · QAC(t).

I Note: Implicitly, pruning affects all word end hypothesesattached to a search tree.



I Language model pruning:I Language model recombination/word boundary optimization:

H(vm2 ; t) = max

v1

p(vm|vm−1

1 ) ·maxτ

H(vm−11 ; τ) · Qτ (t,Svm)

maxum−1

1

H(um−11 ; τ)

Qt(t, 0) = max

vm2

H(vm2 ; t).

I Pruning threshold:

QLM(t) := maxuM

2

H(uM2 ; t).

I Define set E (t) of all word end hypotheses surviving languagemodel pruning at a given time frame:

E (t) :=

(vm−11 , q)

∣∣q = H(vm−11 ; t), q > fLM · QLM(t)

I Attach E (t) to the corresponding search tree started at time t.



I LM look-ahead:I only unigram LM look-ahead is efficient.I LM look-ahead with context dependency (bigram or higher)

requires maximization over all contexts, information gain doesnot pay off w.r.t. efficiency.

I Efficiency problems:

I Only single best ending word hypotheses is used to start treeat a given time frame.

I Word end hypotheses with different probabilities attached tosearch tree.

⇒ Some of the word end hypothese might already be below thepruning thresholds.

I LM recombination builds on all word end hypotheses withvarying start times.

I LM pruning is applied afterwards.⇒ Word boundary optimization delayed.



Early word end pruning:I To prevent too much language model recombinations, apply

acoustic pruning to expanded word end hypotheses beforecomputing the LM probability (cf. word conditioned search).

I Word boundary optimization:

Qvm−11

(t) := maxτ

H(vm−11 ; τ) · Qτ (t, Svm)

maxum−1

1

H(um−11 ; τ)

I Acoustic pruning on expanded word end hypotheses:I Discard all word end hypotheses (vm−1

1 ; t) with

Qvm−11

(t) < fAC · QAC(t)

I LM recombination on remaining hypotheses:

H(vm2 ; t) = max

v1

p(vm|vm−1

1 ) · Qvm−11

(t)

Qt(t, 0) = maxvm

2

H(vm2 ; t).

I Anticipated LM pruning: use/update current best hypothesis.Schluter/Ney: Advanced ASR 418 August 5, 2010


Context RecombinationI In time conditioned search word boundaries are recombined

only at the word ends, whereas in word conditioned search,word boundaries are recombined implicitly early on.

I Search trees for adjacent start times can be expected toexpand to the same state hypotheses (and therefore also thesame word end hypotheses).

I Idea: discard word end hypothesis vm−11 from specific time

frame, if all expanded state hypotheses with respect to thisword end hypothesis within one tree are lower than theequivalent state hypotheses within another tree.

I A hypothesis (vm−11 , q) ∈ E (τ) ending in the word sequence

vm−11 with probability q can be removed from E (τ) if:

q · Qτ (t, s)

Qτ (τ, 0)< max

τ ′,q′:(h,q′)∈E(τ ′)q′ · Qτ ′(t, s)

Qτ ′(τ′, 0)



I Complexity of context recombination is problematic: all activecontexts vm−1

1 , and states s have to be visited.I Approximation: perform context recombination not in every time

frame, but at certain intervals.

Further speed-up of time conditioned search:I Using an HMM with skip transition, a word end hypothesis

still only is generated from the last state of a word, eventhough a skip from the second to last state also would leavethe words HMM.

I Nevertheless, this approximation does not have a significanteffect on the search result.

I In the same way it can be expected that the effect would besmall, if word starts also would not be allowed at every timeframe.

I Approach: only allow word starts (i.e. tree starts) at certainintervals.


Time Conditioned SearchPruning: Results

Search space comparison:I Average number of different active word end state hypotheses

per time frame, independent of context and start time.I Average number of word end hypotheses per time frame.

Results on a small subset of the NAB’94 H1 dev corpus with 20kword vocabulary, with relaxed pruning constraints:

search trees conditioned on:word context [k] start time [k]

Word end states 5.00 5.5

word ends: w/o early pruning 5.00 1150.0with early pruning 11.3after pruning 1.60 8.3after recombination 0.77 1.5



Experiments on European Parliament Plenary Sessions (EPPS) English:

I Vocabulary: 60k words.

I Across-word models.I Word conditioned search (WCS):

I optional bigram language model look-ahead (Bigram LAH)

I Time Conditioned Search (TCS):I Context recombination every 5th time frame (CR5).I Tree start every 2nd time frame (Interval 2).

I Measure word error rate (WER) vs. real time factor (RTF).

I Comparison of time and word conditioned search.



13

13.2

13.4

13.6

13.8

14

14.2

14.4

0 1 2 3 4 5 6 7 8 9

WE

R [%

]

RTF

EPPS Eval06, 60k words

TCSTCS, CR5

TCS Interval 2TCS Interval 2, CR5



13

13.2

13.4

13.6

13.8

14

14.2

14.4

0 1 2 3 4 5 6 7 8 9

WE

R [%

]

RTF

EPPS Eval06, 60k words

WCSWCS, Bigram LAH

TCSTCS Interval 2, CR5


Time Conditioned SearchConclusions

Pruning experiments show:

I Context recombination leads to moderate improvements.

I A tree start interval of 2 frames gives significant speed-up.

I context recombination gives negligible effect on top of treestart interval.

I Tree start interval larger than 2 gives degradations.

Overall:

I With optimized pruning, time conditioned search outperformsword conditioned search.








6. Normalization and Adaptation6.1 Introduction6.2 Audio Segmentation6.3 Speaker Clustering6.4 Vocal Tract Length Normalization (VTLN)6.5 Maximum Likelihood Linear Regression (MLLR)6.6 Feature Space MLLR (fMLLR)6.7 Speaker Adaptive Training6.8 Examples and Results6.9 Summary



Normalization and AdaptationMotivation

Objective: try to fit acoustic model to speaker/condition.

I Need new acoustic adaptation data from the current speakeror condition.

Distinctions:I Model based – Feature based

I Modify model to better fit the features ⇒ adaptation.I Transform features to better fit model ⇒ normalization.

I Supervised – UnsupervisedI Transcriptions available for adaptation data ⇒ supervised.I No transcriptions available ⇒ unsupervised.

I Recognition side – Also in trainingI Normalization also in training is straightforward.I Possible also for adaptation ⇒ speaker adaptive training (SAT).


Normalization and AdaptationSupervised Adaptation

Supervised adaptation = constrained acoustic model reestimation.

I Need manual transcriptions of adaptation data.

I Theoretically similar to acoustic model training – maximumlikelihood.

Model based: M ′ = f (M, φ)

φ = arg maxφ

P(xT1 |wT

1 ,M′).

Feature based: x ′T1 = f (xT1 , θ) (note Jacobi determinant)

θ = arg maxθ

∣∣∣∣df (xT1 , θ)

dxT1

∣∣∣∣P(x ′T1 |wT

1 ,M).

Like in AM training: discriminative training possible and useful.


Normalization and AdaptationUnsupervised Adaptation - Theoretical Foundation

Simultaneous generation of adaptation word sequence andestimation of adaptation parameters.

I Model based: M ′ = f (M, φ)

wT1 , φ = arg max

wT1 ,φ

P(wT1 )P(xT

1 |wT1 ,M

′).

I Feature based method can be defined analogously.

I Above method infeasible in practice.

I Approximations always used.


Normalization and AdaptationUnsupervised Adaptation - Approximations

Word graph based approximation

I Perform first pass recognition generating word graph.

I Posterior confidence measures from word graph.

I Weighted accumulation.

First-best approximation

I Only take first best output instead of word graph.

I Estimation is performed exactly as in the supervised case, butuse first pass output as transcription.

I This method is used most often in practice.


Normalization and AdaptationEstimation Criteria

Different estimation criteria are possibleI Supervised adaptation

I Discriminative criteria can be applied (see Chapter 7).I No principal difference to acoustic model training.

I Unsupervised adaptationI Currently no significant improvement using discriminative

criteria compared to maximum likelihood estimation.I Appropriate criteria are still an open research issue.


Normalization and AdaptationUse Cases: Transcription System

I Objective: generate transcription from untranscribed audiodata.

I Offline system – multiple passes over data allowed.I Several different speakers and conditions, this requires:

I Segmentation, to partition audio in homogeneous portions:single speaker and single condition per segment.

I Clustering of segments into same (similar) speakers and/orconditions.

I Speakers not previously known/no transcribed adaptationdata available: unsupervised adaptation.


Normalization and AdaptationUse Cases: Dictation System

I Objective: Transcribe speech from known speaker in real time.

I Typically only one speaker uses each system (at a time).I User corrects errors in the ASR output ⇒

I Corrected transcripts usable for supervised adaptation.

I Amount of adaptation data continually increases ⇒I Complexity of adaptation method should also increase.I Finally complete retraining of acoustic model might be used.


Normalization and AdaptationUse Cases: On-line Interactive System

I Objective: Transcribe speech from different unknown speakersin real time.

I Used for dialog as well as command and control systems.

I Unsupervised adaptation.

I Requires fast adaptation, i.e. adaptation using very little data(some seconds).

I Can benefit from incremental adaptation, i.e. continuallyupdating adaptation for new data.











Audio Segmentation

I Objective: split audio stream in homogeneous regionsaccording to:

I speaker identity,I recording condition (e.g. background noise, telephone channel),I signal type (e.g. speech, music, noise, silence), andI spoken word sequence.

I Segmentation affects speech recognition performance:I speaker adaptation and speaker clustering:

assume one speaker per segment,I language model assumes sentence boundaries at segment end,I non-speech regions cause insertion errors,I overlapping speech is not recognized correctly,

causes errors at surrounding regions,I words must not be split.


Audio SegmentationMethods

I Metric basedI Compute distance between adjacent regions.I Segment at maxima of the distances.I Distances: KL distance, Bayesian information criterion.

I Model basedI Classify regions using precomputed models for music, speech, etc.I Segment at changes in acoustic class.

I Decoder guidedI Apply speech recognition to input audio stream.I Segment at silence regions.I Other information produced by the decoder can be used.


Audio SegmentationMetric-based using BIC

Bayesian Information Criterion (BIC):

I Likelihood criterion for a model θ, given data set xN1 :

BIC(θ, xN1 ) = log p(xN

1 |θ)− λ

2· d(θ) · log(N),

d(θ): number of parameters in θ,λ: penalty weight for model complexity.

I used for model selection: choose model maximizing the BIC.


Audio SegmentationMetric-based using BIC (ctd.)

Change point detection using BIC:

I Input stream is modeled as Gaussian process in the cepstral space.

I Feature vectors of one segment: drawn from multivariate Gaussian:xi . . . xj ∼ N(µ,Σ).

I For hypothesized segment boundary t in xT1 decide between

I x1 . . . xT ∼ N(µ,Σ), andI x1 . . . xt ∼ N(µ1,Σ1) xt+1 . . . xT ∼ N(µ2,Σ2).

I Use difference of BIC values:

∆BIC(t) = T log |Σ| − t|Σ1| − (T − t)|Σ2| − λP

P =1

2(D +

1

2D(D + 1)) log T

D: dimensionality of feature vectors.

I Change point if ∆BIC(t) > 0.


Audio SegmentationMetric-based using BIC (ctd.)

I Detecting single change point in xT1 :

t = argmaxt∆BIC(t)

I Detecting multiple change points:I Apply change point detection to a sliding, size-growing window.I Start with a window of initial size.I Enlarge window until change point detected, or maximum

window size reached.I Restart with initial window size behind previous change point.


Literature

I G. Schwarz, “Estimating the dimension of a model,” The Annals ofStatistics, vol. 6, no. 2, pp. 461–464, 1978.

I S.S. Chen and P.S. Gopalakrishnan, “Speaker, environment andchannel change detection and clustering via the Bayesianinformation criterion,” in DARPA Broadcast News Transcriptionand Understanding Workshop, Feb. 1998, pp. 127–132.

I M. Cettolo, M. Vescovi, and R. Rizzi, “Evaluation of BIC-basedalgorithms for audio segmentation,” Computer Speech andLanguage, vol. 19, no. 2, pp. 147–170, Apr. 2005.

I S. E. Tranter, K. Yu, G. Evermann, and P. C. Woodland,“Generating and evaluating segmentations for automatic speechrecognition of conversational telephone speech,” in Proc. ICASSP,May 2004, pp. 753–756.

I D. Rybach, C. Gollan, R. Schluter, and H. Ney, “AudioSegmentation for Speech Recognition using Segment Features,” inProc. ICASSP, Apr. 2009, pp. 4197–4200.











Speaker ClusteringIntroduction

I Objective: Group speech segments into clusters for adaptation.

I Segments from same or similar speakers should be grouped.

I Requirement: Clustering should give lowest possibleadaptation WER.

BIC Method

I Uses acoustic features only.

I Greedy, bottom up clustering.

I BIC used to control number of clusters.


Speaker ClusteringBIC Clustering

I Greedy, bottom up, BIC clustering method [Chen+@IBM 1998].

I Each cluster is modeled using single Gaussian, full covariance.

I BIC criterion, Ck is clustering with K clusters, xT1 input data

points, and feature dimension D:

BIC (CK , xT1 ) =

K∑k=1

(−1

2nk log |Σk |

)− 1

2K (D +

1

2D(D + 1)) log T .

I Algorithm:

0. Start with one cluster for each segment.1. Try all possible pairwise cluster merges.2. Merge the pair that gives the largest increase in BIC (Ck , xT

1 ).3. Iterate from 1, until BIC (Ck , xT

1 ) starts to decrease.


Speaker ClusteringBIC Clustering (ctd.)

I In practice modifications to the BIC criterion can be used[[email protected] 2000].

I Criterion takes form: likelihood term + complexity term.

I As in segmentation section, multiply complexity term with apenalty weight λ.

I Tune λ by optimizing WER on a development set.

I In a practical implementation, a K × K matrix is used tocache change in BIC for each merge pair, between iterations.

I Also possible to use BIC only as stop criterion, with distancemeasure to choose pair-wise merging of clusters.


Speaker ClusteringReferences

I S. S. Chen and P. S. Gopalakrishnan, Clustering Via theBayesian Information Criterion With Applications in SpeechRecognition, ICASSP 1998.

I B. Zhou and J.H.L. Hansen, Unsupervised audio streamsegmentation and clustering via the Bayesian informationcriterion, ICSLP 2000.











Vocal Tract Length Normalization (VTLN)Introduction

What is the vocal tract?I Dimensions of vocal tract

vary between speakers.

I Different vocal tract lengthsgive different formantpositions⇒ systematic differences infeatures between speakers.

I Goal: warp frequencyspectrum to match spectrafrom different speakers forthe same phoneme.


Vocal Tract Length Normalization (VTLN)Introduction


Vocal Tract Length Normalization (VTLN)Parameterization

I VTLN warping performed as part of acoustic feature extraction.

I Warping is performed on spectrum, before Mel-warping.

I Typically: warp filter bank centers/widths (as Mel-warping).

Linear warping

I Single (varying) parameter α,termed warping factor.

I To get bijective mapping, kneeparameter ω0 often used⇒ piecewise linear warping.

α>1

α<1

ω

π

π

∼

α=1

ω0

0ω

ω∼


Vocal Tract Length Normalization (VTLN)Estimation

Estimation from formant position

I Compute mean position of (some) formant for speaker.I Derive warping factor from difference with global formant position.I In [Eide+ 96], third formant used, c.f. the paper for details.

Maximum likelihood (ML) estimation [Lee+@AT&T 1996]

α = arg maxα

P(x (α)T1 |α,wN

1 )

I Grid search over warping factors, i.e. [0.8:0.1:1.2],maximization problem.

I In practice: Use alignment (or other state distribution)obtained by existing acoustic model.

I Stable due to single parameter, can be used speaker-wise orsegment-wise.


Vocal Tract Length Normalization (VTLN)Variants

Fast VTLN [Welling+@RWTH 1999]

I ML estimation requires transcription.

I For use in recognition ⇒ Two pass unsupervised.I Goal: eliminate first recognition pass:

I Train (GMM) warping factor classifier on acoustic training set.I Classifier uses only acoustic features (similar to clustering).I In recognition, estimate warping factor using classifier.

Effect of Jacobi Determinant

I Varying transformation (=warping) in optimization.

I Thus, should include Jacobi determinant.

I Solution: VTLN as linear transform, see [Pitz@RWTH 200]and [Umesh+@Indian Inst. Tech. 2005].


Vocal Tract Length Normalization (VTLN)References

I A. Andreou, T.Kamm, and J. Cohen, Experiments in VocalTract Normalization, CAIP Workshop 1994.

I E. Eide and H. Gish, A Parametric Approach to Vocal TractLength Normalization, ICASSP 1996.

I L. Lee, R. C. Rose, Speaker normalization using efficientfrequency warping procedures, ICASSP 1996.

I L. Welling and S. Kanthak and H. Ney, Improved Methods forVocal Tract Normalization, ICASSP 1999.

I M. Pitz, Investigations on Linear Transformations for SpeakerAdaptation and Normalization, PhD Thesis, Aachen,Germany, March 2005.

I S. Umesh and A. Zolnay and H. Ney, ImplementingFrequency-Warping and VTLN Through LinearTransformation of Conventional MFCC, Interspeech 2005.











Maximum Likelihood Linear Regression (MLLR)Introduction

I Goal: Approximate speaker specific acoustic model usingaffine transforms of speaker independent model Gaussianparameters [Leggetter+@Cambridge 1995].

I Parameterizations (r is speaker/condition)I Mean adaptation: µ′rs = Arsµs + brs .I Covariance adaptation: Σ′rs = HrsΣsHT

rs .I Often only mean adaptation used – Σ-adaptation not

discussed here.

I Typically estimated using maximum likelihood criterion.

I State tying: Ars , brs , and Hrs are either global or depend onsets of phonetically related states.


Maximum Likelihood Linear Regression (MLLR)Estimation

I Rewrite MLLR formulation:

ξs = [1 µTs ]T, Wrs = [brs Ars ], µ′rs = Wrs ξs .

I Assume adaptation data (xT1 ,w

N1 ) from single speaker, drop index r .

I Depending on the amount of adaptation data, tying is applied.I For the derivation, we assume a single pooled adaptation

matrix W , i.e. we assume complete tying.I All other cases of tying can be derived from the pooled case by

dividing the adaptation data according to the Viterbi alignment.I Parameter estimation using maximum likelihood criterion:

W = arg maxW

P(xT1 |wN

1 , θ,W ).

I Optimization using expectation maximization (EM) gives:

W ML = arg minW

∑s

T∑t=1

γs(t)(xt −W ξs)TΣ−1s (xt −W ξs)


Maximum Likelihood Linear Regression (MLLR)Estimation: General Case - Full Covariance

Take derivative w.r.t. adaptation matrix component W and set to zero:

∂

∂W

∑s

T∑t=1

γs(t)(xt −W ξs)TΣ−1s (xt −W ξs)

∝∑

s

T∑t=1

γs(t)Σ−1s

[xtξ

Ts −W ξsξ

Ts

]!

= 0

⇔∑

s

T∑t=1

γs(t)Σ−1s W ξsξ

Ts =

∑s

T∑t=1

γs(t)Σ−1s xtξ

Ts


Maximum Likelihood Linear Regression (MLLR)Estimation: Case with Diagonal Covariances

Now assume diagonal covariances Σs,ij = σ2s,iδij .

Component-wise representation and rearrangement:

∑s

T∑t=1

γs(t)Σ−1s W ξsξ

Ts =

∑s

T∑t=1

γs(t)Σ−1s xtξ

Ts

⇔∑n

Win

∑s

T∑t=1

γs(t)1

σ2s,i

ξs,nξs,j︸︷︷︸=:G

(i)nj

=∑

s

1

σ2s,i

T∑t=1

γs(t)xt,iξs,j︸︷︷︸=:Zij

∀ i , j

⇔[W · G (i)

]= Zij ∀ i , j

∣∣∣∣ · (G (i)−1)jk∀ k,

∑j

⇔ Wik =(Z · G (i)−1)

ik∀ i , k


Maximum Likelihood Linear Regression (MLLR)Estimation

Therefore, assuming diagonal covariances, the adaptation matrixcan be obtained row-wise, i.e. for each row i separately:

Wij =(ZsG

(i)s

−1)ij

∀ i , j

with accumulators

G (i) =∑

s

1

σ2s,i

ξsξTs

T∑t=1

γs(t),

Z =∑

s

T∑t=1

γs(t)Σ−1s xtξ

Ts with Σs = diag(σ2

s,1, . . . , σ2s,D).

I Closed form solution for adaptation matrix.

I Accumulator storage requirements: O(D3)

(for G (i)).


Maximum Likelihood Linear Regression (MLLR)Tree Based State Tying

I Described in [Leggetter+@Cambridge 1995].

I Basis: Phonetic classification tree, i.e. from AM state tying.

I Each node represents a regression class, a set of states.

I Typically silence has a separate state.I Dynamically adjusted regression classes – Algorithm:.

I Accumulation of statistics in each leaf node.I Additionally accumulate observation count accumulator.I Combine (=add) accumulators from all child nodes – store in

parent node.I Continue merging until root node is reached.I For each leaf node:

1. If observation count larger than threshold: use accumulator.2. Else go to parent node and repeat from 1.

I Algorithm allows regression classes with too few observationsto still be adapted, by combining data from other classes.


Maximum Likelihood Linear Regression (MLLR)Variants

I Standard MLLR uses same number of matrices A and offsets b.I But also possible to have many more offsets than matrices

[Digilakis+@John Hopkins Workshop 1999].I Allows finer granularity in adjusting number of parameters.I When no transformation matrices, only offsets – sometimes

known as shift MLLR or bias MLLR.I Estimation is very similar to standard MLLR, but A and b

computed separately.I Method for A is formally the same as for matrix W in MLLR.I Offset bc given by the following formula - depends on

corresponding matrix Ac ′(c).

bc =1

T

T∑t=1st∈c

(xt − Ac ′(c)µst

).


Maximum Likelihood Linear Regression (MLLR)References

I C. J. Leggetter and P. C. Woodland, Maximum LikelihoodLinear Regression for Speaker Adaptation of ContinuousDensity Hidden Markov Models , CSL April 1995

I C.J. Leggetter and P.C. Woodland, Flexible SpeakerAdaptation Using Maximum Likelihood Linear Regression,ARPA SLT Workshop 1995

I V. Digalakis, S. Berkowitz, E. Bocchieri, C. Boulis, W. Byrne,H. Collier, A. Corduneanu, A. Kannan, S. Khudanpur, A.Sankar, Rapid Speech Recognizer Adaptation to NewSpeakers, ICASSP 1999











Feature Space MLLR (fMLLR)Introduction

I Goal: Normalize features to better fit speaker [Digalakis+@SRI 1995].

I Affine transform of features: x ′t = Arsxt + brs .

I Note: transformation of probability density, Jacobiandeterminant needed:

P(x ′t) = N (x ′t |µs ,Σs)⇔ P(xt) = |Ars |N (Arsxt + brs |µs ,Σs).

I Advantages:I Pure feature transform ⇒ no changes to core speech

recognition decoder needed.I Speaker adaptive training easy to implement.

I State tying: single global matrix per speaker.

I But: Can also be seen as model transform⇒ regression classes possible with changes to decoder


Feature Space MLLR (fMLLR)Equivalence Model and Feature Transform

MLLR with same matrix for mean and covariance transform:

P(xt | Ars , brs , µs ,Σs) =

=1√

det(2πArsΣsATrs)

e−12

(xt−Arsµs−brs)T(ArsΣsATrs)−1(xt−Arsµs−brs)

=1

det(Ars)√

det(2πΣ)e−

12

(A′rsx+b′rs−µ)TΣ−1(A′rsx+b′rs−µ)

= det(A′rs)N (A′rsx |µ,Σ).

where A′rs = A−1rs and b′rs = −A−1

rs brs .I This is exactly the formula for fMLLR:

I Gaussian distributed random variable, affine transform⇒ Gaussian also in new coordinate system.

I Well known fact from mathematical statistics.


Feature Space MLLR (fMLLR)Maximum Likelihood Estimation

As for MLLR, rewrite in linear form, drop speaker index r , andassume complete tying:

x ′t = W ξt with ξt = [1 xTt ]T and W = [b A].

Maximum likelihood EM equation:

W ML=arg maxW

T log det A− 1

2

∑s

T∑t=1

γs(t)(W ξt − µs)TΣ−1s (W ξt − µs)

=arg maxW

T log det A +

∑s

T∑t=1

γs(t) Trace(

W ξtµTs Σ−1

s

)− 1

2

∑s

T∑t=1

γs(t) Trace(

Σ−1s W ξtξ

Tt W T

).



Rewrite in terms of sufficient statistics:

W ML = arg maxW

T log det A + Trace

(W∑

s

T∑t=1

γs(t)ξtµTs Σ−1

s︸︷︷︸=:Z

)

− 1

2

D∑i ,j=1

D+1∑k,l=1

WjkWil ·∑

s

T∑t=1

γs(t)Σ−1s,ijξt,kξt,l︸︷︷︸

=:Hijkl (4th order tensor)

diagonalcovar.

= arg maxW

T log det A +

D∑i=1

D+1∑k=1

Wik

∑s

T∑t=1

γs(t)ξt,kµs,i1

σ2s,i︸︷︷︸

=:Zki

− 1

2

D∑i=1

D+1∑k,l=1

WikWil ·∑

s

T∑t=1

γs(t)1

σ2s,i

ξt,kξt,l︸︷︷︸=:G

(i)kl (3rd ord. tensor/matrix ∀ i)



Case of diagonal covariances Σs,ij = σ2s,iδij :

W ML = arg maxW

T log det A +

D∑i=1

D+1∑k=1

WikZki

− 1

2

D∑i=1

D+1∑k,l=1

WikWilG(i)kl

,

with sufficient statistics:

G (i) =∑

s

T∑t=1

γs(t)

σ2s,i

ξtξTt

Z =∑

s

T∑t=1

γs(t)ξtµTs Σ−1

s

I No closed form solution for W , iterative methods needed.I Row-wise iterative solution in [Gales+@Cambridge 1998].


Feature Space MLLR (fMLLR)Dimension Reducing Adaptation

Dimension reducing fMLLR [Loof+@RWTH 2007]

I Goal: Utilize more context information from features.

I Replace LDA matrix (instead of applying fMLLR after LDA).

I Non quadratic matrix⇒ Derivation as for fMLLR not possible.

I Instead of ML, use ratio of acoustic model likelihood and aglobal competing model likelihood.

I Competing model: single Gaussian trained on adaptation data,I Parameters: µ′, Σ′ (in transformed feature space).

In essence: How much better do the transformed features fit thetarget acoustic model, than it fits a global single Gaussian?


Feature Space MLLR (fMLLR)Dimension Reducing fMLLR

Consider adaptation data likelihood for global single Gaussian:

log P(Wr xT1 |µ′,Σ′) =

T∑t=1

logN (Wr xt |µ′,Σ′)

=T∑

t=1

[− 1

2log det(2πΣ′)− 1

2(Wr xt − µ′)TΣ′−1(Wr xt − µ′)

]= −T

2log[(2π)D det(Σ′)

]−1

2Trace

[ T∑t=1

(Wr xt − µ′)(Wr xt − µ′)T · Σ′−1

]︸︷︷︸

= Trace(TΣ′Σ′−1) = T ·Trace ID = TD

= −T

2log det Σ′ − TD

2log 2π − TD

2


Feature Space MLLR (fMLLR)Dimension Reducing fMLLR

Maximize likelihood ratio:

Wr = arg maxWr

− log P(Wr xT

1 |µ′,Σ′) + log P(Wr xT1 |M,wN

1 )

with

log P(Wr xT1 |µ′,Σ′) = −T

2log det Σ′ − TD

2log 2π − TD

2

= −T

2log det

[Wr ΣW T

r

]+ const(Wr )

with covariance Σ of untransformed features: Σ′ = Wr ΣW Tr .

Substitute into maximum likelihood ratio:

Wr = arg maxWr

T

2log det(W ΣW T ) + log P(Wr xT

1 |M,wN1 )

I Further derivation similar to regular fMLLR, details in

[Loof+@RWTH 2007]


Feature Space MLLR (fMLLR)References

I V. Digalakis, D. Rtischev L. Neumeyer, and SpeakerAdaptation Using Constrained Estimation of GaussianMixtures, IEEE SAP Sep. 1995.

I M. J. F. Gales, Maximum Likelihood Linear Transformationsfor HMM-based Speech Recognition, CSL Apr. 1998.

I J. Loof, R. Schluter, and H. Ney, Efficient Estimation ofSpeaker-specific Projecting Feature Transforms, Interspeech2007.











Speaker Adaptive Training (SAT)Introduction

I Adaptation compensates for speaker differences in recognition.

I But: We also have speaker differences in training corpus.

I Question: How can we compensate for both these differences?

Speaker-Adaptive Normalization

I Apply transform on training data also.

I Model training using transformed acoustic features.

Speaker-Adaptive Adaptation

I Interaction between model and transform requiressimultaneous model and transform parameter training.

I Cannot simply retrain model→ modified acoustic model training necessary.


Speaker Adaptive TrainingVTLN and fMLLR

Training Procedure

1. Estimate speaker independent acoustic model MSI as normal.

2. Decide on / estimate a target model MT for adaptation estimation.

3. Use MT to estimate VTLN or fMLLR adaptation supervisedfor each speaker in training.

4. Transform features using the estimated adaptation.

5. Train speaker adaptive model MSAT on transformed features,starting from MSI.

Recognition Procedure

1. First pass recognition using speaker independent acoustic model MSI.

2. Estimate VTLN or fMLLR adaptation unsupervised usingtarget model MT.

3. Transform features using the estimated adaptation.

4. Second pass recognition using speaker adaptive model MSAT.


Speaker Adaptive TrainingVTLN and fMLLR

I Original SAT was estimated iteratively, where in each iterationboth acoustic model and adaptation parameters werereestimated [Anastasakos+@BBN 1996].

I Problem: interaction between acoustic model complexity andadaptation.

I A simple, single Gaussian target model is known to lowerWER both for VTLN [Welling+@RWTH 1999] and fMLLR[Stemmer+@FBK 2005].


Speaker Adaptive TrainingMLLR

I Iterative re-estimation of MLLR, Gaussian means µs , andGaussian covariances Σs [Anastasakos+@BBN 1996].

I MLLR parameters Ar and br estimated as usual on best mode.

I Re-estimation for Gaussian parameters using maximum likelihood:

µs =

(T∑t

γs(t)ATr(t)Σ−1

s Ar(t)

)−1( T∑t

γs(t)ATr(t)Σ−1

s (xt − br (t))

)Σs = Same as for standard ML but using SAT mean

I Drawback: Require D × D accumulator for each mean µs .

I 800k Gaussians, D=45, 64 bit float ⇒ 12 GB accumulator.


Speaker Adaptive TrainingReferences

I T. Anastasakos, J. McDonough, R. Schwartz, and J.Makhoul, A Compact Model for Speaker-Adaptive Training,ICSLP, Philadelphia, PA, USA 1996

I G. Stemmer, F. Brugnara, and D. Giuliani, Adaptive TrainingUsing Simple Target Models, ICASSP, Philadelphia, PA, USA2005

I L. Welling and S. Kanthak and H. Ney, Improved Methods forVocal Tract Normalization, ICASSP 1999











Examples and ResultsExperimental setup

All examples using RWTH TC-STAR 2006 English ASR systems:

I Training corpus: 88h.

I Development and Test sets: 3.2h each.

I MFCC + voicedness feature.

I One pass, Fast VTLN.

I LDA dimension: 45 × 153.

I ML trained acoustic models, approx. 900k Gaussians.

I Two pass system: fMLLR and MLLR unsupervised using firstbest output.

I Note: some differences between systems – direct WERcomparison between slides not always possible


Examples and ResultsBasic Methods – Progressive Improvement

Dev Eval

Baseline 18.5 -

+VTLN+voice 17.2 14.4

+fMLLR 15.7 -

+MLLR 14.0 11.8

I Different adaptation methods can be combined.

I Here: Each method gives an additional improvement – notalways the case.


Examples and ResultsDifferent combinations

fML

LR

ML

LR

SA

T

Pro

j.

WER [%]Dev Eval

no no no no 16.4 13.5yes 15.1 11.9

yes 15.0 11.7yes no 14.4 11.0

yes 14.4 10.8yes no no 14.0 11.0

yes 13.8 11.0yes no 13.6 10.6

yes 13.3 10.4

I Here: Baseline includes VTLN.

I fMLLR Improvement:8 – 12% relative.

I Additional improvementMLLR: 7 – 8% relative(from MLLR alone more).

I Improvement from SAT:5 – 8% relative.

I Total improvement (overVTLN): 19 – 23% rel.


Examples and ResultsUse of Confidence Measures

I Baseline: fMLLR SAT using dimension reducing transforms.

I Confidence measure for adaptation: instead of using bestrecognized transcription from first pass, use all word graphhypotheses weighted by confidence measure.

I Goal: Evaluate different model adaptation schemes, with andwithout confidence measures.

Dev EvalNo Conf Conf No Conf Conf

1st pass 16.3 – 13.2 –

fMLLRP SAT 14.2 – 10.5 –

MAP 14.1 13.0 10.5 9.8

MLLR 13.2 12.9 10.1 9.8

Shift-MLLR 13.2 12.7 10.2 9.6











Summary

I Speaker adaptation important to obtain good performance inspeaker independent speech recognition.

I Approach and parameterization usually depend on:I supervision (yes/no),I recognition mode (on-/offline),I amount of adaptation data.

I Feature and model based approaches usually are additive.

I Unsupervised adaptation: 20% relative word error rateimprovement possible.









7. Discriminative Training7.1 Introduction7.2 Overview7.3 Training Criteria7.4 Parameter Optimization7.5 Efficient Calculation of Discriminative Statistics7.6 Generalisation to Log-Linear Modeling7.7 Convex Optimization7.8 Incorporation of Margin Concept7.9 Conclusions7.10 References7.11 Speech Tasks: Corpus Statistics & Setups


IntroductionMotivation

Aim of discriminative methods: improve class separation

I standard maximum likelihood (ML) training: maximizereference class conditional pθ(x |c)

I maximum mutual information (MMI) training: maximize

reference class posterior pθ(c|x) =p(c) · pθ(x |c)∑

c ′

p(c ′) · pθ(x |c ′)Where’s the difference?

I Ideally: (almost) no difference! In case of infinite trainingdata and correct model assumptions, the true probabilities areobtained in both cases. They lead to equal decisions, providedthe class prior p(c) is known. (Proof: model free optimization.)

I ML training: classes are handled independently, thereforedecision boundaries are not considered explicitly in training.

I in MMI training and generally in discriminative training, thereference class directly competes against all other classes,decision boundaries become relevant in training.


IntroductionMotivation

I In practice, model assumptions are incorrect, and trainingdata is limited. Here discriminative training can be beneficial.

Example: a two class problem (with pooled covariance matrix)

-1

0

1

2

-5 -4 -3 -2 -1 0 1 2

y

x

ML/MMI

class -1class +1

MLMMI

-1

0

1

2

-5 -4 -3 -2 -1 0 1 2

y

x

ML MMI

I Clearly, in case of ML training, the outlier deteriorates thedecision boundary, whereas MMI training registers the minorimportance of the outlier.

I MMI captures decision boundary, although model assumptiondoes not fit in second case (pooled covariance).











Overview

Questions:

I Which discriminative criterion to take?

I Relation to decision rule and evaluation measure?

I How to optimize criterion?

I Efficiency?

I Influence of modeling?

I Uniqueness of solution?

I Generalization?

Bottomline:

I How to utilize available training materialto obtain optimum recognition performance?











Training CriteriaNotation

Xr sequence xr ,1, xr ,2, ..., xr ,Tr acoustic observationvectors

Wr spoken word sequence wr ,1,wr ,2, ...,wr ,Nr in trainingutterance r

W any word sequence

p(W ) language model probability, supposed to be given

pθ(Xr |W ) acoustic emission probability/acoustic model

θ set of all parameters of the acoustic model

Mr set of competing word sequences to be considered

f smoothing function


Training CriteriaGeneral Approach

I input: training data and stochastic model pθ(X ,W )input: with free model parameters θ

I output: “optimal” model parameters θI optimality defined via training criterion

θ := arg maxθF (θ)

Unified training criterion [Macherey+ 2005]

F (θ) =R∑

r=1

f

(log

(∑W p(W )pθ(Xr |W ) · A(W ,Wr )∑

W∈Mrp(W )pθ(Xr |W )

))I covers maximum mutual information (MMI), minimum

classification error (MCE), minimum phone/word error(MPE/MWE)

I control set Mr of competing hypotheses, cost function,smoothing function, scaling of models (not shown)


Training CriteriaProbabilistic Training Criteria

Objective

I find good estimate of probability distribution

I optimality regarding error via Bayes’ decoding(asymptotic w.r.t. amount of training data)


Training CriteriaMaximum Likelihood (ML)

I optimization of joint probability

arg maxθ

∑r

log(p(Wr )pθ(X |Wr )

)= arg max

θ

∑r

log pθ(Xr |Wr )

I Tutorial on HMM [Rabiner 1989].

I Maximization of probability of reference word sequences(classes).

I Model correctness important.

I HMM: maximization for each class separately.

I Neglects competing classes.

I Expectation-maximiation: local convergence guaranteed.

I Estimation efficient, easily parallelizable.


Training CriteriaMaximum Mutual Information (MMI)

I optimization of conditional probability

arg maxθ

∑r

log pθ(Wr |Xr ) = arg maxθ

∑r

logp(Wr )pθ(Xr |Wr )∑

V p(V )pθ(Xr |V )

I Considers competing classes and therefore decision boundariesI Necessitates set of competing classes on training data.I Optimization for standard modeling (HMMs, mixture

distributions): only gradient descent or similar.I Optimization using log-linear modeling: convex problemI 1st application of MMI for ASR using discrete HMMs [Bahl+ 1986]:

I 2000 isolated words, 18% rel. improvement in word error rate.I MMI: discrete and continuous probability densities [Brown 1987]:

I isolated E-set letters, 18% rel. improvement in recognition rate.I MMI: discrete & continuous probabilty densities [Normandin 1991]:

I digit strings, up to 50% rel. improvement in string error rate.


Training CriteriaError-Based Training Criteria

Objective: Optimize some error measure directly, e.g.:I Empirical recognition error on training data

I Advantage: direct relation to decision ruleI Problem: non-differentiable training criterion, use of

differentiable approximations in practiceI Problem: ASR classes (words/word sequences) difficult to

handle

I Model-based expected error on training dataI Advantage: word or phoneme error easy to handleI Usually, approximated word/phoneme error, but correct edit

distance also is viable [Heigold+ 2005]I Relation to decision rule less straight-forward.I Over-training and generalization becomes an issue

(→ regularization, margin)


Training CriteriaMinimum classification error (MCE)

I For ASR: minimization of smoothed empirical sentenceerror [Juang & Katagiri 1992,Chou+ 1992].

arg minθ

1

R

R∑r=1

1

1+[ pαθ (Xr |Wr ) · pα(Wr )∑

W 6=Wr

pαθ (Xr |W ) · pα(W )

]2%

I Smoothing parameters α and %.

I Upper bound to Bayes’ error rate for any acousticmodel [Schluter+ 2001]

I Lesser effect of incorrect model assumptions.


Training CriteriaMinimum word/phone error (MWE/MPE)

I minimization of model-based expected word/phone error ontraining data [Povey & Woodland 2002]

arg maxθ

R∑r=1

∑W A(W ,Wr )p(W )pθ(Xr |W )∑

W p(W )pθ(Xr |W )

I Criterion: maximum expected accuracy A(W ,Wr ).

I Accuracy usually approximate, but exact case based on edit(Levenshtein) distance also possible [Heigold+ 2005].

I Regularization (e.g. I-smoothing [Povey & Woodland 2002])necessary due to overtraining.

I Usually better than MMI and MCE.


Training CriteriaPractical Issues

I Importance of language model in training of acoustic model.

I Relative and absolute scaling of language and acoustic modelin training.

I Necessity for recognition of training data.

I Efficient calculation of discriminative training statistics usingword lattices.


Training CriteriaLanguage Models for Discriminative Training

Potential Importance of Language Model Choice:

I language model for recognition of alternative word sequences

I language model dependence of discriminative training criterionitself

I interaction of language model of acoustic model parameters

Correlation hypothesis:Only those acoustic models need optimization, which even to-gether with a language model do not sufficiently discriminate.

−→ language model choice would correlate for training and recognition.

Masking hypothesis:Language model usually largely improves recognition accuracyand might mask deficiencies of the acoustic models.

−→ suboptimal language models for training would give betterperformance.


Training CriteriaLanguage Model for Discriminative Training

I Discriminative training includes language model.I In training, unigram language model usually leads to the best

word error rates [Schluter+ 1999] (WSJ 5k):

language models criterion word error rates[%]recog train dev eval dev& eval

bi – ML 6.91 6.78 6.86zero MMI 6.71 6.03 6.41uni 6.59 6.00 6.33 -8%bi 6.71 6.20 6.48tri 6.87 6.54 6.72

tri – ML 4.82 4.11 4.51zero MMI 4.63 4.05 4.38uni 4.30 3.64 4.01 -11%bi 4.48 3.94 4.24tri 4.58 4.00 4.33


Training CriteriaScaling of likelihoods

I recognition: absolute scaling of likelihoods irrelevant(language model scale vs. acoustic model scale)

I absolute scaling does have impact on word posteriorcalculation [Wessel+ 1998,Woodland & Povey 2000]

I use language model scale β also in training:

p(X ,W ) = p(W )βpθ(X |W )

I replace p(X ,W ) with:

p(X ,W )γ = p(W )βγpθ(X |W )γ for γ ∈ [0, 1]

I optimum approx. for γ = 1β ., i.e. use

p(X ,W )1β = p(W )pθ(X |W )

1β

I For simplicity here usually omitted in equations.


Training CriteriaCompeting Word Sequences

I Problem: Exponential number of competing word sequences.I Competing word sequences need to be estimated:

I Hypothesis-generation on training data using recognizer.I Initial lattice generation using recognizer sufficient.I Later acoustic model rescoring constrained to lattice.

I Representation and processing of competing word sequences.I Efficient algorithms to process word lattices.I Generic implementation: weighted finite state transducers.


Training CriteriaCompeting Word Sequences

History:I best recognized word sequence for MMI (Corrective Training)

[Normandin 1991]:I considers incorrectly recognized training sentences only

I best incorrectly recognized word sequence for MCE [Juang &Katagiri 1992]:

I interpretation of smoothed sentence error still valid

I N-best recognized word sequences for MMI [Chow 1990]:I continuous speech recognition, 1000 wordsI only minor improvements in word error rate

I word graphs from recognition for MMI training [Valtchev+

1997]:I large vocabulary, 64k wordsI efficient implementationI 5-10% relative improvement in word error rate


Training CriteriaComparative Experimental Results

WER [%]SieTill WSJ 5k EPPS English Mandarin BN/BC

Crit. Test Dev Evl Dev06 Evl06 Evl07 Dev07 Evl06

ML 1.81 4.55 3.74 14.4 10.8 12.0 15.1 21.9MMI 1.79 4.07 3.53 13.8 11.0 12.0 14.4 20.8MCE 1.69 4.02 3.47 13.8 11.0 11.9MWE 3.98 3.44MPE 4.17 3.62 13.4 10.2 11.5 14.2 20.6

I SieTill [Schluter 2000]

I WSJ 5k [Macherey 2010]

I EPPS/broadcasts [Heigold 2010]











Parameter OptimizationMotivation

Goal: optimization method for discriminative training criteria F (θ) w.r.t.Goal: set of parameters θ which provides reasonable convergence.

I Various approaches, e.g.:I extended Baum-Welch (EBW) [Normandin 1991]I gradient descent, study: e.g. [Valtchev 1995]I MMI with log-linear models: generalized iterative scaling (GIS)I generalization of GIS to log-linear models with hidden variables

and further criteria like MPE and MCE [Heigold+ 2008a]

I Problems:I robust setting of step sizes/iteration constants (EBW and

gradient descent),I convergence speed (especially GIS).


Parameter OptimizationExtended Baum-Welch

I Motivated by a growth transformation [Gopalakrishnan+ 1991]

I Widely used for discriminative training of Gaussian mixtureHMMs, e.g. [Normandin 1991,Valtchev+ 1997,Schluter2000,Woodland & Povey 2002]

I Highly optimized heuristics for finding right order ofmagnitude for iteration constants.

I Training of Gaussian mixture HMMs: require positivevariances to obtain estimate for iteration constants.


Parameter OptimizationGradient descent

Follow gradient to optimize parameter:

θ = θ + γ∇θFθ

Step sizes:

I heuristic, e.g. for MCE [Chou+ 1992])

I by comparison to EBW [Schluter 2000]

Convergence:

I local optimum

I better convergence: general purpose approaches,e.g. Qprop, Rprop, or L-BFGS, for experimental comparisonssee [McDermott & Katagiri 2005,McDermott+

2007,Gunawardana+ 2005,Mahajan+ 2006]


Parameter OptimizationRprop [Riedmiller & Braun 1993]

General purpose gradient based optimization:

I assume iteration n

I parameter update:

θ(n+1)i = θ

(n)i + γ

(n)i sign

(∂F (θ(n))

∂θi

)I update of step sizes γ

(n)i :

γ(n+1)i =

minγ(n)

i · η+, γmax if ∂F (θ(n))∂θi

· ∂F (θ(n−1))∂θi

> 0

maxγ(n)i · η−, γmin if ∂F (θ(n))

∂θi· ∂F (θ(n−1))

∂θi< 0

γ(n)i otherwise

I η+ ∈ (1,∞), η− ∈ (0, 1)


Parameter OptimizationFormal gradient of MMI

I notation:I W : word sequence w1, . . . ,wN

I r : index of training segment/utterance given by (Xr ,Wr )I Xr : acoustic observation vector sequence xr1, . . . , xrT

I Wr : reference/spoken word sequence wr1, . . . ,wrN

I sT1 : HMM state sequence s1, . . . , sT

I MMI training criterion:

FMMI(θ) =∑

r

log

(p(Wr )pθ(Xr |Wr )∑

W

p(W )pθ(Xr |W )

)

=∑

r

(log p(Wr )pθ(Xr |Wr )− log

∑W

p(W )pθ(Xr |W )

)I acoustic model (HMM):

pθ(Xr ,W ) =∑sTr

1

Tr∏t=1

p(st |st−1)pθ(xrt |st)


Parameter OptimizationDerivative of MMI w.r.t. Parameter

Gradient of MMI criterion:

∇θFMMI(θ) =∑

r

(∇θ log pθ(Xr |Wr )

−

∑W

p(W )pθ(Xr |W )∇θ log pθ(Xr |W )∑W ′

p(W ′)pθ(Xr |W ′)

)

For efficient evaluation, consider derivative of acoustic model,∇θ log pθ(Xr |W ).


Parameter OptimizationDerivative of Acoustic Model

∇θ log pθ(Xr ,W ) = ∇θ log∑

sTr1 :W

Tr∏t=1

pθ(xrt |st)p(st |st−1)

=Tr∑t=1

∑sTr

1 :W

(∇θ log pθ(xrt |st)

)·∏Tr

t′=1 pθ(xrt′ |st′)p(st′ |st′−1)∑σTr

1 :W

∏Trτ=1 pθ(xrτ |sτ )p(sτ |sτ−1)

=Tr∑t=1

∑s

(∇θ log pθ(xrt |s)

)·

∑sTr

1 :st=spθ(Xr , s

Tr1 |W )

pθ(Xr |W )

=Tr∑t=1

∑s

γrt(s|W ) · ∇θ log pθ(xrt |s)

with the word sequence conditioned state posterior (occupancy):

γrt(s|W ) =

∑sTr

1 :st=spθ(Xr , s

Tr1 |W )

pθ(Xr |W )= pθ,t(s|Xr ,W )


Parameter OptimizationDerivative of MMI w.r.t. Parameter

resubstitute derivative of acoustic model into derivative of MMI criterion:

∇θFMMI(θ) =∑

r

Tr∑t=1

∑s


)·

·(γrt(s|Wr )−

∑W

p(W )pθ(Xr |W )γrt(s|W )∑W ′


)

=∑

r

Tr∑t=1

∑s


)·(γrt(s|Wr )− γrt(s)

)with the general state posterior (occupancy):

γrt(s) =

∑W

p(W )pθ(Xr |W )γrt(s|W )∑W ′

p(W ′)pθ(Xr |W ′)= pθ,t(s|Xr )


Parameter OptimizationEfficient Calculation of State Occupancies

I Efficient calculation of spoken word sequence conditional stateoccupancy γrt(s|Wr ): forward-backward state probabilities ontrellis of word sequence.

I Efficient calculation of general state occupancy γrt(s):forward-backward probabilities on trellis of word lattice.

Viterbi approximation:I γrt(s|W ) = δs,srt(W ) with forced alignment

Sr (W ) = sr1(W ), . . . , srTr (W ) of spoken word sequenceI Assume a (word) lattice Mr for utterance r , with edges ω

representing a word w(ω) (in context) with start time ts(ω)and end time te(ω), and a corresponding forced alignmentstets (ω). An edge sequence W ∈Mr then corresponds to the

word sequence W (W). Consequently, the language model andacoustic model can also be defined for an edge sequence,which then might specify word boundaries, phonetic andlanguage model context.


Parameter OptimizationWord Posterior Probabilities

For the general state occupancy in Viterbi approximation we obtain:

γrt(s) =

∑W

p(W )pθ(Xr |W )δs,srt(W )

pθ(Xr )

=∑ω

δs,srt(ω)

∑W:ω∈W

p(W)pθ(Xr |W)

pθ(Xr )

=∑ω

δs,srt(ω)p(ω|Xr )

with the edge (or word in context) posterior

p(ω|Xr ) =∑W:ω∈W

p(W)pθ(Xr |W)

pθ(Xr )

A forward-backward algorithm is used to efficiently compute edge(word in context) posterior probabilities using word lattices.


Parameter OptimizationFormal gradient of MPE

I A(W ,Wr ): accuracy (negated error) between string W and Wr

I example (MPE): approximate phone accuracy [Povey &Woodland 2002]

I expectation of accuracy:

Eθ[A(·,Wr )] :=∑W

A(W ,Wr ) · p(W )pθ(Xr |W )∑W ′


I MPE training criterion:

FMPE(θ) =∑

r

Eθ[A(·,Wr )]


Parameter OptimizationDerivative of MPE w.r.t. Parameter

Derivative of MPE criterion:

∇θpθ(Xr |W ) = pθ(Xr |W ) ·(∇θ log pθ(Xr |W )

)∇θFMPE(θ) =

∑r

∑W

(A(W ,Wr )− Eθ[A(·,Wr )]

)·(∇θ log pθ(Xr |W )

)·

· p(W )pθ(Xr |W )∑W ′


For efficient evaluation, consider derivative of acoustic model:

∇θ log pθ(Xr |W ) =Tr∑t=1

∑s


)·

∑sTr

1 :st=spθ(Xr , s

Tr1 |W )

pθ(Xr |W )


Parameter OptimizationDerivative of MPE w.r.t. Parameter

resubstitute derivative of acoustic model into derivative of MPE criterion:

∇θFMPE(θ) =∑

r

Tr∑t=1

∑s


)· γrt(s)

with the general state accuracy:

γrt(s) =∑W

(A(W ,Wr )−Eθ[A(·,Wr )]

)·

∑sTr

1 :st=s

p(W )pθ(Xr , sTr1 |W )

∑W ′


which can be computed efficiently, similar to the case of generalstate occupancies.


Parameter OptimizationEfficient Calculation of State Accuracy

In general:

I assumption: A(W ,Wr ) =Tr∑t=1

A(srt(W ), srt(Wr ))

I example: approximate phone accuracy [Povey & Woodland2002]

I efficient calculation of general state accuracy γrt(s):forward-backward accuracies on trellis of word lattice [Povey& Woodland 2002]


Parameter OptimizationWord Posterior Accuracies

For the general state accuracy we in Viterbi approximation obtain:

γrt(s) =

∑W

(A(W ,Wr − Eθ[A(·,Wr )])

)· p(W )pθ(Xr |W )δs,srt(W )

pθ(Xr )

=∑ω

δs,srt(ω)

∑W :ω∈W

(A(W ,Wr − Eθ[A(·,Wr )])

)· p(W )pθ(Xr |W )

pθ(Xr )

=∑ω

δs,srt(ω)p(ω|Xr )

with the edge (or word in context) posterior accuracies

p(ω|Xr ) =∑

W :ω∈W

(A(W ,Wr − Eθ[A(·,Wr )])

)· p(W )pθ(Xr |W )

pθ(Xr )

Later, an efficient way of computing edge (word in context)posterior accuracies using word lattices will be presented.











Efficient Calculation of Discriminative StatisticsForward/Backward Probabilities on Word Lattices

I Let ωs(W) and ωe(W) be the first and last edge of acontinuous edge sequence W on a word lattice.

I Assume that the lattice fully encodes the language model context:

p(W (W)) = p(W = ωN1 ) =

N∏n=1

p(ωn|ωn−1)

Let ωri and ωrf be the initial and final edges of a word lattice forutterance r . Then define the following forward (Φ) and backward(Ψ) probabilities on initial and final partial edge sequences on theword lattice respectively:

Φ(ω) =∑

W:ωs(W)=ωri

ωe(W)=ω

p(W)pθ(xrte(W)1 |W)

Ψ(ω) =∑

W:ωs(W)=ω

ωe(W)=ωrf

p(W)pθ(xrTr

ts(W)|W)



For the forward probability a recursion formulae can be derived byseparating the last edge from the edge sequence in the summationand ≺ denoting direct predecessor edges:

Φ(ω) =∑

W:ωs(W)=ωri

ωe(W)=ω

p(W)pθ(xrte(W)1 |W)

=∑ω′≺ω

∑W ′:ωs(W ′)=ωri

ωe(W ′)=ω′

p(W ′)p(ω|ω′)pθ(xrte(W ′)1 |W ′)pθ(xr

te(ω)ts(ω)|ω)

=∑ω′≺ω

Φ(ω′)p(ω|ω′)pθ(xrte(ω)ts(ω)|ω).

Using this recursion formula, the forward probabilities can becalculated efficiently on word lattices.



Similar to the forward probabilities, a recursion formula can bederived for efficient calculation of the backward probabilities and denoting direct successor edges:

Ψ(ω) =∑

W:ωs(W)ωωe(W)=ωrf

p(W)pθ(xrTr

ts(ω)|ωW)

=∑ω′ω

∑W ′:ωs(W ′)ω′ωe(W ′)=ωrf

p(ω′|ω)p(W ′)pθ(xrte(ω)ts(ω)|ω)pθ(xr

Tr

ts(ω′)|ω′W ′)

=∑ω′ω

pθ(xrte(ω)ts(ω)|ω)p(ω′|ω)Ψ(ω′)



Using the forward and backward probabilities, the edge/wordposterior on a word lattice can be written as

p(ω|Xr ) =

Φ(ω)∑ω′ω

p(ω′|ω)Ψ(ω′)

Φ(ωrf )

with pθ(Xr ) = Φ(ωrf ) = Ψ(ωri ).

Word posterior probabilities follow naturally from MPE and similardiscriminative training criteria. They also are the basis forconfidence measures, which are used for unsupervised training,adaptation, or dialog systems. They are also part of approximateapproaches to Bayes’ decision rule with word error cost, likeconfusion networks [Mangu+ 1999], or minimum frame worderror [Wessel+ 2001a,Hoffmeister+ 2006].


Efficient Calculation of Discriminative StatisticsFB Probabilities: Generalization to WFSTs

Replace word lattice with WFST

I edge label: word with pronunciation

I weight of edge ω: p ← p(ω|ω′) · pθ(xrte(ω)ts(ω)|ω)

I semiring: substitute arithmetic operations (multiplication,addition, inversion) with operations of probability semiring

Semiring IK p ⊕ p′ p ⊗ p′ 0 1 inv(p)

probability IR+ p + p′ p · p′ 0 1 1p

Example WFST from SieTill, Wr =”drei sechs neun” (in red)

1

2

3

4

5

6

7 8 90

neun /9//−876null /0//−424drei /3//−384

[SIL] /si//−2618

sechs /6//−1014

drei /3//−1013

drei /3//−909

zwei /2//−556

neun /9//−480

[SIL] /si//−706

[SIL] /si//−274

[SIL] /si//−632

[SIL] /si//−719

sieben /7//67

sieben /7//−568

fünf /5//−437


Efficient Calculation of Discriminative StatisticsFB Probabilities: Generalization to WFSTs

Forward probabilities (pre(ω) ∈ W such that pre(ω) ≺ ω)

Φ(ω) :=⊕

W:ωs(W)=ωri

ωe(W)=ω

⊗ω∈W

p(ω|pre(ω))⊗ pθ(xrte(ω)ts(ω)|ω)

=⊕ω′≺ω

Φ(ω′)⊗ p(ω|ω′)⊗ pθ(xrte(ω)ts(ω)|ω)

Backward probabilities: similar

Using the forward and backward probabilities, the edge posterioron a WFST Xr can be written as

p(ω|Xr ) = Φ(ω)⊗( ⊕ω′ω

p(ω′|ω)⊗Ψ(ω′)

)⊗ inv(Φ(ωrf ))


Efficient Calculation of Discriminative StatisticsExpectation Semiring

vector weight (p, v) of edge ω with

I p ← p(ω|ω′) · pθ(xrte(ω)ts(ω)|ω)

I v ← A(ω) · pI accuracy of edge ω such that

⊗ω∈W A(ω) = A(W,Wr )

I approximate phone accuracy [Povey & Woodland 2002] can bedecomposed in this way

I such a decomposition not possible in general

expectation semiring [Eisner 2001]:vector semiring whose first component is a probability semiring

Semiring IK (p, v)⊕ (p′, v′) (p, v)⊗ (p′, v′) 0 1 inv(p, v)

expectation IR+ × IR (p + p′, v + v′) (p · p′, p · v′ + p′ · v) (0, 0) (1, 0)

„1p,− v

p2

«


Efficient Calculation of Discriminative StatisticsEdge Posteriors & Expectation Semiring

probability semiringI word posterior probabilities (see MMI derivative) identical to

edge posteriors using probability semiring

p(ω|Xr ) = pprobability(ω|Xr )

I intuitive and classical result [Rabiner 1989]

expectation semiringI word posterior accuracies (see MPE derivative) identical to

v -component of edge posteriors using expectationsemiring [Heigold+ 2008b]

p(ω|Xr ) = pexpectation,v (ω|Xr )

I also use this identity to efficiently calculateI derivative of unified training criterionI covariance between two random additive variables (related to

MPE derivative)











Generalisation to Log-Linear ModelingTransformation: Gaussian into Log-Linear Model

Definition of Modelsassume feature vector x ∈ IRD and class c ∈ 1, . . . ,C

Gaussian modelN (x |µc ,Σc) with

I means µc ∈ IRD

I positive-definite covariancematrices Σc ∈ IRD×D

induces posterior pθ(c|x)

p(c)N (x |µc ,Σc)∑c ′

p(c ′)N (x |µc ′ ,Σc ′)

I include priors p(c) ∈ IR+

Log-linear model with uncon-strained parameters

I λc0 ∈ IRI λc1 ∈ IRD

I λc2 ∈ IRD×D

exp(x>λc2x + λ>c1x + λc0

)∑c ′

exp(x>λc ′2x + λ>c ′1x + λc ′0

)


Generalisation to Log-Linear ModelingTransformation: Gaussian into Log-Linear Model

Comparison of terms quadratic, linear, and constant inobservations x leads to the transformation rules [Saul & Lee2002,Gunawardana+ 2005]:

1. λc2 = −12 Σ−1

c

2. λc1 = Σ−1c µc

3. λc0 = −12

(µ>c Σ−1

c µc + log |2πΣc |)

+ log p(c)


Generalisation to Log-Linear ModelingTransformation: Log-Linear into Gaussian Model

I invert transformation from Gaussian to log-linear model

1. Σc = −12λ−1c2

2. µc = Σ−1c λc1

3. p(c) = exp(λc0 + 1

2

(µ>c Σ−1

c µc + log |2πΣc |))

I problem: parameter constraints not satisfied in generalI covariance matrices Σc must be positive-definiteI priors p(c) must be normalized

I solution: model parameters for posterior are ambiguouse.g. for ∆λ2 ∈ IRD×D ,∆λ0 ∈ IR

exp(x>(λc2 + ∆λ2)x + λ>c1x + (λc0 + ∆λ0)

)∑c ′

exp(x>(λc ′2 + ∆λ2)x + λ>c ′1x + (λc ′0 + ∆λ0)

)=

exp(x>λc2x + λ>c1x + λc0

)∑c ′

exp(x>λc ′2x + λ>c ′1x + λc ′0

)Schluter/Ney: Advanced ASR 535 August 5, 2010

Generalisation to Log-Linear ModelingTransformation: Log-Linear into Gaussian Model

invert transformation rules for transformed log-linear model

1. Σc = −12 (λc2 + ∆λ2)−1

2. µc = Σ−1c λc1

3. p(c) = exp((λc0 + ∆λ0) + 1

2

(µ>c Σ−1


use additional degrees of freedom to impose parameter constraints

I choose ∆λ2 ∈ IRD×D such that λc2 + ∆λ2 arenegative-definite

I choose ∆λ0 such that p(c) is normalized, i.e.,

∆λ0 := − log∑

c

exp

(λc0 +

1

2

(µ>c Σ−1












Convex OptimizationMotivation

Conventional approach:

I depends on initialization and choice of optimization algorithm

I spurious local optima (non-convex training criterion)

I many heuristics required

I i.e., involves much engineering work

“Fool-proof” approach:

I unique optimum (independent of initialization)

I accessibility of global optimum (convex training criterion)

I joint optimization of all model parameters, no parameters tobe tuned


Convex OptimizationConvex Training Criteria in Speech Recognition

Assumptions to cast HCRF into CRF

I log-linear parameterization, e.g.p(x |s) = exp

(x>λs2x + λ>s1x + λs0

)and p(s|s ′) = exp(αs′s)

I MMI-like training criterion

I alignment represents spoken sequence

I alignment of spoken sequence known and kept fixed

I use single densities with augmented features instead of mixtures

I exact normalization constant


Convex OptimizationLattice-Based MMI

Flattice(λ) =∑

r

∑sTr

1 ∈Nr

pλ(xTr1 , sTr

1 )

∑sTr

1 ∈Dr

pλ(xTr1 , sTr

1 )

I numerator word lattice Nr : state sequences sT1 representing

correct hypothesisI denominator word lattice Dr : correct and competing state

sequences, use word pair approximation and pruningI non-convex

fünf

fünf

fünfeins

eins

eins

sechs

[SIL]

zwei

neun

sechs[SIL]

acht

[SIL]acht

[SIL]

acht

acht

sechs

[SIL]

eins

drei

0 44152 277

[SIL]

114

115

112

81

78

74

75

Word lattice


Convex OptimizationFool-Proof MMI

Ffool(λ) =∑

r

pλ(xTr1 , sTr

1 )∑sTr

1 ∈Sr

pλ(xTr1 , sTr

1 )

I consider only best state sequence sT1 in numerator, kept fixed

I sum over full state sequence network in denominatorI convex

0: ε

0: ε

ε ε:

ε ε:

0: ε

ε ε:

ε29:

0: ε

ε7:

0: ε

ε ε:

18:ε

ε29:

ε29:

ε ε:

0: ε

ε ε:

ε7:

ε ε:

ε7:

18:ε

ε29:

ε29:

51:ε

ε62:

ε84:

ε73:

ε29:

ε40:

18:ε

ε95:

ε40:

ε29:

ε106:

ε73:

ε62:

51:ε51:ε

ε62:

51:εε62:

ε73:

ε40:

ε84:

0: ε

18:ε

ε29:

ε29:

ε62:

51:ε

ε40:

ε29:

ε29:

ε40:

18:ε

ε7:ε7:

ε ε:

ε29:

ε29:

ε7:

0: ε

15

17

17

16

7:acht

18

18

18

18

18:acht

19

19

19

19

1918:acht

18:acht

20

20

20

20

20

7:acht

21

21

21

21

21

18:acht

7:acht

22

22

22

22

22

22

23

23

23

23

23

23

23

23

24

24

24

24

24

24

24

24

24

25

25

25

25

25

25

25

25

25

18:acht

18:acht

18:acht

18:acht

Part of (pruned) HMM state network


Convex OptimizationFrame-Based MMI

Fframe(λ) =∑

r

Tr∑t=1

pλ(xt , st)S∑

s=1

pλ(xt , s)

I frame discrimination, cf. hybrid approach

I assume alignment for numerator sT1 , kept fixed

I summation over all HMM states s ∈ 1, . . . ,S indenominator

I convex


Convex OptimizationRefinements to MMI

These refinements do not break convexity:

I `2-regularization

I margin term


Convex OptimizationExperimental Results

Initialization

Analyze effect of initialparameters on training.

I vary initialization fordifferent trainingcriteria

I experiments: digitstrings (SieTill,German, telephone)

-4000

-3000

-2000

-1000

0

0 50 100 150 200 250 300

F

iteration index

frame-based M-MMIlattice-based M-MMI

fool-proof M-MMIfrom scratch

from MLfrom frame

1.5

2

2.5

3

3.5

4

4.5

5

0 50 100 150 200 250 300

WE

R [%

]

iteration index

frame-based M-MMIlattice-based M-MMI

fool-proof M-MMIfrom scratch

from MLfrom frame


Convex OptimizationRead Speech (WSJ)

I 5k-vocabulary, trigram language modelI phone-based HMMs, 1,500 CART-tied triphonesI audio data: 15h (training), 0.4h (test)

I log-linear model with kernel-like features f (x)I first (fd(x) = xd) and second (fdd′(x) = xd · xd′) order featuresI cluster features: assume GMM of marginal distribution,

p(x) =∑

l p(x , l)

fl(x) =

p(l |x) if p(l |x) ≥ threshold

0 otherwise

I starting from scratch (model) and linear segmentationI frame-based MMI, with re-alignmentsI details: [Wiesler 2009]


Convex OptimizationRead Speech (WSJ)

Feature setup WER [%]

First order features, monophones 22.7+second order features 10.3+210 cluster features + temporal context of size 9 6.2+1,500 CART-tied HMM states (triphones) 3.9+realignment 3.6GHMM (ML) 3.6

(MMI) 3.0











Incorporation of Margin ConceptMotivation

Goal: incorporation of margin term into conventional training criteriaI replace likelihoods p(W )p(X |W ) with margin-likelihoods

p(W )p(X |W ) exp(−ρA(W ,Wr ))I A(W ,Wr ): accuracy between hypothesis W and reference Wr

I interpretation (boosting):emphasize incorrect hypotheses by up-weighting

I interpretation (large margin): next slides

Margin

low complexity taskdifferent

loss functions

convergence local optima

different parameterization optimization

Margin in trainingis promising.

Individual contribution of margin in LVCSR training?


Incorporation of Margin ConceptSupport Vector Machines (Hinge Loss)

Optimization problem for SVMs

SVM(λ) = −C

2‖λ‖2 −

R∑r=1

l(Wr , dr ; ρ)

I feature functions f (X ,W ), model parameters λ

I distance drW := λ>(f (Xr ,Wr )− f (Xr ,W ))

I hinge loss function l (hinge)(Wr , dr ; ρ) :=maxW 6=Wr max −drW + ρ(A(Wr ,Wr )− A(W ,Wr )), 0

I `2-regularization with constant C > 0

Altun 2003,Taskar 2003


Incorporation of Margin ConceptSmooth Approximation to SVM: Margin-MMI

Margin-based/modified MMI (M-MMI)

FM-MMI,γ(λ) =− C

2‖λ‖2

+R∑

r=1

1

γlog

(exp(γ(λ>f (Xr ,Wr )− ρA(Wr ,Wr )))∑W exp(γ(λ>f (Xr ,W )− ρA(W ,Wr )))

)

Lemma: FM-MMI,γγ→∞→ SVMhinge (pointwise convergence).

Heigold+ 2008b


Incorporation of Margin ConceptProof

∆A(W ,Wr ) := A(Wr ,Wr )− A(W ,Wr )

−1

γlog

exp(γ(λ>f (Xr ,Wr )− ρA(Wr ,Wr )))∑W

exp(γ(λ>f (Xr ,W )− ρA(W ,Wr )))

=

1

γlog

1 +∑

W 6=Wr

exp(γ(−drW + ρ∆A(W ,Wr )))

γ→∞→

maxW 6=Wr

−drW + ρ∆A(W ,Wr ) if ∃W 6= Wr : drW < ρ∆A(W ,Wr )

0 otherwise

= maxW 6=Wr

max−drW + ρ∆A(W ,Wr ), 0

=: l (hinge)(Wr , dr ; ρ).


Incorporation of Margin ConceptSupport Vector Machines (Margin Error)

Optimization problem for SVMs

SVM(λ) = −C

2‖λ‖2 −

R∑r=1

l(Wr , dr ; ρ)

I feature functions f (X ,W ), model parameters λ

I distance drW := λ>(f (Xr ,Wr )− f (Xr ,W ))

I margin error loss functionl (error)(Wr , dr ; ρ) := E

(A(arg minW [drW + ρA(W ,Wr )],Wr )

)I `2-regularization with constant C > 0

Heigold+ 2008b


Incorporation of Margin ConceptSmooth Approximation to SVM: Margin-MPE

Margin-based/modified MPE (M-MPE)

FM-MPE,γ(λ) = −C

2‖λ‖2

+R∑

r=1

∑W

E (W ,Wr )

(exp(γ(λ>f (Xr ,Wr )− ρA(Wr ,Wr )))∑

V exp(γ(λ>f (Xr ,V )− ρA(V ,Wr )))

)

Lemma: FM-MPE,γγ→∞→ SVMerror .

Heigold+ 2008b


Incorporation of Margin ConceptExperimental Evaluation of Margin

Digit strings (SieTill, German, telephone)

dns/state feature orders # param, criterion WER [%]

1 first 11k ML 3.8MMI 2.9M-MMI 2.7

64 first 690k ML 1.8MMI 1.8M-MMI 1.6

1 first, second, 1,409k Frame 1.8and third MMI 1.7

M-MMI 1.5



European parliament plenary sessions in English (EPPS) andMandarin broadcasts

WER [%]EPPS En Mandarin BN/BC

Criterion 90h 230h 1500h

ML 12.0 21.9 17.9

MMI 20.8M-MMI 20.6

MPE 11.5 20.6 16.5M-MPE 11.3 20.3 16.3



Handwriting Recognition (IFN/ENIT)

I isolated town names, handwritten

I choose slice features to use 1D HMM

I details: see [Dreuw+ 2009]

WER [%]Criterion abc-d abd-c acd-b bcd-a abcd-e

ML 7.8 8.7 7.8 8.7 16.8

MMI 7.4 8.2 7.6 8.4 16.4M-MMI 6.1 6.8 6.1 7.0 15.4











ConclusionsEffective Discriminative Training

I Discriminative CriteriaI fit decision rule: minimize training errorI limit overfitting: include regularization and margin to exploit

limit overfitting: remaining degrees of freedom of the parametersI Optimization Methods

I general purpose methods give robust estimatesI in convex case gradient descent still faster than

growth transform (GIS)I Log-Linear Modeling

I convex (w/o hidden variables)I covers Gaussians completely, w/o constraints on e.g. varianceI opens modeling up to arbitrary featuresI initialization: from scratch or from Gaussians

I Estimation of StatisticsI efficiency: use word lattice to represent competing word sequencesI implementation: generic approach using WFSTs,

implementation: covers class of criteria











ReferencesI Y. Altun, I. Tsochantaridis, and T. Hofmann, “Hidden Markov support

vector machines,” in International Conference on Machine Learning(ICML) 2003, Washington, DC, USA, 2003.

I L. R. Bahl, P. F. Brown, P. V. de Souza, R. L. Mercer. “MaximumMutual Information Estimation of Hidden Markov Model Parameters forSpeech Recognition,”Proc. 1986 Int. Conf. on Acoustics, Speech and Signal Processing, Vol.1, pp. 49-52, Tokyo, Japan, May 1986.

I P. F. Brown. The Acoustic-Modeling Problem in Automatic SpeechRecognition, Ph.D. thesis, Department of Computer Science, CarnegieMellon University, Pittsburgh, PA, May 1987.

I W. Chou, B.-H. Juang, and C.-H. Lee, “Segmental GPD training ofHMM based speech recognizer,” in IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP) 1992, San Francisco,CA, USA, March 1992, pp. 473-476.

I Y.-L. Chow: “Maximum Mutual Information Estimation of HMMParameters for Continuous Speech Recognition using the N-bestAlgorithm,” inProc. Int. Conf. on Acoustics, Speech and SignalProcessing (ICASSP), pp. 701–704, Albuquerque, NM, April 1990.


ReferencesI P. Dreuw, G. Heigold, and H. Ney, “Confidence-based discriminative

training for model adaptation in offline Arabic handwriting recognition,”in International Conference on Document Analysis and Recognition(ICDAR), Barcelona, Spain, July 2009.

I J. Eisner, “Expectation semirings: Flexible EM for finite-statetransducers,” in International Workshop on Finite-State Methods andNatural Language Processing (FSMNLP), Helsinki, Finland, August 2001.

I P. S. Gopalakrishnan, D. Kanevsky, A. Nadas, D. Nahamoo. “AnInequality for Rational Functions with Applications to Some StatisticalEstimation Problems,” IEEE Transactions on Information Theory, Vol.37, Nr. 1, pp. 107-113, January 1991.

I A. Gunawardana, M. Mahajan, A. Acero, and J. Platt, “Hiddenconditional random fields for phone classification,” in Interspeech, pp.117 120, Lisbon, Portugal, Sept. 2005.

I G. Heigold, W. Macherey, R. Schluter, and H. Ney: ”Minimum ExactWord Error Training,” in Proc. IEEE Automatic Speech Recognition andUnderstanding Workshop (ASRU), pages 186–190, San Juan, PuertoRico, November 2005.


ReferencesI G. Heigold, T. Deselaers, R. Schluter, H. Ney: “GIS-like Estimation of

Log-Linear Models with Hidden Variables,” in Proc. IEEE InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP), pages4045–4048, Las Vegas, NV, USA, April 2008.

I G. Heigold, T. Deselaers, R. Schluter, H. Ney, “Modified MMI/MPE: Adirect evaluation of the margin in speech recognition,” in InternationalConference on Machine Learning (ICML), pp. 384-391, Helsinki, Finland,July 2008.

I G. Heigold: A Log-Linear Modeling Framework for Speech Recognition,Doctoral Thesis to be submitted, RWTH Aachen University, Aachen,Germany, 2010.

I B. Hoffmeister, T. Klein, R. Schluter, and H. Ney: “Frame Based SystemCombination and a Comparison with Weighted ROVER and CNC,” inProc. Interspeech, pages 537–540, Pittsburgh, PA, USA, September2006.

I B.-H. Juang and S. Katagiri, “Discriminative learning for minimum errorclassification,” IEEE Transactions on Signal Processing, vol. 40, no. 12,pp. 3043-3054, 1992.


References

I D. Kanevsky: “Extended Baum Welch transformations for generalfunctions,” in Proc. IEEE International Conference on Acoustics, Speech,and Signal Processing (ICASSP), pp. 821-824, Montreal, Quebec,Canada, May 2004.

I W. Macherey, L. Haferkamp, R. Schluter, H. Ney: “Investigations onError Minimizing Training Criteria for Discriminative Training inAutomatic Speech Recognition,” in Proc. European Conference onSpeech Communication and Technology (Interspeech), Lisbon, September2005.

I W. Macherey, “Discriminative training and acoustic modeling forautomatic speech recognition,” Ph.D. thesis to be submitted, RWTHAachen University, 2010.

I M. Mahajan, A. Gunawardana, A. Acero: “Training algorithms for hiddenconditional random fields,” in Proc IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP), Toulouse, France,May 2006.


References

I L. Mangu, E. Brill, A. Stolcke: “Finding Consensus Among Words:Lattice-Based Word Error Minimization,” Proc. European Conference onSpeech Communication and Technology (EUROSPEECH), pp. 495–498,Budapest, Hungary, Sept. 1999.

I E. McDermott, S. Katagiri: “Minimum Classification Error for Large ScaleSpeech Recognition Tasks using Weighted Finite State Transducers,” inProc. IEEE International Conference on Acoustics, Speech, and SignalProcessing (ICASSP), Philadelphia, PA, USA, April 2005.

I E. McDermott, T. Hazen, J.L. Roux, A. Nakamura, S. Katagiri:“Discriminative training for large vocabulary speech recognition usingMinimum Classification Error,” in Proc. IEEE Transactions on Audio,Speech and Language Processing (ICASSP), Vol. 15, No. 1, pp. 203–223,April 2007.

I Y. Normandin, “Hidden Markov Models, Maximum Mutual Information,and the Speech Recognition Problem,” Ph.D. thesis, McGill University,Montreal, Canada, 1991.


ReferencesI D. Povey and P. C. Woodland, “Minimum phone error and I- smoothing

for improved discriminative training,” in IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP) 2002, Orlando, FL,May 2002, vol. 1, pp. 105–108.

I L.R. Rabiner, “A tutorial on hidden Markov models and selectedapplications in speech recognition,” in Proc. of the IEEE, vol. 77, no. 2,pp. 257–286, 1989.

I M. Riedmiller and H. Braun, “A direct adaptive method for fasterbackpropagation learning: The Rprop algorithm,” in IEEE InternationalConference on Neural Networks (ICNN) 1993, San Francisco, CA, USA,1993, pp. 586 591.

I L. Saul and D. Lee, “Multiplicative updates for classification by mixturemodels,” in T.G. Dietterich, S. Becker, and Z. Ghahramani, editor,Advances in Neural Information Processing Systems (NIPS). MIT Press,2002.

I R. Schluter, B. Muller, F. Wessel, and H. Ney: “Interdependence ofLanguage Models and Discriminative Training,” in Proc. IEEE AutomaticSpeech Recognition and Understanding Workshop (ASRU), Vol. 1, pages119–122, Keystone, CO, December 1999.


References

I R. Schluter: Investigations on Discriminative Training Criteria, DoctoralThesis, RWTH Aachen University, Aachen, Germany, Sept. 2000.

I R. Schluter, H. Ney: ”Model-based MCE Bound to the True Bayes’Error,” IEEE Signal Processing Letters, Vol. 8, No. 5, pages 131–133,May 2001.

I B. Taskar, C. Guestrin, and D. Koller, “Max-margin Markov networks,” inAdvances in Neural Information Processing Systems (NIPS) 2003, 2003.

I V. Valtchev: Discriminative Methods in HMM-based Speech Recognition,Ph.D. thesis, St. John’s College, University of Cambridge, Cambridge,March 1995.

I V. Valtchev, J. J. Odell, P. C. Woodland, S. J. Young. “MMIE Trainingof Large Vocabulary Recognition Systems,” Speech Communication, Vol.22, No. 4, pp. 303-314, September 1997.

I F. Wessel, K. Macherey, and R. Schluter, “Using word probabilities asconfidence measures,” in IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP) 1998, Seattle, WA, USA, May1998, pp. 225-228.


ReferencesI F. Wessel, R. Schluter, and H. Ney: “Explicit Word Error Minimization

using Word Hypothesis Posterior Probabilities,” in Proc. IEEEInternational Conference on Acoustics, Speech, and Signal Processing(ICASSP), pages 33–36, Salt Lake City, Utah, May 2001.

I F. Wessel, R. Schluter, H. Ney: “Explicit Word Error Minimization usingWord Hypothesis Posterior Probabilities,” Proc. IEEE InternationalConference on Acoustics, Speech and Signal Processing, pp. 33–36, SaltLake City, Utah, May 2001.

I S. Wiesler, et al., “Investigations on features for log-linear acousticmodels in continuous speech recognition,” in IEEE Automatic SpeechRecognition and Understanding Workshop (ASRU), Merano, Italy, Dec.2009.

I P. C. Woodland and D. Povey, “Large scale discriminative training forspeech recognition,” in Automatic Speech Recognition (ASR) 2000,Paris, France, September 2000, pp. 7–16.

I P.C. Woodland, D. Povey: “Large Scale Discriminative Training ofHidden Markov Models for Speech Recognition.” Computer Speech andLanguage, Vol. 16, No. 1, pp. 2548, 2002.











Speech Tasks: Corpus Statistics & Setups

Identifier Description Train/Test [h]

SieTill German digit strings 11/11 (Test)EPPS En English European Parliament 92/2.9 (Evl07)

plenary speechBNBC Cn 230h Mandarin broadcasts 230/2.2 (Evl06)BNBC Cn 1500h Mandarin broadcasts 1,500/2.2 (Evl06)

Identifier #States/#Dns Features

SieTill 430/27k 25 LDA(MFCC)EPPS En 4,500/830k 45 LDA(MFCC+voicing)

+VTLN+SAT/CMLLRBNBC Cn 230h 4,500/1,100k 45 LDA(MFCC)+3 tones

+VTLN+SAT/CMLLRBNBC Cn 1500h 4,500/1,200k 45 SAT/CMLLR(PLP+voicing

+3 tones+32 NN)+VTLN


Handwriting Tasks: Corpus Statistics & Setups

IFN/ENIT:

I isolated Tunisian town names

I 4 training folds + 1 additional fold for testing

I simple appearance-based image slice features

I each fold comprises approximately 500,000 frames

#Observations [k]Corpus Towns Frames

IFN/ENIT a 6.5 452b 6.7 459c 6.5 452d 6.7 451e 6.0 404


Documents

Advanced methods in Automatic Speech Recognition · Advanced methods in Automatic Speech Recognition ... Human Language Technology and Pattern Recognition ... speech and music audio,