44
Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen Eide IBM T.J. Watson Research Center Yorktown Heights, NY, USA {picheny,stanchen,eeide}@us.ibm.com October 6, 2005 ELEN E6884: Advanced Speech Recognition

Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Lecture 5The Big Picture/Language Modeling

Michael Picheny, Stanley F. Chen, Ellen EideIBM T.J. Watson Research Center

Yorktown Heights, NY, USA{picheny,stanchen,eeide }@us.ibm.com

October 6, 2005

IBM ELEN E6884: Advanced Speech Recognition

Page 2: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Administrivia

IBM ELEN E6884: Advanced Speech Recognition 1

Page 3: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Outline

■ The Big Picture (review)● The story so far, plus situating Lab 1 and Lab 2

■ language modeling● why?● how? (grammars, n-grams)● parameter estimation (smoothing)● evaluation

IBM ELEN E6884: Advanced Speech Recognition 2

Page 4: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Pattern Classification

■ given training samples (x1, ω1), . . . , (xN , ωN)● feature vector x = (x1, . . . , xT )● class labels ω

■ guess class label of unseen feature vectors x (i.e., from the testset)

■ example: guess gender of person given height+weight● feature vector x = (height, weight)● class labels ω ∈ {M, F}

IBM ELEN E6884: Advanced Speech Recognition 3

Page 5: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Speech Recognition is Pattern Classification

■ e.g., isolated digit recognition● feature vector x = (x1, . . . , xT ) is acoustic waveform

● T=20000 if 1 sec sample, 20kHz sampling rate● class labels ω ∈ {ONE, TWO, . . . , NINE, ZERO, OH}

IBM ELEN E6884: Advanced Speech Recognition 4

Page 6: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Nearest Neighbor Classification■ to classify a test vector x

● find nearest sample xi in training set● select associated class ωi

140145150155160165170175180185

30 40 50 60 70 80 90 100

heig

ht (

cm)

weight (kg)

malefemale

IBM ELEN E6884: Advanced Speech Recognition 5

Page 7: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Nearest Neighbor Classification

Let’s write down some equations

■ for each class ω, we have a model Mω

■ to classify a test vector xtest, find nearest model Mω

class(xtest) = arg minω

distance(xtest,Mω)

■ nearest neighbor classification● let Mω = {training samples x labeled with class ω}● distance(xtest,Mω) = minx∈Mω distance(xtest,x)● e.g., distance(xtest,x) = Euclidean distance

IBM ELEN E6884: Advanced Speech Recognition 6

Page 8: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Speech Recognition as Pattern Classification

Problem 1: acoustic waveform is not good representation fornearest neighbor classification

■ consider two equal-length acoustic waveforms

x = (x1, . . . , x20000)

x′ = (x′1, . . . , x′20000)

■ do we believe that distance(x,x′) . . .● will be smaller if class(x) = class(x′) . . .● than if class(x) 6= class(x′)?

■ Lab 1: windowing only; computing error rate with DTW● windows are constant-length acoustic waveforms● performance is near random

IBM ELEN E6884: Advanced Speech Recognition 7

Page 9: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Speech Recognition as Pattern Classification

Finding a transformation F for samples: F (xraw) ⇒ xnew

■ want samples xnew in same class to be close to each other■ want samples xnew in different classes to be far apart■ ⇒ signal processing/front end processing

IBM ELEN E6884: Advanced Speech Recognition 8

Page 10: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Feature Transformation

0 10 20 30 40 50 60 70 80 90100

010

2030

4050

6070

8090

100

height, fractional part x100 (cm)

weight, fractional part x100 (kg)

140145150155160165170175180185

3040

5060

7080

90100

height (cm)

weight (kg) m

alefem

ale

IBM ELEN E6884: Advanced Speech Recognition 9

Page 11: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

What Did We Learn in Lab 1?■ a smoothed frequency-domain representation works well

● FFT computes energy at each frequency (fine-grained)● mel-binning does coarse-grained grouping (93.6% accuracy)● log+DCT, Hamming window help● not completely unlike human auditory system

■ accuracy on “PCM” (pulse code modulation): 89.1%● DTW is powerful; speaker-dependent only?

■ DTW is OK for speaker-dependent small-voc isolated word ASR● >95% accuracy in clean conditions

■ data reduction● instead of (x1, . . . , x20000) for 1 sec sample● (~x1, . . . , ~x100), dim(~xt) = 12

IBM ELEN E6884: Advanced Speech Recognition 10

Page 12: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Speech Recognition as Pattern Classification

Problem 2: acoustic waveforms have different lengths

■ e.g., how to compute:

distance{(~x1, . . . , ~x100), (~x′1, . . . , ~x′150)}

■ answer: dynamic time warping (asymmetric)● introduce concept of alignment A = a1 · · · aT

● map each test frame t to a template frame at

IBM ELEN E6884: Advanced Speech Recognition 11

Page 13: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Alignments

■ take minimum distance over all possible alignments A

distance(x,x′) = minA

distanceA(x,x′)

■ distance given an alignment is . . .● sum of distances between each pair of aligned frames

distanceA{(~x1, . . . , ~xT ), (~x′1, . . . , ~x′T ′)} =

T∑t=1

distance(~xt, ~x′at

)

■ dynamic programming computes minA distanceA(·, ·) efficiently● search exponential number of alignments in polynomial time

IBM ELEN E6884: Advanced Speech Recognition 12

Page 14: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

DTW as Nearest Neighbor ClassificationLab 1: collect only one training sample xω per class ω

■ variation: keep many/all training samples

140145150155160165170175180185

30 40 50 60 70 80 90 100

heig

ht (

cm)

weight (kg)

malefemale

IBM ELEN E6884: Advanced Speech Recognition 13

Page 15: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Long Live Nearest Neighbor?Issue 1: outliers

140145150155160165170175180185

30 40 50 60 70 80 90 100

heig

ht (

cm)

weight (kg)

malefemale

■ k-nearest neighbors?● find k > 1 nearest training samples; pick most frequent class

IBM ELEN E6884: Advanced Speech Recognition 14

Page 16: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Long Live k-Nearest Neighbors?Issue 2: different sample densities

150

155

160

165

170

175

180

45 50 55 60 65 70 75

heig

ht (

cm)

weight (kg)

■ averaging all samples in a class to find “canonical” sample?

IBM ELEN E6884: Advanced Speech Recognition 15

Page 17: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Long Live Canonical Samples?Issue 3: bimodal distributions

150

155

160

165

170

175

180

40 45 50 55 60 65 70 75 80

heig

ht (

cm)

weight (kg)

■ what if we try to compute several “canonical” samples?

IBM ELEN E6884: Advanced Speech Recognition 16

Page 18: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Is There a Better Way?

■ using “distance” is too restrictive?■ more generally, want some sort of “score”: scoreω(x)

● reflects how strongly point x belongs to class ω

■ to classify a sample x, pick class with highest score

IBM ELEN E6884: Advanced Speech Recognition 17

Page 19: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Is There a Better Way?

■ is there a framework for assigning scores that . . .● can handle arbitrarily weird sample distributions?

● outliers, varying sample densities, bimodal, etc.● provides guidance on how to find models that perform well at

classification?● provides tools for finding these models efficiently?

■ ⇒ probabilistic modeling!

IBM ELEN E6884: Advanced Speech Recognition 18

Page 20: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Probability 101

■ probabilities P (x) are nonnegative scores that sum to 1∑x

P (x) = 1 P (x) ≥ 0

● semantics: P (x) ∼ frequency of event x

■ joint probabilities: P (x, y)∑x,y

P (x, y) = 1 P (x, y) ≥ 0

IBM ELEN E6884: Advanced Speech Recognition 19

Page 21: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Probability 101

■ marginal probabilities

P (x) =∑

y

P (x, y) P (y) =∑

x

P (x, y)

● probability that I trip is equal to probability I trip and fall plus the probability

I trip and don’t fall

■ conditional probabilities

P (x, y) = P (x)P (y|x) = P (y)P (x|y)

● probability that I trip and fall is equal to probability that I trip times the

probability that I fall given that I trip

IBM ELEN E6884: Advanced Speech Recognition 20

Page 22: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Probabilistic Modeling as Nearest Neighbor

■ for each class ω, we have a model Pω(·)● Pω(x) is a probability distribution over samples x● proportional to frequency that samples from class ω are

located at/around x■ to classify a test vector xtest, find nearest model Pω(·)

class(xtest) = arg minω

distance(xtest, Pω(·))

■ define distance as negative log probability● distance(xtest, Pω(·)) ≡ − log Pω(xtest)

IBM ELEN E6884: Advanced Speech Recognition 21

Page 23: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

What Does Probabilistic Modeling Give Us?

■ if each of our class probability distributions Pω(x) are (perfectly)accurate . . .● we can perform classification optimally!

■ e.g., two-class classification ω ∈ {M, F} (equally frequent)● choose M if PM(x) > PF (x)● choose F if PF (x) > PM(x)● this is the best you can do!

IBM ELEN E6884: Advanced Speech Recognition 22

Page 24: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

What Does Probabilistic Modeling Give Us?

■ consider a simple (1-D) probability distribution (Gaussian)● Pω(x) = 1√

2πσexp

[−1

2

(x−µ

σ

)2]

● parameters µ, σ● assume distribution of samples from class ω is truly Gaussian

■ maximum likelihood estimate (µMLE, σMLE)● choose parameter values that maximize likelihood/probability

of training data (x1, ω1), . . . , (xN , ωN)

(µMLE, σMLE) = arg max(µ,σ)

N∏i=1

Pωi(xi)

IBM ELEN E6884: Advanced Speech Recognition 23

Page 25: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

What Does Probabilistic Modeling Give Us?

■ in the presence of infinite training samples● maximum likelihood estimation (MLE) approaches the “true”

parameter estimates● for most models, MLE is asymptotically consistent, unbiased,

and efficient■ maximum likelihood estimation is easy for a wide class of

models● count and normalize

IBM ELEN E6884: Advanced Speech Recognition 24

Page 26: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Long Live Probabilistic Modeling!

■ if we choose form of Pω(x) correctly and have lots of data . . .● we win!

■ ASR moved to probabilistic modeling in mid-70’s■ in ASR, no serious challenges to this framework since then■ by and large, probabilistic modeling . . .

● does as well or better than all other techniques . . .● on all classification problems● (not everyone agrees)

IBM ELEN E6884: Advanced Speech Recognition 25

Page 27: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

What’s the Catch?

■ guarantees only hold if● know form of probability distribution that data comes from

● MLE tells us how to find parameters only if we know theform of the model

● can actually find MLE● in training HMM’s w/ FB, can only find local optimum● MLE solution for GMM’s is pathological

● infinite training set● “there’s no data like more data”

■ these are the challenges!

IBM ELEN E6884: Advanced Speech Recognition 26

Page 28: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

From DTW to Probabilistic Modeling■ HMM’s are natural way to express DTW as probabilistic model■ replace template for each word with an HMM

● each frame in template corresponds to state in HMM

1

P(x|a1)

2P(x|a1)

P(x|a2)

3P(x|a2)

P(x|a3)

4P(x|a3)

P(x|a4)

5P(x|a4)

P(x|a5)

6P(x|a5)

P(x|a6)

7P(x|a6)

■ alignment A = a1 · · · aT

● DTW: map each test frame t to template frame at

● HMM: map each test frame t to arc at in HMM● each alignment corresponds to path through HMM

■ can design HMM such that (see Holmes, Sec. 9.13, p. 155)

distanceDTW(xtest,xω) ≈ − log P HMMω (xtest)

IBM ELEN E6884: Advanced Speech Recognition 27

Page 29: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

DTW vs. HMM’s

DTW HMMacoustic template HMMframe in template state in HMMDTW alignment path through HMM

move cost transition (log)probdistance between frames output (log)prob

DTW search Forward/Viterbi algorithm

IBM ELEN E6884: Advanced Speech Recognition 28

Page 30: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Key Points for HMM’s

■ an HMM describes a probability distribution over observationsequences x (i.e., acoustic feature vectors): P (x)● separate HMM for each word ω, describing likely acoustic

feature vector sequences for that word: Pω(x)■ a path through the HMM

● alignment A = a1 · · · aT between arcs at and observations(frames) xt

■ to calculate probability Pω(x), sum (or max) probability for eachalignment A● Forward algorithm: Pω(x) =

∑A Pω(x, A)

● Viterbi algorithm: Pω(x) ≈ maxA Pω(x, A)

IBM ELEN E6884: Advanced Speech Recognition 29

Page 31: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Key Points for HMM’s■ probability of alignment and observations

Pω(x, A) = Pω(A)Pω(x|A)

● alignment probability Pω(A): product of transition probabilitiesalong path

Pω(A) =T∏

t=1

Ptrans(at)

● observation probability Pω(x|A): product of observationprobabilities along path

Pω(x|A) =T∏

t=1

Pobs(~xt|at)

IBM ELEN E6884: Advanced Speech Recognition 30

Page 32: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Key Points for HMM’s

■ each observation probability modeled as a Gaussian● diagonal covariance

Pobs(~xt|at) =D∏

dim d

1√2πσat,d

exp

[−1

2

(xt,d − µat,d

σat,d

)2]

=D∏

dim d

N (xt,d;µat,d, σat,d)

■ or mixture of Gaussians

Pobs(~xt|at) =M∑

m=1

λat,m

D∏dim d

N (xt,d;µat,m,d, σat,m,d)

IBM ELEN E6884: Advanced Speech Recognition 31

Page 33: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Why Gaussians and GMM’s?

■ a Gaussian is good at modeling a single cluster of samples

-10

-5

0

5

10

-10 -5 0 5 10

training samples for some arc in HMM

IBM ELEN E6884: Advanced Speech Recognition 32

Page 34: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Why Gaussians and GMM’s?

■ if you sum enough Gaussians, you can approximate anydistribution arbitrarily well

-10

-5

0

5

10

-10 -5 0 5 10

training samples for some arc in HMM

IBM ELEN E6884: Advanced Speech Recognition 33

Page 35: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Optimizing HMM’s for Isolated Word Recognition

■ each frame in DTW template for word ω corresponds to state inHMM?

■ consecutive frames may be very similar to each other● no point having consecutive states in HMM with nearly

identical output distributions

1

P(x|a1)

2P(x|a1)

P(x|a2)

3P(x|a2)

P(x|a3)

4P(x|a3)

P(x|a4)

5P(x|a4)

P(x|a5)

6P(x|a5)

P(x|a6)

7P(x|a6)

■ how many states do we really need?● rule of thumb: three states per phoneme● one for start, middle, and end of phoneme

IBM ELEN E6884: Advanced Speech Recognition 34

Page 36: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Word HMM’s

■ word TWO is composed of phonemes T UW

■ two phonemes ⇒ six HMM states

1

P(x|a1)

2P(x|a1)

P(x|a2)

3P(x|a2)

P(x|a3)

4P(x|a3)

P(x|a4)

5P(x|a4)

P(x|a5)

6P(x|a5)

P(x|a6)

7P(x|a6)

■ convention: same output distribution for all arcs exiting a state● or place output distributions on states

■ parameters: 12 transition probabilities, 6 output distributions

IBM ELEN E6884: Advanced Speech Recognition 35

Page 37: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Isolated Word Recognition with HMM’s

■ training● for each class (word) ω, collect all training samples x with that

label● to train HMM for word ω, run FB on these samples

■ decoding a test sample xtest

● calculate (Viterbi or Forward) likelihood Pω(xtest) of samplewith each word model Pω(·), pick best

class(xtest) = arg maxω

Pω(xtest)

IBM ELEN E6884: Advanced Speech Recognition 36

Page 38: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Control Flow for Training

For each iteration of Forward-Backward training

■ initialize training counts to 0■ for each utterance (xi, ωi) in training data

● calculate acoustic feature vectors from xi using front end● construct HMM representing word sequence ωi

● run FB algorithm to update training counts for all transitionspresent in HMM

■ reestimate HMM parameters using training counts

IBM ELEN E6884: Advanced Speech Recognition 37

Page 39: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Decoding with HMM’s

■ instead of separate HMM for each word ω● merge each word HMM into single big HMM

1 2

ONE

TWO

...

■ use Viterbi algorithm (not Forward algorithm, why?)● in backtrace, collect all words along the way

■ why one big graph?● when move to word sequences, can’t have HMM per

sequence● pruning during search● graph minimization

IBM ELEN E6884: Advanced Speech Recognition 38

Page 40: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Continuous Speech Recognition with HMM’s

■ e.g., ASR for digit sequences of arbitrary length● set of classes ω for classification is infinite

■ instead of separate HMM parameters for each digit string ω● have one HMM for each digit as before● glue together to make HMM for a digit string

IBM ELEN E6884: Advanced Speech Recognition 39

Page 41: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Continuous Speech Recognition with HMM’s

■ training● e.g., reference transcript: ONE TWO

1 2ONE

3TWO

● for each training sample, update counts for each word HMMin transcript

■ decoding● want HMM that represents all digit strings

1

...THREETWOONE

IBM ELEN E6884: Advanced Speech Recognition 40

Page 42: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Silence is Golden

■ what about all those pauses in speech?● at begin/end of utterance; in-between words

■ add silence word ∼SIL to vocabulary● three states? (skip arcs?)

■ allow optional silence at beginning/end of utterances and in-between words

IBM ELEN E6884: Advanced Speech Recognition 41

Page 43: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Silence is Golden

■ e.g., HMM for training with reference script: ONE TWO

1 3

TWO

2

~SIL 4

ONE

5

~SILTWO 6~SIL

ONE

■ in practice, often precompute best word sequence using someother model● e.g., bootstrap models without using optional silences

■ Lab 2: all these graphs are automatically constructed for you● silence is also used in isolated word training/decoding

IBM ELEN E6884: Advanced Speech Recognition 42

Page 44: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen

Wreck a Nice Beach?

The story so far:

■ to decode a test sample xtest, find best word sequence ω

class(xtest) = arg maxω

Pω(xtest) ≡ arg maxω

P (xtest|ω)

● find word sequence ω maximizing likelihood Pω(xtest|ω)● maximum likelihood classification

■ well, you know:

THIS IS OUR ROOM FOR A FOUR HOUR PERIOD .THIS IS HOUR ROOM FOUR A FOR OUR . PERIOD

IBM ELEN E6884: Advanced Speech Recognition 43