Lecture 3: ASR: HMMs, Forward, Viterbi - Stanford University · Which lists how many ice-creams Jason ate every date that summer ... The Evaluation (forward) problem for speech The

CS 224S / LINGUIST 285Spoken Language Processing

AndrewMaasStanfordUniversity

Spring2017

Lecture3:ASR:HMMs,Forward,ViterbiOriginalslidesbyDanJurafsky

Fun informative read on phoneticsTheArtofLanguageInvention.DavidJ.Peterson.2015.http://www.artoflanguageinvention.com/books/

Outline for Today� ASRArchitecture� DecodingwithHMMs

� Forward� ViterbiDecoding

� HowthisfitsintotheASRcomponentofcourse� Onyourown:N-gramsandLanguageModeling� Apr12:Training,AdvancedDecoding� Apr17:FeatureExtraction,GMMAcousticModeling� Apr24:NeuralNetworkAcousticModels� May1:Endtoendneuralnetworkspeechrecognition

The Noisy Channel Model

� Searchthroughspaceofallpossiblesentences.� Picktheonethatismostprobablegiventhewaveform.

!"#$%&$'!('!)'

!"#$%&'!&()&(%&

!"#$%&'()!!*+

*#&!!'+)'!"#$%&,!"#$%&*

!"#$%&+

!"#$%&,

-.'/#!0%'1&'

)2&'.""3'".'4"5&666

-.'/#!0%'1&'

)2&'.""3'".'4"5&666

!"#$!"%75&$8'2+998'.+/048

-('+'2"4&'0(')2&'*$"#(3

&&&

-.'/#!0%'1&')2&'.""3'".'4"5&

The Noisy Channel Model (II)� WhatisthemostlikelysentenceoutofallsentencesinthelanguageLgivensomeacousticinputO?

� TreatacousticinputOassequenceofindividualobservations�O=o1,o2,o3,…,ot

� Defineasentenceasasequenceofwords:�W=w1,w2,w3,…,wn

Noisy Channel Model (III)� Probabilisticimplication:PickthehighestprobS:

� WecanuseBayesruletorewritethis:

� SincedenominatoristhesameforeachcandidatesentenceW,wecanignoreitfortheargmax:

€

ˆ W = argmaxW ∈L

P(W | O)

€

ˆ W = argmaxW ∈L

P(O |W )P(W )€

ˆ W = argmaxW ∈L

P(O |W )P(W )P(O)

Speech Recognition Architecture

!"#$%&'()*"'%+&")",%&'!%-./

0'+$$-'/)1!.+$%-!)2.3"(

2455)*"'%+&"$

#6./")(-7"(-6..3$

822)(",-!./

!9:&';)('/:+':");.3"(

<-%"&=-)>"!.3"&

!"#$%&!'#()#*+)#",,-#,"#.,/)000

!

"

#$"%#$!&"%

Noisy channel model

€

ˆ W = argmaxW ∈L

P(O |W )P(W )

likelihood prior

!"#$%&$'!('!)'

!"#$%&'!&()&(%&

!"#$%&'()!!*+

*#&!!'+)'!"#$%&,!"#$%&*

!"#$%&+

!"#$%&,

-.'/#!0%'1&'

)2&'.""3'".'4"5&666

-.'/#!0%'1&'

)2&'.""3'".'4"5&666

!"#$!"%75&$8'2+998'.+/048

-('+'2"4&'0(')2&'*$"#(3

&&&

-.'/#!0%'1&')2&'.""3'".'4"5&

The noisy channel modelIgnoringthedenominatorleavesuswithtwofactors:P(Source)andP(Signal|Source)

Speech Architecture meets Noisy Channel

Decoding Architecture: five easy pieces

� FeatureExtraction:� 39“MFCC” features

� AcousticModel:� Gaussiansforcomputingp(o|q)

� Lexicon/PronunciationModel� HMM:whatphonescanfolloweachother

� LanguageModel� N-gramsforcomputingp(wi|wi-1)

� Decoder� Viterbialgorithm:dynamicprogrammingforcombiningallthesetogetwordsequencefromspeech

11

Lexicon� Alistofwords� Eachonewithapronunciationintermsofphones� Wegetthesefromon-linepronunciationdictionary

� CMUdictionary:127Kwords� http://www.speech.cs.cmu.edu/cgi-bin/cmudict

� We’llrepresentthelexiconasanHMM

12

HMMs for speech

Phones are not homogeneous!

Time (s)0.48152 0.937203

0

5000

ay k

ay k

Each phone has 3 subphones

!"#$

%&&

'()& *"+,-.%/.01#+2

%,, %$$

%0& %&, %,$ %$2

Resulting HMM word model for “six”

!"#

!"$

!"%

&'()' *+,-#

-$

-%

.#

.$

.%

-#

-$

-%

HMM for the digit recognition task

Markov chain for weather

!"#$"%

&'()

*+,-.

/012

30456

#66

#%6

#22

#26

#%.

#%2

#62

#2.

#..

#6)

#2)

#6.

#.)

#.6

#.2

Markov chain for words

!"#$"%

&'()

*+,"-.

,/0

/'1*2

#22

#%2

#00

#02

#%.

#%0

#20

#0.

#..

#2)

#0)

#.0

#.)

#.2#

2.

Markov chain = First-order observable Markov Model

� asetofstates� Q=q1,q2…qN; thestateattimetisqt

� Transitionprobabilities:� asetofprobabilitiesA=a01a02…an1…ann.� Eachaijrepresentstheprobabilityoftransitioningfromstateitostatej

� ThesetoftheseisthetransitionprobabilitymatrixA

� Distinguishedstartandendstates€

aij = P(qt = j |qt−1 = i) 1≤ i, j ≤ N

€

aij =1; 1≤ i ≤ Nj=1

N

∑

Markov chain = First-order observable Markov Model

Currentstateonlydependsonpreviousstate

€

Markov Assumption : P(qi |q1!qi−1) = P(qi |qi−1)

Another representation for start state

� Insteadofstartstate�Specialinitialprobabilityvectorp�Aninitialdistributionoverprobabilityofstartstates

�Constraints:

€

π i = P(q1 = i) 1≤ i ≤ N

€

π j =1j=1

N

∑

The weather figure using pi

The weather figure: specific example

Markov chain for weather�Whatistheprobabilityof4consecutivewarmdays?

� Sequenceiswarm-warm-warm-warm� I.e.,statesequenceis3-3-3-3� P(3,3,3,3)=�p3a33a33a33a33 =0.2x(0.6)3 =0.0432

How about?

� Hothothothot� Coldhotcoldhot

�Whatdoesthedifferenceintheseprobabilitiestellyouabouttherealworldweatherinfoencodedinthefigure?

HMM for Ice Cream

� Youareaclimatologistintheyear2799� Studyingglobalwarming� Youcan’tfindanyrecordsoftheweatherinBaltimore,MDforsummerof2008

� ButyoufindJasonEisner’sdiary� Whichlistshowmanyice-creamsJasonateeverydatethatsummer

� Ourjob:figureouthowhotitwas

Hidden Markov Model

� ForMarkovchains,outputsymbols=statesymbols� Seehot weather:we’reinstatehot

� Butnotinspeechrecognition� Outputsymbols:vectorsofacoustics(cepstral features)� Hiddenstates:phones

� Soweneedanextension!� AHiddenMarkovModel isanextensionofaMarkovchaininwhichtheinputsymbolsarenotthesameasthestates.

� Thismeanswedon’tknowwhichstatewearein.

Hidden Markov Models

Assumptions�Markovassumption:

�Output-independenceassumption

€

P(qi |q1!qi−1) = P(qi |qi−1)

P(ot |O1t−1,q1

t ) = P(ot | q t )

Eisner task

GivenObservedIceCreamSequence:

1,2,3,2,2,2,3…

Produce:HiddenWeatherSequence:

H,C,H,H,H,C…

HMM for ice cream

!"#$"%

&'()*+',-

!"

./-010&'()2000000000034

./*010&'()200005000036

./7010&'()200000000003-

3*

38

393:

36

37

./-010+',200000000003*

./*010+',200005000036

./7010+',2000000000036

!#

Different types of HMM structure

Bakis =left-to-right Ergodic =fully-connected

The Three Basic Problems for HMMs

Problem1(Evaluation):GiventheobservationsequenceO=(o1o2…oT),andanHMMmodelF =(A,B),howdoweefficientlycomputeP(O|F),theprobabilityoftheobservationsequence,giventhemodel?

Problem2(Decoding):GiventheobservationsequenceO=(o1o2…oT),andanHMMmodelF =(A,B),howdowechooseacorrespondingstatesequenceQ=(q1q2…qT) thatisoptimalinsomesense(i.e.,bestexplainstheobservations)?

Problem3(Learning):HowdoweadjustthemodelparametersF =(A,B) tomaximizeP(O|F )?

Jack Ferguson at IDA in the 1960s

Problem 1: computing the observation likelihood

GiventhefollowingHMM:

Howlikelyisthesequence313?

!"#$"%

&'()*+',-

!"

./-010&'()2000000000034

./*010&'()200005000036

./7010&'()200000000003-

3*

38

393:

36

37

./-010+',200000000003*

./*010+',200005000036

./7010+',2000000000036

!#

How to compute likelihood

� ForaMarkovchain,wejustfollowthestates313andmultiplytheprobabilities

� ButforanHMM,wedon’tknowwhatthestatesare!

� Solet’sstartwithasimplersituation.� Computingtheobservationlikelihoodforagivenhiddenstatesequence� SupposeweknewtheweatherandwantedtopredicthowmuchicecreamJasonwouldeat.

� i.e.,P(313|HHC)

Computing likelihood of 3 1 3 given hidden state sequence

Computing joint probability of observation and state sequence

Computing total likelihood of 3 1 3� Wewouldneedtosumover

� Hothotcold� Hothothot� Hotcoldhot� ….

� Howmanypossiblehiddenstatesequencesarethereforthissequence?

� HowaboutingeneralforanHMMwithNhiddenstatesandasequenceofTobservations?� NT

� Sowecan’tjustdoseparatecomputationforeachhiddenstatesequence.

Instead: the Forward algorithm� Adynamicprogrammingalgorithm

� JustlikeMinimumEditDistanceorCKYParsing�Usesatabletostoreintermediatevalues

� Idea:� Computethelikelihoodoftheobservationsequence

� Bysummingoverallpossiblehiddenstatesequences

� Butdoingthisefficiently� Byfoldingallthesequencesintoasingletrellis

The forward algorithm� Thegoaloftheforwardalgorithmistocompute

�We’lldothisbyrecursion

€

P(o1,o2,...,oT ,qT = qF | λ)

The forward algorithm� Eachcelloftheforwardalgorithmtrellisalphat(j)

� Representstheprobabilityofbeinginstatej� Afterseeingthefirstt observations�Giventheautomaton

� Eachcellthusexpressesthefollowingprobability

The Forward Recursion

The Forward Trellis

!"#$"

%

&

%

&

%

&

'()

*+&,!"#$"-./.*+0,&-

12./.13

*+%,%-./.*+3,%-

14./.12

*+&,&-./.*+3,&-

15./.16

*+&,%-./.*+3,&-10./.16

*+%,&-./.*

+3,%-

17./.12

*+%,!"#$"-/*+0,%-

18./.17

!!"#$9102

!!"!$.9.1:2

!#"#$9.102/1:37.;.1:2/1:8.9.1::5:8

!#"!$.9.102/136.;.1:2/10:.9.1:67

!"#$" !"#$" !"#$"

"

&

%

'() '() '()<=

<2

<3

<:

>3

0

>2 >0

3 0

.32*.14+.02*.08=.0464

We update each cell

!"#$ !"

%$&

%'&

%(&

%)&

*&+!",

!!"#$%&"'&!!()"'$&%-&.*&+!",&

/$

/'

/)

/(

/$

/&

/'

/$

/'

!"0$!"#'

/$

/'

/) /)

/( /(

!!()"*$

!!()"+$

!!()",$

!!()")$

!!(,"*$

!!(,"+$

!!(,",$

!!(,")$

The Forward Algorithm

Decoding� Givenanobservationsequence

� 313

� AndanHMM� Thetaskofthedecoder

� Tofindthebesthidden statesequence

� GiventheobservationsequenceO=(o1o2…oT),andanHMMmodelF =(A,B),howdowechooseacorrespondingstatesequenceQ=(q1q2…qT) thatisoptimalinsomesense(i.e.,bestexplainstheobservations)

Decoding� Onepossibility:

� ForeachhiddenstatesequenceQ� HHH,HHC,HCH,

� ComputeP(O|Q)� Pickthehighestone

� Whynot?�NT

� Instead:� TheViterbialgorithm� Isagainadynamicprogramming algorithm�UsesasimilartrellistotheForwardalgorithm

Viterbi intuition� Wewanttocomputethejointprobabilityoftheobservationsequencetogetherwiththebeststatesequence

€

maxq 0,q1,...,qT

P(q0,q1,...,qT ,o1,o2,...,oT ,qT = qF | λ)

Viterbi Recursion

The Viterbi trellis

!"#$"

%

&

%

&

%

&

'()

*+&,!"#$"-./.*+0,&-

12./.13

*+%,%-./.*+3,%-

14./.12

*+&,&-./.*+3,&-

15./.16

*+&,%-./.*+3,&-10./.16

*+%,&-./.*

+3,%-

17./.12

*+%,!"#$"-/*+0,%-

18./.17

!"#$%9102

!"#"%.9.1:2

!$#$%9.;#<+102/1:37=.1:2/1:8-.9.1:778

!$#"%.9.;#<+102/136=.1:2/10:-.9.1:78

!"#$" !"#$" !"#$"

"

&

%

'() '() '()>?

>2

>3

>:

@3 @2 @0

0 3 0

/

Viterbi intuition� Processobservationsequencelefttoright� Fillingoutthetrellis� Eachcell:

Viterbi Algorithm

Viterbi backtrace

!"#$"

%

&

%

&

%

&

'()

*+&,!"#$"-./.*+0,&-

12./.13

*+%,%-./.*+3,%-

14./.12

*+&,&-./.*+3,&-

15./.16

*+&,%-./.*+3,&-10./.16

*+%,&-./.*

+3,%-

17./.12

*+%,!"#$"-/*+0,%-

18./.17

!"#$%9102

!"#"%.9.1:2

!$#$%9.;#<+102/1:37=.1:2/1:8-.9.1:778

!$#"%.9.;#<+102/136=.1:2/10:-.9.1:78

!"#$" !"#$" !"#$"

"

&

%

'() '() '()>?

>2

>3

>:

@3 @2 @0

0 3 0

/

HMMs for Speech� Wehaven’tyetshownhowtolearn theAandBmatricesforHMMs;� we’lldothatonThursday� TheBaum-Welch(Forward-Backwardalg)

� Butlet’sreturntothinkaboutspeech

Reminder: a word looks like this:

!"#

!"$

!"%

&'()' *+,-#

-$

-%

.#

.$

.%

-#

-$

-%

HMM for digit recognition task

The Evaluation (forward) problem for speech� TheobservationsequenceOisaseriesofMFCCvectors

� ThehiddenstatesWarethephonesandwords� Foragivenphone/wordstringW,ourjobistoevaluateP(O|W)

� Intuition:howlikelyistheinputtohavebeengeneratedbyjustthatwordstringW?

Evaluation for speech: Summing over all different paths!� fayayayayvvvv� ffayayayayvvv� ffffayayayayv� ffayayayayayayv� ffayayayayayayayayv� ffayvvvvvvv

The forward lattice for “five”

The forward trellis for “five”

Viterbi trellis for “five”

Viterbi trellis for “five”

Search space with bigrams

!"!" !"## # $$ $ %&%& %&

'('( '(&& & )) )

*&*& *&++ +

,,,

-./%)0/1/+&%/2 -./+&%/1/%)0/2

-./%)0/1/%)0/2

-./+&%/1/+&%/2

-./%)0/1/#0$%/2

-./#0$%/1/#0$%/2

-./#0$%/1/%)0/2

-./+&%/1/#0$%/2

-./#0$%/1/+&%/2

Viterbi trellis

65

Viterbi backtrace

66

Summary: ASR Architecture� Fiveeasypieces:ASRNoisyChannelarchitecture

� FeatureExtraction:� 39“MFCC” features

� AcousticModel:� Gaussiansforcomputingp(o|q)

� Lexicon/PronunciationModel� HMM:whatphonescanfolloweachother

� LanguageModel� N-gramsforcomputingp(wi|wi-1)

� Decoder� Viterbialgorithm:dynamicprogrammingforcombiningallthesetogetwordsequencefromspeech

67

Documents

Lecture 3: ASR: HMMs, Forward, Viterbi - Stanford University · Which lists how many ice-creams Jason ate every date that summer ... The Evaluation (forward) problem for speech The