Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
CS 224S / LINGUIST 285Spoken Language Processing
AndrewMaasStanfordUniversity
Spring2017
Lecture3:ASR:HMMs,Forward,ViterbiOriginalslidesbyDanJurafsky
Fun informative read on phoneticsTheArtofLanguageInvention.DavidJ.Peterson.2015.http://www.artoflanguageinvention.com/books/
Outline for Today� ASRArchitecture� DecodingwithHMMs
� Forward� ViterbiDecoding
� HowthisfitsintotheASRcomponentofcourse� Onyourown:N-gramsandLanguageModeling� Apr12:Training,AdvancedDecoding� Apr17:FeatureExtraction,GMMAcousticModeling� Apr24:NeuralNetworkAcousticModels� May1:Endtoendneuralnetworkspeechrecognition
The Noisy Channel Model
� Searchthroughspaceofallpossiblesentences.� Picktheonethatismostprobablegiventhewaveform.
!"#$%&$'!('!)'
!"#$%&'!&()&(%&
!"#$%&'()!!*+
*#&!!'+)'!"#$%&,!"#$%&*
!"#$%&+
!"#$%&,
-.'/#!0%'1&'
)2&'.""3'".'4"5&666
-.'/#!0%'1&'
)2&'.""3'".'4"5&666
!"#$!"%75&$8'2+998'.+/048
-('+'2"4&'0(')2&'*$"#(3
&&&
-.'/#!0%'1&')2&'.""3'".'4"5&
The Noisy Channel Model (II)� WhatisthemostlikelysentenceoutofallsentencesinthelanguageLgivensomeacousticinputO?
� TreatacousticinputOassequenceofindividualobservations�O=o1,o2,o3,…,ot
� Defineasentenceasasequenceofwords:�W=w1,w2,w3,…,wn
Noisy Channel Model (III)� Probabilisticimplication:PickthehighestprobS:
� WecanuseBayesruletorewritethis:
� SincedenominatoristhesameforeachcandidatesentenceW,wecanignoreitfortheargmax:
€
ˆ W = argmaxW ∈L
P(W | O)
€
ˆ W = argmaxW ∈L
P(O |W )P(W )€
ˆ W = argmaxW ∈L
P(O |W )P(W )P(O)
Speech Recognition Architecture
!"#$%&'()*"'%+&")",%&'!%-./
0'+$$-'/)1!.+$%-!)2.3"(
2455)*"'%+&"$
#6./")(-7"(-6..3$
822)(",-!./
!9:&';)('/:+':");.3"(
<-%"&=-)>"!.3"&
!"#$%&!'#()#*+)#",,-#,"#.,/)000
!
"
#$"%#$!&"%
Noisy channel model
€
ˆ W = argmaxW ∈L
P(O |W )P(W )
likelihood prior
!"#$%&$'!('!)'
!"#$%&'!&()&(%&
!"#$%&'()!!*+
*#&!!'+)'!"#$%&,!"#$%&*
!"#$%&+
!"#$%&,
-.'/#!0%'1&'
)2&'.""3'".'4"5&666
-.'/#!0%'1&'
)2&'.""3'".'4"5&666
!"#$!"%75&$8'2+998'.+/048
-('+'2"4&'0(')2&'*$"#(3
&&&
-.'/#!0%'1&')2&'.""3'".'4"5&
The noisy channel modelIgnoringthedenominatorleavesuswithtwofactors:P(Source)andP(Signal|Source)
Speech Architecture meets Noisy Channel
Decoding Architecture: five easy pieces
� FeatureExtraction:� 39“MFCC” features
� AcousticModel:� Gaussiansforcomputingp(o|q)
� Lexicon/PronunciationModel� HMM:whatphonescanfolloweachother
� LanguageModel� N-gramsforcomputingp(wi|wi-1)
� Decoder� Viterbialgorithm:dynamicprogrammingforcombiningallthesetogetwordsequencefromspeech
11
Lexicon� Alistofwords� Eachonewithapronunciationintermsofphones� Wegetthesefromon-linepronunciationdictionary
� CMUdictionary:127Kwords� http://www.speech.cs.cmu.edu/cgi-bin/cmudict
� We’llrepresentthelexiconasanHMM
12
HMMs for speech
Phones are not homogeneous!
Time (s)0.48152 0.937203
0
5000
ay k
ay k
Each phone has 3 subphones
!"#$
%&&
'()& *"+,-.%/.01#+2
%,, %$$
%0& %&, %,$ %$2
Resulting HMM word model for “six”
!"#
!"$
!"%
&'()' *+,-#
-$
-%
.#
.$
.%
-#
-$
-%
HMM for the digit recognition task
Markov chain for weather
!"#$"%
&'()
*+,-.
/012
30456
#66
#%6
#22
#26
#%.
#%2
#62
#2.
#..
#6)
#2)
#6.
#.)
#.6
#.2
Markov chain for words
!"#$"%
&'()
*+,"-.
,/0
/'1*2
#22
#%2
#00
#02
#%.
#%0
#20
#0.
#..
#2)
#0)
#.0
#.)
#.2#
2.
Markov chain = First-order observable Markov Model
� asetofstates� Q=q1,q2…qN; thestateattimetisqt
� Transitionprobabilities:� asetofprobabilitiesA=a01a02…an1…ann.� Eachaijrepresentstheprobabilityoftransitioningfromstateitostatej
� ThesetoftheseisthetransitionprobabilitymatrixA
� Distinguishedstartandendstates€
aij = P(qt = j |qt−1 = i) 1≤ i, j ≤ N
€
aij =1; 1≤ i ≤ Nj=1
N
∑
Markov chain = First-order observable Markov Model
Currentstateonlydependsonpreviousstate
€
Markov Assumption : P(qi |q1!qi−1) = P(qi |qi−1)
Another representation for start state
� Insteadofstartstate�Specialinitialprobabilityvectorp�Aninitialdistributionoverprobabilityofstartstates
�Constraints:
€
π i = P(q1 = i) 1≤ i ≤ N
€
π j =1j=1
N
∑
The weather figure using pi
The weather figure: specific example
Markov chain for weather�Whatistheprobabilityof4consecutivewarmdays?
� Sequenceiswarm-warm-warm-warm� I.e.,statesequenceis3-3-3-3� P(3,3,3,3)=�p3a33a33a33a33 =0.2x(0.6)3 =0.0432
How about?
� Hothothothot� Coldhotcoldhot
�Whatdoesthedifferenceintheseprobabilitiestellyouabouttherealworldweatherinfoencodedinthefigure?
HMM for Ice Cream
� Youareaclimatologistintheyear2799� Studyingglobalwarming� Youcan’tfindanyrecordsoftheweatherinBaltimore,MDforsummerof2008
� ButyoufindJasonEisner’sdiary� Whichlistshowmanyice-creamsJasonateeverydatethatsummer
� Ourjob:figureouthowhotitwas
Hidden Markov Model
� ForMarkovchains,outputsymbols=statesymbols� Seehot weather:we’reinstatehot
� Butnotinspeechrecognition� Outputsymbols:vectorsofacoustics(cepstral features)� Hiddenstates:phones
� Soweneedanextension!� AHiddenMarkovModel isanextensionofaMarkovchaininwhichtheinputsymbolsarenotthesameasthestates.
� Thismeanswedon’tknowwhichstatewearein.
Hidden Markov Models
Assumptions�Markovassumption:
�Output-independenceassumption
€
P(qi |q1!qi−1) = P(qi |qi−1)
P(ot |O1t−1,q1
t ) = P(ot | q t )
Eisner task
GivenObservedIceCreamSequence:
1,2,3,2,2,2,3…
Produce:HiddenWeatherSequence:
H,C,H,H,H,C…
HMM for ice cream
!"#$"%
&'()*+',-
!"
./-010&'()2000000000034
./*010&'()200005000036
./7010&'()200000000003-
3*
38
393:
36
37
./-010+',200000000003*
./*010+',200005000036
./7010+',2000000000036
!#
Different types of HMM structure
Bakis =left-to-right Ergodic =fully-connected
The Three Basic Problems for HMMs
Problem1(Evaluation):GiventheobservationsequenceO=(o1o2…oT),andanHMMmodelF =(A,B),howdoweefficientlycomputeP(O|F),theprobabilityoftheobservationsequence,giventhemodel?
Problem2(Decoding):GiventheobservationsequenceO=(o1o2…oT),andanHMMmodelF =(A,B),howdowechooseacorrespondingstatesequenceQ=(q1q2…qT) thatisoptimalinsomesense(i.e.,bestexplainstheobservations)?
Problem3(Learning):HowdoweadjustthemodelparametersF =(A,B) tomaximizeP(O|F )?
Jack Ferguson at IDA in the 1960s
Problem 1: computing the observation likelihood
GiventhefollowingHMM:
Howlikelyisthesequence313?
!"#$"%
&'()*+',-
!"
./-010&'()2000000000034
./*010&'()200005000036
./7010&'()200000000003-
3*
38
393:
36
37
./-010+',200000000003*
./*010+',200005000036
./7010+',2000000000036
!#
How to compute likelihood
� ForaMarkovchain,wejustfollowthestates313andmultiplytheprobabilities
� ButforanHMM,wedon’tknowwhatthestatesare!
� Solet’sstartwithasimplersituation.� Computingtheobservationlikelihoodforagivenhiddenstatesequence� SupposeweknewtheweatherandwantedtopredicthowmuchicecreamJasonwouldeat.
� i.e.,P(313|HHC)
Computing likelihood of 3 1 3 given hidden state sequence
Computing joint probability of observation and state sequence
Computing total likelihood of 3 1 3� Wewouldneedtosumover
� Hothotcold� Hothothot� Hotcoldhot� ….
� Howmanypossiblehiddenstatesequencesarethereforthissequence?
� HowaboutingeneralforanHMMwithNhiddenstatesandasequenceofTobservations?� NT
� Sowecan’tjustdoseparatecomputationforeachhiddenstatesequence.
Instead: the Forward algorithm� Adynamicprogrammingalgorithm
� JustlikeMinimumEditDistanceorCKYParsing�Usesatabletostoreintermediatevalues
� Idea:� Computethelikelihoodoftheobservationsequence
� Bysummingoverallpossiblehiddenstatesequences
� Butdoingthisefficiently� Byfoldingallthesequencesintoasingletrellis
The forward algorithm� Thegoaloftheforwardalgorithmistocompute
�We’lldothisbyrecursion
€
P(o1,o2,...,oT ,qT = qF | λ)
The forward algorithm� Eachcelloftheforwardalgorithmtrellisalphat(j)
� Representstheprobabilityofbeinginstatej� Afterseeingthefirstt observations�Giventheautomaton
� Eachcellthusexpressesthefollowingprobability
The Forward Recursion
The Forward Trellis
!"#$"
%
&
%
&
%
&
'()
*+&,!"#$"-./.*+0,&-
12./.13
*+%,%-./.*+3,%-
14./.12
*+&,&-./.*+3,&-
15./.16
*+&,%-./.*+3,&-10./.16
*+%,&-./.*
+3,%-
17./.12
*+%,!"#$"-/*+0,%-
18./.17
!!"#$9102
!!"!$.9.1:2
!#"#$9.102/1:37.;.1:2/1:8.9.1::5:8
!#"!$.9.102/136.;.1:2/10:.9.1:67
!"#$" !"#$" !"#$"
"
&
%
'() '() '()<=
<2
<3
<:
>3
0
>2 >0
3 0
.32*.14+.02*.08=.0464
We update each cell
!"#$ !"
%$&
%'&
%(&
%)&
*&+!",
!!"#$%&"'&!!()"'$&%-&.*&+!",&
/$
/'
/)
/(
/$
/&
/'
/$
/'
!"0$!"#'
/$
/'
/) /)
/( /(
!!()"*$
!!()"+$
!!()",$
!!()")$
!!(,"*$
!!(,"+$
!!(,",$
!!(,")$
The Forward Algorithm
Decoding� Givenanobservationsequence
� 313
� AndanHMM� Thetaskofthedecoder
� Tofindthebesthidden statesequence
� GiventheobservationsequenceO=(o1o2…oT),andanHMMmodelF =(A,B),howdowechooseacorrespondingstatesequenceQ=(q1q2…qT) thatisoptimalinsomesense(i.e.,bestexplainstheobservations)
Decoding� Onepossibility:
� ForeachhiddenstatesequenceQ� HHH,HHC,HCH,
� ComputeP(O|Q)� Pickthehighestone
� Whynot?�NT
� Instead:� TheViterbialgorithm� Isagainadynamicprogramming algorithm�UsesasimilartrellistotheForwardalgorithm
Viterbi intuition� Wewanttocomputethejointprobabilityoftheobservationsequencetogetherwiththebeststatesequence
€
maxq 0,q1,...,qT
P(q0,q1,...,qT ,o1,o2,...,oT ,qT = qF | λ)
Viterbi Recursion
The Viterbi trellis
!"#$"
%
&
%
&
%
&
'()
*+&,!"#$"-./.*+0,&-
12./.13
*+%,%-./.*+3,%-
14./.12
*+&,&-./.*+3,&-
15./.16
*+&,%-./.*+3,&-10./.16
*+%,&-./.*
+3,%-
17./.12
*+%,!"#$"-/*+0,%-
18./.17
!"#$%9102
!"#"%.9.1:2
!$#$%9.;#<+102/1:37=.1:2/1:8-.9.1:778
!$#"%.9.;#<+102/136=.1:2/10:-.9.1:78
!"#$" !"#$" !"#$"
"
&
%
'() '() '()>?
>2
>3
>:
@3 @2 @0
0 3 0
/
Viterbi intuition� Processobservationsequencelefttoright� Fillingoutthetrellis� Eachcell:
Viterbi Algorithm
Viterbi backtrace
!"#$"
%
&
%
&
%
&
'()
*+&,!"#$"-./.*+0,&-
12./.13
*+%,%-./.*+3,%-
14./.12
*+&,&-./.*+3,&-
15./.16
*+&,%-./.*+3,&-10./.16
*+%,&-./.*
+3,%-
17./.12
*+%,!"#$"-/*+0,%-
18./.17
!"#$%9102
!"#"%.9.1:2
!$#$%9.;#<+102/1:37=.1:2/1:8-.9.1:778
!$#"%.9.;#<+102/136=.1:2/10:-.9.1:78
!"#$" !"#$" !"#$"
"
&
%
'() '() '()>?
>2
>3
>:
@3 @2 @0
0 3 0
/
HMMs for Speech� Wehaven’tyetshownhowtolearn theAandBmatricesforHMMs;� we’lldothatonThursday� TheBaum-Welch(Forward-Backwardalg)
� Butlet’sreturntothinkaboutspeech
Reminder: a word looks like this:
!"#
!"$
!"%
&'()' *+,-#
-$
-%
.#
.$
.%
-#
-$
-%
HMM for digit recognition task
The Evaluation (forward) problem for speech� TheobservationsequenceOisaseriesofMFCCvectors
� ThehiddenstatesWarethephonesandwords� Foragivenphone/wordstringW,ourjobistoevaluateP(O|W)
� Intuition:howlikelyistheinputtohavebeengeneratedbyjustthatwordstringW?
Evaluation for speech: Summing over all different paths!� fayayayayvvvv� ffayayayayvvv� ffffayayayayv� ffayayayayayayv� ffayayayayayayayayv� ffayvvvvvvv
The forward lattice for “five”
The forward trellis for “five”
Viterbi trellis for “five”
Viterbi trellis for “five”
Search space with bigrams
!"!" !"## # $$ $ %&%& %&
'('( '(&& & )) )
*&*& *&++ +
,,,
-./%)0/1/+&%/2 -./+&%/1/%)0/2
-./%)0/1/%)0/2
-./+&%/1/+&%/2
-./%)0/1/#0$%/2
-./#0$%/1/#0$%/2
-./#0$%/1/%)0/2
-./+&%/1/#0$%/2
-./#0$%/1/+&%/2
Viterbi trellis
65
Viterbi backtrace
66
Summary: ASR Architecture� Fiveeasypieces:ASRNoisyChannelarchitecture
� FeatureExtraction:� 39“MFCC” features
� AcousticModel:� Gaussiansforcomputingp(o|q)
� Lexicon/PronunciationModel� HMM:whatphonescanfolloweachother
� LanguageModel� N-gramsforcomputingp(wi|wi-1)
� Decoder� Viterbialgorithm:dynamicprogrammingforcombiningallthesetogetwordsequencefromspeech
67