View
235
Download
0
Tags:
Embed Size (px)
Citation preview
Outline
• Hidden Markov Models – Formalism
• The Three Basic Problems of HMMs
• Solutions
• Applications of HMMs for Automatic Speech Recognition (ASR)
Example: The Dishonest Casino
A casino has two dice:• Fair die
P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6
• Loaded dieP(1) = P(2) = P(3) = P(4) = P(5) = 1/10P(6) = 1/2
Casino player switches back-&-forth between fair and loaded die once in a while
Game:1. You bet $12. You roll (always with a fair die)3. Casino player rolls (maybe with fair die,
maybe with loaded die)4. Highest number wins $2
Question # 1 – Evaluation
GIVEN
A sequence of rolls by the casino player
12455264621461461361366616646616366163661636165
QUESTION
How likely is this sequence, given our model of how the casino works?
This is the EVALUATION problem in HMMs
Question # 2 – Decoding
GIVEN
A sequence of rolls by the casino player
12455264621461461361366616646616366163661636165
QUESTION
What portion of the sequence was generated with the fair die, and what portion with the loaded die?
This is the DECODING question in HMMs
Question # 3 – Learning
GIVEN
A sequence of rolls by the casino player
12455264621461461361366616646616366163661636165
QUESTION
How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back?
This is the LEARNING question in HMMs
The dishonest casino model
FAIR LOADED
0.05
0.05
0.950.95
P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6
P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2
Example: the dishonest casino
Let the sequence of rolls be:
O = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4
Then, what is the likelihood of
X= Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair?
(say initial probs P(t=0,Fair) = ½, P(t=0,Loaded)= ½)
½ P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) =
½ (1/6)10 (0.95)9 = .00000000521158647211 = 0.5 10-9
Example: the dishonest casinoSo, the likelihood the die is fair in all this runis just 0.521 10-9
OK, but what is the likelihood of
X= Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded?
½ P(1 | Loaded) P(Loaded, Loaded) … P(4 | Loaded) =
½ (1/10)8 (1/2)2 (0.95)9 = .00000000078781176215 = 7.9 10-
10
Therefore, it is after all 6.59 times more likely that the die is fair all the way, than that it is loaded all the way.
Example: the dishonest casinoLet the sequence of rolls be:
O = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6
Now, what is the likelihood X = F, F, …, F?
½ (1/6)10 (0.95)9 = 0.5 10-9, same as before
What is the likelihood
X= L, L, …, L?
½ (1/10)4 (1/2)6 (0.95)9 = .00000049238235134735 = 0.5 10-
7
So, it is 100 times more likely the die is loaded
HMM Timeline
• Arrows indicate probabilistic dependencies.• x’s are hidden states, each dependent only on the previous state.
– The Markov assumption holds for the state sequence.
• o’s are observations, dependent only on their corresponding hidden state.
time
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
HMM Formalism
• An HMM can be specified by 3 matrices {• are the initial state probabilities
• A = {aij} are the state transition probabilities = Pr(xj|xi)
• B = {bik} are the observation probabilities = Pr(ok|xi)
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Generating a sequence by the model
Given a HMM, we can generate a sequence of length n as follows:
1. Start at state xi according to prob i
2. Emit letter o1 according to prob bi(o1)
3. Go to state xj according to prob aij
4. … until emitting oT
1
2
N
…
1
2
N
…
1
2
K
…
…
…
…
1
2
N
…
o1 o2 o3 oT
2
1
N
2
0
b2o1
2
The three main questions on HMMs
1. Evaluation
GIVEN a HMM , and a sequence O,FIND Prob[ O | ]
2. Decoding
GIVEN a HMM , and a sequence O,FIND the sequence X of states that maximizes P[X | O, ]
3. Learning
GIVEN a sequence O,FIND a model with parameters , A and B that maximize P[ O | ]
)|Pr( Compute
),,( ,)...( 1
O
BAooO T
oTo1 otot-1 ot+1
Given an observation sequence and a model, compute the probability of the observation sequence
Probability of an Observation
Probability of an Observation
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
1
111112211
...),|Pr(T
toxoxoxoxox ttTT
bbbbbXO
1
111132211
...)|Pr(T
txxxxxxxxxx ttTT
aaaaX
Let X = x1 … xT be the state sequence.
Probability of an Observation
X
XXO )|Pr(),|Pr(
X
XOO )|,Pr()|Pr(
111111
1
1
1
1
tttt ox
T
toxxx
T
tXx bba
111111
1
1
tttt oxxx
T
tXoxx bab
HMM – Evaluation (cont.)
• Why isn’t it efficient? – For a given state sequence of length T we have about 2T
calculations– Let N be the number of states in the graph.– There are NT possible state sequences. – Complexity : O(2TNT )– Can be done more efficiently by the forward-backward (F-
B) procedure.
)|,...()( 1 ixooPt tti
The Forward Procedure (Prefix Probs)
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
The probability of being in state i after generating the first t observations.
)|(),...(
)()|()|...(
)()|...(
),...(
1111
11111
1111
111
jxoPjxooP
jxPjxoPjxooP
jxPjxooP
jxooP
tttt
ttttt
ttt
tt
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
)1( tj
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
)1( tj
)|(),...(
)()|()|...(
)()|...(
),...(
1111
11111
1111
111
jxoPjxooP
jxPjxoPjxooP
jxPjxooP
jxooP
tttt
ttttt
ttt
tt
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
)1( tj
)|(),...(
)()|()|...(
)()|...(
),...(
1111
11111
1111
111
jxoPjxooP
jxPjxoPjxooP
jxPjxooP
jxooP
tttt
ttttt
ttt
tt
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
)1( tj
)|(),...(
)()|()|...(
)()|...(
),...(
1111
11111
1111
111
jxoPjxooP
jxPjxoPjxooP
jxPjxooP
jxooP
tttt
ttttt
ttt
tt
Nijoiji
ttttNi
tt
tttNi
ttt
ttNi
ttt
tbat
jxoPixjxPixooP
jxoPixPixjxooP
jxoPjxixooP
...1
111...1
1
11...1
11
11...1
11
1)(
)|()|(),...(
)|()()|,...(
)|(),,...(
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
Nijoiji
ttttNi
tt
tttNi
ttt
ttNi
ttt
tbat
jxoPixjxPixooP
jxoPixPixjxooP
jxoPjxixooP
...1
111...1
1
11...1
11
11...1
11
1)(
)|()|(),...(
)|()()|,...(
)|(),,...(
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
Nijoiji
ttttNi
tt
tttNi
ttt
ttNi
ttt
tbat
jxoPixjxPixooP
jxoPixPixjxooP
jxoPjxixooP
...1
111...1
1
11...1
11
11...1
11
1)(
)|()|(),...(
)|()()|,...(
)|(),,...(
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
Nijoiji
ttttNi
tt
tttNi
ttt
ttNi
ttt
tbat
jxoPixjxPixooP
jxoPixPixjxooP
jxoPjxixooP
...1
111...1
1
11...1
11
11...1
11
1)(
)|()|(),...(
)|()()|,...(
)|(),,...(
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
The Forward Procedure
Initialization:
Iteration:
Termination:
Computational Complexity: O(N2T)
1)1( ioii b
Nijoijij t
batt...1
1)()1(
Ni
i TOP...1
)()|(
Ni 1
NjTt 111
)|...()( 1 ixooPt tTti
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Another Version: The Backward Procedure (Suffix Probs)
1)( Ti
Njjjoiji tbat
t...1
)1()(1
Probability of the rest of the states given the first state
Decoding
• Given an HMM and a new sequence of observations, find the most probable sequence of hidden states that generated these observations:
• In general, there is an exponential number of possible sequences.
• Use dynamic programming to reduce search space to O(n2T).
)|,Pr(maxarg),|Pr(maxarg^
OXOXXXX
oTo1 otot-1 ot+1
Viterbi Algorithm
),,...,...(max)( 1111... 11
ttttxx
j ojxooxxPtt
The state sequence which maximizes the probability of seeing the observations up to time t-1, landing in state j, and seeing the observation at time t.
x1 xt-1 j
oTo1 otot-1 ot+1
Viterbi Algorithm
1)(max)1(
tjoijii
j batt
1)(maxarg)1(
tjoijii
j batt
Prob. of ML state
x1 xt-1 xt xt+1
Recursion
Name of ML state
oTo1 otot-1 ot+1
Viterbi Algorithm
)(maxargˆ TX ii
T
)1(ˆ1
^
tXtX
t
)(maxarg)ˆ( TXP ii
“Read out” the most likely state sequence, working backwards.
x1 xt-1 xt xt+1 xT
Termination
Lecture 4, Thursday April 10, 2003
Viterbi Training
Initialization: Same as Baum-Welch
Iteration:Perform Viterbi, to find the optimal state sequence.Calculate P(i,j) and i(t) according to the optimal state sequence.Calculate the new parameters A, B and .
Until convergence
Notes:• In general, worse performance than Baum-Welch
Learning by Parameter Estimation:
• Goal : Given an observation sequence, find the model that is most likely to produce that sequence.
• Problem: We don’t know the relative frequencies of hidden visited states.
• No analytical solution is known for HMMs.• We will approach the solution by successive approximations.
The Baum-Welch Algorithm• Find the expected frequencies of possible values of the hidden
variables.• Compute the maximum likelihood distributions of the hidden
variables (by normalizing, as usual for MLE).• Repeat until “convergence.”• This is the Expectation-Maximization (EM) algorithm for parameter
estimation.• Applicable to any stochastic process, in theory.• Special case for HMMs is called the Baum-Welch algorithm.
oTo1 otot-1 ot+1
Arc and State Probabilities
A
B
AAA
BBB B
Nmmm
jjoijittt tt
tbat
Op
OjxixpOjip t
...1
1
)()(
)1()(
)|(
)|,,(),|,( 1
Probability of traversing an arcFrom state i (at time t) to state j (at time t+1)
Nj
tti jipOixpt...1
),(),|()( Probability of being in state i at time t.
oTo1 otot-1 ot+1
Aggregation and Normalization
A
B
AAA
BBB B
)1(ˆ i i
Now we can compute the new MLEs of the model parameters.
1
1
1
1
)(
),(ˆ
T
t i
T
t tij
t
jipa
1
1
1
}:{
)(
)(ˆ
T
t i
T
kot t
ikt
ib t
The Baum-Welch Algorithm
1. Initialize A,B and (Pick the best-guess for model parameters or arbitrary)
2. Repeat3. Calculate and 4. Calculate and 5. Estimate , and 6. Until the changes are small enough
)(tj )(tj)(tj ),( jiPt
i ija ikb
The Baum-Welch Algorithm – Comments
Time Complexity:
# iterations O(N2T)
• Guaranteed to increase the (log) likelihood of the model
P( | O) = P(O, ) / P(O) = P(O | ) P() / P(O)
• Not guaranteed to find globally best parameters
Converges to local optimum, depending on initial conditions
• Too many parameters / too large model - Overtraining
Simple Features Extraction - LPC
• Speech difference equation for a p-th order filter :
• Want to minimize the mean squared prediction error
• The ak’s create the feature vector.
p
kk neknsans
1
)()()(