Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813Statistical Natural Language Processing
Chin-Hui LeeSchool of ECE, Georgia Tech
Atlanta, GA 30332, [email protected]
ECE8813, Spring 2009
Lectures 12-13: Hidden Markov Models
2 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Markov Assumptions• Let X=(X1, .., XT) be a sequence of random variables
taking values in some finite set, S={s1, …, sN}, the state space. If X possesses the following properties, then X is a Markov Chain
1. Limited Horizon:P(Xt+1=sk|X1, .., Xt)=P(X t+1 = sk |Xt) i.e., a word’s state only depends on the previous state
2. Time Invariant (Stationary):P(Xt+1=sk|Xt)=P(X2 =sk|X1) i.e., the dependency does not change over time
3 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Definition of a Markov ChainA is a stochastic N x N matrix
of the probabilities of transitions with:aij = Pr{transition from si to sj} = Pr(Xt+1=sj | Xt=si)∀i, j, aij ≥ 0, and ∀t:
Π is a vector of N elements representing the initial state probability distribution with:πj = Pr{probability that the initial state is si} =
Pr(X1=si) ∀i
Can avoid this by creating a special start state s0
11
=∑=
n
j ija
11
=∑=
n
i iπ
4 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Markov Chain: An Example
5 Center of Signal and Image ProcessingGeorgia Institute of Technology
Markov Process & Language Models• Bayes formula (chain rule):
P(W) = P(w1,w2,...,wT) = Pi=1..T p(wi|w1,w2,..,wi-n+1,..,wi-1)• n-gram language models:
Markov process (chain) of the order n-1:P(W) = P(w1,w2,...,wT) = Π i=1..T p(wi|wi-n+1,wi-n+2,..,wi-1)
Using just one distribution (Ex.: trigram model: p(wi|wi-2,wi-1)):Positions: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Words: My car broke down , and within hours Bob ’s car broke down , too .
p(,|broke down) = p(w5|w3,w4)) = p(w14|w12,w13)
6 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Another Example with a Markov Chain
1.4
1
.3.3
.4
.6 1
.6
.4
te
h a p
i
Start
7 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Probability of a Sequence of States
P(t, a, p, p) = 1.0 × 0.3 × 0.4 × 1.0 = 0.12
∏=
×××=×××=
−
=
−
−
+
1
1
1121
1321121321
11
)|()|()()|()|()()(
T
txxx
TT
TTT
tta
xxPxxPxPxxxxxPxxPxPxxxxP
π
K
KKK
8 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Hidden Markov Models (HMMs)• Sometimes it is not possible to know precisely which
states the model passes through; all we can do is to observe some phenomena that occurs when in that state with some probability distribution
• An HMM is an appropriate model for cases when you don’t know the state sequence that the model passes through, but only some probabilistic function of it. For example:– Word recognition (discrete utterance or within continuous speech) – Phoneme recognition– Part-of-Speech Tagging– Linear Interpolation
9 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
What is an HMM?
• Green circles are hidden states with its status dependent only on the previous state• Purple circles are observed events with their observations depend only of their emitting states
10 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
HMM: An Occasionally Dishonest Casino
• Assume: A casino switches occasionally to a biased dice to increase winning odds !!
• Can we model it with HMM ?• How do we prove it cheats ?• Can we estimate the HMM ?• How many samples needed ?• Which dice used at what time?
1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6
1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2
Fair Biased
0.05
0.95 0.9
0.1
11 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Estimation: More vs. Less Data
1: 0.192: 0.193: 0.234: 0.085: 0.236: 0.08
1: 0.072: 0.103: 0.104: 0.175: 0.056: 0.52
0.27
0.73
0.29
0.71
1: 0.172: 0.173: 0.174: 0.155: 0.186: 0.16
1: 0.102: 0.113: 0.104: 0.115: 0.106: 0.48
0.07
0.93 0.88
0.12
Estimates with 300 rolls Estimates with 30000 rolls
12 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Observations and Hidden States• HMM is also called a probabilistic
function of a Markov chain– State transition follows a Markov
chain– In each state, it generates
observation symbols based on a probability function. Each state has its own probability function
– HMM is a doubly embedded stochastic process
• In HMM,– State is not directly observable
(hidden states)– Only observe observation
symbols generated from states
S =S = ω1, ω4, ω2, ω2, ω1, ω4 (hidden)(hidden)
O =O = v4, v1, v1, v4, v2, v3 (observed)(observed)
13 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
An HMM Example: Urn & Ball
…
Urn 1 Urn NUrn N-1Urn 2
Pr(RED) = b1(1)
Pr(BLE) = b1(2)
Pr(GRN) = b1(3)
…
Pr(RED) = b2(1)
Pr(BLE) = b2(2)
Pr(GRN) = b2(3)
…
Pr(RED) = bN-1(1)
Pr(BLE) = bN-1(2)
Pr(GRN) = bN-1(3)
…
Pr(RED) = bN(1)
Pr(BLE) = bN(2)
Pr(GRN) = bN(3)
…
Observation: O = { GRN, GRN, BLE, RED, RED, … BLE}
14 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Elements of an HMM• HMM (the most general case):
S five-tuple (S, K, Π, A, B), where:S = {s1,s2,...,sN} is the set of states,K = {k1,k2,...,kM} is the output alphabet,Π = {πi} , i ∈ S,A = {aij}, i,j ∈ S,B = {bijk}, i,j ∈ S, k ∈ K (arc emission)B={bik} i ∈ S, k ∈ K (state emission)
• State Sequence: X=(x1, x2, .., xT+1), xt : S →{1,2,…, N}• Observation Sequence: O=(o1, .., oT), ot ∈ K
15 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
HMM as a Generating Model• Given an HMM, denoted as and an
observation sequence O={O1,O2, …, OT}• The HMM can be viewed as a generator to produce O
as:1. Choose an initial state q1=Si according to the initial probability
distribution π2. Set t=13. Choose an observation Ot according to the symbol observation
probability distribution in state Si, i.e., bi(k)4. Transit to a new state qt+1 =Sj according to the state transition
probability distribution, i.e., aij5. Set t=t+1, return to step 3 if t<T6. Terminate the procedure
},,{ πBA=Λ
16 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Assumptions in HMM•• Markov AssumptionMarkov Assumption:
– State transition follows a 1st-order Markov chain– This assumption implies the duration in each state is a binomial
distribution:
•• Output Independence AssumptionOutput Independence Assumption: the probability that a particular observation symbol is emitted from HMM at time t depends only on the current state st and is conditionally independent of the past and future observations
• The two assumptions limit the memory of an HMM and may lead to model deficiency. But they significantly simplify HMM computation, also greatly reduce the number of free parameters to be estimated in practice– Some research works to relax these assumptions has been done in
the literature to enhance HMM in modeling speech signals
)1()()( 1jj
djjj aadp −= −
17 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Types of HMMs (I)• Different transition matrices:
– Ergodic HMM Topology
– Left-to-right HMM: states proceed from left to right
a11 a22 a33 a44
a12 a23 a34
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
333231
232221
131211
aaaaaaaaa
A
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
44
3433
2322
1211
00000
0000
aaa
aaaa
A
18 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Types of HMMs (II)• Different observation symbols: discrete vs. continuous
– Discrete density HMM (DDHMM): observation is discrete, one of afinite set. In discrete density HMM, observation function is a discrete probability density, i.e., a table. In state j,
– Continuous density HMM (CDHMM): observation x is continuous in an observation space. In CDHMM, state observation density is a probability density function (p.d.f.). The common function forms:
• Multivariate Gaussian density
• Gaussian mixture density∞<<∞−=
−
xC
Cμx μxCμx 2/)-()-(- 1t
e||)2(
1),;(n
Nπ
[ ]2.03.04.01.0)(4321
=kOvvvv
j
0101),;()(11
2 >≤≤== ∑∑ == imM
m mM
m mmm xNxMG σωωσμω
19 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
HMM for Data Modeling• HMM is used as a powerful statistical model for sequential
and temporal observation data• HMM is theoretically (mathematically) sound; relatively
simple learning and decoding algorithms exist• HMM is widely used in pattern recognition, machine
learning, etc.– Speech recognition: modeling speech signals– Statistical language processing: modeling language
(word/semantics sequence)– OCR (optimal character recognition): modeling 2-d character image – Gene finding: modeling DNA sequence
20 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Three Fundamental Problems in HMM• How to use HMM to model sequential data ?
– The entire data sequence is viewed as one data sample OO– The HMM is characterized by its parameters .
•• Learning ProblemLearning Problem: HMM parameters Λ must be estimated from a data sample set {O1,O2, …, OT}– The HMM parameters are set so as to best explain known data
•• Evaluation ProblemEvaluation Problem: for an unknown data sample Ox, calculate the probability of the data sample given the model, p(Ox|Λ)
•• Decoding ProblemDecoding Problem: uncover the hidden information; for an observation sequence O={o1,o2,…,oT}, decode a best state sequence S={q1,q2,…,qT} which is optimal in explaining O
},,{ πBA=Λ
21 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
HMM Formalism
{S, K, P, A, B} Π = {pi} are the initial state probabilitiesA = {aij} are the state transition probabilitiesB = {bik} are the observation probabilities
A
B
AAA
BB
SSS
KKK
S
K
S
K
22 Center of Signal and Image ProcessingGeorgia Institute of Technology
Example of an Arc Emit HMM
×
3
1
4
20.6
10.4 1
0.12
enter
here
p(t)=.5p(o)=.2p(e)=.3
p(t)=.8p(o)=.1p(e)=.1
p(t)=0p(o)=0p(e)=1
p(t)=.1p(o)=.7p(e)=.2
p(t)=0p(o)=.4p(e)=.6
p(t)=0p(o)=1p(e)=0
0.88
23 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
HMM Properties• N states in the model: a state has some measurable,
distinctive behaviors• At clock time t, you make a state transition (to a different
state or back to the same state), based on a transition probability distribution that depends on the current state (i.e., the one you're in before making the transition)
• After each transition, you output an observation output symbol according to a probability density distribution which depends on the current state or the current arc
Goal: From the observations, determine what model generated the output, e.g., in word recognition with a different HMM for each word, the goal is to choose the one that best fits the input word (based on state transitions and output observations)
24 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Example HMM• N states corresponding to urns: q1, q2, …,qN
• M colors for balls found in urns• B: N x M matrix; for each urn, there is a
probability density function for each color.– bij={COLOR=zj | URN = qi)
• A: N x N matrix; transition probability distribution (between urns)
• ∏: N vector; initial sate probability distribution (where do we start?)
25 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
An HMM Evolution Example• Process:
– According to π, pick a state qi. Select a ball from urn i (state is hidden)
– Observe the color (observable) – Replace the ball– According to A, transition to the next urn from which to pick a
ball (state is hidden)• Design: Choose N and M. Specify a model Λ = (A, B,
∏) from the training data. Adjust the model parameters to maximize P(O| Λ)
• Use: Given O and Λ = (A, B, ∏), what is P(O| Λ)? If we compute P(O| Λ) for all models Λ, then we can determine which model most likely generated O
26 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Simulating a Markov Processt:= 1;Start in state si with probability πi (i.e., X1=i)Forever do
Move from state si to state sj with probability aij (i.e., Xt+1=j)
Emit observation symbol ot = k with probability bijk
t:= t+1End
27 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Why Use Hidden Markov Models?• HMMs are useful when one can think of
underlying events probabilistically generating surface events. Example: PoS tagging
• HMMs can be efficiently trained using the EM Algorithm
• Another example where HMMs are useful is in generating parameters for linear interpolation of n-gram models
• Assuming that some set of data was generated by a HMM, and then an HMM is useful for calculating the probabilities for possible underlying state sequences
28 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Fundamental Questions for HMMs• Given a model Λ =(A, B, Π), how do we
efficiently compute how likely a certain observation is, that is, P(O| Λ) ?
• Given an observation sequence O and a model Λ, how do we choose a state sequence (X1, …, X T+1) that best explains the observations?
• Given an observation sequence O, and a space of possible models found by varying the model parameters Λ = (A, B, π), how do we find the model that best explains the observed data?
29 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Probability of an Observation• Given the observation sequence O=(o1, …,
oT) and a model, Λ = (A, B, ∏ ), we wish to know how to efficiently compute P(O| Λ)
• For any state sequence, S=(q1, …, qT+1), we find: P(O| Λ)=ΣS P(O | S, Λ) P(S | Λ) which is simply the probability of an observation sequence given the model
• Direct evaluation of this expression is extremely inefficient; however, there are dynamic programming methods that compute it quite efficiently
30 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
)|( Compute),,( ,)...( 1
μμ
OPBAooO T Π==
oTo1 otot-1 ot+1
Given an observation sequence and a model, compute the probability of the observation sequence
Probability of an Observation
31 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Probability of an Observation (Cont.)
TT oqoqoq bbbSOP ...),|(2211
=μ
oTo1 otot-1 ot+1
q1 qt+1 qTqtqt-1
TT qqqqqqq aaaSP132211
...)|(−
= πμ)|(),|()|,( μμμ SPSOPSOP =
∑=S
SPSOPOP )|(),|()|( μμμ
32 Center of Signal and Image ProcessingGeorgia Institute of Technology
Probability of an Observation with an Arc Emit HMM
×
3
1
4
20.6
10.4 1
0.12
enter
here
p(t)=.5p(o)=.2p(e)=.3
p(toe) = .6×.8 ×.88×.7 ×1×.6 +.4×.5 ×1×1 ×.88×.2 +.4×.5 ×1×1 ×.12×1
≅ .237
p(t)=.8p(o)=.1p(e)=.1
p(t)=0p(o)=0p(e)=1
p(t)=.1p(o)=.7p(e)=.2
p(t)=0p(o)=.4p(e)=.6
p(t)=0p(o)=1p(e)=0
0.88
33 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Making Computation Efficient• To avoid computational complexity, use dynamic
programming or memorization techniques due to trellis structure of the problem
• Use an array of states versus time to compute the probability of being at each state at time t+1 in terms of the probabilities for being in each state at t
• A trellis can record the probability of all initial subpaths of the HMM that end in a certain state at a certain time. The probability of longer subpaths can then be worked out in terms of the shorter subpaths
• A forward probability, αi(t)= P(o1o2…o t-1, Xt=i| μ) is stored at (si, t) in the trellis and expresses the total probability of ending up in state si at time t
34 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
The Trellis Structure
35 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
)|,...()( 1 μα iqooPt tti ==
Forward Procedure
oTo1 otot-1 ot+1
q1 qt+1 qTqtqt-1
• Special structure gives us an efficient solution using dynamic programming
• Intuition: Probability of the first t observations is the same for all possible t +1 length state sequences (so don’t recompute it!)
• Define:
36 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
)|(),...()()|()|...(
)()|...(),...(
1111
11111
1111
111
jqoPjqooPjqPjqoPjqooP
jqPjqooPjqooP
tttt
ttttt
ttt
tt
=======
=====
+++
++++
+++
++
oTo1 otot-1 ot+1
q1 qt+1 qTqtqt-1
Forward Procedure (Cont.)
)1( +tjα
37 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
∑
∑
∑
∑
=
+++=
++=
+
++=
+
+=
=====
=====
====
Nijoiji
ttttNi
tt
tttNi
ttt
ttNi
ttt
tbat
jqoPiqjxPiqooP
jqoPiqPiqjqooP
jqoPjqiqooP
...1
111...1
1
11...1
11
11...1
11
1)(
)|()|(),...(
)|()()|,...(
)|(),,...(
α
oTo1 otot-1 ot+1
q1 qt+1 qTqtqt-1
Forward Procedure (Cont.)
38 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
The Forward Algorithm• Forward variables are calculated as follows:
– Initialization: αi(1)= πi , 1≤ i ≤ N– Induction: αj(t+1)=Σi=1, N αi(t)aijbijot, 1≤ t≤T, 1≤ j≤N– Total: P(O|μ)= Σi=1,N αi(T+1)
• This algorithm requires 2N2T multiplications (much less than the direct method which takes (2T+1)NT+1
39 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
The Backward Procedure• We could also compute these probabilities by
working backward through time• The backward procedure computes backward
variables which are the total probability of seeing the rest of the observation sequence given that we were in state si at time t
• βi(t) = P(ot…oT | qt = i, Λ) is a backward variable (probability)
• Backward variables are useful for the problem of parameter reestimation
40 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
)|...()( iqooPt tTti ==β
oT
qT
o1
q1
otot-1 ot+1
qt+1qtqt-1
Backward Procedure (Cont.)
1)1( =+Tiβ
∑=
+=Nj
jioiji tbatt
...1)1()( ββ
41 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
The Backward Algorithm• Backward variables can be calculated
working backward through the treillis as follows:– Initialization: βi(T+1) = 1, 1≤ i ≤ N– Induction: βi(t) =Σj=1,N aijbijotβj(t+1), 1≤ t ≤T, 1≤ i ≤ N– Total: P(O|μ)=Σi=1, N πiβi(1)
• Backward variables can also be combined with forward variables:
P(O|μ) = Σi=1,N αi(t)βi(t), 1≤ t ≤ T+1
42 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
oTo1 otot-1 ot+1
q1 qt+1 qTqtqt-1
∑=
=N
ii TOP
1)()|( αμ Forward Procedure
Summary: Probability of an Observation
∑=
=N
iiiOP
1)1()|( βπμ Backward Procedure
)()()|(1
ttOP i
N
ii βαμ ∑
=
= Combination
43 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Finding the Best State Sequence• One method consists of optimizing on the states
individually• For each t, 1≤ t≤ T+1, we would like to find Xt that
maximizes P(Xt|O, Λ)• Let γi(t) = P(Xt = i |O, Λ) = P(Xt = i, O| Λ)/P(O| Λ) =
(αi(t)βi(t))/Σj=1,N αj(t)βj(t)• The individually most likely state is
Xt =argmax1≤i≤N γi(t), 1≤ t≤ T+1• This quantity maximizes the expected number of states
that will be guessed correctly. However, it may yield a quite unlikely state sequence
^
44 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
The Viterbi Algorithm• The Viterbi algorithm efficiently computes the most
likely state sequence.• To find the most likely complete path compute:
argmaxS P(S|O, Λ)• To do this, it is sufficient to maximize for a fixed O:
argmaxS P(S,O| Λ)• We define
δj(t) = maxq1..qt-1 P(q1…qt-1, o1..ot-1, qt=j|μ) with ψj(t) records the node of the incoming arc that led to this most probable path
45 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
oTo1 otot-1 ot+1
Viterbi Algorithm Properties
),,...,...(max)( 1111... 11ttttqqj ojqooqqPt
t
== −−−
δ
The state sub-sequence which maximizes the probability of seeing the observations to time t -1, landing in state j, and seeing the observation at time t
q1 qt-1 j
46 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
oTo1 otot-1 ot+1
Viterbi Algorithm Properties (Cont.)
),,...,...(max)( 1111... 11ttttqqj ojqooqqPt
t
== −−−
δ
1)(max)1(
+=+
tjoijiij batt δδ
1)(maxarg)1(
+=+
tjoijii
j batt δψDP recursive computation
q1 qt-1 qt qt+1
47 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Finally: The Viterbi AlgorithmThe Viterbi Algorithm works as follows:• Initialization: δj(1) = πj, 1≤ j≤ N• Induction: δj(t+1) = max1≤ i≤N δi(t)aijbijot, 1≤ j≤ N• Store backtrace:
ψj(t+1) = argmax1≤ i≤N δj(t)aij bijot, 1≤ j≤ N
• Termination and path readout: next page
48 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
oTo1 otot-1 ot+1
q1 qt-1 qt qt+1 qT
Viterbi Algorithm: Another Look
)(maxarg^
Tq ii
T δ=
)1(1
^
^+=
+
tqtq
t ψ
)(max)(^
TSP iiδ=
Compute the most likely state sequence by working backwards
49 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
HMM Parameter Estimation• Given a certain collection of observation sequences,
we want to find the values of the model parameters Λ=(A, B, π) which best explain the observed data
• Using maximum likelihood estimation (MLE) to find values to maximize P(O| Λ), i.e. argmaxμ P(Otraining| μ)
• There is no known analytic method to choose μ to maximize P(O| Λ). However, we can locally maximize it by an iterative hill-climbing algorithm known as Baum-Welch or forward-backward algorithm (this is a special case of the EM algorithm which is used in many miss data problems in statistical inference)
50 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
The Forward-Backward Algorithm• We don’t know what the model is, but we can
work out the probability of the observation sequence using some (perhaps randomly chosen) model
• Looking at that calculation, we can see which state transitions and symbol emissions were probably used the most
• By increasing the probability of those, we can choose a revised model which gives a higher probability to the observation sequence
51 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
oTo1 otot-1 ot+1
A
B
AAA
BBB B
Parameter Estimation
• Given an observation sequence, find the model that is most likely to produce that sequence
• No analytic method• Given a model and training sequences, update the
model parameters to better fit the observations
52 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Some Definitions
∑=
+
+
+=
===
===
+
Nmmm
jijoiji
tt
ttt
tttbat
OpOjqiqpOjqiqpjip
t
...1
1
1
)()()1()(
)|()|,,(),|,(),(
1
βαβα
μμμ
• The probability of traversing a certain arc at time t givenObservation sequence O:
∑=
=N
jti jipt
1),()(γ
Let:
53 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
The Probability of Traversing an Arc
si sj
aij bijot
……
t t+1t-1 t+2αi(t) βj(t+1)
54 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
More Definitions
)(1
tT
ti∑
=
γ
• The expected number of transitions from state i in O:
• The expected number of transitions from state i to j in O
),(1
jipT
tt∑
=
55 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Reestimation Procedure• Begin with model μ perhaps selected at
random• Run O through the current model to estimate
the expectations of each model parameter• Change model to maximize the values of the
paths that are used a lot while respecting stochastic constraints
• Repeat this process (hoping to converge on optimal values for the model parameters μ)
56 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
The Reestimation Formula
)1(ˆ ii γπ =The expected frequency in state i at time t=1:
∑
∑
=
==
=
T
ti
T
tt
ij
t
jip
ijia
1
1
)(
),(
state from ns transitio# expected to state from ns transitio# expectedˆ
γ
).|P(O)ˆ|P(O that Note).ˆ,B,A(ˆ derive ),B,(A, From
μμμμ
≥Π=Π=
57 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Reestimation Formula (Cont.)
∑
∑
=
≤≤==
=
T
tt
Ttkot t
ijk
jip
jipji
kjib
t
1
)1,:(
),(
),( to state from ns transitio# expected
observing to state from ns transitio# expectedˆ
58 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
oTo1 otot-1 ot+1
A
B
AAA
BBB B
Parameter Estimation: Summary 1
∑=
+= +
Nmmm
jjoijit tt
tbatjip t
...1)()(
)1()(),( 1
βαβα Probability of
traversing an arcat time t
∑=
=Nj
ti jipt...1
),()(γ Probability of being in state i
59 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
oTo1 otot-1 ot+1
A
B
AAA
BBB B
Parameter Estimation: Summary 2
)1(ˆ iγπ =i
Now we can compute the new estimates of the model parameters∑
∑=
== T
t i
T
t tij
t
jipa
1
1
)(
),(ˆ
γ
∑∑
=
== T
t i
kot i
ikt
tb t
1
}:{
)(
)(ˆ
γ
γ
60 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Baum-Welch Algorithm: DDHMM (I)
E-step:[ ]
);();();(
),|,Pr()(ln
),|,Pr(ln),|Pr(ln
),|()(lnln)(lnln
),|()|,(ln),|()|,(ln
,,|)|,,,,(ln);(
)()()(
)(
1 1 1 1
1 1 1
)(
21
1
)(
11
)(
1 221
1
)(
1
)(
1
)(111}{
)(
1
111
1
nnn
nl
L
l
N
i
M
m
T
tmtiltMi
L
l
N
i
N
j
nl
T
tjltiltij
L
l
nl
N
iili
nll
L
l ss
T
tts
T
tsslss
L
l S
nllll
L
l
nll
SS
L
lll
nLLLS
n
BBQAAQQ
Ovossvb
OssssaOss
OSpobaob
OSpSOpOSpSOp
OOSSOOpEQ
l
l
llTl
l
t
l
ttll
lL
l
++=
Λ==⋅+
Λ==⋅+Λ=⋅=
Λ⋅⎥⎦
⎤⎢⎣
⎡++⋅+=
Λ⋅Λ=Λ⋅⎥⎦
⎤⎢⎣
⎡Λ=
ΛΛ=ΛΛ
∑∑∑∑
∑∑∑∑∑∑
∑ ∑ ∑∑
∑∑∏∑ ∑
= = = =
= = = =−
= =
= ==
===
−
ππ
π
πL
L
LLL
61 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Baum-Welch Algorithm: DDHMM (II)M-step:
∑∑
∑∑
∑∑∑
∑∑
∑∑
∑∑
∑∑∑
∑∑
∑∑
∑
= =
= =
= = =
= =
=
−
=
= =−
= = =−
= =−
+
= =
=+
Λ=
Λ===
Λ==
Λ===⇒=
∂∂
Λ=
Λ===
Λ==
Λ===⇒=
∂∂
Λ=
Λ==⇒=
∂∂
L
l
T
t
nlit
L
l
nl
T
tmtit
L
l
T
t
M
m
nlmtit
L
l
nl
T
tmtit
mimi
n
L
l
T
t
nlilt
L
l
T
t
nljltilt
L
l
T
t
N
j
nljltilt
L
l
T
t
nljltilt
nij
ij
n
L
l
N
i
nlil
L
l
nlil
ni
i
n
l
l
l
l
l
l
l
l
Oss
Ovoss
Ovoss
Ovossvb
vbBBQ
Oss
Ossss
Ossss
Ossssa
aAAQ
Oss
OssQ
1 1
)(
1
)(
1
1 1 1
)(
1
)(
1)(
1
1
1
)(
1 2
)(1
1 2 1
)(1
1 2
)(1
)1()(
1 1
)(1
1
)(1
)1()(
),|Pr(
),|,Pr(
),|,Pr(
),|,Pr()(0
)();(
),|Pr(
),|,Pr(
),|,Pr(
),|,Pr(0);(
),|Pr(
),|Pr(0);( π
πππ
62 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Baum-Welch Algorithm: DDHMM (III)
How to calculate the posteriori probabilities of traversing an arc going from state i to j at time t?
si sj
t t+1)(itα )(1 it+β
)( tjij oba
),(
)|()()()(
)(
)()()()|Pr(
)()()(paths all of prob
at and 1at paths all of prob),|,Pr(
1
1
11
1
ji
OPjobai
i
jobaiO
jobai
tstsOssss
t
ttjijtN
iT
ttjijt
l
ttjijt
jijtit
ξ
βα
α
βαβα
≡
Λ
⋅⋅=
⋅⋅=
Λ
⋅⋅=
−=Λ==
+
=
++
−
∑
∑∑
63 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Baum-Welch Algorithm: DDHMM (IV)Define state occupancy probability:
Ojiji
Oit
jit
T
tt
T
ti
N
jti
in to state from ns transitioofnumber Expected ),(
in state from ns transitioofnumber Expected )(
),()( Note
1
1
1
=
=
=
∑
∑
∑
=
=
=
ξ
γ
ξγ
∑=
=Λ
Λ==Λ== N
jjj
iitti
tt
ttOP
OisPOisPt
1)()(
)()()|(
)|,(),|()(βα
βαγ
64 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Baum-Welch Algorithm: DDHMM (V)Final results: one iteration, from
∑ ∑
∑ ∑
∑ ∑ ∑
∑ ∑
∑ ∑ ∑
∑ ∑
∑ ∑ ∑
∑ ∑
∑ ∑∑ ∑
∑
= =
= =
= = =
= =
= = =
= =
= = =−
= =−
+
= =
= =
=+
−⋅=
Λ==
Λ===
=Λ==
Λ===
=Λ=
Λ==
L
l
T
t
lt
L
l
T
tmlt
lt
L
l
T
t
M
m
nlmtit
L
l
nl
T
tmtit
mi
L
l
T
t
N
j
lt
L
l
T
t
lt
L
l
T
t
N
j
nljltilt
L
l
T
t
nljltilt
nij
L
l
N
j
lL
l
N
i
nlil
L
l
nlil
ni
l
l
l
l
l
l
l
l
ji
voji
Ovoss
Ovossvb
ji
ji
Ossss
Ossssa
jiOss
Oss
1 1
)(
1 1
)(
1 1 1
)(
1
)(
1
1 2 1
)(
1 2
)(
1 2 1
)(1
1 2
)(1
)1(
1 1
)(1
1 1
)(1
1
)(1
)1(
),(
)(),(
),|,Pr(
),|,Pr()(
),(
),(
),|,Pr(
),|,Pr(
),(),|Pr(
),|Pr(
ξ
δξ
ξ
ξ
ξπ
},,{ )()()()( nnnn BA π=Λ
65 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Baum-Welch: Gaussian Mixture CDHMM (I)• Treat both state sequence Sl and mixture component label
sequence ll as missing data.• Only B estimation is different.• E-step:
• M-step:
),|,Pr()()(21||ln
2ln
),|,Pr()(ln);(
)(
1 1 1 1
1
)(
1 1 1 1
)(
nl
L
l
N
i
K
k
T
tltiltikltik
tikltikik
nl
L
l
N
i
K
k
T
tltiltltik
n
OklssXXn
OklssXbBBQ
l
l
Λ==⋅⎥⎦⎤
⎢⎣⎡ −∑−⋅−∑−=
Λ==⋅=
∑∑∑∑
∑∑∑∑
= = = =
−
= = = =
μμω
),|,Pr(
),|,Pr(0);(
)(
)(
1 1)1()(
nl
L L
ltilt
nl
L
l
L
tltiltlt
nik
ik
n
Oklss
OklssXBBQ
t
t
Λ==
Λ==⋅=⇒=
∂∂
∑ ∑
∑ ∑= =+μ
μ1 1l t= =
66 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Baum-Welch: Gaussian Mixture CDHMM (II)
),|,Pr(
),|,Pr(0);(
),|,Pr(
),|,Pr()()(0);(
)(
1 1 1
)(
1 1)1()(
)(
1 1
)(
1 1
)()(
)1()(
nl
L
l
L
t
K
kltilt
nl
L
l
L
tltilt
nik
ik
n
nl
L
l
L
tltilt
nl
L
l
L
tltilt
niklt
tniklt
nik
ik
n
Oklss
OklssBBQ
Oklss
OklssXXBBQ
t
t
t
t
Λ==
Λ===⇒=
∂∂
Λ==
Λ==⋅−⋅−=∑⇒=
∑∂∂
∑ ∑ ∑
∑ ∑
∑ ∑
∑ ∑
= = =
= =+
= =
= =+
ωω
μμ
∑
∑
=
=
∑⋅
∑⋅=
⋅
⋅⋅=≡Λ==
K
k
nik
niklt
nik
nik
niklt
nikl
ik
K
k
likl
lik
lt
ltl
tn
lltilt
XN
XNt
tP
tiikiOklss
1
)()()(
)()()()(
1
)(
)()()()()(
),|(
),|()( where
)(
)()()(),(),|,Pr(
μω
μωγ
γ
γβαζ
where the posteriori probabilities are calculated as:
67 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
HMM Learning: Summary• For an HMM model and a training data
set D = {O1, O2, …, OL},1. Initialization: 2. n=0 ;3. For each observation sequence Ol (l=1,2,…,L):
• Calculate and based on • Calculate all other posteriori probabilities• Accumulate a numerator and a denominator for each HMM
parameter4. HMM parameters update: the numerators divided by
the denominators
5. n=n+1; Go to step 2 until convergence
},,{ πBA=Λ
},,{ )0()0()0()0( πBA=Λ
)(itα )(itβ )(nΛ
=Λ + )1(n
68 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
HMM Applications
• Parameters for interpolated n-gram models• Part-of-speech tagging• Speech recognition• Machine translation• Many others
69 Center of Signal and Image ProcessingGeorgia Institute of Technology
ECE8813 Spring 2009
Summary• Today’s and next classes
– Hidden Markov Models• Note
– Project summary due on 2/24, let’s start our discussion– Project plan finalize on 3/3 (presentation on 4/16 ???)– Lab3 assigned on 2/12 and due on 2/26– Midterm on 3/12 (???)– Final at 8am on 4/27 (shall we try a take-home ???)
• Reading Assignments– Manning and Schutze, Chapter 9