View
218
Download
0
Category
Tags:
Preview:
Citation preview
Fall 2001 EE669: Natural Language Processing 1
Lecture 9: Hidden Markov Models (HMMs)
(Chapter 9 of Manning and Schutze)
Dr. Mary P. Harper
ECE, Purdue University
harper@purdue.edu
yara.ecn.purdue.edu/~harper
Fall 2001 EE669: Natural Language Processing 2
Markov Assumptions• Often we want to consider a sequence of randm variables that aren’t
independent, but rather the value of each variable depends on previous elements in the sequence.
• Let X = (X1, .., XT) be a sequence of random variables taking values
in some finite set S = {s1, …, sN}, the state space. If X possesses the following
properties, then X is a Markov Chain.
• Limited Horizon:
P(Xt+1= sk|X1, .., Xt) = P(X t+1 = sk |Xt) i.e., a word’s state only
depends on the previous state.
• Time Invariant (Stationary):
P(Xt+1= sk|Xt) = P(X2 = sk|X1) i.e., the dependency does not change
over time.
Fall 2001 EE669: Natural Language Processing 3
Definition of a Markov Chain• A is a stochastic N x N matrix
of the probabilities of transitions with:– aij = Pr{transition from si to sj} = Pr(Xt+1= sj | Xt= si) ij, aij 0, and i:
is a vector of N elements representing the initial state probability distribution with: i = Pr{probability that the initial state is si} = Pr(X1= si) i:
– Can avoid this by creating a special start state s0.
11
n
jij
a
11
n
ii
Fall 2001 EE669: Natural Language Processing 4
Markov Chain
Fall 2001 EE669: Natural Language Processing 5
Markov Process & Language Models
• Chain rule:
P(W) = P(w1,w2,...,wT) = i=1..T p(wi|w1,w2,..,wi-n+1,..,wi-1)
• n-gram language models:– Markov process (chain) of the order n-1:
P(W) = P(w1,w2,...,wT) = i=1..T p(wi|wi-n+1,wi-n+2,..,wi-1)
Using just one distribution (Ex.: trigram model: p(wi|wi-2,wi-1)):
Positions: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Words: My car broke down , and within hours Bob ’s car broke down , too .
p( ,|broke down) = p(w5|w3,w4) = p(w14|w12,w13)
Fall 2001 EE669: Natural Language Processing 6
Example with a Markov Chain
1
.4
1
.3.3
.4
.6 1
.6
.4
te
ha p
i
Start
Fall 2001 EE669: Natural Language Processing 7
Probability of a Sequence of States
• P(t, a, p, p) = 1.0 × 0.3 × 0.4 × 1.0 = 0.12• P(t, a, p, e) = ?
1
1
1121
1321121321
11
)|()|()(
)|()|()()(
T
txxx
TT
TTT
tta
xxPxxPxP
xxxxxPxxPxPxxxxP
Fall 2001 EE669: Natural Language Processing 8
Hidden Markov Models (HMMs)
• Sometimes it is not possible to know precisely which states the model passes through; all we can do is observe some phenomena that occurs when in that state with some probability distribution.
• An HMM is an appropriate model for cases when you don’t know the state sequence that the model passes through, but only some probabilistic function of it. For example:– Word recognition (discrete utterance or within continuous speech) – Phoneme recognition– Part-of-Speech Tagging– Linear Interpolation
Fall 2001 EE669: Natural Language Processing 9
Example HMM• Process:
– According to , pick a state qi. Select a ball from urn i (state is hidden).
– Observe the color (observable).
– Replace the ball.
– According to A, transition to the next urn from which to pick a ball (state is hidden).
N states corresponding to urns: q1, q2, …,qN
M colors for balls found in urns: z1, z2, …, zM
A: N x N matrix; transition probability distribution (between urns)B: N x M matrix; for each urn, there is a probability density function for each color.
bij = Pr(COLOR= zj | URN = qi): N vector; initial sate probability distribution (where do we start?)
Fall 2001 EE669: Natural Language Processing 10
What is an HMM?
• Green circles are hidden states
• Dependent only on the previous state
Fall 2001 EE669: Natural Language Processing 11
What is an HMM?
• Purple nodes are observations
• Dependent only on their corresponding hidden state (for a state emit model)
Fall 2001 EE669: Natural Language Processing 12
Elements of an HMM• HMM (the most general case):
– five-tuple (S, K, , A, B), where:• S = {s1,s2,...,sN} is the set of states,
• K = {k1,k2,...,kM} is the output alphabet,
= {i} i S, initial state probability set
• A = {aij} i,j S, transition probability set
• B = {bijk} i,j S, k K (arc emission)
• B = {bik} i S, k K (state emission), emission probability set
• State Sequence: X = (x1, x2, ..., xT+1),
xt : S {1,2,…, N}
• Observation (output) Sequence: O = (o1, ..., oT), ot K
Fall 2001 EE669: Natural Language Processing 13
HMM Formalism
• {S, K, • S : {s1…sN} are the values for the hidden states
• K : {k1…kM} are the values for the observations
SSS
KKK
S
K
S
K
Fall 2001 EE669: Natural Language Processing 14
HMM Formalism
• {S, K, iare the initial state probabilities
• A = {aij} are the state transition probabilities
• B = {bik} are the observation probabilities
A
B
AAA
BB
SSS
KKK
S
K
S
K
i
Fall 2001 EE669: Natural Language Processing 15
Example of an Arc Emit HMM
3
1
4
20.6
10.4 1
0.12
enter
here
p(t)=.5p(o)=.2p(e)=.3
p(t)=.8p(o)=.1p(e)=.1
p(t)=0p(o)=0p(e)=1
p(t)=.1p(o)=.7p(e)=.2
p(t)=0p(o)=.4p(e)=.6
p(t)=0p(o)=1p(e)=0
0.88
Fall 2001 EE669: Natural Language Processing 16
HMMs1. N states in the model: a state has some measurable,
distinctive properties.2. At clock time t, you make a state transition (to a different
state or back to the same state), based on a transition probability distribution that depends on the current state (i.e., the one you're in before making the transition).
3. After each transition, you output an observation output symbol according to a probability density distribution which depends on the current state or the current arc.
Goal: From the observations, determine what model generated the output, e.g., in word recognition with a different HMM for each word, the goal is to choose the one that best fits the input word (based on state transitions and output observations).
Fall 2001 EE669: Natural Language Processing 17
Example HMM
• N states corresponding to urns: q1, q2, …,qN
• M colors for balls found in urns: z1, z2, …, zM
• A: N x N matrix; transition probability distribution (between urns)
• B: N x M matrix; for each urn, there is a probability density function for each color.– bij = Pr(COLOR= zj | URN = qi)
: N vector; initial sate probability distribution (where do we start?)
Fall 2001 EE669: Natural Language Processing 18
Example HMM• Process:
– According to , pick a state qi. Select a ball from urn i (state is hidden).
– Observe the color (observable).
– Replace the ball.
– According to A, transition to the next urn from which to pick a ball (state is hidden).
• Design: Choose N and M. Specify a model = (A, B, ) from the training data. Adjust the model parameters (A, B, ) to maximize P(O| ).
• Use: Given O and = (A, B, ), what is P(O| )? If we compute P(O| ) for all models , then we can determine which model most likely generated O.
Fall 2001 EE669: Natural Language Processing 19
Simulating a Markov Process
t:= 1;Start in state si with probability i (i.e., X1= i)Forever do Move from state si to state sj with probability aij
(i.e., Xt+1= j) Emit observation symbol ot = k with probability bik (bijk ) t:= t+1End
Fall 2001 EE669: Natural Language Processing 20
Why Use Hidden Markov Models?• HMMs are useful when one can think of
underlying events probabilistically generating surface events. Example: Part-of-Speech-Tagging.
• HMMs can be efficiently trained using the EM Algorithm.
• Another example where HMMs are useful is in generating parameters for linear interpolation of n-gram models.
• Assuming that some set of data was generated by an HMM, and then an HMM is useful for calculating the probabilities for possible underlying state sequences.
Fall 2001 EE669: Natural Language Processing 21
Fundamental Questions for HMMs
• Given a model = (A, B, ), how do we efficiently compute how likely a certain observation is, that is, P(O| )
• Given the observation sequence O and a model , how do we choose a state sequence (X1, …, XT+1) that best explains the observations?
• Given an observation sequence O, and a space of possible models found by varying the model parameters = (A, B, ), how do we find the model that best explains the observed data?
Fall 2001 EE669: Natural Language Processing 22
Probability of an Observation
• Given the observation sequence O = (o1, …, oT) and a model = (A, B, ), we wish to know how to efficiently compute P(O| ).
• For any state sequence X = (X1, …, XT+1), we find: P(O| ) = X P(X | ) P(O | X, )
• This is simply the probability of an observation sequence given the model.
• Direct evaluation of this expression is extremely inefficient; however, there are dynamic programming methods that compute it quite efficiently.
Fall 2001 EE669: Natural Language Processing 23
)|( Compute
),,( ,)...( 1
OP
BAooO T
oTo1 otot-1 ot+1
Given an observation sequence and a model, compute the probability of the observation sequence
Probability of an Observation
Fall 2001 EE669: Natural Language Processing 24
Probability of an Observation
TT oxoxox bbbXOP ...),|(2211
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
b
Fall 2001 EE669: Natural Language Processing 25
Probability of an Observation
TT oxoxox bbbXOP ...),|(2211
TT xxxxxxx aaaXP132211
...)|(
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
b
a
Fall 2001 EE669: Natural Language Processing 26
Probability of an Observation
),|()|()|,( XOPXPXOP
TT oxoxox bbbXOP ...),|(2211
TT xxxxxxx aaaXP132211
...)|(
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Fall 2001 EE669: Natural Language Processing 27
Probability of an Observation
)|(),|()|,( XPXOPXOP
TT oxoxox bbbXOP ...),|(2211
TT xxxxxxx aaaXP132211
...)|(
X
XOPXPOP ),|()|()|(
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Fall 2001 EE669: Natural Language Processing 28
111
1
111
1
1}...{
)|(
tttt
T
oxxx
T
txxoxx babOP
Probability of an Observation(State Emit)
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Fall 2001 EE669: Natural Language Processing 29
Probability of an Observation with an Arc Emit HMM
3
1
4
20.6
10.4 1
0.12
enter
here
p(t)=.5p(o)=.2p(e)=.3
p(toe) = .6.881 +
.4 1.88 +
.4 1.12
.237
p(t)=.8p(o)=.1p(e)=.1
p(t)=0p(o)=0p(e)=1
p(t)=.1p(o)=.7p(e)=.2
p(t)=0p(o)=.4p(e)=.6
p(t)=0p(o)=1p(e)=0
0.88
Fall 2001 EE669: Natural Language Processing 30
Making Computation Efficient
• To avoid this complexity, use dynamic programming or memoization techniques due to trellis structure of the problem.
• Use an array of states versus time to compute the probability of being at each state at time t+1 in terms of the probabilities for being in each state at t.
• A trellis can record the probability of all initial subpaths of the HMM that end in a certain state at a certain time. The probability of longer subpaths can then be worked out in terms of the shorter subpaths.
• A forward variable, i(t)= P(o1o2…ot-1, Xt=i| ) is stored at (si, t) in the trellis and expresses the total probability of ending up in state si at time t.
Fall 2001 EE669: Natural Language Processing 31
The Trellis Structure
Fall 2001 EE669: Natural Language Processing 32
)|,...()( 11 ixooPt tti
Forward Procedure
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
• Special structure gives us an efficient solution using dynamic programming.
• Intuition: Probability of the first t observations is the same for all possible t+1 length state sequences (so don’t recompute it!)
• Define:
Fall 2001 EE669: Natural Language Processing 33
)|(),...(
)()|()|...(
)()|...(
),...(
1111
11111
1111
111
jxoPjxooP
jxPjxoPjxooP
jxPjxooP
jxooP
tttt
ttttt
ttt
tt
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
)1( tj Note: We omit | from formula.
Note: We omit | from formula.
Fall 2001 EE669: Natural Language Processing 34
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
)1( tj
)|(),...(
)()|()|...(
)()|...(
),...(
1111
11111
1111
111
jxoPjxooP
jxPjxoPjxooP
jxPjxooP
jxooP
tttt
ttttt
ttt
tt
Note: P(A , B) = P(A| B)*P(B)
Note: P(A , B) = P(A| B)*P(B)
Fall 2001 EE669: Natural Language Processing 35
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
)1( tj
)|(),...(
)()|()|...(
)()|...(
),...(
1111
11111
1111
111
jxoPjxooP
jxPjxoPjxooP
jxPjxooP
jxooP
tttt
ttttt
ttt
tt
Fall 2001 EE669: Natural Language Processing 36
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
)1( tj
)|(),...(
)()|()|...(
)()|...(
),...(
1111
11111
1111
111
jxoPjxooP
jxPjxoPjxooP
jxPjxooP
jxooP
tttt
ttttt
ttt
tt
Fall 2001 EE669: Natural Language Processing 37
Niijoiji
ttttNi
tt
tttNi
ttt
ttNi
ttt
tbat
jxoPixjxPixooP
jxoPixPixjxooP
jxoPjxixooP
...1
111...1
1
11...1
11
11...1
11
1)(
)|()|(),...(
)|()()|,...(
)|(),,...(
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
Fall 2001 EE669: Natural Language Processing 38
Niijoiji
ttttNi
tt
tttNi
ttt
ttNi
ttt
tbat
jxoPixjxPixooP
jxoPixPixjxooP
jxoPjxixooP
...1
111...1
1
11...1
11
11...1
11
1)(
)|()|(),...(
)|()()|,...(
)|(),,...(
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
Fall 2001 EE669: Natural Language Processing 39
Niijoiji
ttttNi
tt
tttNi
ttt
ttNi
ttt
tbat
jxoPixjxPixooP
jxoPixPixjxooP
jxoPjxixooP
...1
111...1
1
11...1
11
11...1
11
1)(
)|()|(),...(
)|()()|,...(
)|(),,...(
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
Fall 2001 EE669: Natural Language Processing 40
Niijoiji
ttttNi
tt
tttNi
ttt
ttNi
ttt
tbat
jxoPixjxPixooP
jxoPixPixjxooP
jxoPjxixooP
...1
111...1
1
11...1
11
11...1
11
1)(
)|()|(),...(
)|()()|,...(
)|(),,...(
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
Fall 2001 EE669: Natural Language Processing 41
The Forward Procedure
• Forward variables are calculated as follows:
– Initialization: i(1)= i , 1 i N
– Induction: j(t+1)=i=1, N i(t)aijbijot, 1 tT, 1 jN
– Total: P(O|)= i=1,N i(T+1) = i=1,N i=1, N i(t)aijbijot
• This algorithm requires 2N2T multiplications (much less than the direct method which takes (2T+1)NT+1
Fall 2001 EE669: Natural Language Processing 42
The Backward Procedure
• We could also compute these probabilities by working backward through time.
• The backward procedure computes backward variables which are the total probability of seeing the rest of the observation sequence given that we were in state si at time t.
i(t) = P(ot…oT | Xt = i, ) is a backward variable.• Backward variables are useful for the problem of
parameter reestimation.
Fall 2001 EE669: Natural Language Processing 43
)|...()( ixooPt tTti
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Backward Procedure
1)1( Ti
Nj
jijoiji tbatt
...1)1()(
Fall 2001 EE669: Natural Language Processing 44
The Backward Procedure
• Backward variables can be calculated working backward through the treillis as follows:– Initialization: i(T+1) = 1, 1 i N
– Induction: i(t) =j=1,N aijbijotj(t+1), 1 t T, 1 i N
– Total: P(O|)=i=1, N ii(1)
• Backward variables can also be combined with forward variables:
P(O|) = i=1,N i(t)i(t), 1 t T+1
Fall 2001 EE669: Natural Language Processing 45
The Backward Procedure
• Backward variables can be calculated working backward through the treillis as follows:– Initialization: i(T+1) = 1, 1 i N
– Induction: i(t) =j=1,N aijbijotj(t+1), 1 t T, 1 i N
– Total: P(O|)=i=1, N ii(1)
• Backward variables can also be combined with forward variables:
P(O|) = i=1,N i(t)i(t), 1 t T+1
)()(
),|...()|,...(
),,...|...()|,...(
)|...,,...(
)|,...()|,(
11
1111
11
1
tt
iXooPiXooP
iXooooPiXooP
ooiXooP
iXooPiXOP
ii
tTttt
ttTttt
Tttt
tTt
Fall 2001 EE669: Natural Language Processing 46
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Probability of an Observation
N
ii TOP
1
)()|(
N
iiiOP
1
)1()|(
)()()|(1
ttOP i
N
ii
Forward Procedure
Backward Procedure
Combination
Fall 2001 EE669: Natural Language Processing 47
Finding the Best State Sequence
• One method consists of optimizing on the states individually.• For each t, 1 t T+1, we would like to find Xt that
maximizes P(Xt|O, ).
• Let i(t) = P(Xt = i |O, ) = P(Xt = i, O|)/P(O|) = (i(t)i(t))/j=1,N
j(t)j(t)• The individually most likely state is
Xt =argmax1iN i(t), 1 t T+1• This quantity maximizes the expected number of states that
will be guessed correctly. However, it may yield a quite unlikely state sequence.
^
Fall 2001 EE669: Natural Language Processing 48
oTo1 otot-1 ot+1
Best State Sequence
• Find the state sequence that best explains the observation sequence
• Viterbi algorithm
),|(maxarg OXPX
Fall 2001 EE669: Natural Language Processing 49
Finding the Best State Sequence: The Viterbi Algorithm
• The Viterbi algorithm efficiently computes the most likely state sequence.
• To find the most likely complete path, compute: argmaxX P(X|O,)
• To do this, it is sufficient to maximize for a fixed O: argmaxX P(X,O|)
• We define j(t) = maxX1..Xt-1
P(X1…Xt-1, o1..ot-1, Xt=j|) j(t) records the node of the incoming arc that led to this most probable path.
Fall 2001 EE669: Natural Language Processing 50
oTo1 otot-1 ot+1
Viterbi Algorithm
)|,...,...(max)( 1111... 11
jxooxxPt tttxx
jt
The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t
x1 xt-1 j
Fall 2001 EE669: Natural Language Processing 51
oTo1 otot-1 ot+1
Viterbi Algorithm
)|,...,...(max)( 1111... 11
jxooxxPt tttxx
jt
tijoijii
j batt )(max)1(
tijoijii
j batt )(maxarg)1(
Recursive Computation
x1 xt-1 xt xt+1
Fall 2001 EE669: Natural Language Processing 52
Finding the Best State Sequence: The Viterbi Algorithm
The Viterbi Algorithm works as follows:• Initialization: j(1) = j, 1 j N
• Induction: j(t+1) = max1 iN i(t)aijbijot, 1 j N
Store backtrace:
j(t+1) = argmax1 iN i(t)aij bijot, 1 j N
• Termination and path readout: XT+1 = argmax1 iN i(T+1) Xt = Xt+1(t+1) P(X) = max1 iN i(T+1)
^^
^^
Fall 2001 EE669: Natural Language Processing 53
oTo1 otot-1 ot+1
Viterbi Algorithm
)(maxargˆ TX ii
T
)1(ˆ1
^
tXtX
t
)(max)ˆ( TXP ii
Compute the most likely state sequence by working backwards
x1 xt-1 xt xt+1 xT
Fall 2001 EE669: Natural Language Processing 54
Third Problem: Parameter Estimation• Given a certain observation sequence, we want to
find the values of the model parameters =(A, B, ) which best explain what we observed.
• Using Maximum Likelihood Estimation, find values to maximize P(O| ), i.e., argmax P(Otraining| )
• There is no known analytic method to choose to maximize P(O| ). However, we can locally maximize it by an iterative hill-climbing algorithm known as Baum-Welch or Forward-Backward algorithm. (This is a special case of the EM Algorithm)
Fall 2001 EE669: Natural Language Processing 55
Parameter Estimation: The Forward-Backward Algorithm
• We don’t know what the model is, but we can work out the probability of the observation sequence using some (perhaps randomly chosen) model.
• Looking at that calculation, we can see which state transitions and symbol emissions were probably used the most.
• By increasing the probability of those, we can choose a revised model which gives a higher probability to the observation sequence.
Fall 2001 EE669: Natural Language Processing 56
oTo1 otot-1 ot+1
Parameter Estimation
• Given an observation sequence, find the model that is most likely to produce that sequence.
• No analytic method• Given a model and training sequences, update the
model parameters to better fit the observations.
A
B
AAA
BBB B
Fall 2001 EE669: Natural Language Processing 57
Definitions
Nmmm
jijoiji
tt
ttt
tt
tbat
Op
OjXiXp
OjXiXpjip
t
...1
1
1
)()(
)1()(
)|(
)|,,(
),|,(),(
1
The probability of traversing a certain arc at time t givenObservation sequence O:
N
jti jipt
1
),()(
Let:
Fall 2001 EE669: Natural Language Processing 58
The Probability of Traversing an Arc
sisj
aij bijot
……
t t+1t-1 t+2i(t) j(t+1)
Fall 2001 EE669: Natural Language Processing 59
Definitions
N
jti
T
ti jiptt
11
),()( ),(
• The expected number of transitions from state i in O:
• The expected number of transitions from state i to j in O
),(1
jipT
tt
Fall 2001 EE669: Natural Language Processing 60
Reestimation Procedure
• Begin with model perhaps selected at random.• Run O through the current model to estimate the
expectations of each model parameter.• Change model to maximize the values of the paths
that are used a lot while respecting stochastic constraints.
• Repeat this process (hoping to converge on optimal values for the model parameters ).
Fall 2001 EE669: Natural Language Processing 61
The Reestimation Formulas
)1(ˆ ii The expected frequency in state i at time t = 1:
T
ti
T
tt
ij
t
jip
i
jia
1
1
)(
),(
state from ns transitio# expected
to state from ns transitio# expectedˆ
).|P(O)ˆ|P(O that Note
).ˆ,B,A(ˆ derive ),B,(A, From
Fall 2001 EE669: Natural Language Processing 62
Reestimation Formulas
T
tt
Ttkot t
ijk
jip
jip
ji
kjib
t
1
)1,:(
),(
),(
to state from ns transitio# expected
observing to state from ns transitio# expectedˆ
Fall 2001 EE669: Natural Language Processing 63
oTo1 otot-1 ot+1
Parameter Estimation
A
B
AAA
BBB B
Nmmm
jjoiji
t tt
tbatjip t
...1
)()(
)1()(),( 1
Probability of traversing an arcat time t
Nj
ti jipt...1
),()( Probability of being in state i
Fall 2001 EE669: Natural Language Processing 64
oTo1 otot-1 ot+1
Parameter Estimation
A
B
AAA
BBB B
)1(ˆ i i
Now we can compute the new estimates of the model parameters.
T
t i
T
t tij
t
jipa
1
1
)(
),(ˆ
T
t i
kot i
ikt
tb t
1
}:{
)(
)(ˆ
Fall 2001 EE669: Natural Language Processing 65
HMM Applications
• Parameters for interpolated n-gram models
• Part of speech tagging
• Speech recognition
Recommended