Upload
xitret
View
242
Download
0
Embed Size (px)
Citation preview
7/30/2019 Chap09 Hmm
1/39
Statistical NLP:Hidden Markov Models
Updated 4/09
7/30/2019 Chap09 Hmm
2/39
Markov Models
Markov models are statistical tools that are
useful for NLP because they can be used forpart-of-speech-tagging applications
Their first use was in modeling the letter
sequences in works of Russian literature They were later developed as a general
statistical tool More specifically, they model a sequence
(perhaps through time) of random variablesthat are not necessarily independent
They rely on two assumptions: LimitedHorizonand Time Invariant
7/30/2019 Chap09 Hmm
3/39
Markov Assumptions
Let X=(X1, .., X
t) be a sequence of random
variables taking values from some finite setS={s1, , sn}, the state space, the Markovproperties are:
Limited Horizon: P(Xt=sk|X1, .., Xt-1)=P(X t = sk|Xt-1) i.e., a state at position tonly depends onthe previous state.
Time Invariant:P(Xt1=sk|Xt-1)=P(X2 =sk|X1) i.e., the dependencydoes not change over time.
If X possesses these properties, then X is said tobe a Markov Chain
7/30/2019 Chap09 Hmm
4/39
Markov Model parameters
A Markov chain is described by: Stochastic transition matrix
aij = p(Xt+1= sj | Xt = si)
Probabilities of initial states of chain
i = p(X1 = si)
there are obvious normalization constraints on a and on .
Finally a Markov Model is formally specified by a three-tuple(S, , A) where S is the set of states, and , A are theprobabilities for the initial state and state transitions.
7/30/2019 Chap09 Hmm
5/39
Probability of sequence of states
By Markov assumption
P(X1, X2,XT) = p(X1) p(X2|X1) p(XT|XT-1)
= X1 t=1T aXt,Xt+1
7/30/2019 Chap09 Hmm
6/39
Example of a Markov Chain
1
.4
1
.3 .3
.4
.61
.6
.4
te
ha p
i
Start
Probability of sequence of states:
P(t, i, p) = p(X1 = t) p(X2=i| X1=t) p(X3=p| X2=i)= 1.0 * 0.3 * 0.6 = 0.18
7/30/2019 Chap09 Hmm
7/39
Are n-gram models MarkovModels?
Bi-gram models are obviously MarkovModels.
What about tri-gram, four-gram modelsetc.?
7/30/2019 Chap09 Hmm
8/39
Stationary distribution of MarkovModels
What is the distribution of a Markov state p(xn) forvery large n.
Obviously, the initial state is forgotten.
Stationary distribution p*A = p this is aneigenvalue problem A*p = 1*p
Example:
Here
Solution of eigenvalue problem:
C I0.7
0.3
0.5
0.5
start
=
5.05.03.07.0A
( )8/38/5=p
7/30/2019 Chap09 Hmm
9/39
Hidden Markov Models (HMM)
Here states are hidden. We have only clues about states by
the symbols each state outputs:
P(Ot = k | Xt = si, Xt+1 = sj) = bi,j,k
7/30/2019 Chap09 Hmm
10/39
The crazy soft drink machine
lemice_tcola
0.30.10.6CP
0.20.70.1IP
ColaPreffered
Ice teaPreffered0.7
0.3
0.5
0.5
start
Output probability
Comment: for this machinethe output really depends
only on si namely bijk= bik
7/30/2019 Chap09 Hmm
11/39
The crazy soft drink machine cont.
What is the probability of observing {lem, ice_t}? Need to sum over all 4 possible paths that might be taken
through the HMM:
0.7*0.3*0.7*0.1 + 0.7*0.3*0.3*0.1 + 0.3*0.3*0.5*0.7 +
0.3*0.3*0.5*0.7 = 0.084
CP CP CP
IP IP
0.7 0.7
0.30.3
0.50.5
lemice_tcola
0.30.10.6CP
0.20.70.1IP
7/30/2019 Chap09 Hmm
12/39
Why Use Hidden MarkovModels?
HMMs are useful when one can think ofunderlying events probabilisticallygenerating surface events. Example:
Part-of-Speech-Tagging or speech. HMMs can efficiently be trained using
the EM Algorithm.
Another example where HMMs areuseful is in generating parameters forlinear interpolation of n-gram models.
7/30/2019 Chap09 Hmm
13/39
wawb
wbw1
wbw2
wbwM
1ab
2ab
3ab
w1 : P1(w1)
:1
:2
:3
w2 :P1(w2)
wM:P1(wM)
wM:P3(wM| wawb)
interpoloating parameters forn-gram models
7/30/2019 Chap09 Hmm
14/39
interpoloating parameters for n-gram models comments
the HMM calculation of observing thesequence (wn-2 , wn-1, wn) is equivalent to
Plin(wn|wn-2,wn-1)=1P1(wn)+ 2P2(wn|wn-1)+
3P3(wn|wn-1,wn-2) Transitions are special transitions that
produce no output symbol
In the above model each word pair (a,b) hasa different HMM. This is relaxed by using tiedstates.
7/30/2019 Chap09 Hmm
15/39
General Form of an HMM
An HMM is specified by a five-tuple (S, K, , A, B) whereS and K are the set of states and the output alphabet,and , A, B are the probabilities for the initial state,state transitions, and symbol emissions, respectively.
When the sets S and K are obvious, HMM will be
represented by a three-tuple (, A, B). Given a specification of an HMM, we can simulate the
running of a Markov process and produce an outputsequence using the algorithm shown on the next page.
More interesting than a simulation, however, is assumingthat some set of data was generated by a HMM, andthen being able to calculate probabilities and probableunderlying state sequences.
7/30/2019 Chap09 Hmm
16/39
A Program for a MarkovProcess modeled by HMM
t:= 1;Start in state si with probability i (i.e., X1=i)Forever do
Move from state si to state sj withprobability aij (i.e., Xt+1 = j) Emit observation symbol ot = k with
probability bijkt:= t+1
End
7/30/2019 Chap09 Hmm
17/39
The Three Fundamental
Questions for HMMs
Given a model =(A, B, ), how do weefficiently compute how likely a certainobservation is, that is, P(O| )
Given the observation sequence O and a
model , how do we choose a state sequence(X1, , X T+1) that best explains theobservations?
Given an observation sequence O, and aspace of possible models found by varyingthe model parameters = (A, B, ), how dowe find the model that best explains theobserved data?
7/30/2019 Chap09 Hmm
18/39
Finding the probability of anobservation I Given the observation sequence O=(o1, , oT) and a
model = (A, B, ), we wish to know how toefficiently compute P(O| ).
For any state sequence X=(X1, , XT+1), we find:
This is simply the sum of the probability of theobservation occurring according to each possiblestate sequence.
Direct evaluation of this expression, however, isextremely inefficient.
7/30/2019 Chap09 Hmm
19/39
Finding the probability of anobservation I Given the observation sequence O=(o1, , oT) and a model =
(A, B, ), we wish to know how to efficiently compute P(O| ). For any state sequence X=(X1, , XT+1), we find:
This is simply the sum of the probability of the observationoccurring according to each possible state sequence.
Direct evaluation of this expression, however, is extremelyinefficient.
S1
S2
1t= 2 3 4 5 6 7
7/30/2019 Chap09 Hmm
20/39
Finding the probability of anobservation I Given the observation sequence O=(o1, , oT) and a model =
(A, B, ), we wish to know how to efficiently compute P(O| ).For any state sequence X=(X1, , XT+1), we find:
This is simply the sum of the probability of the observationoccurring according to each possible state sequence.
Direct evaluation of this expression, however, is extremelyinefficient.
S1
S2
1t= 2 3 4 5 6 7
7/30/2019 Chap09 Hmm
21/39
Finding the probability of anobservation I Given the observation sequence O=(o1, , oT) and a model =
(A, B, ), we wish to know how to efficiently compute P(O| ).For any state sequence X=(X1, , XT+1), we find
This is simply the sum of the probability of the observationoccurring according to each possible state sequence.
Direct evaluation of this expression, however, is extremelyinefficient.
S1
S2
1t= 2 3 4 5 6 7
7/30/2019 Chap09 Hmm
22/39
Finding the probability of anobservation I Given the observation sequence O=(o1, , oT) and a model =
(A, B, ), we wish to know how to efficiently compute P(O| ).For any state sequence X=(X1, , XT+1), we find:
This is simply the sum of the probability of the observationoccurring according to each possible state sequence.
Direct evaluation of this expression, however, is extremelyinefficient.
S1
S2
1t= 2 3 4 5 6 7
7/30/2019 Chap09 Hmm
23/39
Finding the probability of anobservation I Given the observation sequence O=(o1, , oT) and a model =
(A, B, ), we wish to know how to efficiently compute P(O| ).For any state sequence X=(X1, , XT+1), we find:
This is simply the sum of the probability of the observationoccurring according to each possible state sequence.
Direct evaluation of this expression, however, is extremelyinefficient.
S1
S2
1t= 2 3 4 5 6 7
7/30/2019 Chap09 Hmm
24/39
Finding the probability of anobservation I Given the observation sequence O=(o1, , oT) and a model =
(A, B, ), we wish to know how to efficiently compute P(O| ).For any state sequence X=(X1, , XT+1), we find:
This is simply the sum of the probability of the observationoccurring according to each possible state sequence.
Direct evaluation of this expression, however, is extremelyinefficient.
S1
S2
1t= 2 3 4 5 6 7
7/30/2019 Chap09 Hmm
25/39
Finding the probability of anobservation II
In order to avoid this complexity, we can use dynamic
programmingor memorizationtechniques. In particular, we use trellisalgorithms.
We make a square array of states versus time andcompute the probabilities of being at each state at
each time in terms of the probabilities for being in eachstate at the preceding time.
A trellis can record the probability of all initial subpathsof the HMM that end in a certain state at a certain
time. The probability of longer subpaths can then beworked out in terms of the shorter subpaths.
7/30/2019 Chap09 Hmm
26/39
Finding the probability of an
observation III: The forwardprocedure
Aforward variable, i(t)= P(o1o2ot-1, Xt=i|)is stored at(si, t)in the trellis and expresses the total probability of ending up in
state si at time t.
Forward variables are calculated as follows:
Initialization: i(1)=i, 1iN
Induction:j(t+1)=i=1Ni(t)aijbijot, 1tT, 1jN
Total: P(O|)=i=1Ni(T+1)
This algorithm requires 2N2Tmultiplications (much less than the
direct method which takes (2T+1)NT+1
7/30/2019 Chap09 Hmm
27/39
Finding the probability of an
observation IV: The backwardprocedure
The backward procedure computesbackward variableswhich are thetotal probability of seeing the rest of
the observation sequence given that wewere in state si at time t.
Backward variables are useful for theproblem of parameter estimation.
7/30/2019 Chap09 Hmm
28/39
Finding the probability of an
observation V: The backwardprocedure
Let i(t) = P(otoT| Xt= i,)be the backward
variables.
Backward variables can be calculated working
backward through the trellis as follows:
Initialization: i(T+1) = 1, 1iN
Induction:i(t) =j=1Naijbijotj(t+1), 1tT, 1
iN
Total: P(O|)=i=1Nii(1)
7/30/2019 Chap09 Hmm
29/39
Finding the probability of an
observation VI: Combining Forwardand backward
P(O, Xt= i|)
= P(o1oT,Xt= i| )= P(o1ot-1,Xt= i , ot,oT| )
= P(o1ot-1,Xt= i| ) P( ot,oT|o1ot-1, Xt= i,)
= P(o1ot-1,Xt= i| ) P( ot,oT|Xt= i,)
= i(t)i(t),
Total:
P(O|) =i=1N
i(t)i(t), 1 tT+1
7/30/2019 Chap09 Hmm
30/39
Finding the Best StateSequence I One method consists of finding the states
individually:
For each t, 1 t T+1, we would like to find Xtthat maximizes P(Xt|O, ).
Let i(t) = P(Xt = i |O, ) = P(Xt = i, O|)/P(O|)= (i(t)i(t)/j=1N j(t)j(t))
The individually most likely state is
Xt=argmax1iNi(t), 1tT+1 This quantity maximizes the expected number of
states that will be guessed correctly. However, it
may yield a quite unlikely state sequence.
^
7/30/2019 Chap09 Hmm
31/39
Finding the Best State SequenceII: The Viterbi Algorithm
The Viterbi algorithmefficiently computes themost likely state sequence. Commonly, we want to find the most likely complete
path, that is: argmaxXP(X|O,)
To do this, it is sufficient to maximize for a fixed O:argmaxXP(X,O|) We define
j(t) = maxX1..Xt-1P(X1Xt-1, o1..ot-1, Xt=j|)
j(t)records the node of the incoming arc that led tothis most probable path.
7/30/2019 Chap09 Hmm
32/39
Finding the Best State Sequence III:The Viterbi Algorithm
The Viterbi Algorithm works as follows: Initialization: j(1) =j, 1jN
Induction:j(t+1) = max1iNi(t)aijbijot, 1j
N
Store backtrace:
j(t+1) = argmax1iNj(t)aijbijot, 1jN
Termination and path readout:XT+1= argmax1iNj(T+1)
Xt =Xt+1(t+1)
P(X) = max1iNj(T+1)
^
^
^
7/30/2019 Chap09 Hmm
33/39
Finding the Best State Sequence IV:Visualization
S1
S2
1t= 2 3 4 5 6 7
S3
S4
2(t=6) =probability of reaching state 2 at t=6 by the mostprobable path (marked) through state 2 at t=6
2(t=6) =3 is the state from preceding state 2 at t=6 on themost probable path through state 2 at t=6
7/30/2019 Chap09 Hmm
34/39
Parameter Estimation I
Given a certain observation sequence, we
want to find the values of the modelparameters =(A, B, ) which best explainwhat we observed.
Using Maximum Likelihood Estimation, wecan find the values that maximize P(O|),i.e. argmaxP(Otraining|)
There is no known analytic method to choose
to maximize P(O|). However, we canlocally maximize it by an iterative hill-climbingalgorithm known as Baum-WelchorForward-Backward algorithm. (specialcase of the EM Algorithm)
7/30/2019 Chap09 Hmm
35/39
Parameter Estimation II We dont know what the model is, but we
can work out the probability of theobservation sequence using some (perhapsrandomly chosen) model.
Looking at that calculation, we can see whichstate transitions and symbol emissions wereprobably used the most.
By increasing the probability of those, we canchoose a revised model which gives a higherprobability to the observation sequence.
7/30/2019 Chap09 Hmm
36/39
Parameter Estimation III Define Pt(i,j)as the probability of traversing a certain
arc at time t given observation sequence:
7/30/2019 Chap09 Hmm
37/39
== Nj tijipt
...1
),()(
Parameter Estimation IV The probability of traversing a certain arc at time t
given observation sequence can be expressed as:
The probability of leaving state iat time t
7/30/2019 Chap09 Hmm
38/39
)1( i =i
=
==T
t i
T
t t
ij
t
jipa
1
1
)(
),(
Parameter Estimation V
7/30/2019 Chap09 Hmm
39/39
Parameter Estimation VI
Namely, given an HMM model =(A, B, ) the BaumWelsh algorithm gives new values for the ,odel
parameters =(A, B, ) .
One may prove that P(O| )P(O| ) which is ageneral property of EM.
The process is repeated until convergence.
It is prone to local maxima, another property of EM.
^ ^ ^ ^
^