Chap09 Hmm

7/30/2019 Chap09 Hmm

1/39

Statistical NLP:Hidden Markov Models

Updated 4/09

7/30/2019 Chap09 Hmm

2/39

Markov Models

Markov models are statistical tools that are

useful for NLP because they can be used forpart-of-speech-tagging applications

Their first use was in modeling the letter

sequences in works of Russian literature They were later developed as a general

statistical tool More specifically, they model a sequence

(perhaps through time) of random variablesthat are not necessarily independent

They rely on two assumptions: LimitedHorizonand Time Invariant

7/30/2019 Chap09 Hmm

3/39

Markov Assumptions

Let X=(X1, .., X

t) be a sequence of random

variables taking values from some finite setS={s1, , sn}, the state space, the Markovproperties are:

Limited Horizon: P(Xt=sk|X1, .., Xt-1)=P(X t = sk|Xt-1) i.e., a state at position tonly depends onthe previous state.

Time Invariant:P(Xt1=sk|Xt-1)=P(X2 =sk|X1) i.e., the dependencydoes not change over time.

If X possesses these properties, then X is said tobe a Markov Chain

7/30/2019 Chap09 Hmm

4/39

Markov Model parameters

A Markov chain is described by: Stochastic transition matrix

aij = p(Xt+1= sj | Xt = si)

Probabilities of initial states of chain

i = p(X1 = si)

there are obvious normalization constraints on a and on .

Finally a Markov Model is formally specified by a three-tuple(S, , A) where S is the set of states, and , A are theprobabilities for the initial state and state transitions.

7/30/2019 Chap09 Hmm

5/39

Probability of sequence of states

By Markov assumption

P(X1, X2,XT) = p(X1) p(X2|X1) p(XT|XT-1)

= X1 t=1T aXt,Xt+1

7/30/2019 Chap09 Hmm

6/39

Example of a Markov Chain

1

.4

1

.3 .3

.4

.61

.6

.4

te

ha p

i

Start

Probability of sequence of states:

P(t, i, p) = p(X1 = t) p(X2=i| X1=t) p(X3=p| X2=i)= 1.0 * 0.3 * 0.6 = 0.18

7/30/2019 Chap09 Hmm

7/39

Are n-gram models MarkovModels?

Bi-gram models are obviously MarkovModels.

What about tri-gram, four-gram modelsetc.?

7/30/2019 Chap09 Hmm

8/39

Stationary distribution of MarkovModels

What is the distribution of a Markov state p(xn) forvery large n.

Obviously, the initial state is forgotten.

Stationary distribution p*A = p this is aneigenvalue problem A*p = 1*p

Example:

Here

Solution of eigenvalue problem:

C I0.7

0.3

0.5

0.5

start

=

5.05.03.07.0A

( )8/38/5=p

7/30/2019 Chap09 Hmm

9/39

Hidden Markov Models (HMM)

Here states are hidden. We have only clues about states by

the symbols each state outputs:

P(Ot = k | Xt = si, Xt+1 = sj) = bi,j,k

7/30/2019 Chap09 Hmm

10/39

The crazy soft drink machine

lemice_tcola

0.30.10.6CP

0.20.70.1IP

ColaPreffered

Ice teaPreffered0.7

0.3

0.5

0.5

start

Output probability

Comment: for this machinethe output really depends

only on si namely bijk= bik

7/30/2019 Chap09 Hmm

11/39

The crazy soft drink machine cont.

What is the probability of observing {lem, ice_t}? Need to sum over all 4 possible paths that might be taken

through the HMM:

0.7*0.3*0.7*0.1 + 0.7*0.3*0.3*0.1 + 0.3*0.3*0.5*0.7 +

0.3*0.3*0.5*0.7 = 0.084

CP CP CP

IP IP

0.7 0.7

0.30.3

0.50.5

lemice_tcola

0.30.10.6CP

0.20.70.1IP

7/30/2019 Chap09 Hmm

12/39

Why Use Hidden MarkovModels?

HMMs are useful when one can think ofunderlying events probabilisticallygenerating surface events. Example:

Part-of-Speech-Tagging or speech. HMMs can efficiently be trained using

the EM Algorithm.

Another example where HMMs areuseful is in generating parameters forlinear interpolation of n-gram models.

7/30/2019 Chap09 Hmm

13/39

wawb

wbw1

wbw2

wbwM

1ab

2ab

3ab

w1 : P1(w1)

:1

:2

:3

w2 :P1(w2)

wM:P1(wM)

wM:P3(wM| wawb)

interpoloating parameters forn-gram models

7/30/2019 Chap09 Hmm

14/39

interpoloating parameters for n-gram models comments

the HMM calculation of observing thesequence (wn-2 , wn-1, wn) is equivalent to

Plin(wn|wn-2,wn-1)=1P1(wn)+ 2P2(wn|wn-1)+

3P3(wn|wn-1,wn-2) Transitions are special transitions that

produce no output symbol

In the above model each word pair (a,b) hasa different HMM. This is relaxed by using tiedstates.

7/30/2019 Chap09 Hmm

15/39

General Form of an HMM

An HMM is specified by a five-tuple (S, K, , A, B) whereS and K are the set of states and the output alphabet,and , A, B are the probabilities for the initial state,state transitions, and symbol emissions, respectively.

When the sets S and K are obvious, HMM will be

represented by a three-tuple (, A, B). Given a specification of an HMM, we can simulate the

running of a Markov process and produce an outputsequence using the algorithm shown on the next page.

More interesting than a simulation, however, is assumingthat some set of data was generated by a HMM, andthen being able to calculate probabilities and probableunderlying state sequences.

7/30/2019 Chap09 Hmm

16/39

A Program for a MarkovProcess modeled by HMM

t:= 1;Start in state si with probability i (i.e., X1=i)Forever do

Move from state si to state sj withprobability aij (i.e., Xt+1 = j) Emit observation symbol ot = k with

probability bijkt:= t+1

End

7/30/2019 Chap09 Hmm

17/39

The Three Fundamental

Questions for HMMs

Given a model =(A, B, ), how do weefficiently compute how likely a certainobservation is, that is, P(O| )

Given the observation sequence O and a

model , how do we choose a state sequence(X1, , X T+1) that best explains theobservations?

Given an observation sequence O, and aspace of possible models found by varyingthe model parameters = (A, B, ), how dowe find the model that best explains theobserved data?

7/30/2019 Chap09 Hmm

18/39

Finding the probability of anobservation I Given the observation sequence O=(o1, , oT) and a

model = (A, B, ), we wish to know how toefficiently compute P(O| ).

For any state sequence X=(X1, , XT+1), we find:

This is simply the sum of the probability of theobservation occurring according to each possiblestate sequence.

Direct evaluation of this expression, however, isextremely inefficient.

7/30/2019 Chap09 Hmm

19/39

Finding the probability of anobservation I Given the observation sequence O=(o1, , oT) and a model =

(A, B, ), we wish to know how to efficiently compute P(O| ). For any state sequence X=(X1, , XT+1), we find:

This is simply the sum of the probability of the observationoccurring according to each possible state sequence.

Direct evaluation of this expression, however, is extremelyinefficient.

S1

S2

1t= 2 3 4 5 6 7

7/30/2019 Chap09 Hmm

20/39


(A, B, ), we wish to know how to efficiently compute P(O| ).For any state sequence X=(X1, , XT+1), we find:



S1

S2

1t= 2 3 4 5 6 7

7/30/2019 Chap09 Hmm

21/39


(A, B, ), we wish to know how to efficiently compute P(O| ).For any state sequence X=(X1, , XT+1), we find



S1

S2

1t= 2 3 4 5 6 7

7/30/2019 Chap09 Hmm

22/39





S1

S2

1t= 2 3 4 5 6 7

7/30/2019 Chap09 Hmm

23/39





S1

S2

1t= 2 3 4 5 6 7

7/30/2019 Chap09 Hmm

24/39





S1

S2

1t= 2 3 4 5 6 7

7/30/2019 Chap09 Hmm

25/39

Finding the probability of anobservation II

In order to avoid this complexity, we can use dynamic

programmingor memorizationtechniques. In particular, we use trellisalgorithms.

We make a square array of states versus time andcompute the probabilities of being at each state at

each time in terms of the probabilities for being in eachstate at the preceding time.

A trellis can record the probability of all initial subpathsof the HMM that end in a certain state at a certain

time. The probability of longer subpaths can then beworked out in terms of the shorter subpaths.

7/30/2019 Chap09 Hmm

26/39

Finding the probability of an

observation III: The forwardprocedure

Aforward variable, i(t)= P(o1o2ot-1, Xt=i|)is stored at(si, t)in the trellis and expresses the total probability of ending up in

state si at time t.

Forward variables are calculated as follows:

Initialization: i(1)=i, 1iN

Induction:j(t+1)=i=1Ni(t)aijbijot, 1tT, 1jN

Total: P(O|)=i=1Ni(T+1)

This algorithm requires 2N2Tmultiplications (much less than the

direct method which takes (2T+1)NT+1

7/30/2019 Chap09 Hmm

27/39


observation IV: The backwardprocedure

The backward procedure computesbackward variableswhich are thetotal probability of seeing the rest of

the observation sequence given that wewere in state si at time t.

Backward variables are useful for theproblem of parameter estimation.

7/30/2019 Chap09 Hmm

28/39


observation V: The backwardprocedure

Let i(t) = P(otoT| Xt= i,)be the backward

variables.

Backward variables can be calculated working

backward through the trellis as follows:

Initialization: i(T+1) = 1, 1iN

Induction:i(t) =j=1Naijbijotj(t+1), 1tT, 1

iN

Total: P(O|)=i=1Nii(1)

7/30/2019 Chap09 Hmm

30/39

Finding the Best StateSequence I One method consists of finding the states

individually:

For each t, 1 t T+1, we would like to find Xtthat maximizes P(Xt|O, ).

Let i(t) = P(Xt = i |O, ) = P(Xt = i, O|)/P(O|)= (i(t)i(t)/j=1N j(t)j(t))

The individually most likely state is

Xt=argmax1iNi(t), 1tT+1 This quantity maximizes the expected number of

states that will be guessed correctly. However, it

may yield a quite unlikely state sequence.

^

7/30/2019 Chap09 Hmm

31/39

Finding the Best State SequenceII: The Viterbi Algorithm

The Viterbi algorithmefficiently computes themost likely state sequence. Commonly, we want to find the most likely complete

path, that is: argmaxXP(X|O,)

To do this, it is sufficient to maximize for a fixed O:argmaxXP(X,O|) We define

j(t) = maxX1..Xt-1P(X1Xt-1, o1..ot-1, Xt=j|)

j(t)records the node of the incoming arc that led tothis most probable path.

7/30/2019 Chap09 Hmm

32/39

Finding the Best State Sequence III:The Viterbi Algorithm

The Viterbi Algorithm works as follows: Initialization: j(1) =j, 1jN

Induction:j(t+1) = max1iNi(t)aijbijot, 1j

N

Store backtrace:

j(t+1) = argmax1iNj(t)aijbijot, 1jN

Termination and path readout:XT+1= argmax1iNj(T+1)

Xt =Xt+1(t+1)

P(X) = max1iNj(T+1)

^

^

^

7/30/2019 Chap09 Hmm

33/39

Finding the Best State Sequence IV:Visualization

S1

S2

1t= 2 3 4 5 6 7

S3

S4

2(t=6) =probability of reaching state 2 at t=6 by the mostprobable path (marked) through state 2 at t=6

2(t=6) =3 is the state from preceding state 2 at t=6 on themost probable path through state 2 at t=6

7/30/2019 Chap09 Hmm

34/39

Parameter Estimation I

Given a certain observation sequence, we

want to find the values of the modelparameters =(A, B, ) which best explainwhat we observed.

Using Maximum Likelihood Estimation, wecan find the values that maximize P(O|),i.e. argmaxP(Otraining|)

There is no known analytic method to choose

to maximize P(O|). However, we canlocally maximize it by an iterative hill-climbingalgorithm known as Baum-WelchorForward-Backward algorithm. (specialcase of the EM Algorithm)

7/30/2019 Chap09 Hmm

35/39

Parameter Estimation II We dont know what the model is, but we

can work out the probability of theobservation sequence using some (perhapsrandomly chosen) model.

Looking at that calculation, we can see whichstate transitions and symbol emissions wereprobably used the most.

By increasing the probability of those, we canchoose a revised model which gives a higherprobability to the observation sequence.

7/30/2019 Chap09 Hmm

36/39

Parameter Estimation III Define Pt(i,j)as the probability of traversing a certain

arc at time t given observation sequence:

7/30/2019 Chap09 Hmm

37/39

== Nj tijipt

...1

),()(

Parameter Estimation IV The probability of traversing a certain arc at time t

given observation sequence can be expressed as:

The probability of leaving state iat time t

7/30/2019 Chap09 Hmm

38/39

)1( i =i

=

==T

t i

T

t t

ij

t

jipa

1

1

)(

),(

Parameter Estimation V

7/30/2019 Chap09 Hmm

39/39

Parameter Estimation VI

Namely, given an HMM model =(A, B, ) the BaumWelsh algorithm gives new values for the ,odel

parameters =(A, B, ) .

One may prove that P(O| )P(O| ) which is ageneral property of EM.

The process is repeated until convergence.

It is prone to local maxima, another property of EM.

^ ^ ^ ^

^

Documents

Chap09 Hmm