Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of...

Fall 2001 EE669: Natural Language Processing 1

Lecture 9: Hidden Markov Models (HMMs)

(Chapter 9 of Manning and Schutze)

Dr. Mary P. Harper

ECE, Purdue University

harper@purdue.edu

yara.ecn.purdue.edu/~harper

Markov Assumptions• Often we want to consider a sequence of randm variables that aren’t

independent, but rather the value of each variable depends on previous elements in the sequence.

• Let X = (X1, .., XT) be a sequence of random variables taking values

in some finite set S = {s1, …, sN}, the state space. If X possesses the following

properties, then X is a Markov Chain.

• Limited Horizon:

P(Xt+1= sk|X1, .., Xt) = P(X t+1 = sk |Xt) i.e., a word’s state only

depends on the previous state.

• Time Invariant (Stationary):

P(Xt+1= sk|Xt) = P(X2 = sk|X1) i.e., the dependency does not change

over time.

Definition of a Markov Chain• A is a stochastic N x N matrix

of the probabilities of transitions with:– aij = Pr{transition from si to sj} = Pr(Xt+1= sj | Xt= si) ij, aij 0, and i:

is a vector of N elements representing the initial state probability distribution with: i = Pr{probability that the initial state is si} = Pr(X1= si) i:

– Can avoid this by creating a special start state s0.

Markov Chain

Markov Process & Language Models

• Chain rule:

P(W) = P(w1,w2,...,wT) = i=1..T p(wi|w1,w2,..,wi-n+1,..,wi-1)

• n-gram language models:– Markov process (chain) of the order n-1:

P(W) = P(w1,w2,...,wT) = i=1..T p(wi|wi-n+1,wi-n+2,..,wi-1)

Using just one distribution (Ex.: trigram model: p(wi|wi-2,wi-1)):

Positions: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Words: My car broke down , and within hours Bob ’s car broke down , too .

p( ,|broke down) = p(w5|w3,w4) = p(w14|w12,w13)

Example with a Markov Chain

Probability of a Sequence of States

• P(t, a, p, p) = 1.0 × 0.3 × 0.4 × 1.0 = 0.12• P(t, a, p, e) = ?

1321121321

)|()|()(

)|()|()()(

xxPxxPxP

xxxxxPxxPxPxxxxP

Hidden Markov Models (HMMs)

• Sometimes it is not possible to know precisely which states the model passes through; all we can do is observe some phenomena that occurs when in that state with some probability distribution.

• An HMM is an appropriate model for cases when you don’t know the state sequence that the model passes through, but only some probabilistic function of it. For example:– Word recognition (discrete utterance or within continuous speech) – Phoneme recognition– Part-of-Speech Tagging– Linear Interpolation

Example HMM• Process:

– According to , pick a state qi. Select a ball from urn i (state is hidden).

– Observe the color (observable).

– Replace the ball.

– According to A, transition to the next urn from which to pick a ball (state is hidden).

N states corresponding to urns: q1, q2, …,qN

M colors for balls found in urns: z1, z2, …, zM

A: N x N matrix; transition probability distribution (between urns)B: N x M matrix; for each urn, there is a probability density function for each color.

bij = Pr(COLOR= zj | URN = qi): N vector; initial sate probability distribution (where do we start?)

What is an HMM?

• Green circles are hidden states

• Dependent only on the previous state

What is an HMM?

• Purple nodes are observations

• Dependent only on their corresponding hidden state (for a state emit model)

Elements of an HMM• HMM (the most general case):

– five-tuple (S, K, , A, B), where:• S = {s1,s2,...,sN} is the set of states,

• K = {k1,k2,...,kM} is the output alphabet,

= {i} i S, initial state probability set

• A = {aij} i,j S, transition probability set

• B = {bijk} i,j S, k K (arc emission)

• B = {bik} i S, k K (state emission), emission probability set

• State Sequence: X = (x1, x2, ..., xT+1),

xt : S {1,2,…, N}

• Observation (output) Sequence: O = (o1, ..., oT), ot K

HMM Formalism

• {S, K, • S : {s1…sN} are the values for the hidden states

• K : {k1…kM} are the values for the observations

HMM Formalism

• {S, K, iare the initial state probabilities

• A = {aij} are the state transition probabilities

• B = {bik} are the observation probabilities

Example of an Arc Emit HMM

10.4 1

p(t)=.5p(o)=.2p(e)=.3

p(t)=.8p(o)=.1p(e)=.1

p(t)=0p(o)=0p(e)=1

p(t)=.1p(o)=.7p(e)=.2

p(t)=0p(o)=.4p(e)=.6

p(t)=0p(o)=1p(e)=0

HMMs1. N states in the model: a state has some measurable,

distinctive properties.2. At clock time t, you make a state transition (to a different

state or back to the same state), based on a transition probability distribution that depends on the current state (i.e., the one you're in before making the transition).

3. After each transition, you output an observation output symbol according to a probability density distribution which depends on the current state or the current arc.

Goal: From the observations, determine what model generated the output, e.g., in word recognition with a different HMM for each word, the goal is to choose the one that best fits the input word (based on state transitions and output observations).

Example HMM

• N states corresponding to urns: q1, q2, …,qN

• M colors for balls found in urns: z1, z2, …, zM

• A: N x N matrix; transition probability distribution (between urns)

• B: N x M matrix; for each urn, there is a probability density function for each color.– bij = Pr(COLOR= zj | URN = qi)

: N vector; initial sate probability distribution (where do we start?)

Example HMM• Process:

– According to , pick a state qi. Select a ball from urn i (state is hidden).

– Observe the color (observable).

– Replace the ball.

– According to A, transition to the next urn from which to pick a ball (state is hidden).

• Design: Choose N and M. Specify a model = (A, B, ) from the training data. Adjust the model parameters (A, B, ) to maximize P(O| ).

• Use: Given O and = (A, B, ), what is P(O| )? If we compute P(O| ) for all models , then we can determine which model most likely generated O.

Simulating a Markov Process

t:= 1;Start in state si with probability i (i.e., X1= i)Forever do Move from state si to state sj with probability aij

(i.e., Xt+1= j) Emit observation symbol ot = k with probability bik (bijk ) t:= t+1End

Why Use Hidden Markov Models?• HMMs are useful when one can think of

underlying events probabilistically generating surface events. Example: Part-of-Speech-Tagging.

• HMMs can be efficiently trained using the EM Algorithm.

• Another example where HMMs are useful is in generating parameters for linear interpolation of n-gram models.

• Assuming that some set of data was generated by an HMM, and then an HMM is useful for calculating the probabilities for possible underlying state sequences.

Fundamental Questions for HMMs

• Given a model = (A, B, ), how do we efficiently compute how likely a certain observation is, that is, P(O| )

• Given the observation sequence O and a model , how do we choose a state sequence (X1, …, XT+1) that best explains the observations?

• Given an observation sequence O, and a space of possible models found by varying the model parameters = (A, B, ), how do we find the model that best explains the observed data?

Probability of an Observation

• Given the observation sequence O = (o1, …, oT) and a model = (A, B, ), we wish to know how to efficiently compute P(O| ).

• For any state sequence X = (X1, …, XT+1), we find: P(O| ) = X P(X | ) P(O | X, )

• This is simply the probability of an observation sequence given the model.

• Direct evaluation of this expression is extremely inefficient; however, there are dynamic programming methods that compute it quite efficiently.

)|( Compute

),,( ,)...( 1

BAooO T

oTo1 otot-1 ot+1

Given an observation sequence and a model, compute the probability of the observation sequence

TT oxoxox bbbXOP ...),|(2211

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

TT xxxxxxx aaaXP132211

...)|(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

),|()|()|,( XOPXPXOP

...)|(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

)|(),|()|,( XPXOPXOP

...)|(

XOPXPOP ),|()|()|(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

1}...{

txxoxx babOP

Probability of an Observation(State Emit)

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Probability of an Observation with an Arc Emit HMM

10.4 1

p(t)=.5p(o)=.2p(e)=.3

p(toe) = .6.881 +

.4 1.88 +

.4 1.12

p(t)=.8p(o)=.1p(e)=.1

p(t)=0p(o)=0p(e)=1

p(t)=.1p(o)=.7p(e)=.2

p(t)=0p(o)=.4p(e)=.6

p(t)=0p(o)=1p(e)=0

Making Computation Efficient

• To avoid this complexity, use dynamic programming or memoization techniques due to trellis structure of the problem.

• Use an array of states versus time to compute the probability of being at each state at time t+1 in terms of the probabilities for being in each state at t.

• A trellis can record the probability of all initial subpaths of the HMM that end in a certain state at a certain time. The probability of longer subpaths can then be worked out in terms of the shorter subpaths.

• A forward variable, i(t)= P(o1o2…ot-1, Xt=i| ) is stored at (si, t) in the trellis and expresses the total probability of ending up in state si at time t.

The Trellis Structure

)|,...()( 11 ixooPt tti

Forward Procedure

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

• Special structure gives us an efficient solution using dynamic programming.

• Intuition: Probability of the first t observations is the same for all possible t+1 length state sequences (so don’t recompute it!)

• Define:

)|(),...(

)()|()|...(

)()|...(

),...(

jxoPjxooP

jxPjxoPjxooP

jxPjxooP

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj Note: We omit | from formula.

Note: We omit | from formula.

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj

)|(),...(

)()|()|...(

)()|...(

),...(

jxoPjxooP

jxPjxoPjxooP

jxPjxooP

Note: P(A , B) = P(A| B)*P(B)

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj

)|(),...(

)()|()|...(

)()|...(

),...(

jxoPjxooP

jxPjxoPjxooP

jxPjxooP

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj

)|(),...(

)()|()|...(

)()|...(

),...(

jxoPjxooP

jxPjxoPjxooP

jxPjxooP

Niijoiji

ttttNi

jxoPixjxPixooP

jxoPixPixjxooP

jxoPjxixooP

111...1

11...1

)|()|(),...(

)|()()|,...(

)|(),,...(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

Niijoiji

ttttNi

jxoPixjxPixooP

jxoPixPixjxooP

jxoPjxixooP

111...1

11...1

)|()|(),...(

)|()()|,...(

)|(),,...(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

Niijoiji

ttttNi

jxoPixjxPixooP

jxoPixPixjxooP

jxoPjxixooP

111...1

11...1

)|()|(),...(

)|()()|,...(

)|(),,...(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

Niijoiji

ttttNi

jxoPixjxPixooP

jxoPixPixjxooP

jxoPjxixooP

111...1

11...1

)|()|(),...(

)|()()|,...(

)|(),,...(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

The Forward Procedure

• Forward variables are calculated as follows:

– Initialization: i(1)= i , 1 i N

– Induction: j(t+1)=i=1, N i(t)aijbijot, 1 tT, 1 jN

– Total: P(O|)= i=1,N i(T+1) = i=1,N i=1, N i(t)aijbijot

• This algorithm requires 2N2T multiplications (much less than the direct method which takes (2T+1)NT+1

The Backward Procedure

• We could also compute these probabilities by working backward through time.

• The backward procedure computes backward variables which are the total probability of seeing the rest of the observation sequence given that we were in state si at time t.

i(t) = P(ot…oT | Xt = i, ) is a backward variable.• Backward variables are useful for the problem of

parameter reestimation.

)|...()( ixooPt tTti

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Backward Procedure

1)1( Ti

jijoiji tbatt

...1)1()(

• Backward variables can be calculated working backward through the treillis as follows:– Initialization: i(T+1) = 1, 1 i N

– Induction: i(t) =j=1,N aijbijotj(t+1), 1 t T, 1 i N

– Total: P(O|)=i=1, N ii(1)

• Backward variables can also be combined with forward variables:

P(O|) = i=1,N i(t)i(t), 1 t T+1

• Backward variables can be calculated working backward through the treillis as follows:– Initialization: i(T+1) = 1, 1 i N

– Induction: i(t) =j=1,N aijbijotj(t+1), 1 t T, 1 i N

– Total: P(O|)=i=1, N ii(1)

• Backward variables can also be combined with forward variables:

P(O|) = i=1,N i(t)i(t), 1 t T+1

),|...()|,...(

),,...|...()|,...(

)|...,,...(

)|,...()|,(

iXooPiXooP

iXooooPiXooP

ooiXooP

iXooPiXOP

ttTttt

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

ii TOP

)1()|(

)()()|(1

ttOP i

Forward Procedure

Backward Procedure

Combination

Finding the Best State Sequence

• One method consists of optimizing on the states individually.• For each t, 1 t T+1, we would like to find Xt that

maximizes P(Xt|O, ).

• Let i(t) = P(Xt = i |O, ) = P(Xt = i, O|)/P(O|) = (i(t)i(t))/j=1,N

j(t)j(t)• The individually most likely state is

Xt =argmax1iN i(t), 1 t T+1• This quantity maximizes the expected number of states that

will be guessed correctly. However, it may yield a quite unlikely state sequence.

oTo1 otot-1 ot+1

Best State Sequence

• Find the state sequence that best explains the observation sequence

• Viterbi algorithm

),|(maxarg OXPX

Finding the Best State Sequence: The Viterbi Algorithm

• The Viterbi algorithm efficiently computes the most likely state sequence.

• To find the most likely complete path, compute: argmaxX P(X|O,)

• To do this, it is sufficient to maximize for a fixed O: argmaxX P(X,O|)

• We define j(t) = maxX1..Xt-1

P(X1…Xt-1, o1..ot-1, Xt=j|) j(t) records the node of the incoming arc that led to this most probable path.

oTo1 otot-1 ot+1

Viterbi Algorithm

)|,...,...(max)( 1111... 11

jxooxxPt tttxx

The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t

x1 xt-1 j

oTo1 otot-1 ot+1

Viterbi Algorithm

)|,...,...(max)( 1111... 11

jxooxxPt tttxx

tijoijii

j batt )(max)1(

tijoijii

j batt )(maxarg)1(

Recursive Computation

x1 xt-1 xt xt+1

Finding the Best State Sequence: The Viterbi Algorithm

The Viterbi Algorithm works as follows:• Initialization: j(1) = j, 1 j N

• Induction: j(t+1) = max1 iN i(t)aijbijot, 1 j N

Store backtrace:

j(t+1) = argmax1 iN i(t)aij bijot, 1 j N

• Termination and path readout: XT+1 = argmax1 iN i(T+1) Xt = Xt+1(t+1) P(X) = max1 iN i(T+1)

oTo1 otot-1 ot+1

Viterbi Algorithm

)(maxargˆ TX ii

)1(ˆ1

)(max)ˆ( TXP ii

Compute the most likely state sequence by working backwards

x1 xt-1 xt xt+1 xT

Third Problem: Parameter Estimation• Given a certain observation sequence, we want to

find the values of the model parameters =(A, B, ) which best explain what we observed.

• Using Maximum Likelihood Estimation, find values to maximize P(O| ), i.e., argmax P(Otraining| )

• There is no known analytic method to choose to maximize P(O| ). However, we can locally maximize it by an iterative hill-climbing algorithm known as Baum-Welch or Forward-Backward algorithm. (This is a special case of the EM Algorithm)

Parameter Estimation: The Forward-Backward Algorithm

• We don’t know what the model is, but we can work out the probability of the observation sequence using some (perhaps randomly chosen) model.

• Looking at that calculation, we can see which state transitions and symbol emissions were probably used the most.

• By increasing the probability of those, we can choose a revised model which gives a higher probability to the observation sequence.

oTo1 otot-1 ot+1

Parameter Estimation

• Given an observation sequence, find the model that is most likely to produce that sequence.

• No analytic method• Given a model and training sequences, update the

model parameters to better fit the observations.

Definitions

jijoiji

OjXiXp

OjXiXpjip

),|,(),(

The probability of traversing a certain arc at time t givenObservation sequence O:

jti jipt

The Probability of Traversing an Arc

aij bijot

……

t t+1t-1 t+2i(t) j(t+1)

Definitions

ti jiptt

),()( ),(

• The expected number of transitions from state i in O:

• The expected number of transitions from state i to j in O

Reestimation Procedure

• Begin with model perhaps selected at random.• Run O through the current model to estimate the

expectations of each model parameter.• Change model to maximize the values of the paths

that are used a lot while respecting stochastic constraints.

• Repeat this process (hoping to converge on optimal values for the model parameters ).

The Reestimation Formulas

)1(ˆ ii The expected frequency in state i at time t = 1:

state from ns transitio# expected

to state from ns transitio# expectedˆ

).|P(O)ˆ|P(O that Note

).ˆ,B,A(ˆ derive ),B,(A, From

Reestimation Formulas

Ttkot t

to state from ns transitio# expected

observing to state from ns transitio# expectedˆ

oTo1 otot-1 ot+1

jjoiji

tbatjip t

)1()(),( 1

Probability of traversing an arcat time t

ti jipt...1

),()( Probability of being in state i

oTo1 otot-1 ot+1

)1(ˆ i i

Now we can compute the new estimates of the model parameters.

HMM Applications

• Parameters for interpolated n-gram models

• Part of speech tagging

• Speech recognition

Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of...

Documents

Usage of profile HMMs in Bioinformatics

Particle Filters and Applications of HMMs

HMMS LINK Spring-Summer Newsletter 2012

HMMs$and$biological$ sequence$analysis

Trajectory classification using HMMs - zcu.czTrajectory classification using HMMs Jozef Mlích, Pavel Zemčík, Leoš Jiřík imlich@fit.vutbr.cz ,zemcik@fit.vutbr.cz, xjirik02@stud.fit.vutbr.cz

EE669 Lecture Slides Module 1

Application of HMMs: Speech recognition

Collocations Reading: Chap 5, Manning & Schutze

HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

HMMs and Particle Filters

Introduction to HMMs in Bioinformatics

Hidden Markov Models (HMMs)

Web characteristics, broadly defined Thanks to C. Manning, P. Raghavan, H. Schutze

Original Music by Paul Schutze - Ozmovies · Another Schutze soundtrack, this is for a relatively big-budget, commercial ﬁlm, ... romanticism and a restrained ethnic exoticism which

Hmm, HID HMMs

REPROCESSING OVERVIEW - HMMS-NJ

Ch 5. Profile HMMs for sequence families

NSW HMMS LINK Autumn Winter Newsletter

Hmms Spring2013

Genetic Linkage Analysis using HMMs Lecture 7