1 Pattern Recognition Chapter 3 Hidden Markov Models (HMMs)

1

Pattern Recognition

Chapter 3Chapter 3

Hidden Markov Models (HMMs)Hidden Markov Models (HMMs)

2

Hidden Markov Models (HMMs)

• SequentialSequential patterns: patterns:– The order of the data points is irrelevant.The order of the data points is irrelevant.

– No explicit sequencing ... No explicit sequencing ...

• Temporal Temporal patterns:patterns:– The result of a time process (e.g., time series).The result of a time process (e.g., time series).

– Can be represented by a number of states.Can be represented by a number of states.

– States at time States at time t t are influenced directly by states in are influenced directly by states in previous time steps (i.e., correlated).previous time steps (i.e., correlated).

3

Hidden Markov Models (HMMs)

• HMMs are appropriate for problems that have an HMMs are appropriate for problems that have an inherent inherent temporalitytemporality..

– Speech recognitionSpeech recognition

– Gesture recognition Gesture recognition

– Human activity recognitionHuman activity recognition

4

First-Order Markov Models

• They are represented by a They are represented by a graphgraph where every node where every node corresponds to a state corresponds to a state ωωii..

• The graph can be fully-connected with self-loops.The graph can be fully-connected with self-loops.• Links between nodes Links between nodes ωωii and and ωωjj are associated with a are associated with a

transition probabilitytransition probability: :

P( P( ωω(t+1)=(t+1)=ωωjj//ωω(t)=(t)=ωωi i )=)=ααijij

which is the probability of going to state which is the probability of going to state ωωjj at time at time t+1t+1 given that the state at time given that the state at time tt was was ωωii ( (first-orderfirst-order model).model).

5

First-Order Markov Models (cont’d)

• The following constraints should be satisfied: The following constraints should be satisfied:

• Markov models are fully described by their Markov models are fully described by their transition probabilities transition probabilities ααijij

1,ijj

a i

6

Example: Weather Prediction Model

• Assume three weather states:Assume three weather states:– ωω11: Precipitation (rain, snow, hail, etc.): Precipitation (rain, snow, hail, etc.)

– ωω22: Cloudy: Cloudy

– ωω33: Sunny : Sunny

Transition MatrixTransition Matrix

ωω 11 ωω 22 ωω 33

ωω11

ωω22

ωω33

ωω11

ωω22

ωω33

7

Computing P(ωT) of a sequence of states ωT

• Given a sequence of states Given a sequence of states ωωTT=(=(ωω(1), (1), ωω(2),..., (2),..., ωω(T)),(T)), the probability that the model generated the probability that the model generated ωωTT is equal to the product of the corresponding is equal to the product of the corresponding transition probabilities:transition probabilities:

where where P(P(ωω(1)/ (1)/ ωω(0))=P((0))=P(ωω(1))(1)) is the is the prior prior probability of the first state.probability of the first state.

1

( ) ( ( ) / ( 1))T

T

t

P P t t

8

Example: Weather Prediction Model (cont’d)

• What is the probability that the weather for eight What is the probability that the weather for eight consecutive days is: consecutive days is:

““sun-sun-sun-rain-rain-sun-cloudy-sun”sun-sun-sun-rain-rain-sun-cloudy-sun” ? ?

ωω88==ωω33ωω33ωω33ωω11ωω33ωω22ωω33

P(P(ωω88)=P()=P(ωω33)P()P(ωω33//ωω33)P()P(ωω33//ωω33)P()P(ωω11//ωω33)P()P(ωω33//ωω11))

P(P(ωω22//ωω33)P()P(ωω33//ωω22)=1.536 )=1.536 xx 10 10-4-4

9

Limitations of Markov models

• In Markov models, each state is In Markov models, each state is uniquelyuniquely associated with an observable event.associated with an observable event.– Once an observation is made, the state of the Once an observation is made, the state of the

system is trivially retrieved.system is trivially retrieved.

• Such systems are Such systems are notnot of practical use for of practical use for most practical applications.most practical applications.

10

Hidden States and Observations

• Assume that observations are a probabilistic Assume that observations are a probabilistic function of each state.function of each state.– Each state can produce can generate a number of Each state can produce can generate a number of

outputs (i.e., observations) according to a unique outputs (i.e., observations) according to a unique probability distribution.probability distribution.

– Each observation can potentially be generated at any Each observation can potentially be generated at any state. state.

• State sequence isState sequence is not not directly observable. directly observable.– Can be approximated by a sequence of observations. Can be approximated by a sequence of observations.

11

First-order HMMs

• We augment the model such that when it is in state We augment the model such that when it is in state ωω(t)(t) it it also emits some symbol also emits some symbol v(t) (visible states)v(t) (visible states) among a set of among a set of possible symbols.possible symbols.

• We have access to the visible states only, while the We have access to the visible states only, while the ωω(t)(t) are are unobservableunobservable..

12

Example: Weather Prediction Model (cont’d)

vv11: temperature: temperature

vv22: humidity: humidity

etc.etc.

Observations:Observations:

13

First-order HMMs

• For every sequence of -hidden- states, there For every sequence of -hidden- states, there is an associated sequence of visible states:is an associated sequence of visible states:

ωωTT=(=(ωω(1), (1), ωω(2),..., (2),..., ωω(T)) (T)) V VTT=(v=(v(1), (1), vv(2),..., (2),..., v(T))v(T))

• When the model is in state When the model is in state ωωjj at time at time t t, the , the

probability of emitting a visible stateprobability of emitting a visible state v vkk at at

that time is denoted as:that time is denoted as:

P(v(t)=vP(v(t)=vkk/ / ωω(t)= (t)= ωωjj)=b)=bjk’ jk’ wherewhere

((observation probabilitiesobservation probabilities))

1,jkk

b j

14

Absorbing State

• Given a state sequence and its corresponding Given a state sequence and its corresponding observation sequence:observation sequence:

ωωTT=(=(ωω(1), (1), ωω(2),..., (2),..., ωω(T)) (T)) V VTT=(v=(v(1), (1), vv(2),..., (2),..., v(T))v(T))

we assume thatwe assume that ωω(T)=(T)=ωω0 0 is some is some absorbingabsorbing state, state,

which uniquely emits symbol which uniquely emits symbol v(T)=vv(T)=v00

• Once entering the absorbing state, the system can Once entering the absorbing state, the system can not escape from it.not escape from it.

15

HMM Formalism

• An HMM is defined by {An HMM is defined by {ΩΩ, V, V, , – ΩΩ : { : {ωω11… … ωω n n } are the possible states} are the possible states

– VV : {v : {v11…v…vm m } are the possible observations} are the possible observations

– prior state probabilitiesprior state probabilities

– A A = {a= {aijij} are the state transition probabilities} are the state transition probabilities

– BB = {b = {bikik} are the observation state probabilities} are the observation state probabilities

16

Some Terminology

• CausalCausal: the probabilities depend only upon : the probabilities depend only upon previous states.previous states.

• ErgodicErgodic: Every one of the states has : Every one of the states has a non-a non-zero probabilityzero probability of occurring given some of occurring given some starting state.starting state.

““left-right”left-right” HMM HMM

17

Coin toss example

• You are in a room with a barrier (e.g., a curtain) You are in a room with a barrier (e.g., a curtain) through which you cannot see what is happening.through which you cannot see what is happening.

• On the other side of the barrier is another person On the other side of the barrier is another person who is performing a coin (or multiple coin) toss who is performing a coin (or multiple coin) toss experiment.experiment.

• The other person will tell you only the result of the The other person will tell you only the result of the experiment, experiment, notnot how he obtained that resulthow he obtained that result!!!!

e.g., e.g., VVTT=HHTHTTHH...T=v(1),v(2), ..., v(T)=HHTHTTHH...T=v(1),v(2), ..., v(T)

18

Coin toss example (cont’d)

• Problem:Problem: derive an HMM model to explain the derive an HMM model to explain the observed sequence of heads and tails.observed sequence of heads and tails.– The coins represent the states; these are hidden because The coins represent the states; these are hidden because

we do not know which coin was tossed each time.we do not know which coin was tossed each time.

– The outcome of each toss represents an observation.The outcome of each toss represents an observation.

– A “likely” sequence of coins may be inferred from the A “likely” sequence of coins may be inferred from the observations.observations.

– As we will see, the state sequence will not be unique in As we will see, the state sequence will not be unique in general. general.

19

Coin toss example:1-fair coin model

• There are 2 states, each associated with either There are 2 states, each associated with either heads (heads (state1state1) or tails () or tails (state2state2).).

• Observation sequence uniquely defines the statesObservation sequence uniquely defines the states (model is (model is notnot hidden). hidden).

observationsobservations

20

Coin toss example:2-fair coins model

• There are 2 states but neither state is uniquely There are 2 states but neither state is uniquely associated with either heads or tails (i.e., each state associated with either heads or tails (i.e., each state can be associated with a different fair coin).can be associated with a different fair coin).

• A third coin is used to decide which of the fair A third coin is used to decide which of the fair coins to flip.coins to flip.


21

Coin toss example:2-biased coins model

• There are 2 states with each state associated with a There are 2 states with each state associated with a biased coin.biased coin.

• A third coin is used to decide which of the biased A third coin is used to decide which of the biased

coins to flip.coins to flip.


22

Coin toss example:3-biased coins model

• There are 3 states with each state associated with a There are 3 states with each state associated with a biased coin.biased coin.

• We decide which coin to flip using some way We decide which coin to flip using some way

(e.g., other coins).(e.g., other coins).


23

Which model is best?

• Since the states are not observable, the best Since the states are not observable, the best we can do is we can do is select the model that best select the model that best explains the dataexplains the data. .

• Long observation sequences would be best Long observation sequences would be best for selecting the best model ... for selecting the best model ...

24

Classification Using HMMs

• Given an observation sequence VGiven an observation sequence VTT and set of possible and set of possible models, choose the model with the highest probability.models, choose the model with the highest probability.

( / ) ( )( / )

( )

TT

T

P V PP V

P V

Bayes formula:Bayes formula:

25

Main Problems in HMMs

• EvaluationEvaluation– Determine the probability Determine the probability P(VP(VTT)) that a particular that a particular

sequence of visible states sequence of visible states VVTT was generated by a given was generated by a given model (based on dynamic programming).model (based on dynamic programming).

• DecodingDecoding– Given a sequence of visible states Given a sequence of visible states VVTT, determine the most , determine the most

likely sequence of likely sequence of hiddenhidden states states ωωTT that led to those that led to those observations (based on dynamic programming).observations (based on dynamic programming).

• LearningLearning– Given a set of visible observations, determine Given a set of visible observations, determine aaijij and and bbjk jk

(based on EM algorithm).(based on EM algorithm).

26

Evaluation

1

( ) ( ( ) / ( 1))T

Tr r r

t

P P t t

(i.e., possible # of state sequences)(i.e., possible # of state sequences)

27

Evaluation (cont’d)

max

1 1

( ) ( ( ) / ( )) ( ( ) / ( 1))r T

Tr r r

r t

P V P v t t P t t

1

( / ) ( ( ) / ( ))T

T Tr r

t

P V P v t t

(enumerate all possible transitions to determine how good the model is)(enumerate all possible transitions to determine how good the model is)

28

Example: Evaluation

(enumerate all possible transitions to determine how good the model is)(enumerate all possible transitions to determine how good the model is)

29

Computational Complexity

30

Recursive computation of P(VT) (HMM Forward)

v(T)v(1) v(t) v(t+1)

ω(1) ω(t) ω(t+1) ω(T)

ωωii ωωjj......

31

Recursive computation of P(VT) (HMM Forward) (cont’d)

Using maginalization:Using maginalization:

( 1) ( (1), (2),..., ( ), ( 1), ( ) , ( 1) )j i ji

a t P v v v t v t t t

32


ωω00 ωω00 ωω00 ωω00 ωω00

0( ) ( )TP V a T

33


(i.e., corresponds to state (i.e., corresponds to state ωω00= = ωω(T)(T)))0( ) ( )TP V a T

for j=0 to c dofor j=0 to c do

34

Example

ωω0 0 ωω11 ωω2 2 ωω33

ωω0 0

ωω11

ωω2 2

ωω33

ωω0 0

ωω11

ωω2 2

ωω33

35

Example (cont’d)

• Similarly for t=2,3,4Similarly for t=2,3,4

• Finally:Finally:0( ) ( ) 0.0011TP V a T

VVTT==

0.20.2

0.20.2

0.80.8

(0.00108)(0.00108)

initial stateinitial state

36

The backward algorithm (HMM backward)

v(1)

ω(1) ω(t) ω(t+1) ω(T)

v(t) v(t+1) v(T)

......ωωii ωωjj

ββjj(t+1)(t+1) //ω ω (t+1)=(t+1)=ωωjj))

ββii(t)(t) ii

ββii(t)(t) ωωii

37

The backward algorithm (HMM backward) (cont’d)

==ωωjj))))

( )1

( ) ( 1)c

i j ij iv tj

t t a b

1

( ( 1), ( 2),..., ( ) / ( ) ) ( ( ) / ( ) ) ( ( 1) / ( ) )c

j i j ij

P v t v t v T t P v t t P t t

oror ii

v(1)

ω(1) ω(t) ω(t+1) ω(T)

v(t) v(t+1) v(T)

ωωii ωωjj

38

The backward algorithm (HMM backward) (cont’d)

( )1

( ) ( 1)c

i j ij iv tj

t t a b

39

Decoding

• We need to use an optimality criterion to solve We need to use an optimality criterion to solve this problem (i.e., there are several possible ways this problem (i.e., there are several possible ways solving this problem since there are various solving this problem since there are various optimality criteria we could use).optimality criteria we could use).

• Algorithm 1Algorithm 1: choose the states : choose the states ωω(t)(t) which are which are individually most likelyindividually most likely (i.e., maximize the (i.e., maximize the expected number of correct individual states).expected number of correct individual states).

40

Decoding – Algorithm 1 (cont’d)

41

Decoding – Algorithm 2

• Algorithm 2Algorithm 2: at each time step : at each time step tt, find the , find the state that has the highest probability state that has the highest probability ααii(t).(t).

• Uses the Uses the forward algorithmforward algorithm with minor with minor changes.changes.

42


43


44


• There is no guarantee that the path is a valid one.There is no guarantee that the path is a valid one.

• The path might imply a transition that is not allowed by The path might imply a transition that is not allowed by the model.the model.

not allowed! not allowed! ωω3232=0=0

0 1 2 3 40 1 2 3 4

45

Decoding – Algorithm 3

46


47


48


49

Learning

• Use EMUse EM

– Update the weights iteratively to better explain the Update the weights iteratively to better explain the

observed training sequences.observed training sequences.

1 2max ( , ,..., / )T T TnP V V V

50

Learning (cont’d)

• IdeaIdea

[# ]ˆ

[# ]

[# ]ˆ[# ]

i jij

i

j

j

E times it goes from toa

E times it goes from to any other state

E times it emits symbol k while at statebjk

E times it emits any other symbol while at state

51

Learning (cont’d)

• Define the probability of transitioning from Define the probability of transitioning from ωωii to to ωωjj at step at step tt given V given VTT::

( 1) ( )( ) ( ( 1) , ( ) / )

( )i ij jk jT

ij j i T

t b tt P t t V

P V

(expectation step)(expectation step)

52

Learning (cont’d)

53

Learning (cont’d)

1

1

( )[# ]

ˆ[# ]

( )

T

iji j t

ij Ti

ikt k

tE times it goes from to

aE times it goes from to any other state

t

(maximization step)(maximization step)

54

Learning (cont’d)

1, ( )

1

[# ]ˆ[# ]

( )

( )

j

j

T

jlt v t k l

T

jlt l

E times it emits symbol k while at statebjk

E times it emits any other symbol while at state

t

t

(maximization step)(maximization step)

55

Difficulties

• How do we decide on the number of states and How do we decide on the number of states and the structure of the model?the structure of the model?– Use domain knowledge otherwise very hard problem!Use domain knowledge otherwise very hard problem!

• What about the size of observation sequence ?What about the size of observation sequence ?– Should be sufficiently long to guarantee that all state Should be sufficiently long to guarantee that all state

transitions will appear a sufficient number of times.transitions will appear a sufficient number of times.

– A large number of training data is necessary to learn A large number of training data is necessary to learn the HMM parameters.the HMM parameters.

Documents

1 Pattern Recognition Chapter 3 Hidden Markov Models (HMMs)