108
P ATTERN RECOGNITION Markov models Vu PHAM [email protected] Department of Computer Science March 28 th , 2011 06/04/2011 1 Markov models

Hmm Revisited

  • Upload
    vu-pham

  • View
    225

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 1/108

PATTERN RECOGNITION

Markov models

Vu [email protected]

Department of Computer Science

March 28 th , 2011

06/04/2011 1Markov models

Page 2: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 2/108

Contents

• Introduction

Introduction – Motivation

• Markov Chain

• Hidden Markov Models

• Markov Random Field

06/04/2011 2Markov models

Page 3: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 3/108

Introduction• Markov processes are first proposed by

Russian mathematician Andrei Markov –

He used these processes to investigatePushkin’s poem.

• Nowadays, Markov property and HMMs are

widely used in many domains: – Natural Language Processing

– Speech Recognition

– Bioinformatics

– Image/video processing

– ...

06/04/2011 Markov models 3

Page 4: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 4/108

Motivation [0]

• As shown in his paper in 1906, Markov’s original

motivation is purely mathematical:

– Application of The Weak Law of Large Number to dependent

random variables.

• However, we shall not follow this motivation...

06/04/2011 Markov models 4

Page 5: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 5/108

Motivation [1]

• From the viewpoint of classification :

Context-free classification: Bayes classifier( ) ( )| |i j j i p pω ω > ∀ ≠x x

06/04/2011 Markov models 5

Page 6: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 6/108

Motivation [1]

• From the viewpoint of classification :

Context-free classification: Bayes classifier( ) ( )| |i j j i p pω ω > ∀ ≠x x

06/04/2011 Markov models 6

• Classes are independent.

• Feature vectors are independent .

Page 7: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 7/108

Motivation [1]

• From the viewpoint of classification :

Context-free classification: Bayes classifier( ) ( )| |i j j i p pω ω > ∀ ≠x x

– However, there are some applications where various

classes are closely realated:

• POS Tagging, Tracking, Gene boundary recover...

06/04/2011 Markov models 7

s1...s2 s3

...sm

Page 8: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 8/108

Motivation [1]

• Context-dependent classification:

– s1, s2, ..., sm: sequence of m feature vector

– ω 1, ω 2,..., ω N: classes in which these vectors are classified: ω i = 1...k.

s1...s2 s3

...sm

06/04/2011 Markov models 8

Page 9: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 9/108

Motivation [1]

• Context-dependent classification:

– s1, s2, ..., sm: sequence of m feature vector

– ω 1, ω 2,..., ω N: classes in which these vectors are classified: ω i = 1...k.

s1...s2 s3

...sm

• To apply Bayes classifier: – X = s1s2...sm: extened feature vector

Ω i = ω i1, ω i2,..., ω iN : a classification Nm

possible classifications

06/04/2011 Markov models 9

( ) ( )| |i j j i p pΩ > Ω ∀ ≠X X

( ) ( ) ( ) ( )| |i i j j p p j p p iΩ Ω > Ω Ω ∀ ≠X X

Page 10: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 10/108

Motivation [1]

• Context-dependent classification:

– s1, s2, ..., sm: sequence of m feature vector

– ω 1, ω 2,..., ω N: classes in which these vectors are classified: ω i = 1...k.

s1...s2 s3

...sm

• To apply Bayes classifier: – X = s1s2...sm: extened feature vector

– Ωi =

ωi1,

ωi2,...,

ωiN : a classification N

m

possible classifications

06/04/2011 Markov models 10

( ) ( )| |i j j i p pΩ > Ω ∀ ≠X X

( ) ( ) ( ) ( )| |i i j j p p j p p iΩ Ω > Ω Ω ∀ ≠X X

Page 11: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 11/108

Motivation [2]

• From a general view , sometimes we want to evaluate the joint

distribution of a sequence of dependent random variables

06/04/2011 Markov models 11

Page 12: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 12/108

Motivation [2]

• From a general view , sometimes we want to evaluate the joint

distribution of a sequence of dependent random variables

Hôm nay mùng tám tháng ba

Chị em phụ nữ đi ra đi vào...

06/04/2011 Markov models 12

Hôm ...nay mùng ...vào

q 1 q 2 q 3 qm

Page 13: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 13/108

Motivation [2]

• From a general view , sometimes we want to evaluate the joint

distribution of a sequence of dependent random variables

Hôm nay mùng tám tháng ba

Chị em phụ nữ đi ra đi vào...

What is p( Hôm nay.... vào ) = p(q 1=Hôm q2=nay ... q m=vào )?

06/04/2011 Markov models 13

Hôm ...nay mùng ...vào

q 1 q 2 q 3 qm

Page 14: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 14/108

Motivation [2]

• From a general view , sometimes we want to evaluate the joint

distribution of a sequence of dependent random variables

Hôm nay mùng tám tháng ba

Chị em phụ nữ đi ra đi vào...

What is p( Hôm nay.... vào ) = p(q 1=Hôm q2=nay ... q m=vào )?

06/04/2011 Markov models 14

Hôm ...nay mùng ...vào

q 1 q 2 q 3 qm

p(sm | s1s2...sm-1 ) =p(s1s2... sm-1 sm)

p(s1s2... sm-1 )

Page 15: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 15/108

Contents

• Introduction

Markov Chain• Hidden Markov Models

• Markov Random Field

06/04/2011 15Markov models

Page 16: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 16/108

Markov Chain• Has N states, called s 1, s2, ..., s N

• There are discrete timesteps, t=0,

t=1,...• On the t’th timestep the system is in

exactly one of the available states.

s1

s2

Call it

06/04/2011 16Markov models

3

N = 3t = 0q t = q0 = s3

1 2, ,...,t N s sq s∈

Current state

Page 17: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 17/108

Markov Chain• Has N states, called s 1, s2, ..., s N

• There are discrete timesteps, t=0,

t=1,...• On the t’th timestep the system is in

exactly one of the available states.

s1

s2

Current state

Call it• Between each timestep, the next

state is chosen randomly.

06/04/2011 17Markov models

3

N = 3t = 1q t = q1 = s2

1 2, ,...,t N s sq s∈

( ) 1 2s sp =˚

Page 18: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 18/108

Markov Chain• Has N states, called s 1, s2, ..., s N

• There are discrete timesteps, t=0,

t=1,...• On the t’th timestep the system is in

exactly one of the available states.

s1

s2

( )( )( )

2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

s p s

=

=

=

˚

˚

Call it• Between each timestep, the next

state is chosen randomly.• The current state determines the

probability for the next state.

06/04/2011 18Markov models

3

N = 3t = 1q t = q1 = s2

1 2, ,...,t N s sq s∈ ( )( )

1 1 1

12

3 1

0

1

t t p q

p

s q s

s s

s p s

+ =

=

=

=

=

˚

˚ ( )( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

s p s

=

=

=

˚

˚

˚

( ) 1 2s sp =˚

Page 19: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 19/108

Markov Chain• Has N states, called s 1, s2, ..., s N

• There are discrete timesteps, t=0,

t=1,...• On the t’th timestep the system is in

exactly one of the available states.

s1

s2

( )( )( )

2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

s p s

=

=

=

˚

˚

11/3

1/2

1/2

2/3

Call it• Between each timestep, the next

state is chosen randomly.• The current state determines the

probability for the next state. –

Often notated with arcs between states06/04/2011 19Markov models

3

N = 3t = 1q t = q1 = s2

1 2, ,...,t N s sq s∈ ( )( )

1 1 1

12

3 1

0

1

t t p q

p

s q s

s s

s p s

+ =

=

=

=

=

˚

˚ ( )( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

s p s

=

=

=

˚

˚

˚

( ) 1 2s sp =˚

Page 20: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 20/108

Markov Property• q t+1 is conditionally independent of

qt-1 , q t-2 ,..., q 0 given q t.•

In other words: s1

s2

( )1 1 0, ,...,t t t p q q q q+ −

=

˚

˚

( )( )( )

2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

s p s

=

=

˚

˚

11/3

1/2

1/2

2/3

06/04/2011 20Markov models

3

N = 3t = 1q t = q1 = s2

( )( )

1 1 1

12

3 1

0

1

t t p q

p

s q s

s s

s p s

+ =

=

=

=

=

˚

˚ ( )( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

s p s

=

=

=

˚

˚

˚

( ) 1 2s sp =˚

Page 21: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 21/108

Markov Property• q t+1 is conditionally independent of

qt-1 , q t-2 ,..., q 0 given q t.•

In other words: s1

s2

( )1 1 0, ,...,t t t p q q q q+ −

=

˚

˚

( )( )( )

2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

s p s

=

=

˚

˚

11/3

1/2

1/2

2/3

06/04/2011 21Markov models

3

N = 3t = 1q t = q1 = s2

( )( )

1 1 1

12

3 1

0

1

t t p q

p

s q s

s s

s p s

+ =

=

=

=

=

˚

˚ ( )( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

s p s

=

=

=

˚

˚

˚

The state at timestep t+1 dependsonly on the state at timestep t

( )1 2 1 2s sp =˚

Page 22: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 22/108

Markov Property• q t+1 is conditionally independent of

qt-1 , q t-2 ,..., q 0 given q t.•

In other words: s1

s2

( )1 1 0, ,...,t t t p q q q q+ −

=

˚

˚

( )( )( )

2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

s p s

=

=

˚

˚

11/3

1/2

1/2

2/3

06/04/2011 22Markov models

3

N = 3t = 1q t = q1 = s2

( )( )

1 1 1

12

3 1

0

1

t t p q

p

s q s

s s

s p s

+ =

=

=

=

=

˚

˚ ( )( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

s p s

=

=

=

˚

˚

˚

The state at timestep t+1 dependsonly on the state at timestep t

A Markov chain of order m (m finite): the state at

timestep t+1 depends on the past m states:

( ) ( )1 1 0 1 1 1, ,..., , ,...,t t t t t t t m p q q q q p q q q q+ − + − − +=˚ ˚

( )1 2 1 2s s p =˚

Page 23: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 23/108

Markov Property• q t+1 is conditionally independent of

qt-1 , q t-2 ,..., q 0 given q t.•

In other words:( )1 1 0, ,...,t t t p q q q q+ −

=

˚

˚

( )( )( )

2

2

1

3

2

2 1 2

0

s s

p

p

s p s

=

=

˚

˚

s1

s2

11/3

1/2

1/2

2/3

How to represent the jointdistribution of (q 0, q 1, q 2...) usinggraphical models?

06/04/2011 23Markov models

N = 3t = 1q t = q1 = s2

( )( )

1 1 1

12

3 1

0

1

t t p q

p

s q s

s s

s p s

+ =

=

=

=

=

˚

˚ ( )( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

s p s

=

=

=

˚

˚

˚

3The state at timestep t+1 dependsonly on the state at timestep t

( )1 2 1 2s s p =˚

Page 24: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 24/108

Markov Property• q t+1 is conditionally independent of

qt-1 , q t-2 ,..., q 0 given q t.•

In other words: s1

s2

( )1 1 0, ,...,t t t p q q q q+ −

=

˚

˚

( )( )( )

2

2

1

3

2

2 1 2

0

s s

p

p

s p s

=

=

˚

˚

11/3

1/2

1/2

1/3

q 0

q 1

How to represent the jointdistribution of (q 0, q 1, q 2...) usinggraphical models?

06/04/2011 24Markov models

3

N = 3t = 1q t = q1 = s2

( )( )

1 1 1

12

3 1

0

1

t t p q

p

s q s

s s

s p s

+ =

=

=

=

=

˚

˚ ( )( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

s p s

=

=

=

˚

˚

˚

The state at timestep t+1 dependsonly on the state at timestep t

q2

q3

Page 25: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 25/108

Markov chain• So, the chain of q t is called Markov chain

q 0 q 1 q 2 q 3

06/04/2011 Markov models 25

Page 26: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 26/108

Markov chain• So, the chain of q t is called Markov chain

Each q t takes value from the countable state-space s1, s2, s3...• Each q t is observed at a discrete timestep t•

q 0 q 1 q 2 q 3

=t

06/04/2011 Markov models 26

, ,...,t t t t t + − +

Page 27: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 27/108

Markov chain• So, the chain of q t is called Markov chain

Each q t takes value from the countable state-space s1, s2, s3...• Each q t is observed at a discrete timestep t•

q 0 q 1 q 2 q 3

=t

• The transition from q t to q t+1 is calculated from the transitionprobability matrix

06/04/2011 Markov models 27

s1

s3

s2

1 1/3

1/2

1/2

2/3

s1 s2 s3

s1 0 0 1

s2½ ½ 0

s31/3 2/3 0

Transition probabilities

, ,...,t t t t t + − +

Page 28: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 28/108

Markov chain• So, the chain of q t is called Markov chain

Each q t takes value from the countable state-space s1, s2, s3...• Each q t is observed at a discrete timestep t•

q 0 q 1 q 2 q 3

=t

• The transition from q t to q t+1 is calculated from the transitionprobability matrix

06/04/2011 Markov models 28

s1

s3

s2

1 1/3

1/2

1/2

2/3

s1 s2 s3

s1 0 0 1

s2½ ½ 0

s31/3 2/3 0

Transition probabilities

, ,...,t t t t t + − +

k h

Page 29: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 29/108

Markov Chain – Important property

• In a Markov chain, the joint distribution is

( ) ( ) ( )0 1 0 11

, ,..., |m

m j j j

q q q p q q q p p−=

=∏

06/04/2011 Markov models 29

M k Ch i I

Page 30: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 30/108

Markov Chain – Important property

• In a Markov chain, the joint distribution is

• Why?

m

( ) ( ) ( )0 1 0 11

, ,..., |m

m j j j

q q q p q q q p p−=

=∏

06/04/2011 Markov models 30

( ) ( )

( ) ( )

0 1 0 11

0 11

, previous st, ,... ates, |

|

m j j j

m

j j j

q q q p q p q q

p q p q q

p −=

−=

=

=

∏∏

Due to the Markov property

M k Ch i

Page 31: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 31/108

Markov Chain: e.g.• The state-space of weather:

cloud

windrain

06/04/2011 Markov models 31

M k Ch i

Page 32: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 32/108

Markov Chain: e.g.• The state-space of weather:

cloud

windrain

1/2

1/2

1/32/3

1

Rain Cloud Wind

Rain ½ 0 ½

Cloud 1/3 0 2/3

Wind 0 1 0

06/04/2011 Markov models 32

M k Ch i

Page 33: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 33/108

Markov Chain: e.g.• The state-space of weather:

cloud

windrain

1/2

1/2

1/32/3

1

Rain Cloud Wind

Rain ½ 0 ½

Cloud 1/3 0 2/3

Wind 0 1 0

• Markov assumption: weather in the t+1 ’th day isdepends only on the t ’th day.

06/04/2011 Markov models 33

M k Ch i g

Page 34: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 34/108

Markov Chain: e.g.• The state-space of weather:

cloud

windrain

1/2

1/2

1/32/3

1

Rain Cloud Wind

Rain ½ 0 ½

Cloud 1/3 0 2/3

Wind 0 1 0

• Markov assumption: weather in the t+1 ’th day isdepends only on the t ’th day.

We have observed the weather in a week:

06/04/2011 Markov models 34

rain windcloud rainwind

Day: 0 1 2 3 4

Markov Chain: e g

Page 35: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 35/108

Markov Chain: e.g.• The state-space of weather:

cloud

windrain

1/2

1/2

1/32/3

1

Rain Cloud Wind

Rain ½ 0 ½

Cloud 1/3 0 2/3

Wind 0 1 0

• Markov assumption: weather in the t+1 ’th day isdepends only on the t ’th day.

We have observed the weather in a week:

06/04/2011 Markov models 35

rain windcloud rainwind

Day: 0 1 2 3 4

Markov Chain

Contents

Page 36: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 36/108

Contents• Introduction

• Markov Chain

• Hidden Markov Models – Independent assumptions

– Formal definition

– Forward algorithm

– Viterbi algorithm

– Baum-Welch algorithm

• Markov Random Field

06/04/2011 36Markov models

Modeling pairs of sequences

Page 37: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 37/108

Modeling pairs of sequences• In many applications, we have to model pair of sequences

• Examples: – POS tagging in Natural Language Processing (assign each word in a

sentence to Noun, Adj, Verb...)

– Speech recognition (map acoustic sequences to sequences of words)

Computational biology (recover gene boundaries in DNA sequences) – Video tracking (estimate the underlying model states from the observation

sequences)

– And many others...

06/04/2011 Markov models 37

Probabilistic models for sequence pairs

Page 38: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 38/108

Probabilistic models for sequence pairs

• We have two sequences of random variables:

X1, X2, ..., Xm and S 1, S2, ..., S m

• Intuitively, in a pratical system, each X i corresponds to an observation

and each S i corresponds to a state that generated the observation.

• Let each S i be in 1, 2, ..., k and each X i be in 1, 2, ..., o• How do we model the joint distribution:

06/04/2011 Markov models 38

( )1 1 1 1,..., , ,...,m m m m p X x X x S s S s= = = =

Hidden Markov Models (HMMs)

Page 39: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 39/108

Hidden Markov Models (HMMs)

• In HMMs, we assume that( )

( ) ( ) ( )

1 1 1 1

1 1 1 12 1

,..., , ,...,m m m m

m m

j j j j j j j j j j

p X

p

x X x S s S s

s p S s S s p X x S sS − −= =

= = = =

= = == = =∏ ∏˚ ˚

• This is often called Independence assumptions in

HMMs

• We are gonna prove it in the next slides

06/04/2011 Markov models 39

Independence Assumptions in HMMs [1]

Page 40: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 40/108

Independence Assumptions in HMMs [1]

• By the chain rule , the following equality is exact:

( )

( )( )

1 1 1 1

1 1

1 1 1 1

,..., , ,...,

,...,,..., ,...,

m m m m

m m

m m m m

p

p p

X x X x S s S s

S s S s X x X x S s S s

= = = =

= = = ×= = = =˚

( ) ( ) ( ) ( ) ( ) (| | ABC p A BC p BC p A BC p B p C p C= = ˚

• Assumption 1 : the state sequence forms a Markov chain

06/04/2011 Markov models 40

( ) ( ) ( )1 1 1 1 1 1

2

,...,m

m m j j j j

j

S s S s p S s p S s S p s− −

=

= = = = = =∏ ˚

Independence Assumptions in HMMs [2]

Page 41: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 41/108

Independence Assumptions in HMMs [2]

• By the chain rule, the following equality is exact:

• Assum tion 2 : each observation de ends onl on the underl in

( )

( )

1 1 1

11

1

1

1 11 1,

,..., ,...,

,..., ,...,

m m m m

m

j j m m j

j j

X x X x S s S s

X x S s S s X x X x

p

p −=

= = = =

= = = = = =∏

˚

˚

state

• These two assumptions are often called independence

assumptions in HMMs

06/04/2011 Markov models 41

( )( )

1 1 1 1 1 1,..., ,., .., j j m m j j

j j j j

X x S s S s x X x

X

X

p

p

x S s

− −= = = = =

= = =

˚

˚

The Model form for HMMs

Page 42: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 42/108

The Model form for HMMs

• The model takes the following form:

• Parameters in the model:

( ) ( ) ( ) ( )1 1 1 12 1

,.., , ,..., ;m m

m m j j j j j j

x x s s s t s s p e x sθ π −= =

= ∏ ∏˚ ˚

06/04/2011 Markov models 42

( ) Initial probabilities for 1, 2,...,s s k π ∈

( ) Transition probabilities for , ' 1, 2,...,t s s s s k ′ ∈˚

( )

Emission probabilities for 1, 2,...,

and 1, 2,..,

e x s s k

x o

˚

6 components of HMMs

Page 43: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 43/108

6 components of HMMs•

Discrete timesteps: 1, 2, ...• Finite state space: s i (N states)• Events x i (M events)• Vector of initial probabilities ππππ i

ΠΠΠΠ = π i = p(q1 = si) • Matrix of transition robabilities

s1 s2 s3

t11

t21t12

t31

t23t32

e 13

start

ππππ 1ππππ 2 ππππ 3

T = Tij = p(qt+1=s j|q t=si) • Matrix of emission probabilities

E = Eij = p(ot=x j|q t=si)

06/04/2011 Markov models 43

x1 x2 x3

11 e 31e 22

23 e 33

The observations at continuous timesteps form an observation sequenceo1, o2, ..., o t, where o i ∈ x1, x2, ..., x M

6 components of HMMs

Page 44: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 44/108

6 components of HMMs•

Discrete timesteps: 1, 2, ...• Finite state space: s i (N states)• Events x i (M events)• Vector of initial probabilities ππππ i

ΠΠΠΠ = π i = p(q1 = si) • Matrix of transition robabilities

s1 s2 s3

t11

t21t12

t31

t23t32

e 13

start

ππππ 1ππππ 2 ππππ 3

T = Tij = p(qt+1=s j|q t=si) • Matrix of emission probabilities

E = Eij = p(ot=x j|q t=si)

06/04/2011 Markov models 44

x1 x2 x3

11 e 31e 22

23 e 33

The observations at continuous timesteps form an observation sequenceo1, o2, ..., o t, where o i ∈ x1, x2, ..., x M

1 1 1

Constraints:

1 1 1 N N M

i ij iji j j

T E π = = =

= = =∑ ∑ ∑

6 components of HMMs

Page 45: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 45/108

6 components of HMMs•

Given a specific HMM and anobservation sequence, thecorresponding sequence of statesis generally not deterministic

• Example:Given the observation sequence:

s1 s2 s3

t11

t21t12

t31

t23t32

e 13

start

ππππ 1ππππ 2 ππππ 3

x1, x

3, x

3, x

2

The corresponding states can beany of following sequences:

s1, s

2, s

1, s

2

s1, s2, s3, s2

s1, s1, s1, s2

...06/04/2011 Markov models 45

x1 x2 x3

11 e 31e 22

23 e 33

Here’s an HMM

Page 46: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 46/108

Here s an HMM

s1 s2 s3

x1 x2 x3

0.5

0.40.5

0.20.6

0.8

0.3 0.2 0.7 0.1 0.9 0.8

06/04/2011 Markov models 46

T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

Here’s a HMM

Page 47: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 47/108

Start randomly in state 1, 2or 3.• Choose a output at each

state in random.• Let’s generate a sequence

of observations:

s1 s2 s3

x1 x2 x3

0.5

0.40.5

0.2

0.60.8

0.30.2

0.70.1

0.90.8

06/04/2011 Markov models 47

T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 o1

q2 o2

q3 o3

0.3 - 0.3 - 0.4randomply choicebetween S 1, S2, S3

Here’s a HMM

Page 48: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 48/108

Start randomly in state 1, 2or 3.• Choose a output at each

state in random.• Let’s generate a sequence

of observations:

s1 s2 s3

x1 x2 x3

0.5

0.40.5

0.2

0.60.8

0.30.2

0.70.1

0.90.8

06/04/2011 Markov models 48

T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 S3 o1

q2 o2

q3 o3

0.2 - 0.8choice between X 1

and X 3

Page 49: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 49/108

Here’s a HMM

Page 50: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 50/108

• Start randomly in state 1, 2or 3.

• Choose a output at eachstate in random.

• Let’s generate a sequenceof observations:

s1 s2 s3

x1 x2 x3

0.5

0.40.5

0.2

0.60.8

0.30.2

0.70.1

0.90.8

06/04/2011 Markov models 50

T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 S3 o1 X3

q2 S1 o2

q3 o3

0.3 - 0.7choice between X 1

and X 3

Here’s a HMM

Page 51: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 51/108

• Start randomly in state 1, 2or 3.

• Choose a output at eachstate in random.

• Let’s generate a sequenceof observations:

s1 s2 s3

x1 x2 x3

0.5

0.40.5

0.2

0.60.8

0.30.2

0.70.1

0.90.8

06/04/2011 Markov models 51

T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 S3 o1 X3

q2 S1 o2 X1

q3 o3

Go to S 2 withprobability 0.5 orS1 with prob. 0.5

Here’s a HMM

Page 52: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 52/108

• Start randomly in state 1, 2or 3.

• Choose a output at eachstate in random.

• Let’s generate a sequenceof observations:

s1 s2 s3

x1 x2 x3

0.5

0.40.5

0.2

0.60.8

0.30.2

0.70.1

0.90.8

06/04/2011 Markov models 52

T s1

s2

s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1

x2

x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 S3 o1 X3

q2 S1 o2 X1

q3 S1 o3

0.3 - 0.7choice between X 1

and X 3

Here’s a HMM

Page 53: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 53/108

• Start randomly in state 1, 2or 3.

• Choose a output at eachstate in random.

• Let’s generate a sequenceof observations:

s1 s2 s3

x1 x2 x3

0.5

0.40.5

0.2

0.60.8

0.30.2

0.70.1

0.90.8

06/04/2011 Markov models 53

T s1

s2

s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1

x2

x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 S3 o1 X3

q2 S1 o2 X1

q3 S1 o3 X3

We got a sequence

of states andcorrespondingobservations!

Three famous HMM tasks

Page 54: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 54/108

• Given a HMM Φ = (T, E, π ). Three famous HMM tasks are:• Probability of an observation sequence (state estimation)

– Given: Φ , observation O = o 1, o 2,..., o t

– Goal: p(O| Φ ), or equivalently p(q t = si|O)

• Most likely expaination (inference) –

Given: Φ , the observation O = o 1, o 2,..., o t – Goal: Q* = argmax Q p(Q|O)

• Learning the HMM – Given: observation O = o 1, o 2,..., o t and corresponding state sequence

– Goal: estimate parameters of the HMM Φ = (T, E, π )

06/04/2011 Markov models 54

Three famous HMM tasks

Page 55: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 55/108

• Given a HMM Φ = (T, E, π ). Three famous HMM tasks are:• Probability of an observation sequence (state estimation)

– Given: Φ , observation O = o 1, o 2,..., o t

– Goal: p(O| Φ ), or equivalently p(q t = si|O)

• Most likely expaination (inference)

Calculating the probability of

observing the sequence O over

Given: Φ , the observation O = o 1, o 2,..., o t – Goal: Q* = argmax Q p(Q|O)

• Learning the HMM – Given: observation O = o 1, o 2,..., o t and corresponding state sequence

– Goal: estimate parameters of the HMM Φ = (T, E, π )

06/04/2011 Markov models 55

all of possible sequences.

Three famous HMM tasks

Page 56: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 56/108

• Given a HMM Φ = (T, E, π ). Three famous HMM tasks are:• Probability of an observation sequence (state estimation)

– Given: Φ , observation O = o 1, o 2,..., o t

– Goal: p(O| Φ ), or equivalently p(q t = si|O)

• Most likely expaination (inference)

Calculating the best

corresponding state sequence,

Given: Φ , the observation O = o 1, o 2,..., o t – Goal: Q* = argmax Q p(Q|O)

• Learning the HMM – Given: observation O = o 1, o 2,..., o t and corresponding state sequence

– Goal: estimate parameters of the HMM Φ = (T, E, π )

06/04/2011 Markov models 56

given an observation

sequence.

Three famous HMM tasks

Page 57: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 57/108

• Given a HMM Φ = (T, E, π ). Three famous HMM tasks are:• Probability of an observation sequence (state estimation)

– Given: Φ , observation O = o 1, o 2,..., o t

– Goal: p(O| Φ ), or equivalently p(q t = si|O)

• Most likely expaination (inference)

Given an (or a set of)observation sequence and

corresponding state sequence,

Given: Φ , the observation O = o 1, o 2,..., o t – Goal: Q* = argmax Q p(Q|O)

• Learning the HMM – Given: observation O = o 1, o 2,..., o t and corresponding state sequence

– Goal: estimate parameters of the HMM Φ = (T, E, π )

06/04/2011 Markov models 57

estimate the Transition matrix,Emission matrix and initial

probabilities of the HMM

Three famous HMM tasks

Page 58: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 58/108

Problem Algorithm Complexity

State estimation

Calculating: p(O|Φ

)

Forward O(TN 2)

Inference

*=

Viterbi decoding O(TN 2)

06/04/2011 Markov models 58

Learning

Calculating: Φ * = argmax Φ p(O| Φ )

Baum-Welch (EM) O(TN 2)

T: number of timesteps

N: number of states

State estimation problem

Page 59: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 59/108

• Given : Φ = (T, E, π ), observation O = o 1, o 2,..., o t

Goal : What is p(o 1o2...o t) ?• We can do this in a slow, stupid way

– As shown in the next slide...

06/04/2011 Markov models 59

Page 60: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 60/108

Here’s a HMM0 2

Page 61: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 61/108

• What is p( O) = p(o1o

2o

3)

= p(o 1=X3 ∧ o2=X1∧ o3=X3)?• Slow, stupid way:

s1 s2 s3

x1 x2 x3

0.5

0.40.5

0.2

0.60.8

0.3 0.2 0.70.1

0.9 0.8

( ) ( )( ) ( )

paths of length 3

|

Q p O p OQ

p QO Q p

=

= ∑∑ππππ s1 s2 s3

• How to compute p(Q) for anarbitrary path Q?

How to compute p(O|Q) for anarbitrary path Q?

06/04/2011 Markov models 61

pat s o engt∈

p(Q) = p(q 1q2q3)

= p(q 1)p(q 2|q 1)p(q 3|q 2,q 1) (chain)

= p(q 1)p(q 2|q 1)p(q 3|q 2) (why?)

Example in the case Q=S 3S1S1

P(Q) = 0.4 * 0.2 * 0.5 = 0.04

0.3 0.3 0.4

Here’s a HMM0 2

Page 62: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 62/108

• What is p( O) = p(o1o

2o

3)

= p(o 1=X3 ∧ o2=X1∧ o3=X3)?• Slow, stupid way:

s1 s2 s3

x1 x2 x3

0.5

0.40.5

0.2

0.60.8

0.3 0.2 0.70.1

0.9 0.8

( ) ( )( ) ( )

paths of length 3

|

Q p O p OQ

p QO Q p

=

= ∑∑ππππ s1 s2 s3

• How to compute p(Q) for anarbitrary path Q?

How to compute p(O|Q) for anarbitrary path Q?

06/04/2011 Markov models 62

pat s o engt∈

p(O|Q ) = p(o 1o2o3|q 1q2q3)= p(o 1|q 1)p(o 2|q 2)p(o 3|q 3) (why?)

Example in the case Q=S 3S1S1

P(O|Q ) = p(X3|S 3)p(X1|S 1) p(X3|S 1)

=0.8 * 0.3 * 0.7 = 0.168

0.3 0.3 0.4

Here’s a HMM0 2

Page 63: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 63/108

• What is p( O) = p(o1o

2o

3)

= p(o 1=X3 ∧ o2=X1∧ o3=X3)?• Slow, stupid way:

s1 s2 s3

x1 x2 x3

0.5

0.40.5

0.2

0.60.8

0.3 0.2 0.70.1

0.9 0.8

( ) ( )( ) ( )

paths of length 3

|

Q p O p OQ

p QO Q p

=

= ∑∑ππππ s1 s2 s3

• How to compute p(Q) for anarbitrary path Q?

How to compute p(O|Q) for anarbitrary path Q?

06/04/2011 Markov models 63

pat s o engt∈

p(O|Q ) = p(o 1o2o3|q 1q2q3)= p(o 1|q 1)p(o 2|q 1)p(o 3|q 3) (why?)

Example in the case Q=S 3S1S1

P(O|Q ) = p(X3|S 3)p(X1|S 1) p(X3|S 1)

=0.8 * 0.3 * 0.7 = 0.168

0.3 0.3 0.4

p(O) needs 27 p(Q)computations and 27p(O|Q) computations.

What if the sequence has20 observations?

So let’s be smarter...

The Forward algorithm

Page 64: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 64/108

• Given observation o 1o2...o T

• Forward probabilities :

α t(i) = p(o 1o2...o t ∧ q t = si | Φ ) where 1 ≤ t ≤ T

αt(i) = probability that, in a random trial:

– We’d have seen the first t observations

We’d have ended up in s i as the t ’th state visited.• In our example, what is α 2(3) ?

06/04/2011 Markov models 64

αααα t(i): easy to define recursively

Page 65: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 65/108

( ) ( )( ) ( )

( )

1 1 1

1 1 1

1

|

i

i i

i i

i p o q s

p q s p o q s

o E

α

π

= ∧ =

= = =

=

( ) ( )1 2 ... |t t t ii p o o o q sα = ∧ = Φ

( ) ( )1 1 1 1

1 1 1

2

2

...

...

t t t i

N

t t t t i

i p o o o q s

o o o q s o q s p

α + + +

+ +

= ∧ =

= ∧ = ∧ ∧ =

( ) ( ) ( )

1

1

|

|

i i

ij t j t i

ij t j t i

p q sT p s q s

E p x q

q

o s

π +

Π = = == = = =

= = = =

T

E

06/04/2011 Markov models 65

( ) ( )

( ) ( )

( ) ( ) ( )

( ) ( )

1

1 1 1 11

1 11

1 1 1

2 2

1

11

.|

|

|

. . .

|

. .

j

N

t t i t t j t t j j

N

t t i t j t j

N

t t i t i t j t j

N

ji i t t j

o q s o o o q s o p p

p

p p

T

o o q s

o q s q s j

o q s q s q s j

o j E

α

α

α

=

+ +=

+ +=

+ + +=

+=

= ∧ = ∧ = ∧ =

= ∧ = =

= = = =

=

In our example0 2

Page 66: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 66/108

s1 s2 s3

x1 x2 x3

0.5

0.40.5

0.2

0.60.8

0.3 0.2 0.70.1

0.9 0.8

ππππ s s s

( ) ( )( ) ( )( ) ( ) ( ) ( ) ( )

1 2

1 1

1 1 1

. |..t t t i

i i

t ji i t t i t ji t j j

i p o o o q s

i E o

i T E o j E o T j

α

α π

α α α + + +

= ∧ = Φ

=

= =∑ ∑

06/04/2011 Markov models 66

0.3 0.3 0.4We observed: x 1x2

α 1(1) = 0.3 * 0.3 = 0.09

α 1(2) = 0

α 1(3) = 0.2 * 0.4 = 0.08

α 2(1) = 0 * (0.09*0 .5+ 0*0.4 + 0.08*0.2) = 0

α 2(2) = 0.1 * (0.09*0.5 + 0*0 + 0.08*0.8) = 0.0109

α 2(3) = 0 * (0.09*0 + 0*0.6 + 0.08*0) = 0

Forward probabilities - Trellis

Page 67: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 67/108

N

s3

s4

06/04/2011 Markov models 67

T1 2 3 4 5 6

s1

s2

Forward probabilities - Trellis

Page 68: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 68/108

N

s3

s4

α 1 (4)

α 1 (3) α 2 (3) α 6 (3)

06/04/2011 Markov models 68

T1 2 3 4 5 6

s1

s2α 1 (2)

α1

(1)

α 3 (2)

α4

(1)

α 5 (2)

Forward probabilities - Trellis

Page 69: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 69/108

N

s3

s4

α 1 (4)

α 1 (3) α 2 (3)

( ) ( )1 1i ii E oα π =

06/04/2011 Markov models 69

T1 2 3 4 5 6

s1

s2α 1 (2)

α1

(1)

Forward probabilities - Trellis

Page 70: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 70/108

N

s3

s4

α 1 (4)

α 1 (3) α 2 (3)

( ) ( ) ( )1 1t i t ji t ji E o T jα α + +

=

06/04/2011 Markov models 70

T1 2 3 4 5 6

s1

s2α 1 (2)

α1

(1)

Forward probabilities

Page 71: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 71/108

• So, we can cheaply compute:

• How can we cheaply compute:

( ) ( )1 2 ...t t t ii p o o o q sα = ∧ =

( )1 2 ... t p o o o

How can we cheaply compute:

06/04/2011 Markov models 71

( )1 2| ...t i t p q s o o o=

Forward probabilities

Page 72: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 72/108

( )t i

iα = ∑

• So, we can cheaply compute:

• How can we cheaply compute:

( ) ( )1 2 ...t t t ii p o o o q sα = ∧ =

( )1 2 ... t p o o o

How can we cheaply compute:

Look back the trellis...

06/04/2011 Markov models 72

( )1 2| ...t i t p q s o o o= ( )( )

j

t

t

i

j

α α

= ∑

State estimation problem

Page 73: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 73/108

• State estimation is solved:

• Can we utilize the elegant trellis to solve the Inference

( ) ( ) ( )1 21

| N

t ii

O p o o o i p α =

Φ = … = ∑

– Given an observation sequence O, find the best state sequence Q

06/04/2011 Markov models 73

( )* arg max |Q

OQ p Q=

Inference problem

Page 74: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 74/108

• Given : Φ = (T, E, π ), observation O = o 1, o 2,..., o t• Goal : Find

( )

( )1 2

*

1 2 1 2

arg max |

arg max |t

Q

t t q q q

Q p Q O

p q q q o o o…

=

= … …

• rac ca pro ems: – Speech recognition: Given an utterance (sound), what is

the best sentence (text) that matches the utterance?

– Video tracking

– POS Tagging

06/04/2011 Markov models 74

s1 s2

x1 x2 x3

s3

Inference problem

Page 75: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 75/108

• We can do this in a slow, stupid way:( )

( ) ( )( )

* arg max |

|arg max

Q

Q

p Q O

p O Q p Q

p O

Q =

=

• But it’s better if we can find another way tocompute the most probability path (MPP)...

06/04/2011 Markov models 75

( ) ( )( ) ( )1 2

arg max |

arg max |t

Q

Q

p O Q p Q

p o o o Q p Q

=

= …

Efficient MPP computation

Page 76: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 76/108

• We are going to compute the following variables:

• δ t(i) is the probability of the best path of length

( ) ( )1 2 1

1 2 1 1 2maxt

t t t i t q q q

i p q q q q s o o oδ −

−…

= … ∧ = ∧ …

t-1 which ends up in s i and emits o 1...o t.

• Define: mpp t(i) = that path

so: δ t(i) = p(mpp t(i))

06/04/2011 Markov models 76

Viterbi algorithm( ) ( )1 2 1 1 2maxt t t i ti p q q q q s o o oδ −= … ∧ = ∧ …

Page 77: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 77/108

( ) ( )

( ) ( )

( ) ( )

( ) ( )

1 2 1

1 2 1

1 2 1 1 2

1 2 1 1 2

1 11one choice

1 1

arg max

max

t

t

t t t i t

q q q

t t t

i

i t q q q

i

i

p q q q q

mpp i p q q q q s o o o

i p q s o

E o iπ

δ

α

−…

= … ∧ = ∧ …

= = ∧

= =

N δ 1 (4)

06/04/2011 Markov models 77T1 2 3 4 5 6

s1

s2

s3

4

δ 1 (3)

δ 1 (2)

δ 1 (1)

δ 2 (3)

Viterbi algorithmtime t time t + 1

Page 78: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 78/108

• The most prob path with last two states

sis j is the most path to s i, followed by

transition s i s j.• The prob of that path will be:

s1

si s j

time t time t + 1

. . .

. .

δ t(i) × p(s i s j ∧ o t+1)= δ t(i)TijE j(o t+1)

So, the previous state at time t is:

06/04/2011 Markov models 78

( ) ( )*1arg max t ij j t

ii T E i oδ +=

.

Viterbi algorithm

Page 79: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 79/108

• Summary: ( ) ( ) ( )

( ) ( )( ) ( )

**

1

*1

*1

1

arg max

t j t i j

t t j

t ij t

t

i j

i T E o

mpp j mpp i s

i T E

j

i o

δ δ

δ

++

+

+

=

=

=

N

( ) ( ) ( )11 1iii E o iδ π α = =

06/04/2011 Markov models 79T1 2 3 4 5 6

s1

s2

s3

s4

δ 1 (3)

δ 1 (2)

δ 1 (1)

δ 2 (3)

What’s Viterbi used for?•

Page 80: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 80/108

Speech Recognition

06/04/2011 Markov models 80

Chong, Jike and Yi, Youngmin and Faria, Arlo and Satish, Nadathur Rajagopalan and Keutzer, Kurt, “ Data-Parallel Large Vocabulary Continuous Speech Recognition on Graphics Processors ”, EECS Department, University of California, Berkeley, 2008.

Training HMMs

Page 81: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 81/108

• Given : large sequence of observation o 1o2...o T

and number of states N.

• Goal : Estimation of parameters Φ = T, E, π

• That is, how to design an HMM.

• We will infer the model from a large amount of

data o 1o2...o T with a big “ T”.

06/04/2011 Markov models 81

Training HMMs

Page 82: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 82/108

• Remember, we have just computed

p(o 1o2...o T | Φ )

Now, we have some observations and we want to inference Φfrom them.

• So, we could use: – MAX LIKELIHOOD:

– BAYES:

Compute

then take or

06/04/2011 Markov models 82

( )1arg max |T p o oΦ

Φ = … Φ

( )1| T p o oΦ …

[ ] E Φ ( )1max | T o p oΦ

Φ …

Max likelihood for HMMs

Page 83: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 83/108

• Forward probability : the probability of producing o 1...o t while

ending up in state s i

( ) ( )1 2 ...t t t ii p o o o q sα = ∧ =( ) ( )( ) ( ) ( )

1 1

1 1

i i

t i t ji t j

i E o

i E o T j

α π

α α + +

=

= ∑

• Backward probability : the probability of producing o t+1 ...o T giventhat at time t , we are at state s i

06/04/2011 Markov models 83

( ) ( )1 2 . |..t t t iT t i p o o o q s β + += =

Max likelihood for HMMs - Backward

Page 84: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 84/108

• Backward probability : easy to define recursively

( ) ( )

( )( ) ( )

1 2

1 2 1

...

...

|

1

|

t t t t i

T

N

t t t t j t i

T

T

i p o o o q s

i

i p o o o q s q s

β

β

β

+ +

+ + +

= =

=

= ∧ ∧ = =∑

( )

( ) ( ) ( )1 11

1T

N

t t ij j t j

i

i T o j E

β

β β + +=

=

= ∑

06/04/2011 Markov models 84

( ) ( )

( ) ( )

( ) ( )

1

1 1 2 1 11

1 1 2 11

1 11

.| |

| |

..

...

j

N t t j t i t t t j t i

j

N

t t j t i t t j j

N

t ij j t

T

T

j

p o q s q s p o o o q s q s

p o q s q s p o o q s

T o j E β

=

+ + + + +=

+ + + +=

+ +=

= ∧ = = ∧ = ∧ =

= ∧ = = =

=

Max likelihood for HMMs

Page 85: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 85/108

• The probability of traversing a certain arc at time t given

o1o2...o T:

( ) ( )( )

1 1 2

1 1 2

|ij t i t j T

t i t j T

t p q s q s o o o p q s q s o o o

ε +

+

= = ∧ = …= ∧ = ∧ …

=

06/04/2011 Markov models 85

( ) ( ) ( )

( ) ( )

( )( ) ( )

( ) ( )

1 2

1 2 1 1 2

1 2 1 21

1

|

|

T

t t i t i t j t t T t i

N

t i t t t ii

ij t ij N

t i

t T

t

t

p o o o

p o o o q s p q s q s p o o o q s

p o o o q p o o o q

i T it

i i

s s

α β

α ε

β

+ + +

+ +=

=

… ∧ = = ∧

=

= … ==

… ∧ …

=

=∑

Max likelihood for HMMs

Page 86: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 86/108

• The probability of being at state s i at time t given o 1o2...o T:

( ) ( )

( )

1 2

1 1 21

|

|

i t i T

N

t i t j T j

t p q s o o o

p q s q s o o o

γ

+=

= = …

= = ∧ = …∑

06/04/2011 Markov models 86

( ) ( )1

N

ij j

i t t γ ε =

= ∑

Page 87: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 87/108

Update parameters( )

( ) 1

1

|

|

i i

ij t j t i

ij j i

p q s

T p s q s

E p x q

q

o s

π

+

Π = = =

= = = =

= = = =

T

E

Page 88: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 88/108

( )

( ) ( )

1 1

1 11 1

ˆ expected frequency in state i at time t = 1 1

expected # of transitions from state i to jexpected # of transitions from state i

i i

T T

ij ijt t

ij T N T

t t T

t t

π γ

γ

ε ε

ε

− −

= =− −

= =

= = =∑ ∑

( ) ij t j t i

06/04/2011 Markov models 88

1 1 1

expected

t j t

ik E

= = =

=

( ) ( )

( )

( ) ( )

( )

11

1 111 1

1 1 1

# of transitions from state i with x observedexpected # of transitions from state i

,,

k

N T T

t k ijt k i j t t

T N T

i ijt j t

o x t o x t

t t

δ ε δ γ

γ ε

−−

= ==− −

= = =

= =∑∑

∑∑ ∑∑

Update parameters( )

( ) 1

1

|

|

i i

ij t j t i

ij t j t i

p q s

T p s q s

E p x q

q

o s

π

+

Π = = =

= = = =

= = = =

T

E

Page 89: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 89/108

( )

( ) ( )

1 1

1 11 1

ˆ expected frequency in state i at time t = 1 1

expected # of transitions from state i to jexpected # of transitions from state i

i i

T T

ij ijt t

ij T N T

t t T

t t

π γ

γ

ε ε

ε

− −

= =− −

= =

= = =∑ ∑

( ) ij t j t i

06/04/2011 Markov models 89

1 1 1

expected

t j t

ik E

= = =

=

( ) ( )

( )

( ) ( )

( )

11

1 111 1

1 1 1

# of transitions from state i with x observedexpected # of transitions from state i

,,

k

N T T

t k ijt k i j t t

T N T

i ijt j t

o x t o x t

t t

δ ε δ γ

γ ε

−−

= ==− −

= = =

= =∑∑

∑∑ ∑∑

Kronecker delta function:

( ), 1t k t ko x o xδ = ⇔ =

The inner loop of Forward-Backward

Gi i

Page 90: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 90/108

Given an input sequence.1. Calculate forward probability:

– Base case: – Recursive case:

2. Calculate backward probability:

( ) ( )

( ) ( ) ( )1 1

1 1

i i

t i t ji t j

i E o

i E o T j

α π

α α + +

=

= ∑

– Base case: – Recursive case:

3. Calculate expected count:

4. Update parameters:

06/04/2011 Markov models 90

( )

( ) ( ) ( )1 11

1T

N

t t ij j t j

i

i T o j E

β

β β + +=

=

= ∑( )

( ) ( )

( ) ( )1

ij t ij N

t

t

i t

i T it

i i

α β

α β ε

=

=

∑( )

( )

( ) ( )

( )

11

1 111 1

1 1 1 1

, N T T

t k ijij j t t

ij ik N T N T

ij ij

j t j t

o x t t T E

t t

δ ε

ε

ε

ε

−−

= ==− −

= = = =

= =∑∑∑

∑∑ ∑∑

Forward-Backward: EM for HMM

• If k Φ ld i i f i i

Page 91: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 91/108

If we knew Φ we could estimate expectations of quantitiessuch as

– Expected number of times in state i

– Expected number of transitions i j

– Expected number of times in state i

– Expected number of transitions i j

we could compute the max likelihood estimate of Φ = T, E, Π• Also known (for the HMM case) as the Baum-Welch algorithm.

06/04/2011 Markov models 91

EM for HMM

• E h i i id l f ll h

Page 92: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 92/108

Each iteration provides values for all the parameters

• The new model always improve the likeliness of the

training data:

ˆ

• The algorithm does not guarantee to reach global

maximum.

06/04/2011 Markov models 92

1 2 1 2T T p po o o o o o… …

EM for HMM•

Bad News

Page 93: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 93/108

Bad News – There are lots of local minima

• Good News – The local minima are usually adequate models of the data.

• Notice – EM does not estimate the number of states. That must be given (tradeoffs)

– Often, HMMs are forced to have some links with zero probability. This is done

by setting T ij = 0 in initial estimate Φ (0)

– Easy extension of everything seen today: HMMs with real valued outputs

06/04/2011 Markov models 93

Contents

• I t d ti

Page 94: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 94/108

Introduction

• Markov Chain

• Hidden Markov Models

• Markov Random Field (from the viewpoint of

classification)

06/04/2011 94Markov models

Example: Image segmentation

Page 95: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 95/108

• Observations: pixel values

• Hidden variable: class of each pixel

• It’s reasonable to think that there are some underlying relationships

between neighbouring pixels... Can we use Markov models?• Errr.... the relationships are in 2D!

06/04/2011 Markov models 95

MRF as a 2D generalization of MC

• Array of observations: 0 0i N j NX x < ≤ <= ≤

Page 96: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 96/108

Array of observations:

• Classes/States:

• Our objective is classification : given the array of

, 0 ,0 yij xi N j N X x < ≤ <= ≤

, 1...ij ijS s s M = =

observations, estimate the corresponding values of the

state array S so that

06/04/2011 Markov models 96

( ) ( )| is maximum. X p S p S

2D context-dependent classification•

Assumptions:

Page 97: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 97/108

Assumptions: – The values of elements in S are mutually dependent.

– The range of this dependence is limited within a neighborhood.

• For each (i, j) element of S, a neighborhood N ij is defined so

– sij∉ Nij: (i, j) element does not belong to its own set of neighbors. – sij∈ Nkl ⇔ skl ∈ Nij: if s ij is a neighbor of s kl then s kl is also a neighbor

of s ij

06/04/2011 Markov models 97

2D context-dependent classification

The Markov property for 2D case:

Page 98: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 98/108

The Markov property for 2D case:

where includes all the elements of S except the (i, j) one.• The elegeant dynamic programing is not applicable: the problem is

( ) ( )| |ij ij ij ijs S p s p =

ijS

much harder now!

06/04/2011 Markov models 98

2D context-dependent classification

The Markov property for 2D case:

Page 99: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 99/108

The Markov property for 2D case:

where includes all the elements of S except the (i, j) one.• The elegeant dynamic programing is not applicable: the problem is

( ) ( )| |ij ij ij ijs S p s p =

ijS

We are gonna see anapplication of MRF for

much harder now!

06/04/2011 Markov models 99

Image Segmentation

and Restoration.

MRF for Image Segmentation

• Cliques : a set of each pixel which are neighbors

Page 100: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 100/108

Cliques : a set of each pixel which are neighbors

of each other (w.r.t the type of neighborhood)

06/04/2011 Markov models 100

MRF for Image Segmentation

• Dual Lattice number

Page 101: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 101/108

Dual Lattice number

• Line process:

06/04/2011 Markov models 101

MRF for Image Segmentation

• Gibbs distribution:

Page 102: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 102/108

Gibbs distribution:

– Z: normalizing constant

( )( )1

expU s

s

Z T

π −

=

T: parameter• It turns out that Gibbs distribution implies MRF

([Gema 84])

06/04/2011 Markov models 102

Page 103: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 103/108

MRF for Image Segmentation

• Then, the joint probability for the Gibbs model is

Page 104: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 104/108

Then, the joint probability for the Gibbs model is

( )

( )( ),

,

exp

k k i j k

p

F C i j

S T

= −

∑∑

– The sum is calculated over all possible cliques associated

with the neighborhood.

• We also need to work out p(X|S)

• Then p(X|S)p(S) can be maximized... [Gema 84]

06/04/2011 Markov models 104

Page 105: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 105/108

What you should know

Markov property, Markov Chain

Page 106: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 106/108

• HMM:

Defining and computing α t(i) – Viterbi algorithm

– Outline of the EM algorithm for HMM

• Markov Random Field

– And an application in Image Segmentation

– [Geman 84] for more information.

06/04/2011 Markov models 106

Q & A

Page 107: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 107/108

06/04/2011 Markov models 107

References•

L. R. Rabiner, " A Tutorial on Hidden Markov Models and Selected Applicationsi S h R i i “ P f h IEEE V l 77 N 2 257 286 1989

Page 108: Hmm Revisited

8/6/2019 Hmm Revisited

http://slidepdf.com/reader/full/hmm-revisited 108/108

in Speech Recognition “, Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989.

• Andrew W. Moore, “ Hidden Markov Models ”, http://www.autonlab.org/tutorials/

• Geman S., Geman D. “ Stochastic relaxation, Gibbs distributions and the

Ba esian restoration o ima es ” IEEE Transactions on Pattern Anal sis and

Machine Intelligence, Vol. 6(6), pp. 721-741, 1984.

06/04/2011 Markov models 108