Upload
vu-pham
View
225
Download
0
Embed Size (px)
Citation preview
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 1/108
PATTERN RECOGNITION
Markov models
Department of Computer Science
March 28 th , 2011
06/04/2011 1Markov models
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 2/108
Contents
• Introduction
–
Introduction – Motivation
• Markov Chain
• Hidden Markov Models
• Markov Random Field
06/04/2011 2Markov models
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 3/108
Introduction• Markov processes are first proposed by
Russian mathematician Andrei Markov –
He used these processes to investigatePushkin’s poem.
• Nowadays, Markov property and HMMs are
widely used in many domains: – Natural Language Processing
– Speech Recognition
– Bioinformatics
– Image/video processing
– ...
06/04/2011 Markov models 3
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 4/108
Motivation [0]
• As shown in his paper in 1906, Markov’s original
motivation is purely mathematical:
– Application of The Weak Law of Large Number to dependent
random variables.
• However, we shall not follow this motivation...
06/04/2011 Markov models 4
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 5/108
Motivation [1]
• From the viewpoint of classification :
–
Context-free classification: Bayes classifier( ) ( )| |i j j i p pω ω > ∀ ≠x x
06/04/2011 Markov models 5
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 6/108
Motivation [1]
• From the viewpoint of classification :
–
Context-free classification: Bayes classifier( ) ( )| |i j j i p pω ω > ∀ ≠x x
06/04/2011 Markov models 6
• Classes are independent.
• Feature vectors are independent .
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 7/108
Motivation [1]
• From the viewpoint of classification :
–
Context-free classification: Bayes classifier( ) ( )| |i j j i p pω ω > ∀ ≠x x
– However, there are some applications where various
classes are closely realated:
• POS Tagging, Tracking, Gene boundary recover...
06/04/2011 Markov models 7
s1...s2 s3
...sm
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 8/108
Motivation [1]
• Context-dependent classification:
– s1, s2, ..., sm: sequence of m feature vector
– ω 1, ω 2,..., ω N: classes in which these vectors are classified: ω i = 1...k.
s1...s2 s3
...sm
06/04/2011 Markov models 8
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 9/108
Motivation [1]
• Context-dependent classification:
– s1, s2, ..., sm: sequence of m feature vector
– ω 1, ω 2,..., ω N: classes in which these vectors are classified: ω i = 1...k.
s1...s2 s3
...sm
• To apply Bayes classifier: – X = s1s2...sm: extened feature vector
–
Ω i = ω i1, ω i2,..., ω iN : a classification Nm
possible classifications
06/04/2011 Markov models 9
( ) ( )| |i j j i p pΩ > Ω ∀ ≠X X
( ) ( ) ( ) ( )| |i i j j p p j p p iΩ Ω > Ω Ω ∀ ≠X X
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 10/108
Motivation [1]
• Context-dependent classification:
– s1, s2, ..., sm: sequence of m feature vector
– ω 1, ω 2,..., ω N: classes in which these vectors are classified: ω i = 1...k.
s1...s2 s3
...sm
• To apply Bayes classifier: – X = s1s2...sm: extened feature vector
– Ωi =
ωi1,
ωi2,...,
ωiN : a classification N
m
possible classifications
06/04/2011 Markov models 10
( ) ( )| |i j j i p pΩ > Ω ∀ ≠X X
( ) ( ) ( ) ( )| |i i j j p p j p p iΩ Ω > Ω Ω ∀ ≠X X
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 11/108
Motivation [2]
• From a general view , sometimes we want to evaluate the joint
distribution of a sequence of dependent random variables
06/04/2011 Markov models 11
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 12/108
Motivation [2]
• From a general view , sometimes we want to evaluate the joint
distribution of a sequence of dependent random variables
Hôm nay mùng tám tháng ba
Chị em phụ nữ đi ra đi vào...
06/04/2011 Markov models 12
Hôm ...nay mùng ...vào
q 1 q 2 q 3 qm
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 13/108
Motivation [2]
• From a general view , sometimes we want to evaluate the joint
distribution of a sequence of dependent random variables
Hôm nay mùng tám tháng ba
Chị em phụ nữ đi ra đi vào...
•
What is p( Hôm nay.... vào ) = p(q 1=Hôm q2=nay ... q m=vào )?
06/04/2011 Markov models 13
Hôm ...nay mùng ...vào
q 1 q 2 q 3 qm
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 14/108
Motivation [2]
• From a general view , sometimes we want to evaluate the joint
distribution of a sequence of dependent random variables
Hôm nay mùng tám tháng ba
Chị em phụ nữ đi ra đi vào...
•
What is p( Hôm nay.... vào ) = p(q 1=Hôm q2=nay ... q m=vào )?
06/04/2011 Markov models 14
Hôm ...nay mùng ...vào
q 1 q 2 q 3 qm
p(sm | s1s2...sm-1 ) =p(s1s2... sm-1 sm)
p(s1s2... sm-1 )
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 15/108
Contents
• Introduction
•
Markov Chain• Hidden Markov Models
• Markov Random Field
06/04/2011 15Markov models
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 16/108
Markov Chain• Has N states, called s 1, s2, ..., s N
• There are discrete timesteps, t=0,
t=1,...• On the t’th timestep the system is in
exactly one of the available states.
s1
s2
Call it
06/04/2011 16Markov models
3
N = 3t = 0q t = q0 = s3
1 2, ,...,t N s sq s∈
Current state
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 17/108
Markov Chain• Has N states, called s 1, s2, ..., s N
• There are discrete timesteps, t=0,
t=1,...• On the t’th timestep the system is in
exactly one of the available states.
s1
s2
Current state
Call it• Between each timestep, the next
state is chosen randomly.
06/04/2011 17Markov models
3
N = 3t = 1q t = q1 = s2
1 2, ,...,t N s sq s∈
( ) 1 2s sp =˚
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 18/108
Markov Chain• Has N states, called s 1, s2, ..., s N
• There are discrete timesteps, t=0,
t=1,...• On the t’th timestep the system is in
exactly one of the available states.
s1
s2
( )( )( )
2
2
1
3
2
2
1 2
1 2
0
s s
s s
p
p
s p s
=
=
=
˚
˚
Call it• Between each timestep, the next
state is chosen randomly.• The current state determines the
probability for the next state.
06/04/2011 18Markov models
3
N = 3t = 1q t = q1 = s2
1 2, ,...,t N s sq s∈ ( )( )
1 1 1
12
3 1
0
1
t t p q
p
s q s
s s
s p s
+ =
=
=
=
=
˚
˚ ( )( )
( )2
3
1
3
3
3
1 3
2 3
0
s s
s s
p
p
s p s
=
=
=
˚
˚
˚
( ) 1 2s sp =˚
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 19/108
Markov Chain• Has N states, called s 1, s2, ..., s N
• There are discrete timesteps, t=0,
t=1,...• On the t’th timestep the system is in
exactly one of the available states.
s1
s2
( )( )( )
2
2
1
3
2
2
1 2
1 2
0
s s
s s
p
p
s p s
=
=
=
˚
˚
11/3
1/2
1/2
2/3
Call it• Between each timestep, the next
state is chosen randomly.• The current state determines the
probability for the next state. –
Often notated with arcs between states06/04/2011 19Markov models
3
N = 3t = 1q t = q1 = s2
1 2, ,...,t N s sq s∈ ( )( )
1 1 1
12
3 1
0
1
t t p q
p
s q s
s s
s p s
+ =
=
=
=
=
˚
˚ ( )( )
( )2
3
1
3
3
3
1 3
2 3
0
s s
s s
p
p
s p s
=
=
=
˚
˚
˚
( ) 1 2s sp =˚
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 20/108
Markov Property• q t+1 is conditionally independent of
qt-1 , q t-2 ,..., q 0 given q t.•
In other words: s1
s2
( )1 1 0, ,...,t t t p q q q q+ −
=
˚
˚
( )( )( )
2
2
1
3
2
2
1 2
1 2
0
s s
s s
p
p
s p s
=
=
˚
˚
11/3
1/2
1/2
2/3
06/04/2011 20Markov models
3
N = 3t = 1q t = q1 = s2
( )( )
1 1 1
12
3 1
0
1
t t p q
p
s q s
s s
s p s
+ =
=
=
=
=
˚
˚ ( )( )
( )2
3
1
3
3
3
1 3
2 3
0
s s
s s
p
p
s p s
=
=
=
˚
˚
˚
( ) 1 2s sp =˚
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 21/108
Markov Property• q t+1 is conditionally independent of
qt-1 , q t-2 ,..., q 0 given q t.•
In other words: s1
s2
( )1 1 0, ,...,t t t p q q q q+ −
=
˚
˚
( )( )( )
2
2
1
3
2
2
1 2
1 2
0
s s
s s
p
p
s p s
=
=
˚
˚
11/3
1/2
1/2
2/3
06/04/2011 21Markov models
3
N = 3t = 1q t = q1 = s2
( )( )
1 1 1
12
3 1
0
1
t t p q
p
s q s
s s
s p s
+ =
=
=
=
=
˚
˚ ( )( )
( )2
3
1
3
3
3
1 3
2 3
0
s s
s s
p
p
s p s
=
=
=
˚
˚
˚
The state at timestep t+1 dependsonly on the state at timestep t
( )1 2 1 2s sp =˚
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 22/108
Markov Property• q t+1 is conditionally independent of
qt-1 , q t-2 ,..., q 0 given q t.•
In other words: s1
s2
( )1 1 0, ,...,t t t p q q q q+ −
=
˚
˚
( )( )( )
2
2
1
3
2
2
1 2
1 2
0
s s
s s
p
p
s p s
=
=
˚
˚
11/3
1/2
1/2
2/3
06/04/2011 22Markov models
3
N = 3t = 1q t = q1 = s2
( )( )
1 1 1
12
3 1
0
1
t t p q
p
s q s
s s
s p s
+ =
=
=
=
=
˚
˚ ( )( )
( )2
3
1
3
3
3
1 3
2 3
0
s s
s s
p
p
s p s
=
=
=
˚
˚
˚
The state at timestep t+1 dependsonly on the state at timestep t
A Markov chain of order m (m finite): the state at
timestep t+1 depends on the past m states:
( ) ( )1 1 0 1 1 1, ,..., , ,...,t t t t t t t m p q q q q p q q q q+ − + − − +=˚ ˚
( )1 2 1 2s s p =˚
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 23/108
Markov Property• q t+1 is conditionally independent of
qt-1 , q t-2 ,..., q 0 given q t.•
In other words:( )1 1 0, ,...,t t t p q q q q+ −
=
˚
˚
( )( )( )
2
2
1
3
2
2 1 2
0
s s
p
p
s p s
=
=
˚
˚
s1
s2
11/3
1/2
1/2
2/3
•
How to represent the jointdistribution of (q 0, q 1, q 2...) usinggraphical models?
06/04/2011 23Markov models
N = 3t = 1q t = q1 = s2
( )( )
1 1 1
12
3 1
0
1
t t p q
p
s q s
s s
s p s
+ =
=
=
=
=
˚
˚ ( )( )
( )2
3
1
3
3
3
1 3
2 3
0
s s
s s
p
p
s p s
=
=
=
˚
˚
˚
3The state at timestep t+1 dependsonly on the state at timestep t
( )1 2 1 2s s p =˚
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 24/108
Markov Property• q t+1 is conditionally independent of
qt-1 , q t-2 ,..., q 0 given q t.•
In other words: s1
s2
( )1 1 0, ,...,t t t p q q q q+ −
=
˚
˚
( )( )( )
2
2
1
3
2
2 1 2
0
s s
p
p
s p s
=
=
˚
˚
11/3
1/2
1/2
1/3
q 0
q 1
•
How to represent the jointdistribution of (q 0, q 1, q 2...) usinggraphical models?
06/04/2011 24Markov models
3
N = 3t = 1q t = q1 = s2
( )( )
1 1 1
12
3 1
0
1
t t p q
p
s q s
s s
s p s
+ =
=
=
=
=
˚
˚ ( )( )
( )2
3
1
3
3
3
1 3
2 3
0
s s
s s
p
p
s p s
=
=
=
˚
˚
˚
The state at timestep t+1 dependsonly on the state at timestep t
q2
q3
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 25/108
Markov chain• So, the chain of q t is called Markov chain
q 0 q 1 q 2 q 3
06/04/2011 Markov models 25
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 26/108
Markov chain• So, the chain of q t is called Markov chain
•
Each q t takes value from the countable state-space s1, s2, s3...• Each q t is observed at a discrete timestep t•
q 0 q 1 q 2 q 3
=t
06/04/2011 Markov models 26
, ,...,t t t t t + − +
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 27/108
Markov chain• So, the chain of q t is called Markov chain
•
Each q t takes value from the countable state-space s1, s2, s3...• Each q t is observed at a discrete timestep t•
q 0 q 1 q 2 q 3
=t
• The transition from q t to q t+1 is calculated from the transitionprobability matrix
06/04/2011 Markov models 27
s1
s3
s2
1 1/3
1/2
1/2
2/3
s1 s2 s3
s1 0 0 1
s2½ ½ 0
s31/3 2/3 0
Transition probabilities
, ,...,t t t t t + − +
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 28/108
Markov chain• So, the chain of q t is called Markov chain
•
Each q t takes value from the countable state-space s1, s2, s3...• Each q t is observed at a discrete timestep t•
q 0 q 1 q 2 q 3
=t
• The transition from q t to q t+1 is calculated from the transitionprobability matrix
06/04/2011 Markov models 28
s1
s3
s2
1 1/3
1/2
1/2
2/3
s1 s2 s3
s1 0 0 1
s2½ ½ 0
s31/3 2/3 0
Transition probabilities
, ,...,t t t t t + − +
k h
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 29/108
Markov Chain – Important property
• In a Markov chain, the joint distribution is
( ) ( ) ( )0 1 0 11
, ,..., |m
m j j j
q q q p q q q p p−=
=∏
06/04/2011 Markov models 29
M k Ch i I
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 30/108
Markov Chain – Important property
• In a Markov chain, the joint distribution is
• Why?
m
( ) ( ) ( )0 1 0 11
, ,..., |m
m j j j
q q q p q q q p p−=
=∏
06/04/2011 Markov models 30
( ) ( )
( ) ( )
0 1 0 11
0 11
, previous st, ,... ates, |
|
m j j j
m
j j j
q q q p q p q q
p q p q q
p −=
−=
=
=
∏∏
Due to the Markov property
M k Ch i
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 31/108
Markov Chain: e.g.• The state-space of weather:
cloud
windrain
06/04/2011 Markov models 31
M k Ch i
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 32/108
Markov Chain: e.g.• The state-space of weather:
cloud
windrain
1/2
1/2
1/32/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
06/04/2011 Markov models 32
M k Ch i
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 33/108
Markov Chain: e.g.• The state-space of weather:
cloud
windrain
1/2
1/2
1/32/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
• Markov assumption: weather in the t+1 ’th day isdepends only on the t ’th day.
06/04/2011 Markov models 33
M k Ch i g
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 34/108
Markov Chain: e.g.• The state-space of weather:
cloud
windrain
1/2
1/2
1/32/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
• Markov assumption: weather in the t+1 ’th day isdepends only on the t ’th day.
•
We have observed the weather in a week:
06/04/2011 Markov models 34
rain windcloud rainwind
Day: 0 1 2 3 4
Markov Chain: e g
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 35/108
Markov Chain: e.g.• The state-space of weather:
cloud
windrain
1/2
1/2
1/32/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
• Markov assumption: weather in the t+1 ’th day isdepends only on the t ’th day.
•
We have observed the weather in a week:
06/04/2011 Markov models 35
rain windcloud rainwind
Day: 0 1 2 3 4
Markov Chain
Contents
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 36/108
Contents• Introduction
• Markov Chain
• Hidden Markov Models – Independent assumptions
– Formal definition
– Forward algorithm
– Viterbi algorithm
– Baum-Welch algorithm
• Markov Random Field
06/04/2011 36Markov models
Modeling pairs of sequences
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 37/108
Modeling pairs of sequences• In many applications, we have to model pair of sequences
• Examples: – POS tagging in Natural Language Processing (assign each word in a
sentence to Noun, Adj, Verb...)
– Speech recognition (map acoustic sequences to sequences of words)
–
Computational biology (recover gene boundaries in DNA sequences) – Video tracking (estimate the underlying model states from the observation
sequences)
– And many others...
06/04/2011 Markov models 37
Probabilistic models for sequence pairs
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 38/108
Probabilistic models for sequence pairs
• We have two sequences of random variables:
X1, X2, ..., Xm and S 1, S2, ..., S m
• Intuitively, in a pratical system, each X i corresponds to an observation
and each S i corresponds to a state that generated the observation.
• Let each S i be in 1, 2, ..., k and each X i be in 1, 2, ..., o• How do we model the joint distribution:
06/04/2011 Markov models 38
( )1 1 1 1,..., , ,...,m m m m p X x X x S s S s= = = =
Hidden Markov Models (HMMs)
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 39/108
Hidden Markov Models (HMMs)
• In HMMs, we assume that( )
( ) ( ) ( )
1 1 1 1
1 1 1 12 1
,..., , ,...,m m m m
m m
j j j j j j j j j j
p X
p
x X x S s S s
s p S s S s p X x S sS − −= =
= = = =
= = == = =∏ ∏˚ ˚
• This is often called Independence assumptions in
HMMs
• We are gonna prove it in the next slides
06/04/2011 Markov models 39
Independence Assumptions in HMMs [1]
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 40/108
Independence Assumptions in HMMs [1]
• By the chain rule , the following equality is exact:
( )
( )( )
1 1 1 1
1 1
1 1 1 1
,..., , ,...,
,...,,..., ,...,
m m m m
m m
m m m m
p
p p
X x X x S s S s
S s S s X x X x S s S s
= = = =
= = = ×= = = =˚
( ) ( ) ( ) ( ) ( ) (| | ABC p A BC p BC p A BC p B p C p C= = ˚
• Assumption 1 : the state sequence forms a Markov chain
06/04/2011 Markov models 40
( ) ( ) ( )1 1 1 1 1 1
2
,...,m
m m j j j j
j
S s S s p S s p S s S p s− −
=
= = = = = =∏ ˚
Independence Assumptions in HMMs [2]
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 41/108
Independence Assumptions in HMMs [2]
• By the chain rule, the following equality is exact:
• Assum tion 2 : each observation de ends onl on the underl in
( )
( )
1 1 1
11
1
1
1 11 1,
,..., ,...,
,..., ,...,
m m m m
m
j j m m j
j j
X x X x S s S s
X x S s S s X x X x
p
p −=
−
= = = =
= = = = = =∏
˚
˚
state
• These two assumptions are often called independence
assumptions in HMMs
06/04/2011 Markov models 41
( )( )
1 1 1 1 1 1,..., ,., .., j j m m j j
j j j j
X x S s S s x X x
X
X
p
p
x S s
− −= = = = =
= = =
˚
˚
The Model form for HMMs
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 42/108
The Model form for HMMs
• The model takes the following form:
• Parameters in the model:
( ) ( ) ( ) ( )1 1 1 12 1
,.., , ,..., ;m m
m m j j j j j j
x x s s s t s s p e x sθ π −= =
= ∏ ∏˚ ˚
–
–
–
06/04/2011 Markov models 42
( ) Initial probabilities for 1, 2,...,s s k π ∈
( ) Transition probabilities for , ' 1, 2,...,t s s s s k ′ ∈˚
( )
Emission probabilities for 1, 2,...,
and 1, 2,..,
e x s s k
x o
∈
∈
˚
6 components of HMMs
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 43/108
6 components of HMMs•
Discrete timesteps: 1, 2, ...• Finite state space: s i (N states)• Events x i (M events)• Vector of initial probabilities ππππ i
ΠΠΠΠ = π i = p(q1 = si) • Matrix of transition robabilities
s1 s2 s3
t11
t21t12
t31
t23t32
e 13
start
ππππ 1ππππ 2 ππππ 3
T = Tij = p(qt+1=s j|q t=si) • Matrix of emission probabilities
E = Eij = p(ot=x j|q t=si)
06/04/2011 Markov models 43
x1 x2 x3
11 e 31e 22
23 e 33
The observations at continuous timesteps form an observation sequenceo1, o2, ..., o t, where o i ∈ x1, x2, ..., x M
6 components of HMMs
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 44/108
6 components of HMMs•
Discrete timesteps: 1, 2, ...• Finite state space: s i (N states)• Events x i (M events)• Vector of initial probabilities ππππ i
ΠΠΠΠ = π i = p(q1 = si) • Matrix of transition robabilities
s1 s2 s3
t11
t21t12
t31
t23t32
e 13
start
ππππ 1ππππ 2 ππππ 3
T = Tij = p(qt+1=s j|q t=si) • Matrix of emission probabilities
E = Eij = p(ot=x j|q t=si)
06/04/2011 Markov models 44
x1 x2 x3
11 e 31e 22
23 e 33
The observations at continuous timesteps form an observation sequenceo1, o2, ..., o t, where o i ∈ x1, x2, ..., x M
1 1 1
Constraints:
1 1 1 N N M
i ij iji j j
T E π = = =
= = =∑ ∑ ∑
6 components of HMMs
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 45/108
6 components of HMMs•
Given a specific HMM and anobservation sequence, thecorresponding sequence of statesis generally not deterministic
• Example:Given the observation sequence:
s1 s2 s3
t11
t21t12
t31
t23t32
e 13
start
ππππ 1ππππ 2 ππππ 3
x1, x
3, x
3, x
2
The corresponding states can beany of following sequences:
s1, s
2, s
1, s
2
s1, s2, s3, s2
s1, s1, s1, s2
...06/04/2011 Markov models 45
x1 x2 x3
11 e 31e 22
23 e 33
Here’s an HMM
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 46/108
Here s an HMM
s1 s2 s3
x1 x2 x3
0.5
0.40.5
0.20.6
0.8
0.3 0.2 0.7 0.1 0.9 0.8
06/04/2011 Markov models 46
T s1 s2 s3
s1 0.5 0.5 0
s2 0.4 0 0.6
s3 0.2 0.8 0
E x1 x2 x3
s1 0.3 0 0.7
s2 0 0.1 0.9
s3 0.2 0 0.8
ππππ s1 s2 s3
0.3 0.3 0.4
Here’s a HMM
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 47/108
•
Start randomly in state 1, 2or 3.• Choose a output at each
state in random.• Let’s generate a sequence
of observations:
s1 s2 s3
x1 x2 x3
0.5
0.40.5
0.2
0.60.8
0.30.2
0.70.1
0.90.8
06/04/2011 Markov models 47
T s1 s2 s3
s1 0.5 0.5 0
s2 0.4 0 0.6
s3 0.2 0.8 0
E x1 x2 x3
s1 0.3 0 0.7
s2 0 0.1 0.9
s3 0.2 0 0.8
ππππ s1 s2 s3
0.3 0.3 0.4
q1 o1
q2 o2
q3 o3
0.3 - 0.3 - 0.4randomply choicebetween S 1, S2, S3
Here’s a HMM
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 48/108
•
Start randomly in state 1, 2or 3.• Choose a output at each
state in random.• Let’s generate a sequence
of observations:
s1 s2 s3
x1 x2 x3
0.5
0.40.5
0.2
0.60.8
0.30.2
0.70.1
0.90.8
06/04/2011 Markov models 48
T s1 s2 s3
s1 0.5 0.5 0
s2 0.4 0 0.6
s3 0.2 0.8 0
E x1 x2 x3
s1 0.3 0 0.7
s2 0 0.1 0.9
s3 0.2 0 0.8
ππππ s1 s2 s3
0.3 0.3 0.4
q1 S3 o1
q2 o2
q3 o3
0.2 - 0.8choice between X 1
and X 3
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 49/108
Here’s a HMM
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 50/108
• Start randomly in state 1, 2or 3.
• Choose a output at eachstate in random.
• Let’s generate a sequenceof observations:
s1 s2 s3
x1 x2 x3
0.5
0.40.5
0.2
0.60.8
0.30.2
0.70.1
0.90.8
06/04/2011 Markov models 50
T s1 s2 s3
s1 0.5 0.5 0
s2 0.4 0 0.6
s3 0.2 0.8 0
E x1 x2 x3
s1 0.3 0 0.7
s2 0 0.1 0.9
s3 0.2 0 0.8
ππππ s1 s2 s3
0.3 0.3 0.4
q1 S3 o1 X3
q2 S1 o2
q3 o3
0.3 - 0.7choice between X 1
and X 3
Here’s a HMM
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 51/108
• Start randomly in state 1, 2or 3.
• Choose a output at eachstate in random.
• Let’s generate a sequenceof observations:
s1 s2 s3
x1 x2 x3
0.5
0.40.5
0.2
0.60.8
0.30.2
0.70.1
0.90.8
06/04/2011 Markov models 51
T s1 s2 s3
s1 0.5 0.5 0
s2 0.4 0 0.6
s3 0.2 0.8 0
E x1 x2 x3
s1 0.3 0 0.7
s2 0 0.1 0.9
s3 0.2 0 0.8
ππππ s1 s2 s3
0.3 0.3 0.4
q1 S3 o1 X3
q2 S1 o2 X1
q3 o3
Go to S 2 withprobability 0.5 orS1 with prob. 0.5
Here’s a HMM
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 52/108
• Start randomly in state 1, 2or 3.
• Choose a output at eachstate in random.
• Let’s generate a sequenceof observations:
s1 s2 s3
x1 x2 x3
0.5
0.40.5
0.2
0.60.8
0.30.2
0.70.1
0.90.8
06/04/2011 Markov models 52
T s1
s2
s3
s1 0.5 0.5 0
s2 0.4 0 0.6
s3 0.2 0.8 0
E x1
x2
x3
s1 0.3 0 0.7
s2 0 0.1 0.9
s3 0.2 0 0.8
ππππ s1 s2 s3
0.3 0.3 0.4
q1 S3 o1 X3
q2 S1 o2 X1
q3 S1 o3
0.3 - 0.7choice between X 1
and X 3
Here’s a HMM
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 53/108
• Start randomly in state 1, 2or 3.
• Choose a output at eachstate in random.
• Let’s generate a sequenceof observations:
s1 s2 s3
x1 x2 x3
0.5
0.40.5
0.2
0.60.8
0.30.2
0.70.1
0.90.8
06/04/2011 Markov models 53
T s1
s2
s3
s1 0.5 0.5 0
s2 0.4 0 0.6
s3 0.2 0.8 0
E x1
x2
x3
s1 0.3 0 0.7
s2 0 0.1 0.9
s3 0.2 0 0.8
ππππ s1 s2 s3
0.3 0.3 0.4
q1 S3 o1 X3
q2 S1 o2 X1
q3 S1 o3 X3
We got a sequence
of states andcorrespondingobservations!
Three famous HMM tasks
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 54/108
• Given a HMM Φ = (T, E, π ). Three famous HMM tasks are:• Probability of an observation sequence (state estimation)
– Given: Φ , observation O = o 1, o 2,..., o t
– Goal: p(O| Φ ), or equivalently p(q t = si|O)
• Most likely expaination (inference) –
Given: Φ , the observation O = o 1, o 2,..., o t – Goal: Q* = argmax Q p(Q|O)
• Learning the HMM – Given: observation O = o 1, o 2,..., o t and corresponding state sequence
– Goal: estimate parameters of the HMM Φ = (T, E, π )
06/04/2011 Markov models 54
Three famous HMM tasks
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 55/108
• Given a HMM Φ = (T, E, π ). Three famous HMM tasks are:• Probability of an observation sequence (state estimation)
– Given: Φ , observation O = o 1, o 2,..., o t
– Goal: p(O| Φ ), or equivalently p(q t = si|O)
• Most likely expaination (inference)
Calculating the probability of
observing the sequence O over
–
Given: Φ , the observation O = o 1, o 2,..., o t – Goal: Q* = argmax Q p(Q|O)
• Learning the HMM – Given: observation O = o 1, o 2,..., o t and corresponding state sequence
– Goal: estimate parameters of the HMM Φ = (T, E, π )
06/04/2011 Markov models 55
all of possible sequences.
Three famous HMM tasks
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 56/108
• Given a HMM Φ = (T, E, π ). Three famous HMM tasks are:• Probability of an observation sequence (state estimation)
– Given: Φ , observation O = o 1, o 2,..., o t
– Goal: p(O| Φ ), or equivalently p(q t = si|O)
• Most likely expaination (inference)
Calculating the best
corresponding state sequence,
–
Given: Φ , the observation O = o 1, o 2,..., o t – Goal: Q* = argmax Q p(Q|O)
• Learning the HMM – Given: observation O = o 1, o 2,..., o t and corresponding state sequence
– Goal: estimate parameters of the HMM Φ = (T, E, π )
06/04/2011 Markov models 56
given an observation
sequence.
Three famous HMM tasks
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 57/108
• Given a HMM Φ = (T, E, π ). Three famous HMM tasks are:• Probability of an observation sequence (state estimation)
– Given: Φ , observation O = o 1, o 2,..., o t
– Goal: p(O| Φ ), or equivalently p(q t = si|O)
• Most likely expaination (inference)
Given an (or a set of)observation sequence and
corresponding state sequence,
–
Given: Φ , the observation O = o 1, o 2,..., o t – Goal: Q* = argmax Q p(Q|O)
• Learning the HMM – Given: observation O = o 1, o 2,..., o t and corresponding state sequence
– Goal: estimate parameters of the HMM Φ = (T, E, π )
06/04/2011 Markov models 57
estimate the Transition matrix,Emission matrix and initial
probabilities of the HMM
Three famous HMM tasks
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 58/108
Problem Algorithm Complexity
State estimation
Calculating: p(O|Φ
)
Forward O(TN 2)
Inference
*=
Viterbi decoding O(TN 2)
06/04/2011 Markov models 58
Learning
Calculating: Φ * = argmax Φ p(O| Φ )
Baum-Welch (EM) O(TN 2)
T: number of timesteps
N: number of states
State estimation problem
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 59/108
• Given : Φ = (T, E, π ), observation O = o 1, o 2,..., o t
•
Goal : What is p(o 1o2...o t) ?• We can do this in a slow, stupid way
– As shown in the next slide...
06/04/2011 Markov models 59
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 60/108
Here’s a HMM0 2
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 61/108
• What is p( O) = p(o1o
2o
3)
= p(o 1=X3 ∧ o2=X1∧ o3=X3)?• Slow, stupid way:
s1 s2 s3
x1 x2 x3
0.5
0.40.5
0.2
0.60.8
0.3 0.2 0.70.1
0.9 0.8
( ) ( )( ) ( )
paths of length 3
|
Q p O p OQ
p QO Q p
∈
=
= ∑∑ππππ s1 s2 s3
• How to compute p(Q) for anarbitrary path Q?
•
How to compute p(O|Q) for anarbitrary path Q?
06/04/2011 Markov models 61
pat s o engt∈
p(Q) = p(q 1q2q3)
= p(q 1)p(q 2|q 1)p(q 3|q 2,q 1) (chain)
= p(q 1)p(q 2|q 1)p(q 3|q 2) (why?)
Example in the case Q=S 3S1S1
P(Q) = 0.4 * 0.2 * 0.5 = 0.04
0.3 0.3 0.4
Here’s a HMM0 2
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 62/108
• What is p( O) = p(o1o
2o
3)
= p(o 1=X3 ∧ o2=X1∧ o3=X3)?• Slow, stupid way:
s1 s2 s3
x1 x2 x3
0.5
0.40.5
0.2
0.60.8
0.3 0.2 0.70.1
0.9 0.8
( ) ( )( ) ( )
paths of length 3
|
Q p O p OQ
p QO Q p
∈
=
= ∑∑ππππ s1 s2 s3
• How to compute p(Q) for anarbitrary path Q?
•
How to compute p(O|Q) for anarbitrary path Q?
06/04/2011 Markov models 62
pat s o engt∈
p(O|Q ) = p(o 1o2o3|q 1q2q3)= p(o 1|q 1)p(o 2|q 2)p(o 3|q 3) (why?)
Example in the case Q=S 3S1S1
P(O|Q ) = p(X3|S 3)p(X1|S 1) p(X3|S 1)
=0.8 * 0.3 * 0.7 = 0.168
0.3 0.3 0.4
Here’s a HMM0 2
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 63/108
• What is p( O) = p(o1o
2o
3)
= p(o 1=X3 ∧ o2=X1∧ o3=X3)?• Slow, stupid way:
s1 s2 s3
x1 x2 x3
0.5
0.40.5
0.2
0.60.8
0.3 0.2 0.70.1
0.9 0.8
( ) ( )( ) ( )
paths of length 3
|
Q p O p OQ
p QO Q p
∈
=
= ∑∑ππππ s1 s2 s3
• How to compute p(Q) for anarbitrary path Q?
•
How to compute p(O|Q) for anarbitrary path Q?
06/04/2011 Markov models 63
pat s o engt∈
p(O|Q ) = p(o 1o2o3|q 1q2q3)= p(o 1|q 1)p(o 2|q 1)p(o 3|q 3) (why?)
Example in the case Q=S 3S1S1
P(O|Q ) = p(X3|S 3)p(X1|S 1) p(X3|S 1)
=0.8 * 0.3 * 0.7 = 0.168
0.3 0.3 0.4
p(O) needs 27 p(Q)computations and 27p(O|Q) computations.
What if the sequence has20 observations?
So let’s be smarter...
The Forward algorithm
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 64/108
• Given observation o 1o2...o T
• Forward probabilities :
α t(i) = p(o 1o2...o t ∧ q t = si | Φ ) where 1 ≤ t ≤ T
αt(i) = probability that, in a random trial:
– We’d have seen the first t observations
–
We’d have ended up in s i as the t ’th state visited.• In our example, what is α 2(3) ?
06/04/2011 Markov models 64
αααα t(i): easy to define recursively
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 65/108
( ) ( )( ) ( )
( )
1 1 1
1 1 1
1
|
i
i i
i i
i p o q s
p q s p o q s
o E
α
π
= ∧ =
= = =
=
( ) ( )1 2 ... |t t t ii p o o o q sα = ∧ = Φ
( ) ( )1 1 1 1
1 1 1
2
2
...
...
t t t i
N
t t t t i
i p o o o q s
o o o q s o q s p
α + + +
+ +
= ∧ =
= ∧ = ∧ ∧ =
( ) ( ) ( )
1
1
|
|
i i
ij t j t i
ij t j t i
p q sT p s q s
E p x q
q
o s
π +
Π = = == = = =
= = = =
T
E
06/04/2011 Markov models 65
( ) ( )
( ) ( )
( ) ( ) ( )
( ) ( )
1
1 1 1 11
1 11
1 1 1
2 2
1
11
.|
|
|
. . .
|
. .
j
N
t t i t t j t t j j
N
t t i t j t j
N
t t i t i t j t j
N
ji i t t j
o q s o o o q s o p p
p
p p
T
o o q s
o q s q s j
o q s q s q s j
o j E
α
α
α
=
+ +=
+ +=
+ + +=
+=
= ∧ = ∧ = ∧ =
= ∧ = =
= = = =
=
∑
∑
∑
∑
In our example0 2
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 66/108
s1 s2 s3
x1 x2 x3
0.5
0.40.5
0.2
0.60.8
0.3 0.2 0.70.1
0.9 0.8
ππππ s s s
( ) ( )( ) ( )( ) ( ) ( ) ( ) ( )
1 2
1 1
1 1 1
. |..t t t i
i i
t ji i t t i t ji t j j
i p o o o q s
i E o
i T E o j E o T j
α
α π
α α α + + +
= ∧ = Φ
=
= =∑ ∑
06/04/2011 Markov models 66
0.3 0.3 0.4We observed: x 1x2
α 1(1) = 0.3 * 0.3 = 0.09
α 1(2) = 0
α 1(3) = 0.2 * 0.4 = 0.08
α 2(1) = 0 * (0.09*0 .5+ 0*0.4 + 0.08*0.2) = 0
α 2(2) = 0.1 * (0.09*0.5 + 0*0 + 0.08*0.8) = 0.0109
α 2(3) = 0 * (0.09*0 + 0*0.6 + 0.08*0) = 0
Forward probabilities - Trellis
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 67/108
N
s3
s4
06/04/2011 Markov models 67
T1 2 3 4 5 6
s1
s2
Forward probabilities - Trellis
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 68/108
N
s3
s4
α 1 (4)
α 1 (3) α 2 (3) α 6 (3)
06/04/2011 Markov models 68
T1 2 3 4 5 6
s1
s2α 1 (2)
α1
(1)
α 3 (2)
α4
(1)
α 5 (2)
Forward probabilities - Trellis
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 69/108
N
s3
s4
α 1 (4)
α 1 (3) α 2 (3)
( ) ( )1 1i ii E oα π =
06/04/2011 Markov models 69
T1 2 3 4 5 6
s1
s2α 1 (2)
α1
(1)
Forward probabilities - Trellis
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 70/108
N
s3
s4
α 1 (4)
α 1 (3) α 2 (3)
( ) ( ) ( )1 1t i t ji t ji E o T jα α + +
=
∑
06/04/2011 Markov models 70
T1 2 3 4 5 6
s1
s2α 1 (2)
α1
(1)
Forward probabilities
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 71/108
• So, we can cheaply compute:
• How can we cheaply compute:
( ) ( )1 2 ...t t t ii p o o o q sα = ∧ =
( )1 2 ... t p o o o
•
How can we cheaply compute:
06/04/2011 Markov models 71
( )1 2| ...t i t p q s o o o=
Forward probabilities
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 72/108
( )t i
iα = ∑
• So, we can cheaply compute:
• How can we cheaply compute:
( ) ( )1 2 ...t t t ii p o o o q sα = ∧ =
( )1 2 ... t p o o o
•
How can we cheaply compute:
Look back the trellis...
06/04/2011 Markov models 72
( )1 2| ...t i t p q s o o o= ( )( )
j
t
t
i
j
α α
= ∑
State estimation problem
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 73/108
• State estimation is solved:
• Can we utilize the elegant trellis to solve the Inference
( ) ( ) ( )1 21
| N
t ii
O p o o o i p α =
Φ = … = ∑
– Given an observation sequence O, find the best state sequence Q
06/04/2011 Markov models 73
( )* arg max |Q
OQ p Q=
Inference problem
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 74/108
• Given : Φ = (T, E, π ), observation O = o 1, o 2,..., o t• Goal : Find
( )
( )1 2
*
1 2 1 2
arg max |
arg max |t
Q
t t q q q
Q p Q O
p q q q o o o…
=
= … …
• rac ca pro ems: – Speech recognition: Given an utterance (sound), what is
the best sentence (text) that matches the utterance?
– Video tracking
– POS Tagging
06/04/2011 Markov models 74
s1 s2
x1 x2 x3
s3
Inference problem
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 75/108
• We can do this in a slow, stupid way:( )
( ) ( )( )
* arg max |
|arg max
Q
Q
p Q O
p O Q p Q
p O
Q =
=
• But it’s better if we can find another way tocompute the most probability path (MPP)...
06/04/2011 Markov models 75
( ) ( )( ) ( )1 2
arg max |
arg max |t
Q
Q
p O Q p Q
p o o o Q p Q
=
= …
Efficient MPP computation
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 76/108
• We are going to compute the following variables:
• δ t(i) is the probability of the best path of length
( ) ( )1 2 1
1 2 1 1 2maxt
t t t i t q q q
i p q q q q s o o oδ −
−…
= … ∧ = ∧ …
t-1 which ends up in s i and emits o 1...o t.
• Define: mpp t(i) = that path
so: δ t(i) = p(mpp t(i))
06/04/2011 Markov models 76
Viterbi algorithm( ) ( )1 2 1 1 2maxt t t i ti p q q q q s o o oδ −= … ∧ = ∧ …
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 77/108
( ) ( )
( ) ( )
( ) ( )
( ) ( )
1 2 1
1 2 1
1 2 1 1 2
1 2 1 1 2
1 11one choice
1 1
arg max
max
t
t
t t t i t
q q q
t t t
i
i t q q q
i
i
p q q q q
mpp i p q q q q s o o o
i p q s o
E o iπ
δ
α
−
−
…
−…
= … ∧ = ∧ …
= = ∧
= =
N δ 1 (4)
06/04/2011 Markov models 77T1 2 3 4 5 6
s1
s2
s3
4
δ 1 (3)
δ 1 (2)
δ 1 (1)
δ 2 (3)
Viterbi algorithmtime t time t + 1
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 78/108
• The most prob path with last two states
sis j is the most path to s i, followed by
transition s i s j.• The prob of that path will be:
s1
si s j
time t time t + 1
. . .
. .
δ t(i) × p(s i s j ∧ o t+1)= δ t(i)TijE j(o t+1)
•
So, the previous state at time t is:
06/04/2011 Markov models 78
( ) ( )*1arg max t ij j t
ii T E i oδ +=
.
Viterbi algorithm
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 79/108
• Summary: ( ) ( ) ( )
( ) ( )( ) ( )
**
1
*1
*1
1
arg max
t j t i j
t t j
t ij t
t
i j
i T E o
mpp j mpp i s
i T E
j
i o
δ δ
δ
++
+
+
=
=
=
N
( ) ( ) ( )11 1iii E o iδ π α = =
06/04/2011 Markov models 79T1 2 3 4 5 6
s1
s2
s3
s4
δ 1 (3)
δ 1 (2)
δ 1 (1)
δ 2 (3)
What’s Viterbi used for?•
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 80/108
Speech Recognition
06/04/2011 Markov models 80
Chong, Jike and Yi, Youngmin and Faria, Arlo and Satish, Nadathur Rajagopalan and Keutzer, Kurt, “ Data-Parallel Large Vocabulary Continuous Speech Recognition on Graphics Processors ”, EECS Department, University of California, Berkeley, 2008.
Training HMMs
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 81/108
• Given : large sequence of observation o 1o2...o T
and number of states N.
• Goal : Estimation of parameters Φ = T, E, π
• That is, how to design an HMM.
• We will infer the model from a large amount of
data o 1o2...o T with a big “ T”.
06/04/2011 Markov models 81
Training HMMs
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 82/108
• Remember, we have just computed
p(o 1o2...o T | Φ )
•
Now, we have some observations and we want to inference Φfrom them.
• So, we could use: – MAX LIKELIHOOD:
– BAYES:
Compute
then take or
06/04/2011 Markov models 82
( )1arg max |T p o oΦ
Φ = … Φ
( )1| T p o oΦ …
[ ] E Φ ( )1max | T o p oΦ
Φ …
Max likelihood for HMMs
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 83/108
• Forward probability : the probability of producing o 1...o t while
ending up in state s i
( ) ( )1 2 ...t t t ii p o o o q sα = ∧ =( ) ( )( ) ( ) ( )
1 1
1 1
i i
t i t ji t j
i E o
i E o T j
α π
α α + +
=
= ∑
• Backward probability : the probability of producing o t+1 ...o T giventhat at time t , we are at state s i
06/04/2011 Markov models 83
( ) ( )1 2 . |..t t t iT t i p o o o q s β + += =
Max likelihood for HMMs - Backward
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 84/108
• Backward probability : easy to define recursively
( ) ( )
( )( ) ( )
1 2
1 2 1
...
...
|
1
|
t t t t i
T
N
t t t t j t i
T
T
i p o o o q s
i
i p o o o q s q s
β
β
β
+ +
+ + +
= =
=
= ∧ ∧ = =∑
( )
( ) ( ) ( )1 11
1T
N
t t ij j t j
i
i T o j E
β
β β + +=
=
= ∑
06/04/2011 Markov models 84
( ) ( )
( ) ( )
( ) ( )
1
1 1 2 1 11
1 1 2 11
1 11
.| |
| |
..
...
j
N t t j t i t t t j t i
j
N
t t j t i t t j j
N
t ij j t
T
T
j
p o q s q s p o o o q s q s
p o q s q s p o o q s
T o j E β
=
+ + + + +=
+ + + +=
+ +=
= ∧ = = ∧ = ∧ =
= ∧ = = =
=
∑
∑
∑
Max likelihood for HMMs
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 85/108
• The probability of traversing a certain arc at time t given
o1o2...o T:
( ) ( )( )
1 1 2
1 1 2
|ij t i t j T
t i t j T
t p q s q s o o o p q s q s o o o
ε +
+
= = ∧ = …= ∧ = ∧ …
=
06/04/2011 Markov models 85
( ) ( ) ( )
( ) ( )
( )( ) ( )
( ) ( )
1 2
1 2 1 1 2
1 2 1 21
1
|
|
T
t t i t i t j t t T t i
N
t i t t t ii
ij t ij N
t i
t T
t
t
p o o o
p o o o q s p q s q s p o o o q s
p o o o q p o o o q
i T it
i i
s s
α β
α ε
β
+ + +
+ +=
=
…
… ∧ = = ∧
=
= … ==
… ∧ …
=
=∑
∑
Max likelihood for HMMs
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 86/108
• The probability of being at state s i at time t given o 1o2...o T:
( ) ( )
( )
1 2
1 1 21
|
|
i t i T
N
t i t j T j
t p q s o o o
p q s q s o o o
γ
+=
= = …
= = ∧ = …∑
06/04/2011 Markov models 86
( ) ( )1
N
ij j
i t t γ ε =
= ∑
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 87/108
Update parameters( )
( ) 1
1
|
|
i i
ij t j t i
ij j i
p q s
T p s q s
E p x q
q
o s
π
+
Π = = =
= = = =
= = = =
T
E
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 88/108
( )
( ) ( )
1 1
1 11 1
ˆ expected frequency in state i at time t = 1 1
expected # of transitions from state i to jexpected # of transitions from state i
i i
T T
ij ijt t
ij T N T
t t T
t t
π γ
γ
ε ε
ε
− −
= =− −
= =
= = =∑ ∑
( ) ij t j t i
06/04/2011 Markov models 88
1 1 1
expected
t j t
ik E
= = =
=
( ) ( )
( )
( ) ( )
( )
11
1 111 1
1 1 1
# of transitions from state i with x observedexpected # of transitions from state i
,,
k
N T T
t k ijt k i j t t
T N T
i ijt j t
o x t o x t
t t
δ ε δ γ
γ ε
−−
= ==− −
= = =
= =∑∑
∑∑ ∑∑
Update parameters( )
( ) 1
1
|
|
i i
ij t j t i
ij t j t i
p q s
T p s q s
E p x q
q
o s
π
+
Π = = =
= = = =
= = = =
T
E
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 89/108
( )
( ) ( )
1 1
1 11 1
ˆ expected frequency in state i at time t = 1 1
expected # of transitions from state i to jexpected # of transitions from state i
i i
T T
ij ijt t
ij T N T
t t T
t t
π γ
γ
ε ε
ε
− −
= =− −
= =
= = =∑ ∑
( ) ij t j t i
06/04/2011 Markov models 89
1 1 1
expected
t j t
ik E
= = =
=
( ) ( )
( )
( ) ( )
( )
11
1 111 1
1 1 1
# of transitions from state i with x observedexpected # of transitions from state i
,,
k
N T T
t k ijt k i j t t
T N T
i ijt j t
o x t o x t
t t
δ ε δ γ
γ ε
−−
= ==− −
= = =
= =∑∑
∑∑ ∑∑
Kronecker delta function:
( ), 1t k t ko x o xδ = ⇔ =
The inner loop of Forward-Backward
Gi i
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 90/108
Given an input sequence.1. Calculate forward probability:
– Base case: – Recursive case:
2. Calculate backward probability:
( ) ( )
( ) ( ) ( )1 1
1 1
i i
t i t ji t j
i E o
i E o T j
α π
α α + +
=
= ∑
– Base case: – Recursive case:
3. Calculate expected count:
4. Update parameters:
06/04/2011 Markov models 90
( )
( ) ( ) ( )1 11
1T
N
t t ij j t j
i
i T o j E
β
β β + +=
=
= ∑( )
( ) ( )
( ) ( )1
ij t ij N
t
t
i t
i T it
i i
α β
α β ε
=
=
∑( )
( )
( ) ( )
( )
11
1 111 1
1 1 1 1
, N T T
t k ijij j t t
ij ik N T N T
ij ij
j t j t
o x t t T E
t t
δ ε
ε
ε
ε
−−
= ==− −
= = = =
= =∑∑∑
∑∑ ∑∑
Forward-Backward: EM for HMM
• If k Φ ld i i f i i
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 91/108
If we knew Φ we could estimate expectations of quantitiessuch as
– Expected number of times in state i
– Expected number of transitions i j
– Expected number of times in state i
– Expected number of transitions i j
we could compute the max likelihood estimate of Φ = T, E, Π• Also known (for the HMM case) as the Baum-Welch algorithm.
06/04/2011 Markov models 91
EM for HMM
• E h i i id l f ll h
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 92/108
Each iteration provides values for all the parameters
• The new model always improve the likeliness of the
training data:
ˆ
• The algorithm does not guarantee to reach global
maximum.
06/04/2011 Markov models 92
1 2 1 2T T p po o o o o o… …
EM for HMM•
Bad News
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 93/108
Bad News – There are lots of local minima
• Good News – The local minima are usually adequate models of the data.
• Notice – EM does not estimate the number of states. That must be given (tradeoffs)
– Often, HMMs are forced to have some links with zero probability. This is done
by setting T ij = 0 in initial estimate Φ (0)
– Easy extension of everything seen today: HMMs with real valued outputs
06/04/2011 Markov models 93
Contents
• I t d ti
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 94/108
Introduction
• Markov Chain
• Hidden Markov Models
• Markov Random Field (from the viewpoint of
classification)
06/04/2011 94Markov models
Example: Image segmentation
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 95/108
• Observations: pixel values
• Hidden variable: class of each pixel
• It’s reasonable to think that there are some underlying relationships
between neighbouring pixels... Can we use Markov models?• Errr.... the relationships are in 2D!
06/04/2011 Markov models 95
MRF as a 2D generalization of MC
• Array of observations: 0 0i N j NX x < ≤ <= ≤
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 96/108
Array of observations:
• Classes/States:
• Our objective is classification : given the array of
, 0 ,0 yij xi N j N X x < ≤ <= ≤
, 1...ij ijS s s M = =
observations, estimate the corresponding values of the
state array S so that
06/04/2011 Markov models 96
( ) ( )| is maximum. X p S p S
2D context-dependent classification•
Assumptions:
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 97/108
Assumptions: – The values of elements in S are mutually dependent.
– The range of this dependence is limited within a neighborhood.
• For each (i, j) element of S, a neighborhood N ij is defined so
– sij∉ Nij: (i, j) element does not belong to its own set of neighbors. – sij∈ Nkl ⇔ skl ∈ Nij: if s ij is a neighbor of s kl then s kl is also a neighbor
of s ij
06/04/2011 Markov models 97
2D context-dependent classification
•
The Markov property for 2D case:
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 98/108
The Markov property for 2D case:
where includes all the elements of S except the (i, j) one.• The elegeant dynamic programing is not applicable: the problem is
( ) ( )| |ij ij ij ijs S p s p =
ijS
much harder now!
06/04/2011 Markov models 98
2D context-dependent classification
•
The Markov property for 2D case:
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 99/108
The Markov property for 2D case:
where includes all the elements of S except the (i, j) one.• The elegeant dynamic programing is not applicable: the problem is
( ) ( )| |ij ij ij ijs S p s p =
ijS
We are gonna see anapplication of MRF for
much harder now!
06/04/2011 Markov models 99
Image Segmentation
and Restoration.
MRF for Image Segmentation
• Cliques : a set of each pixel which are neighbors
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 100/108
Cliques : a set of each pixel which are neighbors
of each other (w.r.t the type of neighborhood)
06/04/2011 Markov models 100
MRF for Image Segmentation
• Dual Lattice number
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 101/108
Dual Lattice number
• Line process:
06/04/2011 Markov models 101
MRF for Image Segmentation
• Gibbs distribution:
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 102/108
Gibbs distribution:
– Z: normalizing constant
( )( )1
expU s
s
Z T
π −
=
–
T: parameter• It turns out that Gibbs distribution implies MRF
([Gema 84])
06/04/2011 Markov models 102
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 103/108
MRF for Image Segmentation
• Then, the joint probability for the Gibbs model is
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 104/108
Then, the joint probability for the Gibbs model is
( )
( )( ),
,
exp
k k i j k
p
F C i j
S T
= −
∑∑
– The sum is calculated over all possible cliques associated
with the neighborhood.
• We also need to work out p(X|S)
• Then p(X|S)p(S) can be maximized... [Gema 84]
06/04/2011 Markov models 104
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 105/108
What you should know
•
Markov property, Markov Chain
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 106/108
• HMM:
–
Defining and computing α t(i) – Viterbi algorithm
– Outline of the EM algorithm for HMM
• Markov Random Field
– And an application in Image Segmentation
– [Geman 84] for more information.
06/04/2011 Markov models 106
Q & A
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 107/108
06/04/2011 Markov models 107
References•
L. R. Rabiner, " A Tutorial on Hidden Markov Models and Selected Applicationsi S h R i i “ P f h IEEE V l 77 N 2 257 286 1989
8/6/2019 Hmm Revisited
http://slidepdf.com/reader/full/hmm-revisited 108/108
in Speech Recognition “, Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989.
• Andrew W. Moore, “ Hidden Markov Models ”, http://www.autonlab.org/tutorials/
• Geman S., Geman D. “ Stochastic relaxation, Gibbs distributions and the
Ba esian restoration o ima es ” IEEE Transactions on Pattern Anal sis and
Machine Intelligence, Vol. 6(6), pp. 721-741, 1984.
06/04/2011 Markov models 108