Markov Models &Hidden Markov Models
Time-based Models
• Simple parametric distributions are typically based on what is called the “independence assumption”- each data point is independent of the others, and there is no time-sequencing or ordering.
• What if the data has correlations based on its order, like a time-series?
States• An atomic event is an assignment to every
random variable in the domain.• States are atomic events that can transfer
from one to another• Suppose a model has n states • A state-transition diagram describes how the
model behaves
State-transition
Following assumptions- Transition probabilities are stationary- The event space does not change over time- Probability distribution over next states depends only on the current state
State-transition
Following assumptions- Transition probabilities are stationary- The event space does not change over time- Probability distribution over next states depends
only on the current state
Markov Assumption
Markov random processes
• A random sequence has the Markov property if its distribution is determined solely by its current state.
• Any random process having this property is called a Markov random process.
• A system with states that obey the Markov assumption is called a Markov Model.
• A sequence of states resulting from such a model is called a Markov Chain.
Chain Rule & Markov Property
),...(),...|(),...,( 111111 qqPqqqPqqqP ttttt
),...(),...|(),...|( 1212111 qqPqqqPqqqP ttttt
t
iii qqqPqP
2111 ),...|()(
1)|(),...|( 111 iforqqPqqqP iiii
)|()...|()()|()(),...,( 11212
1111
tt
t
iiitt qqPqqPqPqqPqPqqqP
Bayes rule
Markov property
Markov Assumption
• The Markov assumption states that probability of the occurrence of word wi at time t depends only on occurrence of word wi-
1 at time t-1– Chain rule:
– Markov assumption:
P(w1,...,wn ) P(wi |wi 1)i2
n
P(w1,...,wn ) P(wi |w1,...,wi 1)i2
n
Andrei Andreyevich Markov
Born: 14 June 1856 in Ryazan, RussiaDied: 20 July 1922 in Petrograd (now St Petersburg), RussiaMarkov is particularly remembered for his study of Markov chains, sequences of random variables in which the future variable is determined by the present variable but is independent of the way in which the present state arose from its predecessors. This work launched the theory of stochastic processes.
s1 s3
s2
Has N states, called s1, s2 .. sN
There are discrete timesteps, t=0, t=1, …
N = 3
t=0
A Markov System
s1 s3
s2
Has N states, called s1, s2 .. sN
There are discrete timesteps, t=0, t=1, …
On the t’th timestep the system is in exactly one of the available states. Call it qt
Note: qt {s1, s2 .. sN }
N = 3
t=0
qt=q0=s3
Current State
A Markov System
A Markov System
s1 s3
s2
Has N states, called s1, s2 .. sN
There are discrete timesteps, t=0, t=1, …
On the t’th timestep the system is in exactly one of the available states. Call it qt
Note: qt {s1, s2 .. sN }
Between each timestep, the next state is chosen randomly.
N = 3
t=1
qt=q1=s2
Current State
s1 s3
s2
Has N states, called s1, s2 .. sN
There are discrete timesteps, t=0, t=1, …
On the t’th timestep the system is in exactly one of the available states. Call it qt
Note: qt {s1, s2 .. sN }
Between each timestep, the next state is chosen randomly.
The current state determines the probability distribution for the next state.N = 3
t=1
qt=q1=s2
P(qt+1=s1|qt=s3) = 1/3
P(qt+1=s2|qt=s3) = 2/3
P(qt+1=s3|qt=s3) = 0
P(qt+1=s1|qt=s1) = 0
P(qt+1=s2|qt=s1) = 0
P(qt+1=s3|qt=s1) = 1
P(qt+1=s1|qt=s2) = 1/2
P(qt+1=s2|qt=s2) = 1/2
P(qt+1=s3|qt=s2) = 0
s1 s3
s2
Has N states, called s1, s2 .. sN
There are discrete timesteps, t=0, t=1, …
On the t’th timestep the system is in exactly one of the available states. Call it qt
Note: qt {s1, s2 .. sN }
Between each timestep, the next state is chosen randomly.
The current state determines the probability distribution for the next state.N = 3
t=1
qt=q1=s2
P(qt+1=s1|qt=s3) = 1/3
P(qt+1=s2|qt=s3) = 2/3
P(qt+1=s3|qt=s3) = 0
P(qt+1=s1|qt=s1) = 0
P(qt+1=s2|qt=s1) = 0
P(qt+1=s3|qt=s1) = 1
P(qt+1=s1|qt=s2) = 1/2
P(qt+1=s2|qt=s2) = 1/2
P(qt+1=s3|qt=s2) = 0
1/2
1/2
1/3
2/3
1
Often notated with arcs between states
)( 1 ii sqP
s1 s3
s2qt+1 is conditionally independent of { qt-1, qt-2, … q1, q0 } given qt.
In other words:
P(qt+1 = sj |qt = si ) =
P(qt+1 = sj |qt = si ,any earlier history)
Notation:N = 3
t=1
qt=q1=s2
P(qt+1=s1|qt=s3) = 1/3
P(qt+1=s2|qt=s3) = 2/3
P(qt+1=s3|qt=s3) = 0
P(qt+1=s1|qt=s1) = 0
P(qt+1=s2|qt=s1) = 0
P(qt+1=s3|qt=s1) = 1
P(qt+1=s1|qt=s2) = 1/2
P(qt+1=s2|qt=s2) = 1/2
P(qt+1=s3|qt=s2) = 0
1/2
1/2
1/3
2/3
1
)|( 1 jitij sqsqPa
Markov Property
)( 1 ii sqP
s1 s3
s2qt+1 is conditionally independent of { qt-1, qt-2, … q1, q0 } given qt.
In other words:
P(qt+1 = sj |qt = si ) =
P(qt+1 = sj |qt = si ,any earlier history)
Notation:N = 3
t=1
qt=q1=s2
P(qt+1=s1|qt=s3) = 1/3
P(qt+1=s2|qt=s3) = 2/3
P(qt+1=s3|qt=s3) = 0
P(qt+1=s1|qt=s1) = 0
P(qt+1=s2|qt=s1) = 0
P(qt+1=s3|qt=s1) = 1
P(qt+1=s1|qt=s2) = 1/2
P(qt+1=s2|qt=s2) = 1/2
P(qt+1=s3|qt=s2) = 0
1/2
1/2
1/3
2/3
1
)|( 1 jitij sqsqPa
Markov Property
Transition probability
Initial probability
Example: A Simple Markov Model For Weather Prediction
• Any given day, the weather can be described as being in one of three states:– State 1: precipitation (rain, snow, hail, etc.)– State 2: cloudy– State 3: sunnyTransitions between states are described by the transition matrix
This model can then be described by the following directed graph
Basic Calculations• Example: What is the probability that the
weather for eight consecutive days is “sun-sun-sun-rain-rain-sun-cloudy-sun”?
• Solution:• O = sun sun sun rain rain sun cloudy sun 3 3 3 1 1 3 2 3
From Markov To Hidden Markov• The previous model assumes that each state can be uniquely
associated with an observable event– Once an observation is made, the state of the system is then trivially
retrieved– This model, however, is too restrictive to be of practical use for most
realistic problems• To make the model more flexible, we will assume that the
outcomes or observations of the model are a probabilistic function of each state– Each state can produce a number of outputs according to a unique
probability distribution, and each distinct output can potentially be generated at any state
– These are known a Hidden Markov Models (HMM), because the state sequence is not directly observable, it can only be approximated from the sequence of observations produced by the system
The coin-toss problem • To illustrate the concept of an HMM consider the following
scenario– Assume that you are placed in a room with a curtain– Behind the curtain there is a person performing a coin-toss
experiment– This person selects one of several coins, and tosses it: heads (H) or
tails (T)– The person tells you the outcome (H,T), but not which coin was used
each time• Your goal is to build a probabilistic model that best explains
a sequence of observations O={o1,o2,o3,o4,…}={H,T,T,H,,…}– The coins represent the states; these are hidden because you do not
know which coin was tossed each time– The outcome of each toss represents an observation– A “likely” sequence of coins may be inferred from the observations,
but this state sequence will not be unique•
The Coin Toss Example – 2 coins
From Markov to Hidden Markov Model: The Coin Toss Example – 3 coins
Hidden model
• As spectators, we can not tell which coin is being used, all we can observe is the output (head/tail)
• We assume the outputs are based on coin tendencies (output) probabilities
Coin Toss Example
C1 C2 CL-1 CL
P1 P2 PL-1 PL
Ci
Pi
hidden state variables= coins
observed data(“output”) = heads/tails
L
Hidden Markov Models
• Used when states can not directly be observed, good for noisy data
• Requirements:– A finite number of states, each with an output probability
distribution– State transition probabilities– Observed phenomenon, which can be randomly generated
given state-associated probabilities.
HMM Notation(from Rabiner’s Survey)
The states are labeled S1 S2 .. SN
For a particular trial….Let T be the number of observations
T is also the number of states passed throughO = O1 O2 .. OT is the sequence of observations
Q = q1 q2 .. qT is the notation for a path of states
λ = N,M,i,,aij,bi(j) is the specification of an HMM
*L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989.
HMM Formal DefinitionAn HMM, λ, is a 5-tuple consisting of• N the number of states• M the number of possible observations• {1, 2, .. N} The starting state probabilities
P(q0 = Si) = i
• a11 a22 … a1N a21 a22 … a2N
: : : aN1 aN2 … aNN
• b1(1) b1(2) … b1(M) b2(1) b2(2) … b2(M) : : : bN(1) bN(2) … bN(M)
The state transition probabilities
P(qt+1=Sj | qt=Si)=aij
The observation probabilities
P(Ot=k | qt=Si)=bi(k)
Assumptions
• Markov assumption– States depend on previous states
• Stationary assumption– Transition probabilities are independent of time
(“memoryless”)• Output independence
– Observations are independent of previous observations
The three main questions on HMMs
• Evaluation– What is the probability that the observations were
generated by a given model?• Decoding
– Given a model and a sequence of observations, what is the most likely state observations?
• Learning:– Given a model and a sequence of observations, how
should we modify the model parameters to maximize p{observe|model)
The three main questions on HMMs
1. Evaluation
GIVEN a HMM M, and a sequence x,FIND Prob[ x | M ]
2. Decoding
GIVEN a HMM M, and a sequence x,FIND the sequence of states that maximizes P[ x, | M ]
3. Learning
GIVEN a HMM M, with unspecified transition/emission probs.,and a sequence x,
FIND parameters = (bi(.), aij) that maximize P[ x | ]
Let’s not be confused by notation
P[ x | M ]: The probability that sequence x was generated by the model
The model is: architecture (#states, etc) + parameters = aij, ei(.)
So, P[ x | ], and P[ x ] are the same, when the architecture, and the entire model, respectively, are implied
Similarly, P[ x, | M ] and P[ x, ] are the same
In the LEARNING problem we always write P[ x | ] to emphasize that we are seeking the that maximizes P[ x | ]
HMMs
Hidden Markov Models
• Used when states can not directly be observed, good for noisy data
• Requirements:– A finite number of states, each with an output probability
distribution– State transition probabilities– Observed phenomenon, which can be randomly generated
given state-associated probabilities.
Specification of an HMM
• N - number of states– Q = {q1; q2; : : : ;qT} - set of states
• M - the number of symbols (observables)– O = {o1; o2; : : : ;oT} - set of symbols
Description
Specification of an HMM
• A - the state transition probability matrix– aij = P(qt+1 = j|qt = i)
• B- observation probability distribution– bj(k) = P(ot = k|qt = j) i ≤ k ≤ M
• π - the initial state distribution
Description
HMM Formal DefinitionAn HMM, λ, is a 5-tuple consisting of• N the number of states• M the number of possible observations• {1, 2, .. N} The starting state probabilities
P(q0 = Si) = i
• a11 a22 … a1N a21 a22 … a2N
: : : aN1 aN2 … aNN
• b1(1) b1(2) … b1(M) b2(1) b2(2) … b2(M) : : : bN(1) bN(2) … bN(M)
The state transition probabilities
P(qt+1=Sj | qt=Si)=aij
The observation probabilities
P(Ot=k | qt=Si)=bi(k)
Assumptions
• Markov assumption– States depend on previous states
• Stationary assumption– Transition probabilities are independent of time
(“memoryless”)• Output independence
– Observations are independent of previous observations
The three main questions on HMMs
• Evaluation– What is the probability that the observations were
generated by a given model?• Decoding
– Given a model and a sequence of observations, what is the most likely state observations?
• Learning:– Given a model and a sequence of observations, how
should we modify the model parameters to maximize p{observe|model)
Central problems in HMM modelling
• Problem 1Evaluation:– Probability of occurrence of a particular
observation sequence, O = {o1,…,ok}, given the model
– P(O|λ)– Complicated – hidden states– Useful in sequence classification
Centralproblems
Central problems in HMM modelling
• Problem 2Decoding:– Optimal state sequence to produce given
observations, O = {o1,…,ok}, given model
– Optimality criterion– Useful in recognition problems
Centralproblems
Central problems in HMM modelling
• Problem 3Learning:– Determine optimum model, given a training set of
observations– Find λ, such that P(O|λ) is maximal
Centralproblems
Task: Part-Of-Speech Tagging
• Goal: Assign the correct part-of-speech to each word (and punctuation) in a text.
• Example:
• Learn a local model of POS dependencies, usually from pretagged data
Two old men bet on the game .CRD ADJ NN VBD Prep Det NN SYM
Hidden Markov Models
• Assume: POS generated as random process, and each POS randomly generates a word
Det
NN
NNS
ADJ
0.20.3
0.5
0.3
0.5
0.9
0.2
0.1“the”
“a” 0.6
0.4
“cat”“bet”
“cats”
“men”
• First-order (bigram) Markov assumptions:– Limited Horizon: Tag depends only on previous tag
P(ti+1 = tk | t1=tj1,…,ti=tji) = P(ti+1 = tk | ti = tj)
– Time invariance: No change over time
P(ti+1 = tk | ti = tj) = P(t2 = tk | t1 = tj) = P(tj tk)
• Output probabilities:– Probability of getting word wk for tag tj: P(wk | tj)– Assumption:
Not dependent on other tags or words!
HMMs For Tagging
Combining Probabilities• Probability of a tag sequence:P(t1t2…tn) = P(t1)P(t1t2)P(t2t3)…P(tn-1tn)Assume t0 – starting tag:
= P(t0t1)P(t1t2)P(t2t3)…P(tn-1tn)
• Prob. of word sequence and tag sequence:
P(W,T) = i P(ti-1ti) P(wi | ti)
Training from labeled training
• Labeled training = each word has a POS tag• Thus:
π(tj) = PMLE(tj) = C(tj) / N
a(tjtk) = PMLE(tjtk) = C(tj, tk) / C(tj)
b(tj | wk) = PMLE(tj | wk) = C(tj:wk) / C(wk)
• Smoothing applies as usual
Three Basic Problems
• Compute the probability of a text: Pλ(W1,N)
• Compute maximum probability tag sequence:
arg maxT1,N Pλ(T1,N | W1,N)
• Compute maximum likelihood modelarg max λ Pλ(W1,N)
Problem 1: Naïve solution
• State sequence Q = (q1,…qT)
• Assume independent observations:
)()...()(),|(,|( 22111
TqTqqt
T
it obobobqoPqOP
Centralproblems
NB Observations are mutually independent, given the hidden states. (Joint distribution of independent variables factorises into marginal distributions of the independent variables.)
Problem 1: Naïve solution
• Observe that :
• And that:
qTqTqqqqq aaaqP 132211 ...)|(
Centralproblems
q
qPqOPOP )|(),|()|(
Problem 1: Naïve solution
• Finally get:
Centralproblems
q
qPqOPOP )|(),|()|(
NB:-The above sum is over all state paths-There are NT states paths, each ‘costing’ O(T) calculations, leading to O(TNT) time complexity.
Problem 1: Efficient solution
• Define auxiliary forward variable α:
Centralproblems
),|,...,()( 1 iqooPi ttt
αt(i) is the probability of observing a partial sequence ofobservables o1,…ot such that at time t, state qt=i
Forward algorithm:
Problem 1: Efficient solution• Recursive algorithm:
– Initialise:
– Calculate:
– Obtain:
)()( 11 obi ii
Centralproblems
)(])([)( 11
1
tj
N
iijtt obaij
N
iT iOP
1
)()|(
Complexity is O(N2T)
(Partial obs seq to t AND state i at t) x (transition to j at t+1) x (sensor)
Sum of different ways of getting obs seq
Sum, as can reach j from any preceding state
incorporates partial obs seq to t
Forward AlgorithmDefine αk(i) = P(w1,k, tk=ti)
1. For i = 1 To N: α1(i) = a(t0ti)b(w1 | ti)
• For k = 2 To T; For j = 1 To N:
αk(j) = [i αk-1(i)a(titj)]b(wk | tj)
• Then: Pλ(W1,T) = i αT(i)
Complexity = O(N2 T)
Forward Algorithm
t1
t2
t5
t4
t3
w1
t1
t2
t5
t4
t3
w2
t1
t2
t5
t4
t3
w3
α1(1)
α1(2)
α1(5)
α1(4)
α1(3)
a(t0ti)
α2(1)
α2(2)
α2(5)
α2(4)
α2(3)
α3(1)
α3(2)
α3(5)
α3(4)
α3(3)
a(t1t1)
a(t2t1)
a(t3t1)
a(t4t1)
a(t5t1)
a(t1t1)
a(t2t1)
a(t3t1)
a(t4t1)
a(t5t1)
Pλ(W1,3)
Problem 1: Alternative solution
• Define auxiliary forward variable β:
Centralproblems
Backward algorithm:
),|,...,,()( 21 iqoooPi tTttt
t(i) – the probability of observing a sequence ofobservables ot+1,…,oT given state qt =i at time t, and
Problem 1: Alternative solution• Recursive algorithm:
– Initialise:
– Calculate:
– Terminate:
1)( jT
Centralproblems
Complexity is O(N2T)
1 11
( ) ( ) ( )N
t t ij j tj
i j a b o
N
i
iOp1
1 )()|( 1,...,1Tt
Backward AlgorithmDefine βk(i) = P(wk+1,N | tk=ti) --note the difference!
1. For i = 1 To N: βT(i) = 1
1. For k = T-1 To 1; For j = 1 To N:
βk(j) = [i a(tjti)b(wk+1 | ti) βk+1(i)]2. Then:
Pλ(W1,T) = i a(t0ti)b(w1 | ti) β1(i)
Complexity = O(Nt2 N)
Backward Algorithm
t1
t2
t5
t4
t3
w1
t1
t2
t5
t4
t3
w2
t1
t2
t5
t4
t3
w3
β1(1)
β1(2)
β1(5)
β1(4)
β1(3)
a(t0ti)
β2(1)
β2(2)
β2(5)
β2(4)
β2(3)
β3(1)
β3(2)
β3(5)
β3(4)
β3(3)
a(t1t1)
a(t2t1)
a(t3t1)
a(t4t1)
a(t5t1)
a(t1t1)
a(t2t1)
a(t3t1)
a(t4t1)
a(t5t1)
Pλ(W1,3)
Viterbi Algorithm (Decoding)• Most probable tag sequence given text:
T* = arg maxT Pλ(T | W)
= arg maxT Pλ(W | T) Pλ(T) / Pλ(W)(Bayes’ Theorem)
= arg maxT Pλ(W | T) Pλ(T)(W is constant for all T)
= arg maxT i[a(ti-1ti) b(wi | ti) ]= arg maxT i log[a(ti-1ti) b(wi | ti) ]
t1
t2
t3
w1
t1
t2
t3
w2
t1
t2
t3
w3
t0
A(,) t1 t2 t3
t0 0.005 0.02 0.1
t1 0.02 0.1 0.005
t2 0.5 0.0005 0.0005
t3 0.05 0.05 0.005
B(,) w1 w2 w3
t1 0.2 0.005 0.005
t2 0.02 0.2 0.0005
t3 0.02 0.02 0.05
-log A t1 t2 t3
t0 2.3 1.7 1
t1 1.7 1 2.3
t2 0.3 3.3 3.3
t3 1.3 1.3 2.3
-log B w1 w2 w3
t1 0.7 2.3 2.3
t2 1.7 0.7 3.3
t3 1.7 1.7 1.3
t1
t2
t3
w1
t1
t2
t3
w2
t1
t2
t3
w3
t0
-1.7
-0.3
-1.3
-3
-3.4
-2.7
-2.3
-1.7
-1
-6
-4.7
-6.7
-1.7
-0.3
-1.3
-7.3
-9.3
-10.3
Viterbi Algorithm1. D(0, START) = 02. for each tag t != START do: D(1, t) = -3. for i 1 to N do:
a. for each tag tj do:D(i, tj) maxk D(i-1,tk)b(wi|tj)a(tktj)
D(i, tj) maxk D(i-1,tk) + logb(wi|tj) + loga(tktj)
4. log P(W,T) = maxj D(N, tj)
where logb(wi|tj) = log b(wi|tj) and so forth
start
fair loaded
Heads
Tails
Heads
Tails
0.5 0.5
0.1
0.1
0.90.9
Question: Suppose the sequence of our game is: HHHTHHHTTHHTH?
What is the probability of the sequence given the model?
Decoding
• Suppose we have a text written by Shakespeare and a monkey. Can we tell who wrote what?
• Text: Shakespeare or Monkey?
• Case 1: – Fehwufhweuromeojulietpoisonjigjreijge
• Case 2:– mmmmbananammmmmmmbananammm