Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CE-725: Statistical Pattern Recognition Sharif University of TechnologySpring 2013
Soleymani
Hidden Markov Models (HMMs)
Sequential Data
2
i.i.d. assumption will be a poor for many applications E.g., sequential data.
Sequential data: Examples Time series
weather, stock market forecasts
DNA, Protein Speech, Online handwriting Sequence of characters in an English sentence
3 [Bishop]
Markov Chain
4
First order markov chain , , … , = ( | ) ( , , … , ) = ( ) ∏ ( | )
-order markov chain , , … , = ( | , … , , )
Zero-order
Second-order
Model is specified completely by prior probabilities ofstates and probabilities of transition between states
Markov Models, Markov Chains, and HMMs
5
Markov models (General) Future predictions are independent of all but the most recent
observations.
Markov chains: Where (discrete) states are known from observable data, Markov
models lead to Markov chains. They are tractable but severely limited.
Hidden Markov Models (HMMs): Where (discrete) states are non-observable but observations giving
some information about the sequence of states are available More general models than Markov chains, while still retaining tractability, by
the introduction of latent variables, leading to state space models
Hidden Markov Models (HMMs)
6
The state is not directly visible, but output, dependent on thestate, is visible Latent (or hidden) variable corresponding to the observation variable
Assumption: latent variables form a Markov chain
Each state has a probability distribution over possible outputs sequence of observations generated by an HMM gives some information
about the sequence of states
HMM: Probabilistic Model
7
Conditional distribution: a table of numbers showingtransition probabilities between states ≡ ( = | = )
Initial state distribution The initial latent node does not have a parent node: ( ): a vector of probabilities ≡ ( = )
Observation model: conditional distributions of theobserved variables: ( | , ), where is a set of parameters of the distribution
E.g., Gaussian distribution when observations are continuous or table ofprobabilities where observations are discrete (a set of symbols)
S: states (latent variables)O: observations
HMM: Probabilistic Model (Discrete Case)
8
HMM is thus specified as a triplet: : the state transition probability matrix
= ( = | = ) , = 1, … , : initial state probabilities
≡ ( = ) : emission or observation probabilities
= = = = 1, … , , = 1, … , : number of states : number of symbols (observables)
S: latent variables as statesO: discrete observations as symbols
HMM: Example
9
Generalization of mixture models Example: in the above right figure, after constructing an HMM from
training data, the most probable state sequence for the observation, is 1,1 However, if we consider i.i.d assumption, according to the GMM in the above left
figure, the most probable state for is 1 and for is 3
[Bishop]
= 1 = 3= 2
HMM: Properties
10
Some degree of invariance to local warping (compression andstretching) of the time axis. Speech recognition: warping of the time axis associated with natural
variations in the speed of speech HMM can accommodate such a distortion and not penalize it too heavily.
[Bishop]
HMMs: Applications
11
Speech and handwriting: Online handwriting recognition Speech recognition, processing, synthesis
Text processing: Natural language modeling Parsing raw records into structured records
Bioinformatics Analysis of biological sequences such as Proteins and DNA
Financial E.g., Stock market forecasts
Main Questions in HMMs
12
Evaluation problem How likely is the sequence of observation, given our model?
, … , ? Useful in sequence classification
Decoding problem What is the sequence of latent variables corresponding to the
observations? , … , , … , , ? The most likely state sequence that produces given observations
Learning problem Learning parameters = { , , } from a set of training data. Determine optimum model , given a training set of observations:∗ = max , … ,
Explanation
13
Given an HMM and an observation history = , … , finda sequence of states that best explains the observations. Decoding is an special case of explanation problem
Slightly different versions of explanation problem: Decoding: Find the most likely state history , … , given the
observation history , … , ( , … , , … , =?). Filtering: given observations up to time , compute the distribution of
( , … , =?). Smoothing: given observations up to time , compute the distribution
of , ′ < ( , … , =?). Prediction: given measurements up to time , compute the distribution
of , ′ > ( , … , =?).
Core Questions
14
How do we calculate ? How do we calculate ? How do we train the HMM parameters given its
structure and Fully observed training examples: < , … , , , … > Partially observed training examples: < , … >
Evaluation Problem
15
,…, To compute efficiently, we use variable elimination
(Dynamic programming) Forward algorithm Backward algorithm
Evaluation Problem: Forward Algorithm
16
( ) = , , … , , = probability of observing 1, … such that =
Recursive relation: = ∑ = | = | =
Initialization: = , = = | = =
Iterations: = 1 to − 1 = ∑ = | = | =
Final computation: , … , = ∑
= 1, … ,= 1, … ,
Dynamic Programmin
Evaluation Problem: Forward Algorithm
17
contains all of the relevant information about the pastobservations for the purpose of prediction.
…(. ) (. ) (. ) (. )
Evaluation Problem: Backward Algorithm
18
( ) = , , … , | = ( ): probability of observing , … , given =
Recursive relation: = ∑ = | = | =
Initialization = 1
Iterations: = − 1 down to 1 = ∑ = | = | =
Final computation: , … , = ∑
= 1, … ,= 1, … ,
Evaluation Problem: Backward Algorithm
19
…. = 1...
Classification Using Evaluation
20
Given observations and trained HMMmodels: Trained HMMS: = { , , }, …, = { , , }
∗ ,…,∗ ,…,∗ ,…,Bayesian decision
Decoding Problem
21
Choose state sequence to maximize: ( , , … , | , , … , )
Viterbi algorithm: Define auxiliary variable :
= max,…, ( , , … , = , , , … | ) ( ): probability of the most probable path ending in state =
Recursive relation: = max,…, ( | = ) Viterbi algorithm uses dynamic programming to find the most probable
sequence given the observations
Decoding Problem: Viterbi algorithm
22
Initialization = | = = = 0
Iterations: = 1, … , − 1 = max = = argmax
Final computation: ∗ = max,…, ∗ = argmax,…,
Backtrack state sequence: = − 1 down to 1 ∗ = ∗
= max,…, ( | = )= 1, … ,= 1, … ,
Learning
23
Problem: how to construct an HHM given only observations? Training HMM to encode observation sequence such that HMM
should identify a similar observation sequence in future Generative model
Find = ( , , ), maximizing ( , … , | ) Initialize parameters ← Repeat until convergence:
Compute new model , using and observed sequence , … , ←
EM Algorithm
24
EM: general procedure for learning from partly observed data
Define: Q( , ) = ~ ( | , ) log ( , | ) = ∑ ( | , ) × log ( , | )
Choose an initial setting =Iterate until convergence:
E Step: Use and current to calculate ( | , )M Step: = argmax Q( , )←
HMM Learning by EM
25
= , , E-Step:
= = = , … , ; = , = = = , = , … , ;
M-Step:
= ∑ = ∑ ,∑ ∑ , = ∑∑ = ∑ ∑ Assumption: for each of the states we
consider Gaussian emission probabilities: ( | ) = ( | , )
= 1, … ,= 1, … ,= 1, … ,= 1, … ,
Forward-Backward Algorithm
26
Central to efficient inference E-step
This is nothing more than the sum-product algorithm forinference on graphical models applied to HMMs. HMMs are always a special case of graphical models (known as
poly-trees) where the sum-product algorithm is well defined.
Forward-Backward (Baum-Welch) Algorithm
27
This will be used for expectation maximization to train aHMM
, ,…, ,, ,…, ( ) ( )∑ ( )
…. = 1
(. ) (. ) (. ) (. )...
Baum-Welch algorithm (Baum, 1972)
Forward-Backward Algorithm
28
, , … , , == , , … , | = == , , … , | = = , , … , | == , , … , , = , , … , | == ( ) ( )
HMMs for On-line Handwriting Recognition
29
Modeling characters We could also model sub-character strokes
Modeling words Modeling sentences
left-to-right model
Character Model
Example: Handwriting Recognition
30
Assume that all characters in the input data are separated.
Character models show the probability of the input being aparticular character.
0.60.03
0.02
0.01z
c
b
a
Observation
0.1d
Character Models
( | )
Example: Handwriting Recognition
31
Word recognition States: characters Observations: a sequence of handwriting characters Emission probabilities: probabilities of each segment of the
input to be from a character model Transition probabilities: two different assumptions:
Model 1: Lexicon is given Model 2: without constraining the words to be in the lexicon
Example: Handwriting RecognitionModel 1: Lexicon
32
We construct separate HMM models for each lexicon word. The character models are concatenated to represent words.
Recognition of word image is equivalent to the problem ofevaluating few HMM models.
d o g
c a t
dog
cat
0.08 0.1 0.03
0.6 0.5 0.7
Input
Example: Handwriting RecognitionModel 1: Lexicon & Grammar
33
If the input is a sequence of words, we could consider ahigher level HMM Grammar of sentences Bigram (pairs of words) co-occurrences probabilities
The bigram probabilities and the initial word probabilities arerequired.
Example: Handwriting RecognitionModel 1: Lexicon
34
Character
Word
d o
d o g
g
Example: Handwriting RecognitionModel 1: Lexicon & Grammar
35
Nouns, verbs, …
Sentence
Character
Word
d o
d o g
g
cat
dog
NounVerb
Example: Handwriting RecognitionModel 2: Without Lexicon But Using Bigram
36
Single HMM for the whole language (based on a corpus orlanguage model). States: characters in the alphabet Transition probabilities and initial probabilities can be calculated
based on statistics of the language (obtained from a corpus) Emission probabilities are defined as before.
Recognition of word image is equivalent to the problem ofdecoding (Finding the best sequence of hidden states to produce input).
b
a
c
z y
…
0.09
0.15
Links show probabilities of co-occurrences of the characters in the English language
References
37
C.M. Bishop, Pattern Recognition and Machine Learning, 2006 (Chapter 13).
L. R. Rabiner, A tutorial on Hidden Markov Models and selected applicationsin speech recognition, Proceedings of the IEEE, vol. 77, pp. 257-286, 1989.