Soleymani - ce.sharif.edu

CE-725: Statistical Pattern Recognition Sharif University of TechnologySpring 2013

Soleymani

Hidden Markov Models (HMMs)

Sequential Data

2

i.i.d. assumption will be a poor for many applications E.g., sequential data.

Sequential data: Examples Time series

weather, stock market forecasts

DNA, Protein Speech, Online handwriting Sequence of characters in an English sentence

3 [Bishop]

Markov Chain

4

First order markov chain , , … , = ( | ) ( , , … , ) = ( ) ∏ ( | )

-order markov chain , , … , = ( | , … , , )

Zero-order

Second-order

Model is specified completely by prior probabilities ofstates and probabilities of transition between states

Markov Models, Markov Chains, and HMMs

5

Markov models (General) Future predictions are independent of all but the most recent

observations.

Markov chains: Where (discrete) states are known from observable data, Markov

models lead to Markov chains. They are tractable but severely limited.

Hidden Markov Models (HMMs): Where (discrete) states are non-observable but observations giving

some information about the sequence of states are available More general models than Markov chains, while still retaining tractability, by

the introduction of latent variables, leading to state space models

Hidden Markov Models (HMMs)

6

The state is not directly visible, but output, dependent on thestate, is visible Latent (or hidden) variable corresponding to the observation variable

Assumption: latent variables form a Markov chain

Each state has a probability distribution over possible outputs sequence of observations generated by an HMM gives some information

about the sequence of states

HMM: Probabilistic Model

7

Conditional distribution: a table of numbers showingtransition probabilities between states ≡ ( = | = )

Initial state distribution The initial latent node does not have a parent node: ( ): a vector of probabilities ≡ ( = )

Observation model: conditional distributions of theobserved variables: ( | , ), where is a set of parameters of the distribution

E.g., Gaussian distribution when observations are continuous or table ofprobabilities where observations are discrete (a set of symbols)

S: states (latent variables)O: observations

HMM: Probabilistic Model (Discrete Case)

8

HMM is thus specified as a triplet: : the state transition probability matrix

= ( = | = ) , = 1, … , : initial state probabilities

≡ ( = ) : emission or observation probabilities

= = = = 1, … , , = 1, … , : number of states : number of symbols (observables)

S: latent variables as statesO: discrete observations as symbols

HMM: Example

9

Generalization of mixture models Example: in the above right figure, after constructing an HMM from

training data, the most probable state sequence for the observation, is 1,1 However, if we consider i.i.d assumption, according to the GMM in the above left

figure, the most probable state for is 1 and for is 3

[Bishop]

= 1 = 3= 2

HMM: Properties

10

Some degree of invariance to local warping (compression andstretching) of the time axis. Speech recognition: warping of the time axis associated with natural

variations in the speed of speech HMM can accommodate such a distortion and not penalize it too heavily.

[Bishop]

HMMs: Applications

11

Speech and handwriting: Online handwriting recognition Speech recognition, processing, synthesis

Text processing: Natural language modeling Parsing raw records into structured records

Bioinformatics Analysis of biological sequences such as Proteins and DNA

Financial E.g., Stock market forecasts

Main Questions in HMMs

12

Evaluation problem How likely is the sequence of observation, given our model?

, … , ? Useful in sequence classification

Decoding problem What is the sequence of latent variables corresponding to the

observations? , … , , … , , ? The most likely state sequence that produces given observations

Learning problem Learning parameters = { , , } from a set of training data. Determine optimum model , given a training set of observations:∗ = max , … ,

Explanation

13

Given an HMM and an observation history = , … , finda sequence of states that best explains the observations. Decoding is an special case of explanation problem

Slightly different versions of explanation problem: Decoding: Find the most likely state history , … , given the

observation history , … , ( , … , , … , =?). Filtering: given observations up to time , compute the distribution of

( , … , =?). Smoothing: given observations up to time , compute the distribution

of , ′ < ( , … , =?). Prediction: given measurements up to time , compute the distribution

of , ′ > ( , … , =?).

Core Questions

14

How do we calculate ? How do we calculate ? How do we train the HMM parameters given its

structure and Fully observed training examples: < , … , , , … > Partially observed training examples: < , … >

Evaluation Problem

15

,…, To compute efficiently, we use variable elimination

(Dynamic programming) Forward algorithm Backward algorithm

Evaluation Problem: Forward Algorithm

16

( ) = , , … , , = probability of observing 1, … such that =

Recursive relation: = ∑ = | = | =

Initialization: = , = = | = =

Iterations: = 1 to − 1 = ∑ = | = | =

Final computation: , … , = ∑

= 1, … ,= 1, … ,

Dynamic Programmin

Evaluation Problem: Forward Algorithm

17

contains all of the relevant information about the pastobservations for the purpose of prediction.

…(. ) (. ) (. ) (. )

Evaluation Problem: Backward Algorithm

18

( ) = , , … , | = ( ): probability of observing , … , given =

Recursive relation: = ∑ = | = | =

Initialization = 1

Iterations: = − 1 down to 1 = ∑ = | = | =

Final computation: , … , = ∑

= 1, … ,= 1, … ,

Evaluation Problem: Backward Algorithm

19

…. = 1...

Classification Using Evaluation

20

Given observations and trained HMMmodels: Trained HMMS: = { , , }, …, = { , , }

∗ ,…,∗ ,…,∗ ,…,Bayesian decision

Decoding Problem

21

Choose state sequence to maximize: ( , , … , | , , … , )

Viterbi algorithm: Define auxiliary variable :

= max,…, ( , , … , = , , , … | ) ( ): probability of the most probable path ending in state =

Recursive relation: = max,…, ( | = ) Viterbi algorithm uses dynamic programming to find the most probable

sequence given the observations

Decoding Problem: Viterbi algorithm

22

Initialization = | = = = 0

Iterations: = 1, … , − 1 = max = = argmax

Final computation: ∗ = max,…, ∗ = argmax,…,

Backtrack state sequence: = − 1 down to 1 ∗ = ∗

= max,…, ( | = )= 1, … ,= 1, … ,

Learning

23

Problem: how to construct an HHM given only observations? Training HMM to encode observation sequence such that HMM

should identify a similar observation sequence in future Generative model

Find = ( , , ), maximizing ( , … , | ) Initialize parameters ← Repeat until convergence:

Compute new model , using and observed sequence , … , ←

EM Algorithm

24

EM: general procedure for learning from partly observed data

Define: Q( , ) = ~ ( | , ) log ( , | ) = ∑ ( | , ) × log ( , | )

Choose an initial setting =Iterate until convergence:

E Step: Use and current to calculate ( | , )M Step: = argmax Q( , )←

HMM Learning by EM

25

= , , E-Step:

= = = , … , ; = , = = = , = , … , ;

M-Step:

= ∑ = ∑ ,∑ ∑ , = ∑∑ = ∑ ∑ Assumption: for each of the states we

consider Gaussian emission probabilities: ( | ) = ( | , )

= 1, … ,= 1, … ,= 1, … ,= 1, … ,

Forward-Backward Algorithm

26

Central to efficient inference E-step

This is nothing more than the sum-product algorithm forinference on graphical models applied to HMMs. HMMs are always a special case of graphical models (known as

poly-trees) where the sum-product algorithm is well defined.

Forward-Backward (Baum-Welch) Algorithm

27

This will be used for expectation maximization to train aHMM

, ,…, ,, ,…, ( ) ( )∑ ( )

…. = 1

(. ) (. ) (. ) (. )...

Baum-Welch algorithm (Baum, 1972)

Forward-Backward Algorithm

28

, , … , , == , , … , | = == , , … , | = = , , … , | == , , … , , = , , … , | == ( ) ( )

HMMs for On-line Handwriting Recognition

29

Modeling characters We could also model sub-character strokes

Modeling words Modeling sentences

left-to-right model

Character Model

Example: Handwriting Recognition

30

Assume that all characters in the input data are separated.

Character models show the probability of the input being aparticular character.

0.60.03

0.02

0.01z

c

b

a

Observation

0.1d

Character Models

( | )

Example: Handwriting Recognition

31

Word recognition States: characters Observations: a sequence of handwriting characters Emission probabilities: probabilities of each segment of the

input to be from a character model Transition probabilities: two different assumptions:

Model 1: Lexicon is given Model 2: without constraining the words to be in the lexicon

Example: Handwriting RecognitionModel 1: Lexicon

32

We construct separate HMM models for each lexicon word. The character models are concatenated to represent words.

Recognition of word image is equivalent to the problem ofevaluating few HMM models.

d o g

c a t

dog

cat

0.08 0.1 0.03

0.6 0.5 0.7

Input

Example: Handwriting RecognitionModel 1: Lexicon & Grammar

33

If the input is a sequence of words, we could consider ahigher level HMM Grammar of sentences Bigram (pairs of words) co-occurrences probabilities

The bigram probabilities and the initial word probabilities arerequired.

Example: Handwriting RecognitionModel 1: Lexicon

34

Character

Word

d o

d o g

g

Example: Handwriting RecognitionModel 1: Lexicon & Grammar

35

Nouns, verbs, …

Sentence

Character

Word

d o

d o g

g

cat

dog

NounVerb

Example: Handwriting RecognitionModel 2: Without Lexicon But Using Bigram

36

Single HMM for the whole language (based on a corpus orlanguage model). States: characters in the alphabet Transition probabilities and initial probabilities can be calculated

based on statistics of the language (obtained from a corpus) Emission probabilities are defined as before.

Recognition of word image is equivalent to the problem ofdecoding (Finding the best sequence of hidden states to produce input).

b

a

c

z y

…

0.09

0.15

Links show probabilities of co-occurrences of the characters in the English language

References

37

C.M. Bishop, Pattern Recognition and Machine Learning, 2006 (Chapter 13).

L. R. Rabiner, A tutorial on Hidden Markov Models and selected applicationsin speech recognition, Proceedings of the IEEE, vol. 77, pp. 257-286, 1989.

Documents

Soleymani - ce.sharif.edu