Hidden Markov Models with applications to speech recognition

Hidden Markov Models

By Marc Sobel

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)2

Introduction

Modeling dependencies in input; no longer iid Sequences:

Temporal: In speech; phonemes in a word (dictionary), words in a sentence (syntax, semantics of the language). In handwriting, pen movements

Spatial: In a DNA sequence; base pairs


Discrete Markov Process

N states: S1, S2, ..., SN State at “time” t, qt = Si

First-order Markov P(qt+1=Sj | qt=Si, qt-1=Sk ,...) = P(qt+1=Sj | qt=Si)

Transition probabilities aij ≡ P(qt+1=Sj | qt=Si) aij ≥ 0 and Σj=1

N aij=1

Initial probabilities πi ≡ P(q1=Si) Σj=1

N πi=1


Time-based Models

The models typically examined by statistics: Simple parametric distributions Discrete distribution estimates

These are typically based on what is called the “independence assumption”- each data point is independent of the others, and there is no time-sequencing or ordering.

What if the data has correlations based on its order, like a time-series?


Applications of time based models

Sequential pattern recognition is a relevant problem in a number of disciplines Human-computer interaction: Speech recognition Bioengineering: ECG and EEG analysis Robotics: mobile robot navigation Bioinformatics: DNA base sequence alignment


Andrei Andreyevich Markov

Born: 14 June 1856 in Ryazan, RussiaDied: 20 July 1922 in Petrograd (now St Petersburg), RussiaMarkov is particularly remembered for his study of Markov chains, sequences of random variables in which the future variable is determined by the present variable but is independent of the way in which the present state arose from its predecessors. This work launched the theory of stochastic processes.


Markov random processes

A random sequence has the Markov property if its distribution is determined solely by its current state. Any random process having this property is called a Markov random process.

For observable state sequences (state is known from data), this leads to a Markov chain model.

For non-observable states, this leads to a Hidden Markov Model (HMM).


Chain Rule & Markov Property

),...(),...|(),...,( 111111 qqPqqqPqqqP ttttt

),...(),...|(),...|(),...,( 121211111 qqPqqqPqqqPqqqP ttttttt

t

iiitt qqqPqPqqqP

211111 ),...|()(),...,(

1)|(),...|( 111 iforqqPqqqP iiii

)|()...|()()|()(),...,( 11212

1111

tt

t

iiitt qqPqqPqPqqPqPqqqP

Bayes rule

Markov property


s1 s3

s2

Has N states, called s1, s2 .. sN

There are discrete timesteps, t=0, t=1, …

N = 3

t=0

A Markov System


Example: Balls and Urns (markov process with a non-hidden observation process – stochastic automoton

Three urns each full of balls of one colorS1: red, S2: blue, S3: green

048080304050

||||

801010

206020

303040

302050

3313111

3313111

3311

.....

aaa

SSPSSPSSPSP,OP

S,S,S,SO

...

...

...

.,.,. T

A

A


A Plot of 100 observed numbers for the stochastic automoton


Histogram for the stochastic automaton: the proportions reflect the stationary distribution of the chain


Hidden Markov Models

States are not observable Discrete observations {v1,v2,...,vM} are recorded;

a probabilistic function of the state Emission probabilities

bj(m) ≡ P(Ot=vm | qt=Sj) Example: In each urn, there are balls of different

colors, but with different probabilities. For each observation sequence, there are

multiple state sequences


From Markov To Hidden Markov The previous model assumes that each state can be

uniquely associated with an observable event Once an observation is made, the state of the system is then trivially

retrieved This model, however, is too restrictive to be of practical use for most

realistic problems To make the model more flexible, we will assume that the

outcomes or observations of the model are a probabilistic function of each state

Each state can produce a number of outputs according to a unique probability distribution, and each distinct output can potentially be generated at any state

These are known a Hidden Markov Models (HMM), because the state sequence is not directly observable, it can only be approximated from the sequence of observations produced by the system


The coin-toss problem To illustrate the concept of an HMM consider the following

scenario Assume that you are placed in a room with a curtain Behind the curtain there is a person performing a coin-toss experiment This person selects one of several coins, and tosses it: heads (H) or tails

(T) The person tells you the outcome (H,T), but not which coin was used

each time Your goal is to build a probabilistic model that best

explains a sequence of observations O={o1,o2,o3,o4,…}={H,T,T,H,,…}

The coins represent the states; these are hidden because you do not know which coin was tossed each time

The outcome of each toss represents an observation A “likely” sequence of coins may be inferred from the observations,

but this state sequence will not be unique


Speech Recognition

We record the sound signals associated with words.

We’d like to identify the ‘speech recognition features associated with pronouncing these words.

The features are the states and the sound signals are the observations.


The Coin Toss Example – 2 coins


From Markov to Hidden Markov Model: The Coin Toss Example – 3 coins


The urn-ball problem To further illustrate the concept of an HMM, consider this

scenario You are placed in the same room with a curtain Behind the curtain there are N urns, each containing a large number of balls

with M different colors The person behind the curtain selects an urn according to an internal

random process, then randomly grabs a ball from the selected urn He shows you the ball, and places it back in the urn This process is repeated over and over

Questions? How would you represent this experiment with an HMM? What are the states? Why are the states hidden? What are the observations?


Doubly Stochastic SystemThe Urn-and-Ball Model

O = {green, blue, green, yellow, red, ..., blue}

How can we determine the appropriate model for the observation sequence given the system above?


Four Basic Problems of HMMs

1. Evaluation: Given λ, and O, calculate P (O | λ)2. State sequence: Given λ, and O, find Q* such that

P (Q* | O, λ ) = maxQ P (Q | O , λ )

3. Learning: Given X={Ok}k, find λ* such that

P ( X | λ* )=maxλ P ( X | λ )

4. Statistical Inference: Given X={Ok}k, and given observation distributions P(X | θλ) for different lambda’s, estimate the theta parameters.

(Rabiner, 1989)


Example: Balls and Urns (HMM): Learning I Three urns each full of balls of different colors:

S1: state 1, S2: state 2, S3: state 3: start at urn 1. red green blue urn1 urn2 urn3

( 1) ( 1)

(urn 2) ( 2)

(urn 3) (urn 3)

urn urn

urn

0.5,0.2,0.3 0.4 0.3 0.3

B = 0.2,0.3,0.5 B = 0.2 0.6 0.2

0.2,0.5,0.3 0.1 0

2 0 1 2 3

0 0 1 2 2 3 2 3

1 1 3 2 1 2 23 3,1

1 2 3 [ 1; b 3; b 2; b 1]

( | )

b b

b S S P S S S

B

0 1 3

1 1 1 2 3

1, 11 , 1 ,

.1 0.8

S = S = 1; S ,S ,S ;

P O,S | A,B =

P B[1, ] × P S | S × P B[ ,b ] P S | S P B[ ,b ] × P(B[ ,b ])

= B ×a ×B a ×B a

4 3 2 5 1 = 0.5(0. )0. (. )0.3×(. ). =


Baum-Welch EM for Hidden Markov Models We use the notation qt for the probability of the

result at time t; ai[t-1],i[t] for the probability of going from the

observed state at time t-1 to the observed state at time t; ni for the observed number of results i, and ni,j for the number of transitions from I to j;

[ 1], [ ]

i , ,1 ,

log log( ) log

= log(q )

t i t i tt tk

i s t s ti s t

q a

n a n

L


Baum-Welch EM for hmm’s

The constraints are that:

So, differentiating under constraints we get:

s,tt

1; a 1iq

s,t

s,t

,i,j

,

n0 ; 0= ;

a

ˆˆ ; a

i

i

s tii

s

n

q

nnq

n n


Observed colored balls in the hmm model


EM results

We have, b̂= 0.2500 0.4300 0.3200

0.4545 0.1818 0.3636

B̂= 0.1875 0.6250 0.1875 ;

0.1475 0.0328 0.8197

.4 .3 .3

.2 .6 .2

.1 .1 .8

B


More General Elements of an HMM N: Number of states M: Number of observation symbols A = [aij]: N by N state transition probability

matrix B = bj(m): N by M observation probability matrix

Π = [πi]: N by 1 initial state probability vector

λ = (A, B, Π), parameter set of HMM


Particle Evaluation

At stage t, simulate the new state from the former state

using the distribution, and

Weight the result by, . The resulting weight for the j’th particle is:

We should use standard residual resampling. The result gets 50 percent accuracy [Note: I haven’t perfected good residual sampling].

( )jts( )

1jts ( ) ( )

1,:j jt ts B s

( ) ( ) ,:j jt tw A s

( ) ( ) ( )1

j j jt t tW w W


Particle Results: based on 50 observations


Viterbi’s Algorithm

δt(i) ≡ maxq1q2∙∙∙ qt-1 p(q1q2∙∙∙qt-1,qt =Si,O1∙∙∙Ot | λ)

Initialization: δ1(i) = πibi(O1), ψ1(i) = 0

Recursion: δt(j) = maxi δt-1(i)aijbj(Ot), ψt(j) = argmaxi δt-

1(i)aij Termination:

p* = maxi δT(i), qT*= argmaxi δT (i)

Path backtracking:qt

* = ψt+1(qt+1* ), t=T-1, T-2, ..., 1


Viterbi learning versus the actual state (estimate =3; 62% accuracy)


General EM

At each step assume k states:

With p known and the theta’s unknown. We use the terminology Z1,…,Zt for the (unobserved states).

Then the EM equation: (with the pi’s the stationary probabilities of the states)

1( | ),..., ( | )t t kp x p x

1

1

( , ) log ( | ) ( | , );

( | )( | , )

( | )

kt s t

t i

t s st k

t st stst

Q p X P Z s X

P XP Z s X

P X


EM Equations

We have,

So, in the Poisson hidden case we have:

log ( | )| , 0 (s=1,...,k)t s

stt s

p XP Z s X

| ,

(s=1,...,k)

| ,

st tt

s

stt

N P Z s X

P Z s X


Binomial hidden model

We have:

| ,ˆ (s=1,...,k)

| ,

(1 )| ,

(1 )

t t

t t

tt s

ts

t st

N n Ns s s

t s N n Nst st st

st

NP Z s N p

np

P Z s N p

p pP Z s N p

p p


Coin-Tossing Model

Coin 1: 0.2000 0.8000 Coin 2: 0.7000 0.3000 Coin 3: 0.5000 0.5000

State Matrix: C1 C2 C3 Coin 1 0.4000 0.3000 0.3000 Coin 2 0.2000 0.6000 0.2000 Coin 3 0.1000 0.1000 0.8000


Coin tossing model: results


Maximum Likelihood Model

Stationary distribution for states is: 0.1818 0.2727 0.5455 Therefore using a binomial hidden HMM we get:


MCMC approach

Update the posterior distributions for the parameters and the (unobserved) state variables.

11

1

( | ) | ;

( | , );s

Ti t i s

Z

s s

P X i Z

Z P Z Z


Continuous Observations

Discrete:

Gaussian mixture (Discretize using k-means):

Continuous: 2~| jjjtt ,,SqOP N

Use EM to learn parameters, e.g.,

otherwise0

if 1 |

1

mttm

rM

mjjtt

vOrmb,SqOP

tm

ll

ljtt

L

ljljtt

,

,,SqOpP,SqOP

N

GG

~

| |1

t t

t ttj j

Ojˆ


HMM with Input

Input-dependent observations:

Input-dependent transitions (Meila and Jordan, 1996; Bengio and Frasconi, 1996):

Time-delay input:

titjt x,SqSqP |1

2|~| jjt

jt

jtt ,xg,x,SqOP N

1 ttt O,...,Ofx


Model Selection in HMM

Left-to-right HMMs:

In classification, for each Ci, estimate P (O | λi) by a separate HMM and use Bayes’ rule

44

3433

242322

131211

000

00

0

0

a

aa

aaa

aaa

A

j jj

iii POP

POPOP

||

|

Documents

Hidden Markov Models with applications to speech recognition