41
Hidden Markov Models By Marc Sobel

Hidden Markov Models with applications to speech recognition

  • Upload
    butest

  • View
    1.120

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Hidden Markov Models with applications to speech recognition

Hidden Markov Models

By Marc Sobel

Page 2: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)2

Introduction

Modeling dependencies in input; no longer iid Sequences:

Temporal: In speech; phonemes in a word (dictionary), words in a sentence (syntax, semantics of the language). In handwriting, pen movements

Spatial: In a DNA sequence; base pairs

Page 3: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)3

Discrete Markov Process

N states: S1, S2, ..., SN State at “time” t, qt = Si

First-order Markov P(qt+1=Sj | qt=Si, qt-1=Sk ,...) = P(qt+1=Sj | qt=Si)

Transition probabilities aij ≡ P(qt+1=Sj | qt=Si) aij ≥ 0 and Σj=1

N aij=1

Initial probabilities πi ≡ P(q1=Si) Σj=1

N πi=1

Page 4: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)4

Time-based Models

The models typically examined by statistics: Simple parametric distributions Discrete distribution estimates

These are typically based on what is called the “independence assumption”- each data point is independent of the others, and there is no time-sequencing or ordering.

What if the data has correlations based on its order, like a time-series?

Page 5: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)5

Applications of time based models

Sequential pattern recognition is a relevant problem in a number of disciplines Human-computer interaction: Speech recognition Bioengineering: ECG and EEG analysis Robotics: mobile robot navigation Bioinformatics: DNA base sequence alignment

Page 6: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)6

Andrei Andreyevich Markov

Born: 14 June 1856 in Ryazan, RussiaDied: 20 July 1922 in Petrograd (now St Petersburg), RussiaMarkov is particularly remembered for his study of Markov chains, sequences of random variables in which the future variable is determined by the present variable but is independent of the way in which the present state arose from its predecessors. This work launched the theory of stochastic processes.

Page 7: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)7

Markov random processes

A random sequence has the Markov property if its distribution is determined solely by its current state. Any random process having this property is called a Markov random process.

For observable state sequences (state is known from data), this leads to a Markov chain model.

For non-observable states, this leads to a Hidden Markov Model (HMM).

Page 8: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)8

Chain Rule & Markov Property

),...(),...|(),...,( 111111 qqPqqqPqqqP ttttt

),...(),...|(),...|(),...,( 121211111 qqPqqqPqqqPqqqP ttttttt

t

iiitt qqqPqPqqqP

211111 ),...|()(),...,(

1)|(),...|( 111 iforqqPqqqP iiii

)|()...|()()|()(),...,( 11212

1111

tt

t

iiitt qqPqqPqPqqPqPqqqP

Bayes rule

Markov property

Page 9: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)9

s1 s3

s2

Has N states, called s1, s2 .. sN

There are discrete timesteps, t=0, t=1, …

N = 3

t=0

A Markov System

Page 10: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)10

Example: Balls and Urns (markov process with a non-hidden observation process – stochastic automoton

Three urns each full of balls of one colorS1: red, S2: blue, S3: green

048080304050

||||

801010

206020

303040

302050

3313111

3313111

3311

.....

aaa

SSPSSPSSPSP,OP

S,S,S,SO

...

...

...

.,.,. T

A

A

Page 11: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)11

A Plot of 100 observed numbers for the stochastic automoton

Page 12: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)12

Histogram for the stochastic automaton: the proportions reflect the stationary distribution of the chain

Page 13: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)13

Hidden Markov Models

States are not observable Discrete observations {v1,v2,...,vM} are recorded;

a probabilistic function of the state Emission probabilities

bj(m) ≡ P(Ot=vm | qt=Sj) Example: In each urn, there are balls of different

colors, but with different probabilities. For each observation sequence, there are

multiple state sequences

Page 14: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)14

From Markov To Hidden Markov The previous model assumes that each state can be

uniquely associated with an observable event Once an observation is made, the state of the system is then trivially

retrieved This model, however, is too restrictive to be of practical use for most

realistic problems To make the model more flexible, we will assume that the

outcomes or observations of the model are a probabilistic function of each state

Each state can produce a number of outputs according to a unique probability distribution, and each distinct output can potentially be generated at any state

These are known a Hidden Markov Models (HMM), because the state sequence is not directly observable, it can only be approximated from the sequence of observations produced by the system

Page 15: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)15

The coin-toss problem To illustrate the concept of an HMM consider the following

scenario Assume that you are placed in a room with a curtain Behind the curtain there is a person performing a coin-toss experiment This person selects one of several coins, and tosses it: heads (H) or tails

(T) The person tells you the outcome (H,T), but not which coin was used

each time Your goal is to build a probabilistic model that best

explains a sequence of observations O={o1,o2,o3,o4,…}={H,T,T,H,,…}

The coins represent the states; these are hidden because you do not know which coin was tossed each time

The outcome of each toss represents an observation A “likely” sequence of coins may be inferred from the observations,

but this state sequence will not be unique

Page 16: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)16

Speech Recognition

We record the sound signals associated with words.

We’d like to identify the ‘speech recognition features associated with pronouncing these words.

The features are the states and the sound signals are the observations.

Page 17: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)18

The Coin Toss Example – 2 coins

Page 18: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)19

From Markov to Hidden Markov Model: The Coin Toss Example – 3 coins

Page 19: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)21

The urn-ball problem To further illustrate the concept of an HMM, consider this

scenario You are placed in the same room with a curtain Behind the curtain there are N urns, each containing a large number of balls

with M different colors The person behind the curtain selects an urn according to an internal

random process, then randomly grabs a ball from the selected urn He shows you the ball, and places it back in the urn This process is repeated over and over

Questions? How would you represent this experiment with an HMM? What are the states? Why are the states hidden? What are the observations?

Page 20: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)22

Doubly Stochastic SystemThe Urn-and-Ball Model

O = {green, blue, green, yellow, red, ..., blue}

How can we determine the appropriate model for the observation sequence given the system above?

Page 21: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)23

Four Basic Problems of HMMs

1. Evaluation: Given λ, and O, calculate P (O | λ)2. State sequence: Given λ, and O, find Q* such that

P (Q* | O, λ ) = maxQ P (Q | O , λ )

3. Learning: Given X={Ok}k, find λ* such that

P ( X | λ* )=maxλ P ( X | λ )

4. Statistical Inference: Given X={Ok}k, and given observation distributions P(X | θλ) for different lambda’s, estimate the theta parameters.

(Rabiner, 1989)

Page 22: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)24

Example: Balls and Urns (HMM): Learning I Three urns each full of balls of different colors:

S1: state 1, S2: state 2, S3: state 3: start at urn 1. red green blue urn1 urn2 urn3

( 1) ( 1)

(urn 2) ( 2)

(urn 3) (urn 3)

urn urn

urn

0.5,0.2,0.3 0.4 0.3 0.3

B = 0.2,0.3,0.5 B = 0.2 0.6 0.2

0.2,0.5,0.3 0.1 0

2 0 1 2 3

0 0 1 2 2 3 2 3

1 1 3 2 1 2 23 3,1

1 2 3 [ 1; b 3; b 2; b 1]

( | )

b b

b S S P S S S

B

0 1 3

1 1 1 2 3

1, 11 , 1 ,

.1 0.8

S = S = 1; S ,S ,S ;

P O,S | A,B =

P B[1, ] × P S | S × P B[ ,b ] P S | S P B[ ,b ] × P(B[ ,b ])

= B ×a ×B a ×B a

4 3 2 5 1 = 0.5(0. )0. (. )0.3×(. ). =

Page 23: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)25

Baum-Welch EM for Hidden Markov Models We use the notation qt for the probability of the

result at time t; ai[t-1],i[t] for the probability of going from the

observed state at time t-1 to the observed state at time t; ni for the observed number of results i, and ni,j for the number of transitions from I to j;

[ 1], [ ]

i , ,1 ,

log log( ) log

= log(q )

t i t i tt tk

i s t s ti s t

q a

n a n

L

Page 24: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)26

Baum-Welch EM for hmm’s

The constraints are that:

So, differentiating under constraints we get:

s,tt

1; a 1iq

s,t

s,t

,i,j

,

n0 ; 0= ;

a

ˆˆ ; a

i

i

s tii

s

n

q

nnq

n n

Page 25: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)27

Observed colored balls in the hmm model

Page 26: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)28

EM results

We have, b̂= 0.2500 0.4300 0.3200

0.4545 0.1818 0.3636

B̂= 0.1875 0.6250 0.1875 ;

0.1475 0.0328 0.8197

.4 .3 .3

.2 .6 .2

.1 .1 .8

B

Page 27: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)29

More General Elements of an HMM N: Number of states M: Number of observation symbols A = [aij]: N by N state transition probability

matrix B = bj(m): N by M observation probability matrix

Π = [πi]: N by 1 initial state probability vector

λ = (A, B, Π), parameter set of HMM

Page 28: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)30

Particle Evaluation

At stage t, simulate the new state from the former state

using the distribution, and

Weight the result by, . The resulting weight for the j’th particle is:

We should use standard residual resampling. The result gets 50 percent accuracy [Note: I haven’t perfected good residual sampling].

( )jts( )

1jts ( ) ( )

1,:j jt ts B s

( ) ( ) ,:j jt tw A s

( ) ( ) ( )1

j j jt t tW w W

Page 29: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)31

Particle Results: based on 50 observations

Page 30: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)32

Viterbi’s Algorithm

δt(i) ≡ maxq1q2∙∙∙ qt-1 p(q1q2∙∙∙qt-1,qt =Si,O1∙∙∙Ot | λ)

Initialization: δ1(i) = πibi(O1), ψ1(i) = 0

Recursion: δt(j) = maxi δt-1(i)aijbj(Ot), ψt(j) = argmaxi δt-

1(i)aij Termination:

p* = maxi δT(i), qT*= argmaxi δT (i)

Path backtracking:qt

* = ψt+1(qt+1* ), t=T-1, T-2, ..., 1

Page 31: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)33

Viterbi learning versus the actual state (estimate =3; 62% accuracy)

Page 32: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)34

General EM

At each step assume k states:

With p known and the theta’s unknown. We use the terminology Z1,…,Zt for the (unobserved states).

Then the EM equation: (with the pi’s the stationary probabilities of the states)

1( | ),..., ( | )t t kp x p x

1

1

( , ) log ( | ) ( | , );

( | )( | , )

( | )

kt s t

t i

t s st k

t st stst

Q p X P Z s X

P XP Z s X

P X

Page 33: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)35

EM Equations

We have,

So, in the Poisson hidden case we have:

log ( | )| , 0 (s=1,...,k)t s

stt s

p XP Z s X

| ,

(s=1,...,k)

| ,

st tt

s

stt

N P Z s X

P Z s X

Page 34: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)36

Binomial hidden model

We have:

| ,ˆ (s=1,...,k)

| ,

(1 )| ,

(1 )

t t

t t

tt s

ts

t st

N n Ns s s

t s N n Nst st st

st

NP Z s N p

np

P Z s N p

p pP Z s N p

p p

Page 35: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)37

Coin-Tossing Model

Coin 1: 0.2000 0.8000 Coin 2: 0.7000 0.3000 Coin 3: 0.5000 0.5000

State Matrix: C1 C2 C3 Coin 1 0.4000 0.3000 0.3000 Coin 2 0.2000 0.6000 0.2000 Coin 3 0.1000 0.1000 0.8000

Page 36: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)38

Coin tossing model: results

Page 37: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)39

Maximum Likelihood Model

Stationary distribution for states is: 0.1818 0.2727 0.5455 Therefore using a binomial hidden HMM we get:

Page 38: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)40

MCMC approach

Update the posterior distributions for the parameters and the (unobserved) state variables.

11

1

( | ) | ;

( | , );s

Ti t i s

Z

s s

P X i Z

Z P Z Z

Page 39: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)41

Continuous Observations

Discrete:

Gaussian mixture (Discretize using k-means):

Continuous: 2~| jjjtt ,,SqOP N

Use EM to learn parameters, e.g.,

otherwise0

if 1 |

1

mttm

rM

mjjtt

vOrmb,SqOP

tm

ll

ljtt

L

ljljtt

,

,,SqOpP,SqOP

N

GG

~

| |1

t t

t ttj j

Ojˆ

Page 40: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)42

HMM with Input

Input-dependent observations:

Input-dependent transitions (Meila and Jordan, 1996; Bengio and Frasconi, 1996):

Time-delay input:

titjt x,SqSqP |1

2|~| jjt

jt

jtt ,xg,x,SqOP N

1 ttt O,...,Ofx

Page 41: Hidden Markov Models with applications to speech recognition

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)43

Model Selection in HMM

Left-to-right HMMs:

In classification, for each Ci, estimate P (O | λi) by a separate HMM and use Bayes’ rule

44

3433

242322

131211

000

00

0

0

a

aa

aaa

aaa

A

j jj

iii POP

POPOP

||

|