Design and Implementation of Speech Recognition Systemsasr.cs.cmu.edu/spring2011/class8.14feb/class8.hmm.pdf · 14 Feb 2011 2 T 11 T 22 T 33 d T 12 T 23 1 (x) d 2 (x) d 3 (x) •

Design and Implementation of

Speech Recognition Systems

Spring 2011

Class 8: HMMs

14 Feb 2011

14 Feb 2011 1

Recap: Generalized Templates

• A set of “states”

– A distance function associated with each state

• A set of transitions

– Transition-specific penalties

14 Feb 2011 2

T11 T22 T33

T12 T23 d1(x) d2(x) d3(x)

• Identical to generalized templates in principle

• “Distance” functions at states replaced by “probability distribution

function” for state

• Transition “penalties” replaced by transition probabilities

• Maximize probability of observation

– Instead of minimizing cost

• The entire structure may be viewed as one generalization of the

DTW models we have discussed thus far

Recap: HMMs

T11 T22 T33

T12 T23

T13

3

14 Feb 2011

The HMM Process• The HMM models the process underlying the observations

as going through a number of states

– E.g., to produce the sound “W”, it first goes through a state where it

produces the sound “UH”, then goes into a state where it transitions

from “UH” to “AH”, and finally to a state where it produced “AH”

• The true underlying process is the vocal tract here

– Which roughly goes from the configuration for “UH” to the

configuration for “AH”

UH

W AH

4

HMMs are abstractions• The states are not directly observed

– Here states of the process are analogous to configurations of the vocal tract that

produces the signal

– We only hear the speech; we do not see the vocal tract

– i.e. the states are hidden

• The interpretation of states is not always obvious

– The vocal tract actually goes through a continuum of configurations

– The model represents all of these using only a fixed number of states

• The model abstracts the process that generates the data

– The system goes through a finite number of states

– When in any state it can either remain at that state, or go to another with some

probability

– When at any states it generates observations according to a distribution

associated with that state

14 Feb 2011 5

• A Hidden Markov Model consists of two components

– A state/transition backbone that specifies how many states there are, and how they can follow one another

– A set of probability distributions, one for each state, which specifies the distribution of all vectors in that state

Hidden Markov Models

• This can be factored into two separate probabilistic entities– A probabilistic Markov chain with states and transitions

– A set of data probability distributions, associated with the states

Markov chain

Data distributions

6

Equivalence to DTW templates

• HMM – inference equivalent to DTW modified to use a

probabilistic function, for the local node or edge “costs” in the

trellis

– Edges have transition probabilities

– Nodes have output or observation probabilities

• They provide the probability of the observed input

• The output probability may be a Gaussian

– Goal is to find the template with highest probability of matching the

input

• Probability values associated with transitions and edges are

called likelihoods

14 Feb 2011 7

Likelihoods and Cost: Transition

• Transitions in the HMM have associated probabilities

– P11, P12 etc

• They can be converted to “scores” through a logarithm

– T11 = log(P11)

• Or to “costs” through a negative logarithm

– T11 = -log(P11)

Markov chain

P11 P22 P33

P12 P23

P13

DTW Template

T11 T22 T33

T12 T23

T13

14 Feb 2011 8

Likelihoods and Cost: Nodes

• States in the HMM have probability distributions associated with them

– E.g Gaussians

• Whose means and variances have been obtained from the segments associated

with the node

• Nodes in the trellis have a probabilities associated with them

– Pi(O)

– i is the “state” / template node

– O is the observation associated with any node in the trellis

• Node probabilities may be converted to:

– Scores: Ni(O) = log(Pi(O))

– Or Costs: Ni(O) = - log(Pi(O))

Data distributionsP1(X) P2(X) P3(X)

14 Feb 2011 9

Log Likelihoods

• Use probabilities or likelihoods instead of cost

– Scores combines multiplicatively along a path

– Path Score = P1(O1) .P11 .P1(O2) .P12 .P2(O3) .P22 .P2(O4) .P23 .P3(O5) .P23

Alternately use log probabilities as scores: Ni(O) = log(Pi(O)), T11 = log(P11)

– Scores add as in DTW

– Path Score = N1(O1) + T11 + N1(O2) + T12 + N2(O3) + T22 + N2(O4) + T23 + N3(O5) + T23

• Replace all “Min” operations in DTW by “Max”

• Alternately use negative log probabilities as cost: Ni(O) = log(Pi(O)), T11 = -log(P11)

– Cost adds as in DTW

– Computation remains identical to DTW (with edge costs factored in)

P1(O1) P1(O2)

P2(O3)

P2(O4)

P3(O5)

P11

P12

P22

P23

P34

10

• An HMM is a statistical model for a time-varying process

• The process is always in one of a countable number of

• When the process visits in any state, it generates an observation

by a random draw from a distribution associated with that state

• The process constantly moves from state to state. The

probability that the process will move to any state is

determined solely by the current state

– i.e. the dynamics of the process are Markovian

• The entire model represents a probability distribution over the

sequence of observations

– It has a specific probability of generating any particular sequence

– The probabilities of all possible observation sequences sums to 1

HMM as a statistical model

14 Feb 2011 11

HMM assumed to be

generating data

How an HMM models a process

state

distributions

state

sequence

observation

sequence

14 Feb 2011 12

HMM Parameters

• The topology of the HMM

– No. of states and allowed transitions

– E.g. here we have 3 states and

cannot go from the blue state to the

red

• The transition probabilities

– Often represented as a matrix as

here

– Tij is the probability that when in

state i, the process will move to j

• The probability of being at a

particular state at the first instant

• The state output distributions

0.60.4 0.7

0.3

0.5

0.5

5.05.

3.7.0

04.6.

T

13

HMM state output distributions

• The state output distribution represents the distribution of

data produced from any state

• In the previous lecture we assume the state output

distribution to be Gaussian

• Albeit largely in a DTW context

• In reality, the distribution of vectors for any state need not

be Gaussian

In the most general case it can be arbitrarily complex

The Gaussian is only a coarse representation of this distribution

• If we model the output distributions of states better, we can

expect the model to be a better representation of the data

)()(5.0 1

2

1),;()(

mvCmv T

eC

CmvGaussianvP

14

14 Feb 2011

Node Score: The Gaussian Distribution

• What does a Gaussian distribution look like?

• For a single (scalar) variable, it is a bell-shaped curve representing the density of data around the mean

• Example:

Four different scalar Gaussian distributions, with different means and variances

The mean is represented by m, and

variance by s2

m and s are the parameters of the Gaussian distribution(Taken from Wikipedia)

den

sity

data

15

14 Feb 2011

The Scalar Gaussian Function

• The Gaussian density function (the bell curve) is:

• p(x) is the density function of the variable x, with mean m and

variance s2

• The attraction of the Gaussian function (regardless of how

appropriate it is!) comes from how easily the mean and

variance can be estimated from sample data x1, x2, x3 … xN

– m = Si xi/N

– s2 = Si (xi – m)2/N =Si (xi2 – m2)/N

16

The 2-D Gaussian Distribution

• Speech data are not scalar values, but vectors!

• Needs multi-variate (multi-dimensional) Gaussians

• Figure: A Gaussian for 2-D data– Shown as a 3-D plot

• Distributions for higher dimensions are tough to visualize!

14 Feb 2011 17

14 Feb 2011

The Multidimensional Gaussian Distribution

• Instead of variance, the multidimensional Gaussian has a covariance matrix

• The multi-dimensional Gaussian distribution of a vector variable x with mean m and covariance S is given by:

– where N is the vector dimensionality, and det is the determinant function

• The complexity in a full multi-dimensional Gaussian distribution comes from the covariance matrix, which accounts for dependenciesbetween the dimensions

18

The Diagonal Covariance Matrix

• In speech recognition, we frequently assume that the feature vector

dimensions are all independent of each other

• Result: The covariance matrix is reduced to a diagonal form

– The exponential term becomes, simply:

(Si (xi – mi)2/si2)/2, i going over all vector dimensions

– The determinant of the diagonal S matrix is easy to compute

• Further, each si2 (the i-th digonal element in the covariance matrix) is easily

estimated from xi and mi like a scalar

Full covariance:all elements arenon-zero

-0.5(x-m)TC-1(x-m)

Diagonal covariance:off-diagonal elementsare zero

Si (xi-mi)2 / 2si

2

19

Gaussian Mixtures

• A Gaussian Mixture is literally a mixture of Gaussians. It is

a weighted combination of several Gaussian distributions

• v is any data vector. P(v) is the probability given to that vector by the

Gaussian mixture

• K is the number of Gaussians being mixed

• wi is the mixture weight of the ith Gaussian. mi is its mean and Ci is

its covariance

• The Gaussian mixture distribution is also a distribution

• It is positive everywhere.

• The total volume under a Gaussian mixture is 1.0.

• Constraint: the mixture weights wi must all be positive and sum to 1

1

0

),;()(K

i

iii CvGaussianwvP m

20

14 Feb 2011

Gaussian Mixtures

• A Gaussian mixture can represent data distributions

far better than a simple Gaussian

• The two panels show the histogram of an unknown

random variable

• The first panel shows how it is modeled by a

simple Gaussian

• The second panel models the histogram by a

mixture of two Gaussians

• Caveat: It is hard to know the optimal number of

Gaussians in a mixture distribution for any random

variable

21

14 Feb 2011

Generating an observation from a

Gaussian mixture state distribution

First draw the identity of the

Gaussian from the a priori

probability distribution of

Gaussians (mixture weights)

Then draw a vector from

the selected Gaussian

22

14 Feb 2011

• The parameters of an HMM with Gaussian mixture

state distributions are:

– the set of initial state probabilities for all states

– T the matrix of transition probabilities

– A Gaussian mixture distribution for every state in the

HMM. The Gaussian mixture for the ith state is

characterized by

• Ki, the number of Gaussians in the mixture for the ith state

• The set of mixture weights wi,j 0<j<Ki

• The set of Gaussian means mi,j 0 <j<Ki

• The set of Covariance matrices Ci,j 0 < j <Ki

HMMs with Gaussian mixture state distributions

23

14 Feb 2011

Three Basic HMM Problems

• Given an HMM:

– What is the probability that it will generate a specific observation sequence

– Given a observation sequence, how do we determine which observation was generated from which state

• The state segmentation problem

– How do we learn the parameters of the HMM from observation sequences

24

14 Feb 2011

Computing the Probability of an

Observation Sequence

• Two aspects to producing the observation:

– Progressing through a sequence of states

– Producing observations from these states

25

14 Feb 2011

HMM assumed to be

generating data

Progressing through states

state

sequence

• The process begins at some state (red) here

• From that state, it makes an allowed transition

– To arrive at the same or any other state

• From that state it makes another allowed transition

– And so on

26

14 Feb 2011

Probability that the HMM will follow a

particular state sequence

• P(s1) is the probability that the process will initially be in

state s1

• P(si | sj) is the transition probability of moving to state si at

the next time instant when the system is currently in sj

– Also denoted by Pij earlier

– Related to edge scores in DTW as Tij = -log(P(si | sj))

P s s s P s P s s P s s( , , ,...) ( ) ( | ) ( | )...1 2 3 1 2 1 3 2

27

14 Feb 2011

HMM assumed to be

generating data

Generating Observations from States

state

distributions

state

sequence

observation

sequence

• At each time it generates an observation from the state it is in at that time

28

P o o o s s s P o s P o s P o s( , , ,...| , , ,...) ( | ) ( | ) ( | )...1 2 3 1 2 3 1 1 2 2 3 3

• P(oi | si) is the probability of generating observation

oi when the system is in state si

• Related to node scores in DTW trellis as:

Ni(O) = -log(P(oi | si))

Probability that the HMM will generate a

particular observation sequence given a

state sequence (state sequence known)

Computed from the Gaussian or Gaussian mixture for state s1

14 Feb 2011 29

14 Feb 2011

HMM assumed to be

generating data

Progressing through States and Producing

Observations

state

distributions

state

sequence

observation

sequence

• At each time it produces an observation and makes a transition

30

14 Feb 2011

Probability that the HMM will generate a

particular state sequence and, from it,

generate a particular observation sequence

P o s P o s P o s P s P s s P s s( | ) ( | ) ( | )... ( ) ( | ) ( | )...1 1 2 2 3 3 1 2 1 3 2

P o o o s s s( , , ,..., , , ,...)1 2 3 1 2 3

P o o o s s s P s s s( , , ,...| , , ,...) ( , , ,...)1 2 3 1 2 3 1 2 3

31

14 Feb 2011

Probability of Generating an Observation

Sequence

P o s P o s P o s P s P s s P s sall possible

state sequences

( | ) ( | ) ( | )... ( ) ( | ) ( | )....

.

1 1 2 2 3 3 1 2 1 3 2

P o o o s s sall possible

state sequences

( , , ,..., , , ,...).

.

1 2 3 1 2 3P o o o( , , ,...)

1 2 3

• If only the observation is known, the precise state sequence followed to produce it is not known

• All possible state sequences must be considered

32

14 Feb 2011

Computing it Efficiently

• Explicit summing over all state sequences is not

efficient

– A very large number of possible state sequences

– For long observation sequences it may be intractable

• Fortunately, we have an efficient algorithm for this:

The forward algorithm

• At each time, for each state compute the total

probability of all state sequences that generate

observations until that time and end at that state

33

14 Feb 2011

Illustrative Example

• Consider a generic HMM with 5 states and a “terminating

state”. We wish to find the probability of the best state

sequence for an observation sequence assuming it was

generated by this HMM

– P(si) = 1 for state 1 and 0 for others

– The arrows represent transition for which the probability is not 0.

P(si | sj) = aij

– We sometimes also represent the state output probability of si as P(ot |

si) = bi(t) for brevity

91

34

14 Feb 2011

Diversion: The HMM Trellis

Feature vectors

(time)

u

s t( , )

Sta

te inde

x

t-1 t

s

• The trellis is a graphical representation of all possible state sequences through the

HMM to produce a given observation

– Analogous to the DTW search graph / trellis

• The Y-axis represents HMM states, X axis represents observations

• Edges in trellis represent valid transitions in the HMM over a single time step

• Every node represents the event of a particular observation being generated from

a particular state

35

14 Feb 2011

The Forward Algorithm

u u u u t

s t P x x x state t s( , ) ( , ,..., , ( ) | ), , ,

1 2

time

u

s t( , )

Sta

te inde

x

t-1 t

s

• u(s,t) is the total probability of ALL state sequences that end at state s at time t, and all observations until xt

36

14 Feb 2011

u u

su t

s t s t P s s P x s( , ) ( , ) ( | ) ( | ),

1

u u u u t

s t P x x x state t s( , ) ( , ,..., , ( ) | ), , ,

1 2

timeu

t( , )1 1

u

s t( , )1

t-1 t

Can be recursively

estimated starting

from the first time

instant

(forward recursion)s

us t( , )

Sta

te inde

x

• u(s,t) can be recursively computed in terms of

u(s’,t’), the forward probabilities at time t-1


37

14 Feb 2011

time

Sta

te inde

x

T

• In the final observation the alpha at each state gives the probability of all state sequences ending at that state

• The total probability of the observation is the sum of the alpha values at all states

s

u TsTotalprob ),(


38

14 Feb 2011

Problem 2: The state segmentation problem

• Given only a sequence of observations, how do

we determine which sequence of states was

followed in producing it?

39

14 Feb 2011

HMM assumed to be

generating data

The HMM as a generator

state

distributions

state

sequence

observation

sequence

• The process goes through a series of states and produces observations from them

40

14 Feb 2011

HMM assumed to be

generating data

States are Hidden

state

distributions

state

sequence

observation

sequence

• The observations do not reveal the underlying state

41

14 Feb 2011

HMM assumed to be

generating data

The state segmentation problem

state

distributions

state

sequence

observation

sequence

• State segmentation: Estimate state sequence given observations

42

14 Feb 2011

Estimating the State Sequence

• Any number of state sequences could have been

traversed in producing the observation

– In the worst case every state sequence may have produced

it

• Solution: Identify the most probable state sequence

– The state sequence for which the probability of progressing

through that sequence and gen erating the observation

sequence is maximum

– i.e is maximumP o o o s s s( , , ,..., , , ,...)1 2 3 1 2 3

43

• Once again, exhaustive evaluation is impossibly

expensive

• But once again a simple dynamic-programming

solution is available

• Needed:

14 Feb 2011

Estimating the state sequence


P o o o s s s( , , ,..., , , ,...)1 2 3 1 2 3

)|()|()|()|()()|(maxarg 23331222111,...,, 321ssPsoPssPsoPsPsoPsss

44

14 Feb 2011

Estimating the state sequence

• Once again, exhaustive evaluation is impossibly

expensive

• But once again a simple dynamic-programming

solution is available

• Needed:


P o o o s s s( , , ,..., , , ,...)1 2 3 1 2 3

)|()|()|()|()()|(maxarg 23331222111,...,, 321ssPsoPssPsoPsPsoPsss

45

14 Feb 2011

The state sequence

• The probability of a state sequence ?,?,?,?,sx,sy ending

at time t is simply

– P(?,?,?,?, sx ,sy) = P(?,?,?,?, sx ) P(ot|sy)P(sy|sx)

• The best state sequence that ends with sx,sy at t will

have a probability equal to the probability of the best

state sequence ending at t-1 at sx times P(ot|sy)P(sy|sx)

– Since the last term is independent of the state sequence

leading to sx at t-1

46

14 Feb 2011

Trellis

time

94

• The graph below shows the set of all possible state sequences through this HMM in five time intants

t47

14 Feb 2011

The cost of extending a state sequence

time

94

• The cost of extending a state sequence ending at sx is

only dependent on the transition from sx to sy, and the

observation probability at sy

t

sy

sx

48

14 Feb 2011

The cost of extending a state sequence

time

94

• The best path to sy through sx is simply an

extension of the best path to sx

t

sy

sx

49

14 Feb 2011

The Recursion

• The overall best path to sx is an extension of

the best path to one of the states at the previous

time

timet

sy

sx

50

14 Feb 2011

The Recursion

• Bestpath prob(sy,t) =

Best (Bestpath prob(s?,t) * P(sy | s?) * P(ot|sy))

timet

sy

sx

51

14 Feb 2011

Finding the best state sequence

• This gives us a simple recursive formulation to find the overall

best state sequence:

1. The best state sequence X1,i of length 1 ending at state si is

simply si.

– The probability C(X1,i) of X1,i is P(o1 | si) P(si)

2. The best state sequence of length t+1 is simply given by

– (argmax Xt,iC(Xt,i)P(ot+1 | sj) P(sj | si)) si

3. The best overall state sequence for an utterance of length T is

given by

argmax Xt,i sjC(XT,i)

– The state sequence of length T with the highest overall probability

89

52

14 Feb 2011

Finding the best state sequence

• The simple algorithm just presented is called the VITERBI

algorithm in the literature

– After A.J.Viterbi, who invented this dynamic programming algorithm for a

completely different purpose: decoding error correction codes!

• The Viterbi algorithm can also be viewed as a breadth-first graph

search algorithm

– The HMM forms the Y axis of a 2-D plane

• Edge costs of this graph are transition probabilities P(s|s). Node costs are P(o|s)

– A linear graph with every node at a time step forms the X axis

– A trellis is a graph formed as the crossproduct of these two graphs

– The Viterbi algorithm finds the best path through this graph

90

53

14 Feb 2011

Viterbi Search (contd.)

timeInitial state initialized with path-score = P(s1)b1(1)

All other states have score 0 since P(si) = 0 for them92

54

14 Feb 2011


time

State with best path-score

State with path-score < best

State without a valid path-score

P (t)j

= max [P (t-1) a b (t)]i ij ji

Total path-score ending up at state j at time t

State transition probability, i to j

Score for state j, given the input at time t

93

55

14 Feb 2011


time

94

P (t)j

= max [P (t-1) a b (t)]i ij ji

Total path-score ending up at state j at time t

State transition probability, i to j

Score for state j, given the input at time t

56

14 Feb 2011


time

94

57

14 Feb 2011


time

94

58

14 Feb 2011


time

94

59

14 Feb 2011


time

94

60

14 Feb 2011


time

94

61

14 Feb 2011


time

94

THE BEST STATE SEQUENCE IS THE ESTIMATE OF THE STATESEQUENCE FOLLOWED IN GENERATING THE OBSERVATION

62

14 Feb 2011

Viterbi and DTW

• The Viterbi algorithm is identical to the string-

matching procedure used for DTW that we

saw earlier

• It computes an estimate of the state sequence

followed in producing the observation

• It also gives us the probability of the best state

sequence

63

14 Feb 2011

Problem3: Training HMM parameters

• We can compute the probability of an observation,

and the best state sequence given an observation,

using the HMM’s parameters

• But where do the HMM parameters come from?

• They must be learned from a collection of

observation sequences

• We have already seen one technique for training

HMMs: The segmental K-means procedure

64

14 Feb 2011

• The entire segmental K-means algorithm:

1. Initialize all parameters

• State means and covariances

• Transition probabilities

• Initial state probabilities

2. Segment all training sequences

3. Reestimate parameters from segmented

training sequences

4. If not converged, return to 2

Modified segmental K-means AKA

Viterbi training

65

14 Feb 2011

Segmental K-means

T1 T2 T3 T4

The procedure can be continued until convergence

Convergence is achieved when the total best-alignment error for

all training sequences does not change significantly with further

refinement of the model

Initialize Iterate

66

14 Feb 2011

A Better Technique

• The Segmental K-means technique uniquely

assigns each observation to one state

• However, this is only an estimate and may be

wrong

• A better approach is to take a “soft” decision

– Assign each observation to every state with a

probability

67

68

Each vector belongs uniquely to a

segment

T1 T2 T3 T4

MODELS

Training by segmentation: Hard

Assignment

)(

)(

1

1

jsegmenti

i

jsegmenti

j xm

14 Feb 2011

69

Assignment is fractioned:

Every segment gets a piece of

every vector

T1 T2 T3 T4

MODELS

Training by segmentation:

Soft Assignment

vectorsAlli

iji

vectorsAlli

ji

j xff

,

,

1m

Means and variances are computed

from fractioned vectors

1

, segmentsallj

jif

Where do the fractions come from?14 Feb 2011

14 Feb 2011

),...,,,)(( 21 TxxxststateP

The “probability” of a state

• The probability assigned to any state s, for any

observation xt is the probability that the

process was at s when it generated xt

• We want to compute

• We will compute first

– This is the probability that the process visited s at

time t while producing the entire observation

),...,,,)((),...,,|)(( 2121 TT xxxststatePxxxststateP

70

14 Feb 2011

Probability of Assigning an Observation to a

State• The probability that the HMM was in a particular state s when

generating the observation sequence is the probability that it

followed a state sequence that passed through s at time t

s

timet

71

14 Feb 2011


State• This can be decomposed into two multiplicative sections

– The section of the lattice leading into state s at time t and the section

leading out of it

s

timet

72

14 Feb 2011


State• The probability of the red section is the total probability of all

state sequences ending at state s at time t

– This is simply (s,t)

– Can be computed using the forward algorithm

timet

s

73

14 Feb 2011

The forward algorithm

u u

su t

s t s t P s s P x s( , ) ( , ) ( | ) ( | ),

1

u u u u t

s t P x x x state t s( , ) ( , ,..., , ( ) | ), , ,

1 2

timeu

t( , )1 1

u

s t( , )1

t-1 t

Can be recursively

estimated starting

from the first time

instant

(forward recursion)s

us t( , )

Sta

te inde

x

represents the complete current set of HMM parameters

74

14 Feb 2011

The Future Paths• The blue portion represents the probability of all state

sequences that began at state s at time t

– Like the red portion it can be computed using a backward recursion

timet

75

14 Feb 2011

The Backward Recursion

u u t u t u T

s t P x x x state t s( , ) ( , ,..., | ( ) , ), , ,

1 2

u

s t( , )1

u

N t( , )1

t+1

s

t

u u

su t

s t s t P s s P x s( , ) ( , ) ( | ) ( | ),

1

1

u

s t( , )

Can be recursively

estimated starting

from the final time

time instant

(backward recursion)

time

• u(s,t) is the total probability of ALL state sequences that depart from s at time t, and all observations after xt

– (s,T) = 1 at the final time instant for all valid final states76

14 Feb 2011

The complete probability

t+1tt-1

s

u u u u u T

s t s t P x x x state t s( , ) ( , ) ( , ,..., , ( ) | ), , ,

1 2

timeu

t( , )1 1

u

s t( , )1 u

s t( , )1

u

N t( , )1

P state t su

( , ( ) | )X

77

14 Feb 2011

Posterior probability of a state

• The probability that the process was in state s

at time t, given that we have observed the data

is obtained by simple normalization

• This term is often referred to as the gamma

term and denoted by gs,t

P state t sP state t s

P state t s

s t s t

s t s tu

u

us

u u

u us

( ( ) | , )( , ( ) | )

( , ( ) | )

( , ) ( , )

( , ) ( , )

XX

X

78

14 Feb 2011

Update Rules

• Once we have the state probabilities (the

gammas) the update rules are obtained through

a simple modification of the formulae used for

segmental K-means

– This new learning algorithm is known as the

Baum-Welch learning procedure

• Case1: State output densities are Gaussians

79

14 Feb 2011

Update Rules

sxs

s xN

1m

u t

tsu

u t

tutsu

s

x

,,

,,,

g

g

m

sx

s

T

s

s

s xxN

C )()(1

mm

u t

tus

u t

s

T

stsu

s

xx

C,,

,, )()(

g

mmg

Segmental K-means Baum Welch

• A similar update formula reestimates transition probabilities• The initial state probabilities P(s) also have a similar update rule

80

14 Feb 2011

Case 2: State ouput densities are Gaussian

Mixtures

• When state output densities are Gaussian

mixtures, more parameters must be estimated

• The mixture weights ws,i, mean ms,i and

covariance Cs,i of every Gaussian in the

distribution of each state must be estimated

1

0

,,, ),;()|(K

i

isisis CxGaussianwsxP m

81

14 Feb 2011

Splitting the Gamma

g k s u t u

th

u tP state t s P k Gaussian state t s x

, , , ,( ( ) | , ) ( . | ( ) , , ) X

A posteriori probability that the tth vector was generated by the kth Gaussian of state s

Re-estimation of state

parameters

We split the gamma for any state among all the Gaussians at that state

82

14 Feb 2011

Splitting the Gamma among Gaussians

g k s u t u

th

u tP state t s P k Gaussian state t s x

, , , ,( ( ) | , ) ( . | ( ) , , ) X

A posteriori probability that the tth vector was generated by the kth Gaussian of state s

g

m m

m mk s u t u

k s d

k

u t k s k u t k s

k s d

k s

u t k s k u t k s

k

P state t s

w ex x

w ex x

T

T, , ,

,

, , , ,

,

,

, , , ,

( ( ) | , )

XC

C

C

C

1

2

1

2

1

2

1

2

1

1

83

14 Feb 2011

Updating HMM Parameters

~,

, , , ,

, , ,

mg

gk s

k s u t u ttu

k s u ttu

x

~~ ~

,

, , , , , , ,

, , ,

Ck s

k s u t u t k s u t k s

T

tu

k s u ttu

x x

g m m

g

~,

, , ,

, , ,

wk s

k s u ttu

j s u tjtu

g

g

• Note: Every observation contributes to the update of parametervalues of every Gaussian of every state

84

14 Feb 2011

Overall Training Procedure: Single Gaussian

PDF• Determine a topology for the HMM

• Initialize all HMM parameters

– Initialize all allowed transitions to have the same probability

– Initialize all state output densities to be Gaussians

• We’ll revisit initialization

1. Over all utterances, compute the “sufficient” statistics

2. Use update formulae to compute new HMM parameters

3. If the overall probability of the training data has not converged, return to step 1

u t

tutsu x ,,,gu t

tsu ,,g u t

s

T

stsu xx )()(,, mmg

85

14 Feb 2011

An Implementational Detail

• Step1 computes “buffers” over all utterance

• This can be split and parallelized

– U1, U2 etc. can be processed on separate machines

–

...21

,,,,,,,,, Uu t

tutsu

Uu t

tutsu

u t

tutsu xxx ggg

..21

,,,,,, Uu t

tsu

Uu t

tsu

u t

tsu ggg

..)()()()()()(21

,,,,,, Uu t

s

T

stsu

Uu t

s

T

stsu

u t

s

T

stsu xxxxxx mmgmmgmmg

1

,,

Uu t

tsug 1

,,,

Uu t

tutsu xg

1

)()(,,

Uu t

s

T

stsu xx mmg

2

,,

Uu t

tsug 2

,,,

Uu t

tutsu xg

2

)()(,,

Uu t

s

T

stsu xx mmg

Machine 1 Machine 2

86

14 Feb 2011

An Implementational Detail• Step2 aggregates and adds buffers before updating the models

~,

, , , ,

, , ,

mg

gk s

k s u t u ttu

k s u ttu

x

~~ ~

,

, , , , , , ,

, , ,

Ck s


T

tu

k s u ttu

x x

g m m

g

~,

, , ,

, , ,

wk s

k s u ttu

j s u tjtu

g

g

...21

,,,,,,,,, Uu t

tutsu

Uu t

tutsu

u t

tutsu xxx ggg

..21

,,,,,, Uu t

tsu

Uu t

tsu

u t

tsu ggg

..)()()()()()(21

,,,,,, Uu t

s

T

stsu

Uu t

s

T

stsu

u t

s

T


87

...21

,,,,,,,,, Uu t

tutsu

Uu t

tutsu

u t

tutsu xxx ggg

..21

,,,,,, Uu t

tsu

Uu t

tsu

u t

tsu ggg

..)()()()()()(21

,,,,,, Uu t

s

T

stsu

Uu t

s

T

stsu

u t

s

T


14 Feb 2011

An Implementational Detail

• Step2 aggregates and adds buffers before updating the models

~,

, , , ,

, , ,

mg

gk s

k s u t u ttu

k s u ttu

x

~~ ~

,

, , , , , , ,

, , ,

Ck s


T

tu

k s u ttu

x x

g m m

g

~,

, , ,

, , ,

wk s

k s u ttu

j s u tjtu

g

g

Computed bymachine 1

Computed bymachine 2

88

14 Feb 2011

Training for HMMs with Gaussian Mixture

State Output Distributions

• Gaussian Mixtures are obtained by splitting

1. Train an HMM with (single) Gaussian state output

distributions

2. Split the Gaussian with the largest variance

• Perturb the mean by adding and subtracting a small number

• This gives us 2 Gaussians. Partition the mixture weight of the

Gaussian into two halves, one for each Gaussian

• A mixture with N Gaussians now becomes a mixture of N+1

Gaussians

3. Iterate BW to convergence

4. If the desired number of Gaussians not obtained, return to 2

89

14 Feb 2011

Splitting a Gaussian

• The mixture weight w for the Gaussian gets shared as 0.5w by each of the two split Gaussians

m m

me me

90

Transition Probabilities

• How about transition probabilities in an HMM?

– “Hard” estimation – by counting, as for templates

– “Soft” estimation – need soft counts14 Feb 2011 91

T11 T22 T33

T12 T23 d1(x) d2(x) d3(x)

P11 P22 P33

P12 P23

P13

• We have seen how to compute transition penalties for templates

Transition penalties by counting

• 20 vectors in state 1– 16 are followed by vectors in state 1

– 4 are followed by vectors in state 2

• P11 = 16/20 = 0.8 T11 = -log(P11) = -log(0.8)

• P12 = 4/20 = 0.2 T12 = -log(P12) = -log(0.2)

14 Feb 2011 92

T11 T22 T33

T12 T23

Transitions by counting

• We found the best state sequence for each input

– And counted transitions

14 Feb 2011 93

Transitions by counting

• We found the best state sequence for each input

– And counted transitions

14 Feb 2011 94

Observation no. 6 contributed one to the count of occurrences of state 1Observations 6 and 7 contributed one to the count of transitions from state 1 to state 2

Probability of transitions

• P(transition state I state J =

– Count transitions(I,J) / count instances(I)

14 Feb 2011 95




– Count instances(1) = 20

14 Feb 2011 96





• Count transitions (1,1) = 16

• P (transition state 1 state 1) = 0.8

14 Feb 2011 97





• Count transitions (1,2) = 4

• P (transition state 1 state 2) = 0.2

14 Feb 2011 98

Transitions by soft counting

• Each observation pair contributes to every transition

– E.g. observations 6,7 contribute counts to all of the following:

• Transition (11), Transition (12), Transition(22), Transition(23), Transition(33)

14 Feb 2011 99


• Contribution of any transition to the count is the a posteriori probability of the count– This is a fraction

– The fractions for all possible transitions at any time sum to 114 Feb 2011 100


• Probability of a transition is the total probability of all paths that include the transition

14 Feb 2011 101


• The forward probability of the source state at taccounts for all incoming paths at time t

– including the t-th observation xt14 Feb 2011 102

t t+1

u(s , t)


• The backward probability of the destination state at t+1

accounts for all outgoing paths from the state at time t+1

– NOT including the t+1-th observation xt+114 Feb 2011 103

t t+1

u(s’ , t+1)


• The product of the forward probability of s at t and s’ at t+1 accounts

for all paths TO state s at t, and all paths FROM s’ at t+1

– But not the transition from s to s’ or the observation at t+1

14 Feb 2011 104

t t+1

u(s , t) u(s’ , t+1)


• By factoring in the transition probability and observation

probabilities, the total probability is obtained

14 Feb 2011 105

t t+1

u(s , t) u(s’ , t+1) P(s’|s) P(xt+1|s’)

From probability to a posteriori probability

• The a posteriori probability of a transition is

the ratio of its probability to the sum of all

transitions at the same time

14 Feb 2011 106

A posteriori probability of a transition

• Probability of a transition

14 Feb 2011 107

• A posteriori probability of a transition

)1,'()'|1()|'(),(),....,,')1(,)(( 21 tssxPssPtsxxxststateststateP utuN

',

21

211,',,,

),....,,')1(,)((

),....,,')1(,)((

SS

N

Ntstsu

xxxStstateStstateP

xxxststateststatePg

Estimate of Transition Probabilities

• Numberator is total “soft” count of transitions

from state s to s’

• Denumberator is total “soft” count of instances

of state s

14 Feb 2011 108

u t

tsu

u t

tstsu

ssP,,

1,',,,

)|'(g

g

14 Feb 2011

• Arithmetic underflow is a problem

Implementation of BW: underflow

u u

su t

s t s t P s s P x s( , ) ( , ) ( | ) ( | ),

1

• The alpha terms are a recursive product of probability terms

– As t increases, an increasingly greater number probability terms are factored into

the alpha

• All probability terms are less than 1

– State output probabilities are actually probability densities

– Probability density values can be greater than 1

– On the other hand, for large dimensional data, probability density values are usually

much less than 1

• With increasing time, alpha values decrease

• Within a few time instants, they underflow to 0

– Every alpha goes to 0 at some time t. All future alphas remain 0

– As the dimensionality of the data increases, alphas goes to 0 faster

probability termsprobability term

109

14 Feb 2011

• One method of avoiding underflow is to scale all alphas at each

time instant

– Scale with respect to the largest alpha to make sure the largest scaled alpha

is 1.0

– Scale with respect to the sum of the alphas to ensure that all alphas sum to

1.0

– Scaling constants must be appropriately considered when computing the

final probabilities of an observation sequence

• An alternate method: Compute alphas and betas in log

domain

– How? (Not obvious)

Underflow: Solution

110

14 Feb 2011

• Similarly, arithmetic underflow can occur during beta computation

Implementation of BW: underflow

• The beta terms are also a recursive product of probability terms and can

underflow

• Underflow can be prevented by

– Scaling: Divide all beta terms by a constant that prevents underflow

– By performing beta computation in the log domain (now? Not obvious..)

– QUESTION: HOW DOES SCALING AFFECT THE ESTIMATION OF

GAMMA TERMS

– For Gaussian parameter updates?

– For transition probability updates?

'

1, )'|()|'(log)1,'(),(s

tuuu sxPssPtsts

111

14 Feb 2011

Implementation of BW: pruning

• The forward backward computation can get very expensive

• Solution: Prune

• Pruning in the forward backward algorithm raises some additional

issues

• Pruning from forward pass can be employed by backward pass

• Convergence criteria and tests may be affected

• More later

s = pruned out

112

14 Feb 2011

Building a recognizer for isolated words

• Now have all necessary components to build

an HMM-based recognizer for isolated words

– Where each word is spoken by itself in isolation

– E.g. a simple application, where one may either

say “Yes” or “No” to a recognizer and it must

recognize what was said

113

14 Feb 2011

Isolated Word Recognition with HMMs

• Assuming all words are equally likely

• Training

– Collect a set of “training” recordings for each word

– Compute feature vector sequences for the words

– Train HMMs for each word

• Recognition:

– Compute feature vector sequence for test utterance

– Compute the forward probability of the feature vector sequence

from the HMM for each word

• Alternately compute the best state sequence probability using Viterbi

– Select the word for which this value is highest

114

14 Feb 2011

Issues

• What is the topology to use for the HMMs

– How many states

– What kind of transition structure

– If state output densities have Gaussian Mixtures:

how many Gaussians?

115

14 Feb 2011

HMM Topology• For speech a left-to-right topology works best

– The “Bakis” topology

– Note that the initial state probability P(s) is 1 for the 1st state and 0 for

others. This need not be learned

• States may be skipped

116

14 Feb 2011

Determining the Number of States

• How do we know the number of states to use for any word?

– We do not, really

– Ideally there should be at least one state for each “basic sound” within the word

• Otherwise widely differing sounds may be collapsed into one state

• The average feature vector for that state would be a poor representation

• For computational efficiency, the number of states should be small

– These two are conflicting requirements, usually solved by making some educated guesses

117

14 Feb 2011

Determining the Number of States

• For small vocabularies, it is possible to examine each word in

detail and arrive at reasonable numbers:

• For larger vocabularies, we may be forced to rely on some ad

hoc principles

– E.g. proportional to the number of letters in the word

• Works better for some languages than others

• Spanish and Indian languages are good examples where this works as

almost every letter in a word produces a sound

S O ME TH I NG

118

14 Feb 2011

How many Gaussians

• No clear answer for this either

• The number of Gaussians is usually a function

of the amount of training data available

– Often set by trial and error

– A minimum of 4 Gaussians is usually required for

reasonable recognition

119

14 Feb 2011

Implementation of BW: initialization of

alphas and betas

• Initialization for alpha: u(s,1) set to 0 for all

states except the first state of the model. u(s,1)

set to 1 for the first state

– All observations must begin at the first state

• Initialization for beta: u(s, T) set to 0 for all states

except the terminating state. u(s, t) set to 1 for

this state

– All observations must terminate at the final state

120

14 Feb 2011

Initializing State Output Density Parameters

1. Initially only a single Gaussian per state assumed

• Mixtures obtained by splitting Gaussians

2. For Bakis-topology HMMs, a good initialization is the “flat”

initialization

• Compute the global mean and variance of all feature vectors in all

training instances of the word

• Initialize all Gaussians (i.e all state output distributions) with this

mean and variance

• Their means and variances will converge to appropriate values

automatically with iteration

• Gaussian splitting to compute Gaussian mixtures takes care of the

rest

121

14 Feb 2011

Isolated word recognition: Final thoughts

• All relevant topics covered

– How to compute features from recordings of the words

• We will not explicitly refer to feature computation in future lectures

– How to set HMM topologies for the words

– How to train HMMs for the words

• Baum-Welch algorithm

– How to select the most probable HMM for a test instance

• Computing probabilities using the forward algorithm

• Computing probabilities using the Viterbi algorithm

– Which also gives the state segmentation

122

14 Feb 2011

Questions

• ?

123

Documents

Design and Implementation of Speech Recognition Systemsasr.cs.cmu.edu/spring2011/class8.14feb/class8.hmm.pdf · 14 Feb 2011 2 T 11 T 22 T 33 d T 12 T 23 1 (x) d 2 (x) d 3 (x) •