Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Design and Implementation of
Speech Recognition Systems
Spring 2011
Class 8: HMMs
14 Feb 2011
14 Feb 2011 1
Recap: Generalized Templates
• A set of “states”
– A distance function associated with each state
• A set of transitions
– Transition-specific penalties
14 Feb 2011 2
T11 T22 T33
T12 T23 d1(x) d2(x) d3(x)
• Identical to generalized templates in principle
• “Distance” functions at states replaced by “probability distribution
function” for state
• Transition “penalties” replaced by transition probabilities
• Maximize probability of observation
– Instead of minimizing cost
• The entire structure may be viewed as one generalization of the
DTW models we have discussed thus far
Recap: HMMs
T11 T22 T33
T12 T23
T13
3
14 Feb 2011
The HMM Process• The HMM models the process underlying the observations
as going through a number of states
– E.g., to produce the sound “W”, it first goes through a state where it
produces the sound “UH”, then goes into a state where it transitions
from “UH” to “AH”, and finally to a state where it produced “AH”
• The true underlying process is the vocal tract here
– Which roughly goes from the configuration for “UH” to the
configuration for “AH”
UH
W AH
4
HMMs are abstractions• The states are not directly observed
– Here states of the process are analogous to configurations of the vocal tract that
produces the signal
– We only hear the speech; we do not see the vocal tract
– i.e. the states are hidden
• The interpretation of states is not always obvious
– The vocal tract actually goes through a continuum of configurations
– The model represents all of these using only a fixed number of states
• The model abstracts the process that generates the data
– The system goes through a finite number of states
– When in any state it can either remain at that state, or go to another with some
probability
– When at any states it generates observations according to a distribution
associated with that state
14 Feb 2011 5
• A Hidden Markov Model consists of two components
– A state/transition backbone that specifies how many states there are, and how they can follow one another
– A set of probability distributions, one for each state, which specifies the distribution of all vectors in that state
Hidden Markov Models
• This can be factored into two separate probabilistic entities– A probabilistic Markov chain with states and transitions
– A set of data probability distributions, associated with the states
Markov chain
Data distributions
6
Equivalence to DTW templates
• HMM – inference equivalent to DTW modified to use a
probabilistic function, for the local node or edge “costs” in the
trellis
– Edges have transition probabilities
– Nodes have output or observation probabilities
• They provide the probability of the observed input
• The output probability may be a Gaussian
– Goal is to find the template with highest probability of matching the
input
• Probability values associated with transitions and edges are
called likelihoods
14 Feb 2011 7
Likelihoods and Cost: Transition
• Transitions in the HMM have associated probabilities
– P11, P12 etc
• They can be converted to “scores” through a logarithm
– T11 = log(P11)
• Or to “costs” through a negative logarithm
– T11 = -log(P11)
Markov chain
P11 P22 P33
P12 P23
P13
DTW Template
T11 T22 T33
T12 T23
T13
14 Feb 2011 8
Likelihoods and Cost: Nodes
• States in the HMM have probability distributions associated with them
– E.g Gaussians
• Whose means and variances have been obtained from the segments associated
with the node
• Nodes in the trellis have a probabilities associated with them
– Pi(O)
– i is the “state” / template node
– O is the observation associated with any node in the trellis
• Node probabilities may be converted to:
– Scores: Ni(O) = log(Pi(O))
– Or Costs: Ni(O) = - log(Pi(O))
Data distributionsP1(X) P2(X) P3(X)
14 Feb 2011 9
Log Likelihoods
• Use probabilities or likelihoods instead of cost
– Scores combines multiplicatively along a path
– Path Score = P1(O1) .P11 .P1(O2) .P12 .P2(O3) .P22 .P2(O4) .P23 .P3(O5) .P23
Alternately use log probabilities as scores: Ni(O) = log(Pi(O)), T11 = log(P11)
– Scores add as in DTW
– Path Score = N1(O1) + T11 + N1(O2) + T12 + N2(O3) + T22 + N2(O4) + T23 + N3(O5) + T23
• Replace all “Min” operations in DTW by “Max”
• Alternately use negative log probabilities as cost: Ni(O) = log(Pi(O)), T11 = -log(P11)
– Cost adds as in DTW
– Computation remains identical to DTW (with edge costs factored in)
P1(O1) P1(O2)
P2(O3)
P2(O4)
P3(O5)
P11
P12
P22
P23
P34
10
• An HMM is a statistical model for a time-varying process
• The process is always in one of a countable number of
• When the process visits in any state, it generates an observation
by a random draw from a distribution associated with that state
• The process constantly moves from state to state. The
probability that the process will move to any state is
determined solely by the current state
– i.e. the dynamics of the process are Markovian
• The entire model represents a probability distribution over the
sequence of observations
– It has a specific probability of generating any particular sequence
– The probabilities of all possible observation sequences sums to 1
HMM as a statistical model
14 Feb 2011 11
HMM assumed to be
generating data
How an HMM models a process
state
distributions
state
sequence
observation
sequence
14 Feb 2011 12
HMM Parameters
• The topology of the HMM
– No. of states and allowed transitions
– E.g. here we have 3 states and
cannot go from the blue state to the
red
• The transition probabilities
– Often represented as a matrix as
here
– Tij is the probability that when in
state i, the process will move to j
• The probability of being at a
particular state at the first instant
• The state output distributions
0.60.4 0.7
0.3
0.5
0.5
5.05.
3.7.0
04.6.
T
13
HMM state output distributions
• The state output distribution represents the distribution of
data produced from any state
• In the previous lecture we assume the state output
distribution to be Gaussian
• Albeit largely in a DTW context
• In reality, the distribution of vectors for any state need not
be Gaussian
In the most general case it can be arbitrarily complex
The Gaussian is only a coarse representation of this distribution
• If we model the output distributions of states better, we can
expect the model to be a better representation of the data
)()(5.0 1
2
1),;()(
mvCmv T
eC
CmvGaussianvP
14
14 Feb 2011
Node Score: The Gaussian Distribution
• What does a Gaussian distribution look like?
• For a single (scalar) variable, it is a bell-shaped curve representing the density of data around the mean
• Example:
Four different scalar Gaussian distributions, with different means and variances
The mean is represented by m, and
variance by s2
m and s are the parameters of the Gaussian distribution(Taken from Wikipedia)
den
sity
data
15
14 Feb 2011
The Scalar Gaussian Function
• The Gaussian density function (the bell curve) is:
• p(x) is the density function of the variable x, with mean m and
variance s2
• The attraction of the Gaussian function (regardless of how
appropriate it is!) comes from how easily the mean and
variance can be estimated from sample data x1, x2, x3 … xN
– m = Si xi/N
– s2 = Si (xi – m)2/N =Si (xi2 – m2)/N
16
The 2-D Gaussian Distribution
• Speech data are not scalar values, but vectors!
• Needs multi-variate (multi-dimensional) Gaussians
• Figure: A Gaussian for 2-D data– Shown as a 3-D plot
• Distributions for higher dimensions are tough to visualize!
14 Feb 2011 17
14 Feb 2011
The Multidimensional Gaussian Distribution
• Instead of variance, the multidimensional Gaussian has a covariance matrix
• The multi-dimensional Gaussian distribution of a vector variable x with mean m and covariance S is given by:
– where N is the vector dimensionality, and det is the determinant function
• The complexity in a full multi-dimensional Gaussian distribution comes from the covariance matrix, which accounts for dependenciesbetween the dimensions
18
The Diagonal Covariance Matrix
• In speech recognition, we frequently assume that the feature vector
dimensions are all independent of each other
• Result: The covariance matrix is reduced to a diagonal form
– The exponential term becomes, simply:
(Si (xi – mi)2/si2)/2, i going over all vector dimensions
– The determinant of the diagonal S matrix is easy to compute
• Further, each si2 (the i-th digonal element in the covariance matrix) is easily
estimated from xi and mi like a scalar
Full covariance:all elements arenon-zero
-0.5(x-m)TC-1(x-m)
Diagonal covariance:off-diagonal elementsare zero
Si (xi-mi)2 / 2si
2
19
Gaussian Mixtures
• A Gaussian Mixture is literally a mixture of Gaussians. It is
a weighted combination of several Gaussian distributions
• v is any data vector. P(v) is the probability given to that vector by the
Gaussian mixture
• K is the number of Gaussians being mixed
• wi is the mixture weight of the ith Gaussian. mi is its mean and Ci is
its covariance
• The Gaussian mixture distribution is also a distribution
• It is positive everywhere.
• The total volume under a Gaussian mixture is 1.0.
• Constraint: the mixture weights wi must all be positive and sum to 1
1
0
),;()(K
i
iii CvGaussianwvP m
20
14 Feb 2011
Gaussian Mixtures
• A Gaussian mixture can represent data distributions
far better than a simple Gaussian
• The two panels show the histogram of an unknown
random variable
• The first panel shows how it is modeled by a
simple Gaussian
• The second panel models the histogram by a
mixture of two Gaussians
• Caveat: It is hard to know the optimal number of
Gaussians in a mixture distribution for any random
variable
21
14 Feb 2011
Generating an observation from a
Gaussian mixture state distribution
First draw the identity of the
Gaussian from the a priori
probability distribution of
Gaussians (mixture weights)
Then draw a vector from
the selected Gaussian
22
14 Feb 2011
• The parameters of an HMM with Gaussian mixture
state distributions are:
– the set of initial state probabilities for all states
– T the matrix of transition probabilities
– A Gaussian mixture distribution for every state in the
HMM. The Gaussian mixture for the ith state is
characterized by
• Ki, the number of Gaussians in the mixture for the ith state
• The set of mixture weights wi,j 0<j<Ki
• The set of Gaussian means mi,j 0 <j<Ki
• The set of Covariance matrices Ci,j 0 < j <Ki
HMMs with Gaussian mixture state distributions
23
14 Feb 2011
Three Basic HMM Problems
• Given an HMM:
– What is the probability that it will generate a specific observation sequence
– Given a observation sequence, how do we determine which observation was generated from which state
• The state segmentation problem
– How do we learn the parameters of the HMM from observation sequences
24
14 Feb 2011
Computing the Probability of an
Observation Sequence
• Two aspects to producing the observation:
– Progressing through a sequence of states
– Producing observations from these states
25
14 Feb 2011
HMM assumed to be
generating data
Progressing through states
state
sequence
• The process begins at some state (red) here
• From that state, it makes an allowed transition
– To arrive at the same or any other state
• From that state it makes another allowed transition
– And so on
26
14 Feb 2011
Probability that the HMM will follow a
particular state sequence
• P(s1) is the probability that the process will initially be in
state s1
• P(si | sj) is the transition probability of moving to state si at
the next time instant when the system is currently in sj
– Also denoted by Pij earlier
– Related to edge scores in DTW as Tij = -log(P(si | sj))
P s s s P s P s s P s s( , , ,...) ( ) ( | ) ( | )...1 2 3 1 2 1 3 2
27
14 Feb 2011
HMM assumed to be
generating data
Generating Observations from States
state
distributions
state
sequence
observation
sequence
• At each time it generates an observation from the state it is in at that time
28
P o o o s s s P o s P o s P o s( , , ,...| , , ,...) ( | ) ( | ) ( | )...1 2 3 1 2 3 1 1 2 2 3 3
• P(oi | si) is the probability of generating observation
oi when the system is in state si
• Related to node scores in DTW trellis as:
Ni(O) = -log(P(oi | si))
Probability that the HMM will generate a
particular observation sequence given a
state sequence (state sequence known)
Computed from the Gaussian or Gaussian mixture for state s1
14 Feb 2011 29
14 Feb 2011
HMM assumed to be
generating data
Progressing through States and Producing
Observations
state
distributions
state
sequence
observation
sequence
• At each time it produces an observation and makes a transition
30
14 Feb 2011
Probability that the HMM will generate a
particular state sequence and, from it,
generate a particular observation sequence
P o s P o s P o s P s P s s P s s( | ) ( | ) ( | )... ( ) ( | ) ( | )...1 1 2 2 3 3 1 2 1 3 2
P o o o s s s( , , ,..., , , ,...)1 2 3 1 2 3
P o o o s s s P s s s( , , ,...| , , ,...) ( , , ,...)1 2 3 1 2 3 1 2 3
31
14 Feb 2011
Probability of Generating an Observation
Sequence
P o s P o s P o s P s P s s P s sall possible
state sequences
( | ) ( | ) ( | )... ( ) ( | ) ( | )....
.
1 1 2 2 3 3 1 2 1 3 2
P o o o s s sall possible
state sequences
( , , ,..., , , ,...).
.
1 2 3 1 2 3P o o o( , , ,...)
1 2 3
• If only the observation is known, the precise state sequence followed to produce it is not known
• All possible state sequences must be considered
32
14 Feb 2011
Computing it Efficiently
• Explicit summing over all state sequences is not
efficient
– A very large number of possible state sequences
– For long observation sequences it may be intractable
• Fortunately, we have an efficient algorithm for this:
The forward algorithm
• At each time, for each state compute the total
probability of all state sequences that generate
observations until that time and end at that state
33
14 Feb 2011
Illustrative Example
• Consider a generic HMM with 5 states and a “terminating
state”. We wish to find the probability of the best state
sequence for an observation sequence assuming it was
generated by this HMM
– P(si) = 1 for state 1 and 0 for others
– The arrows represent transition for which the probability is not 0.
P(si | sj) = aij
– We sometimes also represent the state output probability of si as P(ot |
si) = bi(t) for brevity
91
34
14 Feb 2011
Diversion: The HMM Trellis
Feature vectors
(time)
u
s t( , )
Sta
te inde
x
t-1 t
s
• The trellis is a graphical representation of all possible state sequences through the
HMM to produce a given observation
– Analogous to the DTW search graph / trellis
• The Y-axis represents HMM states, X axis represents observations
• Edges in trellis represent valid transitions in the HMM over a single time step
• Every node represents the event of a particular observation being generated from
a particular state
35
14 Feb 2011
The Forward Algorithm
u u u u t
s t P x x x state t s( , ) ( , ,..., , ( ) | ), , ,
1 2
time
u
s t( , )
Sta
te inde
x
t-1 t
s
• u(s,t) is the total probability of ALL state sequences that end at state s at time t, and all observations until xt
36
14 Feb 2011
u u
su t
s t s t P s s P x s( , ) ( , ) ( | ) ( | ),
1
u u u u t
s t P x x x state t s( , ) ( , ,..., , ( ) | ), , ,
1 2
timeu
t( , )1 1
u
s t( , )1
t-1 t
Can be recursively
estimated starting
from the first time
instant
(forward recursion)s
us t( , )
Sta
te inde
x
• u(s,t) can be recursively computed in terms of
u(s’,t’), the forward probabilities at time t-1
The Forward Algorithm
37
14 Feb 2011
time
Sta
te inde
x
T
• In the final observation the alpha at each state gives the probability of all state sequences ending at that state
• The total probability of the observation is the sum of the alpha values at all states
s
u TsTotalprob ),(
The Forward Algorithm
38
14 Feb 2011
Problem 2: The state segmentation problem
• Given only a sequence of observations, how do
we determine which sequence of states was
followed in producing it?
39
14 Feb 2011
HMM assumed to be
generating data
The HMM as a generator
state
distributions
state
sequence
observation
sequence
• The process goes through a series of states and produces observations from them
40
14 Feb 2011
HMM assumed to be
generating data
States are Hidden
state
distributions
state
sequence
observation
sequence
• The observations do not reveal the underlying state
41
14 Feb 2011
HMM assumed to be
generating data
The state segmentation problem
state
distributions
state
sequence
observation
sequence
• State segmentation: Estimate state sequence given observations
42
14 Feb 2011
Estimating the State Sequence
• Any number of state sequences could have been
traversed in producing the observation
– In the worst case every state sequence may have produced
it
• Solution: Identify the most probable state sequence
– The state sequence for which the probability of progressing
through that sequence and gen erating the observation
sequence is maximum
– i.e is maximumP o o o s s s( , , ,..., , , ,...)1 2 3 1 2 3
43
• Once again, exhaustive evaluation is impossibly
expensive
• But once again a simple dynamic-programming
solution is available
• Needed:
14 Feb 2011
Estimating the state sequence
P o s P o s P o s P s P s s P s s( | ) ( | ) ( | )... ( ) ( | ) ( | )...1 1 2 2 3 3 1 2 1 3 2
P o o o s s s( , , ,..., , , ,...)1 2 3 1 2 3
)|()|()|()|()()|(maxarg 23331222111,...,, 321ssPsoPssPsoPsPsoPsss
44
14 Feb 2011
Estimating the state sequence
• Once again, exhaustive evaluation is impossibly
expensive
• But once again a simple dynamic-programming
solution is available
• Needed:
P o s P o s P o s P s P s s P s s( | ) ( | ) ( | )... ( ) ( | ) ( | )...1 1 2 2 3 3 1 2 1 3 2
P o o o s s s( , , ,..., , , ,...)1 2 3 1 2 3
)|()|()|()|()()|(maxarg 23331222111,...,, 321ssPsoPssPsoPsPsoPsss
45
14 Feb 2011
The state sequence
• The probability of a state sequence ?,?,?,?,sx,sy ending
at time t is simply
– P(?,?,?,?, sx ,sy) = P(?,?,?,?, sx ) P(ot|sy)P(sy|sx)
• The best state sequence that ends with sx,sy at t will
have a probability equal to the probability of the best
state sequence ending at t-1 at sx times P(ot|sy)P(sy|sx)
– Since the last term is independent of the state sequence
leading to sx at t-1
46
14 Feb 2011
Trellis
time
94
• The graph below shows the set of all possible state sequences through this HMM in five time intants
t47
14 Feb 2011
The cost of extending a state sequence
time
94
• The cost of extending a state sequence ending at sx is
only dependent on the transition from sx to sy, and the
observation probability at sy
t
sy
sx
48
14 Feb 2011
The cost of extending a state sequence
time
94
• The best path to sy through sx is simply an
extension of the best path to sx
t
sy
sx
49
14 Feb 2011
The Recursion
• The overall best path to sx is an extension of
the best path to one of the states at the previous
time
timet
sy
sx
50
14 Feb 2011
The Recursion
• Bestpath prob(sy,t) =
Best (Bestpath prob(s?,t) * P(sy | s?) * P(ot|sy))
timet
sy
sx
51
14 Feb 2011
Finding the best state sequence
• This gives us a simple recursive formulation to find the overall
best state sequence:
1. The best state sequence X1,i of length 1 ending at state si is
simply si.
– The probability C(X1,i) of X1,i is P(o1 | si) P(si)
2. The best state sequence of length t+1 is simply given by
– (argmax Xt,iC(Xt,i)P(ot+1 | sj) P(sj | si)) si
3. The best overall state sequence for an utterance of length T is
given by
argmax Xt,i sjC(XT,i)
– The state sequence of length T with the highest overall probability
89
52
14 Feb 2011
Finding the best state sequence
• The simple algorithm just presented is called the VITERBI
algorithm in the literature
– After A.J.Viterbi, who invented this dynamic programming algorithm for a
completely different purpose: decoding error correction codes!
• The Viterbi algorithm can also be viewed as a breadth-first graph
search algorithm
– The HMM forms the Y axis of a 2-D plane
• Edge costs of this graph are transition probabilities P(s|s). Node costs are P(o|s)
– A linear graph with every node at a time step forms the X axis
– A trellis is a graph formed as the crossproduct of these two graphs
– The Viterbi algorithm finds the best path through this graph
90
53
14 Feb 2011
Viterbi Search (contd.)
timeInitial state initialized with path-score = P(s1)b1(1)
All other states have score 0 since P(si) = 0 for them92
54
14 Feb 2011
Viterbi Search (contd.)
time
State with best path-score
State with path-score < best
State without a valid path-score
P (t)j
= max [P (t-1) a b (t)]i ij ji
Total path-score ending up at state j at time t
State transition probability, i to j
Score for state j, given the input at time t
93
55
14 Feb 2011
Viterbi Search (contd.)
time
94
P (t)j
= max [P (t-1) a b (t)]i ij ji
Total path-score ending up at state j at time t
State transition probability, i to j
Score for state j, given the input at time t
56
14 Feb 2011
Viterbi Search (contd.)
time
94
57
14 Feb 2011
Viterbi Search (contd.)
time
94
58
14 Feb 2011
Viterbi Search (contd.)
time
94
59
14 Feb 2011
Viterbi Search (contd.)
time
94
60
14 Feb 2011
Viterbi Search (contd.)
time
94
61
14 Feb 2011
Viterbi Search (contd.)
time
94
THE BEST STATE SEQUENCE IS THE ESTIMATE OF THE STATESEQUENCE FOLLOWED IN GENERATING THE OBSERVATION
62
14 Feb 2011
Viterbi and DTW
• The Viterbi algorithm is identical to the string-
matching procedure used for DTW that we
saw earlier
• It computes an estimate of the state sequence
followed in producing the observation
• It also gives us the probability of the best state
sequence
63
14 Feb 2011
Problem3: Training HMM parameters
• We can compute the probability of an observation,
and the best state sequence given an observation,
using the HMM’s parameters
• But where do the HMM parameters come from?
• They must be learned from a collection of
observation sequences
• We have already seen one technique for training
HMMs: The segmental K-means procedure
64
14 Feb 2011
• The entire segmental K-means algorithm:
1. Initialize all parameters
• State means and covariances
• Transition probabilities
• Initial state probabilities
2. Segment all training sequences
3. Reestimate parameters from segmented
training sequences
4. If not converged, return to 2
Modified segmental K-means AKA
Viterbi training
65
14 Feb 2011
Segmental K-means
T1 T2 T3 T4
The procedure can be continued until convergence
Convergence is achieved when the total best-alignment error for
all training sequences does not change significantly with further
refinement of the model
Initialize Iterate
66
14 Feb 2011
A Better Technique
• The Segmental K-means technique uniquely
assigns each observation to one state
• However, this is only an estimate and may be
wrong
• A better approach is to take a “soft” decision
– Assign each observation to every state with a
probability
67
68
Each vector belongs uniquely to a
segment
T1 T2 T3 T4
MODELS
Training by segmentation: Hard
Assignment
)(
)(
1
1
jsegmenti
i
jsegmenti
j xm
14 Feb 2011
69
Assignment is fractioned:
Every segment gets a piece of
every vector
T1 T2 T3 T4
MODELS
Training by segmentation:
Soft Assignment
vectorsAlli
iji
vectorsAlli
ji
j xff
,
,
1m
Means and variances are computed
from fractioned vectors
1
, segmentsallj
jif
Where do the fractions come from?14 Feb 2011
14 Feb 2011
),...,,,)(( 21 TxxxststateP
The “probability” of a state
• The probability assigned to any state s, for any
observation xt is the probability that the
process was at s when it generated xt
• We want to compute
• We will compute first
– This is the probability that the process visited s at
time t while producing the entire observation
),...,,,)((),...,,|)(( 2121 TT xxxststatePxxxststateP
70
14 Feb 2011
Probability of Assigning an Observation to a
State• The probability that the HMM was in a particular state s when
generating the observation sequence is the probability that it
followed a state sequence that passed through s at time t
s
timet
71
14 Feb 2011
Probability of Assigning an Observation to a
State• This can be decomposed into two multiplicative sections
– The section of the lattice leading into state s at time t and the section
leading out of it
s
timet
72
14 Feb 2011
Probability of Assigning an Observation to a
State• The probability of the red section is the total probability of all
state sequences ending at state s at time t
– This is simply (s,t)
– Can be computed using the forward algorithm
timet
s
73
14 Feb 2011
The forward algorithm
u u
su t
s t s t P s s P x s( , ) ( , ) ( | ) ( | ),
1
u u u u t
s t P x x x state t s( , ) ( , ,..., , ( ) | ), , ,
1 2
timeu
t( , )1 1
u
s t( , )1
t-1 t
Can be recursively
estimated starting
from the first time
instant
(forward recursion)s
us t( , )
Sta
te inde
x
represents the complete current set of HMM parameters
74
14 Feb 2011
The Future Paths• The blue portion represents the probability of all state
sequences that began at state s at time t
– Like the red portion it can be computed using a backward recursion
timet
75
14 Feb 2011
The Backward Recursion
u u t u t u T
s t P x x x state t s( , ) ( , ,..., | ( ) , ), , ,
1 2
u
s t( , )1
u
N t( , )1
t+1
s
t
u u
su t
s t s t P s s P x s( , ) ( , ) ( | ) ( | ),
1
1
u
s t( , )
Can be recursively
estimated starting
from the final time
time instant
(backward recursion)
time
• u(s,t) is the total probability of ALL state sequences that depart from s at time t, and all observations after xt
– (s,T) = 1 at the final time instant for all valid final states76
14 Feb 2011
The complete probability
t+1tt-1
s
u u u u u T
s t s t P x x x state t s( , ) ( , ) ( , ,..., , ( ) | ), , ,
1 2
timeu
t( , )1 1
u
s t( , )1 u
s t( , )1
u
N t( , )1
P state t su
( , ( ) | )X
77
14 Feb 2011
Posterior probability of a state
• The probability that the process was in state s
at time t, given that we have observed the data
is obtained by simple normalization
• This term is often referred to as the gamma
term and denoted by gs,t
P state t sP state t s
P state t s
s t s t
s t s tu
u
us
u u
u us
( ( ) | , )( , ( ) | )
( , ( ) | )
( , ) ( , )
( , ) ( , )
XX
X
78
14 Feb 2011
Update Rules
• Once we have the state probabilities (the
gammas) the update rules are obtained through
a simple modification of the formulae used for
segmental K-means
– This new learning algorithm is known as the
Baum-Welch learning procedure
• Case1: State output densities are Gaussians
79
14 Feb 2011
Update Rules
sxs
s xN
1m
u t
tsu
u t
tutsu
s
x
,,
,,,
g
g
m
sx
s
T
s
s
s xxN
C )()(1
mm
u t
tus
u t
s
T
stsu
s
xx
C,,
,, )()(
g
mmg
Segmental K-means Baum Welch
• A similar update formula reestimates transition probabilities• The initial state probabilities P(s) also have a similar update rule
80
14 Feb 2011
Case 2: State ouput densities are Gaussian
Mixtures
• When state output densities are Gaussian
mixtures, more parameters must be estimated
• The mixture weights ws,i, mean ms,i and
covariance Cs,i of every Gaussian in the
distribution of each state must be estimated
1
0
,,, ),;()|(K
i
isisis CxGaussianwsxP m
81
14 Feb 2011
Splitting the Gamma
g k s u t u
th
u tP state t s P k Gaussian state t s x
, , , ,( ( ) | , ) ( . | ( ) , , ) X
A posteriori probability that the tth vector was generated by the kth Gaussian of state s
Re-estimation of state
parameters
We split the gamma for any state among all the Gaussians at that state
82
14 Feb 2011
Splitting the Gamma among Gaussians
g k s u t u
th
u tP state t s P k Gaussian state t s x
, , , ,( ( ) | , ) ( . | ( ) , , ) X
A posteriori probability that the tth vector was generated by the kth Gaussian of state s
g
m m
m mk s u t u
k s d
k
u t k s k u t k s
k s d
k s
u t k s k u t k s
k
P state t s
w ex x
w ex x
T
T, , ,
,
, , , ,
,
,
, , , ,
( ( ) | , )
XC
C
C
C
1
2
1
2
1
2
1
2
1
1
83
14 Feb 2011
Updating HMM Parameters
~,
, , , ,
, , ,
mg
gk s
k s u t u ttu
k s u ttu
x
~~ ~
,
, , , , , , ,
, , ,
Ck s
k s u t u t k s u t k s
T
tu
k s u ttu
x x
g m m
g
~,
, , ,
, , ,
wk s
k s u ttu
j s u tjtu
g
g
• Note: Every observation contributes to the update of parametervalues of every Gaussian of every state
84
14 Feb 2011
Overall Training Procedure: Single Gaussian
PDF• Determine a topology for the HMM
• Initialize all HMM parameters
– Initialize all allowed transitions to have the same probability
– Initialize all state output densities to be Gaussians
• We’ll revisit initialization
1. Over all utterances, compute the “sufficient” statistics
2. Use update formulae to compute new HMM parameters
3. If the overall probability of the training data has not converged, return to step 1
u t
tutsu x ,,,gu t
tsu ,,g u t
s
T
stsu xx )()(,, mmg
85
14 Feb 2011
An Implementational Detail
• Step1 computes “buffers” over all utterance
• This can be split and parallelized
– U1, U2 etc. can be processed on separate machines
–
...21
,,,,,,,,, Uu t
tutsu
Uu t
tutsu
u t
tutsu xxx ggg
..21
,,,,,, Uu t
tsu
Uu t
tsu
u t
tsu ggg
..)()()()()()(21
,,,,,, Uu t
s
T
stsu
Uu t
s
T
stsu
u t
s
T
stsu xxxxxx mmgmmgmmg
1
,,
Uu t
tsug 1
,,,
Uu t
tutsu xg
1
)()(,,
Uu t
s
T
stsu xx mmg
2
,,
Uu t
tsug 2
,,,
Uu t
tutsu xg
2
)()(,,
Uu t
s
T
stsu xx mmg
Machine 1 Machine 2
86
14 Feb 2011
An Implementational Detail• Step2 aggregates and adds buffers before updating the models
~,
, , , ,
, , ,
mg
gk s
k s u t u ttu
k s u ttu
x
~~ ~
,
, , , , , , ,
, , ,
Ck s
k s u t u t k s u t k s
T
tu
k s u ttu
x x
g m m
g
~,
, , ,
, , ,
wk s
k s u ttu
j s u tjtu
g
g
...21
,,,,,,,,, Uu t
tutsu
Uu t
tutsu
u t
tutsu xxx ggg
..21
,,,,,, Uu t
tsu
Uu t
tsu
u t
tsu ggg
..)()()()()()(21
,,,,,, Uu t
s
T
stsu
Uu t
s
T
stsu
u t
s
T
stsu xxxxxx mmgmmgmmg
87
...21
,,,,,,,,, Uu t
tutsu
Uu t
tutsu
u t
tutsu xxx ggg
..21
,,,,,, Uu t
tsu
Uu t
tsu
u t
tsu ggg
..)()()()()()(21
,,,,,, Uu t
s
T
stsu
Uu t
s
T
stsu
u t
s
T
stsu xxxxxx mmgmmgmmg
14 Feb 2011
An Implementational Detail
• Step2 aggregates and adds buffers before updating the models
~,
, , , ,
, , ,
mg
gk s
k s u t u ttu
k s u ttu
x
~~ ~
,
, , , , , , ,
, , ,
Ck s
k s u t u t k s u t k s
T
tu
k s u ttu
x x
g m m
g
~,
, , ,
, , ,
wk s
k s u ttu
j s u tjtu
g
g
Computed bymachine 1
Computed bymachine 2
88
14 Feb 2011
Training for HMMs with Gaussian Mixture
State Output Distributions
• Gaussian Mixtures are obtained by splitting
1. Train an HMM with (single) Gaussian state output
distributions
2. Split the Gaussian with the largest variance
• Perturb the mean by adding and subtracting a small number
• This gives us 2 Gaussians. Partition the mixture weight of the
Gaussian into two halves, one for each Gaussian
• A mixture with N Gaussians now becomes a mixture of N+1
Gaussians
3. Iterate BW to convergence
4. If the desired number of Gaussians not obtained, return to 2
89
14 Feb 2011
Splitting a Gaussian
• The mixture weight w for the Gaussian gets shared as 0.5w by each of the two split Gaussians
m m
me me
90
Transition Probabilities
• How about transition probabilities in an HMM?
– “Hard” estimation – by counting, as for templates
– “Soft” estimation – need soft counts14 Feb 2011 91
T11 T22 T33
T12 T23 d1(x) d2(x) d3(x)
P11 P22 P33
P12 P23
P13
• We have seen how to compute transition penalties for templates
Transition penalties by counting
• 20 vectors in state 1– 16 are followed by vectors in state 1
– 4 are followed by vectors in state 2
• P11 = 16/20 = 0.8 T11 = -log(P11) = -log(0.8)
• P12 = 4/20 = 0.2 T12 = -log(P12) = -log(0.2)
14 Feb 2011 92
T11 T22 T33
T12 T23
Transitions by counting
• We found the best state sequence for each input
– And counted transitions
14 Feb 2011 93
Transitions by counting
• We found the best state sequence for each input
– And counted transitions
14 Feb 2011 94
Observation no. 6 contributed one to the count of occurrences of state 1Observations 6 and 7 contributed one to the count of transitions from state 1 to state 2
Probability of transitions
• P(transition state I state J =
– Count transitions(I,J) / count instances(I)
14 Feb 2011 95
Probability of transitions
• P(transition state I state J =
– Count transitions(I,J) / count instances(I)
– Count instances(1) = 20
14 Feb 2011 96
Probability of transitions
• P(transition state I state J =
– Count transitions(I,J) / count instances(I)
– Count instances(1) = 20
• Count transitions (1,1) = 16
• P (transition state 1 state 1) = 0.8
14 Feb 2011 97
Probability of transitions
• P(transition state I state J =
– Count transitions(I,J) / count instances(I)
– Count instances(1) = 20
• Count transitions (1,2) = 4
• P (transition state 1 state 2) = 0.2
14 Feb 2011 98
Transitions by soft counting
• Each observation pair contributes to every transition
– E.g. observations 6,7 contribute counts to all of the following:
• Transition (11), Transition (12), Transition(22), Transition(23), Transition(33)
14 Feb 2011 99
Transitions by soft counting
• Contribution of any transition to the count is the a posteriori probability of the count– This is a fraction
– The fractions for all possible transitions at any time sum to 114 Feb 2011 100
Transitions by soft counting
• Probability of a transition is the total probability of all paths that include the transition
14 Feb 2011 101
Transitions by soft counting
• The forward probability of the source state at taccounts for all incoming paths at time t
– including the t-th observation xt14 Feb 2011 102
t t+1
u(s , t)
Transitions by soft counting
• The backward probability of the destination state at t+1
accounts for all outgoing paths from the state at time t+1
– NOT including the t+1-th observation xt+114 Feb 2011 103
t t+1
u(s’ , t+1)
Transitions by soft counting
• The product of the forward probability of s at t and s’ at t+1 accounts
for all paths TO state s at t, and all paths FROM s’ at t+1
– But not the transition from s to s’ or the observation at t+1
14 Feb 2011 104
t t+1
u(s , t) u(s’ , t+1)
Transitions by soft counting
• By factoring in the transition probability and observation
probabilities, the total probability is obtained
14 Feb 2011 105
t t+1
u(s , t) u(s’ , t+1) P(s’|s) P(xt+1|s’)
From probability to a posteriori probability
• The a posteriori probability of a transition is
the ratio of its probability to the sum of all
transitions at the same time
14 Feb 2011 106
A posteriori probability of a transition
• Probability of a transition
14 Feb 2011 107
• A posteriori probability of a transition
)1,'()'|1()|'(),(),....,,')1(,)(( 21 tssxPssPtsxxxststateststateP utuN
',
21
211,',,,
),....,,')1(,)((
),....,,')1(,)((
SS
N
Ntstsu
xxxStstateStstateP
xxxststateststatePg
Estimate of Transition Probabilities
• Numberator is total “soft” count of transitions
from state s to s’
• Denumberator is total “soft” count of instances
of state s
14 Feb 2011 108
u t
tsu
u t
tstsu
ssP,,
1,',,,
)|'(g
g
14 Feb 2011
• Arithmetic underflow is a problem
Implementation of BW: underflow
u u
su t
s t s t P s s P x s( , ) ( , ) ( | ) ( | ),
1
• The alpha terms are a recursive product of probability terms
– As t increases, an increasingly greater number probability terms are factored into
the alpha
• All probability terms are less than 1
– State output probabilities are actually probability densities
– Probability density values can be greater than 1
– On the other hand, for large dimensional data, probability density values are usually
much less than 1
• With increasing time, alpha values decrease
• Within a few time instants, they underflow to 0
– Every alpha goes to 0 at some time t. All future alphas remain 0
– As the dimensionality of the data increases, alphas goes to 0 faster
probability termsprobability term
109
14 Feb 2011
• One method of avoiding underflow is to scale all alphas at each
time instant
– Scale with respect to the largest alpha to make sure the largest scaled alpha
is 1.0
– Scale with respect to the sum of the alphas to ensure that all alphas sum to
1.0
– Scaling constants must be appropriately considered when computing the
final probabilities of an observation sequence
• An alternate method: Compute alphas and betas in log
domain
– How? (Not obvious)
Underflow: Solution
110
14 Feb 2011
• Similarly, arithmetic underflow can occur during beta computation
Implementation of BW: underflow
• The beta terms are also a recursive product of probability terms and can
underflow
• Underflow can be prevented by
– Scaling: Divide all beta terms by a constant that prevents underflow
– By performing beta computation in the log domain (now? Not obvious..)
– QUESTION: HOW DOES SCALING AFFECT THE ESTIMATION OF
GAMMA TERMS
– For Gaussian parameter updates?
– For transition probability updates?
'
1, )'|()|'(log)1,'(),(s
tuuu sxPssPtsts
111
14 Feb 2011
Implementation of BW: pruning
• The forward backward computation can get very expensive
• Solution: Prune
• Pruning in the forward backward algorithm raises some additional
issues
• Pruning from forward pass can be employed by backward pass
• Convergence criteria and tests may be affected
• More later
s = pruned out
112
14 Feb 2011
Building a recognizer for isolated words
• Now have all necessary components to build
an HMM-based recognizer for isolated words
– Where each word is spoken by itself in isolation
– E.g. a simple application, where one may either
say “Yes” or “No” to a recognizer and it must
recognize what was said
113
14 Feb 2011
Isolated Word Recognition with HMMs
• Assuming all words are equally likely
• Training
– Collect a set of “training” recordings for each word
– Compute feature vector sequences for the words
– Train HMMs for each word
• Recognition:
– Compute feature vector sequence for test utterance
– Compute the forward probability of the feature vector sequence
from the HMM for each word
• Alternately compute the best state sequence probability using Viterbi
– Select the word for which this value is highest
114
14 Feb 2011
Issues
• What is the topology to use for the HMMs
– How many states
– What kind of transition structure
– If state output densities have Gaussian Mixtures:
how many Gaussians?
115
14 Feb 2011
HMM Topology• For speech a left-to-right topology works best
– The “Bakis” topology
– Note that the initial state probability P(s) is 1 for the 1st state and 0 for
others. This need not be learned
• States may be skipped
116
14 Feb 2011
Determining the Number of States
• How do we know the number of states to use for any word?
– We do not, really
– Ideally there should be at least one state for each “basic sound” within the word
• Otherwise widely differing sounds may be collapsed into one state
• The average feature vector for that state would be a poor representation
• For computational efficiency, the number of states should be small
– These two are conflicting requirements, usually solved by making some educated guesses
117
14 Feb 2011
Determining the Number of States
• For small vocabularies, it is possible to examine each word in
detail and arrive at reasonable numbers:
• For larger vocabularies, we may be forced to rely on some ad
hoc principles
– E.g. proportional to the number of letters in the word
• Works better for some languages than others
• Spanish and Indian languages are good examples where this works as
almost every letter in a word produces a sound
S O ME TH I NG
118
14 Feb 2011
How many Gaussians
• No clear answer for this either
• The number of Gaussians is usually a function
of the amount of training data available
– Often set by trial and error
– A minimum of 4 Gaussians is usually required for
reasonable recognition
119
14 Feb 2011
Implementation of BW: initialization of
alphas and betas
• Initialization for alpha: u(s,1) set to 0 for all
states except the first state of the model. u(s,1)
set to 1 for the first state
– All observations must begin at the first state
• Initialization for beta: u(s, T) set to 0 for all states
except the terminating state. u(s, t) set to 1 for
this state
– All observations must terminate at the final state
120
14 Feb 2011
Initializing State Output Density Parameters
1. Initially only a single Gaussian per state assumed
• Mixtures obtained by splitting Gaussians
2. For Bakis-topology HMMs, a good initialization is the “flat”
initialization
• Compute the global mean and variance of all feature vectors in all
training instances of the word
• Initialize all Gaussians (i.e all state output distributions) with this
mean and variance
• Their means and variances will converge to appropriate values
automatically with iteration
• Gaussian splitting to compute Gaussian mixtures takes care of the
rest
121
14 Feb 2011
Isolated word recognition: Final thoughts
• All relevant topics covered
– How to compute features from recordings of the words
• We will not explicitly refer to feature computation in future lectures
– How to set HMM topologies for the words
– How to train HMMs for the words
• Baum-Welch algorithm
– How to select the most probable HMM for a test instance
• Computing probabilities using the forward algorithm
• Computing probabilities using the Viterbi algorithm
– Which also gives the state segmentation
122
14 Feb 2011
Questions
• ?
123