Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to ﬁnd the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Outlines

OverviewIntroductionLinear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel MethodsSparse Kernel Methods

Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis

AutoencodersGraphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 821

Statistical Machine Learning

Christian Walder

Machine Learning Research GroupCSIRO Data61

and

College of Engineering and Computer ScienceThe Australian National University

CanberraSemester One, 2020.

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")




University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

762of 821

Part XXII

Sequential Data 2




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



763of 821

Example of a Hidden Markov Model

Assume Peter and Mary are students in Canberra and Sydney,respectively. Peter is a computer science student and onlyinterested in riding his bicycle, shopping for new computergadgets, and studying. (Well, he also does other things butbecause these other activities don’t depend on the weather weneglect them here.)Mary does not know the current weather in Canberra, butknows the general trends of the weather in Canberra. She alsoknows Peter well enough to know what he does on average onrainy days and also when the skies are blue.She believes that the weather (rainy or not rainy) follows agiven discrete Markov chain. She tries to guess the sequenceof weather patterns for a number of days after Peter tells her onthe phone what he did in the last days.




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



764of 821

Example of a Hidden Markov Model

Mary uses the following HMM

rainy sunnyinitial probability 0.2 0.8

transition probability rainy 0.3 0.7sunny 0.4 0.6

emission probability cycle 0.1 0.6shop 0.4 0.3study 0.5 0.1

Assume, Peter tells Mary that the list of his activities in the lastdays was [′cycle′,′ shop′,′ study′]

(a) Calculate the probability of this observation sequence.(b) Calculate the most probable sequence of hidden states for

these observations.




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



765of 821

Hidden Markov Model

The trellis for the hidden Markov modelFind the probability of ’cycle’, ’shop’, ’study’

Rainy0.2

Sunny0.8

Rainy

Sunny

0.4

0.7

0.3

0.6

Rainy

Sunny

0.4

0.7

0.3

0.6

cycle

shop

study

0.50.4

0.1 0.60.3

0.1cycle

shop

study

0.50.4

0.1 0.60.3

0.1cycle

shop

study

0.50.4

0.1 0.60.3

0.1




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



766of 821

Hidden Markov Model

Find the most probable hidden states for ’cycle’, ’shop’,’study’

Rainy0.2

Sunny0.8

Rainy

Sunny

0.4

0.7

0.3

0.6

Rainy

Sunny

0.4

0.7

0.3

0.6

cycle

shop

study

0.50.4

0.1 0.60.3

0.1cycle

shop

study

0.50.4

0.1 0.60.3

0.1cycle

shop

study

0.50.4

0.1 0.60.3

0.1cycle

shop

study




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



767of 821

Homogeneous Hidden Markov Model

Joint probability distribution over both latent and observedvariables

p(X,Z |θ) = p(z1 |π)

[N∏

n=2

p(zn | zn−1,A)

]N∏

m=1

p(xm | zm, φ)

where X = (x1, . . . , xN), Z = (z1, . . . , zN), andθ = {π,A, φ}.Most of the discussion will be independent of the particularchoice of emission probabilities (e.g. discrete tables,Gaussians, mixture of Gaussians).




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



768of 821

Maximum Likelihood for HMM

We have observed a data set X = (x1, . . . , xn).Assume it came from a HMM with a given structure(number of nodes, form of emission probabilities).The likelihood of the data is

p(X |θ) =∑

Z

p(X,Z |θ)

This joint distribution does not factorise over n (as with themixture distribution).We have N variables, each with K states : KN terms.Number of terms grows exponentially with the length of thechain.But we can use the conditional independence of the latentvariables to reorder their calculation later.Further obstacle to find a closed loop maximum likelihoodsolution: calculating the emission probabilities for differentstates zn.




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



769of 821

Maximum Likelihood for HMM - EM

Employ the EM algorithm to find the Maximum Likelihoodfor HMM.

Start with some initial parameter settings θold.E-step: Find the posterior distribution of the latentvariables p(Z |X,θold).M-step: Maximise

Q(θ,θold) =∑

Z

p(Z |X,θold) ln p(Z,X |θ)

with respect to the parameters θ = {π,A, φ}.




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



770of 821


Employ the EM algorithm to find the Maximum Likelihoodfor HMM.Start with some initial parameter settings θold.

E-step: Find the posterior distribution of the latentvariables p(Z |X,θold).M-step: Maximise

Q(θ,θold) =∑

Z






University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



771of 821


Employ the EM algorithm to find the Maximum Likelihoodfor HMM.Start with some initial parameter settings θold.E-step: Find the posterior distribution of the latentvariables p(Z |X,θold).

M-step: Maximise

Q(θ,θold) =∑

Z






University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



772of 821


Employ the EM algorithm to find the Maximum Likelihoodfor HMM.Start with some initial parameter settings θold.E-step: Find the posterior distribution of the latentvariables p(Z |X,θold).M-step: Maximise

Q(θ,θold) =∑

Z






University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



773of 821


Denote the marginal posterior distribution of zn by γ(zn),andthe joint posterior distribution of two successive latentvariables by ξ(zn−1, zn)

γ(zn) = p(zn |X,θold)

ξ(zn−1, zn) = p(zn−1, zn |X,θold).

For each step n, γ(zn) has K nonnegative values whichsum to 1.For each step n, ξ(zn−1, zn) has K × K nonnegative valueswhich sum to 1.Elements of these vectors are denoted by γ(znk) andξ(zn−1,j, znk) respectively.




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



774of 821


Because the expectation of a binary random variable is theprobability that it is one, we get with this notation

γ(znk) = E [znk] =∑

zn

γ(zn)znk

ξ(zn−1,j, znk) = E [zn−1,j znk] =∑

zn−1,zn

ξ(zn−1, zn)zn−1,j znk.

Putting all together we get

Q(θ,θold) =

K∑k=1

γ(z1k) lnπk +

N∑n=2

K∑j=1

K∑k=1

ξ(zn−1,j, znk) lnAjk

+

N∑n=1

K∑k=1

γ(znk) ln p(xn |φk).




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



775of 821


M-step: Maximising

Q(θ,θold) =

K∑k=1

γ(z1k) lnπk +

N∑n=2

K∑j=1

K∑k=1


+

N∑n=1

K∑k=1


results in

πk =γ(z1k)∑Kj=1 γ(z1j)

Ajk =

∑Nn=2 ξ(zn−1,j, znk)∑K

l=1∑N

n=2 ξ(zn−1,j, znl)




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



776of 821


Still left: Maximising with respect to φ.

Q(θ,θold) =

K∑k=1

γ(z1k) lnπk +

N∑n=2

K∑j=1

K∑k=1


+

N∑n=1

K∑k=1


But φ only appears in the last term, and under theassumption that all the φk are independent of each other,this term decouples into a sum.Then maximise each contribution

∑Nn=1 γ(znk) ln p(xn |φk)

individually.




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



777of 821


In the case of Gaussian emission densities

p(x |φk) = N (x |µk,Σk),

we get for the maximising parameters for the emissiondensities

µk =

∑Nn=1 γ(znk) xn∑N

n=1 γ(znk)

Σk =

∑Nn=1 γ(znk) (xn − µk)(xn − µk)

T∑Nn=1 γ(znk)

.




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



778of 821

Forward-Backward HMM

Need to efficiently evaluate the γ(znk) and ξ(zn−1,j, znk).The graphical model for the HMM is a tree!We know we can use a two-stage message passingalgorithm to calculate the posterior distribution of the latentvariables.For HMM this is called the forward-backward algorithm(Rabiner, 1989), or Baum-Welch algorithm (Baum, 1972).Other variants exist, different only in the form of themessages propagated.We look at the most widely known, the alpha-betaalgorithm.




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



779of 821

Conditional Independence for HMM

Given the data X = {x1, . . . , xN} the following independencerelations hold:

p(X | zn) = p(x1, . . . , xn | zn) p(xn+1, . . . , xN | zn)

p(x1, . . . , xn−1 | xn, zn) = p(x1, . . . , xn−1 | zn)

p(x1, . . . , xn−1 | zn−1, zn) = p(x1, . . . , xn−1 | zn−1)

p(xn+1, . . . , xN | zn, zn+1) = p(xn+1, . . . , xN | zn+1)

p(xn+2, . . . , xN | xn+1, zn+1) = p(xn+2, . . . , xN | zn+1)

p(X | zn−1, zn) = p(x1, . . . , xn−1 | zn−1) p(xn | zn)

p(xn+1, . . . , xN | zn)

p(xN+1 |X, zN+1) = p(xN+1 | zN+1)

p(zn+1 |X, zN) = p(zN+1 | zN)

zn−1 zn zn+1

xn−1 xn xn+1

z1 z2

x1 x2




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



780of 821

Conditional Independence for HMM - Example

Let’s look at the following independence relation:

p(X | zn) = p(x1, . . . , xn | zn) p(xn+1, . . . , xN | zn)

Any path from the set {x1, . . . , xn} to the set {xn+1, . . . , xN}has to go through zn.In p(X | zn) the node zn is conditioned on (= observed).All paths from x1, . . . , xn−1 through zn to xn+1, . . . , xN arehead-tail.The path from xn through zn to zn+1 is tail-tail, so blocked.Therefore x1, . . . , xn ⊥⊥ xn+1, . . . , xN | zn.

zn−1 zn zn+1

xn−1 xn xn+1

z1 z2

x1 x2




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



781of 821

Alpha-Beta HMM

Define the joint probability of observing all data up to stepn and having zn as latent variable to be

α(zn) = p(x1, . . . , xn, zn).

Define the probability of all future data given zn to be

β(zn) = p(xn+1, . . . , xN | zn).

Then it an be shown the following recursions hold

α(zn) = p(xn | zn)∑zn−1

α(zn−1) p(zn | zn−1)

α(z1) =K∏

k=1

{πk p(x1 |φk)}z1k




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



782of 821

Alpha-Beta HMM

At step n we can efficiently calculate α(zn) given α(zn−1)

α(zn) = p(xn | zn)∑zn−1

α(zn−1) p(zn | zn−1)

k = 1

k = 2

k = 3

n− 1 n

α(zn−1,1)

α(zn−1,2)

α(zn−1,3)

α(zn,1)A11

A21

A31

p(xn|zn,1)




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



783of 821

Alpha-Beta HMM

And for β(zn) we get the recursion

β(zn) =∑zn+1

β(zn+1) p(xn+1 | zn+1) p(zn+1 | zn)

k = 1

k = 2

k = 3

n n+ 1

β(zn,1) β(zn+1,1)

β(zn+1,2)

β(zn+1,3)

A11

A12

A13

p(xn|zn+1,1)

p(xn|zn+1,2)

p(xn|zn+1,3)




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



784of 821

Alpha-Beta HMM

How do we start the β recursion? What is β(zN) ?

β(zN) = p(xN+1, . . . , xN | zn).

Can be shown the following is consistent with theapproach

β(zN) = 1.

k = 1

k = 2

k = 3

n n+ 1

β(zn,1) β(zn+1,1)

β(zn+1,2)

β(zn+1,3)

A11

A12

A13

p(xn|zn+1,1)

p(xn|zn+1,2)

p(xn|zn+1,3)




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



785of 821

Alpha-Beta HMM

Now we know how to calculate α(zn) and β(zn) for eachstep.What is the probability of the data p(X) ?Use the definition of γ(zn) and Bayes

γ(zn) = p(zn |X) =p(X | zn) p(zn)

p(X)=

p(X, zn)

p(X)

and the following conditional independence statementfrom the graphical model of the HMM

p(X | zn) = p(x1, . . . , xn | zn)︸︷︷︸α(zn)/p(zn)

p(xn+1, . . . , xN | zn)︸︷︷︸β(zn)

and thereforeγ(zn) =

α(zn)β(zn)

p(X)




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



786of 821

Alpha-Beta HMM

Marginalising over zn results in

1 =∑

zn

γ(zn) =

∑znα(zn)β(zn)

p(X)

and therefore at each step n

p(X) =∑

zn

α(zn)β(zn)

Most conveniently evaluated at step N where β(zN) = 1 as

p(X) =∑

zN

α(zn).




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



787of 821

Alpha-Beta HMM

Finally, we need to calculate the joint posterior distributionof two successive latent variables by ξ(zn−1, zn) defined by

ξ(zn−1, zn) = p(zn−1, zn |X).

This can be done directly from the α and β values in theform

ξ(zn−1, zn) =α(zn−1) p(xn | zn) p(zn | zn−1)β(zn)

p(X)




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



788of 821

How to train a HMM using EM?

1 Make an inital selection for the parameters θold whereθ = {π,A, φ)}. (Often, A,π initialised uniformly orrandomly. The φk-initialisation depends on the emissiondistribution; for Gaussians run K-means first and get µkand Σk from there.)

2 (Start of E-step) Run forward recursion to calculate α(zn).3 Run backward recursion to calculate β(zn).4 Calculate γ(z) and ξ(zn−1, zn) from α(zn) and β(zn).5 Evaluate the likelihood p(X).6 (Start of M-step) Find a θnew maximising Q(θ,θold). This

results in new settings for the parameters πk,Ajk and φk asdescribed before.

7 Iterate until convergence is detected.




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



789of 821

Alpha-Beta HMM - Notes

In order to calculate the likelihood, we need to use the jointprobability p(X,Z) and sum over all possible values of Z.That means, every particular choice of Z corresponds toone path through the lattice diagram. There areexponentially many !Using the alpha-beta algorithm, the exponential cost hasbeen reduced to a linear cost in the length of the model.How did we do that?Swapping the order of multiplication and summation.




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



790of 821


Motivation: The latent states can have some meaningfulinterpretation, e.g. phonemes in a speech recognitionsystem where the observed variables are the acousticsignals.Goal: After the system has been trained, find the mostprobable states of the latent variables for a givensequence of observations.Warning: Finding the set of states which are eachindividually the most probable does NOT solve thisproblem.




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



791of 821


Define ω(zn) = maxz1,...,zn−1 ln p(x1, . . . , xn, z1, . . . , zn)

From the joint distribution of the HMM given by

p(x1, . . . , xN , z1, . . . , zN) = p(z1)

[N∏

n=2

p(zn | zn−1)

]N∏

n=1

p(xn | zn)

the following recursion can be derived

ω(zn) = ln p(xn | zn) + maxzn−1{ln p(zn | zn−1) + ω(zn−1)}

ω(z1) = ln p(z1) + ln p(x1 | z1) = ln p(x1, z1)

k = 1

k = 2

k = 3

n− 2 n− 1 n n + 1




University

A Simple Example


Forward-BackwardHMM


Alpha-Beta HMM



792of 821


Calculate

ω(zn) = maxz1,...,zn−1

ln p(x1, . . . , xn, z1, . . . , zn)

for n = 1, . . . ,N.For each step n remember which is the best transition togo into each state at the next step.At step N : Find the state with the highest probability.For n = 1, . . . ,N − 1: Backtrace which transition led to themost probable state and identify from which state it came.

k = 1

k = 2

k = 3

n− 2 n− 1 n n + 1

Documents

Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to ﬁnd the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step: