32
Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 1of 821 Statistical Machine Learning Christian Walder Machine Learning Research Group CSIRO Data61 and College of Engineering and Computer Science The Australian National University Canberra Semester One, 2020. (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Outlines

OverviewIntroductionLinear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel MethodsSparse Kernel Methods

Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis

AutoencodersGraphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 821

Statistical Machine Learning

Christian Walder

Machine Learning Research GroupCSIRO Data61

and

College of Engineering and Computer ScienceThe Australian National University

CanberraSemester One, 2020.

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Page 2: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

762of 821

Part XXII

Sequential Data 2

Page 3: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

763of 821

Example of a Hidden Markov Model

Assume Peter and Mary are students in Canberra and Sydney,respectively. Peter is a computer science student and onlyinterested in riding his bicycle, shopping for new computergadgets, and studying. (Well, he also does other things butbecause these other activities don’t depend on the weather weneglect them here.)Mary does not know the current weather in Canberra, butknows the general trends of the weather in Canberra. She alsoknows Peter well enough to know what he does on average onrainy days and also when the skies are blue.She believes that the weather (rainy or not rainy) follows agiven discrete Markov chain. She tries to guess the sequenceof weather patterns for a number of days after Peter tells her onthe phone what he did in the last days.

Page 4: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

764of 821

Example of a Hidden Markov Model

Mary uses the following HMM

rainy sunnyinitial probability 0.2 0.8

transition probability rainy 0.3 0.7sunny 0.4 0.6

emission probability cycle 0.1 0.6shop 0.4 0.3study 0.5 0.1

Assume, Peter tells Mary that the list of his activities in the lastdays was [′cycle′,′ shop′,′ study′]

(a) Calculate the probability of this observation sequence.(b) Calculate the most probable sequence of hidden states for

these observations.

Page 5: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

765of 821

Hidden Markov Model

The trellis for the hidden Markov modelFind the probability of ’cycle’, ’shop’, ’study’

Rainy0.2

Sunny0.8

Rainy

Sunny

0.4

0.7

0.3

0.6

Rainy

Sunny

0.4

0.7

0.3

0.6

cycle

shop

study

0.50.4

0.1 0.60.3

0.1cycle

shop

study

0.50.4

0.1 0.60.3

0.1cycle

shop

study

0.50.4

0.1 0.60.3

0.1

Page 6: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

766of 821

Hidden Markov Model

Find the most probable hidden states for ’cycle’, ’shop’,’study’

Rainy0.2

Sunny0.8

Rainy

Sunny

0.4

0.7

0.3

0.6

Rainy

Sunny

0.4

0.7

0.3

0.6

cycle

shop

study

0.50.4

0.1 0.60.3

0.1cycle

shop

study

0.50.4

0.1 0.60.3

0.1cycle

shop

study

0.50.4

0.1 0.60.3

0.1cycle

shop

study

Page 7: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

767of 821

Homogeneous Hidden Markov Model

Joint probability distribution over both latent and observedvariables

p(X,Z |θ) = p(z1 |π)

[N∏

n=2

p(zn | zn−1,A)

]N∏

m=1

p(xm | zm, φ)

where X = (x1, . . . , xN), Z = (z1, . . . , zN), andθ = {π,A, φ}.Most of the discussion will be independent of the particularchoice of emission probabilities (e.g. discrete tables,Gaussians, mixture of Gaussians).

Page 8: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

768of 821

Maximum Likelihood for HMM

We have observed a data set X = (x1, . . . , xn).Assume it came from a HMM with a given structure(number of nodes, form of emission probabilities).The likelihood of the data is

p(X |θ) =∑

Z

p(X,Z |θ)

This joint distribution does not factorise over n (as with themixture distribution).We have N variables, each with K states : KN terms.Number of terms grows exponentially with the length of thechain.But we can use the conditional independence of the latentvariables to reorder their calculation later.Further obstacle to find a closed loop maximum likelihoodsolution: calculating the emission probabilities for differentstates zn.

Page 9: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

769of 821

Maximum Likelihood for HMM - EM

Employ the EM algorithm to find the Maximum Likelihoodfor HMM.

Start with some initial parameter settings θold.E-step: Find the posterior distribution of the latentvariables p(Z |X,θold).M-step: Maximise

Q(θ,θold) =∑

Z

p(Z |X,θold) ln p(Z,X |θ)

with respect to the parameters θ = {π,A, φ}.

Page 10: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

770of 821

Maximum Likelihood for HMM - EM

Employ the EM algorithm to find the Maximum Likelihoodfor HMM.Start with some initial parameter settings θold.

E-step: Find the posterior distribution of the latentvariables p(Z |X,θold).M-step: Maximise

Q(θ,θold) =∑

Z

p(Z |X,θold) ln p(Z,X |θ)

with respect to the parameters θ = {π,A, φ}.

Page 11: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

771of 821

Maximum Likelihood for HMM - EM

Employ the EM algorithm to find the Maximum Likelihoodfor HMM.Start with some initial parameter settings θold.E-step: Find the posterior distribution of the latentvariables p(Z |X,θold).

M-step: Maximise

Q(θ,θold) =∑

Z

p(Z |X,θold) ln p(Z,X |θ)

with respect to the parameters θ = {π,A, φ}.

Page 12: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

772of 821

Maximum Likelihood for HMM - EM

Employ the EM algorithm to find the Maximum Likelihoodfor HMM.Start with some initial parameter settings θold.E-step: Find the posterior distribution of the latentvariables p(Z |X,θold).M-step: Maximise

Q(θ,θold) =∑

Z

p(Z |X,θold) ln p(Z,X |θ)

with respect to the parameters θ = {π,A, φ}.

Page 13: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

773of 821

Maximum Likelihood for HMM - EM

Denote the marginal posterior distribution of zn by γ(zn),andthe joint posterior distribution of two successive latentvariables by ξ(zn−1, zn)

γ(zn) = p(zn |X,θold)

ξ(zn−1, zn) = p(zn−1, zn |X,θold).

For each step n, γ(zn) has K nonnegative values whichsum to 1.For each step n, ξ(zn−1, zn) has K × K nonnegative valueswhich sum to 1.Elements of these vectors are denoted by γ(znk) andξ(zn−1,j, znk) respectively.

Page 14: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

774of 821

Maximum Likelihood for HMM - EM

Because the expectation of a binary random variable is theprobability that it is one, we get with this notation

γ(znk) = E [znk] =∑

zn

γ(zn)znk

ξ(zn−1,j, znk) = E [zn−1,j znk] =∑

zn−1,zn

ξ(zn−1, zn)zn−1,j znk.

Putting all together we get

Q(θ,θold) =

K∑k=1

γ(z1k) lnπk +

N∑n=2

K∑j=1

K∑k=1

ξ(zn−1,j, znk) lnAjk

+

N∑n=1

K∑k=1

γ(znk) ln p(xn |φk).

Page 15: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

775of 821

Maximum Likelihood for HMM - EM

M-step: Maximising

Q(θ,θold) =

K∑k=1

γ(z1k) lnπk +

N∑n=2

K∑j=1

K∑k=1

ξ(zn−1,j, znk) lnAjk

+

N∑n=1

K∑k=1

γ(znk) ln p(xn |φk).

results in

πk =γ(z1k)∑Kj=1 γ(z1j)

Ajk =

∑Nn=2 ξ(zn−1,j, znk)∑K

l=1∑N

n=2 ξ(zn−1,j, znl)

Page 16: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

776of 821

Maximum Likelihood for HMM - EM

Still left: Maximising with respect to φ.

Q(θ,θold) =

K∑k=1

γ(z1k) lnπk +

N∑n=2

K∑j=1

K∑k=1

ξ(zn−1,j, znk) lnAjk

+

N∑n=1

K∑k=1

γ(znk) ln p(xn |φk).

But φ only appears in the last term, and under theassumption that all the φk are independent of each other,this term decouples into a sum.Then maximise each contribution

∑Nn=1 γ(znk) ln p(xn |φk)

individually.

Page 17: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

777of 821

Maximum Likelihood for HMM - EM

In the case of Gaussian emission densities

p(x |φk) = N (x |µk,Σk),

we get for the maximising parameters for the emissiondensities

µk =

∑Nn=1 γ(znk) xn∑N

n=1 γ(znk)

Σk =

∑Nn=1 γ(znk) (xn − µk)(xn − µk)

T∑Nn=1 γ(znk)

.

Page 18: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

778of 821

Forward-Backward HMM

Need to efficiently evaluate the γ(znk) and ξ(zn−1,j, znk).The graphical model for the HMM is a tree!We know we can use a two-stage message passingalgorithm to calculate the posterior distribution of the latentvariables.For HMM this is called the forward-backward algorithm(Rabiner, 1989), or Baum-Welch algorithm (Baum, 1972).Other variants exist, different only in the form of themessages propagated.We look at the most widely known, the alpha-betaalgorithm.

Page 19: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

779of 821

Conditional Independence for HMM

Given the data X = {x1, . . . , xN} the following independencerelations hold:

p(X | zn) = p(x1, . . . , xn | zn) p(xn+1, . . . , xN | zn)

p(x1, . . . , xn−1 | xn, zn) = p(x1, . . . , xn−1 | zn)

p(x1, . . . , xn−1 | zn−1, zn) = p(x1, . . . , xn−1 | zn−1)

p(xn+1, . . . , xN | zn, zn+1) = p(xn+1, . . . , xN | zn+1)

p(xn+2, . . . , xN | xn+1, zn+1) = p(xn+2, . . . , xN | zn+1)

p(X | zn−1, zn) = p(x1, . . . , xn−1 | zn−1) p(xn | zn)

p(xn+1, . . . , xN | zn)

p(xN+1 |X, zN+1) = p(xN+1 | zN+1)

p(zn+1 |X, zN) = p(zN+1 | zN)

zn−1 zn zn+1

xn−1 xn xn+1

z1 z2

x1 x2

Page 20: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

780of 821

Conditional Independence for HMM - Example

Let’s look at the following independence relation:

p(X | zn) = p(x1, . . . , xn | zn) p(xn+1, . . . , xN | zn)

Any path from the set {x1, . . . , xn} to the set {xn+1, . . . , xN}has to go through zn.In p(X | zn) the node zn is conditioned on (= observed).All paths from x1, . . . , xn−1 through zn to xn+1, . . . , xN arehead-tail.The path from xn through zn to zn+1 is tail-tail, so blocked.Therefore x1, . . . , xn ⊥⊥ xn+1, . . . , xN | zn.

zn−1 zn zn+1

xn−1 xn xn+1

z1 z2

x1 x2

Page 21: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

781of 821

Alpha-Beta HMM

Define the joint probability of observing all data up to stepn and having zn as latent variable to be

α(zn) = p(x1, . . . , xn, zn).

Define the probability of all future data given zn to be

β(zn) = p(xn+1, . . . , xN | zn).

Then it an be shown the following recursions hold

α(zn) = p(xn | zn)∑zn−1

α(zn−1) p(zn | zn−1)

α(z1) =K∏

k=1

{πk p(x1 |φk)}z1k

Page 22: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

782of 821

Alpha-Beta HMM

At step n we can efficiently calculate α(zn) given α(zn−1)

α(zn) = p(xn | zn)∑zn−1

α(zn−1) p(zn | zn−1)

k = 1

k = 2

k = 3

n− 1 n

α(zn−1,1)

α(zn−1,2)

α(zn−1,3)

α(zn,1)A11

A21

A31

p(xn|zn,1)

Page 23: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

783of 821

Alpha-Beta HMM

And for β(zn) we get the recursion

β(zn) =∑zn+1

β(zn+1) p(xn+1 | zn+1) p(zn+1 | zn)

k = 1

k = 2

k = 3

n n+ 1

β(zn,1) β(zn+1,1)

β(zn+1,2)

β(zn+1,3)

A11

A12

A13

p(xn|zn+1,1)

p(xn|zn+1,2)

p(xn|zn+1,3)

Page 24: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

784of 821

Alpha-Beta HMM

How do we start the β recursion? What is β(zN) ?

β(zN) = p(xN+1, . . . , xN | zn).

Can be shown the following is consistent with theapproach

β(zN) = 1.

k = 1

k = 2

k = 3

n n+ 1

β(zn,1) β(zn+1,1)

β(zn+1,2)

β(zn+1,3)

A11

A12

A13

p(xn|zn+1,1)

p(xn|zn+1,2)

p(xn|zn+1,3)

Page 25: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

785of 821

Alpha-Beta HMM

Now we know how to calculate α(zn) and β(zn) for eachstep.What is the probability of the data p(X) ?Use the definition of γ(zn) and Bayes

γ(zn) = p(zn |X) =p(X | zn) p(zn)

p(X)=

p(X, zn)

p(X)

and the following conditional independence statementfrom the graphical model of the HMM

p(X | zn) = p(x1, . . . , xn | zn)︸ ︷︷ ︸α(zn)/p(zn)

p(xn+1, . . . , xN | zn)︸ ︷︷ ︸β(zn)

and thereforeγ(zn) =

α(zn)β(zn)

p(X)

Page 26: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

786of 821

Alpha-Beta HMM

Marginalising over zn results in

1 =∑

zn

γ(zn) =

∑znα(zn)β(zn)

p(X)

and therefore at each step n

p(X) =∑

zn

α(zn)β(zn)

Most conveniently evaluated at step N where β(zN) = 1 as

p(X) =∑

zN

α(zn).

Page 27: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

787of 821

Alpha-Beta HMM

Finally, we need to calculate the joint posterior distributionof two successive latent variables by ξ(zn−1, zn) defined by

ξ(zn−1, zn) = p(zn−1, zn |X).

This can be done directly from the α and β values in theform

ξ(zn−1, zn) =α(zn−1) p(xn | zn) p(zn | zn−1)β(zn)

p(X)

Page 28: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

788of 821

How to train a HMM using EM?

1 Make an inital selection for the parameters θold whereθ = {π,A, φ)}. (Often, A,π initialised uniformly orrandomly. The φk-initialisation depends on the emissiondistribution; for Gaussians run K-means first and get µkand Σk from there.)

2 (Start of E-step) Run forward recursion to calculate α(zn).3 Run backward recursion to calculate β(zn).4 Calculate γ(z) and ξ(zn−1, zn) from α(zn) and β(zn).5 Evaluate the likelihood p(X).6 (Start of M-step) Find a θnew maximising Q(θ,θold). This

results in new settings for the parameters πk,Ajk and φk asdescribed before.

7 Iterate until convergence is detected.

Page 29: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

789of 821

Alpha-Beta HMM - Notes

In order to calculate the likelihood, we need to use the jointprobability p(X,Z) and sum over all possible values of Z.That means, every particular choice of Z corresponds toone path through the lattice diagram. There areexponentially many !Using the alpha-beta algorithm, the exponential cost hasbeen reduced to a linear cost in the length of the model.How did we do that?Swapping the order of multiplication and summation.

Page 30: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

790of 821

HMM - Viterbi Algorithm

Motivation: The latent states can have some meaningfulinterpretation, e.g. phonemes in a speech recognitionsystem where the observed variables are the acousticsignals.Goal: After the system has been trained, find the mostprobable states of the latent variables for a givensequence of observations.Warning: Finding the set of states which are eachindividually the most probable does NOT solve thisproblem.

Page 31: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

791of 821

HMM - Viterbi Algorithm

Define ω(zn) = maxz1,...,zn−1 ln p(x1, . . . , xn, z1, . . . , zn)

From the joint distribution of the HMM given by

p(x1, . . . , xN , z1, . . . , zN) = p(z1)

[N∏

n=2

p(zn | zn−1)

]N∏

n=1

p(xn | zn)

the following recursion can be derived

ω(zn) = ln p(xn | zn) + maxzn−1{ln p(zn | zn−1) + ω(zn−1)}

ω(z1) = ln p(z1) + ln p(x1 | z1) = ln p(x1, z1)

k = 1

k = 2

k = 3

n− 2 n− 1 n n + 1

Page 32: Statistical Machine Learning · Maximum Likelihood for HMM - EM Employ the EM algorithm to find the Maximum Likelihood for HMM. Start with some initial parameter settings old. E-step:

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

A Simple Example

Maximum Likelihood forHMM

Forward-BackwardHMM

ConditionalIndependence

Alpha-Beta HMM

How to train a HMMusing EM?

HMM - Viterbi Algorithm

792of 821

HMM - Viterbi Algorithm

Calculate

ω(zn) = maxz1,...,zn−1

ln p(x1, . . . , xn, z1, . . . , zn)

for n = 1, . . . ,N.For each step n remember which is the best transition togo into each state at the next step.At step N : Find the state with the highest probability.For n = 1, . . . ,N − 1: Backtrace which transition led to themost probable state and identify from which state it came.

k = 1

k = 2

k = 3

n− 2 n− 1 n n + 1