Linear Dynamical Systems (Kalman filter)Β Β· Stefanos Zafeiriou Adv. Statistical Machine Learning...

Preview:

Citation preview

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

Linear Dynamical Systems (Kalman filter)

(a) Overview of HMMs

(b) From HMMs to Linear Dynamical Systems (LDS)

1

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

𝒙1 𝒙2 𝒙3 𝒙𝑇

Let’s assume we have discrete random variables (e.g., taking 3 discrete

values 𝒙𝑑 = {100,010,001})

𝑝(𝒙𝑑 𝒙1, . . , π’™π‘‘βˆ’1 = 𝑝(𝒙𝑑|π’™π‘‘βˆ’1) Markov Property:

Stationary, Homogeneous or Time-Invariant if the distribution 𝑝 𝒙𝑑 π’™π‘‘βˆ’1 does not depend on 𝑑

e.g. 𝑝(𝒙𝑑 =100|π’™π‘‘βˆ’1 =

010)

Markov Chains with Discrete Random Variables

2

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

𝑝(𝒙1, . . , 𝒙𝑇) = 𝑝(𝒙1) 𝑝(𝒙𝑑|π’™π‘‘βˆ’1)

T

𝑑=2

What do we need in order to describe the whole procedure?

(1) A probability for the first frame/timestamp etc 𝑝(𝒙1). In order to

define the probability we need to define the vector 𝝅 =(πœ‹1, πœ‹2, … , πœ‹K)

(2) A transition probability 𝑝 𝒙𝒕|π’™π’•βˆ’πŸ . In order to define it we need a

𝐾π‘₯𝐾 transition matrix 𝑨 = π‘Žπ‘–π‘—

𝑝 𝒙1|𝝅 = πœ‹π‘π‘₯1𝒄

𝐾

𝑐=1

𝑝 𝒙𝑑|π’™π‘‘βˆ’1, 𝑨 = π‘Žπ‘—π‘˜π‘₯π‘‘βˆ’1𝑗π‘₯π‘‘π‘˜

𝐾

π‘˜=1

𝐾

𝑗=1

Markov Chains with Discrete Random Variables

3

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

(1) Using the transition matrix we can compute various probabilities regarding future

𝑝 𝒙𝑑+1|π’™π‘‘βˆ’1, 𝑨 = 𝑨2

𝑝 𝒙𝑑+2|π’™π‘‘βˆ’1, 𝑨 = 𝑨3

𝑝 𝒙𝑑+𝑛|π’™π‘‘βˆ’1, 𝑨 = 𝑨𝑛

(1) The stationary probability of a Markov Chain is very important (it’s an indication of how probable ending in one of states in random move) (Google Page Rank).

𝝅𝐓𝑨 = 𝝅𝑻

Markov Chains with Discrete Random Variables

4

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

Latent Variables in Markov Chain

𝒛1

𝒙1

𝒛2

𝒙2

𝒛3

𝒙3

𝒛𝑇

𝒙𝑇

𝑝(𝒙𝑑|𝒛𝑑)

𝐿π‘₯𝐾 emission

probability matrix

B

5

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

Latent Variables in a Markov Chain

𝒛1

𝒙1

𝒛2

𝒙2

𝒛3

𝒙3

𝒛𝑇

𝒙𝑇

6

𝑝 𝒙𝑑 𝒛𝑑 = 𝑁(𝒙𝑑|ππœ… , πšΊπœ…)π‘§π‘˜

𝐾

π‘˜=1

𝐾 Gaussian distributions

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

𝑝(𝒛1, . . , 𝒛𝑇) = 𝑝(𝒛1) 𝑝(𝒛𝑑|π’›π‘‘βˆ’1)

T

𝑑=2

𝑝 𝑿, 𝒁 πœƒ

= 𝑝 𝒙1, 𝒙2, β‹― , 𝒙𝑇 , 𝒛1,𝒛2, β‹― , 𝒛𝑇|πœƒ

= 𝑝(𝒙𝑑|𝒛𝑑)

𝑇

𝑑=1

𝑝(𝒛1) 𝑝(𝒛𝑑|π’›π‘‘βˆ’1)

T

𝑑=2

Factorization of an HMM

7

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

Given a string of observations and parameters:

(1) We want to find for a timestamp 𝑑 the probabilities of 𝒛𝑑 given

the observations that far.

This process is called Filtering: 𝑝 𝒛𝑑 𝒙1, 𝒙2, … , 𝒙𝑑

(2) We want to find for a timestamp 𝑑 the probabilities of 𝒛𝑑 given

the whole string.

This process is called Smoothing: 𝑝 𝒛𝑑 𝒙1, 𝒙2, … , 𝒙𝑇

8

What can we do with an HMM ?

(3) Given the observation string find the string of hidden variables that

maximize the posterior.

argmax𝒛1… 𝒛𝑑 𝑝 𝒛1, 𝒛2, … , 𝒛𝑑 π’™πŸ, π’™πŸ, … , 𝒙𝒕

This process is called Decoding (Viterbi).

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

Hidden Markov Models

Filtering Smoothing Decoding

9

Taken from Machine Learning: A Probabilistic Perspective by K. Murphy

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

(4) Find the probability of the model.

𝑝(𝒙1, 𝒙2, … , 𝒙𝑇)

This process is called Evaluation

Hidden Markov Models

10

(5) Prediction

𝑝 𝒛𝑑+𝛿 𝒙1, 𝒙2, … , 𝒙𝒕 𝑝 𝒙𝑑+𝛿 𝒙1, 𝒙2, … , 𝒙𝒕

(6) EM Parameter estimation (Baum-Welch algorithm)

𝑨,𝝅, πœƒ

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

𝑝 𝒛𝑑 𝒙1, 𝒙2, … , 𝒙𝑑

𝑝 𝒛𝑑 𝒙1, 𝒙2, … , 𝒙𝑇

argmax𝑧1… 𝑧𝑑 𝑝 𝒛1, … , 𝒛𝑑 π’™πŸ, … , 𝒙𝒕

𝑝 π’›π‘‘βˆ’πœ 𝒙1, 𝒙2, … , 𝒙𝒕

𝑝 𝒛𝑑+𝛿 𝒙1, 𝒙2, … , 𝒙𝒕

𝑝 𝒙𝑑+𝛿 𝒙1, 𝒙2, … , 𝒙𝒕

Hidden Markov Models

11

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

β€’ Up until now we had Markov Chains with discrete variables

𝒙1 𝒙2 𝒙3 𝒙𝑇

Linear Dynamical Systems (LDS)

β€’ How can define a transition relationship with continuous valued

variables?

𝒙𝑑 = π‘¨π’™π‘‘βˆ’1 + 𝒗𝑑 𝒙1 = 𝝁0 + 𝒖

𝒗~𝑁(𝒗|𝟎, πšͺ) 𝒖~𝑁 𝒖 𝟎,𝑷0

or

𝒙1~𝑁(𝒙1|𝝁0, 𝑷0) 𝒙𝑑~𝑁(𝒙𝑑|π‘¨π’™π‘‘βˆ’1, πšͺ) or

𝑝(𝒙1) = 𝑁(𝒙1|𝝁0, 𝑷0) 𝒑(𝒙𝑑 π’™π‘‘βˆ’1 = 𝑁(π‘¨π’™π‘‘βˆ’1|𝟎, πšͺ)

12

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

Latent Variable Models (Dynamic, Continuous)

13

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

𝒙1 𝒙2 𝒙3

π’š1 π’š2 π’š3 π’šπ‘

Share a common linear structure

We want to find the parameters:

𝒙𝑁 𝒙𝑁

14

Linear Dynamical Systems (LDS)

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

πœƒ = {𝑾,𝑨, 𝝁0, 𝚺, πšͺ, 𝑷0}

𝒙1 𝒙2 𝒙3 𝒙𝑇

π’š1 π’š2 π’š3 π’šπ‘‡

𝒙𝑑 = π–π’šπ‘‘ + 𝒆𝑑

π’šπ‘‘ = π‘¨π’šπ‘‘βˆ’1 + 𝒗𝑑

π’š1 = 𝝁0 + 𝒖

𝐞~𝑁(𝒆|𝟎, 𝚺)

𝒗~𝑁(𝒗|𝟎, πšͺ)

𝒖~𝑁 𝒖 𝟎,𝑷0

Parameters:

15

Transition model

Linear Dynamical Systems (LDS)

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

𝒙1 𝒙2 𝒙3 𝒙𝑇

π’š1 π’š2 π’š3 π’šπ‘‡

Linear Dynamical Systems (LDS)

𝑝(π’š1) = 𝑁(π’š1|𝝁0, 𝑷0)

𝑝(π’šπ‘‘ π’šπ‘‘βˆ’1 = 𝑁(π’šπ‘‘|π‘¨π’šπ‘‘βˆ’1, πšͺ)

𝑝(𝒙𝑑 π’šπ‘‘ = 𝑁(𝒙𝑑|π‘Ύπ’šπ‘‘ , 𝚺) Emission:

Transition Probability :

First timestamp:

16

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

HMM LDS

Markov Chain

with discrete latent variables

Markov Chain

with continuous latent variables

𝑝 π’šπ’•|π’šπ’•βˆ’πŸ

𝝅 𝑝 π’šπŸ

𝑝 𝒙𝒕|π’šπ’•

𝑨

𝑩

𝐾 distributions 𝑝 𝒙𝒕|π’šπ’•

or

𝐿π‘₯𝐾

𝐾π‘₯𝐾

𝐾π‘₯1 𝑝(π’š1) = 𝑁(π’š1|𝝁0, 𝑷0)

𝑝(π’šπ‘‘ π’šπ‘‘βˆ’1 = 𝑁(π‘¨π’šπ‘‘βˆ’1|𝟎, πšͺ)

𝑝(𝒙𝑑 π’šπ‘‘ = 𝑁(𝒙𝑑|π‘Ύπ’šπ‘‘ , 𝚺)

HMM vs LDS

17

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

What can we do with LDS?

Global Positioning System (GPS)

18

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

{π‘₯𝑑 , 𝑦𝑑, 𝑧𝑑}

clock

What can we do with LDS?

19

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

Noisy measurements

Filtered path

Noisy path

Actual path

We still have noisy measurements

What can we do with LDS?

20

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

What can we do with LDS?

Given a string of observations and parameters:

(1) We want to find for a timestamp 𝑑 the probabilities of 𝒛𝑑 given

the observations that far.

This process is called Filtering: 𝑝 π’šπ‘‘ 𝒙1, 𝒙2, … , 𝒙𝑑

(2) We want to find for a timestamp 𝑑 the probabilities of 𝒛𝑑 given

the whole time series.

This process is called Smoothing: 𝑝 π’šπ‘‘ 𝒙1, 𝒙2, … , 𝒙𝑇

All above probabilities are Gaussians. Means and covariance matrices

are computed recursively!!!!

21

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)

(5) Prediction

𝑝 π’šπ‘‘+𝛿 𝒙1, 𝒙2, … , 𝒙𝒕 𝑝 𝒙𝑑+𝛿 𝒙1, 𝒙2, … , 𝒙𝒕

(6) EM Parameter estimation (Baum-Welch algorithm)

πœƒ = {𝑾,𝑨, 𝝁0, 𝚺, πšͺ, 𝑷0}

What can we do with LDS?

(3) Find the probability of the model.

𝑝(𝒙1, 𝒙2, … , 𝒙𝑇) This process is called Evaluation:

22

Recommended