Introduction to Graphical Models Brookes Vision Lab Reading Group

Introduction to Graphical Models

Brookes Vision Lab Reading Group

Graphical Models

• To build a complex system using simpler parts.

• System should be consistent

• Parts are combined using probability

• Undirected – Markov random fields

• Directed – Bayesian Networks

Overview

• Representation

• Inference

• Linear Gaussian Models

• Approximate inference

• Learning

Causality : Sprinkler “causes” wet grass

Representation

Conditional Independence

• Independent of ancestors given parents• P(C,S,R,W) = P(C) P(S|C) P(R|C,S) P(W|C,S,R)• = P(C) P(S|C) P(R|C) P(W|S,R)

• Space required for n binary nodes– O(2n) without factorization– O(n2k) with factorization, k = maximum fan-in

Inference

• Pr(S=1|W=1) = Pr(S=1,W=1)/Pr(W=1)

= 0.2781/0.6471

= 0.430

• Pr(R=1|W=1) = Pr(R=1,W=1)/Pr(W=1)

= 0.4581/0.6471

= 0.708

Explaining Away

• S and R “compete” to explain W=1

• S and R are conditionally dependent

• Pr(S=1|R=1,W=1) = 0.1945

Inference

where

where

Inference

• Variable elimination

• Choosing optimal ordering – NP hard

• Greedy methods work well

• Computing several marginals

• Dynamic programming avoids redundant computation

• Sound familiar ??

Bayes Balls for Conditional Independence

A Unifying (Re)View

Linear GaussianModel (LGM)

FA SPCA PCA LDS

Mixture of Gaussians VQ HMM

Continuous-State LGM

Basic Model

Discrete-State LGM

Basic Model● State of a system is a k-vector x (unobserved)● Output of a system is a p-vector y (observed) ● Often k << p

● Basic model ● xt+1 = A xt + w● yt = C xt + v

● A is the k x k transition matrix● C is a p x k observation matrix● w = N(0, Q)● v = N(0, R)

● Noise processes are essential

Zero mean w.l.o.g

Degeneracy in Basic Model

• Structure in Q can be moved to A and C• W.l.o.g. Q = I• R cannot be restricted as yt are observed• Components of x can be reordered arbitrarily.• Ordering is based on norms of columns of C.• x1 = N(µ1, Q1)• A and C are assumed to have rank k.• Q, R, Q1 are assumed to be full rank.

Probability Computation

• P( xt+1 | xt ) = N(A xt, Q ; xt+1)

• P( yt | xt ) = N( C xt, R; yt)

• P({x1,..,xT,{y1,..,yT}) =

P(x1) П P(xt+1|xtП P(yt|xt)

• Negative log probability

Inference● Given model parameters {A, C, Q, R, µ1, Q1}● Given observations y● What can be infered about hidden states x ?● Total likelihood

● Filtering : P (x(t) | {y(1), ... , y(t)})● Smoothing: P (x(t) | {y(1), ... , y(T)})● Partial smoothing: P (x(t) | {y(1), ... , y(t+t')})● Partial prediction: P (x(t) | {y(1), ... , y(t-t')})● Intermediate values of recursive methods for computing total likelihood.

Learning• Unknown parameters {A, C, Q, R, µ1, Q1}

• Given observations y• Log-likelihood

F(Q,Ө) – free energy

EM algorithm

• Alternate between maximizing F(Q,Ө) w.r.t. Q and Ө.

• F = L at the beginning of M-step• E-step does not change Ө• Therefore, likelihood does not decrease.



Static Data Modeling Time-series Modeling

● No temporal dependence ● Factor analysis● SPCA● PCA

● Time ordering of data crucial● LDS (Kalman filter models)

Static Data Modelling

• A = 0• x = w• y = C x + v

• x1 = N(0,Q)

• y = N(0, CQC'+R)• Degeneracy in model• Learning : EM

– R restricted

• Inference

Factor Analysis

• Restrict R to be diagonal.• Q = I• x – factors• C – factor loading matrix• R – uniqueness• Learning – EM , quasi-Newton optimization• Inference

SPCA

• R = єI• є – global noise level• Columns of C span the principal subspace.• Learning – EM algorithm• Inference

PCA

• R = lim є->0 єI

• Learning– Diagonalize sample covariance of data– Leading k eigenvalues and eigenvectors define C– EM determines leading eigenvectors without

diagonalization

• Inference– Noise becomes infinitesimal– Posterior collapses to a single point

Linear Dynamical Systems

• Inference – Kalman filter

• Smoothing – RTS recursions

• Learning – EM algorithm– C known – Shumway and Stoffer, 1982– All unknown – Ghahramani and Hinton, 1995

Discrete-State LGM

• xt+1 = WTA[A xt + w]

• yt = C xt + v

• x1 = WTA[N(µ1,Q1)]

Discrete-State LGM

Discrete-state LGM

Static Data Modeling Time-series Modeling

● Mixture of Gaussians● VQ

● HMM

Static Data Modelling

• A = 0• x = WTA[w]• w = N(µ,Q)• Y = C x + v

• лj = P(x = ej)

• Nonzero µ for nonuniform лj

• y = N(Cj, R)

• Cj – jth column of C

Mixture of Gaussians• Mixing coefficients of cluster лj

• Mean – columns Cj

• Variance – R

• Learning: EM (corresponds to ML competitive learning)

• Inference

Vector Quantization

• Observation noise becomes infinitesimal

• Inference problem solved by 1NN rule

• Euclidean distance for diagonal R

• Mahalanobis distance for unscaled R

• Posterior collapses to closest cluster

• Learning with EM = batch version of k-means

Time-series modelling

HMM

• Transition matrix T

• Ti,j = P(xt+1 = ej | xt = ei)

• For every T, there exist A and Q

• Filtering : forward recursions

• Smoothing: forward-backward algorithm

• Learning: EM (called Baum-Welsh reestimation)

• MAP state sequences - Viterbi

Documents

Introduction to Graphical Models Brookes Vision Lab Reading Group