Probabilistic Graphical Models - SNU Chapter 7... · 2015-11-24 · Probabilistic Graphical Models Lecture Notes Fall 2009 November 11, 2009 Byoung-Tak Zhang School of Computer Science

Probabilistic Graphical Models

Lecture Notes Fall 2009

November 11, 2009

Byoung-Tak Zhang

School of Computer Science and Engineering & Cognitive Science, Brain Science, and Bioinformatics

Seoul National University http://bi.snu.ac.kr/~btzhang/

Chapter 7. Latent Variable Models 7.1 Factor Graphs Directed vs. Undirected Graphs

Both graphical models ♦ Specify a factorization (how to express the joint distribution) ♦ Define a set of conditional independence properties

Fig. 7-1. Parent-child local conditional distribution

Fig. 7-2. Maximal clique potential function

Relation of Directed and Indirected Graphs

Converting a directed graph to an undirected graph ♦ Case 1: straight line

♦ In this case, the partition function Z = 1

♦ Case 2: general case. Moralization = ‘marrying the parents’ ♦ Add additional undirected links between all pairs of parents ♦ Drop the arrows

♦ Results in the moral graph

♦ Fully connected no conditional independence properties, in contrast to the original directed graph

♦ We should add the fewest extra links to retain the maximum number of independence properties

Factor

•

•

•

r Graphs

A factor graph♦ Fact♦ Intro♦ Expl

∏

x x ψ

Definition. G

where vertices on the factorXk when

Factor graphcharacteristicExample. C

g(X1,X2,X3) =

with a corres

h is a bipartitetors in directedoducing additilicit decompo

|ψ , , ψ

Given a factor

rization as foll. The fu

hs can be combcs of the funconsider a func

= f1(X1)f2(X1,X

sponding facto

e graph represd/undirected gional nodes fosition /factoriz

|ψ , ,

rization of a fu

, the corre, factor

lows: there is unction is assu

bined with metion ction that fact

X2)f3(X1,X2)f4(X

or graph

senting a jointgraphs or the factors tzation

|⋅⋅⋅ψ ,

(Fac

unction

esponding facvertices an undirected

umed to be rea

essage passing, suc

orizes as follo

X2,X3),

t distribution i

themselves

,

ctor graphs are

,

ctor graph G =

d edge betweeal-valued, i.e.

g algorithms tch as the margows:

in the form of

e bipartite)

= (X, F, E) con, and edges

n factor vertex

to efficiently cginals.

a product of f

nsists of variabE. The edges x fj and variab

.

compute certa

factors.

ble depend

ble vertex

ain

• This factor graph has a cycle. If we merge f2(X1,X2)f3(X1,X2) into a single factor, the resulting factor graph will be a tree. This is an important distinction, as message passing algorithms are usually exact for trees, but only approximate for graphs with cycles.

• Inferences on factor graphs. • Sum-product algorithm: evaluating local marginals over nodes or subsets of nodes

\( ) ( )

xp x p=∑

xx

( )

( ) ( , )s ss Ne x

p F x X∈

= ∏x

\

\ ( )

( )

( )

( ) ( )

( , )

= ( , )

= ( )s

s

x

s sx s Ne x

s sXs Ne x

f xs Ne x

p x p

F x X

F x X

xμ

∈

∈

→∈

=

⎡ ⎤= ⎢ ⎥

⎣ ⎦⎡ ⎤⎢ ⎥⎣ ⎦

∑

∑ ∏

∑∏

∏

x

x

x

The messages ( )sf x xμ → from the factor node sf to the variable node x are computed in the

vertices and passed along the edges.

• Max-sum algorithm: finding the most probable state • The Hammersley–Clifford theorem shows that other probabilistic models such as Markov networks

and Bayesian networks can be represented as factor graphs. • Factor graph representation is frequently used when performing inference over such networks using

belief propagation. • On the other hand, Bayesian networks are more naturally suited for generative models, as they can

directly represent the causalities of the model.

Properties of Factor Graphs

Converting directed and undirected graphs into factor graphs ♦ undirected graph factor graph

Note: For a given fully connected undirected graph, two (or more) different factor graphs are possible. Factor graphs are more specific than the undirected graphs.

♦ directed graph factor graph

♦ For the same directed graph, two or more factor graphs are possible. There can be multiple factor

graphs all of which correspond to the same undirected/directed graph

Converting a directed/undirected tree to a factor graph

♦ The result is again a tree (no loops, one and only one path connecting any two nodes)

Converting a directed polytree to a factor graph ♦ The results in a tree. ♦ Cf. Converting a directed polytree into an undirected graph results in loops due to the

moralization step.

Polytree Undirected graph (moral graph) Factor graph

Local cycles in a directed graph can be removed on conversion to a factor graph

Factor graphs are more specific about the precise form of the factorization

For a fully connected undirected graph, two (or more) factor graphs are possible.

Directed and undirected graphs can express different conditional independence properties

D:

Directed graph Undirected graph

7.2 Pr Latent

PLSA

robabilisVariable M

Latent variab4 Varia

obseLatent variab4 Expl

General form

A

( )p = ∫x

stic LateModels

les ables that are

erved and direle models lain the statist

mulation

( , )p dx z z

ent Sema

e not directly ctly measured

tical propertie

( |p= ∫ x z

antic An

observed butd

s of the observ

) ( )p dz z z

alysis

t are rather in

ved variables

nferred from o

in terms of th

other variable

he latent variab

es that are

bles

7.3 Gaussian Mixture Models

Graphical representation of a mixture model A binary random variable z having a 1-of-K representation

Gaussian mixture distribution can be written as a linear superposition of Gaussians

( )

1= ( | , )

=∑

K

k k kk

p Nπx xμ Σ

An equivalent formulation of the Gaussian mixture involving an explicit latent variable

1

1

( )

( | 1) ( | , )

( | ) ( | , )

k

k

Kzk

k

k kK

zk k

k

p

p z N

p N

π

μ

μ

=

=

=

= = Σ

= Σ

∏

∏

z

x x

x z x

1 1

1

( ) ( , ) ( ) ( | ) ( | , )

( | , )

k k

K Kz zk k k

k k

K

k k kk

p p p p N

N

π μ

π μ

= =

=

⎡ ⎤= = = Σ⎢ ⎥

⎣ ⎦

= Σ

∑ ∑ ∑ ∏ ∏

∑z z z

x x z z x z x

x

1

( 1)

1

k k

K

kk

p z π

π=

= =

=∑

The marginal distribution of x is a Gaussian mixture of the form (*)

for every observed data point xn, there is a corresponding latent variable z

n

( )= ( )p p∑

zx x, z

( )( ) ( )( ) ( )1

1

1|

1 | 1

1 | 1

( | , ) π ( | ,

( )

)

k

k kK

j jj

k k kK

j j

k

jj

p z

p z p z

p z p z

N

z

Nπ

γ

=

=

≡ =

= ==

= =

=

∑

∑

x

x

x

xx

Σ

Σ

μμ

γ(z

k) can also be viewed as the responsibility that component k takes for explaining the observation x

Generating random samples distributed according to the Gaussian mixture model

♦ Generating a value for z, which denoted as from the marginal distribution p(z) and then

generate a value for x from the conditional distribution

a. The three states of z, corresponding to the three components of the mixture, are depicted in red, green, blue

b. The corresponding samples from the marginal distribution p(x) c. The same samples in which the colors represent the value of the responsibilities γ(z

nk) associated with

data point Illustrating the responsibilities by evaluating the posterior probability for each component in

the mixture distribution which this data set was generated ♦ Distribution

Graphical representation of a Gaussian mixture model for a set of N i.i.d. data points {x

n}, with

corresponding latent points {zn}

Data set: X (N x D matrix) with n-th row Tnx

Latent variables: Z (N x K matrix) with rows Tnz

The log of the likelihood function

( )1 1

ln | , , ln{ ( | , )}= =

=∑ ∑N K

k n k kn k

p NπX xΣ Σπ μ μ

7.4 Learning Gaussian Mixtures by EM

The Gaussian mixture models can be learned by the expectation-maximization (EM) algorithm.

Repeat ♦ Expectation step: calculate posterior or responsibilities using the current parameters ♦ Maximization step: re-estimate the parameters based on the responsibilities

Given a Gaussian mixture model, the goal is to maximize the likelihood function with respect to the

parameters 1. Initialize the means μ

k, covariance Σ

k and mixing coefficients π

k

2. E-step: evaluate the posterior probabilities or responsibilities using the current value for the parameters

( ) ( )( )1

| ,

| ,Κ

=

=∑

k n k knk

j n j jj

Nz

N

πγ

π

x

x

μ

μ

∑

∑

3. M-step: re-estimate the means, covariances, and mixing coefficients using the result of E-step.

( )1

1=

= ∑N

newk nk n

nk

zN

γ xμ

( )( )( )T

1

1=

− −∑ ∑N

new new newnk n k n kk

nk

zN

γ x xμ μ

new kk

NN

π =

( )1

N

k nkn

N zγ=

= ∑

4. Evaluate the log likelihood

( ) ( )1 1

ln | , , ln | ,= =

⎛ ⎞= ⎜ ⎟⎝ ⎠

∑ ∑N K

k n k kn k

p Nπ πX xμ μΣ Σ

If converged, terminate; otherwise, go to Step 2.

The General EM Algorithm

In maximizing the log likelihood function ( ) ( ){ }ln ln ,= Σp pZX | θ X, Z | θ the summation prevents the

logarithm from acting directly on the joint distribution ♦ Instead, the log likelihood function for the complete data set {X, Z} is straightforward. ♦ In practice since we are not given the complete data set, we consider instead its expected value Q under

the posterior distribution p( Z|X, Θ) of the latent variable ♦ General EM Algorithm

1. Choose an initial setting for the parameters oldθ 2. E step Evaluate ( )old,p X | Z Θ

3. M step Evaluate newΘ given by

( )( ) ( ) ( )

new old

old old

arg max ,

, , ln |

=

= ΣZ

Q

Q p pZ | X X, Z

ΘΘ Θ Θ

Θ Θ Θ Θ

4. It the covariance criterion is not satisfied, then let old new←Θ Θ and return to Step 2.

♦ The EM algorithm can also be used for fining MAP (maximum a posteriori) using the modified M-step

( , ) ln ( )oldQ pθ θ θ+

Documents

Probabilistic Graphical Models - SNU Chapter 7... · 2015-11-24 · Probabilistic Graphical Models Lecture Notes Fall 2009 November 11, 2009 Byoung-Tak Zhang School of Computer Science