Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Probabilistic Graphical Models
Lecture Notes Fall 2009
November 11, 2009
Byoung-Tak Zhang
School of Computer Science and Engineering & Cognitive Science, Brain Science, and Bioinformatics
Seoul National University http://bi.snu.ac.kr/~btzhang/
Chapter 7. Latent Variable Models 7.1 Factor Graphs Directed vs. Undirected Graphs
Both graphical models ♦ Specify a factorization (how to express the joint distribution) ♦ Define a set of conditional independence properties
Fig. 7-1. Parent-child local conditional distribution
Fig. 7-2. Maximal clique potential function
Relation of Directed and Indirected Graphs
Converting a directed graph to an undirected graph ♦ Case 1: straight line
♦ In this case, the partition function Z = 1
♦ Case 2: general case. Moralization = ‘marrying the parents’ ♦ Add additional undirected links between all pairs of parents ♦ Drop the arrows
♦ Results in the moral graph
♦ Fully connected no conditional independence properties, in contrast to the original directed graph
♦ We should add the fewest extra links to retain the maximum number of independence properties
Factor
•
•
•
r Graphs
A factor graph♦ Fact♦ Intro♦ Expl
∏
x x ψ
Definition. G
where vertices on the factorXk when
Factor graphcharacteristicExample. C
g(X1,X2,X3) =
with a corres
h is a bipartitetors in directedoducing additilicit decompo
|ψ , , ψ
Given a factor
rization as foll. The fu
hs can be combcs of the funconsider a func
= f1(X1)f2(X1,X
sponding facto
e graph represd/undirected gional nodes fosition /factoriz
|ψ , ,
rization of a fu
, the corre, factor
lows: there is unction is assu
bined with metion ction that fact
X2)f3(X1,X2)f4(X
or graph
senting a jointgraphs or the factors tzation
|⋅⋅⋅ψ ,
(Fac
unction
esponding facvertices an undirected
umed to be rea
essage passing, suc
orizes as follo
X2,X3),
t distribution i
themselves
,
ctor graphs are
,
ctor graph G =
d edge betweeal-valued, i.e.
g algorithms tch as the margows:
in the form of
e bipartite)
= (X, F, E) con, and edges
n factor vertex
to efficiently cginals.
a product of f
nsists of variabE. The edges x fj and variab
.
compute certa
factors.
ble depend
ble vertex
ain
• This factor graph has a cycle. If we merge f2(X1,X2)f3(X1,X2) into a single factor, the resulting factor graph will be a tree. This is an important distinction, as message passing algorithms are usually exact for trees, but only approximate for graphs with cycles.
• Inferences on factor graphs. • Sum-product algorithm: evaluating local marginals over nodes or subsets of nodes
\( ) ( )
xp x p=∑
xx
( )
( ) ( , )s ss Ne x
p F x X∈
= ∏x
\
\ ( )
( )
( )
( ) ( )
( , )
= ( , )
= ( )s
s
x
s sx s Ne x
s sXs Ne x
f xs Ne x
p x p
F x X
F x X
xμ
∈
∈
→∈
=
⎡ ⎤= ⎢ ⎥
⎣ ⎦⎡ ⎤⎢ ⎥⎣ ⎦
∑
∑ ∏
∑∏
∏
x
x
x
The messages ( )sf x xμ → from the factor node sf to the variable node x are computed in the
vertices and passed along the edges.
• Max-sum algorithm: finding the most probable state • The Hammersley–Clifford theorem shows that other probabilistic models such as Markov networks
and Bayesian networks can be represented as factor graphs. • Factor graph representation is frequently used when performing inference over such networks using
belief propagation. • On the other hand, Bayesian networks are more naturally suited for generative models, as they can
directly represent the causalities of the model.
Properties of Factor Graphs
Converting directed and undirected graphs into factor graphs ♦ undirected graph factor graph
Note: For a given fully connected undirected graph, two (or more) different factor graphs are possible. Factor graphs are more specific than the undirected graphs.
♦ directed graph factor graph
♦ For the same directed graph, two or more factor graphs are possible. There can be multiple factor
graphs all of which correspond to the same undirected/directed graph
Converting a directed/undirected tree to a factor graph
♦ The result is again a tree (no loops, one and only one path connecting any two nodes)
Converting a directed polytree to a factor graph ♦ The results in a tree. ♦ Cf. Converting a directed polytree into an undirected graph results in loops due to the
moralization step.
Polytree Undirected graph (moral graph) Factor graph
Local cycles in a directed graph can be removed on conversion to a factor graph
Factor graphs are more specific about the precise form of the factorization
For a fully connected undirected graph, two (or more) factor graphs are possible.
Directed and undirected graphs can express different conditional independence properties
D:
Directed graph Undirected graph
7.2 Pr Latent
PLSA
robabilisVariable M
Latent variab4 Varia
obseLatent variab4 Expl
General form
A
( )p = ∫x
stic LateModels
les ables that are
erved and direle models lain the statist
mulation
( , )p dx z z
ent Sema
e not directly ctly measured
tical propertie
( |p= ∫ x z
antic An
observed butd
s of the observ
) ( )p dz z z
alysis
t are rather in
ved variables
nferred from o
in terms of th
other variable
he latent variab
es that are
bles
7.3 Gaussian Mixture Models
Graphical representation of a mixture model A binary random variable z having a 1-of-K representation
Gaussian mixture distribution can be written as a linear superposition of Gaussians
( )
1= ( | , )
=∑
K
k k kk
p Nπx xμ Σ
An equivalent formulation of the Gaussian mixture involving an explicit latent variable
1
1
( )
( | 1) ( | , )
( | ) ( | , )
k
k
Kzk
k
k kK
zk k
k
p
p z N
p N
π
μ
μ
=
=
=
= = Σ
= Σ
∏
∏
z
x x
x z x
1 1
1
( ) ( , ) ( ) ( | ) ( | , )
( | , )
k k
K Kz zk k k
k k
K
k k kk
p p p p N
N
π μ
π μ
= =
=
⎡ ⎤= = = Σ⎢ ⎥
⎣ ⎦
= Σ
∑ ∑ ∑ ∏ ∏
∑z z z
x x z z x z x
x
1
( 1)
1
k k
K
kk
p z π
π=
= =
=∑
The marginal distribution of x is a Gaussian mixture of the form (*)
for every observed data point xn, there is a corresponding latent variable z
n
( )= ( )p p∑
zx x, z
( )( ) ( )( ) ( )1
1
1|
1 | 1
1 | 1
( | , ) π ( | ,
( )
)
k
k kK
j jj
k k kK
j j
k
jj
p z
p z p z
p z p z
N
z
Nπ
γ
=
=
≡ =
= ==
= =
=
∑
∑
x
x
x
xx
Σ
Σ
μμ
γ(z
k) can also be viewed as the responsibility that component k takes for explaining the observation x
Generating random samples distributed according to the Gaussian mixture model
♦ Generating a value for z, which denoted as from the marginal distribution p(z) and then
generate a value for x from the conditional distribution
a. The three states of z, corresponding to the three components of the mixture, are depicted in red, green, blue
b. The corresponding samples from the marginal distribution p(x) c. The same samples in which the colors represent the value of the responsibilities γ(z
nk) associated with
data point Illustrating the responsibilities by evaluating the posterior probability for each component in
the mixture distribution which this data set was generated ♦ Distribution
Graphical representation of a Gaussian mixture model for a set of N i.i.d. data points {x
n}, with
corresponding latent points {zn}
Data set: X (N x D matrix) with n-th row Tnx
Latent variables: Z (N x K matrix) with rows Tnz
The log of the likelihood function
( )1 1
ln | , , ln{ ( | , )}= =
=∑ ∑N K
k n k kn k
p NπX xΣ Σπ μ μ
7.4 Learning Gaussian Mixtures by EM
The Gaussian mixture models can be learned by the expectation-maximization (EM) algorithm.
Repeat ♦ Expectation step: calculate posterior or responsibilities using the current parameters ♦ Maximization step: re-estimate the parameters based on the responsibilities
Given a Gaussian mixture model, the goal is to maximize the likelihood function with respect to the
parameters 1. Initialize the means μ
k, covariance Σ
k and mixing coefficients π
k
2. E-step: evaluate the posterior probabilities or responsibilities using the current value for the parameters
( ) ( )( )1
| ,
| ,Κ
=
=∑
k n k knk
j n j jj
Nz
N
πγ
π
x
x
μ
μ
∑
∑
3. M-step: re-estimate the means, covariances, and mixing coefficients using the result of E-step.
( )1
1=
= ∑N
newk nk n
nk
zN
γ xμ
( )( )( )T
1
1=
− −∑ ∑N
new new newnk n k n kk
nk
zN
γ x xμ μ
new kk
NN
π =
( )1
N
k nkn
N zγ=
= ∑
4. Evaluate the log likelihood
( ) ( )1 1
ln | , , ln | ,= =
⎛ ⎞= ⎜ ⎟⎝ ⎠
∑ ∑N K
k n k kn k
p Nπ πX xμ μΣ Σ
If converged, terminate; otherwise, go to Step 2.
The General EM Algorithm
In maximizing the log likelihood function ( ) ( ){ }ln ln ,= Σp pZX | θ X, Z | θ the summation prevents the
logarithm from acting directly on the joint distribution ♦ Instead, the log likelihood function for the complete data set {X, Z} is straightforward. ♦ In practice since we are not given the complete data set, we consider instead its expected value Q under
the posterior distribution p( Z|X, Θ) of the latent variable ♦ General EM Algorithm
1. Choose an initial setting for the parameters oldθ 2. E step Evaluate ( )old,p X | Z Θ
3. M step Evaluate newΘ given by
( )( ) ( ) ( )
new old
old old
arg max ,
, , ln |
=
= ΣZ
Q
Q p pZ | X X, Z
ΘΘ Θ Θ
Θ Θ Θ Θ
4. It the covariance criterion is not satisfied, then let old new←Θ Θ and return to Step 2.
♦ The EM algorithm can also be used for fining MAP (maximum a posteriori) using the modified M-step
( , ) ln ( )oldQ pθ θ θ+