View
4
Download
0
Category
Preview:
Citation preview
CS242: Probabilistic Graphical ModelsLecture 7: Exponential Families, Directed Graphs
Instructor: Erik Sudderth Brown University Computer Science
September 25, 2014
Some figures and materials courtesy David Barber,Bayesian Reasoning and Machine Learning http://www.cs.ucl.ac.uk/staff/d.barber/brml/
Graphical Models asExponential Family Distributions
x3
x4
x5
x1
x2
x3
x4
x5
x1
x2
x3
x4
x5
x1
x2
Exponential Families of Distributions
fixed vector of sufficient statistics (features), specifying the family of distributions unknown vector of natural parameters,determine particular distribution in this family
�(x) 2 Rd
✓ 2 ⇥ ✓ Rd
⇥ = {✓ 2 Rd | Z(✓) < 1} ensures we construct a valid distribution
p(x | ✓) = 1
Z(✓)
⌫(x) exp{✓T�(x)}
= ⌫(x) exp{✓T�(x)� �(✓)}�(✓) = logZ(✓)
Z(✓) =
Z
X⌫(x) exp{✓T�(x)} dx
normalization constant or partition function(for discrete variables, integral becomes sum)
reference measure independent of parameters(for many models, we simply have )
Z(✓) > 0
⌫(x) > 0⌫(x) = 1
Factor Graphs & Exponential Families
set of N nodes or vertices,
set of hyperedges linking subsets of nodes
{1, 2, . . . , N}Vnormalization constant (partition function)
F f ✓ V
p(x) =1
Z(✓)
Y
f2F f (xf | ✓f )
Z(✓)
• A factor graph is created from non-negative potential functions, often defined as
50 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS
x3
x4 x5
x1 x2
x3
x4 x5
x1 x2
x3
x4 x5
x1 x2
(a) (b) (c)
Figure 2.4. Three graphical representations of a distribution over five random variables (see [175]).(a) Directed graph G depicting a causal, generative process. (b) Factor graph expressing the factoriza-tion underlying G. (c) A “moralized” undirected graph capturing the Markov structure of G.
For example, in the factor graph of Fig. 2.5(c), there are 5 variable nodes, and the jointdistribution has one potential for each of the 3 hyperedges:
p(x) ∝ ψ123(x1, x2, x3)ψ234(x2, x3, x4)ψ35(x3, x5)
Often, these potentials can be interpreted as local dependencies or constraints. Note,however, that ψf (xf ) does not typically correspond to the marginal distribution pf (xf ),due to interactions with the graph’s other potentials.
In many applications, factor graphs are used to impose structure on an exponentialfamily of densities. In particular, suppose that each potential function is described bythe following unnormalized exponential form:
ψf (xf | θf ) = νf (xf ) exp
∑
a∈Af
θfaφfa(xf )
(2.67)
Here, θf ! {θfa | a ∈ Af} are the canonical parameters of the local exponential familyfor hyperedge f . From eq. (2.66), the joint distribution can then be written as
p(x | θ) =
( ∏
f∈F
νf (xf )
)exp
∑
f∈F
∑
a∈Af
θfaφfa(xf ) − Φ(θ)
(2.68)
Comparing to eq. (2.1), we see that factor graphs define regular exponential fami-lies [104, 311], with parameters θ = {θf | f ∈ F}, whenever local potentials are chosenfrom such families. The results of Sec. 2.1 then show that local statistics, computedover the support of each hyperedge, are sufficient for learning from training data. This
50 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS
x3
x4 x5
x1 x2
x3
x4 x5
x1 x2
x3
x4 x5
x1 x2
(a) (b) (c)
Figure 2.4. Three graphical representations of a distribution over five random variables (see [175]).(a) Directed graph G depicting a causal, generative process. (b) Factor graph expressing the factoriza-tion underlying G. (c) A “moralized” undirected graph capturing the Markov structure of G.
For example, in the factor graph of Fig. 2.5(c), there are 5 variable nodes, and the jointdistribution has one potential for each of the 3 hyperedges:
p(x) ∝ ψ123(x1, x2, x3)ψ234(x2, x3, x4)ψ35(x3, x5)
Often, these potentials can be interpreted as local dependencies or constraints. Note,however, that ψf (xf ) does not typically correspond to the marginal distribution pf (xf ),due to interactions with the graph’s other potentials.
In many applications, factor graphs are used to impose structure on an exponentialfamily of densities. In particular, suppose that each potential function is described bythe following unnormalized exponential form:
ψf (xf | θf ) = νf (xf ) exp
∑
a∈Af
θfaφfa(xf )
(2.67)
Here, θf ! {θfa | a ∈ Af} are the canonical parameters of the local exponential familyfor hyperedge f . From eq. (2.66), the joint distribution can then be written as
p(x | θ) =
( ∏
f∈F
νf (xf )
)exp
∑
f∈F
∑
a∈Af
θfaφfa(xf ) − Φ(θ)
(2.68)
Comparing to eq. (2.1), we see that factor graphs define regular exponential fami-lies [104, 311], with parameters θ = {θf | f ∈ F}, whenever local potentials are chosenfrom such families. The results of Sec. 2.1 then show that local statistics, computedover the support of each hyperedge, are sufficient for learning from training data. This
Local exponential family:
50 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS
x3
x4 x5
x1 x2
x3
x4 x5
x1 x2
x3
x4 x5
x1 x2
(a) (b) (c)
Figure 2.4. Three graphical representations of a distribution over five random variables (see [175]).(a) Directed graph G depicting a causal, generative process. (b) Factor graph expressing the factoriza-tion underlying G. (c) A “moralized” undirected graph capturing the Markov structure of G.
For example, in the factor graph of Fig. 2.5(c), there are 5 variable nodes, and the jointdistribution has one potential for each of the 3 hyperedges:
p(x) ∝ ψ123(x1, x2, x3)ψ234(x2, x3, x4)ψ35(x3, x5)
Often, these potentials can be interpreted as local dependencies or constraints. Note,however, that ψf (xf ) does not typically correspond to the marginal distribution pf (xf ),due to interactions with the graph’s other potentials.
In many applications, factor graphs are used to impose structure on an exponentialfamily of densities. In particular, suppose that each potential function is described bythe following unnormalized exponential form:
ψf (xf | θf ) = νf (xf ) exp
∑
a∈Af
θfaφfa(xf )
(2.67)
Here, θf ! {θfa | a ∈ Af} are the canonical parameters of the local exponential familyfor hyperedge f . From eq. (2.66), the joint distribution can then be written as
p(x | θ) =
( ∏
f∈F
νf (xf )
)exp
∑
f∈F
∑
a∈Af
θfaφfa(xf ) − Φ(θ)
(2.68)
Comparing to eq. (2.1), we see that factor graphs define regular exponential fami-lies [104, 311], with parameters θ = {θf | f ∈ F}, whenever local potentials are chosenfrom such families. The results of Sec. 2.1 then show that local statistics, computedover the support of each hyperedge, are sufficient for learning from training data. This
�(✓) = logZ(✓)
Undirected Graphs & Exponential Families
50 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS
x3
x4 x5
x1 x2
x3
x4 x5
x1 x2
x3
x4 x5
x1 x2
(a) (b) (c)
Figure 2.4. Three graphical representations of a distribution over five random variables (see [175]).(a) Directed graph G depicting a causal, generative process. (b) Factor graph expressing the factoriza-tion underlying G. (c) A “moralized” undirected graph capturing the Markov structure of G.
For example, in the factor graph of Fig. 2.5(c), there are 5 variable nodes, and the jointdistribution has one potential for each of the 3 hyperedges:
p(x) ∝ ψ123(x1, x2, x3)ψ234(x2, x3, x4)ψ35(x3, x5)
Often, these potentials can be interpreted as local dependencies or constraints. Note,however, that ψf (xf ) does not typically correspond to the marginal distribution pf (xf ),due to interactions with the graph’s other potentials.
In many applications, factor graphs are used to impose structure on an exponentialfamily of densities. In particular, suppose that each potential function is described bythe following unnormalized exponential form:
ψf (xf | θf ) = νf (xf ) exp
∑
a∈Af
θfaφfa(xf )
(2.67)
Here, θf ! {θfa | a ∈ Af} are the canonical parameters of the local exponential familyfor hyperedge f . From eq. (2.66), the joint distribution can then be written as
p(x | θ) =
( ∏
f∈F
νf (xf )
)exp
∑
f∈F
∑
a∈Af
θfaφfa(xf ) − Φ(θ)
(2.68)
Comparing to eq. (2.1), we see that factor graphs define regular exponential fami-lies [104, 311], with parameters θ = {θf | f ∈ F}, whenever local potentials are chosenfrom such families. The results of Sec. 2.1 then show that local statistics, computedover the support of each hyperedge, are sufficient for learning from training data. This
�(✓) = logZ(✓)
50 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS
x3
x4 x5
x1 x2
x3
x4 x5
x1 x2
x3
x4 x5
x1 x2
(a) (b) (c)
Figure 2.4. Three graphical representations of a distribution over five random variables (see [175]).(a) Directed graph G depicting a causal, generative process. (b) Factor graph expressing the factoriza-tion underlying G. (c) A “moralized” undirected graph capturing the Markov structure of G.
For example, in the factor graph of Fig. 2.5(c), there are 5 variable nodes, and the jointdistribution has one potential for each of the 3 hyperedges:
p(x) ∝ ψ123(x1, x2, x3)ψ234(x2, x3, x4)ψ35(x3, x5)
Often, these potentials can be interpreted as local dependencies or constraints. Note,however, that ψf (xf ) does not typically correspond to the marginal distribution pf (xf ),due to interactions with the graph’s other potentials.
In many applications, factor graphs are used to impose structure on an exponentialfamily of densities. In particular, suppose that each potential function is described bythe following unnormalized exponential form:
ψf (xf | θf ) = νf (xf ) exp
∑
a∈Af
θfaφfa(xf )
(2.67)
Here, θf ! {θfa | a ∈ Af} are the canonical parameters of the local exponential familyfor hyperedge f . From eq. (2.66), the joint distribution can then be written as
p(x | θ) =
( ∏
f∈F
νf (xf )
)exp
∑
f∈F
∑
a∈Af
θfaφfa(xf ) − Φ(θ)
(2.68)
Comparing to eq. (2.1), we see that factor graphs define regular exponential fami-lies [104, 311], with parameters θ = {θf | f ∈ F}, whenever local potentials are chosenfrom such families. The results of Sec. 2.1 then show that local statistics, computedover the support of each hyperedge, are sufficient for learning from training data. This
• Pick features to define an exponential family of distributions • Use factor graph to represent structure of chosen statistics • Create undirected graph with a clique for every factor node • Result: Visualization of Markov properties of your family
Directed Graphs & Exponential Families
p(x) =NY
i=1
p(xi | x�(i), ✓i)
• For each node, pick an appropriate exponential family • Pick features of parent nodes relevant to child variable
Most generally, indicators for all joint configurations of parents. • Child parameters are a (learned) linear function of parent features • Result: Node-specific conditional exponential family models
�(✓i, x�(i)) =
Z
Xi
⌫(xi) exp�✓
Ti �(xi, x�(i))
dxi
p(xi | x�(i), ✓i) = ⌫(xi) exp�✓
Ti �(xi, x�(i))� �(✓i, x�(i))
Learning DirectedGraphical Models
1X
2X
3X
1X
2X
3X
1X
2X
3X
1X
2X
3X
1,
1,
1,
2,
2,
2,
N,
N,
N,
(a) (b)
1X
2X
3X
1X
2X
3X
1X
2X
3X
1X
2X
3X
1,
1,
1,
2,
2,
2,
N,
N,
N,
(a) (b)
N
Terminology: Inference versus Learning • Inference: Given a model with known parameters, estimate the “hidden”
variables for some data instance, or find their marginal distribution • Learning: Given multiple data instances, estimate parameters for a graphical
model of their structure: Maximum Likelihood, Maximum a Posteriori, … Training instances may be completely or partially observed
• Inference: Given observed symptoms for a particular patient, infer probabilities that they have contracted various diseases
• Learning: Given a database of many patient diagnoses,learn the relationships between diseases and symptoms
Example: Expert systems for medical diagnosis
Example: Markov random fields for semantic image segmentation • Inference: What object category is depicted at each pixel in some image? • Learning: Across a database of images, in which some example objects
have been labeled, how do those labels relate to low-level image features?
Learning Directed Graphical Models 1X
2X 3X
X 4
1X
2X
1X
3X
1X
X 4
2X 3X
(a) (b)
Intuition: Must learn a good predictive model of each node, given its parent nodes
p(x) =Y
i2Vp(xi | x�(i), ✓i)
• Directed factorization allows likelihood to locally decompose: p(x | ✓) = p(x1 | ✓1)p(x2 | x1, ✓2)p(x3 | x1, ✓3)p(x4 | x2, x3, ✓4)
log p(x | ✓) = log p(x1 | ✓1) + log p(x2 | x1, ✓2) + log p(x3 | x1, ✓3) + log p(x4 | x2, x3, ✓4)
• We often assume a similarly factorized (meta-independent) prior: p(✓) = p(✓1)p(✓2)p(✓3)p(✓4)
log p(✓) = log p(✓1) + log p(✓2) + log p(✓3) + log p(✓4)
• We thus have independent ML or Bayesian learning problems at each node
Learning with Complete Observations 1X
2X
3X
1X
2X
3X
1X
2X
3X
1X
2X
3X
1,
1,
1,
2,
2,
2,
N,
N,
N,
(a) (b)
1X
2X
3X
1X
2X
3X
1X
2X
3X
1X
2X
3X
1,
1,
1,
2,
2,
2,
N,
N,
N,
(a) (b)
1X
2X
3X
1X
2X
3X
1X
2X
3X
1X
2X
3X
1,
1,
1,
2,
2,
2,
N,
N,
N,
(a) (b)
N Directed
Graphical Model N Independent, Identically
Distributed Training Examples Plate Notation • Directed graph encodes statistical structure of single training examples:
D = {xV,1, . . . , xV,N}p(D | ✓) =NY
n=1
Y
i2Vp(xi,n | x�(i),n, ✓i)
• Given N independent, identically distributed, completely observed samples:
log p(D | ✓) =NX
n=1
X
i2Vlog p(xi,n | x�(i),n, ✓i) =
X
i2V
"NX
n=1
log p(xi,n | x�(i),n, ✓i)
#
Prior Distributions and Tied Parameters log p(D | ✓) =
NX
n=1
X
i2Vlog p(xi,n | x�(i),n, ✓i) =
X
i2V
"NX
n=1
log p(xi,n | x�(i),n, ✓i)
#
A “meta-independent” factorized prior
log p(✓ | D) = C +
X
i2V
"log p(✓i) +
NX
n=1
log p(xi,n | x�(i),n, ✓i)
#• Factorized posterior allows independent learning for each node:
log p(✓b | D) = C + log p(✓b) +
X
i|bi=b
NX
n=1
log p(xi,n | x�(i),n, ✓b)
• Learning easy when groups of nodes are tied to use identical, shared parameters:
log p(D | ✓) =X
i2V
"NX
n=1
log p(xi,n | x�(i),n, ✓bi)
#
bitype or class for variable node i
log p(✓) =X
i2Vlog p(✓i)
Example: Tied Temporal Parameters z1 z2 z3 z4 z5
x1 x2 x3 x4 x5
z1 z2 z3
x1 x2 x3
z1 z2 z3 z4
x1 x2 x3 x4
z0
z0
z0
p(x, z | ✓) =NY
n=1
TnY
t=1
p(zt,n | zt�1,n, ✓time
)p(xt,n | zt,n, ✓obs)
Sequence 1
Sequence 2
Sequence 3
Exponential Families ofProbability Distributions
Why the Exponential Family?
• Allows simultaneous study of many popular probability distributions:Bernoulli (binary), Categorical, Poisson (counts), Exponential (positive), Gaussian (real), …
• Maximum likelihood (ML) learning is simple: moment matching of sufficient statistics • Bayesian learning is simple: conjugate priors are available
Beta, Dirichlet, Gamma, Gaussian, Wishart, … • The maximum entropy interpretation: Among all distributions with certain moments
of interest, the exponential family is the most random (makes fewest assumptions) • Parametric and predictive sufficiency: For arbitrarily large datasets, optimal learning
is possible from a finite-dimensional set of statistics (streaming, big data)
vector of sufficient statistics (features) defining the family
vector of natural parameters indexing particular distributions
�(x) 2 Rd
✓ ✓ Rd
p(x | ✓) = ⌫(x) exp{✓T�(x)� �(✓)} �(✓) = log
Z
X⌫(x) exp{✓T�(x)} dx
⇥ = {✓ 2 Rd | �(✓) < 1}
Example: Bernoulli Distribution Bernoulli Distribution: Single toss of a (possibly biased) coin
Ber(x | µ) = µ
x(1� µ)1�x
0 µ 1E[x | µ] = P[x = 1] = µ
x 2 {0, 1}
Exponential Family Form: Derivation on board Ber(x | ✓) = exp{✓x� �(✓)} ✓ 2 ⇥ = R
µ = (1 + exp(�✓))�1
−10 −5 0 5 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Logistic Function:
✓ = log(µ)� log(1� µ)
�(✓) = log(1 + e✓)
Log Normalization:
−10 −8 −6 −4 −2 0 2 4 6 8 100
1
2
3
4
5
6
7
8
9
10
Log-Normalizer is Convex
Log Partition Function (Normalizer) p(x | ✓) = ⌫(x) exp{✓T�(x)� �(✓)} �(✓) = log
Z
X⌫(x) exp{✓T�(x)} dx
Sec. 2.1. Exponential Families 31
Proposition 2.1.1. The log partition function Φ(θ) of eq. (2.2) is convex (strictly sofor minimal representations) and continuously differentiable over its domain Θ. Itsderivatives are the cumulants of the sufficient statistics {φa | a ∈ A}, so that
∂Φ(θ)
∂θa= Eθ[φa(x)] !
∫
Xφa(x) p(x | θ) dx (2.4)
∂2Φ(θ)
∂θa∂θb= Eθ[φa(x)φb(x)] − Eθ[φa(x)] Eθ[φb(x)] (2.5)
Proof. For a detailed proof of this classic result, see [15, 36, 311]. The cumulant gener-ating properties follow from the chain rule and algebraic manipulation. From eq. (2.5),∇2Φ(θ) is a positive semi–definite covariance matrix, implying convexity of Φ(θ). Forminimal families, ∇2Φ(θ) must be positive definite, guaranteeing strict convexity.
Due to this result, the log partition function is also known as the cumulant generatingfunction of the exponential family. The convexity of Φ(θ) has important implicationsfor the geometry of exponential families [6, 15, 36, 74].
Entropy, Information, and Divergence
Concepts from information theory play a central role in the study of learning andinference in exponential families. Given a probability distribution p(x) defined on adiscrete space X , Shannon’s measure of entropy (in natural units, or nats) equals
H(p) = −∑
x∈X
p(x) log p(x) (2.6)
In such diverse fields as communications, signal processing, and statistical physics,entropy arises as a natural measure of the inherent uncertainty in a random variable [49].The differential entropy extends this definition to continuous spaces:
H(p) = −∫
Xp(x) log p(x) dx (2.7)
In both discrete and continuous domains, the (differential) entropy H(p) is concave,continuous, and maximal for uniform densities. However, while the discrete entropy isguaranteed to be non-negative, differential entropy is sometimes less than zero.
For problems of model selection and approximation, we need a measure of thedistance between probability distributions. The relative entropy or Kullback-Leibler(KL) divergence between two probability distributions p(x) and q(x) equals
D(p || q) =
∫
Xp(x) log
p(x)
q(x)dx (2.8)
Important properties of the KL divergence follow from Jensen’s inequality [49], whichbounds the expectation of convex functions:
E[f(x)] ≥ f(E[x]) for any convex f : X → R (2.9)
• Proof: Chain rule of derivatives and algorithmic manipulation (board)
• Derivatives of the log partition function have an intuitive form:
⇥ = {✓ 2 Rd | �(✓) < 1}
Reminder: Gradients & Hessians
Gradient Vector: f : Rd ! R
rf : Rd ! Rd
(rf(✓))k =@f(✓)
@✓k
Hessian is always symmetric For linear functions, Hessian is zero For quadratic functions, Hessian is
constant over the input space The function f is convex if and only if
the Hessian is positive semidefinite
Hessian Matrix: r2f : Rd ! Rd⇥d
(r2f(✓))k` =@2f(✓)
@✓k@✓`
(r2f(✓))kk =@2f(✓)
@✓2k
Reminder: Convex Functions
−1
0
1
2
3
−1
0
1
2
3
0
100
200
300
400
✓ ✓0
f(�✓ + (1� �)✓0) �f(✓) + (1� �)f(✓0) From any initialization, gradient descent (eventually) finds global minimum Newton’s method, which uses Hessian, often converges faster A sum of convex functions remains convex The negative of a convex function is said to be concave A function is convex if and only if its epigraph (the set of points lying
above the curve) is a convex set
Log Partition Function (Normalizer) p(x | ✓) = ⌫(x) exp{✓T�(x)� �(✓)} �(✓) = log
Z
X⌫(x) exp{✓T�(x)} dx
• Derivatives of the log partition function have an intuitive form: r✓�(✓) = E✓[�(x)]
r2✓�(✓) = Cov✓[�(x)] = E✓[�(x)�(x)
T]� E✓[�(x)]E✓[�(x)]
T
⇥ = {✓ 2 Rd | �(✓) < 1}
• Important consequences for learning with exponential families: Finding gradients is equivalent to finding expected sufficient statistics,
or moments, of some current model. This is an inference problem! The Hessian is positive definite so the log partition function is convex This in turn implies that the parameter space is convex Learning is a convex problem: No local optima!
At least when we have complete observations…
⇥�(✓)
A Little Information Theory • The entropy is a natural measure of the inherent uncertainty
(difficulty of compression) of some random variable:
Sec. 2.1. Exponential Families 31
Proposition 2.1.1. The log partition function Φ(θ) of eq. (2.2) is convex (strictly sofor minimal representations) and continuously differentiable over its domain Θ. Itsderivatives are the cumulants of the sufficient statistics {φa | a ∈ A}, so that
∂Φ(θ)
∂θa= Eθ[φa(x)] !
∫
Xφa(x) p(x | θ) dx (2.4)
∂2Φ(θ)
∂θa∂θb= Eθ[φa(x)φb(x)] − Eθ[φa(x)] Eθ[φb(x)] (2.5)
Proof. For a detailed proof of this classic result, see [15, 36, 311]. The cumulant gener-ating properties follow from the chain rule and algebraic manipulation. From eq. (2.5),∇2Φ(θ) is a positive semi–definite covariance matrix, implying convexity of Φ(θ). Forminimal families, ∇2Φ(θ) must be positive definite, guaranteeing strict convexity.
Due to this result, the log partition function is also known as the cumulant generatingfunction of the exponential family. The convexity of Φ(θ) has important implicationsfor the geometry of exponential families [6, 15, 36, 74].
Entropy, Information, and Divergence
Concepts from information theory play a central role in the study of learning andinference in exponential families. Given a probability distribution p(x) defined on adiscrete space X , Shannon’s measure of entropy (in natural units, or nats) equals
H(p) = −∑
x∈X
p(x) log p(x) (2.6)
In such diverse fields as communications, signal processing, and statistical physics,entropy arises as a natural measure of the inherent uncertainty in a random variable [49].The differential entropy extends this definition to continuous spaces:
H(p) = −∫
Xp(x) log p(x) dx (2.7)
In both discrete and continuous domains, the (differential) entropy H(p) is concave,continuous, and maximal for uniform densities. However, while the discrete entropy isguaranteed to be non-negative, differential entropy is sometimes less than zero.
For problems of model selection and approximation, we need a measure of thedistance between probability distributions. The relative entropy or Kullback-Leibler(KL) divergence between two probability distributions p(x) and q(x) equals
D(p || q) =
∫
Xp(x) log
p(x)
q(x)dx (2.8)
Important properties of the KL divergence follow from Jensen’s inequality [49], whichbounds the expectation of convex functions:
E[f(x)] ≥ f(E[x]) for any convex f : X → R (2.9)
Sec. 2.1. Exponential Families 31
Proposition 2.1.1. The log partition function Φ(θ) of eq. (2.2) is convex (strictly sofor minimal representations) and continuously differentiable over its domain Θ. Itsderivatives are the cumulants of the sufficient statistics {φa | a ∈ A}, so that
∂Φ(θ)
∂θa= Eθ[φa(x)] !
∫
Xφa(x) p(x | θ) dx (2.4)
∂2Φ(θ)
∂θa∂θb= Eθ[φa(x)φb(x)] − Eθ[φa(x)] Eθ[φb(x)] (2.5)
Proof. For a detailed proof of this classic result, see [15, 36, 311]. The cumulant gener-ating properties follow from the chain rule and algebraic manipulation. From eq. (2.5),∇2Φ(θ) is a positive semi–definite covariance matrix, implying convexity of Φ(θ). Forminimal families, ∇2Φ(θ) must be positive definite, guaranteeing strict convexity.
Due to this result, the log partition function is also known as the cumulant generatingfunction of the exponential family. The convexity of Φ(θ) has important implicationsfor the geometry of exponential families [6, 15, 36, 74].
Entropy, Information, and Divergence
Concepts from information theory play a central role in the study of learning andinference in exponential families. Given a probability distribution p(x) defined on adiscrete space X , Shannon’s measure of entropy (in natural units, or nats) equals
H(p) = −∑
x∈X
p(x) log p(x) (2.6)
In such diverse fields as communications, signal processing, and statistical physics,entropy arises as a natural measure of the inherent uncertainty in a random variable [49].The differential entropy extends this definition to continuous spaces:
H(p) = −∫
Xp(x) log p(x) dx (2.7)
In both discrete and continuous domains, the (differential) entropy H(p) is concave,continuous, and maximal for uniform densities. However, while the discrete entropy isguaranteed to be non-negative, differential entropy is sometimes less than zero.
For problems of model selection and approximation, we need a measure of thedistance between probability distributions. The relative entropy or Kullback-Leibler(KL) divergence between two probability distributions p(x) and q(x) equals
D(p || q) =
∫
Xp(x) log
p(x)
q(x)dx (2.8)
Important properties of the KL divergence follow from Jensen’s inequality [49], whichbounds the expectation of convex functions:
E[f(x)] ≥ f(E[x]) for any convex f : X → R (2.9)
discrete entropy(concave, non-negative)
differential entropy(concave, real-valued)
• The relative entropy or Kullback-Leibler (KL) divergence is a non-negative, but asymmetric, “distance” between a given pair of probability distributions:The KL divergence equals zero if and only if almost everywhere.
Sec. 2.1. Exponential Families 31
Proposition 2.1.1. The log partition function Φ(θ) of eq. (2.2) is convex (strictly sofor minimal representations) and continuously differentiable over its domain Θ. Itsderivatives are the cumulants of the sufficient statistics {φa | a ∈ A}, so that
∂Φ(θ)
∂θa= Eθ[φa(x)] !
∫
Xφa(x) p(x | θ) dx (2.4)
∂2Φ(θ)
∂θa∂θb= Eθ[φa(x)φb(x)] − Eθ[φa(x)] Eθ[φb(x)] (2.5)
Proof. For a detailed proof of this classic result, see [15, 36, 311]. The cumulant gener-ating properties follow from the chain rule and algebraic manipulation. From eq. (2.5),∇2Φ(θ) is a positive semi–definite covariance matrix, implying convexity of Φ(θ). Forminimal families, ∇2Φ(θ) must be positive definite, guaranteeing strict convexity.
Due to this result, the log partition function is also known as the cumulant generatingfunction of the exponential family. The convexity of Φ(θ) has important implicationsfor the geometry of exponential families [6, 15, 36, 74].
Entropy, Information, and Divergence
Concepts from information theory play a central role in the study of learning andinference in exponential families. Given a probability distribution p(x) defined on adiscrete space X , Shannon’s measure of entropy (in natural units, or nats) equals
H(p) = −∑
x∈X
p(x) log p(x) (2.6)
In such diverse fields as communications, signal processing, and statistical physics,entropy arises as a natural measure of the inherent uncertainty in a random variable [49].The differential entropy extends this definition to continuous spaces:
H(p) = −∫
Xp(x) log p(x) dx (2.7)
In both discrete and continuous domains, the (differential) entropy H(p) is concave,continuous, and maximal for uniform densities. However, while the discrete entropy isguaranteed to be non-negative, differential entropy is sometimes less than zero.
For problems of model selection and approximation, we need a measure of thedistance between probability distributions. The relative entropy or Kullback-Leibler(KL) divergence between two probability distributions p(x) and q(x) equals
D(p || q) =
∫
Xp(x) log
p(x)
q(x)dx (2.8)
Important properties of the KL divergence follow from Jensen’s inequality [49], whichbounds the expectation of convex functions:
E[f(x)] ≥ f(E[x]) for any convex f : X → R (2.9)
32 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS
Applying Jensen’s inequality to the logarithm of eq. (2.8), which is concave, it is eas-ily shown that the KL divergence D(p || q) ≥ 0, with D(p || q) = 0 if and only ifp(x) = q(x) almost everywhere. However, it is not a true distance metric becauseD(p || q) "= D(q || p). Given a target density p(x) and an approximation q(x), D(p || q)can be motivated as the information gain achievable by using p(x) in place of q(x) [49].Interestingly, the alternate KL divergence D(q || p) also plays an important role in thedevelopment of variational methods for approximate inference (see Sec. 2.3).
An important special case arises when we consider the dependency between tworandom variables x and y. Let pxy(x, y) denote their joint distribution, px(x) andpy(y) their corresponding marginals, and X and Y their sample spaces. The mutualinformation between x and y then equals
I(pxy) ! D(pxy || pxpy) =
∫
X
∫
Ypxy(x, y) log
pxy(x, y)
px(x)py(y)dy dx (2.10)
= H(px) + H(py) − H(pxy) (2.11)
where eq. (2.11) follows from algebraic manipulation. The mutual information can beinterpreted as the expected reduction in uncertainty about one random variable fromobservation of another [49].
Projections onto Exponential Families
In many cases, learning problems can be posed as a search for the best approximationof an empirically derived target density p̃(x). As discussed in the previous section, theKL divergence D(p̃ || q) is a natural measure of the accuracy of an approximation q(x).For exponential families, the optimal approximating density is elegantly characterizedby the following moment–matching conditions:
Proposition 2.1.2. Let p̃ denote a target probability density, and pθ an exponentialfamily. The approximating density minimizing D(p̃ || pθ) then has canonical parametersθ̂ chosen to match the expected values of that family’s sufficient statistics:
Eθ̂[φa(x)] =
∫
Xφa(x) p̃(x) dx a ∈ A (2.12)
For minimal families, these optimal parameters θ̂ are uniquely determined.
Proof. From the definition of KL divergence (eq. (2.8)), we have
D(p̃ || pθ) =
∫
Xp̃(x) log
p̃(x)
p(x | θ)dx
=
∫
Xp̃(x) log p̃(x) dx −
∫
Xp̃(x)
[
log ν(x) +∑
a∈A
θaφa(x) − Φ(θ)
]
dx
= −H(p̃) −∫
Xp̃(x) log ν(x) dx −
∑
a∈A
θa
∫
Xφa(x) p̃(x) dx + Φ(θ)
p(x) = q(x)
• The mutual information measures dependence between a pair of random variables:
32 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS
Applying Jensen’s inequality to the logarithm of eq. (2.8), which is concave, it is eas-ily shown that the KL divergence D(p || q) ≥ 0, with D(p || q) = 0 if and only ifp(x) = q(x) almost everywhere. However, it is not a true distance metric becauseD(p || q) "= D(q || p). Given a target density p(x) and an approximation q(x), D(p || q)can be motivated as the information gain achievable by using p(x) in place of q(x) [49].Interestingly, the alternate KL divergence D(q || p) also plays an important role in thedevelopment of variational methods for approximate inference (see Sec. 2.3).
An important special case arises when we consider the dependency between tworandom variables x and y. Let pxy(x, y) denote their joint distribution, px(x) andpy(y) their corresponding marginals, and X and Y their sample spaces. The mutualinformation between x and y then equals
I(pxy) ! D(pxy || pxpy) =
∫
X
∫
Ypxy(x, y) log
pxy(x, y)
px(x)py(y)dy dx (2.10)
= H(px) + H(py) − H(pxy) (2.11)
where eq. (2.11) follows from algebraic manipulation. The mutual information can beinterpreted as the expected reduction in uncertainty about one random variable fromobservation of another [49].
Projections onto Exponential Families
In many cases, learning problems can be posed as a search for the best approximationof an empirically derived target density p̃(x). As discussed in the previous section, theKL divergence D(p̃ || q) is a natural measure of the accuracy of an approximation q(x).For exponential families, the optimal approximating density is elegantly characterizedby the following moment–matching conditions:
Proposition 2.1.2. Let p̃ denote a target probability density, and pθ an exponentialfamily. The approximating density minimizing D(p̃ || pθ) then has canonical parametersθ̂ chosen to match the expected values of that family’s sufficient statistics:
Eθ̂[φa(x)] =
∫
Xφa(x) p̃(x) dx a ∈ A (2.12)
For minimal families, these optimal parameters θ̂ are uniquely determined.
Proof. From the definition of KL divergence (eq. (2.8)), we have
D(p̃ || pθ) =
∫
Xp̃(x) log
p̃(x)
p(x | θ)dx
=
∫
Xp̃(x) log p̃(x) dx −
∫
Xp̃(x)
[
log ν(x) +∑
a∈A
θaφa(x) − Φ(θ)
]
dx
= −H(p̃) −∫
Xp̃(x) log ν(x) dx −
∑
a∈A
θa
∫
Xφa(x) p̃(x) dx + Φ(θ)
Zero if and only if variables are independent.
Recommended