Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Variational Inference for the Indian Buffet Process
Finale Doshi-Velez† Kurt T. Miller† Jurgen Van Gael† Yee Whye TehCambridge University UC Berkeley Cambridge University Gatsby Unit
† Authors contributed equally
Introduction
Motivating example
We are interested in extracting unobserved features from observed data.For example:
• Latent classes ⇒ Mixture models• Latent features ⇒ Latent feature models
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 2
Introduction
Motivating example
We are interested in extracting unobserved features from observed data.For example:
• Latent classes ⇒ Mixture models
• Latent features ⇒ Latent feature models
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 2
Introduction
Motivating example
We are interested in extracting unobserved features from observed data.For example:
• Latent classes ⇒ Mixture models• Latent features ⇒ Latent feature models
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 2
Introduction
Linear Gaussian Latent Feature ModelWe will focus on one example of a latent feature model:
X Z A+= !. . .
...
Observation for object i Features for object i
!
Observed Unobserved
• N = Number of data points
• D = Dimension of observed data
• K = Number of latent features
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 3
Introduction
Linear Gaussian Latent Feature Model
We will focus on one example of a latent feature model:
X Z A+= !. . .
...
D
D
N N !
K
K
Observed Unobserved
• N = Number of data points
• D = Dimension of observed data
• K = Number of latent features
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 3
Introduction
Linear Gaussian Model Latent Feature Model
Goal: Infer Z and A given data X.
Approach: Bayes’ rule:
p(Z,A|X) ∝ p(X|Z,A)p(A)︸ ︷︷ ︸Model specific
× p(Z)︸︷︷︸Prior on binary matrices
In the linear Gaussian model, we use
• p(X|Z,A) ∼ N (ZA, σ2nI)
• p(A) ∼ N (0, σ2AI)
• p(Z) ∼?X
ZA
!A ?
!n
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 4
Introduction
Linear Gaussian Model Latent Feature Model
Goal: Infer Z and A given data X.
Approach: Bayes’ rule:
p(Z,A|X) ∝ p(X|Z,A)p(A)︸ ︷︷ ︸Model specific
× p(Z)︸︷︷︸Prior on binary matrices
In the linear Gaussian model, we use
• p(X|Z,A) ∼ N (ZA, σ2nI)
• p(A) ∼ N (0, σ2AI)
• p(Z) ∼?X
ZA
!A ?
!n
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 4
Introduction
Linear Gaussian Model Latent Feature Model
Goal: Infer Z and A given data X.
Approach: Bayes’ rule:
p(Z,A|X) ∝ p(X|Z,A)p(A)︸ ︷︷ ︸Model specific
× p(Z)︸︷︷︸Prior on binary matrices
In the linear Gaussian model, we use
• p(X|Z,A) ∼ N (ZA, σ2nI)
• p(A) ∼ N (0, σ2AI)
• p(Z) ∼?X
ZA
!A ?
!n
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 4
The Indian Buffet Process
The Indian Buffet Process - Stick-breaking construction
• First generate v1, v2, · · · i.i.d.∼ Beta(α, 1).
• Let πi =∏ij=1 vj .
• Sample znk ∼ Bernoulli(πk).
· · ·v1 v2 v3 v4 v5 v6 v7 v8 v9
(Teh et al, 2007)
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 5
The Indian Buffet Process
The Indian Buffet Process - Stick-breaking construction
• First generate v1, v2, · · · i.i.d.∼ Beta(α, 1).
• Let πi =∏ij=1 vj .
• Sample znk ∼ Bernoulli(πk).
· · ·v1 v2 v3 v4 v5 v6 v7 v8 v9 !1 !2 !3 !4
!!5 !6 !7 !8 !9 · · ·
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 5
The Indian Buffet Process
The Indian Buffet Process - Stick-breaking construction
• First generate v1, v2, · · · i.i.d.∼ Beta(α, 1).
• Let πi =∏ij=1 vj .
• Sample znk ∼ Bernoulli(πk).
Z
!1 !2 !3 !4 !5 !6 !7 !8 !9 · · ·
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 5
The Indian Buffet Process
Full Linear Gaussian Latent Feature Model
Model:
• p(X|Z,A) ∼ N (ZA, σ2nI)
• p(A) ∼ N (0, σ2AI)
• p(Z) ∼ IBP(α)X
ZA
!A
!n
!
v
Given X, how do we do inference on Z and A?
• Even for finite K, there are 2NK possible Z.
• Many local optima.
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 6
The Indian Buffet Process
Full Linear Gaussian Latent Feature Model
Model:
• p(X|Z,A) ∼ N (ZA, σ2nI)
• p(A) ∼ N (0, σ2AI)
• p(Z) ∼ IBP(α)X
ZA
!A
!n
!
v
Given X, how do we do inference on Z and A?
• Even for finite K, there are 2NK possible Z.
• Many local optima.
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 6
The Indian Buffet Process
Inference in the Linear Gaussian Model
N = 100, D = 500,K = 25
0 5 10 15 20 25 30−7
−6
−5
−4
−3
−2
−1
Time sampler run (minutes)
Pre
dict
ive
log
likel
ihoo
d
N = 500, D = 500,K = 25
0 5 10 15 20 25 30
−10
−9
−8
−7
−6
−5
−4
Time sampler run (minutes)P
redi
ctiv
e lo
g lik
elih
ood
Collapsed Gibbs
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 7
The Indian Buffet Process
Inference in the Linear Gaussian Model
N = 100, D = 500,K = 25
0 5 10 15 20 25 30−7
−6
−5
−4
−3
−2
−1
Time sampler run (minutes)
Pre
dict
ive
log
likel
ihoo
d
N = 500, D = 500,K = 25
0 5 10 15 20 25 30
−10
−9
−8
−7
−6
−5
−4
Time sampler run (minutes)P
redi
ctiv
e lo
g lik
elih
ood
Uncollapsed GibbsCollapsed Gibbs
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 7
The Indian Buffet Process
Inference in the Linear Gaussian Model
N = 100, D = 500,K = 25
0 5 10 15 20 25 30−7
−6
−5
−4
−3
−2
−1
Time sampler run (minutes)
Pre
dict
ive
log
likel
ihoo
d
N = 500, D = 500,K = 25
0 5 10 15 20 25 30
−10
−9
−8
−7
−6
−5
−4
Time sampler run (minutes)P
redi
ctiv
e lo
g lik
elih
ood
VariationalUncollapsed GibbsCollapsed Gibbs
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 7
Variational Inference for the IBP
Mean Field Variational Inference
Approximate p(Z,A|X) with distribution q(Z,A) from a family Q that is“close” to p(Z,A|X).
How do we define “close”? We will attempt to find
q(Z,A) = arg minq∈Q
D(q(Z,A)||p(Z,A|X)).
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 8
Variational Inference for the IBP
Mean Field Variational Inference
Approximate p(Z,A|X) with distribution q(Z,A) from a family Q that is“close” to p(Z,A|X).
How do we define “close”? We will attempt to find
q(Z,A) = arg minq∈Q
D(q(Z,A)||p(Z,A|X)).
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 8
Variational Inference for the IBP
How do we choose Q?
p(Z,A|X) is a distribution over infinitely many features.
Trick (Blei and Jordan, 2004): Let Q be a truncated family where weassume that Z is nonzero in at most the first K columns.
Why can we do this? Intuitively, the probability πk that znk is onedecreases exponentially quickly.
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 9
Variational Inference for the IBP
Truncation boundMore formally, let mK(X) be the marginal of X when Z and A areintegrated out when we truncate the stick-breaking construction atcolumn K.
Then we can show
14
∫|mK(X)−m∞(X)|dX ≤ 1− exp
(−Nα
(α
α+ 1
)K).
0 10 20 30 40 50 60 700
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Bou
nd o
n L1
Dis
tanc
e
K
Truncation BoundTrue Distance
This is the first such bound for the IBP and can serve as a guideline forhow to choose K for the family Q.
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 10
Variational Inference for the IBP
Truncation boundMore formally, let mK(X) be the marginal of X when Z and A areintegrated out when we truncate the stick-breaking construction atcolumn K.
Then we can show
14
∫|mK(X)−m∞(X)|dX ≤ 1− exp
(−Nα
(α
α+ 1
)K).
0 10 20 30 40 50 60 700
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Bou
nd o
n L1
Dis
tanc
e
K
Truncation BoundTrue Distance
This is the first such bound for the IBP and can serve as a guideline forhow to choose K for the family Q.
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 10
Variational Inference for the IBP
How do we choose Q?
We let our family Q be the parameterized family (introducing thestick-breaking variables v)
q(Z,A, v) = qν(Z)qφ(A)qτ (v)
True distribution:
X
ZA
!A
!n
!
v
Variational distribution:
ZA
v
!
! !
• qνnk(znk) = Bernoulli(znk; νnk)
• qφk(Ak·) = N (Ak·; φ̄k,Φk)
• qτk(vk) = Beta(vk; τk1, τk2)
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 11
Variational Inference for the IBP
Inference
Inference now reduces to finding variational parameters (τ, φ,Φ, ν) suchthat q ∈ Q is “close” to p.
(τ, φ,Φ, ν) = arg minq∈Q
D(q(Z,A)||p(Z,A|X)).
This is not a convex optimization, so we can only hope to find a localoptimum.
⇒ Parameter updates done iteratively.
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 12
Variational Inference for the IBP
Inference
Inference now reduces to finding variational parameters (τ, φ,Φ, ν) suchthat q ∈ Q is “close” to p.
(τ, φ,Φ, ν) = arg minq∈Q
D(q(Z,A)||p(Z,A|X)).
This is not a convex optimization, so we can only hope to find a localoptimum.
⇒ Parameter updates done iteratively.
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 12
Variational Inference for the IBP
Parameter updates
Many calculations are straightforward exponential family calculations.
The only nontrivial calculation is Ev,Z [log p(Znk|v)] which requiresevaluating
Ev
[log
(1−
k∏m=1
vm
)]
We provide an efficient way to lower bound this term.
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 13
Results
Results: Synthetic data
N = 100, D = 500,K = 25
0 5 10 15 20 25 30−7
−6
−5
−4
−3
−2
−1
Time sampler run (minutes)
Pre
dict
ive
log
likel
ihoo
d
N = 500, D = 500,K = 25
0 5 10 15 20 25 30
−10
−9
−8
−7
−6
−5
−4
Time sampler run (minutes)P
redi
ctiv
e lo
g lik
elih
ood
VariationalUncollapsed GibbsCollapsed Gibbs
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 14
Results
Results: Real data
2 data sets:
• Yale faces data set: linear Gaussian model, N = 721, D = 1024(32× 32 images)
• Speech data set: iICA model, N = 245, D = 10
0 50 100 150 200 250−30
−20
−10
0
10
20
30
Time
Spe
ech
wav
efor
ms
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 15
Results
Results: Real data
Faces data set: N = 721, D = 1024
5 10 250
0.5
1
1.5
2
2.5
K
Neg
ativ
e lo
g lik
elih
ood
Large D, N - Variational helps
Speech data set: N = 245, D = 10
2 5 90
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
KN
egat
ive
log
likel
ihoo
d
Uncollapsed GibbsVariational
Small N , D - Variational does not help
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 16
Summary
• We present the first variational inference algorithm for the IBP.
• For large N and D, it finds better local optima than the samplers.
• We also present the first truncation bound for the IBP.
Code will be available soon from our websites.
Questions?
Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 17