Variational Inference for the Indian Buffet Processpeople.csail.mit.edu/finale/presentations/kurt_aistats09.pdfVariational Inference for the IBP How do we choose Q? p(Z;AjX) is a distribution

Variational Inference for the Indian Buffet Process

Finale Doshi-Velez† Kurt T. Miller† Jurgen Van Gael† Yee Whye TehCambridge University UC Berkeley Cambridge University Gatsby Unit

† Authors contributed equally

Introduction

Motivating example

We are interested in extracting unobserved features from observed data.For example:

• Latent classes ⇒ Mixture models• Latent features ⇒ Latent feature models

Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 2

Introduction

Motivating example


• Latent classes ⇒ Mixture models

• Latent features ⇒ Latent feature models


Introduction

Motivating example


• Latent classes ⇒ Mixture models• Latent features ⇒ Latent feature models


Introduction

Linear Gaussian Latent Feature ModelWe will focus on one example of a latent feature model:

X Z A+= !. . .

...

Observation for object i Features for object i

!

Observed Unobserved

• N = Number of data points

• D = Dimension of observed data

• K = Number of latent features


Introduction

Linear Gaussian Latent Feature Model

We will focus on one example of a latent feature model:

X Z A+= !. . .

...

D

D

N N !

K

K

Observed Unobserved

• N = Number of data points

• D = Dimension of observed data

• K = Number of latent features


Introduction

Linear Gaussian Model Latent Feature Model

Goal: Infer Z and A given data X.

Approach: Bayes’ rule:

p(Z,A|X) ∝ p(X|Z,A)p(A)︸︷︷︸Model specific

× p(Z)︸︷︷︸Prior on binary matrices

In the linear Gaussian model, we use

• p(X|Z,A) ∼ N (ZA, σ2nI)

• p(A) ∼ N (0, σ2AI)

• p(Z) ∼?X

ZA

!A ?

!n


Introduction








• p(A) ∼ N (0, σ2AI)

• p(Z) ∼?X

ZA

!A ?

!n


Introduction








• p(A) ∼ N (0, σ2AI)

• p(Z) ∼?X

ZA

!A ?

!n


The Indian Buffet Process

The Indian Buffet Process - Stick-breaking construction

• First generate v1, v2, · · · i.i.d.∼ Beta(α, 1).

• Let πi =∏ij=1 vj .

• Sample znk ∼ Bernoulli(πk).

· · ·v1 v2 v3 v4 v5 v6 v7 v8 v9

(Teh et al, 2007)







· · ·v1 v2 v3 v4 v5 v6 v7 v8 v9 !1 !2 !3 !4

!!5 !6 !7 !8 !9 · · ·







Z

!1 !2 !3 !4 !5 !6 !7 !8 !9 · · ·



Full Linear Gaussian Latent Feature Model

Model:


• p(A) ∼ N (0, σ2AI)

• p(Z) ∼ IBP(α)X

ZA

!A

!n

!

v

Given X, how do we do inference on Z and A?

• Even for finite K, there are 2NK possible Z.

• Many local optima.



Full Linear Gaussian Latent Feature Model

Model:


• p(A) ∼ N (0, σ2AI)

• p(Z) ∼ IBP(α)X

ZA

!A

!n

!

v

Given X, how do we do inference on Z and A?

• Even for finite K, there are 2NK possible Z.

• Many local optima.



Inference in the Linear Gaussian Model

N = 100, D = 500,K = 25

0 5 10 15 20 25 30−7

−6

−5

−4

−3

−2

−1

Time sampler run (minutes)

Pre

dict

ive

log

likel

ihoo

d

N = 500, D = 500,K = 25

0 5 10 15 20 25 30

−10

−9

−8

−7

−6

−5

−4

Time sampler run (minutes)P

redi

ctiv

e lo

g lik

elih

ood

Collapsed Gibbs




N = 100, D = 500,K = 25

0 5 10 15 20 25 30−7

−6

−5

−4

−3

−2

−1


Pre

dict

ive

log

likel

ihoo

d

N = 500, D = 500,K = 25

0 5 10 15 20 25 30

−10

−9

−8

−7

−6

−5

−4


redi

ctiv

e lo

g lik

elih

ood

Uncollapsed GibbsCollapsed Gibbs




N = 100, D = 500,K = 25

0 5 10 15 20 25 30−7

−6

−5

−4

−3

−2

−1


Pre

dict

ive

log

likel

ihoo

d

N = 500, D = 500,K = 25

0 5 10 15 20 25 30

−10

−9

−8

−7

−6

−5

−4


redi

ctiv

e lo

g lik

elih

ood

VariationalUncollapsed GibbsCollapsed Gibbs


Variational Inference for the IBP

Mean Field Variational Inference

Approximate p(Z,A|X) with distribution q(Z,A) from a family Q that is“close” to p(Z,A|X).

How do we define “close”? We will attempt to find

q(Z,A) = arg minq∈Q

D(q(Z,A)||p(Z,A|X)).



Mean Field Variational Inference

Approximate p(Z,A|X) with distribution q(Z,A) from a family Q that is“close” to p(Z,A|X).

How do we define “close”? We will attempt to find

q(Z,A) = arg minq∈Q




How do we choose Q?

p(Z,A|X) is a distribution over infinitely many features.

Trick (Blei and Jordan, 2004): Let Q be a truncated family where weassume that Z is nonzero in at most the first K columns.

Why can we do this? Intuitively, the probability πk that znk is onedecreases exponentially quickly.



Truncation boundMore formally, let mK(X) be the marginal of X when Z and A areintegrated out when we truncate the stick-breaking construction atcolumn K.

Then we can show

14

∫|mK(X)−m∞(X)|dX ≤ 1− exp

(−Nα

(α

α+ 1

)K).

0 10 20 30 40 50 60 700

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Bou

nd o

n L1

Dis

tanc

e

K

Truncation BoundTrue Distance

This is the first such bound for the IBP and can serve as a guideline forhow to choose K for the family Q.



Truncation boundMore formally, let mK(X) be the marginal of X when Z and A areintegrated out when we truncate the stick-breaking construction atcolumn K.

Then we can show

14

∫|mK(X)−m∞(X)|dX ≤ 1− exp

(−Nα

(α

α+ 1

)K).

0 10 20 30 40 50 60 700

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Bou

nd o

n L1

Dis

tanc

e

K

Truncation BoundTrue Distance

This is the first such bound for the IBP and can serve as a guideline forhow to choose K for the family Q.



How do we choose Q?

We let our family Q be the parameterized family (introducing thestick-breaking variables v)

q(Z,A, v) = qν(Z)qφ(A)qτ (v)

True distribution:

X

ZA

!A

!n

!

v

Variational distribution:

ZA

v

!

! !

• qνnk(znk) = Bernoulli(znk; νnk)

• qφk(Ak·) = N (Ak·; φ̄k,Φk)

• qτk(vk) = Beta(vk; τk1, τk2)



Inference

Inference now reduces to finding variational parameters (τ, φ,Φ, ν) suchthat q ∈ Q is “close” to p.

(τ, φ,Φ, ν) = arg minq∈Q


This is not a convex optimization, so we can only hope to find a localoptimum.

⇒ Parameter updates done iteratively.



Inference

Inference now reduces to finding variational parameters (τ, φ,Φ, ν) suchthat q ∈ Q is “close” to p.

(τ, φ,Φ, ν) = arg minq∈Q


This is not a convex optimization, so we can only hope to find a localoptimum.

⇒ Parameter updates done iteratively.



Parameter updates

Many calculations are straightforward exponential family calculations.

The only nontrivial calculation is Ev,Z [log p(Znk|v)] which requiresevaluating

Ev

[log

(1−

k∏m=1

vm

)]

We provide an efficient way to lower bound this term.


Results

Results: Synthetic data

N = 100, D = 500,K = 25

0 5 10 15 20 25 30−7

−6

−5

−4

−3

−2

−1


Pre

dict

ive

log

likel

ihoo

d

N = 500, D = 500,K = 25

0 5 10 15 20 25 30

−10

−9

−8

−7

−6

−5

−4


redi

ctiv

e lo

g lik

elih

ood

VariationalUncollapsed GibbsCollapsed Gibbs


Results

Results: Real data

2 data sets:

• Yale faces data set: linear Gaussian model, N = 721, D = 1024(32× 32 images)

• Speech data set: iICA model, N = 245, D = 10

0 50 100 150 200 250−30

−20

−10

0

10

20

30

Time

Spe

ech

wav

efor

ms


Results

Results: Real data

Faces data set: N = 721, D = 1024

5 10 250

0.5

1

1.5

2

2.5

K

Neg

ativ

e lo

g lik

elih

ood

Large D, N - Variational helps

Speech data set: N = 245, D = 10

2 5 90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

KN

egat

ive

log

likel

ihoo

d

Uncollapsed GibbsVariational

Small N , D - Variational does not help


Summary

• We present the first variational inference algorithm for the IBP.

• For large N and D, it finds better local optima than the samplers.

• We also present the first truncation bound for the IBP.

Code will be available soon from our websites.

Questions?


Documents

Variational Inference for the Indian Buffet Processpeople.csail.mit.edu/finale/presentations/kurt_aistats09.pdfVariational Inference for the IBP How do we choose Q? p(Z;AjX) is a distribution