Stochastic Variational Inference

IntroductionStochastic Variational Inference

Stochastic Variational Inference in Topic ModelsSome Bibliograpy

Jesus Fernandez Bes

Machine Learning Group

March 27, 2014

Jesus Fernandez Bes Stochastic Variational Inference

http://arxiv.org/abs/1206.7051

1 Introduction

2 Stochastic Variational InferenceModels with local and global hidden variablesMean-field variational inferenceThe natural gradient of the ELBOStochastic Variational Inference

3 Stochastic Variational Inference in Topic ModelsTopic ModelsLatent Diriclet AllocationHierarchichal Dirichlet Process

4 Some Bibliograpy

MotivationMain Ideas

Challenges of modern data analysis

Massive

Complex

High-dimensional

Probability Models (and Graphical Models) deal with complexity.Scale is the problem.

“Traditional” Variational Inference

1 Inference =⇒ High-dimensional optimization.

2 Solved using Coordinate ascent algorithms.

Analyze ALL the data.Re-estimate hidden structure.Analyze ALL the data.. . .

DO NOT SCALE WITH BIG DATA

MotivationMain Ideas

How to make a general Variational method that scales.

Use Stochastic Optimization. Follow cheap noisy estimates ofthe gradient.

Use Natural Gradient. Stochastic Variational Inference has anattractive form.

Structure of SVI

1 Subsample one or more data points from the data.

2 Analyze the subsample using current variational parameters.

3 Implement a closed-form update of the parameters.

4 Repeat.

Models with local and global hidden variablesMean-field variational inferenceThe natural gradient of the ELBOStochastic Variational Inference

p(x, z, β|α) = p(β|α)

N∏n=1

p(xn, zn|β)

N observations x = x1:N .

Vector of global hidden variables β.

N local hidden variables z = z1:N each is a collection of Jvariables zn = zn,1:J .

Vector of fixed parameters α.

Complete Conditional assumption

Complete conditionals are in the exponential family

p(β|x, z, α) = h(β) exp{ηg(x, z, α)T t(β)− ag(ηg(x, z, α))}p(znj |xn, zn,−j , β) = h(znj) exp{ηl(xn, zn,−j , β)T t(znj)−al(ηl(xn, zn,−j , β))}

h(·) is the base measure.

a(·) is the log normalizer.

η(·) is the natural parameter vectors.

t(·) are the sufficient statistics.

Several distributions in the exponential family

Bernoulli, Gaussian, Multinomial, Dirichlet, Gamma, Poisson,Beta,...

Examples of this kind of model

Bayesian Mixture Models.

Latent Dirichlet Allocation.

Hidden Markov Models (+ many variants).

Kalman filters (+ many variants).

Hierarchical linear regression models.

Hierarchical probit classification models.

Probabilistic factor analysis/matrix factorization models.

Certain Bayesian nonparametric mixture models.

Approximate the posterior distribution of hidden variables given theobservations.

p(z, β|x) =p(x, z, β)∫

p(x, z, β)dzdβ

The problem with the denominator. Intractable to compute.

The evidence lower bound (ELBO)

log p(x) = log

∫p(x, z, β)dzdβ

∫p(x, z, β)

q(z, β)

q(z, β)dzdβ

(Eq[p(x, z, β)

q(z, β)

])≥ Eq [log p(x, z, β)]− Eq [log q(z, β)]

, L(q).

KL(q(z, β)‖p(z, β|x)) = −L(q) + const.

Mean-field Approximation

Assumption on q(z, β):

q(z, β) = q(β|λ)

N∏n=1

J∏j=1

q(znj |φnj)

with q(β|λ) and q(znj |φnj) in the same exponential family as thecomplete conditionals.

q(β|λ) = h(β) exp{λT t(β)− ag(λ)}q(znj |φnj) = h(znj) exp{φTnjt(znj)− al(φnj)}

Easy coordinate ascent algorithm.

Gradient of the ELBO and Coordinate Ascent Inference

∇λL = ∇2λag(λ)(Eq [ηq(x, z, α)]− λ)

∇φnjL = ∇2

φnjal(φnj)(Eq [ηl(xn, zn,−j , β)]− φnj)

Both of them equal 0 by setting

λ = Eq [ηg(x, z, α)]

φn,j = Eq [ηl(xn, zn,−j , β)]

Gradient, if exists, points to the direction of steepest ascent,

arg maxdλ

f(λ− dλ) subject to ‖dλ‖2 < ε

for small ε. Gradient depends on euclidean distance metric in theparameter space.

In probability distributions euclidean metric can be a bad metric.Jesus Fernandez Bes Stochastic Variational Inference

Natural gradient accounts for the information geometry of itsparameter space.

Symmetrized KL divergence

Natural measure of dissimilarity between probability distributions

DsymKL (λ, λ′) = Eλ

q(β|λ)

q(β|λ′)

]+ Eλ′

q(β|λ′)q(β|λ)

]Using this distance, the direction of steepest ascent is

arg maxdλ

f(λ+ dλ) subject to DsymKL (λ, λ+ dλ) < ε

Natural Gradient

Natural Gradient points in the direction of steeped ascent inthe Riemannian space.

∇̂λf(λ) = G(λ)−1∇λf(λ)

where G(λ) = Eλ[(∇λ log q(β, λ))(∇λ log q(β, λ))T

]is the

fisher information matrix of q(λ).

For exponential family: G(λ) = ∇2λag(λ)

For our mean-field model:

∇̂λL = Eφ [ηq(x, z, α)]− λ

∇̂φnjL = Eλ,φn,−j

[ηl(xn, zn,−j , β)]− φnj

Why Natural Gradients?

Traditional Gradients

∇λL = ∇2λag(λ)(Eq [ηq(x, z, α)]− λ)

∇φnjL = ∇2

φnjal(φnj)(Eq [ηl(xn, zn,−j , β)]− φnj)

Natural Gradients

∇̂λL = Eφ [ηq(x, z, α)]− λ∇̂φnj

L = Eλ,φn,−j[ηl(xn, zn,−j , β)]− φnj

Coordinate ascent is equal to taking a natural gradient step oflength one.

Easier to compute. Use them to develop scalable variationalinferece algorithms.

Stochastic Optimization

We have a random function B(λ) with Eq [B(λ)] = ∇λf(λ). Wecan optimize f(λ) iteratively as,

λ(t) = λ(t−1) + ρtbt(λ(t−1))

where bt is an independent draw from B. The sequence of ρt mustsatisfy Robbins-Monro conditions.

Follow noisy estimates of the gradient with a decreasing stepsize.If gradient can be written as a sum of terms (one per datapoint) a fast noisy approximation can be computed bysubsampling data.λ(t) will converge to the optimal λ∗ (if f is convex) or a localoptimum of f (if not convex *).

L(λ) =

global︷︸︸︷Eq [log p(β)]− Eq [log q(β)]

N∑n=1

maxφn

(Eq [log p(xn, zn|β)]− Eq [log q(zn)])︸︷︷︸sum of local

We choose I ∼ Unif(1, · · · , N) and define LI(λ) as the randomfunction

LI(λ) = Eq [log p(β)]− Eq [log q(β)]

+ N maxφI

(Eq [log p(xI , zI |β)]− Eq [log q(zI)])

Expectation of LI is equal to the objective, and consequently∇̂λLI is a noisy but unbiased estimate of the natural gradient ofthe objective.

Stochastic Optimization for global parameters

∇̂λLi = Eq[ηg(x

(N)i , z

(N)i , α)

]− λ

ηg(x(N)i , z

(N)i , α) = α+N · (t(xn, zn), 1)

∇̂λLi = α+N · (Eq [t(xn, zn)] , 1)− λ

Using Stochastic optimization

λ̂t , α+NEφ(λ) [(t(xi, zi), 1)]

λ(t) = λ(t−1) + ρt

(λ̂t − λ(t−1)

)= (1− ρt)λ(t−1) + ρtλ̂t

Stochastic Variational Inference

Extensions

Minibatches

Pick more than one data point each time,

λ(t) = (1− ρt)λ(t−1) +ρtS

λ̂s.

Empirical Bayes estimation of hyperparameters

Get a point estimate of the value of hyperparameters α

α(t) = α(t−1) + ρt∇αLt(λ(t−1), φ, α(t−1)).

Topic ModelsLatent Diriclet AllocationHierarchichal Dirichlet Process

Topic Models

Observations:

Words wdn is the nth word in the dth document. Element of afixed vocabulary of V terms.

Latent Variables:

A topic βk is a distribution over the vocabulary. Point inV − 1-simplex.Topic proportions θd are asociated to each document.Distribution over topics.Each word in each document comes from a single topic. TopicAssignment zdn are topic indexes.

Consider two models: Latent Dirichlet Allocation (LDA) has afixed number of K topics. Hierarchical Dirichlet Process (HDP)has infinite number of topics.

Analyzing the documents

Posterior inference of p(β, θ, z|w)

Generative model

1 Draw topics bk ∼ Dirichlet(η, · · · , η).2 For each document d ∈ {1, · · · , D}:

1 Draw topic proportions θ ∼ Dirichlet(α, · · · , α).2 For each word w ∈ {1, · · · , N}:

1 Draw topic assignment zdn ∼ Multinomial(θd).2 Draw word wdn ∼ Multinomial(βzdn).

Variational Inference in LDA

Mean-field for LDA

q(zdn) = Multinomial(φdn)

q(θd) = Dirichlet(γd)

q(βk) = Dirichlet(λk)

1 Update per-document d local variational parameters

φkdn ∝ exp{Ψ(γdk) + Ψ(λk,wdn)−Ψ(

λkv)} for n ∈ {1, · · · , N}

γd = α+

N∑n=1

2 Update global parameters λk = η +∑D

∑Nn=1 φ

kdnwdn

Stochastic Variational Inference in LDA

Results LDA

Nature: 350k docs, 58M words, 4200 terms.

New York Times: 1.8M docs, 461M words, 8000 terms.

Wikipedia: 3.8M docs,482M words, 7700 terms.

* Batch Variational uses a subset of 100k docs.Jesus Fernandez Bes Stochastic Variational Inference

Results HDP

Some Bibliograpy

Main Paper

Hoffman, M. D., and Blei, D. M., and Wang, C., andPaisley, J. (2013). “Stochastic variational inference”. The Journalof Machine Learning Research, 14(1), 1303-1347.

Other References

Blei, D. M.. “Variational Inference”. Lecture Notes ofCOS597C: Advanced Methods in Probabilistic Modeling,Princeton University, fall 2011,www.cs.princeton.edu/courses/archive/fall11/

cos597C/lectures/variational-inference-i.pdf.

Blei, D. M., “Exponential Families,” Lecture Notes ofCOS597C: Advanced Methods in Probabilistic Modeling,Princeton University, fall 2011,www.cs.princeton.edu/courses/archive/fall11/

cos597C/lectures/exponential-families.pdf.

Stochastic Variational Inference - UC3Mjesusfbes/MLG_SVI.pdfThe natural gradient of the ELBO...

Documents

Variational Inference for Crowdsourcingpapers.nips.cc/paper/4627-variational-inference-for-crowdsourcing.pdf · variational inference methods for graphical models. First, we present

Variational Inference: A Review for Statisticians · inference (Hoffman et al., 2013), which scales variational inference to massive data using stochastic optimization (Robbins and

StochasticCollapsedVariationalBayesianInference ... · gorithms for topic models. Recent advances in stochastic variational inference algorithms for latent Dirichlet allocation (LDA)

VARIATIONAL BAYESIAN PHYLOGENETIC INFERENCE

Variational Inference for GPs: Presenters · Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28 Chaoqi Wang Sana Tonekaboni Will Grathwohl

Variational Methods for LDA Stochastic Variational Inference...Stochastic variational inference SUBSAMPLE DATA INFER LOCAL STRUCTURE UPDATE GLOBAL STRUCTURE 1 A generic class of models

Functional Variational Bayesian Neural Networksbayesiandeeplearning.org/2018/papers/12.pdf · Sampling-Based Functional Variational Inference Adversarial functional variational inference

Variational Learning and Variational Inference · The variational approach • Variational inference: Find q(h) by solving • Variational learning: Alternate between running variational

A variational mean eld algorithm for e cient inference in large systems of stochastic ...eprints.aston.ac.uk/25263/1/Variational_mean_field... · 2017-02-07 · A variational mean

Stochastic Variational Method and Viscous Hydrodynamicsbrat/Home-NeD/talks_pdf/Sep1/Kodama.pdf · Stochastic Variational Method and Viscous Hydrodynamics ... Stochastic Variational

Variational Inference over Combinatorial Spaces

Bayesian Nonnegative Matrix Factorization with Stochastic ...jwp2128/Teaching/E6892/papers/chapter.pdf · Bayesian Nonnegative Matrix Factorization with Stochastic Variational Inference

Variational Bayesian inference - Kay Brodersen · Variational Bayesian inference is based on variational calculus. Variational calculus Standard calculus Newton, Leibniz, and others

Towards Verified Stochastic Variational Inference for

Collapsed Variational Bayesian Inference of the Author ...people.csail.mit.edu › ythomas › publications › 2016AuthorTopicCVB-… · Collapsed Variational Bayesian Inference

Stochastic Variational Inference With Gradient Linearization...Stochastic variational optimization. Recently, it was shown that the KL divergence is amenable to stochastic optimization

Stochastic Optimization and Variational Inference · Stochastic Optimization and Variational Inference David M ... Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscanUygur

Stochastic Variational Inference - Columbia Universityjwp2128/Papers/HoffmanBleiWangPaisley2013.pdf · Using stochastic variational inference, we analyze several large collections

Variational inference for stochastic differential equations...Variational inference Goal: inference on posterior p( jy) Given unnormalised version p( ;y) Introduce q( ;˚) Family of