Upload
-
View
580
Download
7
Embed Size (px)
DESCRIPTION
It's the deck for one Hulu internal machine learning workshop, which introduces the background, theory and application of expectation propagation method.
Citation preview
Expecta(on Propaga(on Theory and Applica(on
Dong Guo Research Workshop 2013 Hulu Internal
See more details in
hEp://dongguo.me/blog/2014/01/01/expecta(on-‐propaga(on/ hEp://dongguo.me/blog/2013/12/01/bayesian-‐ctr-‐predic(on-‐for-‐bing/
Outline
• Overview • Background • Theory • Applica(ons
OVERVIEW
Bayesian Paradigm
• Infer posterior distribu(on Prior
Data
Posterior Make decision
Note: figure of LDA is from Wikipedia, and the right figure is from paper ‘Web-‐Scale Bayesian Click-‐Through Rate PredicFon for Sponsored Search AdverFsing in MicrosoI’s Bing Search Engine’
Bayesian inference methods
• Exact inference – Belief propaga(on
• Approximate inference – Stochas(c (sampling) – Determinis(c
• Assumed density filtering • Expecta(on propaga(on • Varia(onal Bayes
Message passing
• A form of communica(on used in mul(ple domains of computer science – Parallel compu(ng (MPI) – Object-‐oriented programming – Inter-‐process communica(on – Bayesian inference
• A family of methods to infer posterior distribu(on
Expecta(on Propaga(on
• Belongs to message passing family
• Approximated method (itera(on is needed)
• Very popular in Bayesian inference, especially in graphic model
Researchers
• Thomas Minka – EP was proposed in PhD thesis
• Kevin p. Murphy – Machine Learning A ProbabilisFc PerspecFve
BACKGROUND
Background
• (Truncated) Gaussian • Exponen(al family • Graphic model • Factor graph • Belief propaga(on • Moment matching
Gaussian and Truncated Gaussian
• Gaussian opera(on is a basis for EP inference – Gaussian +*/ Gaussian – Gaussian integral
• Truncated Gaussian is used in many EP applica(ons
• See details here
Exponen(al family distribu(on
• Very good summary in Wikipedia • Sufficient sta(s(cs of Gaussian distribu(on: (x, x^2) • Typical distribu(on
q(z) = h(z)g(η)exp{ηTu(z)}
Note: above 4 figures are from Wikipedia
Graphical Models • Directed graph (Bayesian Network) • Undirected graph (Condi(onal
Random Field)
P(x) = p(xk | pak )k=1
K
∏
x1
x4
x3 x2 x1
x4
x3 x2
Factor graph
• Express rela(on between variable nodes explicitly • Rela(on in edge -‐> factor node
• Hide the difference of BN and CRF in inference • Make inference more intui(onal
x1
x4
x3 x2 x1
x4
x3 x2 fa
fc
c
BELIEF PROPAGATION
Belief Propaga(on Overview
• Exact Bayesian method to infer marginal distribu(on – ‘sum-‐product’ message passing
• Key components – Calculate posterior distribu(on of variable node – Two kinds of messages
Posterior distribu(on of variable node
• Factor graph
p(X) = Fs (s,Xs )s∈ne(x )∏ , for any variable x in the graph
p(x) = p(X)X \x∑ = Fs (s,Xs )
s∈ne(x )∏ =
X \x∑ Fs (x,Xs )
Xs∑
s∈ne(x )∏ = µ fs −>x
(x)s∈ne(x )∏
in which µ fs −>x(x) = Fs (x,Xs )
Xs∑
Note: the figure is from book ‘PaMern recogniFon and machine learning’
Message: factor -‐> variable node
• Factor graph
µ fs −>x(x) = ...
x1
∑ fs (x, x1,..., xM )xM∑ µxm −> fs
(xm )xm∈ne( fs )\x∏ ,
in which {x1,..., xM } is the set of variables on which the factor fs depends
Note: the figure is from book ‘PaMern recogniFon and machine learning’
Message: variable -‐> factor node
• Factor graph
µxm −> fs(xm ) = µ fl −>xm
(xm )l∈ne(xm )\ fs∏
Summary: posterior distribuFon is only determined by factors !!
Note: the figure is from book ‘PaMern recogniFon and machine learning’
Whole steps of BP
• Steps to calculate posterior distribu(on of given variable node – Step 1: construct factor graph – Step 2: treat the variable node as root, and ini(alize messages sent from leaf nodes
– Step 3: leverage the message passing steps recursively un(l the root node receives messages from all of its neighbors
– Step 4: get the marginal distribu(on by mul(plying all messages sent in
Note: the figures are from book ‘PaMern recogniFon and machine learning’
BP: example • Infer marginal distribu(on of x_3
• Infer marginal distribu(on of every variables
Note: the figures are from book ‘PaMern recogniFon and machine learning’
Posterior is intractable some(mes
• Example – Infer the mean of a Gaussian distribu(on
– Ad predictor
p(x |θ ) = (1−w)N(x |θ , I )+wN(x | 0,aI )
p(θ ) = N(θ | 0,bI )
Note: the figure is from book ‘PaMern recogniFon and machine learning’
Distribu(on Approxima(on
Such that: q(x) = h(x)g(η)exp{ηTu(x)}
KL(p || q) = − p(x)∫ In q(x)p(x)
dx = − p(x)Inq(x)dx +∫ p(x)Inp(x)∫ dx
= − p(x)Ing(η)dx − p(x)ηTu(x)∫ dx + const∫ = − Ing(η)−ηTΕ p(x )[u(x)]+ constwhere const terms are independent of the natural parameter η
Minimize KL(p || q) by setting the gradient with repect to η to zero: => −∇Ing(η) = Ε p(x )[u(x)]By leveraging formula (2.226) in PRML: => Eq(x )[u(x)]= −∇Ing(η) = Ε p(x )[u(x)]
Approximate p(x) with q(x), which belongs to exponential family
Moment matching
• Moments of a distribu(on
It's called moment matching when q(x) is Gaussian distribution then u(x) = (x, x2 )T
=> q(x)xdx = p(x)xdx∫∫ , and q(x)x2 dx = p(x)x2 dx∫∫=> meanq(x ) = q(x)xdx = p(x)xdx∫∫ = meanp(x ),
varianceq(x ) = q(x)x2 dx − (meanq(x ) )2∫
= p(x)x2 dx∫ − (meanp(x ) )2 = variance p(x )
k'th moment Mk = xk f (x)dxa
b
∫
EXPECTATION PROPAGATION = Belief Propaga(on + Moment matching?
Key Idea • Approximate each factor with Gaussian distribu(on
• Approximate corresponding factor pairs one by one?
• Approximate each factor in turn in the context of all remaining factors (Proposed by Minka)
refine factor f j(θ ) by ensuring qnew (θ )∝ f j(θ )q \ j (θ ) is close with f j (θ )q \ j (θ )
in which q \ j (θ ) = q(θ )f j(θ )
EP: The detail steps
1.Initialize all of the approximating factors fi(θ )
2.Initialize the posterior approximation by setting :q(θ )∝ fi(θ )i∏
3.Until convergence :
(a). Choose a fator f j(θ ) to refine.
(b). Remove f j(θ ) from the posterior by division :q \ j (θ ) = q(θ )f j(θ )
(c). Get the new posterior by settting sufficient statistics of qnew (θ ) equal to those of f j (θ )q \ j (θ )
z j
(minimize KL(f j (θ )q \ j (θ )
z j|| qnew (θ ))),in which z j = f j (θ )q \ j (θ )dθ∫ , and qnew (θ ) = 1
kf j(θ )q \ j (θ )
(d). Get the refined factor f j(θ ) : f j(θ ) = k qnew (θ )q \ j (θ )
Example: The cluEer problem
• Infer the mean of a Gaussian distribu(on • Want to try MLE, but
• Approximate with
– Approximate mixture Gaussian using Gaussian
p(x |θ ) = (1−w)N(x |θ , I )+wN(x | 0,aI )
p(θ ) = N(θ | 0,bI )
q(θ ) = N(θ |m,vI ), and each factor fn(θ ) = N(θ |mn ,vnI )
Note: the figure is from book ‘PaMern recogniFon and machine learning’
Example: The cluEer problem(2)
• Approximate complex factor(e.g. mixture Gaussian) with Gaussian
fn (θ ) in blue, fn(θ ) in red, and q \n (θ ) in green Remember variance of q \n (θ ) is usually very small, so fn(θ ) only need to approximate fn (θ ) in small range
Note: above 2 figures are from book ‘PaMern recogniFon and machine learning’
Applica(on: Bayesian CTR predictor for Bing
• See the details here – Inference step by step – Make predic(on
• Some insights – Variance of each feature increases aker every exposure
– Sample with more features will have bigger variance • Independent assump(on for features
Experimenta(on • Dataset is very Inhomogeneous
• Performance
– Other metrics
• Pros: speed, parameter choice cost, online learning support, interpreta(ve, support add more factors
• Cons: sparse • Code
Model FTRL OWLQN Ad predictor
AUC 0.638 0.641 0.639
Application: XBOX skill rating system
•
See details in P793~798 of Machine Learning A ProbabilisFc PerspecFve Note: the figure is from paper: ‘TrueSkill: A Bayesian Skill RaFng System’
Apply to all Bayesian models
• Infer.net (Microsok/Bishop) – A framework for running Bayesian inference in graphical models
– Model-‐based machine learning
References • Books
– Chapter 2/8/10 of PaMern RecogniFon and Machine Learning – Chapter 22 of Machine Learning: A ProbabilisFc PerspecFve
• Papers – A family of algorithms for approximate Bayesian inference – From belief propagaFon to expectaFon propagaFon – TrueSkill: A Bayesian Skill RaFng System – Web-‐Scale Bayesian Click-‐Through Rate PredicFon for Sponsored
Search AdverFsing in MicrosoI’s Bing Search Engine
• Roadmap for EP