Upload
gamada
View
37
Download
0
Tags:
Embed Size (px)
DESCRIPTION
PAC-Bayesian Theorems for Gaussian Process Classifications. Matthias Seeger University of Edinburgh. Overview. PAC-Bayesian theorem for Gibbs classifiers Application to Gaussian process classification Experiments Conclusions. What Is a PAC Bound?. Sample S= {( x i ,t i ) | i=1,…,n}. - PowerPoint PPT Presentation
Citation preview
PAC-Bayesian Theorems forGaussian Process Classifications
Matthias SeegerMatthias Seeger
University of EdinburghUniversity of Edinburgh
Overview
PAC-Bayesian theorem for Gibbs classifiersPAC-Bayesian theorem for Gibbs classifiers Application to Gaussian process Application to Gaussian process
classificationclassification ExperimentsExperiments ConclusionsConclusions
What Is a PAC Bound?
Algorithm: Algorithm: SS Predictor Predictor tt** from from xx**
Generalisation error: Generalisation error: gen(S)gen(S) PAC/distribution freePAC/distribution free bound: bound:
Unknown P* Sample S={(xi,ti) | i=1,…,n}i.i.d.
Nonuniform PAC Bounds A PAC bound has toA PAC bound has to
hold independent of correctnesshold independent of correctnessof prior knowledgeof prior knowledge
It does It does notnot have to have tobe independentbe independentof prior knowledgeof prior knowledge
Unfortunately, most standard VC bounds Unfortunately, most standard VC bounds are only are only vaguely dependentvaguely dependent on prior/model on prior/model they are applied to lack tightnessthey are applied to lack tightness
Gibbs Classifiers
Bayes classifier:Bayes classifier:
Gibbs classifier:Gibbs classifier:
New independent New independent ww for each prediction for each prediction
w
y1 y2 y3
t1 t2 t3
R3
2{-1,+1}
PAC-Bayesian Theorem
Result for Gibbs classifiersResult for Gibbs classifiers Prior Prior P(P(ww)), independent of , independent of SS Posterior Posterior Q(w)Q(w), may depend on , may depend on SS Expected generalisation error:Expected generalisation error:
Expected empirical error:Expected empirical error:
PAC-Bayesian Theorem (II)
McAllester (1999):McAllester (1999):
D[Q || P]D[Q || P]: Relative entropy: Relative entropyIf If Q(Q(ww)) feasible approximation to Bayesian feasible approximation to Bayesian posterior, we can compute posterior, we can compute D[Q || P]D[Q || P]
The Proof IdeaStep 1: Inequality for a dumb classifierStep 1: Inequality for a dumb classifier
Let .Let .Large deviation bound holds for fixed Large deviation bound holds for fixed ww (use Asymptotic Equipartition Property).(use Asymptotic Equipartition Property).
Since Since P(P(ww)) independent of S, bound holds independent of S, bound holds also “on average”also “on average”
The Proof Idea (II)Could use Jensen’s inequality:Could use Jensen’s inequality:
But so what?? But so what?? PP is fixed a-priori, giving a is fixed a-priori, giving a pretty pretty dumbdumb classifier! classifier!
Can we exchange Can we exchange PP for for QQ? Yes!? Yes! What do we have to pay? What do we have to pay? nn-1-1 D[Q || P] D[Q || P]
Convex Duality Could finish proof using tricks and Jensen.Could finish proof using tricks and Jensen.
Let’s see what’s Let’s see what’s behindbehind instead! instead! Convex (Legendre) DualityConvex (Legendre) Duality::
A very simple, but powerful concept:A very simple, but powerful concept:Parameterise linear lower bounds to a Parameterise linear lower bounds to a convex functionconvex function
Behind the scenes (almost) everywhere:Behind the scenes (almost) everywhere:EM, variational bounds, primal-dual EM, variational bounds, primal-dual optimisation, …, PAC-Bayesian theoremoptimisation, …, PAC-Bayesian theorem
Convex Duality (II)
Convex Duality (III)
The Proof Idea (III) Works just as well for spaces of functions Works just as well for spaces of functions
and distributions.and distributions. For our purpose:For our purpose:
is convex and has the dualis convex and has the dual
The Proof Idea (IV)
This gives the boundThis gives the bound
for all for all Q, Q, Set Set (w) = n (w) = n (w)(w). Then:. Then:
Have already bounded 2Have already bounded 2ndnd term right. term right.And on the left (Jensen again):And on the left (Jensen again):
Comments
PAC-Bayesian technique generic:PAC-Bayesian technique generic:Use specific large deviation bounds for the Use specific large deviation bounds for the QQ-independent term-independent term
Choice of Choice of QQ::Trade-off between Trade-off between emp(S,Q)emp(S,Q) and and divergence divergence D[Q || P]D[Q || P]..Bayesian posterior a good candidateBayesian posterior a good candidate
Gaussian Process Classification
Recall yesterday:Recall yesterday:We approximate true posterior process by a We approximate true posterior process by a Gaussian one:Gaussian one:
The Relative Entropy
But, then the relative entropy is just:But, then the relative entropy is just:
Straightforward to compute for Straightforward to compute for allall GPC GPC approximations in this classapproximations in this class
Concrete GPC MethodsWe considered so far:We considered so far: Laplace GPC [Barber/Williams]Laplace GPC [Barber/Williams] Sparse greedy GPC (IVM) [Csato/Opper, Sparse greedy GPC (IVM) [Csato/Opper,
Lawrence/Seeger/Herbrich]Lawrence/Seeger/Herbrich]
Setup:Setup:Downsampled MNIST (2s vs. 3s). RBF Downsampled MNIST (2s vs. 3s). RBF kernels. Model selection using independent kernels. Model selection using independent holdout sets (no ML-II allowed here!)holdout sets (no ML-II allowed here!)
Results for Laplace GPC
Results Sparse Greedy GPC
Extremely tight for a kernel classifier boundExtremely tight for a kernel classifier bound NoteNote: These results are for : These results are for GibbsGibbs classifiers. classifiers.
BayesBayes classifiers do better, but the (original) classifiers do better, but the (original)PAC-Bayesian theorem does not holdPAC-Bayesian theorem does not hold
Comparison Compression Bound
Compression bound for sparse greedy GPC Compression bound for sparse greedy GPC ((BayesBayes version, not Gibbs) version, not Gibbs)
ProblemProblem: Bound not configurable by prior : Bound not configurable by prior knowledge, not specific to the algorithmknowledge, not specific to the algorithm
Comparison With SVM
Compression bound (best we could find!)Compression bound (best we could find!) NoteNote: Bound values lower than for sparse : Bound values lower than for sparse
GPC GPC onlyonly becausebecause of sparser solution: of sparser solution:Bound does not depend on algorithm!Bound does not depend on algorithm!
Model Selection
The Bayes Classifier
Very recently, Meir and Zhang obtained Very recently, Meir and Zhang obtained PAC-Bayesian bound for Bayes-type PAC-Bayesian bound for Bayes-type classifiersclassifiers
Uses recent Rademacher complexity Uses recent Rademacher complexity bounds together with convex duality bounds together with convex duality argumentargument
Can be applied to GP classification as well Can be applied to GP classification as well (not yet done)(not yet done)
Conclusions PAC-Bayesian technique (convex duality) PAC-Bayesian technique (convex duality)
leads to tighter bounds than previously leads to tighter bounds than previously available for Bayes-type classifiers (to our available for Bayes-type classifiers (to our knowledge)knowledge)
Easy extension to multi-class scenariosEasy extension to multi-class scenarios Application to GP classification:Application to GP classification:
Tighter bounds than previously available Tighter bounds than previously available for kernel machines (to our knowledge)for kernel machines (to our knowledge)
Conclusions (II) Value in practice: Bound holds for Value in practice: Bound holds for anyany
posterior posterior approximationapproximation, not just the true , not just the true posterior itselfposterior itself
Some open problems:Some open problems: Unbounded loss functionsUnbounded loss functions Characterize the slack in the boundCharacterize the slack in the bound Incorporating ML-II model selection over Incorporating ML-II model selection over
continuous hyperparameter spacecontinuous hyperparameter space