26
PAC-Bayesian Theorems for Gaussian Process Classifications Matthias Seeger Matthias Seeger University of Edinburgh University of Edinburgh

PAC-Bayesian Theorems for Gaussian Process Classifications

  • Upload
    gamada

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

PAC-Bayesian Theorems for Gaussian Process Classifications. Matthias Seeger University of Edinburgh. Overview. PAC-Bayesian theorem for Gibbs classifiers Application to Gaussian process classification Experiments Conclusions. What Is a PAC Bound?. Sample S= {( x i ,t i ) | i=1,…,n}. - PowerPoint PPT Presentation

Citation preview

Page 1: PAC-Bayesian Theorems for Gaussian Process Classifications

PAC-Bayesian Theorems forGaussian Process Classifications

Matthias SeegerMatthias Seeger

University of EdinburghUniversity of Edinburgh

Page 2: PAC-Bayesian Theorems for Gaussian Process Classifications

Overview

PAC-Bayesian theorem for Gibbs classifiersPAC-Bayesian theorem for Gibbs classifiers Application to Gaussian process Application to Gaussian process

classificationclassification ExperimentsExperiments ConclusionsConclusions

Page 3: PAC-Bayesian Theorems for Gaussian Process Classifications

What Is a PAC Bound?

Algorithm: Algorithm: SS Predictor Predictor tt** from from xx**

Generalisation error: Generalisation error: gen(S)gen(S) PAC/distribution freePAC/distribution free bound: bound:

Unknown P* Sample S={(xi,ti) | i=1,…,n}i.i.d.

Page 4: PAC-Bayesian Theorems for Gaussian Process Classifications

Nonuniform PAC Bounds A PAC bound has toA PAC bound has to

hold independent of correctnesshold independent of correctnessof prior knowledgeof prior knowledge

It does It does notnot have to have tobe independentbe independentof prior knowledgeof prior knowledge

Unfortunately, most standard VC bounds Unfortunately, most standard VC bounds are only are only vaguely dependentvaguely dependent on prior/model on prior/model they are applied to lack tightnessthey are applied to lack tightness

Page 5: PAC-Bayesian Theorems for Gaussian Process Classifications

Gibbs Classifiers

Bayes classifier:Bayes classifier:

Gibbs classifier:Gibbs classifier:

New independent New independent ww for each prediction for each prediction

w

y1 y2 y3

t1 t2 t3

R3

2{-1,+1}

Page 6: PAC-Bayesian Theorems for Gaussian Process Classifications

PAC-Bayesian Theorem

Result for Gibbs classifiersResult for Gibbs classifiers Prior Prior P(P(ww)), independent of , independent of SS Posterior Posterior Q(w)Q(w), may depend on , may depend on SS Expected generalisation error:Expected generalisation error:

Expected empirical error:Expected empirical error:

Page 7: PAC-Bayesian Theorems for Gaussian Process Classifications

PAC-Bayesian Theorem (II)

McAllester (1999):McAllester (1999):

D[Q || P]D[Q || P]: Relative entropy: Relative entropyIf If Q(Q(ww)) feasible approximation to Bayesian feasible approximation to Bayesian posterior, we can compute posterior, we can compute D[Q || P]D[Q || P]

Page 8: PAC-Bayesian Theorems for Gaussian Process Classifications

The Proof IdeaStep 1: Inequality for a dumb classifierStep 1: Inequality for a dumb classifier

Let .Let .Large deviation bound holds for fixed Large deviation bound holds for fixed ww (use Asymptotic Equipartition Property).(use Asymptotic Equipartition Property).

Since Since P(P(ww)) independent of S, bound holds independent of S, bound holds also “on average”also “on average”

Page 9: PAC-Bayesian Theorems for Gaussian Process Classifications

The Proof Idea (II)Could use Jensen’s inequality:Could use Jensen’s inequality:

But so what?? But so what?? PP is fixed a-priori, giving a is fixed a-priori, giving a pretty pretty dumbdumb classifier! classifier!

Can we exchange Can we exchange PP for for QQ? Yes!? Yes! What do we have to pay? What do we have to pay? nn-1-1 D[Q || P] D[Q || P]

Page 10: PAC-Bayesian Theorems for Gaussian Process Classifications

Convex Duality Could finish proof using tricks and Jensen.Could finish proof using tricks and Jensen.

Let’s see what’s Let’s see what’s behindbehind instead! instead! Convex (Legendre) DualityConvex (Legendre) Duality::

A very simple, but powerful concept:A very simple, but powerful concept:Parameterise linear lower bounds to a Parameterise linear lower bounds to a convex functionconvex function

Behind the scenes (almost) everywhere:Behind the scenes (almost) everywhere:EM, variational bounds, primal-dual EM, variational bounds, primal-dual optimisation, …, PAC-Bayesian theoremoptimisation, …, PAC-Bayesian theorem

Page 11: PAC-Bayesian Theorems for Gaussian Process Classifications

Convex Duality (II)

Page 12: PAC-Bayesian Theorems for Gaussian Process Classifications

Convex Duality (III)

Page 13: PAC-Bayesian Theorems for Gaussian Process Classifications

The Proof Idea (III) Works just as well for spaces of functions Works just as well for spaces of functions

and distributions.and distributions. For our purpose:For our purpose:

is convex and has the dualis convex and has the dual

Page 14: PAC-Bayesian Theorems for Gaussian Process Classifications

The Proof Idea (IV)

This gives the boundThis gives the bound

for all for all Q, Q, Set Set (w) = n (w) = n (w)(w). Then:. Then:

Have already bounded 2Have already bounded 2ndnd term right. term right.And on the left (Jensen again):And on the left (Jensen again):

Page 15: PAC-Bayesian Theorems for Gaussian Process Classifications

Comments

PAC-Bayesian technique generic:PAC-Bayesian technique generic:Use specific large deviation bounds for the Use specific large deviation bounds for the QQ-independent term-independent term

Choice of Choice of QQ::Trade-off between Trade-off between emp(S,Q)emp(S,Q) and and divergence divergence D[Q || P]D[Q || P]..Bayesian posterior a good candidateBayesian posterior a good candidate

Page 16: PAC-Bayesian Theorems for Gaussian Process Classifications

Gaussian Process Classification

Recall yesterday:Recall yesterday:We approximate true posterior process by a We approximate true posterior process by a Gaussian one:Gaussian one:

Page 17: PAC-Bayesian Theorems for Gaussian Process Classifications

The Relative Entropy

But, then the relative entropy is just:But, then the relative entropy is just:

Straightforward to compute for Straightforward to compute for allall GPC GPC approximations in this classapproximations in this class

Page 18: PAC-Bayesian Theorems for Gaussian Process Classifications

Concrete GPC MethodsWe considered so far:We considered so far: Laplace GPC [Barber/Williams]Laplace GPC [Barber/Williams] Sparse greedy GPC (IVM) [Csato/Opper, Sparse greedy GPC (IVM) [Csato/Opper,

Lawrence/Seeger/Herbrich]Lawrence/Seeger/Herbrich]

Setup:Setup:Downsampled MNIST (2s vs. 3s). RBF Downsampled MNIST (2s vs. 3s). RBF kernels. Model selection using independent kernels. Model selection using independent holdout sets (no ML-II allowed here!)holdout sets (no ML-II allowed here!)

Page 19: PAC-Bayesian Theorems for Gaussian Process Classifications

Results for Laplace GPC

Page 20: PAC-Bayesian Theorems for Gaussian Process Classifications

Results Sparse Greedy GPC

Extremely tight for a kernel classifier boundExtremely tight for a kernel classifier bound NoteNote: These results are for : These results are for GibbsGibbs classifiers. classifiers.

BayesBayes classifiers do better, but the (original) classifiers do better, but the (original)PAC-Bayesian theorem does not holdPAC-Bayesian theorem does not hold

Page 21: PAC-Bayesian Theorems for Gaussian Process Classifications

Comparison Compression Bound

Compression bound for sparse greedy GPC Compression bound for sparse greedy GPC ((BayesBayes version, not Gibbs) version, not Gibbs)

ProblemProblem: Bound not configurable by prior : Bound not configurable by prior knowledge, not specific to the algorithmknowledge, not specific to the algorithm

Page 22: PAC-Bayesian Theorems for Gaussian Process Classifications

Comparison With SVM

Compression bound (best we could find!)Compression bound (best we could find!) NoteNote: Bound values lower than for sparse : Bound values lower than for sparse

GPC GPC onlyonly becausebecause of sparser solution: of sparser solution:Bound does not depend on algorithm!Bound does not depend on algorithm!

Page 23: PAC-Bayesian Theorems for Gaussian Process Classifications

Model Selection

Page 24: PAC-Bayesian Theorems for Gaussian Process Classifications

The Bayes Classifier

Very recently, Meir and Zhang obtained Very recently, Meir and Zhang obtained PAC-Bayesian bound for Bayes-type PAC-Bayesian bound for Bayes-type classifiersclassifiers

Uses recent Rademacher complexity Uses recent Rademacher complexity bounds together with convex duality bounds together with convex duality argumentargument

Can be applied to GP classification as well Can be applied to GP classification as well (not yet done)(not yet done)

Page 25: PAC-Bayesian Theorems for Gaussian Process Classifications

Conclusions PAC-Bayesian technique (convex duality) PAC-Bayesian technique (convex duality)

leads to tighter bounds than previously leads to tighter bounds than previously available for Bayes-type classifiers (to our available for Bayes-type classifiers (to our knowledge)knowledge)

Easy extension to multi-class scenariosEasy extension to multi-class scenarios Application to GP classification:Application to GP classification:

Tighter bounds than previously available Tighter bounds than previously available for kernel machines (to our knowledge)for kernel machines (to our knowledge)

Page 26: PAC-Bayesian Theorems for Gaussian Process Classifications

Conclusions (II) Value in practice: Bound holds for Value in practice: Bound holds for anyany

posterior posterior approximationapproximation, not just the true , not just the true posterior itselfposterior itself

Some open problems:Some open problems: Unbounded loss functionsUnbounded loss functions Characterize the slack in the boundCharacterize the slack in the bound Incorporating ML-II model selection over Incorporating ML-II model selection over

continuous hyperparameter spacecontinuous hyperparameter space