PAC-Bayesian Theorems for Gaussian Process Classifications

PAC-Bayesian Theorems forGaussian Process Classifications

Matthias SeegerMatthias Seeger

University of EdinburghUniversity of Edinburgh

Overview

PAC-Bayesian theorem for Gibbs classifiersPAC-Bayesian theorem for Gibbs classifiers Application to Gaussian process Application to Gaussian process

classificationclassification ExperimentsExperiments ConclusionsConclusions

What Is a PAC Bound?

Algorithm: Algorithm: SS Predictor Predictor tt** from from xx**

Generalisation error: Generalisation error: gen(S)gen(S) PAC/distribution freePAC/distribution free bound: bound:

Unknown P* Sample S={(xi,ti) | i=1,…,n}i.i.d.

Nonuniform PAC Bounds A PAC bound has toA PAC bound has to

hold independent of correctnesshold independent of correctnessof prior knowledgeof prior knowledge

It does It does notnot have to have tobe independentbe independentof prior knowledgeof prior knowledge

Unfortunately, most standard VC bounds Unfortunately, most standard VC bounds are only are only vaguely dependentvaguely dependent on prior/model on prior/model they are applied to lack tightnessthey are applied to lack tightness

Gibbs Classifiers

Bayes classifier:Bayes classifier:

Gibbs classifier:Gibbs classifier:

New independent New independent ww for each prediction for each prediction

w

y1 y2 y3

t1 t2 t3

R3

2{-1,+1}

PAC-Bayesian Theorem

Result for Gibbs classifiersResult for Gibbs classifiers Prior Prior P(P(ww)), independent of , independent of SS Posterior Posterior Q(w)Q(w), may depend on , may depend on SS Expected generalisation error:Expected generalisation error:

Expected empirical error:Expected empirical error:

PAC-Bayesian Theorem (II)

McAllester (1999):McAllester (1999):

D[Q || P]D[Q || P]: Relative entropy: Relative entropyIf If Q(Q(ww)) feasible approximation to Bayesian feasible approximation to Bayesian posterior, we can compute posterior, we can compute D[Q || P]D[Q || P]

The Proof IdeaStep 1: Inequality for a dumb classifierStep 1: Inequality for a dumb classifier

Let .Let .Large deviation bound holds for fixed Large deviation bound holds for fixed ww (use Asymptotic Equipartition Property).(use Asymptotic Equipartition Property).

Since Since P(P(ww)) independent of S, bound holds independent of S, bound holds also “on average”also “on average”

The Proof Idea (II)Could use Jensen’s inequality:Could use Jensen’s inequality:

But so what?? But so what?? PP is fixed a-priori, giving a is fixed a-priori, giving a pretty pretty dumbdumb classifier! classifier!

Can we exchange Can we exchange PP for for QQ? Yes!? Yes! What do we have to pay? What do we have to pay? nn-1-1 D[Q || P] D[Q || P]

Convex Duality Could finish proof using tricks and Jensen.Could finish proof using tricks and Jensen.

Let’s see what’s Let’s see what’s behindbehind instead! instead! Convex (Legendre) DualityConvex (Legendre) Duality::

A very simple, but powerful concept:A very simple, but powerful concept:Parameterise linear lower bounds to a Parameterise linear lower bounds to a convex functionconvex function

Behind the scenes (almost) everywhere:Behind the scenes (almost) everywhere:EM, variational bounds, primal-dual EM, variational bounds, primal-dual optimisation, …, PAC-Bayesian theoremoptimisation, …, PAC-Bayesian theorem

Convex Duality (II)

Convex Duality (III)

The Proof Idea (III) Works just as well for spaces of functions Works just as well for spaces of functions

and distributions.and distributions. For our purpose:For our purpose:

is convex and has the dualis convex and has the dual

The Proof Idea (IV)

This gives the boundThis gives the bound

for all for all Q, Q, Set Set (w) = n (w) = n (w)(w). Then:. Then:

Have already bounded 2Have already bounded 2ndnd term right. term right.And on the left (Jensen again):And on the left (Jensen again):

Comments

PAC-Bayesian technique generic:PAC-Bayesian technique generic:Use specific large deviation bounds for the Use specific large deviation bounds for the QQ-independent term-independent term

Choice of Choice of QQ::Trade-off between Trade-off between emp(S,Q)emp(S,Q) and and divergence divergence D[Q || P]D[Q || P]..Bayesian posterior a good candidateBayesian posterior a good candidate

Gaussian Process Classification

Recall yesterday:Recall yesterday:We approximate true posterior process by a We approximate true posterior process by a Gaussian one:Gaussian one:

The Relative Entropy

But, then the relative entropy is just:But, then the relative entropy is just:

Straightforward to compute for Straightforward to compute for allall GPC GPC approximations in this classapproximations in this class

Concrete GPC MethodsWe considered so far:We considered so far: Laplace GPC [Barber/Williams]Laplace GPC [Barber/Williams] Sparse greedy GPC (IVM) [Csato/Opper, Sparse greedy GPC (IVM) [Csato/Opper,

Lawrence/Seeger/Herbrich]Lawrence/Seeger/Herbrich]

Setup:Setup:Downsampled MNIST (2s vs. 3s). RBF Downsampled MNIST (2s vs. 3s). RBF kernels. Model selection using independent kernels. Model selection using independent holdout sets (no ML-II allowed here!)holdout sets (no ML-II allowed here!)

Results for Laplace GPC

Results Sparse Greedy GPC

Extremely tight for a kernel classifier boundExtremely tight for a kernel classifier bound NoteNote: These results are for : These results are for GibbsGibbs classifiers. classifiers.

BayesBayes classifiers do better, but the (original) classifiers do better, but the (original)PAC-Bayesian theorem does not holdPAC-Bayesian theorem does not hold

Comparison Compression Bound

Compression bound for sparse greedy GPC Compression bound for sparse greedy GPC ((BayesBayes version, not Gibbs) version, not Gibbs)

ProblemProblem: Bound not configurable by prior : Bound not configurable by prior knowledge, not specific to the algorithmknowledge, not specific to the algorithm

Comparison With SVM

Compression bound (best we could find!)Compression bound (best we could find!) NoteNote: Bound values lower than for sparse : Bound values lower than for sparse

GPC GPC onlyonly becausebecause of sparser solution: of sparser solution:Bound does not depend on algorithm!Bound does not depend on algorithm!

Model Selection

The Bayes Classifier

Very recently, Meir and Zhang obtained Very recently, Meir and Zhang obtained PAC-Bayesian bound for Bayes-type PAC-Bayesian bound for Bayes-type classifiersclassifiers

Uses recent Rademacher complexity Uses recent Rademacher complexity bounds together with convex duality bounds together with convex duality argumentargument

Can be applied to GP classification as well Can be applied to GP classification as well (not yet done)(not yet done)

Conclusions PAC-Bayesian technique (convex duality) PAC-Bayesian technique (convex duality)

leads to tighter bounds than previously leads to tighter bounds than previously available for Bayes-type classifiers (to our available for Bayes-type classifiers (to our knowledge)knowledge)

Easy extension to multi-class scenariosEasy extension to multi-class scenarios Application to GP classification:Application to GP classification:

Tighter bounds than previously available Tighter bounds than previously available for kernel machines (to our knowledge)for kernel machines (to our knowledge)

Conclusions (II) Value in practice: Bound holds for Value in practice: Bound holds for anyany

posterior posterior approximationapproximation, not just the true , not just the true posterior itselfposterior itself

Some open problems:Some open problems: Unbounded loss functionsUnbounded loss functions Characterize the slack in the boundCharacterize the slack in the bound Incorporating ML-II model selection over Incorporating ML-II model selection over

continuous hyperparameter spacecontinuous hyperparameter space

Documents

PAC-Bayesian Theorems for Gaussian Process Classifications