47
Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM www.grunwald.nl joint work with John Langford, TTI Chicago, Preliminary version appeared in Proceedings17 th annual Conference On Learning Theory (COLT 2004)

Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM joint work with John Langford, TTI Chicago, Preliminary version

Embed Size (px)

Citation preview

Page 1: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Suboptimality of Bayes and MDL in Classification

Peter Grünwald

CWI/EURANDOM

www.grunwald.nl

joint work with John Langford, TTI Chicago,

Preliminary version appeared in Proceedings17th annual Conference On Learning Theory (COLT 2004)

Page 2: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Our Result

• We study Bayesian and Minimum Description Length (MDL) inference in classification problems

• Bayes and MDL should automatically deal with overfitting

• We show there exist classification domains where Bayes and MDL…

when applied in a standard manner

…perform suboptimally (overfit!) even if sample size tends to infinity

Page 3: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Why is this interesting?

• Practical viewpoint: – Bayesian methods

• used a lot in practice

• sometimes claimed to be ‘universally optimal’

– MDL methods• even designed to deal with overfitting

– Yet MDL and Bayes can ‘fail’ even with infinite data

• Theoretical viewpoint– How can result be reconciled with various strong Bayesian

consistency theorems?

Page 4: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Menu

1. Classification

2. Abstract statement of main result

3. Precise statement of result

4. Discussion

Page 5: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Classification

• Given: – Feature space – Label space– Sample– Set of hypotheses (classifiers)

• Goal: find a that makes few mistakes on future data from the same source

– We say ‘c has small generalization error/classification risk’

Page 6: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Classification Models

• Types of Classifiers1. hard classifiers: (-1/1-output)

• decision trees, stumps, forests

2. soft classifiers (real-valued output)• support vector machines• neural networks

3. probabilistic classifiers• Naïve Bayes/Bayesian network classifiers• Logistic regression

initial focus

Page 7: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Generalization Error

• As is customary in statistical learning theory, we analyze classification by postulating some (unknown) distribution D on joint (input,label)-space

• Performance of a classifier measured in terms of its generalization error (classification risk) defined as

Page 8: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Learning Algorithms

• A learning algorithm LA based on set of candidate classifiers , is a function that, for each sample S of arbitrary length, outputs classifier :

Page 9: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Consistent Learning Algorithms

• Suppose are i.i.d.• A learning algorithm is consistent or asymptotically

optimal if, no matter what the ‘true’ distribution D is,

in D – probability, as .

Page 10: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Consistent Learning Algorithms

• Suppose are i.i.d.• A learning algorithm is consistent or asymptotically

optimal if, no matter what the ‘true’ distribution D is,

in D – probability, as .

‘learned’ classifier where is ‘best’ classifier in

Page 11: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Main Result

• There exists– input domain – prior P , non-zero on a countable set of classifiers– ‘true’ distribution D– a constant

such that the Bayesian learning algorithm is is is is is asymptotically K-suboptimal:

Page 12: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Main Result

• There exists– input domain – prior , non-zero on a countable set of classifiers– ‘true’ distribution D– a constant

such that the Bayesian learning algorithm is is is is is asymptotically K-suboptimal:

• Same holds for MDL learning algorithm

Page 13: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Remainder of Talk

1. How is “Bayes learning algorithm” defined?

2. What is scenario?• how do , ‘true’ distr. D and prior P look like?

3. How dramatic is result?• How large is K? • How strange are choices for ?

4. How bad can Bayes get?

5. Why is result surprising?• can it be reconciled with Bayesian consistency results?

Page 14: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Bayesian Learning of Classifiers

• Problem: Bayesian inference defined for models that are sets of probability distributions

• In our scenario, models are sets of classifiers , i.e. functions

• How can we find a posterior over classifiers using Bayes rule?

• Standard answer: convert each to a corresponding distribution and apply Bayes to the set of distributions thus obtained

Page 15: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

classifiers probability distrs.

• Standard conversion method from to : logistic (sigmoid) transformation

• For each and , set

• Define priors on and on and set

Page 16: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Logistic transformation - intuition

• Consider ‘hard’ classifiers • For each ,

• Here

is empirical error that c makes on data,

and is number of mistakes c makes on data

Page 17: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Logistic transformation - intuition

• For fixed – log-likelihood is linear function of number of mistakes c

makes on data– (log-) likelihood maximized for c that is optimal for

observed data

• For fixed c,– Maximizing likelihood over also makes sense

Page 18: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Logistic transformation - intuition

• In Bayesian practice, logistic transformation is standard tool, nowadays performed without giving any motivation or explanation– We did not find it in Bayesian textbooks, …– …, but tested it with three well-known Bayesians!

• Analogous to turning set of predictors with squared error into conditional distributions with normally distributed noise

expresses where Z is independent noise bit

Page 19: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Main Result

• There exists– input domain – prior P on a countable set of classifiers– ‘true’ distribution D– a constant

such that the Bayesian learning algorithm is is is is is asymptotically K-suboptimal:

holds both for full Bayes and for Bayes (S)MAP

Grünwald & Langford, COLT 2004

Page 20: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Definition of .

• Posterior:

• Predictive Distribution:

• “Full Bayes” learning algorithm:

Page 21: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Issues/Remainder of Talk

1. How is “Bayes learning algorithm” defined?

2. What is scenario?• how do , ‘true’ distr. D and prior P look like?

3. How dramatic is result?• How large is K? • How strange are choices for ?

4. How bad can Bayes get?

5. Why is result surprising?• can it be reconciled with Bayesian consistency results?

Page 22: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Scenario

• Definition of Y, X and :

• Definition of prior:– for some small , for all large n,

– can be any strictly positive smooth prior

(or discrete prior with sufficient precision)

Page 23: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Scenario – II: Definition of true D

1. Toss fair coin to determine value of Y .

2. Toss coin Z with bias

3. If (easy example) then for all , set

4. If (hard example) then set

and for all , independently set

Page 24: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Result:

• All features are informative of , but is more informative than all the others, so is best classifier:

• Nevertheless, with ‘true’ D- probability 1, as

(but note: for each fixed j, )

Page 25: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Issues/Remainder of Talk

1. How is “Bayes learning algorithm” defined?

2. What is scenario?• how do , ‘true’ distr. D and prior P look like?

3. How dramatic is result?• How large is K? • How strange are choices for ?

4. How bad can Bayes get?

5. Why is result surprising?• can it be reconciled with Bayesian consistency results?

Page 26: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Theorem 1

• There exists– input domain – prior P on a countable set of classifiers– ‘true’ distribution D– a constant

such that the Bayesian learning algorithm is is is is is asymptotically K-suboptimal:

holds both for full Bayes and for Bayes MAP

Grünwald & Langford, COLT 2004

Page 27: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Theorem 1, extended

• X-axis: • = maximum

Bayes MAP/MDL

• = maximum

full Bayes (binary entropy)• Maximum difference

achieved at

Page 28: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

How ‘natural’ is scenario?

• Basic scenario is quite unnatural• We chose it because we could prove something

about it! But:1. Priors are natural (take e.g. Rissanen’s universal prior)

2. Clarke (2002) reports practical evidence that Bayes performs suboptimally with large yet misspecified models in a regression context

3. Bayesian inference is consistent under very weak conditions. So even if unnatural, result is still interesting!

Page 29: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Issues/Remainder of Talk

1. How is “Bayes learning algorithm” defined?

2. What is scenario?• how do , ‘true’ distr. D and prior P look like?

3. How dramatic is result?• How large is K? • How strange are choices for ?

4. How bad can Bayes get?

5. Why is result surprising?• can it be reconciled with Bayesian consistency results?

Page 30: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Bayesian Consistency Results

• Doob (1949, special case):

Suppose– Countable– Contains ‘true’ conditional distribution

Then with D -probability 1,

weakly/in Hellinger distance

Page 31: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Bayesian Consistency Results

• If

…then we must also have

• Our result says that this does not happen in our scenario. Hence the (countable!) we constructed must be misspecified:

• Model homoskedastic, ‘true’ D heteroskedastic!

Page 32: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Bayesian consistency under misspecification

• Suppose we use Bayesian inference based on ‘model’

• If , then under ‘mild’ generality conditions, Bayes still converges to distribution that is closest to in KL-divergence (relative entropy).

• The logistic transformation ensures that

achieved for c that also achieves

Page 33: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Bayesian consistency under misspecification

• In our case, Bayesian posterior does not converge to distribution with smallest classification generalization error, so it also does not converge to distribution closest to true D in KL-divergence

• Apparently, ‘mild’ generality conditions for ‘Bayesian consistency under misspecification’ are violated

• Conditions for consistency under misspecification are much stronger than conditions for standard consistency!– must either be convex or “simple” (e.g. parametric)

Page 34: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Is consistency achievable at all?

• Methods for avoiding overfitting proposed in statistical and computational learning theory literature are consistent – Vapnik’s methods (based on VC-dimension etc.)– McAllester’s PAC-Bayes methods

• These methods invariably punish ‘complex’ (low prior) classifiers much more than ordinary Bayes – in the simplest version of PAC-Bayes,

Page 35: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Consistency and Data Compression - I

• Our inconsistency result also holds for (various incarnations of) MDL learning algorithm

• MDL is a learning method based on data compression; in practice it closely resembles Bayesian inference with certain special priors

• ….however…

Page 36: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Consistency and Data Compression - II

• There already exist (in)famous inconsistency results for Bayesian inference by Diaconis and Freedman

• For some highly non-parametric , even if “true” D is in , Bayes may not converge to it

• These type of inconsistency results do not apply to MDL, since Diaconis and Freedman use priors that do not compress the data

• With MDL priors, if true D is in , then consistency is guaranteed under no further conditions at all (Barron ’98)

Page 37: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Issues/Remainder of Talk

1. How is “Bayes learning algorithm” defined?

2. What is scenario?• how do , ‘true’ distr. D and prior P look like?

3. How dramatic is result?• How large is K? • How strange are choices for ?

4. How bad can Bayes get? (& “what happens”)

5. Why is result surprising?• can it be reconciled with Bayesian consistency results?

Page 38: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Thm 2: full Bayes result is ‘tight’

• X-axis: • = maximum

Bayes MAP/MDL

• = maximum

full Bayes (binary entropy)• Maximum difference

achieved at

Page 39: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Theorem 2

Page 40: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Proof Sketch

1. Log loss of Bayes upper bounds 0/1-loss:• For every sequence

2. Log loss of Bayes upper bounded by log loss of 0/1 optimal plus log-term

Page 41: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Proof Sketch

Page 42: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Proof Sketch

1. Log loss of Bayes upper bounds 0/1-loss:• For every sequence

2. Log loss of Bayes upper bounded by log loss of 0/1 optimal plus log-term

Page 43: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Proof Sketch

1. Log loss of Bayes upper bounds 0/1-loss:• For every sequence

2. Log loss of Bayes upper bounded by log loss of 0/1 optimal plus log-term

(Law of large nrs/Hoeffding)

Page 44: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Wait a minute…

• Accumulated log loss of sequential Bayesian predictions is always within of accumulated log loss of optimal

• So Bayes is ‘good’ with respect to log loss/KL-div.• But Bayes is ‘bad’ with respect to 0/1-loss• How is this possible?• The Bayesian posterior effectively becomes a mixture

of ‘bad’ distributions (different mixture at different m)– Mixture is closer to true distribution D than in

KL-divergence/log loss prediction – But performs worse than in terms of 0/1 error

Page 45: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Bayes predicts too well

• Let be a set of distribution, and let be defined with respect to a prior that makes a universal data-compressor wrt

• One can show that the only true distributions D for which Bayes can ever become inconsistent in KL-divergence sense…

• …are those under which the posterior predictive distribution becomes closer in KL-divergence to D than the best single distribution in

Page 46: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Conclusion

• Our result applies to hard classifiers and (equivalently) to probabilistic classifiers under slight misspecification

• Bayesian may argue that the Bayesian machinery was never intended for misspecified models

• Yet, computational resources and human imagination being limited, in practice Bayesian inference is applied to misspecified models all the time.

• In this case, Bayes may overfit even in the limit for an infinite amount of data

Page 47: Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM  joint work with John Langford, TTI Chicago, Preliminary version

Thank you for your attention!