Reconciling Bayesian and Non- Bayesian Analysis · Conventional Bayesian analysis and decision theory fails to distinguish t from g in its event space. Thereforeonemust becarefulinrelating

Reconciling Bayesian and Non-Bayesian AnalysisDavid H. Wolpert

SFI WORKING PAPER: 1993-09-054

SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent theviews of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared in print. Except for papers by our externalfaculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, orfunded by an SFI grant.©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensuretimely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rightstherein are maintained by the author(s). It is understood that all persons copying this information willadhere to the terms and constraints invoked by each author's copyright. These works may be reposted onlywith the explicit permission of the copyright holder.www.santafe.edu

SANTA FE INSTITUTE

RECONCILING BAYESIAN AND NON-BAYESIAN ANALYSIS

by

David Ii Wolpert

The Santa Fe Institute, 1660 Old Pecos Trail, Suite A, Santa Fe, NM, 87501, [email protected]

Abstract: This paper is an attempt to reconcile Bayesian and non-Bayesian approaches to statisti

cal inference, by casting both in terms of a broader formalism. In particular, this paper is an attempt

to show that when one extends conventional Bayesian analysis to distinguish the truth from one's

guess for the truth, one gains a broader perspective which allows the inclusion of non-Bayesian for

malisms. Amongst other things, this perspective shows how it is possible for non-Bayesian tech

niques to perform well, despite their handicaps.

INTRODUCTION

The purpose of this paper is to reconcile Bayesian and non-Bayesian approaches to statistical

inference, by casting both in terms of a broader formalism. In particular, this paper is an attempt to

show that when one extends conventional Bayesian analysis to distinguish the truth from one's

guess for the truth, one gains a broader perspective which allows the inclusion of non-Bayesian for

malisms.

A bit of what I say in this paper has been said before; I make no claims for blanket originality.

Rather what I wish to do in this paper is emphasize these points differently, to combine them with

2

other points which are (to my knowledge) original, and to present the mix in a formal manner.

First of all, why should one want to effect a reconciliation between Bayesian and non-Bayesian

analysis? Why be bothered with non-Bayesian techniques? After all, Bayesian analysis has many

advantages over other formalisms. To list just a few: Bayesian analysis forces one to makes one's

assumptions explicit; it ensures self-consistency in one's use ofprobability; it is principled and pro

grammatic, in the sense that it provides a single unified approach to all inference problems; it

works well if used carefully; in fact, when one has reason to be very sure of one's priors (e.g., say

they are given by quantum mechanics) Bayesian analysis is essentially impossible to beat; and in

some ways most important of all (from a sociological perspective), Bayesian analysis is in a certain

sense more elegant and aesthetically pleasing than non-Bayesian analysis. I should add that I per

sonally have often used Bayesian techniques in the past and will do so again in the future; I per

sonally find them quite useful.

As compelling as these reasons are however, none of them constitute aformal mathematical

proof that using Bayesian techniques in the real world, "carefully" or otherwise, is guaranteed to

give better performance than using any other non-Bayesian technique. Indeed, there are many ex

amples - constructed by self-professed Bayesians - which cast doubt on such a guarantee. For ex

ample, as is discussed separately at this conference (Strauss 1993, Wolpert 1993a), although the

"evidence" procedure sometimes works well in practice (Gull 1989, MacKay 1992), it is, formally

speaking, a non-Bayesian technique. As another example, MacKay has recently announced that he

won a prediction competition using a technique based on an extension of cross-validation (MacKay

1993). Yet that technique is certainly not Bayesian in any (currently understood) respect. More

over, in (MacKay 1992), in a section somewhat disingenuously entitled "Why Bayes can't system

atically reject the truth", MacKay presents a theoretical argument which can be used to defend the

practice of setting any parameter (not just a hyperparameter) via maximum likelihood.

A different kind of reason for not dismissing non-Bayesian techniques out of hand arises when

one considers the problem of setting the joint probability distribution P(truth =t, data =d). 1 If - as

is often the case in the real world - we already know the likelihood P(d I t), in what ways can we

3

go about fixing the remaining degrees of freedom in the joint distribution, while ensuring consis

tency with the laws of probability theory? One way to do this, of course, is to provide the prior dis

tribution pet). Given pet) in addition to P(d I t), the joint distribution is fully specified, in a manner

consistent with probability theory. This is the basis for conventional Bayesian analysis.

But there are other ways to extend the likelihood to fix the joint distribution as well. For exam

ple, if there is a data set d = d' such that for no truth t does the likelihood of d' equal zero exactly,

then we can do the following: Rather than augmenting the likelihood by providing the prior, we

can instead provide all the values of the posterior pet I d =d'). By doing this we will have fixed the

entire joint distribution, for all possible data sets. (This follows from the equality P(xi' Yj) =

(P(xi I Yl) xP(Yj I xi)/P(y11 Xi)} / Lij{P(Xi I Yl) xP(Yj I xi)/P(y11 Xi)}') In particular, we will

have pet I d) for any data set d '" d' with which we might wish to perform statistical inference.

So rather than trying to find an appropriate pet) as in conventional Bayesian analysis, we could

instead try to find an appropriate pet I d) for one specific d, d'. Doing this, we would be assured of

self-consistency, our assumptions would be explicit (they would be our choice of pet I d) for that

one specific d), etc. Just as in conventional Bayesian analysis. So in a certain sense it is just as le

gitimate to have one's assumptions be directly for the posterior (for one fixed d) as to have them

be for the prior.

Note, however, that with this alternative scheme, rather than using pseudo-intuitive arguments

to set pet), as in conventional Bayesian analysis, instead we would be using such arguments to set

pet I d) for one particular d. In particular, one could set pet I d) using "pseudo-intuitive" bias+vari

ance arguments, for example, or cross-validation type arguments. Indeed, one might even be able

to formalize such arguments as the use of "desiderata" to set pet I d) for one specific d. Alternative

ly, we could view such arguments as reflections of the fact that there's no reason that "prior knowl

edge" has to be expressed in a prior; it can get expressed directly in a posterior. So for example,

my "prior knowledge" might consist of knowing that cross-validation works well for a certain class

of problems.

However they're viewed, once such arguments are used to justify a choice of pet I d)-for-one-

4

specific-d, for all other d we would be exploiting "non-(conventional) Bayesian" (i.e., non-prior

based) techniques, but in a manner possessing most of the strengths of conventional Bayesian anal

ysis. Moreover, in a number of scenarios one might argue that a particular pseudo-intuitive justifi

cation of pet I d)-for-one-specific-d is more compelling than the alternative pseudo-intuitive justi-

fication of p(t).2 In such a case, it would be the conventional Bayesian who is hard pressed to for

mally justify hislher procedure.

More generally, one can argue that since the exact schemes (prior-based and posterior-for-one

d-based) have corresponding validities, the same might be true of a particular approximation to the

scheme outlined above (for using non-Bayesian techniques and a single d to set pet, d)) and a par

ticular approximation to conventional Bayesian analysis. In particular, consider the approximation

of simply using a non-Bayesian technique to set pet I d) for all d (i.e., the "approximation", to the

scheme outlined above, of just using the non-Bayesian technique straight, without any attempt to

enforce consistency). It could be that this approximation is just as valid as making certain calcula

tional approximations in conventional Bayesian analysis. In other words, it might be that non

Bayesian techniques are just as valid as conventional Bayesian techniques which involve calcula

tional approximations. At a minimum, this is quite plausible when the calculational approxima

tions in question are poor; the whole issue, which has never before been addressed, is how accurate

the calculational approximations in a Bayesian technique must be to "beat" a particular non-Baye

sian technique.

I should emphasize that I am not here advocating that one use non-Bayesian techniques in prac

tice (I'm not taking sides in this paper). In particular I am not advocating that one use such tech

niques via the scheme outlined above for inferring a full pet, d) from pet I d) for a single d. Rather

my point is simply that there do exist schemes based on non-Bayesian techniques which possess

many of the strengths of conventional Bayesian techniques. So those strengths alone don't decide

the issue; we must instead compare Bayesian and non-Bayesian techniques in terms of how well

they work in practice. To do this, we need to make use of a new formalism.

5

1. A FORMALISM FOR RECONCILING BAYES AND NON-BAYES

In the (vast) majority of inference problems, there are four quantities of interest: the data, the

truth, one's guess for the truth, and a real world "error" or "cost" or "loss" accompanying a partic-

ular use of a particular inference technique.3 Since these four quantities are distinct objects, the

probability distribution governing them - the distribution governing that which we are interested

in, namely the inference process - is the joint distribution P(truth =t, guess =g, data =d, cost =c).

Conventional Bayesian analysis and decision theory fails to distinguish t from g in its event

space. Therefore one must be careful in relating the distribution pet, g, d, c) to the distributions used

in conventional Bayesian analysis. In particular, it is important to understand that what is usually

(loosely) called the "posterior" in conventional Bayesian analysis is pet I d) and not peg I d). This

is the case because ofthe conventional Bayesian's use of Bayes' theorem, which gives the posterior

in terms of the likelihood. Since the likelihood is P(data =d I truth =t), not P(data =d I guess =g),

the "posterior" must be pet I d).

peg I d) is instead an entirely different object, which in general has no analogue in conventional

Bayesian analysis. It is the probability of making a guess g given data d. In other words, it is one's

algorithm for performing statistical inference. A priori, it need have nothing to do with Bayesian

techniques. Indeed, despite folk wisdom to the contrary, an arbitrary peg I d) need not be express

ible in "Bayesian" terms. I.e., it need not be expressible in terms of a posterior (P(d I t) x pet) /

P(d») It=g for some pet), or as an average over t of such a quantity, or some such. (For example, the

evidence procedure's peg I d) can not be expressed this way - see (Wolpert 1993a).) As such,

peg I d) is the object which allows one to expand the discussion to consider non-Bayesian tech

niques.

One nice feature of this "extended" Bayesian framework is that in it, the difference between

conventional Bayesian analysis and (most forms of) non-Bayesian analysis is no longer some qua

si-philosophical preference for different statistical dogmas. Instead that difference reduces to sim-

6

ply what conditional distribution the two fonnalisms choose to evaluate.

For example, consider the field of classification, in which (in its simplest fonn) t and g are both

functions from an input space to a discrete output space. Vapnik's famous theorem in classification

theory (Vapnik 1982) is concerned with conditional distributions of the fonn P(lc - sl > e I t, m),

where m is the size of the data set d, cost c is defined as the average number of times (across the

input space) that t differs from g, s is the average number of times across the data set that g and t

disagree (i.e., the classifier's empirical misclassification rate), and e is an arbitrary value. Vapnik

derives results of the fonn P(lc - sl > e I t, m) < function(e, m, t, "VC dimension ofP(g I d)"). These

results can (usually) be cast in such a way that they're independent of t, and therefore of P(t), the

prior over truths; Vapnik's result is "non-Bayesian". (In essence, it's a likelihood-driven confi

dence interval-type result.) However if rather than P(lc - sl > elm) one had instead decided to an

alyze something like P(c > e I d), then priors over t would matter, and we would be faced with a

purely Bayesian problem.

In other words, the only fundamental difference between the Bayesian analysis and Vapnik's

analysis is that they examine different conditional distributions. The math is just as valid in both

cases. And both cases have a straight-forward, somewhat illuminating frequentist interpretation.4

Moreover, since the only true distinguishing feature between the two kinds of analysis is their dis

tribution of interest, we're immediately led to consider the applicability of that distribution to the

particular real world problems facing us. (See footnote 4.) As an aside, note that this consideration

serves to point up another strength of Bayesian techniques as compared to non-Bayesian tech

niques; by putting what we know (and nothing else) on the right-hand side of the conditional bar,

Bayesian techniques are examining the "correct" distribution, as far as the right-hand side of the

conditioning bar is concerned.

Not only is all this made clear in extended Bayesian analysis; in addition one can not even ex

press Vapnik's work using conventional Bayesian analysis.S The extension of the event space to

include g's which are distinct from t's is crucial if one wants to house both Bayesian and non-Baye

sian techniques under the same roof.

7

Indeed, the extended Bayesian framework can be used to explore a subtle conflict in what's

been said so far. The extended framework shows that Bayesian techniques examine the "correct"

distribution, whereas non-Bayesian ones do not. However the discussion in the introduction sug

gests that Bayesian techniques need not be optimal, even as measured by that "correct" distribu

tion. How can this be?

The answer to this question is provided by employing a little algebra in conjunction with the

fact that peg I t, d) = peg I d) (i.e., the fact that by definition the statistical algorithm won't change

its guess unless the data on which it's run changes). It's straight-forward to use this fact to prove

the following (see (Wolpert 1992, 1993b»:

Theorem 1: P(c I d) =Lg,t [peg I d) x pet I d) x Mc,d(g, t)], for some matrix M parameterized by c

andd.

(A similar result holds if g and t are not countable.)

In many situations (e.g., the aforementioned classification problem) M is a symmetric matrix.

In such cases, theorem one tells us that P(c I d) is given by a non-Euclidean inner product between

the posterior and one's inference algorithm. In other words, intuitively speaking, how well your

algorithm performs is determined by how "aligned" it is with the true posterior. Therefore, even if

a particular peg I d) is constructed in a "Bayesian" manner, unless one somehow knows that that

peg / d) is constructed appropriately for the actual pet / d), one has no assurances that that peg I d)

is better "aligned" with pet I d) than some other peg I d) is. Even if that other peg I d) was not con

structed in a Bayesian manner. It is only under the assumption that one's Bayesian algorithm was

based on a correct guess for pet I d) that the usual results of decision theory declaring the optimal

nature of a Bayesian algorithm apply.

In plain English, theorem 1 states that a Bayesian's assumption for the priors (and therefore for

pet I d» might be wrong. (Subtleties concerning that word "wrong" are addressed below.) Hence,

said Bayesian might construct a peg I d) predicated on an incorrect pet I d), and therefore perform

8

poorly. Accordingly, there are two issues at hand: how accurate the Bayesian's assumptions for the

prior are, and how probability of cost varies with changes in that accuracy. Indeed, depending on

these issues, the Bayesian might perform more poorly than a non-Bayesian. Even though the non

Bayesian never stated their assumptions explicitly in the first place.

In fact, in the case of classification discussed above, if one's cost is determined by how well g

matches t for input values not yet seen (Le., for input values outside of the data set), one has the

following theorems (Wolpert 1993b):

Theorem 2. Ifpet) is uniform, then regardless of the noise and the sampling distribution over the

input spaee, both P(elm) and P(e I d) are independent of the inference algorithm.

In other words, unless one can rule out a uniform P(t), one can not rule out the possibility that all

statistical algorithms will behave the same, whether they are constructed in a Bayesian manner or

otherwise. This result can be extended as follows.

Theorem 3. For any two inference algorithms Pj(g I d) and Pz(g I d), independent of the noise

and sampling distribution,

i) if there exists a t such that E(e I t, m) is lowerfor Pj(g I d), then there exists a different t such

that E(e It, m) is lower for Pz(g I d) (m being the size ofthe data set);

ii) if there exists a t and a d such that E(e I t, d) is lower for Pj(g I d), then there exists a different

tand dsuch that E(e I t, d) is lower for Pz(g I d);

iii) if there exists aP(t) and a d sueh that E(e I d) is lower for Pj(g I d), then there exists a

different pet) and d sueh that E(c I d) is lower for Pz(g I d);

ivY if there exists a pet) such that E(e I m) is lower for Pj(g I d), then there exists a different pet)

such that E(e I m) is lower for Pz(g I d).

9

Again, all of this holds whether or not the inference algorithms in question are constructed in a

Bayesian manner. (Here, by a "Bayesian inference algorithm" I mean one which makes a guess for

P(t), and then constructs a P(g I d) under the assumption that that guess is correct.)

It's worth emphasizing that these theorems don't just say that a non-Bayesian algorithm might

beat a Bayesian in one particular trial, by luck as it were. Rather they say that a non-Bayesian al

gorithm might win on average, in a particular series of trials. It might give a lower expected cost.

(This despite the fact that the lowest possible expected cost is given by a "Bayes-optimal" algo

rithm constructed with knowledge of the correct P(t).)

With the extended Bayesian formalism what we have is a view of inductive inference as an ex

ercise in game theory. Inductive inference is a 2-person game, and it can either be a single move

game or a multi-move game (as when one investigates "learning curves" - see [Tishby 1989]). One

of the participants in the game is you, the statistician. The other participant is the data-generating

mechanism, aka the universe. This opponent of yours draws truths t from a set of possible t, ac

cording to the distribution P(t). Given t, your opponent then produces a data set, according to the

distribution P(d I t). This d is then shown to you. Based on d, you guess a g according to P(g I d).

We then use some sort of error or cost function to determine how well your g matches your oppo

nent's t. Ifyou know your opponent's P(t) and P(d I t), then you can use that information to perform

optimally. But if you have to guess P(t) ... you have no assurances.

It should be emphasized that none of these theorems can be used to show that one should use

a particular non-Bayesian technique rather than a Bayesian one. A careful Bayesian analysis is of

ten the "best one can do", or at least the analysis one can best justify doing. The point made by

these theorems is simply that there is no assurance that the "analysis one can best justify doing"

will actually perform better than some other technique for which one only has weak justifications.

In this these theorems are descriptive rather than prescriptive. They aren't designed to provide ad

vice on how to perform one's statistics. Rather they describe the ramifications of various statistical

scenarios.

Nonetheless, as a particular example, note that these "descriptions" support the notion that

10

there is nothing inherently bad about using a non-Bayesian algorithm to choose between Bayesian

and non-Bayesian techniques, for certain statistical scenarios. For example, if we don't have much

information to go on in making an assumption/guess for pet) (and especially in the limiting case of

no knowledge - a case sometimes dealt with via an "uninformative prior"), then it makes perfect

sense to be suspicious of any such guess for pet), and of any peg Id) constructed under the assump

tion of that pet). In such a scenario, one need not be shy about using techniques like cross-valida

tion, bootstrap, stacked generalization, etc. Not only can such techniques be used to choose be

tween Bayesian and non-Bayesian techniques, they can also be used to choose between different

Bayesian techniques, i.e., between Bayesian techniques based on different guesses for pet).

(MacKay exploits this idea in (MacKay 1993).) They can even be used successfully to combine

techniques (Breiman 1992, MacKay 1993).

From this perspective, the conventional prior-based Bayesian approach is the approach of

choice only if there is no alternative non-Bayesian approach which is sufficiently compelling in

comparison. (The comparison being between how compelling is a peg Id) based on a guess for pet)

vs. a peg Id) based on other considerations.) Or to put it another way, as far as P(c Id) is concerned,

it's the approach of choice only if our "prior knowledge" about the prior distribution is more sub

stantial than our prior knowledge about other distributions. This is implicitly acknowledged in (for

example) MacKay's use of cross-validation. As other examples, empirical Bayes and ML-IT (Berg

er 1985) explicitly acknowledge the utility of non-Bayesian approaches in parts of the analysis.

Baldly put, one can argue that those (not at all uncommon) scenarios in which the Bayesian

approach has worked very well compared to non-Bayesian techniques do not reflect some inherent

"a priori superiority" of prior-based Bayesian techniques. Rather they simply reflect the fact that

at least some aspects of the non-Bayesian techniques employed in those scenarios were not suffi

ciently powerful in comparison to the corresponding Bayesian techniques (as often is cross-valida

tion, to give one example).

11

2. INTERPRETATIONAL AND PHILOSOPHICAL ISSUES

It is conventional with some modern-day Bayesians to interpret "probability" as synonymous

with "degree of (personal) belief'. Under this interpretation, there are some odd aspects to the ex

tended Bayesian analysis presented in the previous section. After all, if pet) is my belief in propo

sition t, then don't I automatically know pet) exactly, and therefore can't I always build my algo

rithm peg I d) appropriately, and be assured of optimality? In other words, if I'm a Bayesian, aren't

I assured of a relationship between peg I d) and pet I d) which results in minimal expected cost?

This line ofreasoning is often invoked (implicitly or otherwise) as a justification for why one

should use Bayesian techniques. There are a number of ways to address it. The first is to simply

note that the analysis of the previous section, in which the statistician does not have direct access

to pet) (and therefore has no guarantee of optimality, no matter how purely a Bayesian (s)he is), is

exactly correct if you're playing a real two-person game. In other words, if, in the time-honored

tradition of probability theory, one is investigating gambling, then the analysis of the previous sec

tion, together with all of its conclusions, is tautologically correct. For such a scenario, being a

Bayesian provides no assurances of optimality. Even if you arrive at your priors through sophisti

cated, almost "indisputable" group theoretic/information theoretic arguments concerning your pri

or knowledge / ignorance, if it turns out that your sophisticated priors disagree with those of the

house, well, then you lose.

Given this, the question immediately arises of whether there is some never-before-elaborated

fundamental distinction between gambling scenarios and the kinds of statistical problems we usu

ally encounter in our professional lives. (What could that distinction be? What concrete ramifica

tions could it have?) If there is no such distinction, then the simple fact that the conclusions of the

previous section (about the lack of guarantees accompanying being a Bayesian) are appropriate for

gambling implies that those conclusions are also appropriate for other non-gambling uses of statis

tics. Like the uses of statistics discussed at this conference.

Anotherresponse to the worry about what pet) "is" is to say that if we had sufficient knowledge

12

of the laws of physics (in particular, of the boundary conditions of the universe) and of the laws of

human psychology, and if we were sufficiently competent to perform the appropriate quantum me

chanical calculations, then we could calculate pet) exactly. In other words, one possible interpre

tation is that pet) is the "real" pet), determined by the laws of quantum mechanics applied to the

universe as a whole.

Indeed, note that even if someone doesn't believe in quantum mechanics, essentially anyone

can imagine that it's true. In other words, we can self-consistently imagine that tlie universe

evolves in accord with equations governing "absolute, objective" probabilities. This simple fact,

that quantum mechanics as currently formulated is mathematically self-consistent, shows that there

is no formal problem with its implicit notion of absolute, objective probabilities, which exist inde

pendently of any particular person's degree of belief. Simply put, there is nothing mathematically

necessary about the degree of belief interpretation of probability theory. There is nothing mathe

matically necessary about the idea that a Bayesian automatically knows what pet) is.

As it turns out, one can come to this conclusion without invoking quantum mechanics. To do

so simply note that there is nothing in Cox's axioms which forces a particular interpretation ofwhat

probability "is". In other words, those axioms only say that if you're going to use a calculus of un

certainty, and it's to be reasonable, it must obey the laws of probability theory. Those axioms do

not tell us how to assign the probability values in the first place. Now on the one hand, this of

course means that there is no a priori reason to interpret probabilities as frequencies. However it

also means that there is no a priori reason to interpret them as a "degree of belief." There is nothing

in Cox's axioms which proves that probability is anything so under the control ofthe researcher as

that researcher's beliefs. One could interpret probability as degree of belief. In such a case, Baye

sian analysis becomes a set ofmles for telling you what structure your beliefs must have to be self

consistent. But one is not/orced to interpret probability this way. "Degree of belief' is an interpre

tation, which is imposed from outside on the mathematics, and as such possesses no mathematical

necessity.6

Indeed, any conclusion we reach that depends on such a degree of belief interpretation is a con-

13

clusion which does not follow solely from Cox's axioms. More is being added. In other words, it

is fine to impose a somewhat arbitrary intuitive interpretation (like degree of belief) onto a formal

ism, but only so long as it doesn't add anything of substance which is not rigorously justified by

that formalism. The arbitrary interpretation can not have formal implications which the math does

not support; if it does, then one is imposing one's subjective philosophy onto the mathematics rath

er than letting that mathematics speak for itself. In particular, the idea that the theorems of the pre

vious section are not valid because a Bayesian "automatically knows" pet) is such an interpretation

driven (rather than math-driven) conclusion.

In general we know what a "truth" t is, and we know what a guess g is, and these are different

objects. Since they are different objects, a priori their distributions need not be intimately connect

ed. This is reflected in the theorems of the previous section. Accordingly, if we make a formal

statement connecting pet I d) and peg I d), this corresponds to an extra assumption concerning

pet, g, c, d), an assumption not demanded by the mathematics. Unless one can somehow rigorously

prove such an assumption (say by performing the appropriate physics calculations), one can not

prove that Bayesian techniques will always outperform non-Bayesian ones.

In a certain sense, none of this really disagrees with conventional Bayesianism as practiced.

That's because everyone does, implicitly, adhere to a notion of a "correct" prior pet) which has to

be discovered and is not "automatically known". In other words, people's actions betray them; they

act as though priors are more than just degrees of personal belief. (This is obviously true of empir

ical Bayesians - I'm here talking about pure hierarchical Bayesians.)

For example, there have been many times in the history of Bayesian statistics when an argu

ment in favor of a new prior pet) has come along, and everyone has adopted that new prior in their

peg Id). Indeed, some ofthe more prominent attendees at this conference have spent the better parts

of their careers looking for arguments to establish, once and for all, what priors one should use (i.e.,

what priors are "correct") for certain scenarios. Moreover, they've changed their views on this is

sue several times. Each time they act as though they were using an "incorrect" prior pet) before,

despite the fact that that pet) was properly reflected in their old degrees-of-belief, as expressed in

14

their old peg I d). And each time they adopt a new prior, the advocates of the new prior tend to look

askance at any laggards still using the old one. (They certainly are not content to simply say to such

laggards "Pick whatever prior your truly believe; my job as a Bayesian is simply to tell you what

that choice suggests for you to do.") It's hard to understand this kind of behavior if these people

truly view probabilities pet) as degrees of personal belief, without any notion of a "correct" pet)

certainly the laggard is following along consistently with his/her "beliefs", and can not be criticized

on that score.

Indeed, for a century Bayesianism was in disrepute, and the current consensus is that is was in

disrepute because it was used with "bad choices ofpriors". Just translate "bad choice of' to "incor

rect assumption for", and you have the theorems of the previous section.

Alternatively, note that transferring from an "incorrect" pet) to a better one is really nothing

more than the process accompanying the (in)famous "opportunity to learn" which one encounters

when one's Bayesian analysis leads to poor results. Or to put it another way, having pet I d) poorly

reflected in peg I d) is an opportunity to learn. If you assume these distributions are always "auto

matically" connected, you're assuming you never have an opportunity to learn. (As an aside, note

that a mismatch in the distributions is an "opportunity to leam" whether or not peg I d) is based on

Bayesian analysis - Bayesians have no monopoly on the use of the concept "opportunity to learn"

as a cover for poor performance.)

Another approach to this issue is to replace the phrase "correct P(t)" with "P(t) which best re

flects prior knowledge". "Best" in the sense that as time goes on, as our understanding of statistics

improves, as (for example) we come to a better understanding of what "uninformative" means, etc.,

our thinking on "what distribution best reflects our prior knowledge" changes. This "best" pet) is

the holy grail- it's what all those who make new arguments concerning priors are shooting for, but

(presumably) don't yet have. The important point is to acknowledge that in general one's pick of

a prior need not be the "best" one, and in fact the "best" one need not even be known to humanity,

either now or in the future.

Note that I'm not arguing for any particular interpretation of probability. In particular, I'm not

15

saying that there are such things as objective, "correct" probabilities (although quantum mechanics

certainly suggests that there are). I'm simply pointing out that statisticians, even the most die-hard

Bayesian statisticians, act as though there were such objective probabilities.

In addition to being used to cast light on the sociology of Bayesians, and on what they "really

think", the theorems of the previous section can be used the same way with some non-Bayesians.

For example, although couched purely in terms of conventional probability theory, the theorems

of the previous section provide a suggestion of what those who don't like probability theory (and

instead prefer Dempster-Schaffer theory, non-monotonic logics, fuzzy logic, etc.) are "getting at".

If one knows pet) exactly, then according to theorem one, there's no room for discussion

Bayesian techniques incorporating one's knowledge ofP(t) into one's peg I d) will always win, on

average (assuming we also know the likelihood). However let's say we still have our Bayesian hats

on, but now only have limited information concerning pet). Since our knowledge ofP(t) is not com

plete, we will inevitably be off a bit in the guess we make for P(t) and then incorporate into our

peg I d). This means we will be guessing sub-optimally.

Accordingly, as discussed above, when we are off in our guess for pet), it's possible that a non

Bayesian might beat us, on average. Thus there is a correlation between how much we know about

pet) and how confident we are that a Bayesian technique using our guess for pet) is necessarily su

perior to a particular non-Bayesian technique. This can be viewed as introducing a distinction be

tween an assumption for a probability and one's "confidence" in that assumption. One's confi

dence in the superiority of one's statistical algorithm depends not only on probabilities but also on

"confidence" in those probabilities. It is interesting to speculate that his might be what non-proba

bilists are getting at with notions like "plausibility vs. probability"'?

Acknowledgments

This work was supported in part by NLM grant F37 LMOOO11-01. I would like to thank Wray Bun-

16

tine, David Rosen and Cullen Schaffer for helpful conversations.

Footnotes

1. For the purposes of this paper there is no reason to specify whether the notation "PO" refers to

a probability, a density function, or some other similar object. In general though, it usually refers

to a probability.

2. Note that I am here comparing the reasonableness of a non-Bayesian argument for setting a pos

terior for one specific d, to the reasonableness of an argument for setting a prior. I am not compar

ing the reasonableness of that non-Bayesian argument to the (indisputable) "reasonableness" of

Bayes' theorem. Although the non-Bayesian vs. Bayesian distinction is usually viewed in terms of

whether or not to ignore Bayes' theorem, that is not the issue here.

3. Strictly speaking, there is almost always a fifth quantity as well, "prior knowledge". Prior knowl

edge is an extremely amorphous object, which is quite difficult to quantify in any broad and yet

meaningful way. Because of this, and since it is peripheral to the main thrust of this paper, in this

paper I will not directly concern myself with prior knowledge; it is implicit rather than explicit in

all that is said here.

4. Vapnik's result means that if you many times generate a classification problem having a data set

of size m, and compare the true cost of the classification one comes up with to the "empirical" cost

on the data set, you find that his bound on the difference applies. The Bayesian result also concerns

itself with this very situation, just modified slightly; of all those classification problems, only look

at those in which the data set is d, the data set actually before us. Of course, this modification suf

fices to make the prior pet) crucial, and therefore invalidates Vapnik's bounds. (As an aside, one

17

should note that a different modification, of only looking at those problems in which the empirical

cost is c, also invalidates Vapnik's results. In other words, although often used as though it refers

to the distribution P(lc - sl > E I c, m), Vapnik's result does not.)

5. It's not only conventional Bayesian analysis which comes up short as far as understanding Va

pnik's result is concerned. The field of "Vapnikology" (as J. Denker calls it) also come up short,

in that the (vast) majority of its practitioners do not realize that there is an issue concerning what

conditional distribution one wishes to examine, and that this choice carried important implications.

Indeed, Vapnik himself never explicitly specifies what goes on the right-hand side of the condi

tioning bar; one must infer what goes there from how he uses probability theory.

6. Note that assigning a "degree of belief' to a proposition and making an assumption for the prob

ability of that proposition (in the gambling-against-the-house scenario) are very similar things.

Both are subjective declarations concerning how reasonable the researcher thinks the proposition

is. This similarity is, I suspect, the reason why people have confused the two concepts so easily.

However one should note that there is an important distinction between the two: in saying what

one's degree of belief is in a proposition, there exists an implicit notion that, so long as one is self

consistent and not irrational, one is tautologically correct. In essence, one is simply stating how one

feels, and by definition, that's whatever you say it is. By definition though, there is no such notion

of tautological correctness to making an assumption for something. Assumptions can be right and

they can be wrong. It is the claim of this paper that pet) is something one can assume - and in par

ticular something one can assume incorrectly - as opposed to something one can simply declare.

7. It should be noted that I'm speaking somewhat loosely here. For example, I have been careful

not to define "confidence" formally. In particular, I do NOT want to equate confidence in a prob

ability with probability of a probability. For such a scenario, one can simply marginalize over one's

"confidence" to get a new, exact pet). All I'm trying to do here is provide an intuitive view of the-

18

orems 1 through 3 which can may be viewed as what non-probabilists are "getting at".

References

Berger, J. (1985). Statistical decision theory and Bayesian analysis. Springer-Verlag.

Breiman, L. (1992). Stacked regression. University of Califomia, Berkeley, Dept. of Statistics TR92-367.

Gull, S.P. (1989). Developments in maximum entropy data analysis. In "Maximum-entropy andBayesian methods", J. Skilling (Ed.). Kluwer Academics publishers.

MacKay, D.J.C. (1992). Bayesian Interpolation. A Practical Framework for BackpropagationNetworks. Neural Computation, 4, 415 and 448.

MacKay, D.J.C. (1993). Bayesian non-linear modeling for the energy prediction competition. Inpreparation.

Strauss, C.E.M, Wolpert, D.H., Wolf, D.R. (1993). Alpha, Evidence, and the Entropic Prior. In"Maximum-entropy and Bayesian methods", A. Mohammed-Djafari (Ed.). Kluwer Academicspublishers. In press.

Tishby, N., Levin, E., and Solla, S. (1989). Consistent inference of probabilities in layered networks: predictions and generalization. Proceedings ofthe international Joint Conference on Neural Networks, Washington, D.C., pp. II 403-409, IEEE.

Vapnik, V. (1982). Estimation of dependences based on empirical data. Springer-Verlag.

Wolpert, D. H. (1992). On the connection between in-sample testing and generalization error.Complex Systems, 6, 47-94.

Wolpert, D.H., Strauss, C.E., Wolf, D.R. (1993a). What Bayes has to say about the evidence procedure. These proceedings.

Wolpert, D.H., (1993b). On overfitting avoidance as bias. SF! TR 92-03-016. Submitted.

Documents

Reconciling Bayesian and Non- Bayesian Analysis · Conventional Bayesian analysis and decision theory fails to distinguish t from g in its event space. Thereforeonemust becarefulinrelating