40
Statistical Science 2014, Vol. 29, No. 2, 227–239 DOI: 10.1214/13-STS457 © Institute of Mathematical Statistics, 2014 On the Birnbaum Argument for the Strong Likelihood Principle 1 Deborah G. Mayo Abstract. An essential component of inference based on familiar frequen- tist notions, such as p-values, significance and confidence levels, is the rel- evant sampling distribution. This feature results in violations of a principle known as the strong likelihood principle (SLP), the focus of this paper. In particular, if outcomes x and y from experiments E 1 and E 2 (both with un- known parameter θ ) have different probability models f 1 (·), f 2 (·), then even though f 1 (x ; θ) = cf 2 (y ; θ) for all θ , outcomes x and y may have dif- ferent implications for an inference about θ . Although such violations stem from considering outcomes other than the one observed, we argue this does not require us to consider experiments other than the one performed to pro- duce the data. David Cox [Ann. Math. Statist. 29 (1958) 357–372] proposes the Weak Conditionality Principle (WCP) to justify restricting the space of relevant repetitions. The WCP says that once it is known which E i produced the measurement, the assessment should be in terms of the properties of E i . The surprising upshot of Allan Birnbaum’s [J. Amer. Statist. Assoc. 57 (1962) 269–306] argument is that the SLP appears to follow from applying the WCP in the case of mixtures, and so uncontroversial a principle as sufficiency (SP). But this would preclude the use of sampling distributions. The goal of this article is to provide a new clarification and critique of Birnbaum’s argument. Although his argument purports that [(WCP and SP) entails SLP], we show how data may violate the SLP while holding both the WCP and SP. Such cases also refute [WCP entails SLP]. Key words and phrases: Birnbaumization, likelihood principle (weak and strong), sampling theory, sufficiency, weak conditionality. 1. INTRODUCTION It is easy to see why Birnbaum’s argument for the strong likelihood principle (SLP) has long been held as a significant, if controversial, result for the foundations of statistics. Not only do all of the familiar frequentist error-probability notions, p-values, significance levels and so on violate the SLP, but the Birnbaum argument purports to show that the SLP follows from principles that frequentist sampling theorists accept: Deborah G. Mayo is Professor of Philosophy, Department of Philosophy, Virginia Tech, 235 Major Williams Hall, Blacksburg, Virginia 24061, USA (e-mail: [email protected]). 1 Discussed in 10.1214/14-STS470, 10.1214/14-STS471, 10.1214/14-STS472, 110.1214/14-STS473, 10.1214/14-STS474 and 10.1214/14-STS475; rejoinder at 10.1214/14-STS482. The likelihood principle is incompatible with the main body of modern statistical theory and practice, notably the Neyman– Pearson theory of hypothesis testing and of confidence intervals, and incompatible in general even with such well-known con- cepts as standard error of an estimate and significance level. [Birnbaum (1968), page 300.] The incompatibility, in a nutshell, is that on the SLP, once the data x are given, outcomes other than x are irrelevant to the evidential import of x. “[I]t is clear that reporting significance levels violates the LP [SLP], since significance levels involve averaging over sam- ple points other than just the observed x.” [Berger and Wolpert (1988), page 105.] 227

Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

Statistical Science2014, Vol. 29, No. 2, 227–239DOI: 10.1214/13-STS457© Institute of Mathematical Statistics, 2014

On the Birnbaum Argument for the StrongLikelihood Principle1

Deborah G. Mayo

Abstract. An essential component of inference based on familiar frequen-tist notions, such as p-values, significance and confidence levels, is the rel-evant sampling distribution. This feature results in violations of a principleknown as the strong likelihood principle (SLP), the focus of this paper. Inparticular, if outcomes x∗ and y∗ from experiments E1 and E2 (both with un-known parameter θ ) have different probability models f1(·), f2(·), then eventhough f1(x∗; θ) = cf2(y∗; θ) for all θ , outcomes x∗ and y∗ may have dif-ferent implications for an inference about θ . Although such violations stemfrom considering outcomes other than the one observed, we argue this doesnot require us to consider experiments other than the one performed to pro-duce the data. David Cox [Ann. Math. Statist. 29 (1958) 357–372] proposesthe Weak Conditionality Principle (WCP) to justify restricting the space ofrelevant repetitions. The WCP says that once it is known which Ei producedthe measurement, the assessment should be in terms of the properties of Ei .The surprising upshot of Allan Birnbaum’s [J. Amer. Statist. Assoc. 57 (1962)269–306] argument is that the SLP appears to follow from applying the WCPin the case of mixtures, and so uncontroversial a principle as sufficiency (SP).But this would preclude the use of sampling distributions. The goal of thisarticle is to provide a new clarification and critique of Birnbaum’s argument.Although his argument purports that [(WCP and SP) entails SLP], we showhow data may violate the SLP while holding both the WCP and SP. Suchcases also refute [WCP entails SLP].

Key words and phrases: Birnbaumization, likelihood principle (weak andstrong), sampling theory, sufficiency, weak conditionality.

1. INTRODUCTION

It is easy to see why Birnbaum’s argument for thestrong likelihood principle (SLP) has long been held asa significant, if controversial, result for the foundationsof statistics. Not only do all of the familiar frequentisterror-probability notions, p-values, significance levelsand so on violate the SLP, but the Birnbaum argumentpurports to show that the SLP follows from principlesthat frequentist sampling theorists accept:

Deborah G. Mayo is Professor of Philosophy, Departmentof Philosophy, Virginia Tech, 235 Major Williams Hall,Blacksburg, Virginia 24061, USA (e-mail: [email protected]).

1Discussed in 10.1214/14-STS470, 10.1214/14-STS471,10.1214/14-STS472, 110.1214/14-STS473, 10.1214/14-STS474and 10.1214/14-STS475; rejoinder at 10.1214/14-STS482.

The likelihood principle is incompatiblewith the main body of modern statisticaltheory and practice, notably the Neyman–Pearson theory of hypothesis testing andof confidence intervals, and incompatiblein general even with such well-known con-cepts as standard error of an estimate andsignificance level. [Birnbaum (1968), page300.]

The incompatibility, in a nutshell, is that on the SLP,once the data x are given, outcomes other than x areirrelevant to the evidential import of x. “[I]t is clearthat reporting significance levels violates the LP [SLP],since significance levels involve averaging over sam-ple points other than just the observed x.” [Berger andWolpert (1988), page 105.]

227

Page 2: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

228 D. G. MAYO

1.1 The SLP and a Frequentist Principle ofEvidence (FEV)

Birnbaum, while responsible for this famous argu-ment, rejected the SLP because “the likelihood conceptcannot be construed so as to allow useful appraisal, andthereby possible control, of probabilities of erroneousinterpretations” [Birnbaum (1969), page 128]. That is,he thought the SLP at odds with a fundamental fre-quentist principle of evidence.

Frequentist Principle of Evidence (general): Draw-ing inferences from data requires considering the rele-vant error probabilities associated with the underlyingdata generating process.

David Cox intended the central principle invoked inBirnbaum’s argument, the Weak Conditionality Princi-ple (WCP), as one important way to justify restrictingthe space of repetitions that are relevant for informa-tive inference. Implicit in this goal is that the role ofthe sampling distribution for informative inference isnot merely to ensure low error rates in repeated appli-cations of a method, but to avoid misleading inferencesin the case at hand [Mayo (1996); Mayo and Spanos(2006, 2011); Mayo and Cox (2010)].

To refer to the most familiar example, the WCP saysthat if a parameter of interest θ could be measured bytwo instruments, one more precise then the other, anda randomizer that is utterly irrelevant to θ is used todecide which instrument to use, then, once it is knownwhich experiment was run and its outcome given, theinference should be assessed using the behavior of theinstrument actually used. The convex combination ofthe two instruments, linked via the randomizer, definesa mixture experiment, Emix. According to the WCP,one should condition on the known experiment, evenif an unconditional assessment improves the long-runperformance [Cox and Hinkley (1974), pages 96–97].

While conditioning on the instrument actually usedseems obviously correct, nothing precludes theNeyman–Pearson theory from choosing the procedure“which is best on the average over both experiments”in Emix [Lehmann and Romano (2005), page 394].They ask the following: “for a given test or confidenceprocedure, should probabilities such as level, power,and confidence coefficient be calculated conditionally,given the experiment that has been selected, or uncon-ditionally?” They suggest that “[t]he answer cannot befound within the model but depends on the context”(ibid). The WCP gives a rationale for using the condi-tional appraisal in the context of informative paramet-ric inference.

1.2 What Must Logically Be Shown

However, the upshot of the SLP is to claim that thesampling theorist must go all the way, as it were, givena parametric model. If she restricts attention to theexperiment producing the data in the mixture experi-ment, then she is led to consider just the data and notthe sample space, once the data are in hand. Whilethe argument has been stated in various forms, thesurprising upshot of all versions is that the SLP ap-pears to follow from applying the WCP in the caseof mixture experiments, and so uncontroversial a no-tion as sufficiency (SP). “Within the context of whatcan be called classical frequency-based statistical in-ference, Birnbaum (1962) argued that the conditional-ity and sufficiency principles imply the [strong] like-lihood principle” [Evans, Fraser and Monette (1986),page 182].

Since the challenge is for a sampling theorist whoholds the WCP, it is obligatory to consider whetherand how such a sampling theorist can meet it. Whilethe WCP is not itself a theorem in a formal system,Birnbaum’s argument purports that the following is atheorem:

[(WCP and SP) entails SLP].

If true, any data instantiating both WCP and SP couldnot also violate the SLP, on pain of logical contradic-tion. We will show how data may violate the SLP whilestill adhering to both the WCP and SP. Such cases alsorefute [WCP entails SLP], making our argument appli-cable to attempts to weaken or remove the SP. ViolatingSLP may be written as not-SLP.

We follow the formulations of the Birnbaum argu-ment given in Berger and Wolpert (1988), Birnbaum(1962), Casella and Berger (2002) and Cox (1977). Thecurrent analysis clarifies and fills in important gaps ofan earlier discussion in Mayo (2010), Mayo and Cox(2011), and lets us cut through a fascinating and com-plex literature. The puzzle is solved by adequately stat-ing the WCP and keeping the meaning of terms consis-tent, as they must be in an argument built on a series ofidentities.

1.3 Does It Matter?

On the face of it, current day uses of sampling theorystatistics do not seem in need of going back 50 years totackle a foundational argument. This may be so, butonly if it is correct to assume that the Birnbaum ar-gument is flawed somewhere. Sampling theorists whofeel unconvinced by some of the machinations of the

Page 3: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

ON THE BIRNBAUM ARGUMENT FOR THE STRONG LIKELIHOOD PRINCIPLE 229

argument must admit some discomfort at the lack ofresolution of the paradox. If one cannot show the rele-vance of error probabilities and sampling distributionsto inferences once the data are in hand, then the uses offrequentist sampling theory, and resampling methods,for inference purposes rest on shaky foundations.

The SLP is deemed of sufficient importance to beincluded in textbooks on statistics, along with a versionof Birnbaum’s argument that we will consider:

It is not uncommon to see statistics texts ar-gue that in frequentist theory one is facedwith the following dilemma: either to denythe appropriateness of conditioning on theprecision of the tool chosen by the toss ofa coin, or else to embrace the strong likeli-hood principle, which entails that frequen-tist sampling distributions are irrelevant toinference once the data are obtained. Thisis a false dilemma. . . . The ‘dilemma’ ar-gument is therefore an illusion. [Cox andMayo (2010), page 298.]

If we are correct, this refutes a position that is gen-erally presented as settled in current texts. But the illu-sion is not so easy to dispel, thus this paper.

Perhaps, too, our discussion will illuminate a pointof agreement between sampling theorists and con-temporary nonsubjective Bayesians who concede they“have to live with some violations of the likelihoodand stopping rule principles” [Ghosh, Delampady andSumanta (2006), page 148], since their prior probabil-ity distributions are influenced by the sampling distri-bution. “This, of course, does not happen with subjec-tive Bayesianism. . . . the objective Bayesian respondsthat objectivity can only be defined relative to a frameof reference, and this frame needs to include the goal ofthe analysis.” [Berger (2006), page 394.] By contrast,Savage stressed:

According to Bayes’s theorem, P(x|θ) . . .constitutes the entire evidence of the exper-iment . . . [I]f y is the datum of some otherexperiment, and if it happens that P(x|θ)

and P(y|θ) are proportional functions of θ

(that is, constant multiples of each other),then each of the two data x and y have ex-actly the same thing to say about the valueof θ . [Savage (1962a), page 17, using θ forhis λ and P for Pr .]

2. NOTATION AND SKETCH OF BIRNBAUM’SARGUMENT

2.1 Points of Notation and Interpretation

Birnbaum focuses on informative inference about aparameter θ in a given model M , and we retain thatcontext. The argument calls for a general term to ab-breviate: the inference implication from experiment E

and result z, where E is an experiment involving theobservation of Z with a given distribution f (z; θ) anda model M . We use the following:

InfrE[z]: the parametric statistical inferencefrom a given or known (E, z).

(We prefer “given” to “known” to avoid referenceto psychology.) We assume relevant features of modelM are embedded in the full statement of experimentE. An inference method indicates how to compute theinformative parametric inference from (E, z). Let

(E, z) ⇒ InfrE[z]: an informative paramet-ric inference about θ from given (E, z) is tobe computed by means of InfrE[z].

The principles of interest turn on cases where (E, z) isgiven, and we reserve “⇒” for such cases. The abbrevi-ation InfrE[z], first developed in Cox and Mayo (2010),could allude to any parametric inference account; weuse it here to allow ready identification of the partic-ular experiment E and its associated sampling distri-bution, whatever it happens to be. InfrEmix(z) is alwaysunderstood as using the convex combination over theelements of the mixture.

Assertions about how inference “is to be computedgiven (E, z)” are intended to reflect the principles ofevidence that arise in Birnbaum’s argument, whethermathematical or based on intuitive, philosophical con-siderations about evidence. This is important becauseBirnbaum emphasizes that the WCP is “not necessaryon mathematical grounds alone, but it seems to be sup-ported compellingly by considerations . . . concerningthe nature of evidential meaning” of data when draw-ing parametric statistical inferences [Birnbaum (1962),page 280]. In using “=” we follow the common nota-tion even though WCP is actually telling us when z1and z2 should be deemed inferentially equivalent forthe associated inference.

By noncontradiction, for any (E, z), InfrE[z] =InfrE[z]. So to apply a given inference implicationmeans its inference directive is used and not some com-peting directive at the same time. Two outcomes z1and z2 will be said to have the same inference impli-cations in E, and so are inferentially equivalent withinE, whenever InfrE[z1] = InfrE[z2].

Page 4: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

230 D. G. MAYO

2.2 The Strong Likelihood Principle: SLP

The principle under dispute, the SLP, asserts the in-ferential equivalence of outcomes from distinct exper-iments E1 and E2. It is a universal if-then claim:

SLP: For any two experiments E1 andE2 with different probability models f1(·),f2(·) but with the same unknown parame-ter θ , if outcomes x∗ and y∗ (from E1 andE2, resp.) give rise to proportional likeli-hood functions (f1(x∗; θ) = cf2(y∗; θ) forall θ , for c a positive constant), then x∗ andy∗ should be inferentially equivalent for anyinference concerning parameter θ .

A shorthand for the entire antecedent is that (E1,x∗)is an SLP pair with (E2,y∗), or just x∗ and y∗ form anSLP pair (from {E1,E2}). Assuming all the SLP stipu-lations, for example, that θ is a shared parameter (aboutwhich inferences are to be concerned), we have the fol-lowing:

SLP: If (E1,x∗) and (E2,y∗) form an SLPpair, then InfrE1[x∗] = InfrE2[y∗].

Experimental pairs E1 and E2 involve observingrandom variables X and Y, respectively. Thus, (E2, y∗)or just y∗ asserts “E2 is performed and y∗ observed,” sowe may abbreviate InfrE2[(E2,y∗)] as InfrE2[y∗]. Like-wise for x∗. A generic z is used when needed.

2.3 Sufficiency Principle (Weak LikelihoodPrinciple)

For informative inference about θ in E, if TE isa (minimal) sufficient statistic for E, the SufficiencyPrinciple asserts the following:

SP: If TE(z1) = TE(z2), then InfrE[z1] =InfrE[z2].

That is, since inference within the model is to be com-puted using the value of TE(·) and its sampling distri-bution, identical values of TE have identical inferenceimplications, within the stipulated model. Nothing inour argument will turn on the minimality requirement,although it is common.

2.3.1 Model checking. An essential part of the state-ments of the principles SP, WCP and SLP is that thevalidity of the model is granted as adequately repre-senting the experimental conditions at hand [Birnbaum(1962), page 280]. Thus, accounts that adhere to theSLP are not thereby prevented from analyzing featuresof the data, such as residuals, in checking the validity

of the statistical model itself. There is some ambiguityon this point in Casella and Berger (2002):

Most model checking is, necessarily, basedon statistics other than a sufficient statis-tic. For example, it is common practice toexamine residuals from a model. . . Such apractice immediately violates the Suffi-ciency Principle, since the residuals are notbased on sufficient statistics. (Of coursesuch a practice directly violates the [strong]LP also.) [Casella and Berger (2002), pages295–296.]

They warn that before considering the SLP andWCP, “we must be comfortable with the model” [ibid,page 296]. It seems to us more accurate to regard theprinciples as inapplicable, rather than violated, whenthe adequacy of the relevant model is lacking. Apply-ing a principle will always be relative to the associatedexperimental model.

2.3.2 Can two become one? The SP is sometimescalled the weak likelihood principle, limited as it is toa single experiment E, with its sampling distribution.This suggests that if an arbitrary SLP pair, (E1,x∗) and(E2,y∗), could be viewed as resulting from a single ex-periment (e.g., by a mixture), then perhaps they couldbecome inferentially equivalent using SP. This will bepart of Birnbaum’s argument, but is neatly embeddedin his larger gambit to which we now turn.

2.4 Birnbaumization: Key Gambit in Birnbaum’sArgument

The larger gambit of Birnbaum’s argument may bedubbed Birnbaumization. An experiment has been run,label it as E2, and y∗ observed. Suppose, for the para-metric inference at hand, that y∗ has an SLP pair x∗ in adistinct experiment E1. Birnbaum’s task is to show thetwo are evidentially equivalent, as the SLP requires.

We are to imagine that performing E2 was the resultof flipping a fair coin (or some other randomizer givenas irrelevant to θ ) to decide whether to run E1 or E2.Cox terms this the “enlarged experiment” [Cox (1978),page 54], EB . We are then to define a statistic TB thatstipulates that if (E2,y∗) is observed, its SLP pair x∗in the unperformed experiment is reported;

TB(Ei,Zi) ={(

E1,x∗), if

(E1,x∗)

or(E2,y∗)

,

(Ei, zi ), otherwise.

Birnbaum’s argument focuses on the first case andours will as well.

Page 5: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

ON THE BIRNBAUM ARGUMENT FOR THE STRONG LIKELIHOOD PRINCIPLE 231

Following our simplifying notation, whenever E2 isperformed and Y = y∗ observed, and y∗ is seen toadmit an SLP pair, then label its particular SLP pair(E1,x∗). Any problems of nonuniqueness in identify-ing SLP pairs are put to one side, and Birnbaum doesnot consider them. Thus, when (E2,y∗) is observed,TB reports it as (E1,x∗). This yields the Birnbaum ex-periment, EB , with its statistic TB . We abbreviate theinference (about θ ) in EB as

InfrEB

[y∗]

.

The inference implication (about θ ) in EB from y∗ un-der Birnbaumization is

(E2,y∗) ⇒ InfrEB

[x∗]

,

where the computation in EB is always a convex com-bination over E1 and E2. But also,

(E1,x∗) ⇒ InfrEB

[x∗]

.

It follows that, within EB , x∗ and y∗ are inferentiallyequivalent. Call this claim

[B] : InfrEB[x∗] = InfrEB

[y∗].The argument is to hold for any SLP pair. Now [B]

does not yet reach the SLP which requires

InfrE1

[x∗] = InfrE2

[y∗]

.

But Birnbaum does not stop there. Having constructedthe hypothetical experiment EB , we are to use theWCP to condition back down to the known experimentE2. But this will not produce the SLP as we now show.

2.5 Why Appeal to Hypothetical Mixtures?

Before turning to that, we address a possible query:why suppose the argument makes any appeal to a hy-pothetical mixture? (See also Section 5.1.) The reasonis this: The SLP does not refer to mixtures. It is a uni-versal generalization claiming to hold for an arbitrarySLP pair. But we have no objection to imagining [asBirnbaum does (1962), page 284] a universe of all ofthe possible SLP pairs, where each pair has resultedfrom a θ -irrelevant randomizer (for the given context).Then, when y∗ is observed, we pluck the relevant pairand construct TB . Our question is this: why should theinference implication from y∗ be obtained by referenceto InfrEB

[y∗], the convex combination? Birnbaum doesnot stop at [B], but appeals to the WCP. Note the WCPis based on the outcome y∗ being given.

3. SLP VIOLATION PAIRS

Birnbaum’s argument is of central interest when wehave SLP violations. We may characterize an SLP vi-olation as any inferential context where the antecedentof the SLP is true and the consequent is false:

SLP violation: (E1,x∗) and (E2,y∗) forman SLP pair, but InfrE1[x∗] �= InfrE2[y∗].

An SLP pair that violates the SLP will be called anSLP violation pair (from E1, E2, resp.).

It is not always emphasized that whether (and how)an inference method violates the SLP depends on thetype of inference to be made, even within an accountthat allows SLP violations. One cannot just look atthe data, but must also consider the inference. For ex-ample, there may be no SLP violation if the focus ison point against point hypotheses, whereas in comput-ing a statistical significance probability under a nullhypothesis there may be. “Significance testing of ahypothesis. . . is viewed by many as a crucial elementof statistics, yet it provides a startling and practicallyserious example of conflict with the [SLP].” [Bergerand Wolpert (1988), pages 104–105.] The followingis a dramatic example that often arises in this con-text.

3.1 Fixed versus Sequential Sampling

Suppose X and Y are samples from distinct experi-ments E1 and E2, both distributed as N(θ, σ 2), with σ 2

identical and known, and p-values are to be calculatedfor the null hypothesis H0: θ = 0 against H1: θ �= 0.

In E2 the sampling rule is to continue sampling untilyn > cα = 1.96σ/

√n, where yn = 1

n

∑ni=1 yi . In E1,

the sample size n is fixed and α = 0.05.In order to arrive at the SLP pair, we have to consider

the particular outcome observed. Suppose that E2 isrun and is first able to stop with n = 169 trials. Denotethis result as y∗. A choice for its SLP pair x∗ would be(E1,1.96σ/

√169), and the SLP violation is the fact

that the p-values associated with x∗ and y∗ differ.

3.2 Frequentist Evidence in the Case ofSignificance Tests

“[S]topping ‘when the data looks good’can be a serious error when combinedwith frequentist measures of evidence. Forinstance, if one used the stopping rule[above]. . . but analyzed the data as if afixed sample had been taken, one couldguarantee arbitrarily strong frequentist ‘sig-

Page 6: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

232 D. G. MAYO

nificance’ against H0 . . . .” [Berger andWolpert (1988), page 77.]

From their perspective, the problem is with the useof frequentist significance. For a detailed discussionin favor of the irrelevance of this stopping rule, seeBerger and Wolpert (1988), pages 74–88. For samplingtheorists, by contrast, this example “taken in the con-text of examining consistency with θ = 0, is enoughto refute the strong likelihood principle” [Cox (1978),page 54], since, with probability 1, it will stop witha ‘nominally’ significant result even though θ = 0. Itcontradicts what Cox and Hinkley call “the weak re-peated sampling principle” [Cox and Hinkley (1974),page 51]. More generally, the frequentist principle ofevidence (FEV) would regard small p-values as mis-leading if they result from a procedure that readily gen-erates small p-values under H0.2

For the sampling theorist, to report a 1.96 standarddeviation difference known to have come from op-tional stopping, just the same as if the sample size hadbeen fixed, is to discard relevant information for in-ferring inconsistency with the null, while “accordingto any approach that is in accord with the strong like-lihood principle, the fact that this particular stoppingrule has been used is irrelevant.” [Cox and Hinkley(1974), page 51.]3 The actual p-value will depend ofcourse on when it stops. We emphasize that our argu-ment does not turn on accepting a frequentist principleof evidence (FEV), but these considerations are usefulboth to motivate and understand the core principle ofBirnbaum’s argument, the WCP.

4. THE WEAK CONDITIONALITY PRINCIPLE (WCP)

From Section 2.4 we have [B] InfrEB[x∗]= InfrEB

[y∗]since the inference implication is by the constructedTB . How might Birnbaum move from [B] to the SLP,for an arbitrary pair x∗and y∗?

There are two possibilities. One would be to insistinformative inference ignore or be insensitive to sam-pling distributions. But since we know that SLP viola-tions result because of the difference in sampling distri-butions, to simply deny them would obviously render

2Mayo and Cox (2010), page 254:FEV: y is (strong) evidence against H0, if and only if, were H0 a

correct description of the mechanism generating y, then, with highprobability this would have resulted in a less discordant result thanis exemplified by y.

3Analogous situations occur without optional stopping, as withselecting a data-dependent, maximally likely, alternative [Cox andHinkley (1974), Example 2.4.1, page 51]. See also Mayo and Kruse(2001).

his argument circular (or else irrelevant for samplingtheory). We assume Birnbaum does not intend his ar-gument to be circular and Birnbaum relies on furthersteps to which we now turn.

4.1 Mixture (Emix): Two Instruments of DifferentPrecisions [Cox (1958)]

The crucial principle of inference on which Birn-baum’s argument rests is the weak conditionality prin-ciple (WCP), intended to indicate the relevant samplingdistribution in the case of certain mixture experiments.The famous example to which we already alluded, “isnow usually called the ‘weighing machine example,’which draws attention to the need for conditioning, atleast in certain types of problems” [Reid (1992), page582].

We flip a fair coin to decide which of two instru-ments, E1 or E2, to use in observing a Normally dis-tributed random sample Z to make inferences aboutmean θ . E1 has variance of 1, while that of E2 is 106.We limit ourselves to mixtures of two experiments.

In testing a null hypothesis such as θ = 0, the same zmeasurement would correspond to a much smaller p-value were it to have come from E1 rather than fromE2: denote them as p1(z) and p2(z), respectively. Theoverall (or unconditional) significance level of the mix-ture Emix is the convex combination of the p-values:[p1(z) + p2(z)]/2. This would give a misleading re-port of how precise or stringent the actual experimen-tal measurement is [Cox and Mayo (2010), page 296].[See Example 4.6, Cox and Hinkley (1974), pages 95–96; Birnbaum (1962), page 280.]

Suppose that we know we have observed a measure-ment from E2 with its much larger variance:

The unconditional test says that we canassign this a higher level of significancethan we ordinarily do, because if we wereto repeat the experiment, we might samplesome quite different distribution. But thisfact seems irrelevant to the interpretation ofan observation which we know came from adistribution [with the larger variance]. [Cox(1958), page 361.]

The WCP says simply: once it is known which Ei

has produced z, the p-value or other inferential assess-ment should be made with reference to the experimentactually run.

Page 7: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

ON THE BIRNBAUM ARGUMENT FOR THE STRONG LIKELIHOOD PRINCIPLE 233

4.2 Weak Conditionality Principle (WCP) in theWeighing Machine Example

We first state the WCP in relation to this example.We are given (Emix, zi ), that is, (Ei, zi) results from

mixture experiment Emix. WCP exhorts us to conditionto be relevant to the experiment actually producing theoutcome. This is an example of what Cox terms “con-ditioning for relevance.”

WCP: Given (Emix, zi), condition on the Ei

producing the result

(Emix, zi ) ⇒ InfrEi

[(Emix, zi)

]= pi(z) = InfrEi

[zi].Do not use the unconditional formulation

(Emix, zi )� InfrEmix

[(Emix, zi)

]= [

p1(z) + p2(z)]/2.

The concern is that

InfrEmix

[(Emix, zi)

] = [p1(z) + p2(z)

]/2 �= pi(z).

There are three sampling distributions, and the WCPsays the relevant one to use whenever InfrEmix[zi] �=InfrEi

[zi] is the one known to have generated the re-sult [Birnbaum (1962), page 280]. In other cases theWCP would make no difference.

4.3 The WCP and Its Corollaries

We can give a general statement of the WCP as fol-lows:

A mixture Emix selects between E1 and E2, using aθ -irrelevant process, and it is given that (Ei, zi) results,i = 1,2. WCP directs the inference implication. Know-ing we are mapping an outcome from a mixture, thereis no need to repeat the first component of (Emix, zi),so it is dropped except when a reminder seems use-ful:

(i) Condition to obtain relevance:

(Emix, zi) ⇒ InfrEi

[(Emix, zi)

] = InfrEi(zi ).

In words, zi arose from Emix but the inference im-plication is based on Ei .

(ii) Eschew unconditional formulations:

(Emix, zi )� InfrEmix[zi],whenever the unconditional treatment yields a dif-ferent inference implication,

that is, whenever InfrEmix[zi] �= InfrEi[zi].

NOTE. InfrEmix[zi] which abbreviates InfrEmix[(Emix,

zi )] asserts that the inference implication uses theconvex combination of the relevant pair of experi-ments.

We now highlight some points for reference.

4.3.1 WCP makes a difference. The cases of inter-est here are where applying WCP would alter the un-conditional implication. In these cases WCP makes adifference.

Note that (ii) blocks computing the inference im-plication from (Emix, zi) as InfrEmix[zi], wheneverInfrEmix[zi] �= InfrEi

[zi] for i = 1,2. Here E1, E2 andEmix would correspond to three sampling distributions.

WCP requires the experiment and its outcome to begiven or known: If it is given only that z came fromE1 or E2, and not which, then WCP does not authorize(i). In fact, we would wish to block such an inferenceimplication. For instance,

(E1 or E2, z)� InfrE1[z].Point on notation: The use of “⇒” is for a given out-

come. We may allow it to be used without ambiguitywhen only a disjunction is given, because while E1 en-tails (E1 or E2), the converse does not hold. So no erro-neous substitution into an inference implication wouldfollow.

4.3.2 Irrelevant augmentation: Keep irrelevantfacts irrelevant (Irrel). Another way to view the WCPis to see it as exhorting us to keep what is irrelevant tothe sampling behavior of the experiment performed ir-relevant (to the inference implication). Consider Birn-baum’s (1969), page 119, idea that a “trivial” but harm-less addition to any given experimental result z mightbe to toss a fair coin and augment z with a report ofheads or tails (where this is irrelevant to the originalmodel). Note the similarity to attempts to get an exactsignificance level in discrete tests, by allowing border-line outcomes to be declared significant or not (at thegiven level) according to the outcome of a coin toss.The WCP, of course, eschews this. But there is a cru-cial ambiguity to avoid. It is a harmless addition onlyif it remains harmless to the inference implication. Ifit is allowed to alter the test result, it is scarcely harm-less.

A holder of the WCP may stipulate that a given zi

can always be augmented with the result of a θ -irrele-vant randomizer, provided that it remains irrelevant tothe inference implication about θ in Ei . We can abbre-viate this irrelevant augmentation of a given result zi

as a conjunction: (Ei & Irrel),

Page 8: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

234 D. G. MAYO

(Irrel): InfrEi[(Ei & Irrel, zi)] = InfrEi

[zi],i = 1,2.

We illuminate this in the next subsection.

4.3.3 Is the WCP an equivalence? “It was the adop-tion of an unqualified equivalence formulation of con-ditionality, and related concepts, which led, in my1962 paper, to the monster of the likelihood axiom”[Birnbaum (1975), page 263]. He admits the contrastwith “the one-sided form to which applications” hadbeen restricted [Birnbaum (1969), page 139, note 11].The question of whether the WCP is a proper equiva-lence relation, holding in both directions, is one of themost central issues in the argument. But what would bealleged to be equivalent?

Obviously not the unconditional and the condi-tional inference implications: the WCP makes a dif-ference just when they are inequivalent, that is, whenInfrEmix[zi] �= InfrEi

[zi]. Our answer is that the WCPinvolves an inequivalence as well as an equivalence.The WCP prescribes conditioning on the experimentknown to have produced the data, and not the otherway around. It is their inequivalence that gives Cox’sWCP its normative proscriptive force. To assume theWCP identifies InfrEmix[zi] and InfrEi

[zi] leads to trou-ble. (We return to this in Section 7.)

However, there is an equivalence in WCP (i). Fur-ther, once the outcome is given, the addition of θ -irrelevant features about the selection of the experimentperformed are to remain irrelevant to the inference im-plication:

InfrEi

[(Emix, zi)

] = InfrEi

[(Ei & Irrel, zi)

].

Both are the same as InfrEi[zi]. While claiming that z

came from a mixture, even knowing it came from anonmixture, may seem unsettling, we grant it for pur-poses of making out Birnbaum’s argument. By (Irrel),it cannot alter the inference implication under Ei .

5. BIRNBAUM’S SLP ARGUMENT

5.1 Birnbaumization and the WCP

What does the WCP entail as regards Birnbaumiza-tion? Now WCP refers to mixtures, but is the Birn-baum experiment EB a mixture experiment? Not re-ally. One cannot perform the following: Toss a fair coin(or other θ -irrelevant randomizer). If it lands heads,perform an experiment E2 that yields a member ofan SLP pair y∗; if tails, observe an experiment thatyields the other member of the SLP pair x∗. We do not

know what outcome would have resulted from the un-performed experiment, much less that it would be anoutcome with a proportional likelihood to the observedy∗. There is a single experiment, and it is stipulated weknow which and what its outcome was. Some have de-scribed the Birnbaum experiment as unperformable, orat most a “mathematical mixture” rather than an “ex-perimental mixture” [Kalbfleisch (1975), pages 252–253]. Birnbaum himself calls it a “hypothetical” mix-ture [Birnbaum (1962), page 284].

While a holder of the WCP may simply deny its gen-eral applicability in hypothetical experiments, giventhat Birnbaum’s argument has stood for over fiftyyears, we wish to give it maximal mileage. Birnbau-mization may be “performed” in the sense that TB canbe defined for any SLP pair x∗, y∗. Refer back to thehypothetical universe of SLP pairs, each imagined tohave been generated from a θ -irrelevant mixture (Sec-tion 2.5). When we observe y∗ we pluck the x∗ com-panion needed for the argument. In short, we can Birn-baumize an experimental result: Constructing statis-tic TB with the derived experiment EB is the “per-formance.” But what cannot shift in the argument isthe stipulation that Ei be given or known (as noted inSection 4.3.1), that i be fixed. Nor can the meaningof “given z∗” shift through the argument, if it is to besound.

Given z∗, the WCP precludes Birnbaumizing. On theother hand, if the reported z∗ was the value of TB , thenwe are given only the disjunction, precluding the com-putation relevant for i fixed (Section 4.3.1). Let us con-sider the components of Birnbaum’s argument.

5.2 Birnbaum’s Argument

(E2,y∗) is given (and it has an SLP pair x∗). Thequestion is to its inferential import. Birnbaum will seekto show that

InfrE2

[y∗] = InfrE1

[x∗]

.

The value of TB is (E1,x∗). Birnbaumization mapsoutcomes into hypothetical mixtures EB :

(1) If the inference implication is by the stipulationsof EB ,

(E2,y∗) ⇒ InfrEB

[x∗] = InfrEB

[y∗]

.

Likewise for (E1,x∗). TB is a sufficient statisticfor EB (the conditional distribution of Z given TB

is independent of θ ).

Page 9: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

ON THE BIRNBAUM ARGUMENT FOR THE STRONG LIKELIHOOD PRINCIPLE 235

(2) If the inference implication is by WCP,(E2,y∗)

� InfrEB

[y∗]

,

rather(E2,y∗) ⇒ InfrE2

[y∗]

and(E1,x∗) ⇒ InfrE1

[x∗]

.

Following the inference implication according to EB

in (1) is at odds with what the WCP stipulates in (2).Given y∗, Birnbaumization directs using the convexcombination over the components of TB ; WCP es-chews doing so. We will not get

InfrE1

[x∗] = InfrE2

[y∗]

.

The SLP only seems to follow by the erroneous iden-tity:

InfrEB

[z∗i

] = InfrEi

[z∗i

]for i = 1,2.

5.3 Refuting the Supposition that [(SP and WCP)entails SLP]

We can uphold both (1) and (2), while at the sametime holding the following:

(3) InfrE1[x∗] �= InfrE2[y∗].Specifically, any case where x∗ and y∗ is an SLP vi-

olation pair is a case where (3) is true. Since whenever(3) holds we have a counterexample to the SLP gen-eralization, this demonstrates that SP and WCP andnot-SLP are logically consistent. Thus, so are WCPand not-SLP. This refutes the supposition that [(SP andWCP) entails SLP] and also any purported derivationof SLP from WCP alone.4

SP is not blocked in (1). The SP is always relative toa model, here EB . We have the following:

x∗ and y∗ are SLP pairs in EB , andInfrEB

[x∗] = InfrEB

[y∗]

(i.e., [B] holds).

One may allow different contexts to dictate whetheror not to condition [i.e., whether to apply (1) or (2)],but we know of no inference account that permits,let alone requires, self-contradictions. By noncontra-diction, for any (E, z), InfrE[z] = InfrE[z]. (“⇒” is a

4By allowing applications of Birnbaumization and appropri-ate choices of the irrelevant randomization probabilities, SP canbe weakened to “mathematical equivalence,” or even (with com-pounded mixtures) omitted so that WCP would entail SLP. SeeBirnbaum (1972) and Evans, Fraser and Monette (1986).

function from outcomes to inference implications, andz = z, for any z.)

Upholding and applying. This recalls our points inSection 2.1. Applying a rule means following its infer-ence directive. We may uphold the if-then stipulationsin (1) and (2), but to apply their competing implicationsin a single case is self-contradictory.

Arguing from a self-contradiction is unsound. Theslogan that anything follows from a self-contradictionG and not-G is true, since for any claim C, the fol-lowing is a logical truth: If G then (if not-G then C).Two applications of modus ponens yield C. One canalso derive not-C! But since G and its denial cannotbe simultaneously true, any such argument is unsound.(A sound argument must have true premises and belogically valid.) We know Birnbaum was not intendingto argue from a self-contradiction, but this may inad-vertently occur.

5.4 What if the SLP Pair Arose from an ActualMixture?

What if the SLP pair x∗, y∗ arose from a genuine, andnot a Birnbaumized, mixture. (Consider fixed versussequential sampling, Section 3.1. Suppose E1 fixes n

at 169, the coin flip says perform E2, and it happens tostop at n = 169.) We may allow that an unconditionalformulation may be defined so that

InfrEmix

[x∗] = InfrEmix

[y∗]

.

But WCP eschews the unconditional formulation; itsays condition on the experiment known to have pro-duced zi :(

Emix, z∗i

) ⇒ InfrEi

[z∗i

], i = 1,2.

Any SLP violation pair x∗,y∗ remains one: InfrE1[x∗] �=InfrE2[y∗].

6. DISCUSSION

We think a fresh look at this venerable argumentis warranted. Wearing a logician’s spectacles and en-tering the debate outside of the thorny issues fromdecades ago may be an advantage.

It must be remembered that the onus is not on some-one who questions if the SLP follows from SP andWCP to provide suitable principles of evidence, how-ever desirable it might be to have them. The onus is onBirnbaum to show that for any given y∗, a member ofan SLP pair with x∗, with different probability modelsf1(·), f2(·), that he will be able to derive from SP andWCP, that x∗ and y∗ would have the identical inference

Page 10: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

236 D. G. MAYO

implications concerning shared parameter θ . We haveshown that SLP violations do not entail renouncing ei-ther the SP or the WCP.

It is no rescue of Birnbaum’s argument that a sam-pling theorist wants principles in addition to the WCPto direct the relevant sampling distribution for infer-ence; indeed, Cox has given others. It was to make theapplication of the WCP in his argument as plausibleas possible to sampling theorists that Birnbaum beginswith the type of mixture in Cox’s (1958) famous exam-ple of instruments E1, E2 with different precisions.

We do not assume sampling theory, but employ aformulation that avoids ruling it out in advance. Thefailure of Birnbaum’s argument to reach the SLP reliesonly on a correct understanding of the WCP. We maygrant that for any y∗ its SLP pair could occur in repe-titions (and may even be out there as in Section 2.5).However, the key point of the WCP is to deny thatthis fact should alter the inference implication fromthe known y∗. To insist it should is to deny the WCP.Granted, WCP sought to identify the relevant samplingdistribution for inference from a specified type of mix-ture, and a known y∗, but it is Birnbaum who purportsto give an argument that is relevant for a sampling theo-rist and for “approaches which are independent of this[Bayes’] principle” [Birnbaum (1962), page 283]. Itsimplications for sampling theory is why it was dubbed“a landmark in statistics” [Savage (1962b), page 307].

Let us look at the two statements about inference im-plications from a given (E2,y∗), applying (1) and (2)in Section 5.2:

(E2,y∗) ⇒ InfrEB

[x∗]

,(E2,y∗) ⇒ InfrE2

[y∗]

.

Can both be applied in exactly the same model with thesame given z? The answer is yes, so long as the WCPhappens to make no difference:

InfrEB

[z∗i

] = InfrEi

[z∗i

], i = 1,2.

Now the SLP must be applicable to an arbitrary SLPpair. However, to assume that (1) and (2) can be consis-tently applied for any x∗,y∗ pair would be to assume noSLP violations are possible, which really would renderBirnbaum’s argument circular. So from Section 5.3, thechoices are to regard Birnbaum’s argument as unsound(arguing from a contradiction) or circular (assumingwhat it purports to prove). Neither is satisfactory. Weare left with competing inference implications and noway to get to the SLP. There is evidence Birnbaum sawthe gap in his argument (Birnbaum, 1972), and in the

end he held the SLP only restricted to (predesignated)point against point hypotheses.5

It is not SP and WCP that conflict; the conflictcomes from WCP together with Birnbaumization—understood as both invoking the hypothetical mixtureand erasing the information as to which experimentthe data came. If one Birnbaumizes, one cannot at thesame time uphold the “keep irrelevants irrelevant” (Ir-rel) stipulation of the WCP. So for any given (E, z)one must choose, and the answer is straightforward fora holder of the WCP. To paraphrase Cox’s (1958), page361, objection to unconditional tests:

Birnbaumization says that we can assign y∗a different level of significance than we or-dinarily do, because one may identify anSLP pair x∗ and construct statistic TB . Butthis fact seems irrelevant to the interpreta-tion of an observation which we know camefrom E2. To conceal the index, and use theconvex combination, would give a distortedassessment of statistical significance.

7. RELATION TO OTHER CRITICISMS OFBIRNBAUM

A number of critical discussions of the Birnbaum ar-gument and the SLP exist. While space makes it im-possible to discuss them here, we believe the currentanalysis cuts through this extremely complex literature.Take, for example, the most well-known criticisms byDurbin (1970) and Kalbfleish (1975), discussed in theexcellent paper by Evans, Fraser and Monette (1986).Allowing that any y∗ may be viewed as having arisenfrom Birnbaum’s mathematical mixture, they considerthe proper order of application of the principles. If wecondition on the given experiment first, Kalbfleish’srevised sufficiency principle is inapplicable, so Birn-baum’s argument fails. On the other hand, Durbin ar-gues, if we reduce to the minimal sufficient statisticfirst, then his revised principle of conditionality can-not be applied. Again Birnbaum’s argument fails. Soeither way it fails.

Unfortunately, the idea that one must revise the ini-tial principles in order to block SLP allows downplay-ing or dismissing these objections as tantamount to

5This alone would not oust all sampling distributions. Birn-baum’s argument, even were it able to get a foothold, would haveto apply further rounds of conditioning to arrive at the data alone.

Page 11: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

ON THE BIRNBAUM ARGUMENT FOR THE STRONG LIKELIHOOD PRINCIPLE 237

denying SLP at any cost (please see the references6).We can achieve what they wish to show, without alter-ing principles, and from WCP alone. Given y∗, WCPblocks Birnbaumization; given y∗ has been Birnbau-mized, the WCP precludes conditioning.

We agree with Evans, Fraser and Monette (1986),page 193, “that Birnbaum’s use of [the principles]. . . are contrary to the intentions of the principles, asjudged by the relevant supporting and motivating ex-amples. From this viewpoint we can state that the in-tentions of S and C do not imply L.” [Where S, Cand L are our SP, WCP and SLP.] Like Durbin andKalbfleisch, they offer a choice of modifications of theprinciples to block the SLP. These are highly insight-ful and interesting; we agree that they highlight a needto be clear on the experimental model at hand. Still,it is preferable to state the WCP so as to reflect these“intentions,” without which it is robbed of its function.The problem stems from mistaking WCP as the equiva-lence InfrEmix[z] = InfrEi

[z] (whether the mixture is hy-pothetical or actual). This is at odds with the WCP. Thepuzzle is solved by adequately stating the WCP. Asidefrom that, we need only keep the meaning of terms con-sistent through the argument.

We emphasize that we are neither rejecting the SPnor claiming that it breaks down, even in the specialcase EB . The sufficiency of TB within EB , as a mathe-matical concept, holds: the value of TB “suffices” forInfrEB

[y∗], the inference from the associated convexcombination. Whether reference to hypothetical mix-ture EB is relevant for inference from given y∗ is adistinct question. For an alternative criticism see Evans(2013).

8. CONCLUDING REMARKS

An essential component of informative inference forsampling theorists is the relevant sampling distribution:it is not a separate assessment of performance, but partof the necessary ingredients of informative inference.It is this feature that enables sampling theory to haveSLP violations (e.g., in significance testing contexts).Any such SLP violation, according to Birnbaum’s ar-gument, prevents adhering to both SP and WCP. We

6In addition to the authors cited in the manuscript, see es-pecially comments by Savage, Cornfield, Bross, Pratt, Dempsteret al. (1962) on Birnbaum. For later discussions, see Barndorff-Nielsen (1975), Berger (1986), Berger and Wolpert (1988), Birn-baum (1970a, 1970b), Dawid (1986), Savage (1970) and referencestherein.

have shown that SLP violations do not preclude WCPand SP.

The SLP does not refer to mixtures. But supposingthat (E2,y∗) is given, Birnbaum asks us to considerthat y∗ could also have resulted from a θ -irrelevantmixture that selects between E1, E2. The WCP saysthis piece of information should be irrelevant for com-puting the inference from (E2,y∗) once given. That is,InfrEi

[(Emix,y∗)] = InfrEi[y∗], i = 1,2. It follows that

if InfrE1[x∗] �= InfrE2[y∗], the two remain unequal af-ter the recognition that y∗ could have come from themixture. What was an SLP violation remains one.

Given y∗, the WCP says do not Birnbaumize. Oneis free to do so, but not to simultaneously claim tohold the WCP in relation to the given y∗, on pain oflogical contradiction. If one does choose to Birnbau-mize, and to construct TB , admittedly the known out-come y∗ yields the same value of TB as would x∗. Us-ing the sample space of EB yields [B]: InfrEB

[x∗] =InfrEB

[y∗]. This is based on the convex combination ofthe two experiments and differs from both InfrE1[x∗]and InfrE2[y∗]. So again, any SLP violation remains.Granted, if only the value of TB is given, using InfrEB

may be appropriate. For then we are given only the dis-junction: either (E1,x∗) or (E2,y∗). In that case, one isbarred from using the implication from either individ-ual Ei . A holder of WCP might put it this way: once(E, z) is given, whether E arose from a θ -irrelevantmixture or was fixed all along should not matter to theinference, but whether a result was Birnbaumized ornot should, and does, matter.

There is no logical contradiction in holding that ifdata are analyzed one way (using the convex combina-tion in EB ), a given answer results, and if analyzed an-other way (via WCP), one gets quite a different result.One may consistently apply both the EB and the WCPdirectives to the same result, in the same experimentalmodel, only in cases where WCP makes no difference.To claim for any x∗, y∗, the WCP never makes a dif-ference, however, would assume that there can be noSLP violations, which would make the argument cir-cular.7 Another possibility would be to hold, as Birn-baum ultimately did, that the SLP is “clearly plausible”[Birnbaum (1968), page 301] only in “the severely re-stricted case of a parameter space of just two points”

7His argument would then follow the pattern: If there are SLPviolations, then there are no SLP violations. Note that (V impliesnot-V) is not a logical contradiction. It is logically equivalent tonot-V. Then, Birnbaum’s argument is equivalent to not-V: denyingthat x∗, y∗ can give rise to an SLP violation. That would render itcircular.

Page 12: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

238 D. G. MAYO

where these are predesignated [Birnbaum (1969), page128]. But that is to relinquish the general result.

ACKNOWLEDGMENTS

I am extremely grateful to David Cox and ArisSpanos for numerous discussions, corrections and jointwork over many years on this and related foundationalissues. I appreciate the careful queries and detailedsuggested improvements on earlier drafts from anony-mous referees, from Jean Miller and Larry Wasserman.My understanding of Birnbaum was greatly facilitatedby philosopher of science, Ronald Giere, who workedwith Birnbaum. I’m also grateful for his gift of someof Birnbaum’s original materials and notes.

REFERENCES

BARNDORFF-NIELSEN, O. (1975). Comments on paper by J. D.Kalbfleisch. Biometrika 62 261–262.

BERGER, J. O. (1986). Discussion on a paper by Evans et al. [Onprinciples and arguments to likelihood]. Canad. J. Statist. 14195–196.

BERGER, J. O. (2006). The case for objective Bayesian analysis.Bayesian Anal. 1 385–402. MR2221271

BERGER, J. O. and WOLPERT, R. L. (1988). The Likelihood Prin-ciple, 2nd ed. Lecture Notes—Monograph Series 6. IMS, Hay-ward, CA.

BIRNBAUM, A. (1962). On the foundations of statistical inference.J. Amer. Statist. Assoc. 57 269–306. Reprinted in Breakthroughsin Statistics 1 (S. Kotz and N. Johnson, eds.) 478–518. Springer,New York.

BIRNBAUM, A. (1968). Likelihood. In International Encyclopediaof the Social Sciences 9 299–301. Macmillan and the Free Press,New York.

BIRNBAUM, A. (1969). Concepts of statistical evidence. In Philos-ophy, Science, and Method: Essays in Honor of Ernest Nagel(S. Morgenbesser, P. Suppes and M. G. White, eds.) 112–143.St. Martin’s Press, New York.

BIRNBAUM, A. (1970a). Statistical methods in scientific inference.Nature 225 1033.

BIRNBAUM, A. (1970b). On Durbin’s modified principle of condi-tionality. J. Amer. Statist. Assoc. 65 402–403.

BIRNBAUM, A. (1972). More on concepts of statistical evidence.J. Amer. Statist. Assoc. 67 858–861. MR0365793

BIRNBAUM, A. (1975). Comments on paper by J. D. Kalbfleisch.Biometrika 62 262–264.

CASELLA, G. and BERGER, R. L. (2002). Statistical Inference,2nd ed. Duxbury Press, Belmont, CA.

COX, D. R. (1958). Some problems connected with statistical in-ference. Ann. Math. Statist. 29 357–372. MR0094890

COX, D. R. (1977). The role of significance tests. Scand. J. Stat. 449–70. MR0448666

COX, D. R. (1978). Foundations of statistical inference: The casefor eclecticism. Aust. N. Z. J. Stat. 20 43–59. MR0501453

COX, D. R. and HINKLEY, D. V. (1974). Theoretical Statistics.Chapman & Hall, London. MR0370837

COX, D. R. and MAYO, D. G. (2010). Objectivity and conditional-ity in frequentist inference. In Error and Inference: Recent Ex-changes on Experimental Reasoning, Reliability, and the Objec-tivity and Rationality of Science (G. Mayo and A. Spanos, eds.)276–304. Cambridge Univ. Press, Cambridge.

DAWID, A. P. (1986). Discussion on a paper by Evans et al. [Onprinciples and arguments to likelihood]. Canad. J. Statist. 14196–197.

DURBIN, J. (1970). On Birnbaum’s theorem on the relation be-tween sufficiency, conditionality and likelihood. J. Amer. Statist.Assoc. 65 395–398.

EVANS, M. (2013). What does the proof of Birnbaum’s theoremprove? Unpublished manuscript.

EVANS, M. J., FRASER, D. A. S. and MONETTE, G. (1986). Onprinciples and arguments to likelihood. Canad. J. Statist. 14181–199. MR0859631

GHOSH, J. K., DELAMPADY, M. and SAMANTA, T. (2006). An In-troduction to Bayesian Analysis. Theory and Methods. SpringerTexts in Statistics. Springer, New York. MR2247439

KALBFLEISCH, J. D. (1975). Sufficiency and conditionality.Biometrika 62 251–268. MR0386075

LEHMANN, E. L. and ROMANO, J. P. (2005). Testing StatisticalHypotheses, 3rd ed. Springer Texts in Statistics. Springer, NewYork. MR2135927

MAYO, D. G. (1996). Error and the Growth of ExperimentalKnowledge. Univ. Chicago Press, Chicago, IL.

MAYO, D. G. (2010). An error in the argument from conditionalityand sufficiency to the likelihood principle. In Error and Infer-ence: Recent Exchanges on Experimental Reasoning, Reliabil-ity, and the Objectivity and Rationality of Science (D. G. Mayoand A. Spanos, eds.) 305–314. Cambridge Univ. Press, Cam-bridge.

MAYO, D. G. and COX, D. R. (2010). Frequentist statistics as atheory of inductive inference. In Error and Inference: RecentExchanges on Experimental Reasoning, Reliability, and the Ob-jectivity and Rationality of Science (D. G. Mayo and A. Spanos,eds.) 247–274. Cambridge Univ. Press, Cambridge. First pub-lished in The Second Erich L. Lehmann Symposium: Optimality49 (2006) (J. Rojo, ed.) 77–97. Lecture Notes—Monograph Se-ries. IMS, Beachwood, OH.

MAYO, D. G. and COX, D. R. (2011). Statistical scientist meetsa philosopher of science: A conversation. In Rationality, Mar-kets and Morals: Studies at the Intersection of Philosophy andEconomics 2 (D. G. Mayo, A. Spanos and K. W. Staley, eds.)(Special Topic: Statistical Science and Philosophy of Science:Where do (should) They Meet in 2011 and Beyond?) (October18) 103–114. Frankfurt School, Frankfurt.

MAYO, D. G. and KRUSE, M. (2001). Principles of inference andtheir consequences. In Foundations of Bayesianism (D. Corfieldand J. Williamson, eds.) 24 381–403. Applied Logic. KluwerAcademic Publishers, Dordrecht.

MAYO, D. G. and SPANOS, A. (2006). Severe testing as a basicconcept in a Neyman–Pearson philosophy of induction. BritishJ. Philos. Sci. 57 323–357. MR2249183

MAYO, D. G. and SPANOS, A. (2011). Error statistics. In Philos-ophy of Statistics 7 (P. S. Bandyopadhyay and M. R. Forster,eds.) 152–198. Handbook of the Philosophy of Science. Else-vier, Amsterdam.

Page 13: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

ON THE BIRNBAUM ARGUMENT FOR THE STRONG LIKELIHOOD PRINCIPLE 239

REID, N. (1992). Introduction to Fraser (1966) structural probabil-ity and a generalization. In Breakthroughs in Statistics (S. Kotzand N. L. Johnson, eds.) 579–586. Springer Series in Statistics.Springer, New York.

SAVAGE, L. J., ed. (1962a). The Foundations of Statistical Infer-ence: A Discussion. Methuen, London.

SAVAGE, L. J. (1962b). Discussion on a paper by A. Birnbaum [Onthe foundations of statistical inference]. J. Amer. Statist. Assoc.57 307–308.

SAVAGE, L. J. (1970). Comments on a weakened principle of con-ditionality. J. Amer. Statist. Assoc. 65 (329) 399–401.

SAVAGE, L. J., BARNARD, G., CORNFIELD, J., BROSS, I.,BOX, G. E. P., GOOD, I. J., LINDLEY, D. V. et al. (1962).On the foundations of statistical inference: Discussion. J. Amer.Statist. Assoc. 57 307–326.

Page 14: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

Statistical Science2014, Vol. 29, No. 2, 240–241DOI: 10.1214/14-STS470Main article DOI: 10.1214/13-STS457© Institute of Mathematical Statistics, 2014

Discussion of “On the BirnbaumArgument for the Strong LikelihoodPrinciple”A. P. Dawid

Abstract. Deborah Mayo claims to have refuted Birnbaum’s argument thatthe Likelihood Principle is a logical consequence of the Sufficiency and Con-ditionality Principles. However, this claim fails because her interpretation ofthe Conditionality Principle is different from Birnbaum’s. Birnbaum’s proofcannot be so readily dismissed.

Key words and phrases: Conditionality principle, Birnbaum’s theorem,likelihood principle, sufficiency principle, weak conditionality principle.

Deborah Mayo (2014) is not the first devoutly towish that the (strong) Likelihood Principle [principleL of Birnbaum (1962)] was not a logical consequenceof the Sufficiency Principle (Birnbaum’s S) and theConditionality Principle (Birnbaum’s C). This concernarises because much of frequentist inference is in clearviolation of L, while at the same time purporting toabide by S and C. This constitutes a self-contradiction,which frequentists are, however, loth to admit. Birn-baum himself appears to have been quite distraught athis own finding, and in the half-century since publica-tion of his argument there has been a constant trickleof attempts to come to terms with it, including one ortwo of my own (Dawid, 1977; Dawid, 1983; Dawid,1987; Dawid, 2011); a detailed account that I considerdisplays the underlying logic clearly can be found inChapter II “Principles of Inference” of Dawid (2013).

Those who feel disquiet at the destructive implica-tions of Birnbaum’s theorem for their favored methodof inference (be it frequentist or, for example, “objec-tive Bayesian,” which also violates L) have a number ofstrategies to try and ease that disquiet. If they accept thevalidity of the theorem, they might argue [along withFraser (1963); Durbin (1970); Kalbfleisch (1975)] thatS or C should not be taken as universally applicable—thus evading the consequent of the theorem by denying

A. P. Dawid is Emeritus Professor of Statistics, StatisticalLaboratory, Centre for Mathematical Sciences, CambridgeUniversity, Wilberforce Road, Cambridge CB3 0WB, UK(e-mail: [email protected]).

its antecedents. This is at least a logically sound ploy,although it reeks of adhockery. Also, the ploy may notbe totally successful, since some of the “undesirable”implications of the theorem may survive weakening ofits hypotheses: Dawid (1987) suggests that the princi-ple of the irrelevance of the stopping rule is one suchsurvivor.

A second possible strategy is to fully accept S andC and Birnbaum’s argument—and thereby come to ac-cept L. This is the path of enlightenment followed byconversion.

The third strategy involves accepting S and C, butstill rejecting L. If that is your motivation (and you careabout self-consistency), you have no option but to tryand find fault with the logic of Birnbaum’s theorem.This is Mayo’s strategy. The only problem is that Birn-baum’s theorem is indeed logically sound. That meansthat Mayo’s attempt to argue the contrary must itselfbe unsound. Although there are many points at which Iam deeply critical of her argument, I will content my-self with drawing attention to her principal misunder-standing, which vitiates her entire enterprise: she sim-ply has not grasped Birnbaum’s conditionality princi-ple C, conflating and confusing it with Cox’s WCP,which is quite different.

According to Mayo, WCP requires that “one shouldcondition on the known experiment,” or (as she phrasesit in Section 4.3) “eschew unconditional formulations.”But Birnbaum describes his principle C as the require-ment that

240

Page 15: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

DISCUSSION 241

the evidential meaning of any outcome ofany mixture experiment is the same as thatof the corresponding outcome of the corre-sponding component experiment, ignoringthe over-all structure of the mixture exper-iment.

That is, Birnbaum’s principle C requires identity of theinferences to be drawn (from the same data) in dif-ferent circumstances. This imposes an equivalence re-lationship across such circumstances. Principle C hasnothing to say about the form or nature of the in-ferences, and—importantly—unlike WCP is entirelynondirectional. Mayo has misconstrued it as synony-mous with WCP, which would require that we shoulddiscard whatever inference we might have been con-templating in the mixture experiment and replace it byour favored inference in the component experiment.However, an equally (in)valid reading of C would bethe contrary: that we should discard a contemplatedcomponent-experiment inference in favor of an infer-ence formed for the mixture experiment. In fact, nei-ther of these interpretations has anything to do withprinciple C and, typically—as indeed follows fromBirnbaum’s theorem and the fact that frequentist infer-ence violates C—neither of them can be implementedconsistently within a frequentist framework.

In her Section 4.3.3 Mayo does consider the rela-tionship between WCP and equivalence principles, andquite correctly decides that WCP is not one of these. InSection 7 she opines, “The problem stems from mis-taking WCP as the equivalence. . . .” So at least she re-alises that WCP and Birnbaum’s principle C are dif-ferent. However the “problem” is just the contrary: she

has mistaken Birnbaum’s equivalence requirement C asthe “nonequivalence” principle WCP.

Mayo has attempted to argue that L does not followfrom S and WCP. Notwithstanding the shortfalls in herarguments, I agree with that conclusion. The trouble is,it has nothing to do with Birnbaum’s theorem. Mayohas been attacking a straw man, and Birnbaum’s result,S & C ⇒ L, remains entirely untouched by her criti-cisms.

REFERENCES

BIRNBAUM, A. (1962). On the foundations of statistical inference.J. Amer. Statist. Assoc. 57 269–326. MR0138176

DAWID, A. P. (1977). Conformity of inference patterns. In RecentDevelopments in Statistics (Proc. European Meeting Statisti-cians, Grenoble, 1976) 245–256. North-Holland, Amsterdam.MR0471123

DAWID, A. P. (1983). Statistical inference: I. In Encyclopedia ofStatistical Sciences 4 (S. Kotz, N. L. Johnson and C. B. Read,eds.) 89–105. Wiley, New York.

DAWID, A. P. (1987). Invited discussion of “On principles andarguments to likelihood,” by M. Evans, D. A. S. Fraser andG. Monette. Canad. J. Statist. 14 196–197.

DAWID, A. P. (2011). Basu on ancillarity. In Selected Works ofDebabrata Basu. Sel. Works Probab. Stat. 5–8. Springer, NewYork. MR2799327

DAWID, A. P. (2013). Principles of statistics. Online lecture notesat http://www.flooved.com/reader/3470.

DURBIN, J. (1970). On Birnbaum’s theorem on the relation be-tween sufficiency, conditionality and likelihood. J. Amer. Statist.Assoc. 65 395–398.

FRASER, D. A. S. (1963). On the sufficiency and likelihood prin-ciples. J. Amer. Statist. Assoc. 58 641–647. MR0153078

KALBFLEISCH, J. D. (1975). Sufficiency and conditionality.Biometrika 62 251–268. MR0386075

MAYO, D. (2014). On the Birnbaum argument for the strong like-lihood principle. Statist. Sci. 29 227–239.

Page 16: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

Statistical Science2014, Vol. 29, No. 2, 242–246DOI: 10.1214/14-STS471Main article DOI: 10.1214/13-STS457© Institute of Mathematical Statistics, 2014

Discussion of “On the BirnbaumArgument for the Strong LikelihoodPrinciple”Michael Evans

Abstract. We discuss Birnbaum’s result, its relevance to statistical reason-ing, Mayo’s objections and the result in [Electron. J. Statist. 7 (2013) 2645–2655] that the proof of this result doesn’t establish what is commonly be-lieved.

Key words and phrases: Sufficiency, conditionality, likelihood, statisticalevidence.

1. INTRODUCTION

The result established in Birnbaum (1962), that ifone accepts the frequentist principles of sufficiency (S)and conditionality (C), then one must accept the like-lihood principle (L), has been an issue in the founda-tions of statistics for 50 years. Many statisticians andphilosophers of science accept Birnbaum’s theorem asa logical fact because the proof is simple and, if theyfollow a pure likelihood or Bayesian prescription forinference, it doesn’t violate the way they think sta-tistical analyses should be conducted. Many frequen-tist statisticians reject the result basically because theydon’t like the consequence that frequentist evaluationsof statistical methodologies are irrelevant.

In the end, an acceptable theory of inference has tobe based on sound logic with no appeals to ex cathedraprinciples. Any principles used as part of forming sucha theory have to have strong justifications and produceresults that are free of paradoxes and contradictions.For example, the principle of conditional probability,which says we replace P(A) as the measure of beliefthat event A is true by P(A | C) after being told thatevent C has occurred, seems like a basic principle ofinference that, with careful application, is sound.

Does the likelihood principle carry the same weightin a theory of inference as the principle of conditional

Michael Evans is Professor, Department of StatisticalSciences, University of Toronto, 100 St. George St., Toronto,Ontario M5S 3G3, Canada (e-mail:[email protected]).

probability? We don’t think so and we will later ar-gue that a somewhat weakened version is really just aconsequence of the principle of conditional probability.Given that such principles can have a significant influ-ence on what we view as correct statistical reasoning, itis important to examine the justifications for the likeli-hood principle, and Birnbaum’s theorem is commonlycited as such, to see if these are correct.

Another principle cited in Mayo’s paper is the prin-ciple of frequentism. So what is the justification forthis principle? Generally, this seems to be based onthe belief that it typically produces sensible statisti-cal methods, although, as we will subsequently discuss,the story seems incomplete and unclear. If the principleof frequentism is correct, we need to have a good argu-ment for it and a much more complete development ofthe theory.

The relevance of frequentism to Mayo’s paper liesin the author’s position that Birnbaum’s argument isbasically a violation of the principle. An argument isprovided for why the joint application of S and C usedin Birnbaum’s proof constitute such a violation. I ac-cept Mayo’s reasoning. In fact, I think it is somewhatsimilar to the argument put forward in Evans, Fraserand Monette (1986) that the applications of S and C inthe proof are incorrect because S discards as irrelevantprecisely the information used by C to form the con-ditional model. So the justifications for S and C con-tradict one another in the proof and this doesn’t seemright. This contradiction is avoided if one adopts theprinciple put forward in Durbin (1970), that we shouldrestrict to ancillaries that are functions of a minimalsufficient statistic, and then Birnbaum’s proof fails.

242

Page 17: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

DISCUSSION 243

As we will discuss in Section 3, however, the issuein Birnbaum’s argument is not really with S and C to-gether, but rather with C itself and with what is actu-ally proved. A very broad hint that this is the case isprovided in Evans, Fraser and Monette (1986) where,using the same style of argument as Birnbaum, it is“proved” that accepting C alone is equivalent to ac-cepting L. So Durbin’s point doesn’t save the day evenif we accept it. Actually, I don’t think the arguments inMayo’s paper, or in Evans, Fraser and Monette (1986),completely dispense with Birnbaum’s theorem either.They just reinforce the unsettling feeling that some-thing is wrong somewhere.

Section 3 contains an outline of Evans (2013) that,for me at least, definitively settles the issue of what iswrong with Birnbaum’s result and does this mathemat-ically. As will be apparent from Section 2, however,it is clear that I believe that a proper prior is a neces-sary part of formally correct statistical reasoning. Sowhy would a Bayesian want to invalidate Birnbaum’sresult? This is because the result, as usually stated,is not logically correct. Any valid theory has to havesound, logical foundations and so we don’t want anyfaulty reasoning being used to justify that theory. Infact, Birnbaum’s result even misleads, as we’ve heardit said that model checking and checking for prior-dataconflict violate the likelihood principle and so shouldnot be carried out. Both of these activities are a nec-essary part of a statistical analysis. For this is how wedeal, at least in part, with the subjectivity inherent in astatistical analysis due to the choices made by a statis-tician. This point, at least with respect to model check-ing, is also made in Mayo’s paper and I think it is anexcellent one.

For proper Bayesians, a form of the likelihood prin-ciple is a consequence of the principle of conditionalprobability, a far more important principle. Applyingthe principle of conditional probability to the jointprobability model for the model parameter and data af-ter observing the data, we have that probability state-ments about the model parameter depend on the sam-pling model and data only through the likelihood (notethe emphasis). Of course, the likelihood map is mini-mal sufficient so there is nothing surprising in this.

2. BIRNBAUM AND EVIDENCE

There is an aspect of Birnbaum’s work in this areathat is particularly noteworthy. This is his emphasison trying to characterize statistical evidence concern-ing the true value of the model parameter as expressed

by the function Ev. Consider the pairs (M,x), whereM = {fθ : θ ∈ �} is a set of probability distributions in-dexed by parameter θ ∈ � and x is observed data com-ing from a distribution in M . Then Birnbaum (1962)writes Ev(M1, x1) = Ev(M2, x2) to mean that the ev-idence in (M1, x1) is the same as the evidence in(M2, x2) whenever certain conditions are satisfied. Werequire here that M1 and M2 have the same parameterspace, but this can be weakened to include models withparameter spaces that are bijectively equivalent.

The principles S,C and L are considered as possi-ble partial characterizations of statistical evidence. Forexample, if (M1, x1) and (M2, x2) are related via S,then Birnbaum says that, for frequentist statisticians,Ev(M1, x1) = Ev(M2, x2) and similarly for C. Birn-baum is careful to say that Ev does not characterizewhat statistical evidence is, it is a kind of “equivalencerelation” (see Section 3).

In essence Birnbaum brings us to the heart of thematter in statistical inference. What is statistical evi-dence or, more appropriately, how do we measure it?It seems collectively we talk about it, but we rarely getdown to details and really spell out how we are sup-posed to handle this concept. Perhaps the closest to do-ing this is the pure likelihood theory, as discussed, forexample, in Royall (1997), but this is only a definitionof relative evidence when comparing two values of thefull model parameter. For marginal parameters, this ap-proach uses the profile likelihood as the only generalway to compare the evidence for different values andthis is unsatisfactory from many points of view. Forexample, a profile likelihood function is not generallya likelihood function.

For a frequentist theory of statistical inference, asopposed to a theory of statistical decision, it seems es-sential that a general method for measuring statisticalevidence be provided that can be applied in any partic-ular problem. The p-value is often used as a frequentistmeasure of evidence against a hypothesis, but, for a va-riety of reasons, it does not seem to be appropriate. Forexample, we need a measure that can also provide ev-idence for something being true and not just evidenceagainst, given that we have assumed that the true dis-tribution is in M .

If we add a proper prior to the ingredients, then itseems we can come up with sensible measures of evi-dence. For evidence, as expressed by observed data instatistical problems, is what causes beliefs to changeand so we can measure evidence by measuring changein belief. For example, if we are interested in the truthof the event A, and this has prior probability P(A) > 0,

Page 18: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

244 M. EVANS

then after observing C, the principle of conditionalprobability leads to the posterior probability P(A | C)

as the appropriate expression of beliefs about A. Ac-cordingly, we measure evidence by the change in be-lief from P(A) to P(A | C). A simple principle of ev-idence says that we have evidence for the truth of A

when P(A | C) > P(A), evidence against the truth ofA when P(A | C) < P(A) and no evidence one wayor the other when P(A | C) = P(A). This principle iscommon in discussions about evidence in the philoso-phy of science and it seems obviously correct.

Of course, we also want to know how much evidencewe have and this has led to a variety of different mea-sures based on (P (A),P (A | C)). The Bayes factorBF(A | C) = P(Ac)P (A | C)P (Ac)/P (A)P (Ac | C)

is one such measure, as BF(A | C) > 1 if and only ifP(A | C) > P(A) and bigger values mean more evi-dence in favor of A being true. A central question as-sociated with this, and other measures of evidence, ishow to calibrate its values, as in when is BF(A | C)

big and when is it small. Actually, we prefer measur-ing evidence via the relative belief ratio RB(A | C) =P(A | C)/P (A), as the associated mathematics and thecalibration of its values are both simpler. The gener-alization to continuous contexts is effected by takinglimits and then both measures agree. A full theory ofinference, both estimation and hypothesis assessment,can be built based on this measure of evidence togetherwith a very natural calibration. This is discussed inBaskurt and Evans (2013). Of course, many will notlike this because it involves proper priors, and so issubjective and supposedly not scientific. Alternatively,some may complain that priors are somehow hard tocome up with.

In reality, all of statistics, excepting the data when itis properly collected, is subjective. We choose mod-els and we choose priors. What is important is thatany choice we make, as part of a statistical analysis,be checkable against the objective data to ensure thechoice at least make sense. We check the model byasking whether or not the data is surprising for eachdistribution in the model, and there are many well-known procedures for doing this. Perhaps not so famil-iar is that we can also check a proper prior by askingwhether or not the true value is in a region of relativelylow prior probability. Procedures for doing this consis-tently are developed in Evans and Moshonov (2006)and Evans and Jang (2011a). In fact, there are evenlogical approaches to modifying priors when prior-data conflict is found, as discussed in Evans and Jang

(2011b). Moreover, with a suitable definition of evi-dence, we can measure a priori whether or not a prioris inducing bias into a problem; see Baskurt and Evans(2013). So subjectivity is not really the issue. We doour best to assess and control its effects, and maybethat is part of the role of statistics in science, but in theend it is an unavoidable aspect of any statistical inves-tigation.

It is undoubtedly true that it is possible to write downcomplicated models for which it is extremely diffi-cult, if not impossible, to prescribe an elicitation pro-cedure in an application that leads to a sensible choiceof a prior. But what does this say about our choice ofmodel? It seems that we do not understand the effectsof parameters in the model on the measurements we aretaking sufficiently well to develop such a procedure.That is certainly possible, and perhaps even common,but it doesn’t speak well for the modeling process andit shouldn’t be held up as a criticism of what shouldbe the gold standard for inference. An analogous situa-tion arises with data collection where we know the goldstandard is random sampling from the population(s) towhich our inferences are to apply and, when we areinterested in relationships among variables, controlledallocation of the values of predictors to sampled units.The fact that this is rarely, if ever, achieved doesn’tcause us to throw out the baby with the dirty bath water.Gold standards serve as guides that we strive to attainand analyses that don’t just need to be suitably quali-fied.

Our main point in this section is that the problem ofmeasuring statistical evidence is the central issue in de-veloping a theory of statistical inference. It seems thatBirnbaum realized this and was searching for a wayto accomplish this goal when he came upon what ap-peared to be a remarkable result.

3. WHAT’S WRONG WITH BIRNBAUM’S RESULT?

Perhaps everybody who has read the proof of Birn-baum’s theorem is surprised at its simplicity. In fact,this is one of the reasons it is so convincing, as theredoes not appear to be a logical flaw in the proof.As Mayo has noted, however, there are reasons to bedoubtful of, if not even reject, the result as being validwithin the domain of any sensible theory of statisti-cal inference. Still suspicions linger, as the formulationseems so simple.

As we will now explain, the result proved is not re-ally the result claimed. If we want to treat Birnbaum’stheorem and its proof as a piece of mathematics, then

Page 19: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

DISCUSSION 245

we have to be precise about the ingredients going intoit. It is the imprecision in Birnbaum’s formulation thatleads to a faulty impression of exactly what is proved.This is more carefully explained in Evans (2013), butwe can give a broad outline here.

Suppose we have a set D. A relation R on D isany subset R ⊂ D × D. Meaningful relations expresssomething and (d1, d2) ∈ R means that d1 and d2share some relevant property. Let I denote the setof all model–data pairs (M,x). So, for example, wecan consider S as a relation on I by saying the pair((M1, x1), (M2, x2)) ∈ S ⊂ I × I whenever (M1, x1)

and (M2, x2) have equivalent minimal sufficient statis-tics. Similarly, C and L are relations on I .

An equivalence relation R on D is a relation thatis reflexive: (d, d) ∈ R for all d ∈ D, symmetric:(d1, d2) ∈ R implies (d2, d1) ∈ R, and transitive: (d1,

d2), (d2, d3) ∈ R implies (d1, d3) ∈ R. It is reason-able to say that, whatever property is characterized byrelation R, when R is an equivalence relation, then(d1, d2) ∈ R means that d1 and d2 possess the prop-erty to the same degree. It is easy to prove that S andL are equivalence relations but C and S ∪ C are notequivalence relations; see Evans (2013).

Associated with an arbitrary relation R on D isthe smallest equivalence relation on D containing R,which we will denote by R. Clearly, R is the intersec-tion of all equivalence relations containing R. But R

can also be characterized in another way that is key toBirnbaum’s proof.

LEMMA. If R is a reflexive relation on D, thenR = {(x, y) :∃n,x1, . . . , xn ∈ D with x = x1, y = xn

and (xi, xi+1) ∈ R or (xi+1, xi) ∈ R}.Note that S and C are both reflexive and, thus, S ∪C

is reflexive.In Birnbaum’s proof, he starts with ((M1, x1),

(M2, x2)) ∈ L, namely, these pairs have proportionallikelihoods. Birnbaum constructs the mixture model(Birnbaumization) M∗ and then argues that we havethat ((M1, x1), (M∗, (1, x1))) ∈ C, ((M∗, (1, x1)),

(M∗, (2, x2))) ∈ S and ((M∗, (2, x2)), (M2, x2)) ∈ C.Since C ⊂ S ∪ C and S ⊂ S ∪ C, by the Lemma,this proves that L ⊂ S ∪ C and this is all that Birn-baum’s argument establishes. Since S ∪ C ⊂ L and L

is an equivalence relation, we also have L = S ∪ C.As shown in Evans (2013), it is also true that S ∪ C

is properly contained in L, so there is some content tothe proof. In prose, Birnbaum’s proof establishes thefollowing: if we accept S, and we accept C, and we

accept all the equivalences generated by these princi-ples jointly, then we accept L. Certainly accepting S

and C is not equivalent to accepting L since S ∪ C is aproper subset of L. We need the additional hypothesisand there doesn’t appear to be any good reason whywe should accept this as part of a theory of statisticalinference. It is easy to construct relations R where R

is meaningless. So we have to justify the additionalpairs we add to a relation when completing it to be anequivalence relation.

It is interesting to note that the argument suppos-edly establishing the equivalence of C and L in Evans,Fraser and Monette (1986) also proceeds in the sameway using the method of the Lemma. Since C is prop-erly contained in L, this proof establishes that C = L.So in fact, S is irrelevant in Birnbaum’s proof. Theproblem with the principles S and C, as partial char-acterizations of statistical evidence, lies with C and thefact that it is not an equivalence relation. That C is notan equivalence relation is another way of expressingthe well-known fact that, in general, a unique maximalancillary doesn’t exist.

The result C = L does have some content. To be avalid characterization of evidence in the context of I ,we will have to modify C so that it is an equivalencerelation. The smallest equivalence relation containingC is L and this is unappealing, at least to frequentists,as it implies that repeated sampling properties are ir-relevant for inference. Another natural candidate for aresolution is the largest equivalence relation containedin C that is compatible with all the equivalence rela-tions based on maximal ancillaries. This is given bythe equivalence relation based on the laminal ancillary.From Basu (1959), ancillary statistic a is a laminal an-cillary if it is a function of every maximal ancillary andany other ancillary with this property is a function of a.The laminal ancillary is essentially unique. It is unclearhow appealing this resolution would be to frequentists,but there don’t seem to be any other natural candidates.

Many authors, including Mayo, refer to the weakconditionality principle which restricts attention to an-cillaries that are physically part of the sampling. Insuch a case we would presumably write our modelsdifferently so as to reflect the fact that this samplingoccurred in stages. In other words, the universe is dif-ferent than I , the one Birnbaum considered. Theredoesn’t seem to be anything controversial about sucha principle and it is well motivated by the two measur-ing instruments example and many others.

We don’t believe, however, that the weak condition-ality principle resolves the problem with conditionality

Page 20: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

246 M. EVANS

more generally. For example, how does weak condi-tionality deal with situations like Example 2-2 in Fraser(2004) and many others like it? Conditioning on an an-cillary seems absolutely essential if we are to obtainsensible inferences in such examples, but there doesn’tappear to be any physical aspect of the sampling thatcorresponds to the relevant ancillary.

Many frequentist statisticians ignore conditionality,but, as noted in Fraser (2004), this is not logical. Thetheme in conditional inference is to find the right hypo-thetical sequence of repeated samples to compare theobserved sample to. This takes us back to our questionconcerning the principle of frequentism: why are weconsidering repeated samples anyway? A successfulfrequentist theory of inference requires at least a res-olution of the problems with conditionality. The lackof such a resolution leads to doubts as to the validity ofthe basic idea that underlies frequentism.

Issues concerning ancillaries are not irrelevant toBayesians, as they have uses in model checking andchecking for prior-data conflict. Notice that the princi-ple of conditional probability does not imply that theseactivities need refer to any kind of posterior probabili-ties and it is perfectly logical for these to be based onprior probabilities. For example, model checking canbe based on the distribution of an ancillary or the con-ditional distribution of the data given a minimal suffi-cient. Of course, C is not relevant for proper Bayesianprobability statements about θ , as the principle of con-ditional probability implies that we condition on all ofthe data.

We acknowledge that it is possible that the problemswith C might be fixable or even eliminated through abetter understanding of what we are trying to accom-plish in statistical analyses—these aren’t just problemsin mathematics. We can’t resist noting, however, thatthe simple addition of a proper prior to the ingredientsdoes the job, at least for inference.

4. CONCLUSIONS

Mayo’s paper contains a number of insightful com-ments and, more generally, it helps to focus attentionon what is the most important question in statistics,namely, what is the right way to formulate a statisticalproblem and carry out a statistical analysis? To a cer-tain extent, Birnbaum’s result has been an impedimentin moving forward toward developing a theory of infer-ence that has a solid foundation. It is good to have such

underbrush removed from the discussion. We have togive great credit to Birnbaum, however, for his focuson what is important in achieving this goal, namely, themeasurement of statistical evidence. That his theoremhas lasted for so long is a testament to the difficultiesinvolved in this task.

In general, we need a strong foundation for a the-ory of statistical inference rather than principles, oftennot clearly stated, that have only some vague, intuitiveappeal. The only way we can determine whether or notan instance of statistical reasoning is correct lies withinthe context of a sound theory. That two statistical anal-yses based on the same data and addressing the samequestion can be deemed to be correct and yet come todifferent conclusions is not a contradiction. Statisticstells us that we simply must collect more data to re-solve such differences. In our view, the role of statisticsin science is to explain how to reason correctly in sta-tistical contexts. Without a strong theory we can’t dothat.

REFERENCES

BASKURT, Z. and EVANS, M. (2013). Hypothesis assessment andinequalities for Bayes factors and relative belief ratios. BayesianAnal. 8 569–590. MR3102226

BASU, D. (1959). The family of ancillary statistics. Sankhya 21247–256. MR0110115

BIRNBAUM, A. (1962). On the foundations of statistical inference.J. Amer. Statist. Assoc. 57 296–326. MR0138176

DURBIN, J. (1970). On Birnbaum’s theorem on the relation be-tween sufficiency, conditionality and likelihood. J. Amer. Statist.Assoc. 65 395–398.

EVANS, M. (2013). What does the proof of Birnbaum’s theoremprove? Electron. J. Stat. 7 2645–2655. MR3121626

EVANS, M. J., FRASER, D. A. S. and MONETTE, G. (1986). Onprinciples and arguments to likelihood. Canad. J. Statist. 14181–199. MR0859631

EVANS, M. and JANG, G. H. (2011a). A limit result for the priorpredictive applied to checking for prior-data conflict. Statist.Probab. Lett. 81 1034–1038. MR2803740

EVANS, M. and JANG, G. H. (2011b). Weak informativity and theinformation in one prior relative to another. Statist. Sci. 26 423–439. MR2917964

EVANS, M. and MOSHONOV, H. (2006). Checking for prior-dataconflict. Bayesian Anal. 1 893–914 (electronic). MR2282210

FRASER, D. A. S. (2004). Ancillaries and conditional inference.Statist. Sci. 19 333–369. MR2140544

ROYALL, R. M. (1997). Statistical Evidence. A LikelihoodParadigm. Monographs on Statistics and Applied Probability71. Chapman & Hall, London. MR1629481

Page 21: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

Statistical Science2014, Vol. 29, No. 2, 247–251DOI: 10.1214/14-STS472Main article DOI: 10.1214/13-STS457© Institute of Mathematical Statistics, 2014

Discussion: Foundations of StatisticalInference, RevisitedRyan Martin and Chuanhai Liu

Abstract. This is an invited contribution to the discussion on Professor Deb-orah Mayo’s paper, “On the Birnbaum argument for the strong likelihoodprinciple,” to appear in Statistical Science. Mayo clearly demonstrates thatstatistical methods violating the likelihood principle need not violate eitherthe sufficiency or conditionality principle, thus refuting Birnbaum’s claim.With the constraints of Birnbaum’s theorem lifted, we revisit the foundationsof statistical inference, focusing on some new foundational principles, theinferential model framework, and connections with sufficiency and condi-tioning.

Key words and phrases: Birnbaum, conditioning, dimension reduction, in-ferential model, likelihood principle.

1. INTRODUCTION

Birnbaum’s theorem (Birnbaum, 1962) is arguablythe most controversial result in statistics. The theo-rem’s conclusion is that a framework for statistical in-ference that satisfies two natural conditions, namely,the sufficiency principle (SP) and the conditionalityprinciple (CP), must also satisfy an exclusive condi-tion, the likelihood principle (LP). The controversy liesin the fact that LP excludes all those standard methodstaught in Stat 101 courses. Professor Mayo success-fully refutes Birnbaum’s claim, showing that violationsof LP need not imply violations of SP or CP. The keyto Mayo’s argument is a correct formulation of CP; seealso Evans (2013). Her demonstration resolves the con-troversy around Birnbaum and LP, helping to put thestatisticians’ house in order.

The controversy and confusion surrounding Birn-baum’s claim has perhaps discouraged researchersfrom considering questions about the foundations ofstatistics. We view Professor Mayo’s paper as an in-vitation for statisticians to revisit these fundamental

Ryan Martin is Assistant Professor, Department ofMathematics, Statistics, and Computer Science, Universityof Illinois at Chicago, 851 S. Morgan St., Chicago, Illinois60607, USA (e-mail: [email protected]). Chuanhai Liu isProfessor, Department of Statistics, Purdue University, 250North University St., West Lafayette, Indiana 47907-2067,USA (e-mail: [email protected]).

questions, and we are grateful for the opportunity tocontribute to this discussion.

Though LP no longer constrains the frequentist ap-proach, this does not mean that pure frequentism isnecessarily correct. For example, reproducibility is-sues1 in large-scale studies is an indication that thefrequentist techniques that have been successful inclassical problems may not be appropriate for today’shigh-dimensional problems. We contend that some-thing more than the basic sampling model is requiredfor valid statistical inference, and appropriate condi-tioning is one aspect of this. Here we consider what anew framework, called inferential models (IMs), has tosay concerning the foundations of statistical inference,with a focus on something more fundamental than CPand SP. For this, we begin in Section 2 with a discus-sion of valid probabilistic inference as motivation forthe IM framework. A general efficiency principle ispresented in Section 3 and we give an IM-based di-mension reduction strategy that accomplishes what theclassical SP and CP set out to do. Section 4 gives someconcluding remarks.

1“Announcement: Reducing our irreproducibility,” Nature 496(2013), DOI:10.1038/496398a.

247

Page 22: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

248 R. MARTIN AND C. LIU

2. VALID PROBABILISTIC INFERENCE

2.1 A Validity Principle

The sampling model for observable data X, depend-ing on an unknown parameter θ , is the familiar startingpoint. Our claim is that the sampling model alone is notsufficient for valid probabilistic inference. By “proba-bilistic inference” we mean a framework whereby anyassertion/hypothesis about the unknown parameter hasa (predictive) probability attached to it after data is ob-served. The sampling model facilitates a comparisonof the chances of the event X = x on different prob-ability spaces. For probabilistic inference, there mustbe a known distribution available, after data is ob-served. The random element that corresponds to thisdistribution shall be called a predictable quantity, andprobabilistic inference is obtained by predicting thispredictable quantity after seeing data. The probabilis-tic inference is “valid” if the predictive probabilitiesare suitably calibrated or, equivalently, have a fixedand known scale for meaningful interpretation. With-out risk of confusion, we shall call the prediction ofthe predictable quantity valid if it admits valid proba-bilistic inference. We summarize this in the followingvalidity principle (VP).

VALIDITY PRINCIPLE. Probabilistic inference re-quires associating an unobservable but predictablequantity with the observable data and unknown param-eter. Probabilities to be used for inference are obtainedby valid prediction of the predictable quantity.

The frequentist approach aims at developing proce-dures, such as confidence intervals and testing rules,having long-run frequency properties. Expressions like“95% confidence” have no predictive probability inter-pretation after data is observed, so frequentist methodsare not probabilistic in our sense. Nevertheless, certainfrequentist quantities, such as p-values, may be justifi-able from a valid probabilistic inference point of view(Martin and Liu, 2014b).

When genuine prior information is available and canbe summarized as a usual probability model, the corre-sponding Bayesian inference is both probabilistic andvalid [see Martin and Liu (2014a), Remark 4]. Whenno genuine prior information is available, and a de-fault prior distribution is used, the validity property isquestionable. Probability matching priors, Bernstein–von Mises theorems, etc., are efforts to make posteriorinference valid, in the sense above, at least approx-imately. The standard interpretation of these resultsis, for example, that Bayesian credible intervals have

the nominal frequentist coverage probability asymp-totically; see Fraser (2011). In that case, the remarksabove concerning frequentist methods apply.

Fiducial inference was introduced by Fisher (1930)to avoid using artificial priors in scientific inference.Subsequent work includes structural inference (Fraser,1968), the Dempster–Shafer theory of belief func-tions (Shafer, 1976; Dempster, 2008), generalized in-ference (Chiang, 2001; Weerahandi, 1993) and gener-alized fiducial inference (Hannig, 2009, 2013). Fidu-cial distributions are defined by expressing the param-eter as a data-dependent function of a pivotal quan-tity. This results in a bona fide posterior distributiononly in Fraser’s structural models and, in those cases,it corresponds to a Bayesian posterior (Lindley, 1958;Taraldsen and Lindqvist, 2013). Therefore, the fidu-cial distribution is meaningful when the correspondingBayesian prior is meaningful in the sense above. Moreon fiducial from the IM perspective is given below.

2.2 IM Framework

The IM framework, proposed recently by Martin andLiu (2013), has its roots in fiducial and Dempster–Shafer theory; see also Martin, Zhang and Liu (2010).At a fundamental level, the IM approach is driven byVP. Here is a quick overview.

Write the sampling model/data-generating mecha-nism as

X = a(θ,U), U ∼ PU,(1)

where X ∈ X is the observable data, θ ∈ � is the un-known parameter, and U ∈ U is an unobservable aux-iliary variable with known distribution PU . FollowingVP, the goal is to use the data X and the distribution forU for meaningful probabilistic inference on θ withoutassuming a prior. The following three steps describethe IM construction.

A-STEP. Associate the observed data X = x, theparameter and the auxiliary variable via (1) and con-struct the set-valued mapping, given by

�x(u) = {θ :x = a(θ, u)

}, u ∈ U.

The fiducial approach considers the distribution of�x(U) as a function of U ∼ PU . The IM framework,on the other hand, predicts the unobserved U using arandom set.

P-STEP. Predict the unobservable U with a randomset S . The distribution PS of S is required to be validin the sense that

f (U) ≥st Unif(0,1),(2)

Page 23: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

DISCUSSION 249

where f (u) = PS(S � u) and ≥st means “stochasti-cally no smaller than.”

C-STEP. Combine �x(·) and S as �x(S) =⋃u∈S �x(u). For a given assertion A ⊆ �, evaluate

the evidence in x for and not against the claim “θ ∈ A”via the belief and plausibility functions:

belx(A) = PS{�x(S) ⊆ A

},

plx(A) = PS{�x(S) ∩ A �= ∅

}.

In cases where �x(S) is empty with positive PS -probability, some adjustments to the above formulasare needed; see Martin and Liu (2013).

The IM output is the pair of functions (belx,plx)and, when applied to an assertion A about the parame-ter θ of interest, these provide a (personal) probabilis-tic summary of the evidence in data X = x support-ing the truthfulness of A. Property (2) guarantees thatthe IM output is valid. Our focus in the remainder ofthe discussion will be on the A-step and, in particular,auxiliary variable dimension reduction. Such concernsabout dimensionality are essential for efficient infer-ence on θ . These are also closely tied to the classicalideas of sufficiency and conditionality.

We end this section with a few remarks on fiducial.If one takes the predictive random set S in the P-step asa singleton, that is, S = {U}, where U ∼ PU , then belxand plx are equal and equal to the fiducial distribution.In this sense, fiducial provides probabilistic inference.However, the singleton predictive random set is notvalid in the sense of (2), so fiducial inference is gener-ally not valid, violating the second part of VP. One canalso reconstruct the fiducial distribution by choosinga valid predictive random set S so that belx(A) equalsthe fiducial probability of A for all suitable A ⊆ �. Butthis would generally require that S depend on the ob-served X = x, and the resulting inference suffers froma selection bias, or double-use of the data, resulting inunjustifiably large belief probabilities.

3. EFFICIENCY AND DIMENSION REDUCTION

3.1 An Efficiency Principle

It is natural to strive for efficient statistical inference.In the context of IMs, we want plX(A) to be as stochas-tically small as possible, as a function of X, when theassertion A about θ is false. To connect this to classicalefficiency, plX(A) can be interpreted like the p-valuefor testing H0 : θ ∈ A, so stochastically small plausibil-ity when A is false corresponds to the high power of thetest. We state the following efficiency principle (EP).

EFFICIENCY PRINCIPLE. Subject to the validityconstraint, probabilistic inference should be made asefficient as possible.

EP is purposefully vague: it allows for a variety oftechniques to be employed to increase efficiency. Thenext section discusses one important technique relatedto auxiliary variable dimension reduction.

3.2 Improved Efficiency via Dimension Reduction

In the classical setting, sufficiency reduces the datato a good summary statistic. In the IM context, how-ever, the dimension of the auxiliary variable, not thedata, is of primary concern. For example, in the case ofiid sampling, the dimension of U is the same as that ofX, which is usually greater than that of θ . In such cases,it is inefficient to predict a high-dimensional auxiliaryvariable for inference on a lower dimensional param-eter. The idea is to reduce the dimension of U to thatof θ . This auxiliary variable dimension reduction willindirectly result in some transformation of the data.

How to reduce the dimension of U? We seek a newauxiliary variable V , of the same dimension of θ , suchthat the baseline association (1) can be rewritten as

T (X) = b(θ,V ), V ∼ PV ,(3)

for some functions T and b, and distribution PV . HerePV may actually depend on some features of the dataX. Such a dimension reduction is general, but Martinand Liu (2014a) consider an important case, whichwe summarize here. Suppose we have two one-to-onemappings, x → (T (x),H(x)) and u → (τ (u), η(u)),with the requirement that η(U) = H(X). Since H(X)

is observable, so too must be the feature η(U) of U .This point has two important consequences: first, afeature of U that is observed need not be predicted,hence a dimension reduction; second, the feature of U

that is observed naturally provides some informationabout the part that remains unobserved, so condition-ing should help improve prediction. By construction,the baseline association (1) is equivalent to

T (X) = b(θ, τ (U)

)and H(X) = η(U),(4)

and this suggests an association of the form (3), whereV = τ(U) and PV is the conditional distribution ofτ(U) given η(U) = H(X).

It remains to discuss how the dimension reductionstrategy described above related to EP. The followingtheorem gives one relatively simple illustration of theimproved efficiency via dimension reduction.

Page 24: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

250 R. MARTIN AND C. LIU

THEOREM 1. Suppose that the baseline associa-tion (1) can be rewritten as (4) and that τ(U) and η(U)

are independent. Then inference based on T (X) =b(θ, τ (U)) alone, by a valid prediction of τ(U), ignor-ing η(U), is at least as efficient as inference from (4)by a valid prediction of (τ (U), η(U)).

See the corollary to Proposition 1 in Martin and Liu(2014a, full-length version); see also Liu and Martin(2015). Therefore, reducing the dimension of the aux-iliary variable cannot make inference less efficient. Thepoint in that paper is that reducing the dimension actu-ally improves efficiency, hence EP.

In the standard examples, for example, regular ex-ponential families, our dimension reduction above cor-responds to classical sufficiency; see Example 1. Out-side the standard examples, the IM dimension reduc-tion gives something different from sufficiency, in par-ticular, the former often leads directly to further di-mension reduction compared to the latter; see Exam-ple 2. That the IM dimension reduction naturally con-tains some form of conditioning is an advantage. Theabsence of conditioning in the standard definition ofsufficiency is one possible reason why conditional in-ference has yet to become part of the mainstream. TheIM framework also has advantages beyond dimensionreduction and conditioning. In particular, the IM outputgives valid prior-free probabilistic inference on θ .

3.3 “Dimension Reduction Entails SP and CP”

This section draws some connections between theIM dimension reduction above and SP and CP. First, itis clear that following the auxiliary variable dimensionreduction strategy described above entails CP. In theCox example, the randomization that determines whichmeasurement instrument will be used corresponds toan auxiliary variable whose value is observed com-pletely. So, our auxiliary variable dimension reductionstrategy implies conditioning on the actual instrumentused, hence CP. For SP, Theorem 1 gives some insight.That is, when a sufficient statistic has dimension thesame as θ , one can take T (X) as that sufficient statisticand select independent τ(U) and η(U). In general, ourdimension reduction and efficiency considerations aremore meaningful than sufficiency and SP. The exam-ples below illustrate this point further.

EXAMPLE 1. Suppose X1,X2 are independentN(θ,1) samples, and write the association as Xi =θ + Ui , where U1,U2 are independent standard nor-mal. In this case, there are lots of candidate mappings(τ, η) to rewrite the baseline association in the form

(4). Two choices are {τ(u) = u1, η(u) = u2 − u1} and{τ(u) = u, η(u) = (u1 − u, u2 − u)}. At first look,the second choice, corresponding to sufficiency, seemsbetter than the first. However, the dimension-reducedassociations (3) based on these two choices are ex-actly the same. This means, first, there is nothing spe-cial about sufficiency in light of proper conditioning(Evans, Fraser and Monette, 1986; Fraser, 2004). Sec-ond, it suggests that, at least in simple problems, thedimension-reduced association (3) does not depend onthe choice of (τ, η), that is, it only depends on the suffi-cient statistic, hence SP. The message here holds moregenerally, though a rigorous formulation remains to beworked out.

EXAMPLE 2. Consider independent exponentialrandom variables X1,X2, the first with mean θ andthe second with mean θ−1. In this case, the mini-mal sufficient statistic, (X1,X2), is two-dimensionalwhile the parameter is one-dimensional. Martin andLiu (2014a) take the baseline association as X1 = θU1and X2 = θ−1U2, where U1,U2 are independent stan-dard exponential. They employ a novel partial differ-ential equations technique to identify a function η of(U1,U2) whose value is fully observed, so that only ascalar auxiliary variable needs to be predicted. Theirsolution is equivalent to that based on the conditionaldistribution of the maximum likelihood estimate givenan ancillary statistic (Fisher, 1973; Ghosh, Reid andFraser, 2010). The message here is that the IM-basedauxiliary variable dimension reduction strategy doessomething similar to the classical strategy of condi-tioning on ancillary statistics, but it does so in a mostlyautomatic way.

4. CONCLUDING REMARKS

Professor Mayo is to be congratulated for her contri-bution. Besides resolving the controversy surroundingBirnbaum’s theorem, her paper is an invitation for afresh discussion on the foundations of statistical infer-ence. Though LP no longer constrains the frequentistapproach, we have argued here that something morethan the basic sampling model is required for validstatistical inference. The IM framework features theprediction of unobserved auxiliary variables as this“something more,” and the idea of reducing the dimen-sion of the auxiliary variable before prediction leadsto improved efficiency, accomplishing what SP and CPare meant to do. We expect further developments forand from IMs in years to come.

Page 25: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

DISCUSSION 251

ACKNOWLEDGMENTS

Supported in part by NSF Grants DMS-10-07678,DMS-12-08833 and DMS-12-08841.

REFERENCES

BIRNBAUM, A. (1962). On the foundations of statistical inference.J. Amer. Statist. Assoc. 57 269–326. MR0138176

CHIANG, A. K. L. (2001). A simple general method for construct-ing confidence intervals for functions of variance components.Technometrics 43 356–367. MR1943189

DEMPSTER, A. P. (2008). The Dempster–Shafer calculusfor statisticians. Internat. J. Approx. Reason. 48 365–377.MR2419025

EVANS, M. (2013). What does the proof of Birnbaum’s theoremprove? Electron. J. Stat. 7 2645–2655. MR3121626

EVANS, M. J., FRASER, D. A. S. and MONETTE, G. (1986). Onprinciples and arguments to likelihood. Canad. J. Statist. 14181–199. MR0859631

FISHER, R. A. (1930). Inverse probability. Math. Proc. CambridgePhilos. Soc. 26 528–535.

FISHER, R. A. (1973). Statistical Methods and Scientific Inference,3rd ed. Hafner Press, New York.

FRASER, D. A. S. (1968). The Structure of Inference. Wiley, NewYork. MR0235643

FRASER, D. A. S. (2004). Ancillaries and conditional inference.Statist. Sci. 19 333–369. MR2140544

FRASER, D. A. S. (2011). Is Bayes posterior just quick and dirtyconfidence? Statist. Sci. 26 299–316. MR2918001

GHOSH, M., REID, N. and FRASER, D. A. S. (2010). Ancillarystatistics: A review. Statist. Sinica 20 1309–1332. MR2777327

HANNIG, J. (2009). On generalized fiducial inference. Statist.Sinica 19 491–544. MR2514173

HANNIG, J. (2013). Generalized fiducial inference via discretiza-tion. Statist. Sinica 23 489–514. MR3086644

LINDLEY, D. V. (1958). Fiducial distributions and Bayes’ theorem.J. R. Stat. Soc. Ser. B Stat. Methodol. 20 102–107. MR0095550

LIU, C. and MARTIN, R. (2015). Inferential Models: Reasoningwith Uncertainty. Monographs in Statistics and Applied Proba-bility Series. Chapman & Hall, London. To appear.

MARTIN, R. and LIU, C. (2013). Inferential models: A frameworkfor prior-free posterior probabilistic inference. J. Amer. Statist.Assoc. 108 301–313. MR3174621

MARTIN, R. and LIU, C. (2014a). Conditional inferential models:Combining information for prior-free probabilistic inference.J. R. Stat. Soc. Ser. B. Stat. Methodol. To appear. Full-lengthpreprint version at arXiv:1211.1530. DOI:10.1111/rssb.12070

MARTIN, R. and LIU, C. (2014b). A note on p-values inter-preted as plausibilities. Statist. Sinica. To appear. Available atarXiv:1211.1547. DOI:10.5705/ss.2013.087

MARTIN, R., ZHANG, J. and LIU, C. (2010). Dempster–Shafertheory and statistical inference with weak beliefs. Statist. Sci.25 72–87. MR2741815

SHAFER, G. (1976). A Mathematical Theory of Evidence. Prince-ton Univ. Press, Princeton, NJ. MR0464340

TARALDSEN, G. and LINDQVIST, B. H. (2013). Fiducial theoryand optimal inference. Ann. Statist. 41 323–341. MR3059420

WEERAHANDI, S. (1993). Generalized confidence intervals. J.Amer. Statist. Assoc. 88 899–905. MR1242940

Page 26: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

Statistical Science2014, Vol. 29, No. 2, 252–253DOI: 10.1214/14-STS473Main article DOI: 10.1214/13-STS457© Institute of Mathematical Statistics, 2014

Discussion: On Arguments ConcerningStatistical PrinciplesD. A. S. Fraser

(i) Statistical inference after Neyman–Pearson. Sta-tistical inference as an alternative to Neyman–Pearsondecision theory has a long history in statistical think-ing, with strong impetus from Fisher’s research; see,for example, the overview in Fisher (1956). Some re-sulting concerns in inference theory then reached themathematical statistics community rather forcefullywith Cox (1958); this had focus on the two measuring-instruments example and on uses of conditioning thatwere compelling.

(ii) Birnbaum and logical analysis in statistical in-ference. Birnbaum (1962) introduced notation for thestatistical inference available from an investigationwith a model and data. This gave grounds to analyzehow different methods or principles might influencethe statistical inference. As part of this he discussedhow sufficiency, likelihood and conditioning could dif-ferentially affect statistical inference. Much of his dis-cussion centered on the argument from conditioningand sufficiency to likelihood, but a primary conse-quence was the attention attracted to conditioning andits role in inference. While this interest in conditioningwas substantial for those concerned with the core ofstatistics, it has more recently been neglected or over-looked. Indeed, some recent texts, for example, Rice(2007), seem not to acknowledge conditioning in in-ference or even the measuring-instrument example.

(iii) Mayo and statistical principles. Mayo should bestrongly commended for reminding us that the prin-ciples and arguments of statistical inference deservevery serious consideration and, we might add, couldhave very serious consequences (Fraser, 2014). Herprimary focus is on the argument (Birnbaum, 1962)that the principles sufficiency and conditionality leadto the likelihood principle. This may not cover some re-cent aspects of conditioning (Fraser, Fraser and Staicu,2010), but should strongly stimulate renewed interestin conditioning.

D. A. S. Fraser is Emeritus Professor, Department ofStatistical Sciences, University of Toronto, 100 St. GeorgeSt., Toronto, Ontario M5S 3G3, Canada (e-mail:[email protected]).

(iv) Contemporary inference theory. Many statisti-cal models have continuity in how parameter changeaffects observable variables or, more specifically, howparameter change affects coordinate quantile func-tions, the inverses of the coordinate distribution func-tions. This continuity in its global effect is widely ne-glected in statistical inference. If this effect on quan-tile functions is accepted and used in the inferenceprocedures, then in wide generality there is a well-determined conditioning (Fraser, Fraser and Staicu,2010). And likelihood analysis then offers an exponen-tial model approximation that is third-order equivalentto the given model, and this in turn provides third-orderinference for any scalar component parameters of in-terest. Thus, the familiar conditioning conflicts are rou-tinely avoided by acknowledging the important modelcontinuity.

(v) What is available? The conditioning just de-scribed leads routinely to p-value functions p(ψ) forany scalar component parameter ψ = ψ(θ) of the sta-tistical model. A wealth of statistical inference method-ology then immediately becomes available from suchp-value functions. For example, a test for a value ψ0is given by the p-value p(ψ0), a confidence interval bythe inverse (ψβ/2, ψ1−β/2) = p−1(1−β/2, β/2) of thep-value function, and a median estimate by the valuep−1(0.5). But quite generally the needed p-value func-tions are not available from a likelihood function alone!

(vi) What are the implications? If continuity is in-cluded as an ingredient of many model-data combina-tions, then, as we have indicated, likelihood analysisproduces p-values and confidence intervals, and theseare not available from the likelihood function alone.This thus demonstrates that with such continuity-basedconditioning the likelihood principle is not a conse-quence of sufficiency and conditioning principles. Butif we omit the continuity then we are directly facedwith the issue addressed by Mayo.

ACKNOWLEDGMENTS

Supported in part by the Natural Sciences and Engi-neering Research Council of Canada and Senior Schol-ars Funding at York University.

252

Page 27: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

DISCUSSION 253

REFERENCES

BIRNBAUM, A. (1962). On the foundations of statistical infer-ence (with discussion). J. Amer. Statist. Assoc. 57 269–326.MR0138176

COX, D. R. (1958). Some problems connected with statistical in-ference. Ann. Math. Statist. 29 357–372. MR0094890

FISHER, R. A. (1956). Statistical Methods and Scientific Inference.Oliver and Boyd, Edinburgh.

FRASER, D. A. S. (2014). Why does statistics have two theo-ries? In Past, Present and Future of Statistical Science (X. Lin,D. Banks, C. Genest, G. Molenberghs, D. Scott and J.-L. Wang,eds.) 237–252. CRC Press, Boca Raton, FL.

FRASER, A. M., FRASER, D. A. S. and STAICU, A.-M. (2010).Second order ancillary: A differential view from continuity.Bernoulli 16 1208–1223. MR2759176

RICE, J. (2007). Mathematical Statistics and Data Analysis, 3rded. Brooks/Cole, Belmont, CA.

Page 28: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

Statistical Science2014, Vol. 29, No. 2, 254–258DOI: 10.1214/14-STS474Main article DOI: 10.1214/13-STS457© Institute of Mathematical Statistics, 2014

Discussion of “On the BirnbaumArgument for the Strong LikelihoodPrinciple”Jan Hannig

Abstract. In this discussion we demonstrate that fiducial distributions pro-vide a natural example of an inference paradigm that does not obey StrongLikelihood Principle while still satisfying the Weak Conditionality Principle.

Key words and phrases: Generalized fiducial inference, strong likelihoodprinciple violation, weak conditionality principle.

Professor Mayo should be congratulated on bring-ing new light into the veritable arguments about statis-tical foundations. It is well documented that p-values,confidence intervals and hypotheses tests do not satisfythe Strong Likelihood Principle (SLP). In the next sec-tion we will demonstrate that fiducial distributions pro-vide a natural example of an inference paradigm thatbreaks SLP while still satisfying the Weak Condition-ality Principle (WCP).

1. HISTORY OF FIDUCIAL INFERENCE

The origin of Generalized Fiducial Inference can betraced back to R. A. Fisher (Fisher, 1930, 1933, 1935)who introduced the concept of a fiducial distributionfor a parameter and proposed the use of this fiducialdistribution in place of the Bayesian posterior distribu-tion. In the case of a one-parameter family of distribu-tions, Fisher gave the following definition for a fidu-cial density r(θ) of the parameter based on a singleobservation x for the case where the cdf F(x, θ) is adecreasing function of θ :

r(θ) = −∂F (x, θ)

∂θ.(1.1)

For multiparameter families of distributions Fisher didnot give a formal definition. Moreover, the fiducial ap-proach led to confidence sets whose frequentist cover-age probabilities were close to the claimed confidence

Jan Hannig is Professor, Department of Statistics andOperations Research, University of North Carolina atChapel Hill, 330 Hanes Hall, Chapel Hill, North Carolina27599-3260, USA (e-mail: [email protected]).

levels but they were not exact in the frequentist sense.Fisher’s proposal led to major discussions among theprominent statisticians of the 1930s, 40s and 50s (e.g.,Dempster, 1966, 1968; Fraser, 1961a, 1961b, 1966,1968; Jeffreys, 1940; Lindley, 1958; Stevens, 1950).Many of these discussions focused on the nonexactnessof the confidence sets and also on the nonuniquenessof fiducial distributions. The latter part of the 20th cen-tury has seen only a handful of publications (Barnard,1995; Dawid, Stone and Zidek, 1973; Salome, 1998;Dawid and Stone, 1982; Wilkinson, 1977) as the fidu-cial approach fell into disfavor and became a topic ofhistorical interest only.

Since the mid-2000s, there has been a true resur-rection of interest in modern modifications of fidu-cial inference. These approaches have become knownunder the umbrella name of distributional inference.This increase of interest came both in the number ofdifferent approaches to the problem and the numberof researchers working on these problems, and man-ifested itself in an increasing number of publicationsin premier journals. The common thread for these ap-proaches is a definition of inferentially meaningfulprobability statements about subsets of the parameterspace without the need for subjective prior information.

These modern approaches include the Dempster–Shafer theory (Dempster, 2008; Edlefsen, Liu andDempster, 2009) and its recent extension called infer-ential models (Martin, Zhang and Liu, 2010; Zhangand Liu, 2011; Martin and Liu, 2013a, 2013b, 2013c,2013d). A somewhat different approach termed con-fidence distributions looks at the problem of obtain-ing an inferentially meaningful distribution on the pa-

254

Page 29: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

DISCUSSION 255

rameter space from a purely frequentist point of view(Xie and Singh, 2013). One of the main contributionsof this approach is the ability to combine informa-tion from disparate sources with deep implications formeta analysis (Schweder and Hjort, 2002; Singh, Xieand Strawderman, 2005; Xie, Singh and Strawderman,2011; Hannig and Xie, 2012; Xie et al., 2013). Anothermore mature approach is called objective Bayesianinference that aims at finding nonsubjective model-based priors. An example of a recent breakthroughin this area is the modern development of referencepriors (Berger, 1992; Berger and Sun, 2008; Berger,Bernardo and Sun, 2009; 2012; Bayarri et al., 2012).Another related approach is based on higher order like-lihood expansions and implied data dependent priors(Fraser, Fraser and Staicu, 2010; Fraser, 2004, 2011;Fraser and Naderi, 2008; Fraser et al., 2010; Fraser,Reid and Wong, 2005). There is also important ini-tial work showing how some simple fiducial distribu-tions that are not Bayesian posteriors naturally arisewithin the decision theoretical framework (Taraldsenand Lindqvist, 2013).

Arguably, Generalized Fiducial Inference has beenon the forefront of the modern fiducial revival. Start-ing in the early 1990s, the work of Tsui and Weer-ahandi (1989, 1991) and Weerahandi (1993, 1994,1995) on generalized confidence intervals and the workof Chiang (2001) on the surrogate variable methodfor obtaining confidence intervals for variance compo-nents led to the realization that there was a connectionbetween these new procedures and fiducial inference.This realization evolved through a series of works inthe early 2000s (Hannig, 2009; Hannig, Iyer and Pat-terson, 2006; Iyer, Wang and Mathew, 2004; Patterson,Hannig and Iyer, 2004). The strengths and limitationsof the fiducial approach are starting to be better un-derstood; see especially Hannig (2009, 2013). In par-ticular, the asymptotic exactness of fiducial confidencesets, under fairly general conditions, was establishedin Hannig (2013); Hannig, Iyer and Patterson (2006);Sonderegger and Hannig (2014). Generalized fiducialinference has also been extended to prediction prob-lems in Wang, Hannig and Iyer (2012). Computationalissues were discussed in Cisewski and Hannig (2012),Hannig, Lai and Lee (2014), and model selection inthe context of Generalized Fiducial Inference has beenstudied in Hannig and Lee (2009); Lai, Hannig and Lee(2013).

2. GENERALIZED FIDUCIAL DISTRIBUTION ANDTHE WEAK CONDITIONALITY PRINCIPLE

Most modern incarnations of fiducial inference beginwith expressing the relationship between the data, X,and the parameters, ξ , as

X = G(U, ξ),(2.1)

where G(·, ·) is termed the data generating equation(also called the association equation or structural equa-tion) and U is the random component of this data gen-erating equation whose distribution is free of parame-ters and completely known.

After observing the data x the next step is to use theknown distribution of U and the inverse of the data(2.1) to define probabilities for the subsets of the pa-rameter space. In particular, Generalized Fiducial In-ference defines a distribution on the parameter space asthe weak limit as ε → 0 of the conditional distribution

arg minξ

∥∥x − G(U�, ξ

)∥∥ ∣∣ {min

ξ

∥∥x − G(U�, ξ

)∥∥ ≤ ε},

(2.2)where U� has the same distribution as U. If thereare multiple values minimizing the norm, the opera-tor arg minξ selects one of them (possibly at random).We stress at this point that the Generalized FiducialDistribution is not unique. For example, different datagenerating equations can give a somewhat differentGeneralized Fiducial Distribution. Notice also that ifP(minξ ‖x − G(U�, ξ)‖ = 0) > 0, which is the casefor discrete distributions, the limit in (2.2) is the condi-tional distribution evaluated at ε = 0.

The conditional form of (2.2) immediately impliesthe Weak Conditional Principle for the limiting Gen-eralized Fiducial Distribution. To demonstrate this, letus consider the two-instrument example of (Cox, 1958)(see also Section 4.1 of the discussed article). The datagenerating equation can be written in a hierarchicalform:

M = 1 + I(0,1/2)(U),

X = θ + σMZ,

where U ∼ U(0,1) and Z ∼ N(0,1) are independentand the precisions σ1 << σ2 are known. If both X = x

the measurement made and M = m the instrument used(m = 1,2 for machine 1 and 2 respectively) are ob-served, the conditional distribution (2.2) is N(x,σ 2

m),only taking into account the experiment actually per-formed. On the other hand, if only M is unobserved,then the conditional distribution (2.2) is the mixture0.5N(θ,σ 2

1 ) + 0.5N(θ,σ 22 ). As claimed, the General-

ized Fiducial Distribution follows WCP in this exam-ple.

Page 30: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

256 J. HANNING

3. GENERALIZED FIDUCIAL DISTRIBUTION ANDTHE STRONG LIKELIHOOD PRINCIPLE

In general, the Generalized Fiducial Distributiondoes not satisfy the Strong Likelihood principle. Wefirst demonstrate this on inference for geometric dis-tribution. To begin, we perform some preliminary cal-culations. Let X be a random variable with discretedistribution function F(x, ξ). Let us assume for sim-plicity of presentation that for each fixed x, F(x, ξ) ismonotone in ξ and spans the whole [0,1]. The inversedistribution function F−1(u, ξ) = inf{x :F(x, ξ) ≥ u}forms a natural data generating equation

X = F−1(U, ξ), U ∼ (0,1).

The minimizer in (2.2) is not unique, but any fidu-cial distribution will have a distribution function satis-fying 1 − F(x, ξ) ≤ H(ξ) ≤ 1 − F(x−, ξ) if F(x, ξ)

is decreasing in ξ and F(x−, ξ) ≤ H(ξ) ≤ F(x, ξ)

if F(x, ξ) is increasing. To resolve this nonunique-ness, Hannig (2009) and Efron (1998) recommend us-ing the half correction which is the mixture distribu-tion with distribution functions H(ξ) = 1− (F (x, ξ)+F(x−, ξ))/2 if F(x, ξ) is decreasing in ξ or H(ξ) =(F (x, ξ) + F(x−, ξ))/2 if F(x, ξ) increasing.

Let us now consider observing a random variableN = n following the Geometric(p) distribution. SLPimplies that the inference based on observing N = n

should be the same as inference based on observingX = 1 where X is Binomial(n,p). However, the Ge-ometric based Generalized Fiducial Distribution hasa distribution function between 1 − (1 − p)n−1 ≤HG(p) ≤ 1 − (1 − p)n. The binomial based General-ized Fiducial Distribution uses bounds 1 − (1 − p)n −np(1−p)n−1 ≤ HB(p) ≤ 1−(1−p)n. Thus, the effectof the stopping rule demonstrates itself in the General-ized Fiducial Inference through the lower bound thatis much closer to the upper bound in the case of geo-metric distribution. (We remark that one cannot ignorethe lower bound, as the upper bound is used to formupper confidence intervals and the lower bound is usedfor lower confidence intervals on p.) To conclude, thefiducial distribution in this example depends on boththe distribution function of x and also on the distribu-tion function of x − 1.

Let us now turn our attention to continuous distri-butions. In particular, assume that the parameter ξ ∈� ⊂ R

p is p-dimensional and that the inverse to (2.1)G−1(x, ξ) = u exists. Then under some differentiabil-ity assumptions, Hannig (2013) shows that the general-ized fiducial distribution is absolutely continuous with

density

r(ξ) = f (x, ξ)J (x, ξ)∫� f (x, ξ ′)J (x, ξ ′) dξ ′ ,(3.1)

where f (x, ξ) is the likelihood and the function J (x, ξ)

is

J (x, ξ)(3.2)

= ∑i=(i1,...,ip)

1≤i1<···<ip≤n

∣∣∣∣det(

d

dξG(u, ξ)

∣∣∣u=G−1(x,ξ)

)i

∣∣∣∣,

where ddξ

G(u, ξ) is the n×p Jacobian matrix of partialderivatives computed with respect of components of ξ .The sum in (3.2) spans over all p-tuples of indexesi = (1 ≤ i1 < · · · < ip ≤ n) ⊂ {1, . . . , n}. Additionally,for any n × p matrix J , the sub-matrix (J )i is thep ×p matrix containing the rows i = (i1, . . . , ip) of A.The form of (3.1) suggests that as long as the JacobianJ (x, ξ) does not separate into J (x, ξ) = f (x)g(ξ), inwhich case the Generalized Fiducial Distribution is thesame as the Bayes posterior with g(ξ) used as a prior,the Generalized Fiducial Distribution does not satisfySLP due to the dependance on dG(u, ξ)/dξ .

4. GENERALIZED FIDUCIAL DISTRIBUTION ANDSUFFICIENCY PRINCIPLE

Whether the Generalized Fiducial Distribution satis-fies the sufficiency principle depends entirely on whatdata generating equation is chosen. For example, letus assume that Y = (S(X),A(X))′, where S is a p-dimensional sufficient and A is ancillary and X satisfies(2.1). Because dA/dξ = 0, the sum in (3.2) containsonly one nonzero term:

J (x, ξ) =∣∣∣∣det

(d

dξS(G(u, ξ)

)∣∣∣u=G−1(x,ξ)

)∣∣∣∣.(4.1)

Let s = S(x) and a = A(x) be the observed valuesof the sufficient and ancillary statistics respectively.To interpret the Generalized Fiducial Distribution as-sume that there is a unique ξ solving s = S(G(u, ξ))

for every u and denote this solution Qs(u) = ξ . Alsoassume that the ancillary data generating equationA(G(u, ξ)) = A(u) is not a function of ξ . A straight-forward calculation shows that the fiducial density(3.1) with (4.1) is the conditional distribution ofQs(U�) | A(U�) = a. We conclude that this choice ofdata generating equation leads to inference based onsufficient statistics conditional on the ancillary. How-ever, we still do not expect the SLP to hold in generaleven for this data generating equation.

Page 31: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

DISCUSSION 257

Heuristically, this is because GFI is using not onlythe data observed, but also the data that based on thedata generating equation could have been observed inthe neighborhood of the observed data.

5. FINAL REMARKS

Let us close with discussing the example of Sec-tion 3.1. While the paper is not very clear on the exactspecification of the events, it appears that for experi-ment 1 we observe the event

O1 = {y169 > 1.96σ/√

169},while for the experiment 2 we observe

O2 = {yk ≤ 1.96σ/√

k, k = 1, . . . ,168,

y169 > 1.96σ/√

169}.Since O2 ⊂ O1, we see that the likelihood

Pθ(O2) = Pθ(O2 | O1)P (O1).

Consequently, we would have an SLP pair if and onlyif Pθ(O2 | O1) was a constant as a function of θ . How-ever, this is not the case, as clearly P0(O2 | O1) > 0and limθ→∞ Pθ(O2 | O1) = 0. Consequently, we donot have an SLP pair.

ACKNOWLEDGMENTS

Supported in part by the National Science Founda-tion Grants 1007543 and 1016441.

REFERENCES

BARNARD, G. A. (1995). Pivotal models and the fiducial argu-ment. Internat. Statist. Rev. 63 309–323.

BAYARRI, M. J., BERGER, J. O., FORTE, A. and GARCÍA-DONATO, G. (2012). Criteria for Bayesian model choice withapplication to variable selection. Ann. Statist. 40 1550–1577.MR3015035

BERGER, J. O. and BERNARDO, J. M. (1992). On the de-velopment of reference priors. In Bayesian Statistics 4(J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith,eds.) 35–60. Oxford Univ. Press, New York. MR1380269

BERGER, J. O., BERNARDO, J. M. and SUN, D. (2009). Theformal definition of reference priors. Ann. Statist. 37 905–938.MR2502655

BERGER, J. O., BERNARDO, J. M. and SUN, D. (2012). Objectivepriors for discrete parameter spaces. J. Amer. Statist. Assoc. 107636–648. MR2980073

BERGER, J. O. and SUN, D. (2008). Objective priors for the bi-variate normal model. Ann. Statist. 36 963–982. MR2396821

CHIANG, A. K. L. (2001). A simple general method for construct-ing confidence intervals for functions of variance components.Technometrics 43 356–367. MR1943189

CISEWSKI, J. and HANNIG, J. (2012). Generalized fiducial infer-ence for normal linear mixed models. Ann. Statist. 40 2102–2127. MR3059078

COX, D. R. (1958). Some problems connected with statistical in-ference. Ann. Math. Statist. 29 357–372. MR0094890

DAWID, A. P. and STONE, M. (1982). The functional-model basisof fiducial inference. Ann. Statist. 10 1054–1074. MR0673643

DAWID, A. P., STONE, M. and ZIDEK, J. V. (1973). Marginaliza-tion paradoxes in Bayesian and structural inference. J. R. Stat.Soc. Ser. B Stat. Methodol. 35 189–233. MR0365805

DEMPSTER, A. P. (1966). New methods for reasoning towardsposterior distributions based on sample data. Ann. Math. Statist.37 355–374. MR0187357

DEMPSTER, A. P. (1968). A generalization of Bayesian inference(with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 30 205–247. MR0238428

DEMPSTER, A. P. (2008). The Dempster–Shafer calculusfor statisticians. Internat. J. Approx. Reason. 48 365–377.MR2419025

EDLEFSEN, P. T., LIU, C. and DEMPSTER, A. P. (2009). Estimat-ing limits from Poisson counting data using Dempster–Shaferanalysis. Ann. Appl. Stat. 3 764–790. MR2750681

EFRON, B. (1998). R. A. Fisher in the 21st century (invited paperpresented at the 1996 R. A. Fisher Lecture). Statist. Sci. 13 95–122. MR1647499

FISHER, R. A. (1930). Inverse probability. Math. Proc. CambridgePhilos. Soc. XXVI 528–535.

FISHER, R. A. (1933). The concepts of inverse probability andfiducial probability referring to unknown parameters. Proc. R.Soc. Lond. Ser. A Math. Phys. Eng. Sci. 139 343–348.

FISHER, R. A. (1935). The fiducial argument in statistical infer-ence. Ann. Eugenics VI 91–98.

FRASER, D. A. S. (1961a). On fiducial inference. Ann. Math.Statist. 32 661–676. MR0130755

FRASER, D. A. S. (1961b). The fiducial method and invariance.Biometrika 48 261–280. MR0133910

FRASER, D. A. S. (1966). Structural probability and a generaliza-tion. Biometrika 53 1–9. MR0196840

FRASER, D. A. S. (1968). The Structure of Inference. Wiley, NewYork. MR0235643

FRASER, D. A. S. (2004). Ancillaries and conditional inference.Statist. Sci. 19 333–369. MR2140544

FRASER, D. A. S. (2011). Is Bayes posterior just quick and dirtyconfidence? Statist. Sci. 26 299–316. MR2918001

FRASER, A. M., FRASER, D. A. S. and STAICU, A.-M. (2010).Second order ancillary: A differential view from continuity.Bernoulli 16 1208–1223. MR2759176

FRASER, D. A. S. and NADERI, A. (2008). Exponential models:Approximations for probabilities. Biometrika 94 1–9.

FRASER, D., REID, N. and WONG, A. (2005). What a model withdata says about theta. Internat. J. Statist. Sci. 3 163–178.

FRASER, D. A. S., REID, N., MARRAS, E. and YI, G. Y. (2010).Default priors for Bayesian and frequentist inference. J. R. Stat.Soc. Ser. B Stat. Methodol. 72 631–654. MR2758239

HANNIG, J. (2009). On generalized fiducial inference. Statist.Sinica 19 491–544. MR2514173

HANNIG, J. (2013). Generalized fiducial inference via discretiza-tion. Statist. Sinica 23 489–514. MR3086644

HANNIG, J., IYER, H. and PATTERSON, P. (2006). Fiducial gen-eralized confidence intervals. J. Amer. Statist. Assoc. 101 254–269. MR2268043

Page 32: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

258 J. HANNING

HANNIG, J., LAI, R. C. S. and LEE, T. C. M. (2014). Computa-tional issues of generalized fiducial inference. Comput. Statist.Data Anal. 71 849–858. MR3132011

HANNIG, J. and LEE, T. C. M. (2009). Generalized fidu-cial inference for wavelet regression. Biometrika 96 847–860.MR2767274

HANNIG, J. and XIE, M.-G. (2012). A note on Dempster–Shaferrecombination of confidence distributions. Electron. J. Stat. 61943–1966. MR2988470

IYER, H. K., WANG, C. M. J. and MATHEW, T. (2004). Modelsand confidence intervals for true values in interlaboratory trials.J. Amer. Statist. Assoc. 99 1060–1071. MR2109495

JEFFREYS, H. (1940). Note on the Behrens–Fisher formula. Ann.Eugenics 10 48–51. MR0002080

LAI, R. C. S., HANNIG, J. and LEE, T. C. M. (2013). Generalizedfiducial inference for ultra high dimensional regression. Avail-able at arXiv:1304.7847.

LINDLEY, D. V. (1958). Fiducial distributions and Bayes’ theorem.J. R. Stat. Soc. Ser. B Stat. Methodol. 20 102–107. MR0095550

MARTIN, R. and LIU, C. (2013a). Conditional inferential models:Combining information for prior-free probabilistic inference.Preprint.

MARTIN, R. and LIU, C. (2013b). Inferential models: A frame-work for prior-free posterior probabilistic inference. J. Amer.Statist. Assoc. 108 301–313. MR3174621

MARTIN, R. and LIU, C. (2013c). Marginal inferential mod-els: prior-free probabilistic inference on interest parameters.Preprint.

MARTIN, R. and LIU, C. (2013d). On a ’plausible’ interpretationof p-values. Preprint.

MARTIN, R., ZHANG, J. and LIU, C. (2010). Dempster–Shafertheory and statistical inference with weak beliefs. Statist. Sci.25 72–87. MR2741815

PATTERSON, P., HANNIG, J. and IYER, H. K. (2004). Fiducialgeneralized confidence intervals for proportion of conformance.Technical Report 2004/11, Colorado State Univ., Fort Collins,CO.

SALOME, D. (1998). Staristical inference via fiducial methods.Ph.D. thesis, Univ. Groningen.

SCHWEDER, T. and HJORT, N. L. (2002). Confidence and like-lihood. Scand. J. Stat. 29 309–332. Large structured modelsin applied sciences; challenges for statistics (Grimstad, 2000).MR1909788

SINGH, K., XIE, M. and STRAWDERMAN, W. E. (2005). Combin-ing information from independent sources through confidencedistributions. Ann. Statist. 33 159–183. MR2157800

SONDEREGGER, D. and HANNIG, J. (2014). Fiducial theory forfree-knot splines. In Contemporaly Developments in Statisti-cal Theory, a Festschrift in Honor of Professor Hira L. Koul(T. N. Sriraus, ed.) 155–189. Springer, Berlin.

STEVENS, W. L. (1950). Fiducial limits of the parameter of a dis-continuous distribution. Biometrika 37 117–129. MR0035955

TARALDSEN, G. and LINDQVIST, B. H. (2013). Fiducial theoryand optimal inference. Ann. Statist. 41 323–341. MR3059420

TSUI, K.-W. and WEERAHANDI, S. (1989). Generalized p-valuesin significance testing of hypotheses in the presence of nuisanceparameters. J. Amer. Statist. Assoc. 84 602–607. MR1010352

TSUI, K.-W. and WEERAHANDI, S. (1991). Corrections: “Gen-eralized p-values in significance testing of hypotheses in thepresence of nuisance parameters”. [J. Amer. Statist. Assoc. 84(1989) 602–607. MR1010352]. J. Amer. Statist. Assoc. 86 256.MR1137115

WANG, C. M., HANNIG, J. and IYER, H. K. (2012). Fiducial pre-diction intervals. J. Statist. Plann. Inference 142 1980–1990.MR2903406

WEERAHANDI, S. (1993). Generalized confidence intervals. J.Amer. Statist. Assoc. 88 899–905. MR1242940

WEERAHANDI, S. (1994). Correction: “Generalized confidenceintervals”. [J. Amer. Statist. Assoc. 88 (1993) 899–905.MR1242940]. J. Amer. Statist. Assoc. 89 726. MR1294096

WEERAHANDI, S. (1995). Exact Statistical Methods for DataAnalysis. Springer Series in Statistics. Springer, New York.MR1316663

WILKINSON, G. N. (1977). On resolving the controversy in statis-tical inference. J. R. Stat. Soc. Ser. B Stat. Methodol. 39 119–171. MR0652326

XIE, M.-G. and SINGH, K. (2013). Confidence distribution, thefrequentist distribution estimator of a parameter: A review. In-ternat. Statist. Rev. 81 3–39. MR3047496

XIE, M., SINGH, K. and STRAWDERMAN, W. E. (2011). Confi-dence distributions and a unifying framework for meta-analysis.J. Amer. Statist. Assoc. 106 320–333. MR2816724

XIE, M., LIU, R. Y., DAMARAJU, C. V. and OLSON, W. H.(2013). Incorporating external information in analyses of clin-ical trials with binary outcomes. Ann. Appl. Stat. 7 342–368.MR3086422

ZHANG, J. and LIU, C. (2011). Dempster–Shafer inference withweak beliefs. Statist. Sinica 21 475–494. MR2829843

Page 33: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

Statistical Science2014, Vol. 29, No. 2, 259–260DOI: 10.1214/14-STS475Main article DOI: 10.1214/13-STS457© Institute of Mathematical Statistics, 2014

Discussion of “On the BirnbaumArgument for the Strong LikelihoodPrinciple”Jan F. Bjørnstad

Abstract. The paper by Mayo claims to provide a new clarification and cri-tique of Birnbaum’s argument for showing that sufficiency and conditionalityprinciples imply the likelihood principle. However, much of the arguments goback to arguments made thirty to forty years ago. Also, the main contentionin the paper, that Birnbaum’s arguments are not valid, seems to rest on amisunderstanding.

Key words and phrases: Likelihood, conditionality, sufficiency, Birnbaum’stheorem.

The goal of this paper is to provide a new clarifica-tion and critique of Birnbaum’s argument for showingthat principles of sufficiency and conditionality entailthe (strong) likelihood principle (LP).

I must admit I do not find that the paper providessuch a new clarification of the criticism of Birnbaum’sargument. Rather, much of the criticism in the papergoes back to arguments made in the 70s and 80s by sev-eral authors, for example, Durbin (1970), Kalbfleisch(1975), Cox (1978) and Evans, Fraser and Monette(1986). This critique has been discussed by severalstatisticians with an opposing view; see Berger andWolpert (1988) and Bjørnstad (1991).

I will concentrate my discussion on what seems tobe the most important contention in the paper, that thesufficient statistic in Birnbaum’s proof erases the infor-mation as to which experiment the data came from and,hence, that the weak conditionality principle (WCP)cannot be applied; see, for example, Sections 5.2 and 7.

As I understand it, this is a misunderstanding of theproof. For one thing, it seems that only the observationx2 from the experiment E2 is considered in the mixtureexperiment instead of the correct (E2, x2). The obser-

Jan F. Bjørnstad is Professor of Statistics, Department ofMathematics, University of Oslo and Head of Research,Division for Statistical Methods, Statistics Norway, P.O.Box 8131 Dep., N-0033 Oslo, Norway (e-mail:[email protected]).

vations in a mixture experiment are always of the form(Eh, xh)—never as only xh.

Other arguments leading up to this contention seemto rest on a misunderstanding of the sufficiency consid-erations in the proof, that given an observation from acertain experiment the result in an unperformed exper-iment is to be reported; see, for example, Sections 2.4and 5.1. I find that this is definitely not the case. To bespecific, the author considers the following proof in thediscrete case:

Let (j, xj ) indicate that Experiment Ej was per-formed with data xj , j = 1,2. Assume the data val-ues x0

1 and x02 have proportional likelihoods from ex-

periments E1 and E2, respectively. Then the sufficientstatistic in the mixture experiment used in Birnbaum’sproof is given by

T (j, xj ) = (j, xj ) if (j, xj ) �= (1, x0

1),(2, x0

2),

T(1, x0

1) = T

(2, x0

2) = (

1, x01).

Mayo claims that T (1, x01) = T (2, x0

2) (= c) impliesthat the weak conditionality principle (WCP) is vio-lated. Now, one should note that the proof works withany c �= (1, x0

1) and c �= (j, xj ), j = 1,2 and all xj .When E2 is performed and x0

2 is the result, then the ev-idence should only depend on E2 and x0

2 , and not ona result of an unperformed experiment E1. This is, ofcourse, correct, but it does not depend on a result in E1.(Actually E1 is not an unperformed experiment either.We comment on this issue later.) By letting c = (1, x0

1)

259

Page 34: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

260 J. F. BJØRNSTAD

it seems so, but we see by choosing a c �= (1, x01) it

is not the case. So WCP is not violated. The sufficientstatistic simply takes the same value for these two re-sults of the mixture experiment. It has nothing to dowith WCP.

So Birnbaum’s proof does not require that the evi-dential support of a known result should depend on theresult of an unperformed experiment. It follows that themain contention in the paper seems to rest on a lack ofunderstanding of the basics of the proof of Birnbaum’stheorem. In fact, it is possible to do the proof even moregenerally. One can show, for a given experiment [see,e.g., Cox and Hinkley (1974) and Bjørnstad (1996)],that if two likelihoods are proportional for two possi-ble observations in the same experiment, there existsa minimal sufficient statistic with the same value forthe two observations. This holds both for discrete andcontinuous models.

To make the sufficiency argument clearer, consider amixture of a binomial experiment E1 and a negative bi-nomial experiment E2 where the observations are x1 =number of successes in 12 trials and x2 = number oftrials until 3 successes. If x0

1 = 3 and x02x2 = 12 then

the likelihoods are proportional. A natural choice of thesufficient statistic T in the mixture experiment in Birn-baum’s proof has

T(1, x0

1) = T

(2, x0

2) = 3/12,

the proportion of successes in either case.

Clearly then, the value of T from experiment E2 doesnot depend on the result from E1.

As already mentioned, the author claims that the suf-ficient statistic T in the proof of Birnbaum’s result hasthe effect of erasing the index of the experiment. More-over, it is claimed that inference based on T is to becomputed over the performed and unperformed experi-ments E1 and E2. As we have shown, this is simply notthe case. It should also be mentioned that statisticallythe proof simply considers two instances of performingthe mixture experiments resulting in proportional like-lihoods and really has nothing to do with consideringunperformed experiments.

Let me also mention that the author’s premise inSection 5 is not correct. The starting point is not thatwe have an arbitrary outcome of one single experi-ment, but rather that two experiments have been per-formed about the same parameter resulting in propor-tional likelihoods. So Birnbaum does not enlarge aknown single experiment but constructs a mixture ofthe two performed experiments. There is really no un-performed experiment here. In a sense, one may regard

the paper by Mayo as actually not discussing the LP atall.

It should be clear that I find that the main contentionin the paper does not hold when maintaining the origi-nal meaning of the principles of sufficiency, condition-ality and likelihood. Other comments made in this pa-per referring to various authors in the 70s and 80s are adifferent matter. However, I do not find any new clari-fication of Birnbaum’s fundamental theorem in this pa-per. For example, regarding sufficiency, it is necessaryto restrict the application of sufficiency to nonmixtureexperiments, as Kalbfleisch (1975) did, in order to in-validate Birnbaum’s result. Berger and Wolpert (1988)argue, I think, convincingly against such a restriction.See also Bjørnstad (1991).

Let me end this discussion by making clear the fol-lowing fact: It is obviously clear that frequentistic mea-sures may, and typically do, violate LP. This is true asfar as it comes to analysis of the actual data we observe.But a major point here is that the LP does not say thatone should not be concerned with how the methods dowhen used repeatedly. LP is simply not about methodevaluation. Evaluation of methods is still important. SoLP says in essence that frequentistic considerations arenot sufficient for evaluating the uncertainty and relia-bility in the statistical analysis of the actual data; seealso Bjørnstad (1996) for a discussion on this issue.

REFERENCES

BERGER, J. O. and WOLPERT, R. L. (1988). The Likelihood Prin-ciple, 2nd ed. Lecture Notes—Monograph Series 6. IMS, Hay-ward, CA.

BJØRNSTAD, J. F. (1991). Introduction to Birnbaum (1962): Onthe foundations of statistical inference. In Breakthroughs inStatistics 1 (S. Kotz and J. Johnson, eds.) 461–477. SpringerSeries in Statistics. Springer, New York.

BJØRNSTAD, J. F. (1996). On the generalization of the likelihoodfunction and the likelihood principle. J. Amer. Statist. Assoc. 91791–806. MR1395746

COX, D. R. (1978). Foundations of statistical inference: The casefor eclecticism. Aust. N. Z. J. Stat. 20 43–59. MR0501453

COX, D. R. and HINKLEY, D. V. (1974). Theoretical Statistics.Chapman & Hall, London. MR0370837

DURBIN, J. (1970). On Birnbaum’s theorem in the relation be-tween sufficiency, conditionality and likelihood. J. Amer. Statist.Assoc. 65 395–398.

EVANS, M. J., FRASER, D. A. S. and MONETTE, G. (1986). Onprinciples and arguments to likelihood. Canad. J. Statist. 14181–199. MR0859631

KALBFLEISCH, J. D. (1975). Sufficiency and conditionality.Biometrika 62 251–268. MR0386075

Page 35: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

Statistical Science2014, Vol. 29, No. 2, 261–266DOI: 10.1214/14-STS482Main article DOI: 10.1214/13-STS457© Institute of Mathematical Statistics, 2014

Rejoinder: “On the Birnbaum Argument forthe Strong Likelihood Principle”Deborah G. Mayo

1. INTRODUCTION

I am honored and grateful to have so many interest-ing and challenging comments on my paper. I want tothank the discussants for their willingness to jump backinto the thorny quagmire of Birnbaum’s argument. Toa question raised in the paper “Does it matter?”, thesediscussions show the answer is yes. The enlighteningconnections to contemporary projects are especiallyvaluable in galvanizing future efforts to address foun-dational issues in statistics.

As long-standing as Birnbaum’s result has been,Birnbaum himself went through dramatic shifts in ashort period of time following his famous (1962) re-sult. More than of historical interest, these shifts pro-vide a unique perspective on the current problem. Al-ready in the rejoinder to Birnbaum (1962), he is wor-ried about criticisms (by Pratt, 1962) pertaining to ap-plying WCP to his constructed mathematical mixtures(what I call Birnbaumization), and hints at replacingWCP with another principle (Irrelevant Censoring).Then there is a gap until around 1968 at which pointBirnbaum declares the SLP plausible “only in the sim-plest case, where the parameter space has but two”predesignated points [Birnbaum (1968), page 301].He tells us in Birnbaum (1970a, page 1033) that hehas pursued the matter thoroughly, leading to “rejec-tion of both the likelihood concept and various pro-posed formalizations of prior information.” The basisfor this shift is that the SLP permits interpretationsthat “can be seriously misleading with high probabil-ity” [Birnbaum (1968), page 301]. He puts forwardthe “confidence concept” (Conf) which takes from theNeyman–Pearson (N–P) approach “techniques for sys-tematically appraising and bounding the probabilities(under respective hypotheses) of seriously mislead-ing interpretations of data” while supplying it an evi-dential interpretation [Birnbaum (1970a), page 1033].

Deborah G. Mayo is Professor of Philosophy, Departmentof Philosophy, Virginia Tech, 235 Major Williams Hall,Blacksburg, Virginia 24061, USA (e-mail: [email protected]).

Given the many different associations with “confi-dence,” I use (Conf) in this Rejoinder to refer to Birn-baum’s idea. Many of the ingenious examples of theincompatibilities of SLP and (Conf) are traceable backto Birnbaum, optional stopping being just one [seeBirnbaum (1969)]. A bibliography of Birnbaum’s workis Giere (1977). Before his untimely death (at 53),Birnbaum denies the SLP even counts as a principle ofevidence (in Birnbaum, 1977). He thought it anoma-lous that (Conf) lacked an explicit evidential interpre-tation, even though, at an intuitive level, he saw it asthe “one rock in a shifting scene” in statistical thinkingand practice [Birnbaum (1970a), page 1033]. I returnto this in Section 4 of this rejoinder.

2. BJØRNSTAD, DAWID AND EVANS

Let me begin by answering the central criticismsthat, if correct, would be obstacles to what I purport tohave shown in my paper. It is entirely understandablethat leading voices in a long-lived controversy wouldassume that all of the twists and turns, avenues androadways, have already been visited, and that no newflaw in the argument could enter to shake up the debate.I say to the reader that the surest sign that the issue isunsettled is that my critics disagree among themselvesabout the puzzle and even the key principles under dis-cussion: the WCP, and in one case, the SLP itself. Toremind us [Section 2.2]:

SLP: For any two experiments E1 and E2with different probability models f1, f2but with the same unknown parameter θ ,if outcomes x∗ and y∗ (from E1 and E2,resp.) determine the same likelihood func-tion [f1(x∗; θ) = cf2(y∗; θ) for all θ ], thenx∗ and y∗ should be inferentially equivalentfor any inference concerning parameter θ .

A shorthand for the entire antecedent is that (E1,x∗) isan SLP pair with (E2,y∗), or just x∗ and y∗ form anSLP pair (from {E1,E2}). Assuming all the SLP stipu-lations, we have

SLP: If (E1,x∗) and (E2,y∗) form an SLPpair, then InfrE1[x∗] = InfrE2[y∗].

261

Page 36: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

262 D. G. MAYO

Bjørnstad

According to Bjørnstad, “The starting point is notthat we have an arbitrary outcome of one single exper-iment, but rather that two experiments have been per-formed about the same parameter resulting in propor-tional likelihoods.” I do not think Bjørnstad can actu-ally mean to say the SLP cannot be applied until bothmembers of the SLP pair are observed. So, for exam-ple, if in the sequential experiment one is able to stop(with a 0.05 p-value) at n = 169, resulting in y∗, onemay not regard it as evidentially equivalent to x∗, theSLP pair with n fixed at 169, until and unless x∗ is ac-tually generated? The universal generalization of theSLP asserts that for an arbitrary y∗, if it has an SLPpair x∗, then y∗ is equivalent in evidence to x∗. Bjørn-stad’s problematic reading results in his next remark:“So Birnbaum does not enlarge a known single experi-ment but constructs a mixture of the two performed ex-periments.” What is constructed in Birnbaum’s exper-iment EB is a hypothetical or mathematical mixture,based on having observed y∗ (from E2). This is partof the key gambit I call Birnbaumization (Section 2.4).We are to consider the possibility that performing E2(which gave rise to y∗) was the result of a θ -irrelevantrandomizer (deciding between E1 or E2). Now I grantBirnbaum that we may imagine all the SLP pairs are“out there,” each pair assumed to have resulted from aθ -irrelevant randomizer, ripe for plucking whenever amember of an SLP pair is observed. (See Sections 2.5and 5.1.) Yet even granting Birnbaum all of this, westill may not infer SLP (nor does it follow in the casewhere the mixture is actual).

Bjørnstad also criticizes me because he claims theSLP “is simply not about method evaluation.” His po-sition is that there is an evidential appraisal, and quiteseparately an assessment of long-run performance. Fora frequentist, or one who holds Birnbaum’s (Conf), evi-dential import is inseparable from an assessment of therelevant error probabilities. Not because we regard evi-dential import as all about long-runs, but because scru-tinizing a given inference is bound up with a method’sability to have alerted us to misleading interpretations.

Bjørnstad does “not find any new clarification ofBirnbaum’s fundamental theorem in this paper” be-cause he assumes I am channeling the attempts ofDurbin (1970), Kalbfleisch (1975) and Evans, Fraserand Monette (1986), all of whom restrict or modify ei-ther SP or WCP to block the result. While I stand onthe shoulders of these and other earlier treatments, acrucial difference is that, unlike them, I do not alter

the principles involved. If one is out to demonstrate thelogical flaw in an argument, as every good philosopherknows, one should scrupulously adhere to the premisesand generously interpret the machinations of the ar-guer. This I do. Bjørnstad’s opinion is that “one mayregard the paper by Mayo as actually not discussingthe LP at all.” Or, alternatively, one may regard the po-sition held by this critic to be mistaken about the SLPand Birnbaum’s argument.

Dawid

Professors Dawid and Evans disagree about the keyprinciple invoked in Birnbaum’s argument, the WCP.Dawid views it as an equivalence relation, Evans saysit is not. I follow Birnbaum in regarding the WCP asan equivalence, but, unlike both Dawid and Evans, Ipin down what is to be meant in regarding WCP as anequivalence, or, for that matter, an inequivalence (seeSection 4.3). First Dawid.

Dawid maintains that my WCP differs from the prin-ciple of conditionality Birnbaum uses in the SLP argu-ment. Not so. I am working with the WCP stated inBirnbaum (1962, 1969), the very same one defined byDawid:

The evidential meaning of any outcome ofany mixture experiment is the same as thatof the corresponding outcome of the corre-sponding component experiment, ignoringthe over-all structure of the mixture exper-iment.

Dawid’s definition is a portion of the one found inBirnbaum (1962), page 271. It assumes, of course, allof the other stipulations, for example, we are making“informative” inferences about θ , it is a θ -irrelevantmixture, the outcome is given, and all the rest. It is thedefinition used in countless variations of the SLP argu-ment, and it is clearly captured in my Section 4.3. Per-haps I should have abbreviated it as CP; WCP comesfrom Berger and Wolpert’s (1988) manifesto, The Like-lihood Principle. My intention was to underscore Birn-baum’s emphasis that the WCP concerns mixture ex-periments and is distinct from many other uses of “con-ditioning” in statistics [Birnbaum (1962), pages 282–283].

I wondered why Dawid thought I denied that WCPasserts an equivalence, until I noticed that Dawid lopsoff the end of my sentence from Section 7: I do not say“the problem stems from mistaking WCP as the equiv-alence” simpliciter, but rather it stems from the in-correct equivalence! The incorrect equivalence equates

Page 37: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

REJOINDER 263

the inference from the given experiment with one thattakes account of the (irrelevant) mixture structure. Thisis what Dawid is on about in describing invalid con-struals of WCP, so he can scarcely object. (See Sec-tion 5.2.)

As with any equivalence, there is an implicit inequiv-alence as a corollary. [See (i) and (ii) in Section 4.3.]Typically, in saying the evidential import of two out-comes are the same, one would not add “and be sure toignore any features that would render them inequiva-lent.” Birnbaum adds this warning because some treat-ments do not ignore the mixture structure. To put thisanother way, WCP includes the phrases “is the sameas” as well as “ignoring.” The problem is that Dawid isignoring the word “ignoring” in the very definition heproffers. There is no difference between the phrases

• ignore the over-all structure of the mixture experi-ment

and

• eschew any construal that does not ignore the over-all structure of the mixture experiment.

I also refer to this as irrelevance (Irrel) (Section 4.3.2)because Birnbaum describes the WCP as asserting the“irrelevance of (component) experiments not actuallyperformed” [Birnbaum (1962), page 271].

Dawid opines that I am using the WCP in DavidCox’s (1958) famous weighing example, which hedoes not define; I am guessing he means to suggest Imust be limiting myself to actual mixtures. That is tomiss the genius of Birnbaum’s argument. Birnbaum,quite deliberately, intends to capitalize on the persua-siveness of conditioning in Cox’s famous example, buthis ploy is to extend the argument to mathematical orhypothetical mixtures. (I am not saying it is an innocu-ous move, but that is a separate matter.) Even if Dawidchooses to view Cox’s WCP as a nonequivalence, itis irrelevant; I am following Birnbaum in construing itas an equivalence, permitting, for example, y∗, knownto have come from a nonmixture, to be evidentiallyequivalent to the appropriate θ -irrelevant mixture as inSection 4.3. (Irrel) protects against illicit readings thatDawid warns against. SLP still will not follow.

So Dawid, Birnbaum and I are using the same def-inition of WCP. The onus is on Dawid to pinpointwhere my characterization deviates from Birnbaum’s.The only difference is that I have shown one cannot getto SLP, and Dawid gives no clue how to get around mycriticism.

For Dawid to simply pronounce that “Birnbaum’stheorem is indeed logically sound” and that thereforemy argument “must itself be unsound” is question-begging, and will not do. Demonstrating unsoundnessof my argument should be accomplished straightfor-wardly, as I have done regarding Birnbaum. That said,I fully agree with Dawid that one can view [(SP andWCP), entails SLP] as a theorem, but in order to detachthe SLP, as is mandatory for Birnbaum, he is left witha “proof” that is either unsound or question-begging.Perhaps those who are long wedded to Birnbaum’s ar-gument are comfortable with merely assuming whatwas to have been shown. It is part of the mysterious“path of enlightenment followed by conversion” thatDawid mentions. That is no reason for others to allow“trust me, it is sound” to take the place of argument.

Evans

Given that Evans largely agrees with me, it mayseem ungenerous to focus on apparent disagreements,but there is too great a danger in leaving some mis-impressions regarding a problem already beset withdecades of misunderstanding. Notably, it seems I havenot convinced Evans of the logical error that Birnbaummakes. Instead Evans thinks the problem is with theconditionality principle WCP, and claims that frequen-tists need to fix it somehow. But it is not the principle,it is the “proof.”

I have at least convinced Evans that there are caseswhere SP and WCP and not-SLP hold without logi-cal contradiction (in Mayo, 2010). These cases may becalled “counterexamples” to the argument whose con-clusion is the SLP. They are also counterexamples to[WCP entails SLP], using the weaker notion of mathe-matical equivalence of Birnbaum (1972) that dispenseswith (SP). Evans will take those counterexamples toshow that WCP is not an equivalence relation, assum-ing a frequentist standpoint. Now it is true that anysuch counterexample may be seen to warn us againstmistaking WCP as asserting the incorrect equivalence,noted in my rejoinder to Dawid. But that does notpreclude WCP from asserting a correct equivalence.A more general issue I have with Evans’ treatment isthat it does not show where the source of the problemlies in arguments for SLP. Introducing his set-theoretictreatment into a simple argument, I am afraid, does nothelp to pinpoint where the argument goes wrong, butin fact leaves us with a very murky idea even as tohis definition of WCP. The argument for the SLP be-gins with: We are given y∗ from E2, a member of an

Page 38: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

264 D. G. MAYO

SLP pair. Will Evans block introduction of the mathe-matical mixture in Birnbaumization? This would seemto cut off Birnbaum’s argument too quickly. Were thatsufficient, the debate would have surely ended withKalbfleisch (1975). Note too, unlike Evans, my argu-ment in the paper under discussion does not rely on as-suming a frequentist principle at all, though obviouslyI avoid a formulation that rules it out in advance. Tosum up this section, Evans uses my counterexamplesto show a restricted WCP may be applied, while block-ing SLP. Left as it is, it opens him to the criticism (theone Dawid raised!) that he is altering Birnbaum’s WCPand restricting it to actual mixtures.

What a surprise, then, to hear Evans allege that“many authors, including Mayo, refer to the [WCP]which restricts attention to ancillaries that are phys-ically part of the sampling.” I do not know on whatgrounds Evans wants to distinguish actual and math-ematical mixtures, but Birnbaum’s argument for theSLP concerns mathematical or hypothetical mixtures.Birnbaum calls an experiment a mixture “if it is math-ematically equivalent” to a mixture [Birnbaum (1962),page 279]. Further, Birnbaum (1962) emphasizes thatearlier proofs [that WCP and SP imply SLP] were re-stricted to actual mixtures. “But in the above proof”he is able to get a result relevant for all classes of ex-periments by using an ancillary “constructed with thehypothetical mixture” [EB ] [Birnbaum (1962), page286]. So, I am not sure what Evans is alleging. In oneplace, Evans worries whether the WCP “resolves theproblem with conditionality more generally,” but thisis a separate issue from Birnbaum’s argument. Here thefocus is on WCP solely for purposes of arriving at theSLP.

Although there was not space to discuss this in mypaper, it is worth noting why merely blocking the SLPwith a modified WCP fails to make progress with afurther goal required of an adequate treatment. Con-sider how, in discussing Durbin’s modified principle ofconditionality, Birnbaum notes that “Durbin’s formula-tion (C’), although weaker than [WCP], is neverthelesstoo strong (implies too much of the content of [SLP])to be compatible with standard (non-Bayesian) statisti-cal concepts and techniques” [Birnbaum (1970b), page402]. Birnbaum (1975, page 264) raises the same prob-lem with Kalbfleish’s restriction to “minimal experi-ments” to which Evans’ treatment is closely related.Evans does not show his modified conditionality prin-ciple avoids entailing “too much of the SLP.” (This re-lates to Dawid’s point about stopping rules in his com-ment.) For a frequentist account to satisfy Birbaum’s

(Conf), all cases that allow misleading interpretationswith high probability should still show up as SLP vio-lations.

To this end, my argument shows that any violationof SLP in frequentist sampling theory necessarily re-sults in an illicit substitution in the formulation of Birn-baum’s argument. To put the problem in general terms,p = r does not follow from p = q and q = r , if q

shifts to q ′ within the argument, where q �= q ′ (fal-lacy of 4 terms). For specifics see Section 5. Thus, oursis in no danger of implying “too much” of the SLP:what was an SLP violation remains one. Now Evansmay not be concerned with retaining those frequentistSLP violations, given he makes it very clear he em-braces Bayesianism, but that is irrelevant to what anadequate treatment of Birnbaum’s argument demands.I have seen some statistics textbooks leave the detailsof the SLP proof to the reader; I think it is time to givefull credit to students who found it impossible to makea valid substitution in general. I explained why.

3. FRASER, HANNIG, MARTIN AND LIU

Let me turn to the second group of discussants. Itis an honor to be “strongly commended” by Fraserfor emphasizing the importance of “principles and ar-guments of statistical inference”; and I feel my ef-forts are worthwhile with Martin and Liu’s noting my“demonstration resolves the controversy around Birn-baum and LP, helping to put the statisticians’ housein order.” I entirely agree with them that the “confu-sion surrounding Birnbaum’s claim has perhaps dis-couraged researchers from considering questions aboutthe foundations of statistics,” at least from appealingto those foundations that reject the SLP. Let me under-score Fraser’s point that the need for an inferential vari-ation of (N–P) theory “reached the mathematical statis-tics community rather forcefully with Cox (1958); thishad the focus on the two measuring-instruments exam-ple and on uses of conditioning that were compelling.”Cox’s (1958) example also appears in Hannig’s discus-sion, and I will borrow his simple description of thecase where the measurement but not the instrument M

is observed. In that case, inference is based on the con-vex combination of the mixture components, consistentwith WCP. This allows me to succinctly put an equivo-cation that I suspect may enter, in the case of SLP pairs,between the irrelevance of the mixture structure, given(Ei, zi), and the irrelevance of the index i, given justthe measurement. This equivocation may be behind theBirnbaum puzzle.

Page 39: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

REJOINDER 265

Fraser rightly reminds us that, “statistical inferenceas an alternative to (N–P) decision theory has a longhistory in statistical thinking” with strong impetusfrom Fisher. Still Birnbaum struggled to articulate a N–P theory as “an inference theory” (Birnbaum, 1977),and my view is that we had to solve “Birnbaum’s prob-lem” before doing so properly. Finding Birnbaum’s ar-gument unsound opens the door to foundations that arefree from paying obeisance to the SLP. In this spiritMartin and Liu correctly view my paper as “an invi-tation for a fresh discussion on the foundations of sta-tistical inference.” Yet there is more than one way ofproceeding. Tracing out the mathematical similaritiesand differences between the approaches of Fraser, Han-ning, Martin and Liu is a task for which others are bet-ter equipped than I. All are said to violate SLP.

It is interesting to note, as Hannig does, that “sincethe mid 2000s, there has been a true resurrection ofinterest in modern modifications of fiducial inference”which had long fallen into disrepute. Fraser’s has beenone of the leading voices to persevere with innovativedevelopments, and his own “confidence” idea is clearlyin sync with Birnbaum. However, the differences thatemerge in this group’s discussions should not be down-played. Hannig says that “the common thread for theseapproaches is a definition of inferentially meaningfulprobability statements about subsets of the parameterspace without the need for subjective prior informa-tion,” and Martin and Liu suggest that error probabilityaccounts are appropriate only for decision procedures,as distinct from their “inferential models.” Some mightview these as attempts to build a concept of evidence asa kind of probabilism but without the priors. However,in the background of these contemporary developmentslurks a suspicion that their SLP violations were pickingup differences where no purely inferential differencewas warranted. So long as Birnbaum’s proof stood, thissuspicion made sense.

Post-SLP, it is worth standing back and reflectinganew on these accounts. In this respect, this founda-tional project is just beginning because for 40 or 50years, the questions of foundations were largely re-stricted to accounts that obeyed, or were close to obey-ing, the SLP. So, we have Birnbaum, alongside Fisher,being catapulted onto the contemporary foundationalscene, squarely calling on us to address the still unre-solved problem: how to obtain an account of statisticalinference that also controls the probability of seriouslymisleading inferences. Better yet, the two goals shouldmesh into one.

4. POST-SLP FOUNDATIONS

Return to where we left off in the opening section ofthis rejoinder: Birnbaum (1969),

The problem-area of main concern heremay be described as that of determiningprecise concepts of statistical evidence (sys-tematically linked with mathematical mod-els of experiments), concepts which areto be non-Bayesian, non-decision-theoretic,and significantly relevant to statistical prac-tice. [Birnbaum (1969), page 113.]

Given Neyman’s behavioral decision construal, Birn-baum claims that “when a confidence region estimateis interpreted as representing statistical evidence abouta parameter” [Birnbaum (1969), page 122], an investi-gator has necessarily adjoined a concept of evidence,(Conf) that goes beyond the formal theory. What isthis evidential concept? The furthest Birnbaum gets indefining (Conf) is in his posthumous article Birnbaum(1977):

(Conf) A concept of statistical evidence isnot plausible unless it finds ‘strong evidencefor H2 against H1’ with small probabil-ity (α) when H1 is true, and with muchlarger probability (1 − β) when H2 is true.[Birnbaum (1977), page 24.]

On the basis of (Conf), Birnbaum reinterprets statisti-cal outputs from N–P theory as strong, weak, or worth-less statistical evidence depending on the error prob-abilities of the test [Birnbaum (1977), pages 24–26].While this sketchy idea requires extensions in manyways (e.g., beyond pre-data error probabilities and be-yond the two hypothesis setting), the spirit of (Conf),that error probabilities quantify properties of methodswhich in turn indicate the warrant to accord a given in-ference, is, I think, a valuable shift of perspective. Thisis not the place to elaborate, except to note that myown twist on Birnbaum’s general idea is to appraise ev-idential warrant by considering the capabilities of teststo have detected erroneous interpretations, a concept Icall severity. That Birnbaum preferred a propensity in-terpretation of error probabilities is not essential. Whatmatters is their role in picking up how features of ex-perimental design and modeling alter a methods’ ca-pabilities to control “seriously misleading interpreta-tions.” Even those who embrace a version of proba-bilism may find a distinct role for a severity concept.Recall that Fisher always criticized the presupposition

Page 40: Statistical Science On the Birnbaum Argument for the ... · 228 D. G. MAYO 1.1 The SLP and a Frequentist Principle of Evidence (FEV) Birnbaum, while responsible for this famous argu-ment,

266 D. G. MAYO

that a single use of mathematical probability must becompetent for qualifying inference in all logical situa-tions [Fisher (1956), page 47].

Birnbaum’s philosophy evolved from seeking con-cepts of evidence in degree of support, belief or plau-sibility between statements of data and hypotheses toembracing (Conf) with the required control of mis-leading interpretations of data. The former view re-flected the logical empiricist assumption that there ex-ist context-free evidential relationships—a paradigmphilosophers of statistics have been slow to throw off.The newer (post-positivist) movements in philosophyand history of science were just appearing in the 1970s.Birnbaum was ahead of his time in calling for a philos-ophy of science relevant to statistical practice; it is nowlong overdue!

“Relevant clarifications of the nature androles of statistical evidence in scientific re-search may well be achieved by bringingto bear in systematic concert the scholarlymethods of statisticians, philosophers andhistorians of science, and substantive scien-tists...” [Birnbaum (1972), page 861].

REFERENCES

BERGER, J. O. and WOLPERT, R. L. (1988). The Likelihood Prin-ciple, 2nd ed. Lecture Notes—Monograph Series 6. IMS, Hay-ward, CA.

BIRNBAUM, A. (1962). On the foundations of statistical inference.J. Amer. Statist. Assoc. 57 269–306. Reprinted in Breakthroughsin Statistics 1 (S. Kotz and N. Johnson, eds.) 478–518. Springer,New York.

BIRNBAUM, A. (1968). Likelihood. In International Encyclopediaof the Social Sciences 9 299–301. Macmillan and the Free Press,New York.

BIRNBAUM, A. (1969). Concepts of statistical evidence. In Philos-ophy, Science, and Method: Essays in Honor of Ernest Nagel(S. Morgenbesser, P. Suppes and M. G. White, eds.) 112–143.St. Martin’s Press, New York.

BIRNBAUM, A. (1970a). Statistical methods in scientific inference.Nature 225 1033.

BIRNBAUM, A. (1970b). On Durbin’s modified principle of condi-tionality. J. Amer. Statist. Assoc. 65 402–403.

BIRNBAUM, A. (1972). More on concepts of statistical evidence.J. Amer. Statist. Assoc. 67 858–861. MR0365793

BIRNBAUM, A. (1975). Comments on “Sufficiency and condition-ality” by J. D. Kalbfleisch. Biometrika 62 262–264.

BIRNBAUM, A. (1977). The Neyman–Pearson theory as deci-sion theory, and as inference theory; with a criticism of theLindley–Savage argument for Bayesian theory. Synthese 36 19–49. MR0652320

COX, D. R. (1958). Some problems connected with statistical in-ference. Ann. Math. Statist. 29 357–372. MR0094890

DURBIN, J. (1970). On Birnbaum’s theorem on the relation be-tween sufficiency, conditionality and likelihood. J. Amer. Statist.Assoc. 65 395–398.

EVANS, M. J., FRASER, D. A. S. and MONETTE, G. (1986). Onprinciples and arguments to likelihood. Canad. J. Statist. 14181–199. MR0859631

FISHER, R. A. (1956). Statistical Methods and Scientific Inference.Oliver and Boyd, Edinburgh.

GIERE, R. N. (1977). Allan Birnbaum’s conception of statisticalevidence. Synthese 36 5–13. MR0494585

KALBFLEISCH, J. D. (1975). Sufficiency and conditionality.Biometrika 62 251–268. MR0386075

MAYO, D. G. (2010). An error in the argument from conditionalityand sufficiency to the likelihood principle. In Error and Infer-ence: Recent Exchanges on Experimental Reasoning, Reliabil-ity, and the Objectivity and Rationality of Science (D. G. Mayoand A. Spanos, eds.) 305–314. Cambridge Univ. Press., Cam-bridge.

PRATT, J. W. (1962). On the foundations of statistical inference:Discussion. J. Amer. Statist. Assoc. 57 314–316.