Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (Selection Effects)

Preview:

DESCRIPTION

2nd half of “Frequentist Statistics as a Theory of Inductive Inference” (Selection Effects)

Citation preview

1

2nd half of “Frequentist Statistics as a Theory of Inductive

Inference” (Selection Effects)

The idealized formulation in the initial definition of a significance

test starts with a hypothesis and a test statistic, obtains data, then

applies the test and looks at the outcome:

The hypothetical procedure involved in the definition of the

test then matches reasonably closely what was done;

The possible outcomes are the different possible values of the

specified test statistic.

This permits features of the distribution of the test statistic to

be relevant for learning about aspects of the mechanism

generating the data.

2

It often happens that either the null hypothesis or the test statistic

are influenced by preliminary inspection of the data, so that the

actual procedure generating the final test result is altered:*

This, may alter the capabilities of the test to detect

discrepancies from the null hypotheses reliably, calling for

adjustments in its error probabilities.

This is required to ensure that the p-values serve their intended

purpose for frequentist inference, whether in behavioral or

evidential contexts.

* the objective of the test is to enable us to learn something about

the underlying data generating mechanism, and this learning is

made possible by correctly assessing the actual error probabilities.

3

Ad hoc Hypotheses, Non-novel Data, Double-Counting, etc.

The general point involved has been discussed extensively in both

philosophical and statistical literatures.

In the former under such headings as requiring novelty or

avoiding ad hoc hypotheses (use-constructions, etc.)

Under the latter, as rules against peeking at the data, shopping

for significance, data mining, etc., for taking selection effects into

account.

(This will come up again throughout the semester. Optional

stopping is an example of a data dependent strategy, as with “look

elsewhere” effects in the Higgs research)

These problems remain unresolved in general..

4

Error statistical considerations, coupled with a sound principle of

inductive evidence, may allow going further by providing criteria

for when various data dependent selections matter and how to take

account of their influence on error probabilities.

(some items of Mayo and Spanos: “How to discount double

counting when it counts,” “Some surprising facts about surprising

facts”, chapters 7,8,9 of EGEK, especially 8, “hunting without a

license” Spanos)

5

In particular, if the null hypothesis chosen for testing just because

the test statistic is large, the probability of finding some such

discordance or other may be high even under the null.

Thus, following FEV, we would not have genuine evidence of

inconsistency with the null, and unless the p-value is modified

accordingly, the inference would be misleading.

6

Example 1: Hunting for Statistical Significance

Investigators have 20 independent sets of data, each reporting on

different but closely related effects.

After doing all 20 tests, with 20 nulls, H0i, i = 1, …20

they report only the smallest p-value, e.g., 0.05, and its

corresponding null hypothesis, say H013.

e.g., there is no difference between some treatment (a childhood

training regimen) and a factor, f13 (some personality characteristic

later in life).

Passages from EGEK (Morrison and Henkel)

7

This “hunting” procedure should be compared with a case where

H013 was preset as the single hypothesis to test, and the small p-

value found.

In the hunting case, the possible results are the possible

statistically significant factors that might be found to show a

"calculated" statistical significant departure from the null. The

relevant type 1 error probability is the probability of finding at

least one such significant difference out of 20, even though the

global null is true (i.e., all twenty observed differences are due

to chance).

The probability that this procedure yields erroneous rejection

differs from, and will be much greater than, 0.05 (and is

approximately 0.64).

8

There are different, and indeed many more, ways one can err in

this example than when one null is preset, and this is reflected

in the adjusted p-value.

My blog (reblogged March 3, 2014)

Hardly a day goes by where I do not come across an article on the

problems for statistical inference based on fallaciously capitalizing

on chance: high-powered computer searches and “big” data trolling

offer rich hunting grounds out of which apparently impressive

results may be “cherry-picked”:

When the hypotheses are tested on the same data that suggested

them and when tests of significance are based on such data, then a

spurious impression of validity may result. The computed level of

significance may have almost no relation to the true level. . . .

9

Suppose that twenty sets of differences have been examined, that

one difference seems large enough to test and that this difference

turns out to be “significant at the 5 percent level.” Does this mean

that differences as large as the one tested would occur by chance

only 5 percent of the time when the true difference is zero? The

answer is no, because the difference tested has been selected from

the twenty differences that were examined. The actual level of

significance is not 5 percent, but 64 percent! (Selvin 1970, 104)[1]

Critics of the Morrison and Henkel ilk clearly report that to ignore

a variety of “selection effects” results in a fallacious computation

of the actual significance level associated with a given inference;

the “computed” or “nominal” significance level differs from the

actual or warranted significance level.

10

[1] Selvin calculates this approximately by considering the

probability of finding at least one statistically significant difference

at the .05 level when 20 independent samples are drawn from

populations having true differences of zero, 1 – P (no such

difference): 1 – (.95)20 = 1 – .36. This assumes, unrealistically,

independent samples, but without that it may be unclear how to

even approximately compute actual p-values.

This influence on long-run error is well known, but should this

influence the interpretation of the result in a context of inductive

inference?

According to frequentist or severity reasoning it should

Not so easy to explain why:

11

The concern is not the avoidance of often announcing genuine

effects erroneously in a series, the concern is that this test performs

poorly as a tool for discriminating genuine from chance effects in

this case.

Because at least one such impressive departure, we know, is

common even if all are due to chance, the test has scarcely

reassured us that it has done a good job of avoiding such a

mistake in this case.

Even if there are other grounds for believing the genuineness of

the one effect that is found, we deny that this test alone has

supplied such evidence.

12

The "hunting procedure" does a very poor job in alerting us to,

in effect, temper our enthusiasm, even where such tempering is

warranted.

If the p-value is adjusted to reflect the actual error rate, the

test again becomes a tool that serves this purpose.

13

Example 2. Hunting for a Murderer

(hunting for the source of a known effect by eliminative induction)

Testing for a DNA match with a given specimen, known to be that

of the murderer, a search through a data-base of possible matches

is done one at a time.

We are told, in a fairly well-known presentation of this case, that:

P(DNA match; not murderer) = very small

P(DNA match; murderer) ~ 1

The first individual, if any, from the data-base for which a match is

found is declared to truly match the criminal, i.e., to be the

murderer.

14

(The null hypothesis, in effect, asserts that the person tested does

NOT “match the criminal”; so the null is rejected iff there is an

observed DNA match.)

Example 2 is superficially similar to Example 1, finding a DNA

match being somewhat akin to finding a statistically significant

departure from a null hypothesis: one searches through data and

concentrates on the one case where a "match" with the criminal's

DNA is found, ignoring the non-matches.

If one adjusts for "hunting" in Example 1, shouldn't one do so

in broadly the same way in Example 2?

15

No!

(Although some have erroneously supposed frequentists say “yes”)

In Example 1 the concern is inferring a genuine, “reproducible"

effect, when in fact no such effect exists;

In Example 2, there is a known effect or specific event, the

criminal's DNA, and reliable procedures are used to track down the

specific cause or source (as conveyed by the low "erroneous-

match" rate.)

16

The probability is high that we would not obtain a match with

person i, if i were not the criminal; so, by FEV, finding the

match is excellent evidence that i is the criminal. Moreover,

each non-match found, by the stipulations of the example,

virtually excludes that person;

Note: the contrast in hunting for a DNA match is finding a match

with the first person tested, as opposed to hunting through a data

base

The more negative results found, the more the inferred "match"

is fortified; whereas in Example 1 this is not so.

17

18

Data-dependent Specification of distance or “cut-offs”

(case 1)

An analogy — The Texas Sharpshooter: testing a

sharpshooter's ability by having him shoot and then drawing a

bull's-eye around his results so as to yield the highest number of

bull's-eyes,

The skill that one is allegedly testing and making inferences

about is his ability to shoot when the target is given and fixed,

while that is not the skill actually responsible for the resulting

score.

19

Case 2:

By contrast, if the choice of specification is guided not by

considerations of the statistical significance of departure from the

original null hypothesis, but rather because one is an empirically

adequate statistical model, the other violates assumptions, no

adjustment for selection is called for.

Indeed, using a statistically adequate specification gives

reassurance that the calculated p-value is relevant for

interpreting the evidence reliably.

20

Need for Adjustments for Data-Dependent Selections

How does our conception of the frequentist theory of induction

help to guide the answers?

1. It must be considered whether the context is one where the key

concern is the control of error rates in a series of applications

(behavioristic goal), or whether it is a context of evaluating

specific evidence (inferential goal).

The relevant error probabilities may be altered for the former

context and not for the latter.

2. To determine the relevant hypothetical series on which to base

error frequencies one must identify the particular obstacles that

need to be avoided for a reliable inference in the particular case,

21

and the capacity of the test, as a measuring instrument, to have

revealed the presence of the obstacle.

22

Statistics in the Discovery of the Higgs

Everyone was excited with the announced evidence for the

discovery of a standard model (SM) Higgs particle based on a

“5 sigma observed effect” (July 2012).

But because this report links to the use of statistical significance

tests, some Bayesians raised criticisms

23

They want to ensure that before announcing the hypothesis H*:

“a SM Higgs boson has been discovered” (with such and such

properties) that

H* has been given a severe run for its money

That with extremely high probability we would have observed

a smaller excess of signal-like events, were we in a universe

where:

H0: μ = 0 —background only hypothesis, vs.

So, very probably H0 would have survived a cluster of tests,

fortified with much cross-checking T, were μ = 0.

24

Note what’s being given a high probability:

Pr(test T would produce less than 5 sigma; H0) > 9999997.

With probability .9999997, our methods would show that the

bumps disappear (as so often occurred), under the assumption

data are due to background H0.

Assuming we want a posterior probability in H* seems to be a

slide from the value of knowing this probability is high for

assessing the warrant for H*

Granted, this inference relies on an implicit severity principle

of evidence.

25

Data provide good evidence for inferring H (just) to the extent

that H passes severely with x0, i.e., to the extent that H would

(very probably) not have survived the test so well were H false.

They then quantify various properties of the particle discovered

(inferring ranges of magnitudes)

26

The p-value police

Leading (subjective) Bayesian, Dennis Lindley had a letter sent around (to ISBA members)i:

Why demand such strong evidence?

(Could only be warranted if beliefs in the Higgs extremely low or costs of error exorbitant.)

Are they so wedded to frequentist methods? Lindley asks. “If so, has anyone tried to explain what bad science that is?”

27

Other critics rushed in to examine if reports (by journalists and scientists) misinterpreted the sigma levels as posterior probability assignments to the models.

Many critics have claimed that the .99999 was fallaciously being assigned to H* itself—a posterior probability in H*1

.

Surely there are misinterpretations, but many were not What critics are doing is interpret a legitimate error

probability as a posterior in H*: SM Higgs

1

28

Physicists did not assign a high probability to

H*: SM Higgs exists

(whatever it might mean

Besides, many believe in beyond the standard model physics.

One may say informally, “so probably we have experimentally

demonstrated an SM-like Higgs”.

When you hear: what they really want are posterior

probabilities, ask: How are we to interpret prior probabilities?

Posterior probabilities?

29

This is a great methodological controversy in practice that

philosophers of science and evidence should be in on

Our job is to clarify terms, is it not?

i

Recommended