Download pdf - Statistical Flukes, the Higgs Discovery, and 5 Sigma

11/5

1

Statistical Flukes, the Higgs Discovery, and 5 Sigma Deborah G. Mayo

Virginia Tech

(I) “5 sigma observed effect”.

One of the biggest science events of 2012-13 was the announcement on July 4, 2012 of evidence for the discovery of a Higgs particle based on a “5 sigma observed effect”. With the March 2013 data analysis, the 5 sigma difference grew to 7 sigmas.

11/5

2

• Because the 5 sigma report refers to frequentist statistical tests, the discovery was immediately imbued with controversies from philosophy of statistics

• I’m an outsider to high energy physics, HEP, but (aside from finding it fascinating), any philosopher of statistics worth her salt should be able to illuminate some of the more public controversies e.g., P-values.

Not difficult to do, fortunately.

11/5

3

(II) Bad Science? (O’Hagan, prompted by Lindley) To the ISBA: “Dear Bayesians: We’ve heard a lot about the Higgs boson. ...Specifically, the news referred to a confidence interval with 5-sigma limits.… Five standard deviations, assuming normality, means a p-value of around 0.0000005… Why such an extreme evidence requirement? We know from a Bayesian perspective that this only makes sense if (a) the existence of the Higgs boson has extremely small prior probability and/or (b) the consequences of erroneously announcing its discovery are dire in the extreme. … …. Are the particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?”

11/5

4

Not bad science at all!

• HEP physicists are sophisticated with their statistical methodology: they’d seen too many bumps disappear.

• They want to ensure that before announcing the hypothesis H*: “a new particle has been discovered” that: H* has been given a severe run for its money.

Significance tests and cognate methods (confidence intervals) are methods of choice here for good reason

11/5

5

(III) Simple statistical significance test: ingredients (i) Null or test hypothesis: in terms of an unknown parameter μ in a statistical model, an idealized representation of underlying data generation: a model of the detector

μ is the “global signal strength” parameter H0: μ = 0 i.e., zero signal (background only hypothesis)

Η0: µ = 0 vs. Η1: µ > 0 μ = 1: Standard Model (SM) Higgs boson signal in addition to

the background

11/5

6

Empirical data are modeled as observed values of a sample X (random variable); here numbers of events of a type. (ii). Test statistic or distance statistic: d(X)—the larger its value the more inconsistent the data are with Η0 in the direction of alternatives or discrepancies of interest. d(X): how many excess events of a given type are observed (from trillions of collisions) in comparison to what would be expected from background alone (in the form of bumps). d(X) has a known probability distribution under Η0 (and under various alternatives).

11/5

7

(iii). The P-value (or significance level) associated with d(x0) is the probability of a difference as large or larger than d(x0), under the assumption that H0 is true:

P-value=Pr(d(X) > d(x0); H0)

If the P-value is sufficiently small (e.g., .05, .01, .001) d(x0) is said to be statistically significant (or significant at the level reached) d(X) can be given in terms of standard deviation units, or sigma units

11/5

8

The distribution of statistic d(X) is the sampling distribution

Pr(d(X) > 1; H0) = .16 Pr(d(X) > 2; H0) = .02 Pr(d(X) > 3; H0) =.001 Pr(d(X) > 4; H0) = .00003

Pr(d(X) > 5; H0)= .0000003

The probability of observing results as or more extreme as 5 sigmas, under H0, is approximately 1 in 3,500,000.

11/5

9

Normal distribution

11/5

10

The actual computations are based on simulating what it would be like were Η0: µ = 0 (signal strength = 0), fortified with much cross-checking of results. So the significance test has:

1) Data x0 and hypotheses Η0: µ = 0 vs. Η1: µ > 0 2) A (distance) test statistic d(X) 3) Probability distribution of d(X) under the null and various

alternatives

11/5

11

There’s generally a rule of interpretation:

• if d(X) > 5 sigma, infer discovery

• if d(X) > 2 sigma, get more data We want methods with high capability to detect discrepancies while avoiding mistaking spurious bumps as real.

11/5

12

• First stage: test for a real effect (Cox’s taxonomy: searching for structure)

Not a point against point test! Cousins: H0 is Standard Model (SM) missing a piece

• Second stage: determine its properties, test SM vs “Beyond SM” (BSM)

(Cox: embedded)

11/5

13

(IV) The P-Value Police

When the July 2012 report came out, a number of people set out to grade the different interpretations of the P-value report:

Larry Wasserman (“Normal Deviate” on his blog) called them the “P-Value Police”.

• Job: to examine if reports by journalists and scientists could by any stretch of the imagination be seen to have misinterpreted the sigma levels as posterior probability assignments to the various models and claims.

David Spiegelhalter: A well-known (Bayesian) statistician: risk communication.

11/5

14

Thumbs up or down Thumbs up, to the ATLAS group report:

“A statistical combination of these channels and others puts the significance of the signal at 5 sigma, meaning that only one experiment in three million would see an apparent signal this strong in a universe without a Higgs.”

Thumbs down to reports such as:

“There is less than a one in 3.5 million chance that their results are a statistical fluke.”

Critics (Spiegelhalter) allege they are misinterpreting the P-value as a posterior probability on H0.

11/5

15

Not so. H0 does not say the observed results are due to background alone, or are flukes,

Η0: µ = 0 Although if H0 were true it follows that various results would occur with specified probabilities. (In particular, it entails that large bumps are improbable.)

11/5

16

In fact it is an ordinary error probability. Since it’s not just a single result, but a dynamic test procedure, we can write it: (1) Pr(Test T produces d(X) > 5; H0) ≤ .0000003

Note: (1) is not a conditional probability (that involves a prior)

Pr(Test T produces d(X) > 5 and H0)/ Pr(H0)

11/5

17

(V) Detaching inference(s) from the evidence True, the inference actually detached goes beyond a P-value report. Infer:

(2)There is strong evidence for

(first) a genuine discrepancy from H0

(later) H*: a Higgs (or a Higgs-like) particle.

Gradations: indication, evidence, discovery (up to July 4, 2012)

Inferring (2) relies on an implicit principle of evidence.

11/5

18

Test Principle #1: (statistical significance) Data provide evidence for a genuine discrepancy from H0 (just) to the extent that H0 would (very probably) have survived, were H0 a reasonably adequate description of the process generating the data. (1)’ Pr(Test T produces d(X) < 5; H0) > .9999997

• With probability .9999997, the bumps would be smaller, would behave like flukes, disappear with more data, not be produced at both CMS and ATLAS, in a world given by H0.

• They didn’t disappear, they grew (2) So, H*: a Higgs (or a Higgs-like) particle.

11/5

19

Following the rule: Interpret 5 sigma bumps as a real effect (a discrepancy from 0), you’d erroneously interpret data with probability less than .0000003

An error probability

The warrant isn’t low long-run error (in a case like this) but detaching an inference based on “strong argument from coincidence”. Qualifying claims by how well they have been probed (precision, accuracy).

11/5

20

Second Stage Once the null is rejected, the job shifts to testing if various parameters agree with the SM predictions. Now the corresponding null hypothesis is the SM Higgs boson The null hypothesis at the second stage

H[2]0: SM Higgs boson: µ = 1

and discrepancies from it are probed, estimated with confidence intervals (Cousins)

11/5

21

Takes us to the most important role served by statistical significance tests: (requiring a 5 sigma excess for discovery):

It affords a standard for:

• (a) denying sufficient evidence of a new particle, inferring “not a genuine effect”, and

• (b) ruling out values of various parameters, e.g., mass ranges.

11/5

22

(VI) Positive and Negative test results of the analysis Positive (very low P-value): infer genuine effects Negative (moderate P-value): deny real effects (infer flukes),

Deny excesses indicate BSM.

• At both stages, they were engaged in exploration for BSM physics (beyond the standard model)

• It combined testing, estimating, exploring.

11/5

23

NYT: “Chasing the Higgs” [Dennis Overbye interviews spokespeople Gianotti (ATLAS) and Tonelli (CMS).]

• Once a month they got bumps that were random flukes “So ‘we crosscheck everything’ and ‘try to kill’ any anomaly that might be merely random.” They were convinced they had found evidence of extra dimensions of space time “and then the signal faded like an old tied balloon.”

11/5

24

• “We’ve made many discoveries,” Dr. Tonelli said,

“most of them false.”

• “Ninety-nine percent of the time, that is just what happens.”

What’s the difference between HEP physics and social psychology (and other big data screening) where “most results in most fields are false”, or so we keep hearing? HEP physicists don’t publish on the basis of a single “nominal” (or “local”) P-value.

11/5

25

Look Elsewhere Effect (LEE)

A nominal (or local) P-value: the P-value at a particular, data-determined, mass. But the probability of so impressive a difference anywhere in a mass range would be greater than the local one. I take it that requiring a smaller P-value (i.e., bigger difference), at least 5 sigma, is akin to adjusting for multiple trials or look elsewhere effect LEE.

11/5

26

“Game of Bump-Hunting” (Overbye) “One bump on physicists’ charts…was disappearing. But another was blooming like the shy girl at a dance. …. nobody could remember exactly when she had come in. But she was the one who would marry the prince.” “It continued to grow over the fall until it had reached the 3-sigma level — the chances of being a fluke [spurious significance] were less than 1 in 740, enough for physicists to admit it to the realm of “evidence” of something, but not yet a discovery.”

11/5

27

Background knowledge of how flukes behave:

• “If they were flukes, more data would make them fade into the statistical background,

• If not, the bumps would grow in slow motion into a bona fide discovery.”

• They give the bump a hard time, look at multiple decay channels, and don’t tell the details of where they found her to the other team.

• When two independent experiments find the same particle signal at the same mass, it overcomes the multiple testing and gives a strong argument.

11/5

28

(VII) Possible Anomalies for SM They also follow up bumps indicating discrepancies with

H[2]0 SM Higgs boson: µ = 1

Hints of anomalies with the “plain vanilla” particle of the Standard Model (viewed as tests or corresponding interval estimates) Even a year later they examined these anomalies with more data.

11/5

29

Curb your enthusiasm

Matt Strassler: “The excess (in favor of BSM properties) became a bit smaller each time…. That’s an unfortunate sign, if one is hoping the excess isn’t just a statistical fluke.” Or they’d see the bump at ATLAS… and not CMS “Taking all of the data, and not cherry picking…there’s nothing here that you can call “evidence” for the much sought BSM.” (Strassler) Considering the frequent flukes, and the hot competition between the ATLAS and CMS to be first, a tool for when to “curb their enthusiasm” seems exactly what was wanted.

11/5

30

So, this “negative” portion involves:

(a) denying BSM anomalies are real

(b) setting upper bounds for these discrepancies with the SM Higgs

Each with its own test statistic and evidence g(x0)

H[2]0 : SM Higgs boson: µ = 1

Failing to reject the null isn’t evidence for it, but they could set upper bounds.

11/5

31

Test Principle #2 (for non-significance): Data provide evidence to rule out a discrepancy δ∗ to the extent that a larger g(x0) would very probably have resulted if δ were as great as δ∗

Detach δ < δ∗ (could equivalently be viewed as inferring a confidence interval estimate δ = g(x0) + ε)

So these tools seem just the thing for this research

11/5

32

(VIII) Conclusion O’Hagan published a digest of responses a few days later

• “They surely would be willing to announce SM Higgs discovery if they were 99.99% certain of the existence of the SM Higgs” (and avoid the ad hoc 5 sigma)

Pr(SM Higgs) = .9999

• It would require prior probabilities to “SM Higgs” claim, and prior distribution on the numerous “nuisance” parameters of the background and the signal.

• Multivariate priors, correlations between parameters, joint priors, and the catchall: P(data|not H*)

11/5

33

• Even if all that were done and agreed upon, it would not have given the kind of tools needed to find things out

Worse: spiked priors Pr(No SM Higgs)= Pr(SM Higgs)=.5

(not uninformative)

• Physicists believed in SM Higgs before building the big collider, given the perfect predictive success of SM, its simplicity–very different than having evidence for a discovery.

• Others may believe (and fervently wish) that it will break down somewhere.

11/5

34

P-value police: Those who think we want a posterior probability in H* might be sliding from what may be inferred from this legitimate high probability: Pr(test T would not reach 5 sigma; H0) > .9999997

With probability .9999997, our methods would show that the bumps disappear, under the assumption data are due to background H0. They don’t disappear but grow. Infer H* Qualified by the test properties

11/5

35

What’s passed with high severity?

H*: a Higgs boson consistent with the SM (at the levels of precision and accuracy of these experiments)

An adequate account should also always report alternatives that have not been well ruled out

• measurements not precise enough to rule out discrepancies from a SM Higgs as large as 10%, 20%, 50%.

• There are rivals to the SM that would not have been distinguishable with the given data (which went through a lot of filtering, and triggering rules).

They will get more data in 2015, there’s talk of a more precise detector being built

11/5

36

REFERENCES (Online links) • Atlas report: http://cds.cern.ch/record/1494183/files/ATLAS-‐CONF-‐2012-‐162.pdf

• Atlas Higgs experiment, public results: https://twiki.cern.ch/twiki/bin/view/AtlasPublic/HiggsPublicResults

• CMS Higgs experiment, public results: https://twiki.cern.ch/twiki/bin/view/CMSPublic/PhysicsResultsHIG

• Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-‐27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture

11/5

37

Notes-‐Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-‐275.

• Cousins, R. (2014). “The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics” http://arxiv.org/abs/1310.3791

• O’Hagan letter:

§ Original letter with responses: http://bayesian.org/forums/news/3648

§ 1st link in a group of discussions of the letter: http://errorstatistics.com/2012/07/11/is-‐particle-‐physics-‐bad-‐science/

• Overbye, D. (March 15, 2013) “Chasing the Higgs,” New York Times: http://www.nytimes.com/2013/03/05/science/chasing-‐the-‐higgs-‐boson-‐how-‐2-‐teams-‐of-‐rivals-‐at-‐CERN-‐searched-‐for-‐physics-‐most-‐elusive-‐particle.html?pagewanted=all&_r=0

11/5

38

• Spiegelhalter, D. (August 7, 2012) blog, Understanding Uncertainty , “Explaining 5 sigma for the Higgs: how well did they do?” http://understandinguncertainty.org/explaining-‐5-‐sigma-‐higgs-‐how-‐well-‐did-‐they-‐do

• Strassler, M. (July 2, 2013) blog, Of Particular Significance, “A Second Higgs Particle”: http://profmattstrassler.com/2013/07/02/a-‐second-‐higgs-‐particle/

• Wasserman, L. (July 11, 2012) blog, Normal Deviate, “The Higgs Boson and the P-‐Value Police”: http://normaldeviate.wordpress.com/2012/07/11/the-‐higgs-‐boson-‐and-‐the-‐p-‐value-‐police/