How to decide? - LIMSI...Hypothesis Testing Metho des probabilistes pour le TAL Guillaume Wisniewski [email protected] novembre 2017 Universite Paris Sud & LIMSI 1 Framework

Hypothesis Testing

Methodes probabilistes pour le TAL

Guillaume Wisniewski

[email protected]

novembre 2017

Universite Paris Sud & LIMSI

1

Framework

Goal

Example

• claim/hypothesis of Nadi : his

‘genius’ is due to having eaten

salted butter since her childhood

• how to test this hypothesis / check

that experimental evidence

supports it

Hypothesis Testing

• procedure = logical sequence of

steps

• decide whether to accept or reject

an hypothesis2

(Re)formulating the hypothesis

• express the hypothesis on a form

that can be tested

• we have to decide on anoperational definition ofintelligence ⇒ IQ test

• something we can measure

• convention between

experimenters

• research hypothesis : people eating

salted butter are cleverer / have an

higher IQ

3

How to decide ?

• Compare the distributions of

• IQ of people that are eating

salted butter (usual-IQ

distribution)

• IQ of people that are not eating

salted butter (salted-IQ

distribution)

4

But...

Salted-IQ distribution

• something we simply cannot find

out

• we only have access to one score !

Usual-IQ distribution

• can we give everyone an IQ-test ?

• alternative : assume that the IQ

scores are normally distributed

• the creators of IQ tests deliberately

constructed them so that the

scores are distributed according to

N (100, 15)5

Null Hypothesis

What we have...

• one distribution out of two is

known, but cannot test our

research hypothesis...

• but we can test the null hypothesis

H0 : usual-IQ and salted-IQ

distributions are the same

Null Hypothesis

• consider H0 “innocent until proven

guilty”

• assume H0 is true unless the data

give strong evidence of the contrary6

Testing the null hypothesis

Principle

• assume that salted-IQ and usual-IQ distributions are the same

• test whether Nadi IQ score comes from the usual-IQ

distribution

In practice

• compute the z-score :

z =x − µσ

(1)

⇒ distance between the raw score and the population mean in

units of the standard deviation

• z table : area under the Gaussian curve at the right of z

7

Intepreting z-scores

−4 −2 0 2 40

0.2

0.4

0.6

0.8

1

z

freq

uen

cy

zvalue of the z-table

Probability to observe a value larger that z

8

In our case

Nadi’s IQ = 120

• z = 1.33 regarding the z-table :

9.18% of the usual-IQ distribution

have a IQ score higher than Nadi

• not that impressive

Nadi’s IQ = 145

• z = 3⇒ 0.13% of score higher

than Nadi’s

• more likely that this score belongs

to a different, higher, distribution

we rely on the assumption that the salted-IQ and usual-IQ

distributions are the same9

General interpretation

General principle

hypothesis testing is a gamble on the

basis of probabilities. If the probability

of Peter’s score coming from a

distribution the same as the usual-IQ

distribution is very low we reject the

null hypothesis, if the probability is not

very low we accept it.

10

Significance Level

When should we switch from rejection to acceptance ?

Significance Level

• reject the H0 with a signicance level of 0.05

• the score of the unknown distribution can only arise from the

known distribution with a chance of less than 5%

⇒ decision criterion

11

Vocabulary

One- and two-tailed predictions

1. The unknown distribution is the same as the known

distribution.

2. The unknown distribution is higher up the scale than the

known distribution.

3. The unknown distribution is lower down the scale than the

known distribution.

12

Example i

We are tossing a coin. Is it fair ?

Principle

• toss coin n times

• coin ‘suspicious’ if number of heads is much less or much

more than n2

13

Example ii

Hypothesis

• c : probability to observe head

• H0 : c = 0.5

• HA : c 6= 0.5 (alternate/research hypothesis)

Statistical test

• c number of heads in n tosses

• standard deviation of c is√

c×(1−c)n

• test statistic :

z =c − c√

1n · c · (1− c)

(2)

14

Application

First

• n = 100, c = 0.62

• we have : z = 2.4, value in z table : 0.82%

• we reject H0 at the 5% level

Second

• n = 100, c = 0.47

• we have : z = −0.6, value in z table : 27.43%

• c is not significantly different from 0.5 at the 5% level

15

Generalization

What we have seen ?

Simplest case

• one known distribution + normal distribution

• one sample

Generalization(s)

• hypothesis one the ‘usual’ distribution : shape, µ known, σ

known, ...

• one sample / two samples

16

In practice

• the ‘spirit’ is the same

• (lots of) technical difficulties

• e.g.

• Student distribution instead of a

normal distribution when the

variance is not known

• non-parametric tests

17

Example : length of sentences

Data

• Mean sentence length in 50 novels from 1950s : X1 = 19.3


• X is normally distributed with variance σ2 = 134.2

18

Example : length of sentences

Data



• X is normally distributed with variance σ2 = 134.2

Test statistic (difference in estimated mean)

Z =X1 − X2√

σ2

n1+n2

=19.3− 16.9√

134.250+50

= 2.28 (3)

Conclusions

• p = 0.0226

• Reject H0 at α = 5% (but not at α = 1%)

18

What is wrong with significance

testing ?

History

• most of the concepts weredeveloped by Sir Ronald Fischer inthe 1920s

• “a genius who almost

single-handedly created the

foundations for modern statistical

science”

• strong opposition from the very

beginning

• at the core of most scientific

results, founding principles of the

design of experiments

19

It’s Not Easy Being Greene (ER Season 2, ep. 13)

• Benton Are you serious ?

• Vucelich Simon did an analysis of our result. Our P-value

was 0.60. We are one successful outcome away from

statistical significance.

• Benton We can publish ?

• Vucelich Soon. One more aneurysm and our numbers will

blind the most dubious skeptics. After that, we head to D.C.

to play dog-and-pony for the FDA. Now, Simon doesn’t fly,

so he stays here which makes you the next choice for

Clamp-and-Run Ambassador to Europe.

...

• Vucelich You’ve gotta find another patient soon because the

Norwegians are doing a similar study. And, Peter, we cannot

let the Vikings pillage our thunder.

20

Definition of H0

• in the dice example : H0 : c = 0.5

• in practice : c will never be exactly

0.5. What is important is that it

must be “close” to 0.5

• but ‘less tractable’

• We know a priori that H0 is false

21

What is significance ?

• textbook case : compare a new

drug to an old drug

• new drug works 0.4% (i.e. 0.004)

better than the old one

• is the new one is “significantly”

better ?

• what if the new drug has much

worse side effects and costs a lot

more (a given, for a new drug).

22

Impact of sample size

Recall that in the dice example :

z =c − c√

1n · c · (1− c)

(4)

• to make z arbitrarily small, just increase n

• as the sample size increases, eventually everything becomes

“significant”

• in NLP, n is always large !

• every educated person should understand statistics and

hypothesis testing !

23

What should we do instead ?

Possible solution

• form a confidence interval :

• if n is large enough : estimation reasonably accurate

• location of the interval = answer to the question

Example (coins)

• confidence interval : [0.502, 0.504]

• close enough of 0.5 (even if 0.5 not in it) + very narrow

• but no ‘automatic’ decision

24

On the importance of ‘automatic’ decision

I was in search of a one-armed economist, so that the

guy could never make a statement and then say : “on the

other hand”

(President Harry S. Truman)

• foolish to expect to prove anything in a mathematical sense

• statistics = one piece of evidence

• must be weighted and combine to other information ⇒preponderance of all the evidence

• but : lots of discussions

25

Evaluating classifiers performance in

NLP

The task

Accuracies of two PoS

tagger across 10 datasets

Context

• compare a new system A to a

baseline system B : is A better than

B on some large population of data

• what can we conclude if A beats B

on one particular dataset ?

“by chance” victories

Main problem : (almost) impossible to draw

new test sets from the underlying

population

26

Difficulties

• effect size δ(x) = sA(x)− sB(x)

(difference of score on dataset x)

• δ(x) is not normally distributed

• δ(x) does not follow any

well-studied distribution

• many bias (e.g. sample size)

27

In practice : paired bootstrap

Show me the code

1. Draw b bootstrap samples x (i) of size n by sampling with

replacement from x

2. initialize s = 0

3. For each x (i) increment s if δ(x (i))> 2 · δ(x)

4. Estimate p ' sb

Interpretation

• how often A beats B by more than δ(x) on x (i) ?

• factor 2 :

• x (i) is drawn from x

• we expect A to beat B by δ(x) for at least half of the x (i)

• mean correction28

Impact on test set size

29

Conclusion

What’s in a p-value in NLP ?, A. Søgaard, A. Johannsen, B. Plank, D. Hovy and H. Martınez Alonso, Conference

on Computational Language Learning, 2014

30

References

Taylor Berg-Kirkpatrick, David Burkett, and Dan Klein, An

empirical investigation of statistical significance in NLP,

EMNLP (Stroudsburg, PA, USA), Association for

Computational Linguistics, 2012, pp. 995–1005.

Anders Søgaard, Anders Johannsen, Barbara Plank, Dirk Hovy,

and Hector Martınez Alonso, What’s in a p-value in nlp ?,

CoNLL (Ann Arbor, Michigan), Association for Computational

Linguistics, June 2014, pp. 1–10.

31

Documents

How to decide? - LIMSI...Hypothesis Testing Metho des probabilistes pour le TAL Guillaume Wisniewski [email protected] novembre 2017 Universite Paris Sud & LIMSI 1 Framework