Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Hypothesis Testing
Methodes probabilistes pour le TAL
Guillaume Wisniewski
novembre 2017
Universite Paris Sud & LIMSI
1
Framework
Goal
Example
• claim/hypothesis of Nadi : his
‘genius’ is due to having eaten
salted butter since her childhood
• how to test this hypothesis / check
that experimental evidence
supports it
Hypothesis Testing
• procedure = logical sequence of
steps
• decide whether to accept or reject
an hypothesis2
(Re)formulating the hypothesis
• express the hypothesis on a form
that can be tested
• we have to decide on anoperational definition ofintelligence ⇒ IQ test
• something we can measure
• convention between
experimenters
• research hypothesis : people eating
salted butter are cleverer / have an
higher IQ
3
How to decide ?
• Compare the distributions of
• IQ of people that are eating
salted butter (usual-IQ
distribution)
• IQ of people that are not eating
salted butter (salted-IQ
distribution)
4
But...
Salted-IQ distribution
• something we simply cannot find
out
• we only have access to one score !
Usual-IQ distribution
• can we give everyone an IQ-test ?
• alternative : assume that the IQ
scores are normally distributed
• the creators of IQ tests deliberately
constructed them so that the
scores are distributed according to
N (100, 15)5
Null Hypothesis
What we have...
• one distribution out of two is
known, but cannot test our
research hypothesis...
• but we can test the null hypothesis
H0 : usual-IQ and salted-IQ
distributions are the same
Null Hypothesis
• consider H0 “innocent until proven
guilty”
• assume H0 is true unless the data
give strong evidence of the contrary6
Testing the null hypothesis
Principle
• assume that salted-IQ and usual-IQ distributions are the same
• test whether Nadi IQ score comes from the usual-IQ
distribution
In practice
• compute the z-score :
z =x − µσ
(1)
⇒ distance between the raw score and the population mean in
units of the standard deviation
• z table : area under the Gaussian curve at the right of z
7
Intepreting z-scores
−4 −2 0 2 40
0.2
0.4
0.6
0.8
1
z
freq
uen
cy
zvalue of the z-table
Probability to observe a value larger that z
8
In our case
Nadi’s IQ = 120
• z = 1.33 regarding the z-table :
9.18% of the usual-IQ distribution
have a IQ score higher than Nadi
• not that impressive
Nadi’s IQ = 145
• z = 3⇒ 0.13% of score higher
than Nadi’s
• more likely that this score belongs
to a different, higher, distribution
we rely on the assumption that the salted-IQ and usual-IQ
distributions are the same9
General interpretation
General principle
hypothesis testing is a gamble on the
basis of probabilities. If the probability
of Peter’s score coming from a
distribution the same as the usual-IQ
distribution is very low we reject the
null hypothesis, if the probability is not
very low we accept it.
10
Significance Level
When should we switch from rejection to acceptance ?
Significance Level
• reject the H0 with a signicance level of 0.05
• the score of the unknown distribution can only arise from the
known distribution with a chance of less than 5%
⇒ decision criterion
11
Vocabulary
One- and two-tailed predictions
1. The unknown distribution is the same as the known
distribution.
2. The unknown distribution is higher up the scale than the
known distribution.
3. The unknown distribution is lower down the scale than the
known distribution.
12
Example i
We are tossing a coin. Is it fair ?
Principle
• toss coin n times
• coin ‘suspicious’ if number of heads is much less or much
more than n2
13
Example ii
Hypothesis
• c : probability to observe head
• H0 : c = 0.5
• HA : c 6= 0.5 (alternate/research hypothesis)
Statistical test
• c number of heads in n tosses
• standard deviation of c is√
c×(1−c)n
• test statistic :
z =c − c√
1n · c · (1− c)
(2)
14
Application
First
• n = 100, c = 0.62
• we have : z = 2.4, value in z table : 0.82%
• we reject H0 at the 5% level
Second
• n = 100, c = 0.47
• we have : z = −0.6, value in z table : 27.43%
• c is not significantly different from 0.5 at the 5% level
15
Generalization
What we have seen ?
Simplest case
• one known distribution + normal distribution
• one sample
Generalization(s)
• hypothesis one the ‘usual’ distribution : shape, µ known, σ
known, ...
• one sample / two samples
16
In practice
• the ‘spirit’ is the same
• (lots of) technical difficulties
• e.g.
• Student distribution instead of a
normal distribution when the
variance is not known
• non-parametric tests
17
Example : length of sentences
Data
• Mean sentence length in 50 novels from 1950s : X1 = 19.3
• Mean sentence length in 50 novels from 2000s : X1 = 16.4
• X is normally distributed with variance σ2 = 134.2
18
Example : length of sentences
Data
• Mean sentence length in 50 novels from 1950s : X1 = 19.3
• Mean sentence length in 50 novels from 2000s : X1 = 16.4
• X is normally distributed with variance σ2 = 134.2
Test statistic (difference in estimated mean)
Z =X1 − X2√
σ2
n1+n2
=19.3− 16.9√
134.250+50
= 2.28 (3)
Conclusions
• p = 0.0226
• Reject H0 at α = 5% (but not at α = 1%)
18
What is wrong with significance
testing ?
History
• most of the concepts weredeveloped by Sir Ronald Fischer inthe 1920s
• “a genius who almost
single-handedly created the
foundations for modern statistical
science”
• strong opposition from the very
beginning
• at the core of most scientific
results, founding principles of the
design of experiments
19
It’s Not Easy Being Greene (ER Season 2, ep. 13)
• Benton Are you serious ?
• Vucelich Simon did an analysis of our result. Our P-value
was 0.60. We are one successful outcome away from
statistical significance.
• Benton We can publish ?
• Vucelich Soon. One more aneurysm and our numbers will
blind the most dubious skeptics. After that, we head to D.C.
to play dog-and-pony for the FDA. Now, Simon doesn’t fly,
so he stays here which makes you the next choice for
Clamp-and-Run Ambassador to Europe.
...
• Vucelich You’ve gotta find another patient soon because the
Norwegians are doing a similar study. And, Peter, we cannot
let the Vikings pillage our thunder.
20
Definition of H0
• in the dice example : H0 : c = 0.5
• in practice : c will never be exactly
0.5. What is important is that it
must be “close” to 0.5
• but ‘less tractable’
• We know a priori that H0 is false
21
What is significance ?
• textbook case : compare a new
drug to an old drug
• new drug works 0.4% (i.e. 0.004)
better than the old one
• is the new one is “significantly”
better ?
• what if the new drug has much
worse side effects and costs a lot
more (a given, for a new drug).
22
Impact of sample size
Recall that in the dice example :
z =c − c√
1n · c · (1− c)
(4)
• to make z arbitrarily small, just increase n
• as the sample size increases, eventually everything becomes
“significant”
• in NLP, n is always large !
• every educated person should understand statistics and
hypothesis testing !
23
What should we do instead ?
Possible solution
• form a confidence interval :
• if n is large enough : estimation reasonably accurate
• location of the interval = answer to the question
Example (coins)
• confidence interval : [0.502, 0.504]
• close enough of 0.5 (even if 0.5 not in it) + very narrow
• but no ‘automatic’ decision
24
On the importance of ‘automatic’ decision
I was in search of a one-armed economist, so that the
guy could never make a statement and then say : “on the
other hand”
(President Harry S. Truman)
• foolish to expect to prove anything in a mathematical sense
• statistics = one piece of evidence
• must be weighted and combine to other information ⇒preponderance of all the evidence
• but : lots of discussions
25
Evaluating classifiers performance in
NLP
The task
Accuracies of two PoS
tagger across 10 datasets
Context
• compare a new system A to a
baseline system B : is A better than
B on some large population of data
• what can we conclude if A beats B
on one particular dataset ?
“by chance” victories
Main problem : (almost) impossible to draw
new test sets from the underlying
population
26
Difficulties
• effect size δ(x) = sA(x)− sB(x)
(difference of score on dataset x)
• δ(x) is not normally distributed
• δ(x) does not follow any
well-studied distribution
• many bias (e.g. sample size)
27
In practice : paired bootstrap
Show me the code
1. Draw b bootstrap samples x (i) of size n by sampling with
replacement from x
2. initialize s = 0
3. For each x (i) increment s if δ(x (i))> 2 · δ(x)
4. Estimate p ' sb
Interpretation
• how often A beats B by more than δ(x) on x (i) ?
• factor 2 :
• x (i) is drawn from x
• we expect A to beat B by δ(x) for at least half of the x (i)
• mean correction28
Impact on test set size
29
Conclusion
What’s in a p-value in NLP ?, A. Søgaard, A. Johannsen, B. Plank, D. Hovy and H. Martınez Alonso, Conference
on Computational Language Learning, 2014
30
References
Taylor Berg-Kirkpatrick, David Burkett, and Dan Klein, An
empirical investigation of statistical significance in NLP,
EMNLP (Stroudsburg, PA, USA), Association for
Computational Linguistics, 2012, pp. 995–1005.
Anders Søgaard, Anders Johannsen, Barbara Plank, Dirk Hovy,
and Hector Martınez Alonso, What’s in a p-value in nlp ?,
CoNLL (Ann Arbor, Michigan), Association for Computational
Linguistics, June 2014, pp. 1–10.
31