Upload
julian-urbano
View
55
Download
4
Embed Size (px)
DESCRIPTION
Previous research has suggested the permutation test as the theoretically optimal statistical significance test for IR evaluation, and advocated for the discontinuation of the Wilcoxon and sign tests. We present a large-scale study comprising nearly 60 million system comparisons showing that in practice the bootstrap, t-test and Wilcoxon test out- perform the permutation test under different optimality criteria. We also show that actual error rates seem to be lower than the theoretically expected 5%, further confirming that we may actually be underestimating significance.
Citation preview
A Comparison of the Optimality
of Statistical Significance Tests
for Information Retrieval Evaluation
Julián Urbano, Mónica Marrero and Diego Martín
Department of Computer Science · University Carlos III of Madrid
The problem: is system A more effective than system B?
The drill: evaluate with a test collection and run a statistical significance test
The dilemma: t-test, Wilcoxon, sign, bootstrap or permutation?
The reason: test assumptions are violated, so which one is optimal in practice?
Three criteria: power (maximize # of significants), safety (minimize # of errors), exact (keep errors at α)
Data and Methods
• TREC Robust 2004: 100 topics from Ad Hoc 7 and 8
o 110 runs, 5995 pairs of systems
• Randomly split topics in T1 and T2, as if two collections
o Evaluate all runs and compute p-values
o Compare p-values from T1 with p-values from T2
o 1000 trials, 12M p-values per test, 60M in total
• Interpret pairs of p-values for different α levels
T2
A ≻B A ≺B A ≻≻B A ≺≺B
T1
A ≻B Non-significance
A≻≻B Lack of
power
Minor
error Success
Major
error
Dublin, Ireland · 30th July 2013 Supported by ACM SIGIR Student Travel Grant
Non-significance rate
Significance level α
Non
-sig
nific
ants
/ T
otal
0.3
0.3
50.
40.
45
0.5
0.6
.001 .005 .01 .05 .1
t-testpermutationbootstrapWilcoxonsign
Success rate
Significance level α
Suc
cess
es /
Tot
al s
igni
fican
ts
0.76
0.78
0.80
0.82
0.84
0.86
.001 .005 .01 .05 .1
Lack of power rate
Significance level α
Lack
s of
pow
er /
Tot
al s
igni
fican
ts
0.12
0.14
0.16
0.18
0.20
.001 .005 .01 .05 .1
Minor error rate
Significance level α
Min
or e
rror
s / T
otal
sig
nific
ants
0.00
10.
002
0.00
50.
010
0.02
0
.001 .005 .01 .05 .1
t-testpermutationbootstrapWilcoxonsign
y=x
Major error rate
Significance level α
Maj
or e
rror
s / T
otal
sig
nific
ants
5e-0
75e
-06
5e-0
55e
-04
.001 .005 .01 .05 .1
Global error rate
Significance level α
Min
or a
nd M
ajor
err
ors
/ Tot
al s
igni
fican
ts
5e-0
42e
-03
5e-0
32e
-02
5e-0
2
.0001 .0005.001 .005 .01 .05 .1 .5
Take-Home Messages
• Power: bootstrap test gives more significant results
• Safety: t-test gives fewer errors
• Exactness: Wilcoxon test best tracks the nominal level
• The permutation test is not optimal in practice
• Error rates seem lower than expected; focus on power
Previous Work
Zobel’98, Sanderson & Zobel’05, Cormack & Lynam’06
• Wilcoxon more powerful than t-test, but more errors
Smucker et al. ‘07, ‘09
• bootstrap test overly powerful, though similar to t-test
and permutation
• Wilcoxon and sign unreliable, should use permutation