1
A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation Julián Urbano, Mónica Marrero and Diego Martín Department of Computer Science · University Carlos III of Madrid The problem: is system A more effective than system B? The drill: evaluate with a test collection and run a statistical significance test The dilemma: t-test, Wilcoxon, sign, bootstrap or permutation? The reason: test assumptions are violated, so which one is optimal in practice? Three criteria: power (maximize # of significants), safety (minimize # of errors), exact (keep errors at α) Data and Methods TREC Robust 2004: 100 topics from Ad Hoc 7 and 8 o 110 runs, 5995 pairs of systems Randomly split topics in T 1 and T 2 , as if two collections o Evaluate all runs and compute p-values o Compare p-values from T 1 with p-values from T 2 o 1000 trials, 12M p-values per test, 60M in total Interpret pairs of p-values for different α levels T 2 A B A B A B A B T 1 A B Non-significance AB Lack of power Minor error Success Major error Dublin, Ireland · 30th July 2013 Supported by ACM SIGIR Student Travel Grant Non-significance rate Significance level α Non-significants / Total 0.3 0.35 0.4 0.45 0.5 0.6 .001 .005 .01 .05 .1 t-test permutation bootstrap Wilcoxon sign Success rate Significance level α Successes / Total significants 0.76 0.78 0.80 0.82 0.84 0.86 .001 .005 .01 .05 .1 Lack of power rate Significance level α Lacks of power / Total significants 0.12 0.14 0.16 0.18 0.20 .001 .005 .01 .05 .1 Minor error rate Significance level α Minor errors / Total significants 0.001 0.002 0.005 0.010 0.020 .001 .005 .01 .05 .1 t-test permutation bootstrap Wilcoxon sign y=x Major error rate Significance level α Major errors / Total significants 5e-07 5e-06 5e-05 5e-04 .001 .005 .01 .05 .1 Global error rate Significance level α Minor and Major errors / Total significants 5e-04 2e-03 5e-03 2e-02 5e-02 .0001 .0005.001 .005 .01 .05 .1 .5 Take-Home Messages Power: bootstrap test gives more significant results Safety: t-test gives fewer errors Exactness: Wilcoxon test best tracks the nominal level The permutation test is not optimal in practice Error rates seem lower than expected; focus on power Previous Work Zobel’98, Sanderson & Zobel’05, Cormack & Lynam’06 Wilcoxon more powerful than t-test, but more errors Smucker et al. ‘07, ‘09 bootstrap test overly powerful, though similar to t-test and permutation Wilcoxon and sign unreliable, should use permutation

A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation

Embed Size (px)

DESCRIPTION

Previous research has suggested the permutation test as the theoretically optimal statistical significance test for IR evaluation, and advocated for the discontinuation of the Wilcoxon and sign tests. We present a large-scale study comprising nearly 60 million system comparisons showing that in practice the bootstrap, t-test and Wilcoxon test out- perform the permutation test under different optimality criteria. We also show that actual error rates seem to be lower than the theoretically expected 5%, further confirming that we may actually be underestimating significance.

Citation preview

Page 1: A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation

A Comparison of the Optimality

of Statistical Significance Tests

for Information Retrieval Evaluation

Julián Urbano, Mónica Marrero and Diego Martín

Department of Computer Science · University Carlos III of Madrid

The problem: is system A more effective than system B?

The drill: evaluate with a test collection and run a statistical significance test

The dilemma: t-test, Wilcoxon, sign, bootstrap or permutation?

The reason: test assumptions are violated, so which one is optimal in practice?

Three criteria: power (maximize # of significants), safety (minimize # of errors), exact (keep errors at α)

Data and Methods

• TREC Robust 2004: 100 topics from Ad Hoc 7 and 8

o 110 runs, 5995 pairs of systems

• Randomly split topics in T1 and T2, as if two collections

o Evaluate all runs and compute p-values

o Compare p-values from T1 with p-values from T2

o 1000 trials, 12M p-values per test, 60M in total

• Interpret pairs of p-values for different α levels

T2

A ≻B A ≺B A ≻≻B A ≺≺B

T1

A ≻B Non-significance

A≻≻B Lack of

power

Minor

error Success

Major

error

Dublin, Ireland · 30th July 2013 Supported by ACM SIGIR Student Travel Grant

Non-significance rate

Significance level α

Non

-sig

nific

ants

/ T

otal

0.3

0.3

50.

40.

45

0.5

0.6

.001 .005 .01 .05 .1

t-testpermutationbootstrapWilcoxonsign

Success rate

Significance level α

Suc

cess

es /

Tot

al s

igni

fican

ts

0.76

0.78

0.80

0.82

0.84

0.86

.001 .005 .01 .05 .1

Lack of power rate

Significance level α

Lack

s of

pow

er /

Tot

al s

igni

fican

ts

0.12

0.14

0.16

0.18

0.20

.001 .005 .01 .05 .1

Minor error rate

Significance level α

Min

or e

rror

s / T

otal

sig

nific

ants

0.00

10.

002

0.00

50.

010

0.02

0

.001 .005 .01 .05 .1

t-testpermutationbootstrapWilcoxonsign

y=x

Major error rate

Significance level α

Maj

or e

rror

s / T

otal

sig

nific

ants

5e-0

75e

-06

5e-0

55e

-04

.001 .005 .01 .05 .1

Global error rate

Significance level α

Min

or a

nd M

ajor

err

ors

/ Tot

al s

igni

fican

ts

5e-0

42e

-03

5e-0

32e

-02

5e-0

2

.0001 .0005.001 .005 .01 .05 .1 .5

Take-Home Messages

• Power: bootstrap test gives more significant results

• Safety: t-test gives fewer errors

• Exactness: Wilcoxon test best tracks the nominal level

• The permutation test is not optimal in practice

• Error rates seem lower than expected; focus on power

Previous Work

Zobel’98, Sanderson & Zobel’05, Cormack & Lynam’06

• Wilcoxon more powerful than t-test, but more errors

Smucker et al. ‘07, ‘09

• bootstrap test overly powerful, though similar to t-test

and permutation

• Wilcoxon and sign unreliable, should use permutation