A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation

A Comparison of the Optimality

of Statistical Significance Tests

for Information Retrieval Evaluation

Julián Urbano, Mónica Marrero and Diego Martín

Department of Computer Science · University Carlos III of Madrid

The problem: is system A more effective than system B?

The drill: evaluate with a test collection and run a statistical significance test

The dilemma: t-test, Wilcoxon, sign, bootstrap or permutation?

The reason: test assumptions are violated, so which one is optimal in practice?

Three criteria: power (maximize # of significants), safety (minimize # of errors), exact (keep errors at α)

Data and Methods

• TREC Robust 2004: 100 topics from Ad Hoc 7 and 8

o 110 runs, 5995 pairs of systems

• Randomly split topics in T1 and T2, as if two collections

o Evaluate all runs and compute p-values

o Compare p-values from T1 with p-values from T2

o 1000 trials, 12M p-values per test, 60M in total

• Interpret pairs of p-values for different α levels

T2

A ≻B A ≺B A ≻≻B A ≺≺B

T1

A ≻B Non-significance

A≻≻B Lack of

power

Minor

error Success

Major

error

Dublin, Ireland · 30th July 2013 Supported by ACM SIGIR Student Travel Grant

Non-significance rate

Significance level α

Non

-sig

nific

ants

/ T

otal

0.3

0.3

50.

40.

45

0.5

0.6

.001 .005 .01 .05 .1

t-testpermutationbootstrapWilcoxonsign

Success rate


Suc

cess

es /

Tot

al s

igni

fican

ts

0.76

0.78

0.80

0.82

0.84

0.86

.001 .005 .01 .05 .1

Lack of power rate


Lack

s of

pow

er /

Tot

al s

igni

fican

ts

0.12

0.14

0.16

0.18

0.20

.001 .005 .01 .05 .1

Minor error rate


Min

or e

rror

s / T

otal

sig

nific

ants

0.00

10.

002

0.00

50.

010

0.02

0

.001 .005 .01 .05 .1

t-testpermutationbootstrapWilcoxonsign

y=x

Major error rate


Maj

or e

rror

s / T

otal

sig

nific

ants

5e-0

75e

-06

5e-0

55e

-04

.001 .005 .01 .05 .1

Global error rate


Min

or a

nd M

ajor

err

ors

/ Tot

al s

igni

fican

ts

5e-0

42e

-03

5e-0

32e

-02

5e-0

2

.0001 .0005.001 .005 .01 .05 .1 .5

Take-Home Messages

• Power: bootstrap test gives more significant results

• Safety: t-test gives fewer errors

• Exactness: Wilcoxon test best tracks the nominal level

• The permutation test is not optimal in practice

• Error rates seem lower than expected; focus on power

Previous Work

Zobel’98, Sanderson & Zobel’05, Cormack & Lynam’06

• Wilcoxon more powerful than t-test, but more errors

Smucker et al. ‘07, ‘09

• bootstrap test overly powerful, though similar to t-test

and permutation

• Wilcoxon and sign unreliable, should use permutation