A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval...

  • View
    55

  • Download
    4

  • Category

    Science

Preview:

DESCRIPTION

Previous research has suggested the permutation test as the theoretically optimal statistical significance test for IR evaluation, and advocated for the discontinuation of the Wilcoxon and sign tests. We present a large-scale study comprising nearly 60 million system comparisons showing that in practice the bootstrap, t-test and Wilcoxon test out- perform the permutation test under different optimality criteria. We also show that actual error rates seem to be lower than the theoretically expected 5%, further confirming that we may actually be underestimating significance.

Citation preview

A Comparison of the Optimality

of Statistical Significance Tests

for Information Retrieval Evaluation

Julián Urbano, Mónica Marrero and Diego Martín

Department of Computer Science · University Carlos III of Madrid

The problem: is system A more effective than system B?

The drill: evaluate with a test collection and run a statistical significance test

The dilemma: t-test, Wilcoxon, sign, bootstrap or permutation?

The reason: test assumptions are violated, so which one is optimal in practice?

Three criteria: power (maximize # of significants), safety (minimize # of errors), exact (keep errors at α)

Data and Methods

• TREC Robust 2004: 100 topics from Ad Hoc 7 and 8

o 110 runs, 5995 pairs of systems

• Randomly split topics in T1 and T2, as if two collections

o Evaluate all runs and compute p-values

o Compare p-values from T1 with p-values from T2

o 1000 trials, 12M p-values per test, 60M in total

• Interpret pairs of p-values for different α levels

T2

A ≻B A ≺B A ≻≻B A ≺≺B

T1

A ≻B Non-significance

A≻≻B Lack of

power

Minor

error Success

Major

error

Dublin, Ireland · 30th July 2013 Supported by ACM SIGIR Student Travel Grant

Non-significance rate

Significance level α

Non

-sig

nific

ants

/ T

otal

0.3

0.3

50.

40.

45

0.5

0.6

.001 .005 .01 .05 .1

t-testpermutationbootstrapWilcoxonsign

Success rate

Significance level α

Suc

cess

es /

Tot

al s

igni

fican

ts

0.76

0.78

0.80

0.82

0.84

0.86

.001 .005 .01 .05 .1

Lack of power rate

Significance level α

Lack

s of

pow

er /

Tot

al s

igni

fican

ts

0.12

0.14

0.16

0.18

0.20

.001 .005 .01 .05 .1

Minor error rate

Significance level α

Min

or e

rror

s / T

otal

sig

nific

ants

0.00

10.

002

0.00

50.

010

0.02

0

.001 .005 .01 .05 .1

t-testpermutationbootstrapWilcoxonsign

y=x

Major error rate

Significance level α

Maj

or e

rror

s / T

otal

sig

nific

ants

5e-0

75e

-06

5e-0

55e

-04

.001 .005 .01 .05 .1

Global error rate

Significance level α

Min

or a

nd M

ajor

err

ors

/ Tot

al s

igni

fican

ts

5e-0

42e

-03

5e-0

32e

-02

5e-0

2

.0001 .0005.001 .005 .01 .05 .1 .5

Take-Home Messages

• Power: bootstrap test gives more significant results

• Safety: t-test gives fewer errors

• Exactness: Wilcoxon test best tracks the nominal level

• The permutation test is not optimal in practice

• Error rates seem lower than expected; focus on power

Previous Work

Zobel’98, Sanderson & Zobel’05, Cormack & Lynam’06

• Wilcoxon more powerful than t-test, but more errors

Smucker et al. ‘07, ‘09

• bootstrap test overly powerful, though similar to t-test

and permutation

• Wilcoxon and sign unreliable, should use permutation

Recommended