A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval...

A Comparison of the Optimality

of Statistical Significance Tests

for Information Retrieval Evaluation

Julián Urbano, Mónica Marrero and Diego Martín

Department of Computer Science · University Carlos III of Madrid

The problem: is system A more effective than system B?

The drill: evaluate with a test collection and run a statistical significance test

The dilemma: t-test, Wilcoxon, sign, bootstrap or permutation?

The reason: test assumptions are violated, so which one is optimal in practice?

Three criteria: power (maximize # of significants), safety (minimize # of errors), exact (keep errors at α)

Data and Methods

• TREC Robust 2004: 100 topics from Ad Hoc 7 and 8

o 110 runs, 5995 pairs of systems

• Randomly split topics in T1 and T2, as if two collections

o Evaluate all runs and compute p-values

o Compare p-values from T1 with p-values from T2

o 1000 trials, 12M p-values per test, 60M in total

• Interpret pairs of p-values for different α levels

A ≻B A ≺B A ≻≻B A ≺≺B

A ≻B Non-significance

A≻≻B Lack of

error Success

Dublin, Ireland · 30th July 2013 Supported by ACM SIGIR Student Travel Grant

Non-significance rate

Significance level α

.001 .005 .01 .05 .1

t-testpermutationbootstrapWilcoxonsign

Success rate

.001 .005 .01 .05 .1

Lack of power rate

.001 .005 .01 .05 .1

Minor error rate

.001 .005 .01 .05 .1

t-testpermutationbootstrapWilcoxonsign

Major error rate

.001 .005 .01 .05 .1

Global error rate

.0001 .0005.001 .005 .01 .05 .1 .5

Take-Home Messages

• Power: bootstrap test gives more significant results

• Safety: t-test gives fewer errors

• Exactness: Wilcoxon test best tracks the nominal level

• The permutation test is not optimal in practice

• Error rates seem lower than expected; focus on power

Previous Work

Zobel’98, Sanderson & Zobel’05, Cormack & Lynam’06

• Wilcoxon more powerful than t-test, but more errors

Smucker et al. ‘07, ‘09

• bootstrap test overly powerful, though similar to t-test

and permutation

• Wilcoxon and sign unreliable, should use permutation

A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval...

Science

Optimality criterion methods

On the optimality of the observability inequalities …optimality of the observability inequality remains open for scalar equations. The optimality is a consequence of a construction

OPTIMALITY THEORY - ROA

AUGUST 30, 2010 Basics of Online Journalism. Advantages Significance Audience Control Nonlinearity Storage, Retrieval, & Unlimited Space Immediacy Multimedia

Optimality and diachronic adaptation

Media Retrieval Information Retrieval Image Retrieval Video Retrieval Audio Retrieval Information Retrieval Image Retrieval Video Retrieval Audio Retrieval

Consistency and Optimality in Managerial

Optimality for Dynamic Patternsblsk/Docs/Balabonski-Talk-PPDP10.pdf · T. Balabonski — Optimality for Dynamic Patterns PPDP 2010: July 28 — 1/20 Optimality for Dynamic Patterns

Learnability in Optimality Theory

Optimality Conditions for Unconstrained optimization

(DMTCS 21) - Welcome to · PDF file · 2013-03-09Answer Question No. 1 compulsorily. (15 × 1 = 15) ... What is principle of optimality? ... Explain about the significance of omni-directional

Banks international asset portfolios: optimality, linkages ... · Banks international asset portfolios: optimality, ... 4 Banks International Asset Portfolios: Optimality, Linkages

Optimality Conditions for Disjunctive Programs with ...library.utia.cas.cz/separaty/2007/mtr/outrata-optimality conditions... · Optimality Conditions for Disjunctive Programs with

Entropy and Optimality in Abstract Art: An Empirical Test ... and Optimality Pre-Print.pdfEntropy and Optimality in Abstract Art: An Empirical Test of Visual Aesthetics Kevin Burns

The Syllable in Optimality Theory

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007

Delay Analysis and Optimality

From: Optimality Theory: An Overview, ed. by Diana ...langendoen/Optimality...From: Optimality Theory: An Overview, ed. by Diana Archangeli & D. Terence Langendoen, 1997.Oxford: Blackwell

Adverb Placement An Optimality Theoretic · PDF fileAdverb Placement. An Optimality Theoretic Approach ii Chapter 3. An Optimality Theoretic Approach to Adverb Placement 63 3.1 Theoretical

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation