Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

AdMIRe 2012 Lyon, France · April 17th Picture by ERdi43 (Wikipedia)

Towards Minimal Test Collections for Evaluation of

Audio Music Similarity and Retrieval

@julian_urbano University Carlos III of Madrid

@m_schedl Johannes Kepler University

Problem

evaluation of IR systems is costly

Annotations

time consuming expensive

boring

(Bad) Consequence

small and biased test collections unlikely to change from year to year

Solution

apply low-cost evaluation methodologies

2011 1960

ISMIR (2000-today)

MIREX (2005-today)

TREC (1992-today)

CLEF (2000-today)

NTCIR (1999-today)

Cranfield 2 (1962-1966)

MEDLARS (1966-1967)

SMART (1961-1995)

nearly 2 decades of Meta-Evaluation in Text IR

a lot of things have happened here!

some good practices inherited from here

Minimal Test Collections (MTC) [Carterette at al.]

estimate the ranking of systems with very few judgments (high incompleteness)

Application in Audio Music Similarity (AMS)

dozens of volunteers required by MIREX every year to make thousands of judgments

Year Teams Systems Queries Results Judgments Overlap 2006 5 6 60 1,800 1,629 10% 2007 8 12 100 6,000 4,832 19% 2009 9 15 100 7,500 6,732 10% 2010 5 8 100 4,000 2,737 32% 2011 10 18 100 9,000 6,322 30%

evaluation with

incomplete judgments

Basic Idea

treat similarity scores as random variables can be estimated with uncertainty

gain of an arbitrary document: Gi ⤳ multinomial

𝐸 𝐺𝑖 = 𝑃 𝐺𝑖 = 𝑙 · 𝑙

𝑙∈ℒ

ℒ𝐵𝑅𝑂𝐴𝐷 = 0, 1, 2 ℒ𝐹𝐼𝑁𝐸 = {0, 1, … , 100}

whenever document i is judged:

𝐸 𝐺𝑖 = 𝑙 𝑉𝑎𝑟 𝐺𝑖 = 0

*all variance formulas in the paper

AG@k is also treated as a random variable

𝐸 𝐴𝐺@𝑘 =1

𝑘 𝐸 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘

𝑖∈𝒟

iterate all documents (in practice, only

the top k retrieved)

ranking at which it was retrieved

Ultimate Goal

compute a good estimate with the least effort

Comparing Two Systems

𝐸 𝛥𝐴𝐺@𝑘 =1

𝑘 𝐸 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘

𝑖∈𝒟

what really matters is the sign of the difference

Evaluating Several Queries

𝐸 𝛥𝐴𝐺@𝑘 =1

𝒬 𝐸 𝛥𝐴𝐺@𝑘𝑞

𝑞∈𝒬

iterate all queries

The Rationale

if then judge another document else stop judging

𝛼 < 𝑃 Δ𝐴𝐺@𝑘 ≤ 0 < 1 − 𝛼

Distribution of AG@k

𝑃 𝐴𝐺@𝑘 = 𝓏 ≔ 𝑃 𝐴𝐺@𝑘 = 𝓏 𝛾𝑘 · 𝑃 𝛾𝑘

𝛾𝑘∈𝛤𝑘

what are the possible assignments of similarity?

iterate all possible permutations of k

similarity assignments

ultimately depends on the distribution of Gi

Plain English

the ratio of similarity assignments s.t. AG@k=z

For Complex Measures or Large Similarity Scales

run Monte Carlo simulation

Actually, AG@k is a Special Case

let G be the similarity of the top k for all queries

1. take a sample of k documents. Mean = X1

2. take a sample of k documents. Mean = X2

...

Q. take a sample of k documents. Mean = XQ

Mean of sample means = X

Central Limit Theorem

regardless of the distribution of G

query AG@k for a single query

mean AG@k over all queries

as Q→∞, X approximates a normal distribution

AG@k is Normally Distributed

use the normal cumulative density function Φ

𝑃 ∆𝐴𝐺@𝑘 ≤ 0 = Φ−𝐸 ∆𝐴𝐺@𝑘

𝑉𝑎𝑟 ∆𝐴𝐺@𝑘

BROAD scale

AG@5

De

nsity

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

FINE scale

AG@5

De

nsity

0 20 40 60 80 100

0.0

00

0.0

10

0.0

20

0.0

30

Confidence as a Function of # Judgments

Percent of judgments

Co

nfid

en

ce

in

ra

nkin

g o

f syste

ms

0 10 20 30 40 50 60 70 80 90 100

75

80

85

90

95

100

50

55

60

65

70

what documents should we judge? those that maximize the confidence

or keep judging to be really confident we can

stop judging

or waste our time

The Trick

documents retrieved by both systems are useless there is no need to judge them

whatever Gi is, it is added and then subtracted

Comparing Several Systems

compute a weight wi for each query-document judge the document with largest effect

wi in the Original MTC

wi = largest weight across system pairs reduces to # of system pairs affected by query-doc i

wi Dependent on Confidence

if we are highly confident about a pair of systems we do not need to judge another of their documents

𝑤𝑖 = 1− 𝐶𝐴,𝐵 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘2

𝐴,𝐵 ∈𝒮−ℛ

better results than traditional weights

iterate system pairs with low confidence

weight inversely proportional to confidence

even if it has the largest weight

MTC for AMS

with AG@k

MTC for ΔAG@k

while 1

𝒮 𝐶𝐴,𝐵𝐴,𝐵 ∈𝒮

≤ 1 − 𝛼 do

𝑖∗ ← 𝑎𝑟𝑔𝑚𝑎𝑥𝑖 𝑤𝑖

from all unjudged query-documents judge query-document 𝑖∗ (obtain true 𝑔𝑎𝑖𝑛𝑖∗) 𝐸 𝐺𝑖∗ ← 𝑔𝑎𝑖𝑛𝑖∗ 𝑉𝑎𝑟 𝐺𝑖∗ ← 0

end while

average confidence on the ranking

select the best document

update (increase confidence)

MTC in MIREX AMS 2011

Why MIREX 2011

largest edition so far 18 systems (153 pairwise comparisons)

100 queries and 6,322 judgments

Distribution of Gi

let us work with a uniform distribution for now

Confidence as Judgments are Made

correct bins: estimated sign is correct or not significant anyway





high confidence with considerably

less effort

Accuracy as Judgments are Made estimated bins always better than expected

Accuracy as Judgments are Made

estimated signs highly correlated with confidence

Accuracy as Judgments are Made

rankings with tau = 0.9 traditionally considered equivalent (same as 95% accuracy)

high confidence and

high accuracy with considerably

less effort

Statistical Significance

MTC allows us to accurately estimate the ranking but for the current set of queries

can we generalize to a general set of queries?

Not Trivial

we have the variance of the estimates but not the sample variance

Work with Upper and Lower Bounds of ΔAG@k

Upper bound: best case for A Lower bound: best case for B

∆𝐴𝐺@𝑘 =1

𝑘 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘

𝑖∈𝜋

+

+1

𝑘 𝑙+ · 𝐼 𝐴𝑖 ≤ 𝑘

𝑖∈𝜋

−

−1

𝑘 𝑙− · 𝐼 𝐵𝑖 ≤ 𝑘 ∧ 𝐴𝑖 > 𝑘

𝑖∈𝜋

known judgments

*same for the lower bound



∆𝐴𝐺@𝑘 =1


𝑖∈𝜋

+

+1


𝑖∈𝜋

−

−1

𝑘 𝑙− · 𝐼 𝐵𝑖 ≤ 𝑘 ∧ 𝐴𝑖 > 𝑘

𝑖∈𝜋

retrieved by A


unknown judgments best

similarity score



∆𝐴𝐺@𝑘 =1


𝑖∈𝜋

+

+1


𝑖∈𝜋

−

−1

𝑘 𝑙− · 𝐼 𝐵𝑖 ≤ 𝑘 ∧ 𝐴𝑖 > 𝑘

𝑖∈𝜋


unknown judgments

retrieved by B but not by A

worst similarity

score

3 Rules

1. Assume best case for A (upper bound) if A <<< B then conclude A <<< B

2. Assume best case for B (lower bound) if B <<< A then conclude B <<< A

3. If in the best case for A we do not have A >>> B and in the best case for B we do not have B >>> A then conclude they are not significantly different

Problem upper and lower bounds are very unrealistic

Incorporate a Heuristic

4. If the estimated difference is larger than t naively conclude significance

Choose t Based on Power Analysis

t = effect-size detectable by a t-test with • sample variance σ2=0.0615 • sample size n=100 • Type I Error rate α=0.05 • Type II Error rate β=0.15

t ≈ 0.067

from previous MIREX editions

typical values

Accuracy of the Significance Estimates

rule 4 (heuristic) ends up overestimating significance

pretty good around 95% confidence


rule 4 (heuristic) ends up overestimating significance

rules 1 to 3 begin to apply and correct overestimations


closer to expected

never under 90%

significance can be estimated

fairly well too

what we did

Introduce MTC to the MIR folks

Work out the Math for MTC with AG@k

See How Well it would have Done in AMS 2011 quite well actually!

what now

Learn the true Distribution of Similarity Judgments

Significance Testing with Incomplete Judgments

Study Low-Cost Methodologies for other MIR Tasks

it‘s clearly not uniform would give more accurate estimates with less effort

use previous AMS data or fit a model as we judge

best-case scenarios are very unrealistic

what for

MTC Greatly Reduces the Effort for AMS (and SMS)

have MIREX volunteers incrementally create brand new test collections for other tasks

Better Yet

study low-cost methodologies for the other tasks

Not Only for MIREX

private collections for in-house evaluations no possibility of gathering large pools of annotators

lost-cost becomes paramount

the MIR community needs a paradigm shift

from a priori to a posteriori evaluation methods

to reduce cost and gain reliability

Technology

Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval