Upload
julian-urbano
View
532
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Reliable evaluation of Information Retrieval systems requires large amounts of relevance judgments. Making these annotations is quite complex and tedious for many Music Information Retrieval tasks, so performing such evaluations requires too much effort. A low-cost alternative is the application of Minimal Test Collection algorithms, which offer quite reliable results while significantly reducing the annotation effort. The idea is to incrementally select what documents to judge so that we can compute estimates of the effectiveness differences between systems with a certain degree of confidence. In this paper we show a first approach towards its application to the evaluation of the Audio Music Similarity and Retrieval task, run by the annual MIREX evaluation campaign. An analysis with the MIREX 2011 data shows that the judging effort can be reduced to about 35% to obtain results with 95% confidence.
Citation preview
AdMIRe 2012 Lyon, France · April 17th Picture by ERdi43 (Wikipedia)
Towards Minimal Test Collections for Evaluation of
Audio Music Similarity and Retrieval
@julian_urbano University Carlos III of Madrid
@m_schedl Johannes Kepler University
Problem
evaluation of IR systems is costly
Annotations
time consuming expensive
boring
(Bad) Consequence
small and biased test collections unlikely to change from year to year
Solution
apply low-cost evaluation methodologies
2011 1960
ISMIR (2000-today)
MIREX (2005-today)
TREC (1992-today)
CLEF (2000-today)
NTCIR (1999-today)
Cranfield 2 (1962-1966)
MEDLARS (1966-1967)
SMART (1961-1995)
nearly 2 decades of Meta-Evaluation in Text IR
a lot of things have happened here!
some good practices inherited from here
Minimal Test Collections (MTC) [Carterette at al.]
estimate the ranking of systems with very few judgments (high incompleteness)
Application in Audio Music Similarity (AMS)
dozens of volunteers required by MIREX every year to make thousands of judgments
Year Teams Systems Queries Results Judgments Overlap 2006 5 6 60 1,800 1,629 10% 2007 8 12 100 6,000 4,832 19% 2009 9 15 100 7,500 6,732 10% 2010 5 8 100 4,000 2,737 32% 2011 10 18 100 9,000 6,322 30%
evaluation with
incomplete judgments
Basic Idea
treat similarity scores as random variables can be estimated with uncertainty
gain of an arbitrary document: Gi ⤳ multinomial
𝐸 𝐺𝑖 = 𝑃 𝐺𝑖 = 𝑙 · 𝑙
𝑙∈ℒ
ℒ𝐵𝑅𝑂𝐴𝐷 = 0, 1, 2 ℒ𝐹𝐼𝑁𝐸 = {0, 1, … , 100}
whenever document i is judged:
𝐸 𝐺𝑖 = 𝑙 𝑉𝑎𝑟 𝐺𝑖 = 0
*all variance formulas in the paper
AG@k is also treated as a random variable
𝐸 𝐴𝐺@𝑘 =1
𝑘 𝐸 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘
𝑖∈𝒟
iterate all documents (in practice, only
the top k retrieved)
ranking at which it was retrieved
Ultimate Goal
compute a good estimate with the least effort
Comparing Two Systems
𝐸 𝛥𝐴𝐺@𝑘 =1
𝑘 𝐸 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘
𝑖∈𝒟
what really matters is the sign of the difference
Evaluating Several Queries
𝐸 𝛥𝐴𝐺@𝑘 =1
𝒬 𝐸 𝛥𝐴𝐺@𝑘𝑞
𝑞∈𝒬
iterate all queries
The Rationale
if then judge another document else stop judging
𝛼 < 𝑃 Δ𝐴𝐺@𝑘 ≤ 0 < 1 − 𝛼
Distribution of AG@k
𝑃 𝐴𝐺@𝑘 = 𝓏 ≔ 𝑃 𝐴𝐺@𝑘 = 𝓏 𝛾𝑘 · 𝑃 𝛾𝑘
𝛾𝑘∈𝛤𝑘
what are the possible assignments of similarity?
iterate all possible permutations of k
similarity assignments
ultimately depends on the distribution of Gi
Plain English
the ratio of similarity assignments s.t. AG@k=z
For Complex Measures or Large Similarity Scales
run Monte Carlo simulation
Actually, AG@k is a Special Case
let G be the similarity of the top k for all queries
1. take a sample of k documents. Mean = X1
2. take a sample of k documents. Mean = X2
...
Q. take a sample of k documents. Mean = XQ
Mean of sample means = X
Central Limit Theorem
regardless of the distribution of G
query AG@k for a single query
mean AG@k over all queries
as Q→∞, X approximates a normal distribution
AG@k is Normally Distributed
use the normal cumulative density function Φ
𝑃 ∆𝐴𝐺@𝑘 ≤ 0 = Φ−𝐸 ∆𝐴𝐺@𝑘
𝑉𝑎𝑟 ∆𝐴𝐺@𝑘
BROAD scale
AG@5
De
nsity
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
FINE scale
AG@5
De
nsity
0 20 40 60 80 100
0.0
00
0.0
10
0.0
20
0.0
30
Confidence as a Function of # Judgments
Percent of judgments
Co
nfid
en
ce
in
ra
nkin
g o
f syste
ms
0 10 20 30 40 50 60 70 80 90 100
75
80
85
90
95
100
50
55
60
65
70
what documents should we judge? those that maximize the confidence
or keep judging to be really confident we can
stop judging
or waste our time
The Trick
documents retrieved by both systems are useless there is no need to judge them
whatever Gi is, it is added and then subtracted
Comparing Several Systems
compute a weight wi for each query-document judge the document with largest effect
wi in the Original MTC
wi = largest weight across system pairs reduces to # of system pairs affected by query-doc i
wi Dependent on Confidence
if we are highly confident about a pair of systems we do not need to judge another of their documents
𝑤𝑖 = 1− 𝐶𝐴,𝐵 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘2
𝐴,𝐵 ∈𝒮−ℛ
better results than traditional weights
iterate system pairs with low confidence
weight inversely proportional to confidence
even if it has the largest weight
MTC for AMS
with AG@k
MTC for ΔAG@k
while 1
𝒮 𝐶𝐴,𝐵𝐴,𝐵 ∈𝒮
≤ 1 − 𝛼 do
𝑖∗ ← 𝑎𝑟𝑔𝑚𝑎𝑥𝑖 𝑤𝑖
from all unjudged query-documents judge query-document 𝑖∗ (obtain true 𝑔𝑎𝑖𝑛𝑖∗) 𝐸 𝐺𝑖∗ ← 𝑔𝑎𝑖𝑛𝑖∗ 𝑉𝑎𝑟 𝐺𝑖∗ ← 0
end while
average confidence on the ranking
select the best document
update (increase confidence)
MTC in MIREX AMS 2011
Why MIREX 2011
largest edition so far 18 systems (153 pairwise comparisons)
100 queries and 6,322 judgments
Distribution of Gi
let us work with a uniform distribution for now
Confidence as Judgments are Made
correct bins: estimated sign is correct or not significant anyway
Confidence as Judgments are Made
correct bins: estimated sign is correct or not significant anyway
Confidence as Judgments are Made
correct bins: estimated sign is correct or not significant anyway
high confidence with considerably
less effort
Accuracy as Judgments are Made estimated bins always better than expected
Accuracy as Judgments are Made
estimated signs highly correlated with confidence
Accuracy as Judgments are Made
rankings with tau = 0.9 traditionally considered equivalent (same as 95% accuracy)
high confidence and
high accuracy with considerably
less effort
Statistical Significance
MTC allows us to accurately estimate the ranking but for the current set of queries
can we generalize to a general set of queries?
Not Trivial
we have the variance of the estimates but not the sample variance
Work with Upper and Lower Bounds of ΔAG@k
Upper bound: best case for A Lower bound: best case for B
∆𝐴𝐺@𝑘 =1
𝑘 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘
𝑖∈𝜋
+
+1
𝑘 𝑙+ · 𝐼 𝐴𝑖 ≤ 𝑘
𝑖∈𝜋
−
−1
𝑘 𝑙− · 𝐼 𝐵𝑖 ≤ 𝑘 ∧ 𝐴𝑖 > 𝑘
𝑖∈𝜋
known judgments
*same for the lower bound
Work with Upper and Lower Bounds of ΔAG@k
Upper bound: best case for A Lower bound: best case for B
∆𝐴𝐺@𝑘 =1
𝑘 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘
𝑖∈𝜋
+
+1
𝑘 𝑙+ · 𝐼 𝐴𝑖 ≤ 𝑘
𝑖∈𝜋
−
−1
𝑘 𝑙− · 𝐼 𝐵𝑖 ≤ 𝑘 ∧ 𝐴𝑖 > 𝑘
𝑖∈𝜋
retrieved by A
*same for the lower bound
unknown judgments best
similarity score
Work with Upper and Lower Bounds of ΔAG@k
Upper bound: best case for A Lower bound: best case for B
∆𝐴𝐺@𝑘 =1
𝑘 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘
𝑖∈𝜋
+
+1
𝑘 𝑙+ · 𝐼 𝐴𝑖 ≤ 𝑘
𝑖∈𝜋
−
−1
𝑘 𝑙− · 𝐼 𝐵𝑖 ≤ 𝑘 ∧ 𝐴𝑖 > 𝑘
𝑖∈𝜋
*same for the lower bound
unknown judgments
retrieved by B but not by A
worst similarity
score
3 Rules
1. Assume best case for A (upper bound) if A <<< B then conclude A <<< B
2. Assume best case for B (lower bound) if B <<< A then conclude B <<< A
3. If in the best case for A we do not have A >>> B and in the best case for B we do not have B >>> A then conclude they are not significantly different
Problem upper and lower bounds are very unrealistic
Incorporate a Heuristic
4. If the estimated difference is larger than t naively conclude significance
Choose t Based on Power Analysis
t = effect-size detectable by a t-test with • sample variance σ2=0.0615 • sample size n=100 • Type I Error rate α=0.05 • Type II Error rate β=0.15
t ≈ 0.067
from previous MIREX editions
typical values
Accuracy of the Significance Estimates
rule 4 (heuristic) ends up overestimating significance
pretty good around 95% confidence
Accuracy of the Significance Estimates
rule 4 (heuristic) ends up overestimating significance
rules 1 to 3 begin to apply and correct overestimations
Accuracy of the Significance Estimates
closer to expected
never under 90%
significance can be estimated
fairly well too
what we did
Introduce MTC to the MIR folks
Work out the Math for MTC with AG@k
See How Well it would have Done in AMS 2011 quite well actually!
what now
Learn the true Distribution of Similarity Judgments
Significance Testing with Incomplete Judgments
Study Low-Cost Methodologies for other MIR Tasks
it‘s clearly not uniform would give more accurate estimates with less effort
use previous AMS data or fit a model as we judge
best-case scenarios are very unrealistic
what for
MTC Greatly Reduces the Effort for AMS (and SMS)
have MIREX volunteers incrementally create brand new test collections for other tasks
Better Yet
study low-cost methodologies for the other tasks
Not Only for MIREX
private collections for in-house evaluations no possibility of gathering large pools of annotators
lost-cost becomes paramount
the MIR community needs a paradigm shift
from a priori to a posteriori evaluation methods
to reduce cost and gain reliability