MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Music Information Retrieval (MIR) Evaluation

COSMIR and the OpenMIC Challengethe talk previously known as

A Plan for Sustainable Music IR Evaluation

Brian McFee* Eric Humphrey Julián Urbano**

solving problems with science

Hypothesis(model)

Experiment(evaluation)

Progress depends on reproducibility

We’ve known this for a while

● Many years of MIREX!

● Lots of participation

● It’s been great for the community

MIREX (cartoon form)

Scientists Code MIREX machines(and task captains)

Data (private)

Results

Evaluating the evaluation model

We would not be where we are today without MIREX.But this paradigm faces an uphill battle :’o(

Costs of doing business

● Computer time

● Human labor

● Data collection


● Computer time

● Human labor

● Data collection

Annual sunk costs(proportional to participants)

Best ! for $

*arrows are probably not to scale


● Computer time

● Human labor

● Data collection Best ! for $

*arrows are probably not to scale

Annual sunk costs(proportional to participants)

The worst thing that could happen is growth!

Limited feedback in the lifecycle

Hypothesis(model)

Experiment(evaluation)

Performance metrics (always)System outputs (sometimes)Input data (almost never)

Stale data implies bias

https://frinkiac.com/caption/S07E24/252468https://frinkiac.com/caption/S07E24/288671

The current model is unsustainable

● Inefficient distribution of labor

● Limited feedback

● Inherent and unchecked bias

What is a sustainable model?

● Kaggle is a data science evaluation community (sound familiar?)

● How it works:○ Download data○ Upload predictions

○ Observe results

● The user-base is huge○ 536,000 registered users○ 4,000 forum posts per month

○ 3,500 competition submissions per day (!!!)

What is a sustainable model?

● Kaggle is a data science evaluation community (sound familiar?)

● How it works:○ Download data○ Upload predictions

○ Observe oresults

● The user-base is huge○ 536,000 registered users○ 4,000 forum posts per month

○ 3,500 competition submissions per day (!!!)

Distributed computation.

Academic Instances

Related / Prior work in Multimedia

• DCASE – Detection and Classification of Acoustic Scenes and Events

• NIST – Iris Challenge Evaluation, SRE i-Vector Challenge, Face Recognition Grand Challenge, TRECVid, ...

• MediaEval – Emotion in Music, Querying Musical Scores, Soundtrack Selection

http://www.cs.tut.fi/sgn/arg/dcase2016/challenge

https://www.nist.gov/programs-projects/iris-challenge-evaluation-ice

https://ivectorchallenge.nist.gov/evaluations/1

https://www.nist.gov/programs-projects/face-recognition-grand-challenge-frgc

http://trecvid.nist.gov/

http://www.multimediaeval.org/

http://www.multimediaeval.org/mediaeval2015/emotioninmusic2015/index.html

http://www.multimediaeval.org/mediaeval2015/camerata2015/

http://www.multimediaeval.org/mediaeval2013/soundtrack2013/

Open content

● Participants need unfettered access to audio content

● Without input data, error analysis is impossible

● Creative commons-licensed music is plentiful on the internet!○ Jamendo: 500K tracks

○ FMA: 90K tracks

The distributed model is sustainable

● Distributed computation

● Open data means clear feedback

● Efficient allocation of human effort

But what about annotation?

Incremental evaluation

● Performance estimates +/- degree of uncertainty

● Annotate the most informative examples first

○ Beats: [Holzapfel et al., TASLP 2012]

○ Similarity: [Urbano and Schedl, IJMIR 2013]

○ Chords: [Humphrey & Bello, ISMIR 2015]

○ Structure: [Nieto, PhD thesis 2015]

● Which tracks do we annotate for evaluation?

○ None, at first!

Incremental evaluation

● Performance estimates +/- degree of uncertainty

● Annotate the most informative examples first

○ Beats: [Holzapfel et al., TASLP 2012]

○ Similarity: [Urbano and Schedl, IJMIR 2013]

○ Chords: [Humphrey & Bello, ISMIR 2015]

○ Structure: [Nieto, PhD thesis 2015]

● Which tracks do we annotate for evaluation?

○ None, at first!

This is already common practice in MIR.

Let’s standardize it!

The evaluation loop

1. Collect CC-licensed music

2. Define tasks

3. ($) Release annotated development set

4. Collect system outputs

5. ($) Annotate informative examples

6. Report scores

7. Retire and release old data

Human costs ($) directly produce data

The evaluation loop

1. Collect CC-licensed music

2. Define tasks

3. ($) Release annotated development set

4. Collect system outputs

5. ($) Annotate informative examples

6. Report scores

7. Retire and release old data

Human costs ($) directly produce data

What are the drawbacks here?

● Loss of algorithmic transparency

● Potential for cheating?

● CC/PD music isn’t “real” enough

What are the drawbacks here?

● Loss of algorithmic transparency

● Potential for cheating?

● CC/PD music isn’t “real” enough

● Linking to source makes results verifiable and replicable!

● What’s the incentive for cheating?● Can we make it unpractical?● Even if people do cheat, we still get the

annotations.

● For which tasks?

Okay ... so now what?

The Open-MIC Challenge 2017

Open Music Instrument Classification Challenge

● Task 1 (classification): what instruments are played in this audio file?

● Task 2 (retrieval): in which audio files (from this corpus) is this instrument played?

● Complements what is currently covered in MIREX

● Objective and simple (?) task for annotators

● A large, well-annotated data set would be valuable for the community

The Open-MIC Challenge 2017

https://cosmir.github.io/open-mic/

● 20+ collaborators from various groups across academia and industry

Content Management System Backend

● Manages audio content and user annotations

● Flask web application will hide endpoints behind user authentication

● Working with industry to secure support for Google Compute Platform (App Engine, DataStore)

● Designed to scale with participation / users

Web Annotator

● Requests audio and instrument taxonomy from CMS backend

● Provides audio playback, instrument tag selection, annotation submission

● Authenticates users for annotator attribution & data quality

● Provides statistics and feedback on individual / overall annotation effort

Web Annotator – SONYC

Web Annotator – Gracenote

● Audio content: green light from Jamendo!

● Instrument taxonomy – converging to ≈25 classes for now

● Currently iterating on task definition, measures, annotation task design, etc.

Data and Task Definition

Roadmap (Tentative)

Sept2016

Oct2017

JuneOct Nov Dec Jan2017

Feb Mar Apr May July Aug Sept

Backend

Web Annotator

IngestAudio

Annotation

ISMIRChallenge OpenTask Definition

Instrument Taxonomy

Evaluation Machinery

Build / release development set

● https://cosmir.github.io … https://cosmir.github.io/open-mic/contribute.html

● Watch the GitHub repo, read / comment on issues, submit PRs: https://github.com/cosmir/open-mic

● Ask questions on [email protected]

● Participate in the Open-MIC challenge (coming Summer 2017)

● Talk to us / me! { [email protected] / [email protected] / [email protected] }

Want to get involved?

http://cosmir.github.io

https://cosmir.github.io/open-mic/contribute.html

http://cosmir.github.io

https://github.com/cosmir/open-mic

mailto:[email protected]




Thanks!

Science

MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Music Information Retrieval (MIR) Evaluation