Mining for the Meaning of Music

Mining for Meaning of Music - Ellis 2008-03-27 p. /451

1. Motivation: Oodles of Music2. Eigenrhythms & Eigenmelodies3. Melodic-Harmonic Fragments4. Other Projects

Mining for theMeaning of Music

Dan EllisLaboratory for Recognition and Organization of Speech and Audio

Dept. Electrical Engineering, Columbia University, NY USA

http://labrosa.ee.columbia.edu/

http://labrosa.ee.columbia.edu

http://labrosa.ee.columbia.edu


LabROSA Overview

InformationExtraction

MachineLearning

SignalProcessing

Speech

Music EnvironmentRecognition

Retrieval

Separation


1. Motivation: Oodles of Music

• The impact of the iPodcreates new research questions (music IR)but also: provides new tools for old questions

• What can you do with 100k+ tracks?around 9 months of listening..unsupervised data


“The Meaning of Music”Two kinds of “meaning”:

• What does music evoke in a listener’s mind?

i.e. “what does it all mean?” (metaphysics?)study with subjective experiments(then build detectors for specific responses ...?)

• What phenomena are denoted by “music”?

i.e. delineate the “set of all music”(the ultimate music/nonmusic classifier?).. this talk’s topic

4

?


Re-used Musical Elements• “What are the most popular chord

progressions?”a well-formed question...music occupies a small subset of some spacelook at massive audio archive?

• How can we distill a large collection of music audio into a compact description of what “music” means?or at least a vocabulary...

5

Scatter of PCA(3:6) of 12x16 beatchroma

10 20 30 40 50 60

10

20

30

40

50

60

10 20 30 40 50 60

10

20

30

40

50

60


Potential Applications• Given a description of the

musically valid subspace...compression: represent a given piece by its indices/parameters in the subspace

classification: subspace representation reveals ‘essence’; neighbors are interesting

manipulation: modifications within the space remain musically valid

6


2. Eigenrhythms: Drum Track Structure

• To first order, All pop music has the same drum track:

• Can we capture this from examples?.. including the variations

• Can we exploit it?.. for synthesis.. for classification.. for insight

Ellis & Arroyo ISMIR’04

7


Basis Sets• Dataset reduced to

linear combinations of a few basic patterns

bases H: subspace that spans the dataweights W: dimension-reduced projection of data

X = W × Hdataweights

bases

-1

0

1

×=

8


Different basis projections

• Principal Component Analysis (PCA)optimizes MSE of low-D reconstruction

• Independent Component Analysis (ICA)projections are independent (cf decorrelated)

• Linear Discriminant Analysis (LDA)given class labels for data, find dimensions to separate them

• Nonnegative Matrix Factorization (NMF)each basis function only adds bits in

0 2 4 60

1

2

3

4

5

6ICA basis

0 2 4 60

1

2

3

4

5

6PCA basis

-2 0 2-3

-2

-1

0

1

2

3LDA basis

0 2 4 60

1

2

3

4

5

6NMF basis

9


100 200 300 samples

BD

SN

HH

Data• Drum tracks extracted from MIDI

100 examples (10 × 10 genre classes)fixed mapping to 3 instruments:bass drum, snare, hi-hattemporary proxy for audio transcription...

• Pseudo-envelope representation40ms half-Gauss window sampled at 200 Hz

• Extract just one pattern from each MIDI looking for variety, quantity not a problem

10


Aligning Data: Tempo• Need to align patterns prior to PCA/...• First, normalize tempo

autocorrelation gives BPM candidateskeep them all for now...

11


Aligning Data: Downbeat• Downbeat from best match of tempo-

normalized pattern to mean templatetry every tempo hypotheses, choose best match

Original pattern scaled

12


Aligned Data

• Tempo normalization + downbeat alignment→ 100 excerpts (2 bars each):

• Can now extract basis projection(s)

13


Eigenrhythms (PCA)

• Need 20+ Eigenvectors for good coverage of 100 training patterns (1200 dims)

• Eigenrhythms both add and subtract

14


Posirhythms (NMF)

• Nonnegative: only adds beat-weight• Capturing some structure

Posirhythm 1

BDSNHH

BDSNHH

BDSNHH

BDSNHH

BDSNHH

BDSNHH

Posirhythm 2

Posirhythm 3 Posirhythm 4

Posirhythm 5

500 0 samples (@ 200 Hz)beats (@ 120 BPM)

100 150 200 250 300 350 4001 2 3 4 1 2 3 4

Posirhythm 6

50 100 150 200 250 300 350

-0.1

0

0.1

15


Eigenrhythms for Classification• Projections in Eigenspace / LDA space

• 10-way Genre classification (nearest nbr):PCA3: 20% correctLDA4: 36% correct

-20 -10 0 10-10

-5

0

5

10PCA(1,2) projection (16% corr)

-8 -6 -4 -2 0 2-4

-2

0

2

4

6LDA(1,2) projection (33% corr)

bluescountrydiscohiphophousenewwaverockpoppunkrnb

16


Eigenrhythm BeatBox• Resynthesize rhythms from eigen-space

17


Eigenmelodies?• Can we do a similar thing with melodies?• Cluster ‘fragments’ that recur in melodies

.. across large music database

.. one way to get fragment alignment?

.. trade data for model sophistication

• Data sourcespitch tracker, or MIDI training data

• Melody fragment representationDCT(1:20) - removes average, smoothes detail

Trainingdata

Melodyextraction

5 secondfragments

Topclusters

VQ clustering


Melody Clustering• Clusters match underlying contour:

• Some interesting matches:e.g. Pink + Nsync


3. Melodic-Harmonic Fragments• Can we use the subspace and clustering

ideas with our oodles of music?use lots of real music audiocapture both melodic and harmonic context...

• Goal: Dictionary of common motifs (clichés)build up into longer sequencesreveal quotes & inspirations, genre/style idioms

• Questionswhat representation and similarity measure?what clustering scheme?tractability: how large can we go?


Finding Common Fragments

• Chop up music into short descriptions of musical content24-beat beat-chroma matrices

• Choose a few at “starts” (landmarks)

• Put into LSH table similar items fall in same bin

• Find the bins with most entries= most commonly reused motifs

21

Music

audio

Locality

Sensitive

Hash Table

Beat

tracking

Chroma

features

Key

normalization

Landmark

identification


Beat Tracking• Goal: Per-‘beat’ (tatum) feature vector

for tempo normalization, efficiency

• “Onset Strength Envelope”sumf(max(0, difft(log |X(t, f)|)))

• Autocorr. + window → global tempo estimate

22

10203040

freq

/ mel

0

0

5 10 15time / sec

0 100 200 300 400 500 600 700 800 900lag / 4 ms samples

1000

0

168.5 BPM

Ellis ’06,’07

Music

audio

Locality

Sensitive

Hash Table

Beat

tracking

Chroma

features

Key

normalization

Landmark

identification


Beat Tracking (2)• Dynamic Programming finds beat times {ti}

optimizes i O(ti) + i F(ti+1 – ti , p)where O(t) is onset strength envelope (local score)W(t) is a log-Gaussian window (transition cost)p is the default beat period per measured tempoincrementally find best predecessor at every timebacktrace from largest final score to get beats

23

C*(t) = O(t) + max{αF(t – τ, τp) + C*(τ)}τ

P(t) = argmax{αF(t – τ, τp) + C*(τ)}τ

tτ

O(t)

C*(t)

Music

audio

Locality

Sensitive

Hash Table

Beat

tracking

Chroma

features

Key

normalization

Landmark

identification


Beat Tracking (3)• DP will bridge gaps (non-causal)

there is always a best path ...

• 2nd place in MIREX 2006 Beat Trackingcompared to McKinney & Moelants human data

24

10

20

30

40

0 5 10 150

20

40

freq / B

ark

band

Subje

ct #

time / s

test 2 (Bragg) - McKinney + Moelants Subject data

182 184 186 188 190 192 time / sec

10

20

30

40

freq / B

ark

band

Alanis Morissette - All I Want - gap + beats

Music

audio

Locality

Sensitive

Hash Table

Beat

tracking

Chroma

features

Key

normalization

Landmark

identification


Chroma Features• Chroma features convert spectral energy

into musical weights in a canonical octavei.e. 12 semitone bins

• Can resynthesize as “Shepard Tones”all octaves at once

25

Piano scale

2 4 6 8 10 100 200 300 400 500 600 700time / sec

freq / k

Hz

0

1

2

3

4

time / frames

chro

ma

A

C

D

F

G

Piano chromatic scale IF chroma

0 500 1000 1500 2000 2500 freq / Hz-60-50-40-30-20-10

0

2 4 6 8 10 time / sec

freq

/ kH

z

leve

l / d

B

0

1

2

3

4Shepard tone resynth12 Shepard tone spectra

Music

audio

Locality

Sensitive

Hash Table

Beat

tracking

Chroma

features

Key

normalization

Landmark

identification


Beat-Chroma Features• Beat + chroma features / 30ms frames→ average chroma within each beatcompact; sufficient?

! "# "! $# $! %# %!

$

&

'

(

"#

"$

"#

$#

%#

&#

$

&

'

(

"#

"$

# ! "# "!)*+,-.-/,0

)*+,-.-1,2)/

34,5-.-6,7

89/,)-/)4,9:);

0;48+2-1*9/

0;48+2-1*9/

#

Music

audio

Locality

Sensitive

Hash Table

Beat

tracking

Chroma

features

Key

normalization

Landmark

identification


Aligned Global model

Taxman Eleanor Rigby I'm Only Sleeping

She Said She Said Good Day Sunshine And Your Bird Can Sing

Love You To Yellow Submarine

alig

ned

chro

ma

A

CD

FG

A C D F GA

CD

FG

A C D F GA

CD

FG

A C D F GA

CD

FG

A C D F GA

CD

FG

A C D F GA

CD

FG

A C D F GA

CD

FG

A C D F GA

CD

FG

A C D F GA

CD

FG

aligned chroma

Key Estimation• Covariance of chroma reflects key

• Normalize by transposing for best fit

single Gaussian model of one piecefind ML rotation of other piecesmodel all transposed piecesiterate until convergence

27

Ellis ICASSP ’07

Music

audio

Locality

Sensitive

Hash Table

Beat

tracking

Chroma

features

Key

normalization

Landmark

identification


Landmark Location• Looking for “beginnings” of phrases

e.g. abrupt change in harmony, instruments, etc.use likelihood ratio test: data following under model up to boundary

• Choose top 10 locally-normalized peaks.. to control data size? include ± 2 beatsto catch errors

28time / sec

freq /

kHz

chro

ma bi

n

Come Together - Spectrogram, Beat-sync chromogram, and top 10 segment points

0

1

2

3

4

5

0 50 100 150 200 250

2

4

6

8

10

12

Music

audio

Locality

Sensitive

Hash Table

Beat

tracking

Chroma

features

Key

normalization

Landmark

identification


Locality Sensitive Hash• Goal: Quantize high-dimensional data so

‘similar’ items fall into same bin.. for fast and scalable nearest-neighbor search

• Idea: Multiple random scalar projectionseach one will tend to keep neighbors nearbyitems close together in all projections are probably neighbors

29

from Slaney & Casey ‘08

Music

audio

Locality

Sensitive

Hash Table

Beat

tracking

Chroma

features

Key

normalization

Landmark

identification


Experiments• Data

“artist20” - 20 artist x 6 albums = 1413 tracks(up to) 10 landmarks/track = 14,078 patcheseach patch = 12 chroma bins x 24 beats (288 dims)

• Performancefeature calculation:~ 60 minLSH 14k NNs:~ 30 sec51 patches have >10 NNs within r = 2.0

30

0 5 10 15 20 25 30 35 400

10

20

30

40

50

60

70

80

90

100

# near neighbors

coun

t

# neighbors within 2.0 - 14078 a20 patches


Results - artist20

mainly sustained notes31

chro

ma bi

n

radiohead 01-You 192.7-198.5s

2

4

6

8

10

12

chro

ma bi

n

green day 02-Blood Sex And Booze 122.3-129.8s

2

4

6

8

10

12ch

roma

bin

beat

radiohead 07-Ripcord 177.4-182.5s

5 10 15 20

2

4

6

8

10

12

chro

ma bi

n

beat

radiohead 11-Genchildren hidden 30.9-35.0s

5 10 15 20

2

4

6

8

10

12


Results - Beatles• Only the 86 Beatles tracks

• All beat offsets = 41,705 patchesLSH takes 300 sec - approx NlogN in patches?

• High-pass along time to avoid sustainednotes

• Song filterremove hitsin same track

32

chro

ma bi

n

02-I Should Have Known Better 92.4-97.7s

2

4

6

8

10

12

chro

ma bi

n

05-Here There And Everywhere 12.1-20.5s

2

4

6

8

10

12

chro

ma bi

n

beat

09-Martha My Dear 90.9-98.6s

5 10 15 20

2

4

6

8

10

12

chro

ma bi

nbeat

12-Piggies 22.0-29.6s

5 10 15 20

2

4

6

8

10

12


Results - Chroma Peaks

• Beat-chroma too diversereduce variation by keeping only 4 chroma/frame

• Landmarks off-by-1 → use tr –2 ... tr +270,606 fragments (all beats would be 1.3M fragments)

33

#32273 - 13 instances #51917 - 13 instances



#55881 - 5 instances

5 10 15 20

#68445 - 5 instances

5 10 15 time / beats

GFDCA

GFDCA

GFDCA

GFDCA

chro

ma

bins

a20-top10x5-cp4-4p0


Results - Detail• Interesting fragment cluster...

• Not that interesting...further simplification of fragments?larger dataset?

34

depeche mode 13-Ice Machine 199.5-204.6s roxette 03-Fireworks 107.9-114.8s

roxette 04-Waiting For The Rain 80.1-93.5s

5 10 15 20 time / beats

tori amos 11-Playboy Mommy 157.8-171.3s

5 10 15

GFDCA

GFDCA

chro

ma

bins


4. Other Projects: Music Similarity• The most central problem...

motivates extracting musical informationsupports real applications (playlists, discovery)

• But do we need content-based similarity?compete with collaborative filteringcompete with fingerprinting + metadata

• Maybe ... for the Future of Musicconnect listeners directly to musicians

35


Discriminative Classification• Classification as a proxy for similarity

• Distribution models...

• vs. SVM

36

KL

KL

MFCCs

Art

ist

1A

rtis

t 2

Test Song

Min Artist

Training

GMMs

D

D

D

D

D

D

MFCCs

Art

ist

1A

rtis

t 2

Song Features

DAG SVM

Test Song

Artist

Training

Mandel & Ellis ‘05


Segment-Level Features• Statistics of spectra and envelope

define a point in feature spacefor SVM classification, or Euclidean similarity...

37

Mandel & Ellis ‘07

LabROSALaboratory for the Recognition andOrganization of Speech and Audio

LabROSA’s audio music similarity and classification systemsMichael I Mandel and Daniel P W Ellis

LabROSA · Dept of Electrical Engineering · Columbia University, New York{mim,dpwe}@ee.columbia.edu

1. Feature Extraction• Spectral features are the same as Mandel and Ellis (2005), mean and

unwrapped covariance of MFCCs• Temporal features are similar to those in Rauber et al. (2002)

• Break input into overlapping 10-second clips• Analyze mel spectrum of each clip, averaging clips’ features• Combine mel frequencies together to get magnitude bands for low,

low-mid, high-mid, and high frequencies• FFT in time gives modulation frequency, keep magnitude of low-

est 20% of frequencies• DCT in modulation frequency gives envelope cepstrum• Stack the four bands’ envelope cepstra into one feature vector

• Each feature is then normalized across all of the songs to be zero-mean, unit-variance.

2. Similarity and Classification• We use a DAG-SVM for n-way classification of songs (Platt et al., 2000)• The distance between songs is the Euclidean distance between their

feature vectors• During testing, some songs were a top similarity match for many songs• Re-normalizing each song’s feature vector avoided this problem

ReferencesM. Mandel and D. Ellis. Song-level features and support vector machines for music classification. In Joshua

Reiss and Geraint Wiggins, editors, Proc. ISMIR, pages 594–599, 2005.

John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large margin DAGs for multiclass classification. InS.A. Solla, T.K. Leen, and K.-R. Mueller, editors, Advances in Neural Information Processing Systems 12,pages 547–553, 2000.

Andreas Rauber, Elias Pampalk, and Dieter Merkl. Using psycho-acoustic models and self-organizing mapsto create a hierarchical structuring of music by sound similarity. In Michael Fingerhut, editor, Proc. ISMIR,pages 71–80, 2002.

3. ResultsAudio Music Similarity Audio Classification

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

PS GT LB CB1 TL1 ME TL2 CB2 CB3 BK1 PC BK2

Greater0

Psum

Fine

WCsum

SDsum

Greater10

10

20

30

40

50

60

70

80

IMsvm

MEspec ME TL GT IM

knn KL CL GH

Genre ID Hierarchical

Genre ID Raw

Mood ID

Composer ID

Artist ID

PS = Pohle, Schnitzer; GT = George Tzane-takis; LB = Barrington, Turnbull, Torres, Lanckriet;CB = Christoph Bastuck; TL = Lidy, Rauber, Per-tusa, Inesta; ME = Mandel, Ellis; BK = Bosteels,Kerre; PC = Paradzinets, Chen

IM = IMIRSEL M2K; ME = Mandel, Ellis; TL = Lidy,Rauber, Pertusa, Inesta; GT = George Tzane-takis; KL = Kyogu Lee; CL = Laurier, Herrera;GH = Guaus, Herrera

MIREX 2007, 3rd Music Information Retrieval Evaluation eXchange, 26 September 2007, Vienna, Austria


MIREX’07 Results• One system for similarity and classification

38

LabROSALaboratory for the Recognition andOrganization of Speech and Audio

LabROSA’s audio music similarity and classification systemsMichael I Mandel and Daniel P W Ellis

LabROSA · Dept of Electrical Engineering · Columbia University, New York{mim,dpwe}@ee.columbia.edu

1. Feature Extraction• Spectral features are the same as Mandel and Ellis (2005), mean and

unwrapped covariance of MFCCs• Temporal features are similar to those in Rauber et al. (2002)

• Break input into overlapping 10-second clips• Analyze mel spectrum of each clip, averaging clips’ features• Combine mel frequencies together to get magnitude bands for low,

low-mid, high-mid, and high frequencies• FFT in time gives modulation frequency, keep magnitude of low-

est 20% of frequencies• DCT in modulation frequency gives envelope cepstrum• Stack the four bands’ envelope cepstra into one feature vector

• Each feature is then normalized across all of the songs to be zero-mean, unit-variance.

2. Similarity and Classification• We use a DAG-SVM for n-way classification of songs (Platt et al., 2000)• The distance between songs is the Euclidean distance between their

feature vectors• During testing, some songs were a top similarity match for many songs• Re-normalizing each song’s feature vector avoided this problem

ReferencesM. Mandel and D. Ellis. Song-level features and support vector machines for music classification. In Joshua

Reiss and Geraint Wiggins, editors, Proc. ISMIR, pages 594–599, 2005.

John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large margin DAGs for multiclass classification. InS.A. Solla, T.K. Leen, and K.-R. Mueller, editors, Advances in Neural Information Processing Systems 12,pages 547–553, 2000.

Andreas Rauber, Elias Pampalk, and Dieter Merkl. Using psycho-acoustic models and self-organizing mapsto create a hierarchical structuring of music by sound similarity. In Michael Fingerhut, editor, Proc. ISMIR,pages 71–80, 2002.

3. ResultsAudio Music Similarity Audio Classification

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

PS GT LB CB1 TL1 ME TL2 CB2 CB3 BK1 PC BK2

Greater0

Psum

Fine

WCsum

SDsum

Greater10

10

20

30

40

50

60

70

80

IMsvm

MEspec ME TL GT IM

knn KL CL GH

Genre ID Hierarchical

Genre ID Raw

Mood ID

Composer ID

Artist ID

PS = Pohle, Schnitzer; GT = George Tzane-takis; LB = Barrington, Turnbull, Torres, Lanckriet;CB = Christoph Bastuck; TL = Lidy, Rauber, Per-tusa, Inesta; ME = Mandel, Ellis; BK = Bosteels,Kerre; PC = Paradzinets, Chen

IM = IMIRSEL M2K; ME = Mandel, Ellis; TL = Lidy,Rauber, Pertusa, Inesta; GT = George Tzane-takis; KL = Kyogu Lee; CL = Laurier, Herrera;GH = Guaus, Herrera

MIREX 2007, 3rd Music Information Retrieval Evaluation eXchange, 26 September 2007, Vienna, Austria


Cover Song Detection• “Cover Songs” = reinterpretation of a piece

different instrumentation, characterno match with “timbral” features

• Need a different representation!beat-synchronous chroma features

39

Let It Be - The Beatles Let It Be - Nick Cave

time / sectime / sec

freq / k

Hz

Let It Be / Beatles / verse 1

2 4 6 8 100

1

2

3

4

freq / k

Hz

chro

ma

chro

ma

0

1

2

3

4Let It Be / Nick Cave / verse 1

2 4 6 8 10

Beat-sync chroma features

5 10 15 20 25 beats

Beat-sync chroma features

5 10 15 20 25 beats

A

C

D

F

G

A

C

D

F

G

Ellis & Poliner ‘07


Matching: Global Correlation• Cross-correlate entire beat-chroma matrices

... at all possible transpositionsimplicit combination of match quality and duration

• One good matching fragment is sufficient...?

40

100 200 300 400 500 beats @281 BPM

-500 -400 -300 -200 -100 0 100 200 300 400 skew / beats

-5

0

+5

G

E

D

C

A

ch

rom

a b

ins

G

E

D

C

A

ch

rom

a b

ins

ske

w /

se

mito

ne

s

Elliott Smith - Between the Bars

Glen Phillips - Between the Bars

Cross-correlation


“Semantic Bases”: MajorMiner• Describe segment in human-relevant terms

e.g. anchor space, but more so

• Need ground truth...what words to people use?

• MajorMiner game:400 users7500 unique tags70,000 taggings2200 10-sec clips used

• Train classifiers...

41

Mandel & Ellis ’07,‘08


MajorMiner Autotagging Results• Tags with enough verified clips → train SVM

• Some good resultstest has 50% baseline; 7% better is significant50-300 training patterns

• Next step: Propagate labelssemi-supervised“multi-instance” learning

42

0.4 0.5 0.6 0.7 0.8 0.9 1drumsvoicebassvocalkeyboardinstrumental80ssolomalebritishslowpianosoftpopdrum machinefemaleguitardistortionsynthstringscountryrockfastpunkdanceelectronicelectronicabeatambientjazzr&btechnohousesaxophonehip hoprap

Classification accuracy


Transcription as Classification• Exchange signal models for data

transcription as pure classification problem:

Poliner & Ellis ‘05,’06,’07

Classification:•N-binary SVMs (one for ea. note).•Independent frame-levelclassification on 10 ms grid.

•Dist. to class bndy as posterior.

classification posteriors

Temporal Smoothing:•Two state (on/off) independentHMM for ea. note. Parameters learned from training data.

•Find Viterbi sequence for ea. note.

hmm smoothing

Training data and features:•MIDI, multi-track recordings, playback piano, & resampled audio(less than 28 mins of train audio).

•Normalized magnitude STFT.

feature representation feature vector


Singing Voice Modeling & Alignment• How do singers sing?

e.g. “vowel modification” in classical voicetuning variation...

• Collect the data.. by aligning libretto to recordingse.g. align Karaoke MIDI files to original recordingsdetail at alignments

• Lyric Transcription?

44

Christine SmitJohanna Devaney

Time / sec

Fre

quency / H

z

5 10 15 20 25 30 35 40 450

100

200

300

400

500

600

700

800

900

1000


Conclusions

• Lots of data + noisy transcription + weak clustering⇒ musical insights?

Music

audio

Tempo

and beat

Low-level

features Classification

and Similarity

Music

Structure

Discovery

Melody

and notes

Key

and chords

browsing

discovery

production

modeling

generation

curiosity

Documents

Mining for the Meaning of Music