Data Driven Music Understanding - Columbia …dpwe/talks/brooklyn-tech-2013-05.pdfData-Driven Music Understanding - Dan Ellis 2013-05-15 /31 Re-use in Music • What are the most popular

Data-Driven Music Understanding - Dan Ellis 2013-05-15 /311

1. Motivation: What is Music?2. Eigenrhythms3. Melodic-Harmonic Fragments4. Example Applications

Data DrivenMusic Understanding

Dan EllisLaboratory for Recognition and Organization of Speech and Audio

Dept. Electrical Engineering, Columbia University, NY USAhttp://labrosa.ee.columbia.edu/

http://labrosa.ee.columbia.edu

http://labrosa.ee.columbia.edu


LabROSA : Machine Listening

2

• Extracting useful information from sound... like (we) animals do

Dectect

Classify

Describe

EnvironmentalSound

Speech Music

Task

Domain

ASR MusicTranscription

MusicRecommendation

VAD Speech/Music

Emotion

“Sound Intelligence”

EnvironmentAwareness

AutomaticNarration


1. Motivation: What is music?

• What does music evoke in a listener’s mind?

• Which are the things that we call “music”?

3

?


Oodles of Music

• What can you do with a million tracks?

Bertin-Mahieux et al. ’09


Re-use in Music

• What are the most popular chord progressions in pop music?

5

points towards a great degree of conventionalism in the creationand production of this type of music. Yet, we find three importanttrends in the evolution of musical discourse: the restriction of pitchsequences (with metrics showing less variety in pitch progressions),the homogenization of the timbral palette (with frequent timbresbecoming more frequent), and growing average loudness levels(threatening a dynamic richness that has been conserved untiltoday). This suggests that our perception of the new would be essen-tially rooted on identifying simpler pitch sequences, fashionable tim-bral mixtures, and louder volumes. Hence, an old tune with slightlysimpler chord progressions, new instrument sonorities that were inagreement with current tendencies, and recorded with modern tech-niques that allowed for increased loudness levels could be easilyperceived as novel, fashionable, and groundbreaking.

ResultsTo identify structural patterns of musical discourse we first need tobuild a ‘vocabulary’ of musical elements (Fig. 1). To do so, we encodethe dataset descriptions by a discretization of their values, yieldingwhat we call music codewords20 (see Supplementary Information, SI).In the case of pitch, the descriptions of each song are additionallytransposed to an equivalent main tonality, such that all of them areautomatically considered within the same tonal context or key. Next,

to quantify long-term variations of a vocabulary, we need to obtainsamples of it at different periods of time. For that we perform aMonte Carlo sampling in a moving window fashion. In particular,for each year, we sample one million beat-consecutive codewords,considering entire tracks and using a window length of 5 years (thewindow is centered at the corresponding year such that, for instance,for 1994 we sample one million consecutive beats by choosing fulltracks whose year annotation is between 1992 and 1996, bothincluded). This procedure, which is repeated 10 times, guarantees arepresentative sample with a smooth evolution over the years.

We first count the frequency of usage of pitch codewords (i.e. thenumber of times each codeword type appears in a sample). Weobserve that most used pitch codewords generally correspond towell-known harmonic items21, while unused codewords correspondto strange/dissonant pitch combinations (Fig. 2a). Sorting the fre-quency counts in decreasing order provides a very clear patternbehind the data: a power law17 of the form z / r2a, where z corre-sponds to the frequency count of a codeword, r denotes its rank (i.e. r5 1 for the most used codeword and so forth), and a is the power lawexponent. Specifically, we find that the distribution of codewordfrequencies for a given year nicely fits to P(z) / (c 1 z) 2b for z .zmin, where we take z as the random variable22, b 5 1 1 1/a as theexponent, and c as a constant (Fig. 2b). A power law indicates thata few codewords are very frequent while the majority are highly

Figure 1 | Method schematic summary with pitch data. The dataset contains the beat-based music descriptions of the audio rendition of a musical pieceor score (G, Em, and D7 on the top of the staff denote chords). For pitch, these descriptions reflect the harmonic content of the piece15, and encapsulate allsounding notes of a given time interval into a compact representation11,12, independently of their articulation (they consist of the 12 pitch class relativeenergies, where a pitch class is the set of all pitches that are a whole number of octaves apart, e.g. notes C1, C2, and C3 all collapse to pitch class C). Alldescriptions are encoded into music codewords, using a binary discretization in the case of pitch. Codewords are then used to perform frequency counts,and as nodes of a complex network whose links reflect transitions between subsequent codewords.

www.nature.com/scientificreports

SCIENTIFIC REPORTS | 2 : 521 | DOI: 10.1038/srep00521 2

points towards a great degree of conventionalism in the creationand production of this type of music. Yet, we find three importanttrends in the evolution of musical discourse: the restriction of pitchsequences (with metrics showing less variety in pitch progressions),the homogenization of the timbral palette (with frequent timbresbecoming more frequent), and growing average loudness levels(threatening a dynamic richness that has been conserved untiltoday). This suggests that our perception of the new would be essen-tially rooted on identifying simpler pitch sequences, fashionable tim-bral mixtures, and louder volumes. Hence, an old tune with slightlysimpler chord progressions, new instrument sonorities that were inagreement with current tendencies, and recorded with modern tech-niques that allowed for increased loudness levels could be easilyperceived as novel, fashionable, and groundbreaking.

ResultsTo identify structural patterns of musical discourse we first need tobuild a ‘vocabulary’ of musical elements (Fig. 1). To do so, we encodethe dataset descriptions by a discretization of their values, yieldingwhat we call music codewords20 (see Supplementary Information, SI).In the case of pitch, the descriptions of each song are additionallytransposed to an equivalent main tonality, such that all of them areautomatically considered within the same tonal context or key. Next,

to quantify long-term variations of a vocabulary, we need to obtainsamples of it at different periods of time. For that we perform aMonte Carlo sampling in a moving window fashion. In particular,for each year, we sample one million beat-consecutive codewords,considering entire tracks and using a window length of 5 years (thewindow is centered at the corresponding year such that, for instance,for 1994 we sample one million consecutive beats by choosing fulltracks whose year annotation is between 1992 and 1996, bothincluded). This procedure, which is repeated 10 times, guarantees arepresentative sample with a smooth evolution over the years.

We first count the frequency of usage of pitch codewords (i.e. thenumber of times each codeword type appears in a sample). Weobserve that most used pitch codewords generally correspond towell-known harmonic items21, while unused codewords correspondto strange/dissonant pitch combinations (Fig. 2a). Sorting the fre-quency counts in decreasing order provides a very clear patternbehind the data: a power law17 of the form z / r2a, where z corre-sponds to the frequency count of a codeword, r denotes its rank (i.e. r5 1 for the most used codeword and so forth), and a is the power lawexponent. Specifically, we find that the distribution of codewordfrequencies for a given year nicely fits to P(z) / (c 1 z) 2b for z .zmin, where we take z as the random variable22, b 5 1 1 1/a as theexponent, and c as a constant (Fig. 2b). A power law indicates thata few codewords are very frequent while the majority are highly

Figure 1 | Method schematic summary with pitch data. The dataset contains the beat-based music descriptions of the audio rendition of a musical pieceor score (G, Em, and D7 on the top of the staff denote chords). For pitch, these descriptions reflect the harmonic content of the piece15, and encapsulate allsounding notes of a given time interval into a compact representation11,12, independently of their articulation (they consist of the 12 pitch class relativeenergies, where a pitch class is the set of all pitches that are a whole number of octaves apart, e.g. notes C1, C2, and C3 all collapse to pitch class C). Alldescriptions are encoded into music codewords, using a binary discretization in the case of pitch. Codewords are then used to perform frequency counts,and as nodes of a complex network whose links reflect transitions between subsequent codewords.

www.nature.com/scientificreports

SCIENTIFIC REPORTS | 2 : 521 | DOI: 10.1038/srep00521 2

Serrà et al. 2012


Potential Applications

• Compression

• Classification

• Manipulation

6


2. Eigenrhythms: Drum Track Structure

• To first order, all pop music has the same beat:

• Can we learn this from examples?

Ellis & Arroyo ’04

7


Basis Sets

• Combine a few basic patternsto make a larger dataset

X = W × Hdataweights

patterns

-1

0

1

!=

8


Drum Pattern Data

• Tempo normalization + downbeat alignment

9


NMF Eigenrhythms

• Nonnegative: only add beat-weight

Posirhythm 1

BDSNHH

BDSNHH

BDSNHH

BDSNHH

BDSNHH

BDSNHH

Posirhythm 2

Posirhythm 3 Posirhythm 4

Posirhythm 5

500 0 samples (@ 200 Hz)beats (@ 120 BPM)

100 150 200 250 300 350 4001 2 3 4 1 2 3 4

Posirhythm 6

50 100 150 200 250 300 350

-0.1

0

0.1

10


Eigenrhythm BeatBox• Resynthesize rhythms from eigen-space

11


3. Melodic-Harmonic Fragments

• How similar are two pieces?

• Can we find all the pop-music clichés?


MFCC Features• Used in speech recognition

13

freq

/ Hz

Coef

ficien

tfre

q / H

z

Let It Be (LIB-1) - log-freq specgram

32913965915

MFCCs

2468

1012

time / sec

Noise excited MFCC resynthesis (LIB-2)

0 5 10 15 20 25

32913965915


Chroma Features• Idea:

Project onto 12 semitonesregardless of octave

14

Fujishima 1999

50 100 150 fft bin 2 4 6 8 time / sec 50 100 150 200 250 time / frame

freq

/ kHz

01234

chro

ma

A

CD

FG

chro

ma

A

CD

FG

M X C× =mapping signal chroma


Chroma Features• To capture “musical” content

15

Let It Be - log-freq specgram (LIB-1)

Chroma features

CDEGAB

Shepard tone resynthesis of chroma (LIB-3)

MFCC-filtered shepard tones (LIB-4)

freq

/ Hz

30014006000

freq

/ Hz

chro

ma b

in

30014006000

freq

/ Hz

30014006000

time / sec0 5 10 15 20 25


Let It Be - log-freq specgram (LIB-1)

Onset envelope + beat times

Beat-synchronous chroma

Beat-synchronous chroma + Shepard resynthesis (LIB-6)

freq

/ Hz

30014006000

freq

/ Hz

30014006000

CDEGAB

chro

ma

bin

time / sec0 5 10 15 20 25

Beat-Synchronous Chroma• Compact representation of harmonies

16


Chord Recognition• Beat synchronous chroma look like chords

can we transcribe them?

• Two approachesmanual templates (prior knowledge)learned models (from training data)

17

ACDEG

chro

ma

bin

time / sec0 5 10 15 20

C-E-

G

B-D

-G

A-C-

E

A-C-

D-F ...


Chord Recognition System• Analogous to speech recognition

Gaussian models of features for each chordHidden Markov Models for chord transitions

18

Audio

Labels

Beat track

Resample

Chroma100-1600 HzBPF

Chroma25-400 HzBPF

Root normalize

HMMViterbi

Counttransitions

Gaussian Unnormalize

beat-synchronouschroma features

chordlabels

24x24transition

matrix

24Gaussmodels

traintest

C D E G A BCDE

GAB

CDE

GAB

C D E G A B

C maj

c min

C D E F G A B c d e f g a bC

D

EF

G

A

Bc

d

ef

g

a

b

Sheh & Ellis 2003Ellis & Weller 2010


Chord Recognition• Often works:

• But only 60-80% of the time

19

freq

/ Hz

Let It Be/06-Let It Be

C G A:min

A:m

in/b7

F:m

aj7F:

maj6 C G F C C G A:min

A:m

in/b7

F:m

aj7

240

761

2416

0 2 4 6 8 10 12 14 16 18

ABCDEFG

C G a F C G F C G a

Groundtruthchord

Audio

Recognized

Beat-synchronous

chroma


• Chord model centers (means) indicate chord ‘templates’:

What did the models learn?

0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4PCP_ROT family model means (train18)

DIMDOM7MAJMINMIN7

C D E F G A B C (for C-root chords)


Finding ‘Cover Songs’

• Little similarity in surface audio...

• .. but appears in beat-chroma

21

Let It Be - The Beatles

time / sectime / sec

freq

/ kH

z

Let It Be / Beatles / verse 1

2 4 6 8 100

1

2

3

4

freq

/ kH

z

chro

ma

chro

ma

0

1

2

3

4Let It Be / Nick Cave / verse 1

2 4 6 8 10

Beat-sync chroma features

5 10 15 20 25 beats

Beat-sync chroma features

5 10 15 20 25 beatsA

CD

FG

A

CD

FG

Let It Be - Nick Cave

Ellis & Poliner ’07Ravuri & Ellis ’10


Large-Scale Cover Recognition

• 2D Fourier Transform Magnitude (2DFTM)fixed-size feature to capture “essence” of chromagram:

• First results on finding covers in 1M songs

22

Average rank meanAP

random 500,000 0.000

jumpcodes 2 308,369 0.002

2DFTM (50 PC) 137,117 0.020

Bertin-Mahieux & Ellis ’12


Finding Common Fragments

• Cluster beat-synchronous chroma patches

23

#32273 - 13 instances #51917 - 13 instances



#55881 - 5 instances

5 10 15 20

#68445 - 5 instances

5 10 15 time / beats

GFDCA

GFDCA

GFDCA

GFDCA

chro

ma b

insa20-top10x5-cp4-4p0


Clustered Fragments

• ... for a dictionary of common themes?

24

depeche mode 13-Ice Machine 199.5-204.6s roxette 03-Fireworks 107.9-114.8s

roxette 04-Waiting For The Rain 80.1-93.5s

5 10 15 20 time / beats

tori amos 11-Playboy Mommy 157.8-171.3s

5 10 15

GFDCA

GFDCA

chro

ma

bins


4. Example Applications:Music Discovery

• Connecting listeners to musicians

25

Berenzweig & Ellis ‘03


Playlist Generation

• Incremental learning of listeners’ preferences

26

Mandel, Poliner, Ellis ‘06


MajorMiner: Music Tagging• Describe music using words

27

Mandel & Ellis ’07,‘08


Classification Results• Classifiers trained from top 50 tags

28

40 80 120 160 200 240 280 320drumguitarmalesynthrockelectronicpopvocalbassfemaledancetechnopianojazzhip_hoprapslowbeatvoice80selectronicainstrumentalfastsaxophonekeyboardcountrydrum_machinedistortionbritishambientsoftfunkr_balternativehouseindiestringssolonoisequietsilencesamplespunkhornssingingdrum_bassendtranceclub_90s

320!2

!1.5

!1

!0.5

0

0.5

1

1.5

01 Soul Eyes

50 100 150 200 250 300

time / s

freq

/ Hz

135240427761

13562416


Music Transcription Poliner & Ellis ‘05,’06,’07

Classification:

•N-binary SVMs (one for ea. note).

•Independent frame-level

classification on 10 ms grid.

•Dist. to class bndy as posterior.

classification posteriors

Temporal Smoothing:

•Two state (on/off) independent

HMM for ea. note. Parameters

learned from training data.

•Find Viterbi sequence for ea. note.

hmm smoothing

Training data and features:

•MIDI, multi-track recordings, playback piano, & resampled audio

(less than 28 mins of train audio). •Normalized magnitude STFT.

feature representation feature vector


MEAPsoft• Music Engineering Art Projects

collaboration between EE and Computer Music Center

30

with Douglas Repetto, Ron Weiss, and the rest

of the MEAP team


Conclusions

• Lots of data + noisy transcription + weak clustering⇒ musical insights?

Musicaudio

Tempoand beat

Low-levelfeatures Classification

and Similarity

MusicStructureDiscovery

Melodyand notes

Keyand chords

browsingdiscoveryproduction

modelinggenerationcuriosity

31

Documents

Data Driven Music Understanding - Columbia …dpwe/talks/brooklyn-tech-2013-05.pdfData-Driven Music Understanding - Dan Ellis 2013-05-15 /31 Re-use in Music • What are the most popular