Data Driven Music Understanding - Columbia …dpwe/talks/brooklyn-tech-2013-05.pdfData-Driven Music...

Data-Driven Music Understanding - Dan Ellis 2013-05-15 /311

1. Motivation: What is Music?2. Eigenrhythms3. Melodic-Harmonic Fragments4. Example Applications

Data DrivenMusic Understanding

Dan EllisLaboratory for Recognition and Organization of Speech and Audio

Dept. Electrical Engineering, Columbia University, NY USAhttp://labrosa.ee.columbia.edu/

LabROSA : Machine Listening

• Extracting useful information from sound... like (we) animals do

Dectect

Classify

Describe

EnvironmentalSound

Speech Music

Domain

ASR MusicTranscription

MusicRecommendation

VAD Speech/Music

Emotion

“Sound Intelligence”

EnvironmentAwareness

AutomaticNarration

1. Motivation: What is music?

• What does music evoke in a listener’s mind?

• Which are the things that we call “music”?

Oodles of Music

• What can you do with a million tracks?

Bertin-Mahieux et al. ’09

Re-use in Music

• What are the most popular chord progressions in pop music?

points towards a great degree of conventionalism in the creationand production of this type of music. Yet, we find three importanttrends in the evolution of musical discourse: the restriction of pitchsequences (with metrics showing less variety in pitch progressions),the homogenization of the timbral palette (with frequent timbresbecoming more frequent), and growing average loudness levels(threatening a dynamic richness that has been conserved untiltoday). This suggests that our perception of the new would be essen-tially rooted on identifying simpler pitch sequences, fashionable tim-bral mixtures, and louder volumes. Hence, an old tune with slightlysimpler chord progressions, new instrument sonorities that were inagreement with current tendencies, and recorded with modern tech-niques that allowed for increased loudness levels could be easilyperceived as novel, fashionable, and groundbreaking.

ResultsTo identify structural patterns of musical discourse we first need tobuild a ‘vocabulary’ of musical elements (Fig. 1). To do so, we encodethe dataset descriptions by a discretization of their values, yieldingwhat we call music codewords20 (see Supplementary Information, SI).In the case of pitch, the descriptions of each song are additionallytransposed to an equivalent main tonality, such that all of them areautomatically considered within the same tonal context or key. Next,

to quantify long-term variations of a vocabulary, we need to obtainsamples of it at different periods of time. For that we perform aMonte Carlo sampling in a moving window fashion. In particular,for each year, we sample one million beat-consecutive codewords,considering entire tracks and using a window length of 5 years (thewindow is centered at the corresponding year such that, for instance,for 1994 we sample one million consecutive beats by choosing fulltracks whose year annotation is between 1992 and 1996, bothincluded). This procedure, which is repeated 10 times, guarantees arepresentative sample with a smooth evolution over the years.

We first count the frequency of usage of pitch codewords (i.e. thenumber of times each codeword type appears in a sample). Weobserve that most used pitch codewords generally correspond towell-known harmonic items21, while unused codewords correspondto strange/dissonant pitch combinations (Fig. 2a). Sorting the fre-quency counts in decreasing order provides a very clear patternbehind the data: a power law17 of the form z / r2a, where z corre-sponds to the frequency count of a codeword, r denotes its rank (i.e. r5 1 for the most used codeword and so forth), and a is the power lawexponent. Specifically, we find that the distribution of codewordfrequencies for a given year nicely fits to P(z) / (c 1 z) 2b for z .zmin, where we take z as the random variable22, b 5 1 1 1/a as theexponent, and c as a constant (Fig. 2b). A power law indicates thata few codewords are very frequent while the majority are highly

Figure 1 | Method schematic summary with pitch data. The dataset contains the beat-based music descriptions of the audio rendition of a musical pieceor score (G, Em, and D7 on the top of the staff denote chords). For pitch, these descriptions reflect the harmonic content of the piece15, and encapsulate allsounding notes of a given time interval into a compact representation11,12, independently of their articulation (they consist of the 12 pitch class relativeenergies, where a pitch class is the set of all pitches that are a whole number of octaves apart, e.g. notes C1, C2, and C3 all collapse to pitch class C). Alldescriptions are encoded into music codewords, using a binary discretization in the case of pitch. Codewords are then used to perform frequency counts,and as nodes of a complex network whose links reflect transitions between subsequent codewords.

www.nature.com/scientificreports

SCIENTIFIC REPORTS | 2 : 521 | DOI: 10.1038/srep00521 2

points towards a great degree of conventionalism in the creationand production of this type of music. Yet, we find three importanttrends in the evolution of musical discourse: the restriction of pitchsequences (with metrics showing less variety in pitch progressions),the homogenization of the timbral palette (with frequent timbresbecoming more frequent), and growing average loudness levels(threatening a dynamic richness that has been conserved untiltoday). This suggests that our perception of the new would be essen-tially rooted on identifying simpler pitch sequences, fashionable tim-bral mixtures, and louder volumes. Hence, an old tune with slightlysimpler chord progressions, new instrument sonorities that were inagreement with current tendencies, and recorded with modern tech-niques that allowed for increased loudness levels could be easilyperceived as novel, fashionable, and groundbreaking.

ResultsTo identify structural patterns of musical discourse we first need tobuild a ‘vocabulary’ of musical elements (Fig. 1). To do so, we encodethe dataset descriptions by a discretization of their values, yieldingwhat we call music codewords20 (see Supplementary Information, SI).In the case of pitch, the descriptions of each song are additionallytransposed to an equivalent main tonality, such that all of them areautomatically considered within the same tonal context or key. Next,

to quantify long-term variations of a vocabulary, we need to obtainsamples of it at different periods of time. For that we perform aMonte Carlo sampling in a moving window fashion. In particular,for each year, we sample one million beat-consecutive codewords,considering entire tracks and using a window length of 5 years (thewindow is centered at the corresponding year such that, for instance,for 1994 we sample one million consecutive beats by choosing fulltracks whose year annotation is between 1992 and 1996, bothincluded). This procedure, which is repeated 10 times, guarantees arepresentative sample with a smooth evolution over the years.

We first count the frequency of usage of pitch codewords (i.e. thenumber of times each codeword type appears in a sample). Weobserve that most used pitch codewords generally correspond towell-known harmonic items21, while unused codewords correspondto strange/dissonant pitch combinations (Fig. 2a). Sorting the fre-quency counts in decreasing order provides a very clear patternbehind the data: a power law17 of the form z / r2a, where z corre-sponds to the frequency count of a codeword, r denotes its rank (i.e. r5 1 for the most used codeword and so forth), and a is the power lawexponent. Specifically, we find that the distribution of codewordfrequencies for a given year nicely fits to P(z) / (c 1 z) 2b for z .zmin, where we take z as the random variable22, b 5 1 1 1/a as theexponent, and c as a constant (Fig. 2b). A power law indicates thata few codewords are very frequent while the majority are highly

Figure 1 | Method schematic summary with pitch data. The dataset contains the beat-based music descriptions of the audio rendition of a musical pieceor score (G, Em, and D7 on the top of the staff denote chords). For pitch, these descriptions reflect the harmonic content of the piece15, and encapsulate allsounding notes of a given time interval into a compact representation11,12, independently of their articulation (they consist of the 12 pitch class relativeenergies, where a pitch class is the set of all pitches that are a whole number of octaves apart, e.g. notes C1, C2, and C3 all collapse to pitch class C). Alldescriptions are encoded into music codewords, using a binary discretization in the case of pitch. Codewords are then used to perform frequency counts,and as nodes of a complex network whose links reflect transitions between subsequent codewords.

www.nature.com/scientificreports

SCIENTIFIC REPORTS | 2 : 521 | DOI: 10.1038/srep00521 2

Serrà et al. 2012

Potential Applications

• Compression

• Classification

• Manipulation

2. Eigenrhythms: Drum Track Structure

• To first order, all pop music has the same beat:

• Can we learn this from examples?

Ellis & Arroyo ’04

Basis Sets

• Combine a few basic patternsto make a larger dataset

X = W × Hdataweights

patterns

Drum Pattern Data

• Tempo normalization + downbeat alignment

NMF Eigenrhythms

• Nonnegative: only add beat-weight

Posirhythm 1

BDSNHH

Posirhythm 2

Posirhythm 3 Posirhythm 4

Posirhythm 5

500 0 samples (@ 200 Hz)beats (@ 120 BPM)

100 150 200 250 300 350 4001 2 3 4 1 2 3 4

Posirhythm 6

50 100 150 200 250 300 350

Eigenrhythm BeatBox• Resynthesize rhythms from eigen-space

3. Melodic-Harmonic Fragments

• How similar are two pieces?

• Can we find all the pop-music clichés?

MFCC Features• Used in speech recognition

ficien

Let It Be (LIB-1) - log-freq specgram

32913965915

time / sec

Noise excited MFCC resynthesis (LIB-2)

0 5 10 15 20 25

32913965915

Chroma Features• Idea:

Project onto 12 semitonesregardless of octave

Fujishima 1999

50 100 150 fft bin 2 4 6 8 time / sec 50 100 150 200 250 time / frame

M X C× =mapping signal chroma

Chroma Features• To capture “musical” content

Let It Be - log-freq specgram (LIB-1)

Chroma features

CDEGAB

Shepard tone resynthesis of chroma (LIB-3)

MFCC-filtered shepard tones (LIB-4)

30014006000

time / sec0 5 10 15 20 25

Let It Be - log-freq specgram (LIB-1)

Onset envelope + beat times

Beat-synchronous chroma

Beat-synchronous chroma + Shepard resynthesis (LIB-6)

30014006000

CDEGAB

time / sec0 5 10 15 20 25

Beat-Synchronous Chroma• Compact representation of harmonies

Chord Recognition• Beat synchronous chroma look like chords

can we transcribe them?

• Two approachesmanual templates (prior knowledge)learned models (from training data)

time / sec0 5 10 15 20

D-F ...

Chord Recognition System• Analogous to speech recognition

Gaussian models of features for each chordHidden Markov Models for chord transitions

Labels

Beat track

Resample

Chroma100-1600 HzBPF

Chroma25-400 HzBPF

Root normalize

HMMViterbi

Counttransitions

Gaussian Unnormalize

beat-synchronouschroma features

chordlabels

24x24transition

matrix

24Gaussmodels

traintest

C D E G A BCDE

C D E G A B

C D E F G A B c d e f g a bC

Sheh & Ellis 2003Ellis & Weller 2010

Chord Recognition• Often works:

• But only 60-80% of the time

Let It Be/06-Let It Be

C G A:min

maj6 C G F C C G A:min

0 2 4 6 8 10 12 14 16 18

ABCDEFG

C G a F C G F C G a

Groundtruthchord

Recognized

Beat-synchronous

chroma

• Chord model centers (means) indicate chord ‘templates’:

What did the models learn?

0 5 10 15 20 250

0.4PCP_ROT family model means (train18)

DIMDOM7MAJMINMIN7

C D E F G A B C (for C-root chords)

Finding ‘Cover Songs’

• Little similarity in surface audio...

• .. but appears in beat-chroma

Let It Be - The Beatles

time / sectime / sec

Let It Be / Beatles / verse 1

2 4 6 8 100

4Let It Be / Nick Cave / verse 1

2 4 6 8 10

Beat-sync chroma features

5 10 15 20 25 beats

Beat-sync chroma features

5 10 15 20 25 beatsA

Let It Be - Nick Cave

Ellis & Poliner ’07Ravuri & Ellis ’10

Large-Scale Cover Recognition

• 2D Fourier Transform Magnitude (2DFTM)fixed-size feature to capture “essence” of chromagram:

• First results on finding covers in 1M songs

Average rank meanAP

random 500,000 0.000

jumpcodes 2 308,369 0.002

2DFTM (50 PC) 137,117 0.020

Bertin-Mahieux & Ellis ’12

Finding Common Fragments

• Cluster beat-synchronous chroma patches

#32273 - 13 instances #51917 - 13 instances

#55881 - 5 instances

5 10 15 20

#68445 - 5 instances

5 10 15 time / beats

insa20-top10x5-cp4-4p0

Clustered Fragments

• ... for a dictionary of common themes?

depeche mode 13-Ice Machine 199.5-204.6s roxette 03-Fireworks 107.9-114.8s

roxette 04-Waiting For The Rain 80.1-93.5s

5 10 15 20 time / beats

tori amos 11-Playboy Mommy 157.8-171.3s

5 10 15

4. Example Applications:Music Discovery

• Connecting listeners to musicians

Berenzweig & Ellis ‘03

Playlist Generation

• Incremental learning of listeners’ preferences

Mandel, Poliner, Ellis ‘06

MajorMiner: Music Tagging• Describe music using words

Mandel & Ellis ’07,‘08

Classification Results• Classifiers trained from top 50 tags

40 80 120 160 200 240 280 320drumguitarmalesynthrockelectronicpopvocalbassfemaledancetechnopianojazzhip_hoprapslowbeatvoice80selectronicainstrumentalfastsaxophonekeyboardcountrydrum_machinedistortionbritishambientsoftfunkr_balternativehouseindiestringssolonoisequietsilencesamplespunkhornssingingdrum_bassendtranceclub_90s

01 Soul Eyes

50 100 150 200 250 300

time / s

135240427761

13562416

Music Transcription Poliner & Ellis ‘05,’06,’07

Classification:

•N-binary SVMs (one for ea. note).

•Independent frame-level

classification on 10 ms grid.

•Dist. to class bndy as posterior.

classification posteriors

Temporal Smoothing:

•Two state (on/off) independent

HMM for ea. note. Parameters

learned from training data.

•Find Viterbi sequence for ea. note.

hmm smoothing

Training data and features:

•MIDI, multi-track recordings, playback piano, & resampled audio

(less than 28 mins of train audio). •Normalized magnitude STFT.

feature representation feature vector

MEAPsoft• Music Engineering Art Projects

collaboration between EE and Computer Music Center

with Douglas Repetto, Ron Weiss, and the rest

of the MEAP team

Conclusions

• Lots of data + noisy transcription + weak clustering⇒ musical insights?

Musicaudio

Tempoand beat

Low-levelfeatures Classification

and Similarity

MusicStructureDiscovery

Melodyand notes

Keyand chords

browsingdiscoveryproduction

modelinggenerationcuriosity

Data Driven Music Understanding - Columbia …dpwe/talks/brooklyn-tech-2013-05.pdfData-Driven Music...

Documents

Grammatical Music Composition with Dissimilarity Driven ...ncra.ucd.ie/papers/Grammatical Music Composition... · Algorithmic composition involves composing music according to a given

Data-Driven Music Audio Understandingdpwe/proposals/NSF-music-2006-11.pdf · or present attributes common to all preferred music, and new tracks are chosen to ﬁt this proﬁle

SEMANTICALLY CONSISTENT SEPARATION OF FOREGROUND …dpwe/pubs/PapaE14-rpca.pdf · 2014-05-19 · uated on short popular music excerpts from the Beach Boys. [23] proposes a non-negative

Chord Recognition and Segmentation using EM-trained …dpwe/talks/ismir-2003-10.pdfChord Recognition and Segmentation using EM-trained Hidden Markov Models Alex Sheh & Dan Ellis {asheh79,dpwe}@ee.columbia.edu

Music Notes. Music is a central aspect of the novel Characters are defined by their musical preferences and driven by their love of music Think of

ITR: Listen and Learn – Artiﬁcial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artiﬁcial Intelligence in Auditory

MAPPING MUSIC IN THE PALM OF YOUR HAND ...dpwe/ismir2004/CRFILES/paper153.pdfcollections can be hard if one does not know exactly what to look for. In this paper a novel user interface

MUSIC-DRIVEN NARRATIVE FILM: VOCAL MUSIC AND …

Grammatical Music Composition with Dissimilarity Driven

Data driven music cognition

SIGNALS&’SYSTEMS&’CONTROL’ Senior/graduate&’Advanced ...dpwe/Signals2010.pdf · SIGNALS&’SYSTEMS&’CONTROL’ ... Basic Linear Algebra Basic Probability and Statistics

Driven Music Conference Magazine

Audio Signal Recognition for Speech, Music, and ...dpwe/talks/ASA-austin-2003...Electronica 10 5 15 10 5 15 0 Dan Ellis Audio Signal Reecognition 2003-11-13 - 22 / 25 Outline Pattern

DETECTING MUSIC IN AMBIENT AUDIO BY LONG-WINDOW ...dpwe/pubs/LeeE08-musicdet.pdfapproximately 18% were found to include music – enough to be a generally-useful feature, while still

A real-time music-scene-description system: predominant-F0 ...dpwe/papers/Goto04-musicscene.pdf · bass lines in real-world audio signals containing the sounds of various instruments

MUSIC DOESN’T JUST HAPPEN - Recorded Music NZ · | MUSIC DOESN’T JUST HAPPEN NZ record companies income 83% NZ MUSIC INDUSTRY SNAPSHOT Overall revenue 2017 14.6% to $98.8m Driven

audeosynth: Music-Driven Video Montageyzyu/publication/audeosynth-sig2015.pdf · music drives the composition of the video content. According to musical movements and beats, video

TuneSensor: A Semantic-Driven Music Recommendation Service For

ELEN E4896 MUSIC SIGNAL PROCESSING Lecture 9: Time & …dpwe/e4896/lectures/E4896-L09.pdfE4896 Music Signal Processing (Dan Ellis) 2013-03-25 - /16 SOLA / SOLAFS • Simple OLA leads

cellF: a neuron-driven music synthesiser for real-time performance