Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 1

JHU CLSP Summer School

Pattern Recognition Applied to Music Signals

Music Content Analysis

Classification and Features

Statistical Pattern Recognition

Gaussian Mixtures and Neural Nets

Singing Detection

Dan Ellis <[email protected]>http://www.ee.columbia.edu/~dpwe/muscontent/

Laboratory for Recognition and Organization of Speech and AudioColumbia University, New York

July 1st, 2003

1

2

3

4

5

http://www.ee.columbia.edu/~dpwe/muscontent/



• Music contains informationat many levels

- what is it?

• We’d like to get this information out automatically

- fine-level transcription of events- broad-level classification of pieces

• Information extraction can be framed as:pattern classification / recognitionor machine learning

- build systems based on (labeled) training data

1


Music analysis

• What information can we get from music?

• Score recovery

- extract the ‘performance’

• Instrument identification

• Ensemble performance

- ‘gestalts’: chords, tone colors

• Broader timescales

- phrasing & musical structure- artist / genre clustering and classification

Time

Fre

quen

cy

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

1000

2000

3000

4000


Outline



- classification- spectrograms- cepstra



Singing Detection

1

2

3

4

5



• Classification means:finding categorical (

discrete

) labels for real-world (

continuous

) observations

• Problems

- parameter tuning- feature overlap

2

0

00

1000

1000

2000

2000

1000

2000

3000

4000

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

f/Hz

time/s

F1/Hz

F2/Hz

x

x ayao


Classification system parts

• Right features are critical

- place upper bound on classifier- should make important aspects visible- invariance under irrelevant modifications

signal

segment

feature vector

class

Pre-processing/segmentation

Feature extraction

Classification

Post-processing

Sensor

• STFT• Locate vowels

• Formant extraction

• Context constraints• Costs/risk


The Spectrogram

• Short-time Fourier transform:

• Plot STFT as a grayscale image:

X k m,[ ] x n[ ] w n mL–[ ] j 2πk n mL–( )N

--------------------------------( )–exp⋅ ⋅n 0=N 1–

∑=

X k m,[ ]

time / s

time / s

freq

/ H

zintensity / dB

2.35 2.4 2.45 2.5 2.55 2.60

1000

2000

3000

4000

freq

/ H

z

0

1000

2000

3000

4000

0

0.1

-50

-40

-30

-20

-10

0

10

0 0.5 1 1.5 2 2.5


Cepstra

• Spectrograms are good for visualization;Cepstrum is preferred for classification

- dct of STFT:

• Cepstra capture coarse information in fewer dimensions with less correlation:

ck idft X k m,[ ]log( )=

frames

5

10

15

20

25

4

8

12

16

20

5 10 15 20

4

8

12

16

20

-5

-4

-3

-2

50 100 150

Cep

stra

l co

effi

cien

tsA

ud

ito

ry

spec

tru

m

Covariance matrixFeatures Example joint distrib (10,15)

2468

1012141618

-5 0 5-4-3-2-10123


Outline




- Priors and posteriors- Bayesian classifier


Singing Detection

1

2

3

4

5



• Observations are random variableswhose distribution depends on the class:

• Source distributions

p

(

x

|

ω

i

)

- reflect variability in feature- reflect noise in observation- generally have to be estimated from data

(rather than known in advance)

3

Classωi

(hidden)discrete continuous

Observationx

p(x|ωi)Pr(ωi|x)

p(x|ωi)

x

ω1 ω2 ω3 ω4


Priors and posteriors

• Bayesian inference can be interpreted as updating prior beliefs with new information,

x

:

• Posterior is prior scaled by likelihood & normalized by evidence (so

Σ

(

posteriors

)

= 1)

• Objection: priors are often unknown

- but omitting them amounts to assuming they are all equal

Pr ωi( )p x ωi( )

p x ω j( ) Pr ω j( )⋅j∑

---------------------------------------------------⋅ Pr ωi x( )=

Posteriorprobability

‘Evidence’ = p(x)

Likelihood

Prior probability

Bayes’ Rule:


Bayesian (MAP) classifier

• Optimal classifier is

but we don’t know

• Can model conditional distributions then use Bayes’ rule to find MAP class

• Or, can model directly e.g. train a neural net to map from inputs x to a set of outputs Pr(ωi)

- discriminative model

ω̂ Pr ωi x( )ωi

argmax =

Pr ωi x( )

p x ωi( )

Labeledtraining

examples{xn,ωxn}

Sortaccordingto class

Estimateconditional pdf

for class ω1

p(x|ω1)


Outline




Gaussian Mixtures and Neural Nets- Gaussians- Gaussian mixtures- Multi-layer perceptrons (MLPs)- Training and test data

Singing Detection

1

2

3

4

5



• Gaussians as parametric distribution models:

• Described by d dimensional mean vector µi

and d x d covariance matrix Σi

• Classify by maximizing log likelihood i.e.

4

p x ωi( )1

2π( )dΣi

1 2⁄----------------------------------- 1

2--- x µµµµi–( )

TΣi

1–x µi–( )–exp⋅=

0

2

4

0

2

4

0

0.5

1

0 1 2 3 4 50

1

2

3

4

5

12--- x µµµµi–( )

TΣi

1–x µi–( )–

12--- Σilog– Pr ωi( )log+

ωiargmax


Gaussian Mixture models (GMMs)

• Weighted sum of Gaussians can fit any PDF:

i.e.

- each observation from random single Gaussian?

• Find ck and mk parameters via EM

- easy if we knew which mk generated each x

p x( ) ck p x mk( )k∑≈

weights ck

Gaussians p(x|mk)

05

1015

2025

3035

0

0.2

0.4

0.6

0.8

1

1.2

1.4

-2

-1

0

1

2

3

original data

Gaussiancomponents

resultingsurface


GMM examples

• Vowel data fit with different mixture counts:

200 400 600 800 1000 1200600

800

1000

1200

1400

16001 Gauss logp(x)=-1911

200 400 600 800 1000 1200600

800

1000

1200

1400


200 400 600 800 1000 1200600

800

1000

1200

1400


200 400 600 800 1000 1200600

800

1000

1200

1400



Neural networks

• Don’t model distributions ,

instead, model posteriors

• Sums over nonlinear functions of sums → large range of decision surfaces

• e.g. Multi-layer perceptron (MLP) with 1 hidden layer:

• Train the weights wij with back-propagation

p x ωi( )

Pr ωi x( )

yk F w jk F wijxij∑[ ]⋅j∑[ ]=

+

+wij

x1 y1

h1

h2

y2

x2x3

F[·]+

+

+wjk

Inputlayer

Hiddenlayer

Outputlayer


Neural net example

• 2 input units (normalized F1, F2)

• 5 hidden units, 3 output units (“U”, “O”, “A”)

• Sigmoid nonlinearity:

F1“A”“O”“U”

F2

F x[ ] 1

1 ex–

+-----------------=

xddF

⇒ F 1 F–( )=

5 4 3 2 1 0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

sigm(x)

sigm(x)ddx


Neural net training

example...

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

12:5:3 net: MS error by training epoch

Mea

n sq

uare

d er

ror

Training epoch

200 400 600 800 1000 1200600

800

1000

1200

1400

1600Contours @ 10 iterations

200 400 600 800 1000 1200600

800

1000

1200

1400

1600Contours @ 100 iterations


Aside: Training and test data

• A rich model can learn every training example (overtraining)

• But, goal is to classify new, unseen datai.e. generalization- sometimes use ‘cross validation’ set to decide

when to stop training

• For evaluation results to be meaningful:- don’t test with training data!- don’t train on test data (even indirectly...)

training or parameters

errorrate

Testdata

Trainingdata

Overfitting


Outline





Singing Detection- Motivation- Features- Classifiers

1

2

3

4

5


Singing Detection(Berenzweig et al. ’01)

• Can we automatically detect when singing is present?

- for further processing (lyrics recognition?)- as a song signature?- as a basis for classification?

5

File: /Users/dpwe/projects/aclass/aimee.wav

f 9 Printed: T

ue Mar 11 13:04:28

5001000150020002500300035004000450050005500600065007000

Hz

0:02 0:04 0:06 0:08 0:10 0:12 0:14 0:16 0:18 0:20 0:22 0:24 0:26 0:28t

mus mus musvoxvox


Singing Detection: Requirements

• Labeled training examples- 60 x 15 sec. radio excerpts - hand-mark sung phrases

• Labeled test data- several complete tracks from

CDs, hand-labelled

• Feature choice- Mel-frequency Cepstral Coefficients (MFCCs)

popular for speech; maybe sung voice too?- separation of voices? temporal dimension?

• Classifier choice- MLP Neural Net- GMMs for singing / music- SVM?

0

0

0

tim

freq

/ kH

z

0 2 4 6 8 10 120

5

10

freq

/ kH

z

5

10

freq

/ kH

z

5

10

freq

/ kH

z5

10

trn/mus/3

trn/mus/8

trn/mus/19

trn/mus/28

hand-label vox


GMM System

• Separate models for p(x|sing), p(x|no sing)- combined via likelihood ratio test

• How many Gaussians for each?- say 20; depends on data & complexity

• What kind of covariance?- diagonal (spherical?)

C0C1

C12

p(X|“singing”)

music singing?MFCCcalculation

GMM1

...p(X|“no singing”)

GMM2

Log l'hoodratio testp(X|“singing”)

logp(X|“not”)


GMM Results

• Raw and smoothed results (Best FA=84.9%):

• MLP has advantage of discriminant training

• Each GMM trains only on data subset → faster to train? (2 x 10 min vs. 20 min)

-10

-5

0

5

10

0 5 10 15 20 25 30-5

0

5

freq

/ kH

zlo

g(lh

ood)

log(

lhoo

d)

0

2

4

6

8

10

time / sec

Aimee Mann : Build That Wall + handvox

26d 20mix GMM on 16ms frames (FA=65.8% @ thr=0)

26d 20mix GMM smoothed by 61 pt (1 sec) hann (FA=84.9% @ thr=-0.8)


MLP Neural Net

• Directly estimate p(singing | x)

- net has 26 inputs (+∆), 15 HUs, 2 o/ps (26:15:2)

• How many hidden units?- depends on data amount, boundary complexity

• Feature context window?- useful in speech

• Delta features?- useful in speech

• Training parameters...

C0C1

C12

“singing”music

“not singing”...

MFCCcalculation


MLP Results

• Raw net outputs on a CD track (FA 74.1%):

• Smoothed for continuity: best FA = 90.5%

time / sec

freq

/ kH

zp(

sing

ing)

0

2

4

6

8

10

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30

Aimee Mann : Build That Wall + handvox

26:15:1 netlab on 16ms frames (FA=74.1% @ thr=0.5)

time / sec

p(si

ngin

g)

0 5 10 15 20 25 300

0.2

0.4

0.6

0.8

1y p ( ) ( )


Artist Classification(Berenzweig et al. 2002)

• Artist label as available stand-in for genre

• Train MLP to classify frames among 21 artists

• Using only “voice” segments:Song-level accuracy improves 56.7% → 64.9%

0 10 20 30 40 50 60 70 80

Boards of CanadaSugarplastic

Belle & SebastianMercury Rev

CorneliusRichard Davies

Dj ShadowMouse on Mars

The Flaming LipsAimee Mann

WilcoXTCBeck

Built to SpillJason Falkner

OvalArto LindsayEric Matthews

The MolesThe Roots

Michael Penntrue voice

time / sec

Track 4 - Arto Lindsay (dynvox=Arto, unseg=Oval)

0 50 100 150 200

Boards of CanadaSugarplastic

Belle & SebastianMercury Rev

CorneliusRichard Davies

Dj ShadowMouse on Mars

The Flaming LipsAimee Mann

WilcoXTCBeck

Built to SpillJason Falkner

OvalArto Lindsay

Eric MatthewsThe MolesThe Roots

Michael Penntrue voice

time / sec

Track 117 - Aimee Mann (dynvox=Aimee, unseg=Aimee)


Summary

• Music content analysis:Pattern classification

• Basic machine learning methods:Neural Nets, GMMs

• Singing detection: classic application

but... the time dimension?


References

A.L. Berenzweig and D.P.W. Ellis (2001)“Locating Singing Voice Segments within MusicSignals”, Proc. IEEE Workshop on Apps. of Sig. Proc. toAcous. and Audio, Mohonk NY, October 2001. http://www.ee.columbia.edu/~dpwe/pubs/waspaa01-singing.pdf

R.O. Duda, P. Hart, R. Stork (2001)Pattern Classification, 2nd Ed.Wiley, 2001.

E. Scheirer and M. Slaney (1997) “Construction and evaluation of a robust multifeature speech/music discriminator”, Proc. IEEE ICASSP, Munich, April 1997. http://www.ee.columbia.edu/~dpwe/e6820/papers/ScheiS97-mussp.pdf

http://www.ee.columbia.edu/~dpwe/pubs/waspaa01-singing.pdf

http://www.ee.columbia.edu/~dpwe/e6820/papers/ScheiS97-mussp.pdf

Documents

Pattern Recognition Applied to Music Signals