30
Dan Ellis Pattern Recognition 2003-07-01 - 1 JHU CLSP Summer School Pattern Recognition Applied to Music Signals Music Content Analysis Classification and Features Statistical Pattern Recognition Gaussian Mixtures and Neural Nets Singing Detection Dan Ellis <[email protected]> http://www .ee .columbia.edu/~dpw e/m uscontent/ Laboratory for Recognition and Organization of Speech and Audio Columbia University, New York July 1st, 2003 1 2 3 4 5

Pattern Recognition Applied to Music Signals

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 1

JHU CLSP Summer School

Pattern Recognition Applied to Music Signals

Music Content Analysis

Classification and Features

Statistical Pattern Recognition

Gaussian Mixtures and Neural Nets

Singing Detection

Dan Ellis <[email protected]>http://www.ee.columbia.edu/~dpwe/muscontent/

Laboratory for Recognition and Organization of Speech and AudioColumbia University, New York

July 1st, 2003

1

2

3

4

5

Page 2: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 2

Music Content Analysis

• Music contains informationat many levels

- what is it?

• We’d like to get this information out automatically

- fine-level transcription of events- broad-level classification of pieces

• Information extraction can be framed as:pattern classification / recognitionor machine learning

- build systems based on (labeled) training data

1

Page 3: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 3

Music analysis

• What information can we get from music?

• Score recovery

- extract the ‘performance’

• Instrument identification

• Ensemble performance

- ‘gestalts’: chords, tone colors

• Broader timescales

- phrasing & musical structure- artist / genre clustering and classification

Time

Fre

quen

cy

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

1000

2000

3000

4000

Page 4: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 4

Outline

Music Content Analysis

Classification and Features

- classification- spectrograms- cepstra

Statistical Pattern Recognition

Gaussian Mixtures and Neural Nets

Singing Detection

1

2

3

4

5

Page 5: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 5

Classification and Features

• Classification means:finding categorical (

discrete

) labels for real-world (

continuous

) observations

• Problems

- parameter tuning- feature overlap

2

0

00

1000

1000

2000

2000

1000

2000

3000

4000

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

f/Hz

time/s

F1/Hz

F2/Hz

x

x ayao

Page 6: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 6

Classification system parts

• Right features are critical

- place upper bound on classifier- should make important aspects visible- invariance under irrelevant modifications

signal

segment

feature vector

class

Pre-processing/segmentation

Feature extraction

Classification

Post-processing

Sensor

• STFT• Locate vowels

• Formant extraction

• Context constraints• Costs/risk

Page 7: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 7

The Spectrogram

• Short-time Fourier transform:

• Plot STFT as a grayscale image:

X k m,[ ] x n[ ] w n mL–[ ] j 2πk n mL–( )N

--------------------------------( )–exp⋅ ⋅n 0=N 1–

∑=

X k m,[ ]

time / s

time / s

freq

/ H

zintensity / dB

2.35 2.4 2.45 2.5 2.55 2.60

1000

2000

3000

4000

freq

/ H

z

0

1000

2000

3000

4000

0

0.1

-50

-40

-30

-20

-10

0

10

0 0.5 1 1.5 2 2.5

Page 8: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 8

Cepstra

• Spectrograms are good for visualization;Cepstrum is preferred for classification

- dct of STFT:

• Cepstra capture coarse information in fewer dimensions with less correlation:

ck idft X k m,[ ]log( )=

frames

5

10

15

20

25

4

8

12

16

20

5 10 15 20

4

8

12

16

20

-5

-4

-3

-2

50 100 150

Cep

stra

l co

effi

cien

tsA

ud

ito

ry

spec

tru

m

Covariance matrixFeatures Example joint distrib (10,15)

2468

1012141618

-5 0 5-4-3-2-10123

Page 9: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 9

Outline

Music Content Analysis

Classification and Features

Statistical Pattern Recognition

- Priors and posteriors- Bayesian classifier

Gaussian Mixtures and Neural Nets

Singing Detection

1

2

3

4

5

Page 10: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 10

Statistical Pattern Recognition

• Observations are random variableswhose distribution depends on the class:

• Source distributions

p

(

x

|

ω

i

)

- reflect variability in feature- reflect noise in observation- generally have to be estimated from data

(rather than known in advance)

3

Classωi

(hidden)discrete continuous

Observationx

p(x|ωi)Pr(ωi|x)

p(x|ωi)

x

ω1 ω2 ω3 ω4

Page 11: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 11

Priors and posteriors

• Bayesian inference can be interpreted as updating prior beliefs with new information,

x

:

• Posterior is prior scaled by likelihood & normalized by evidence (so

Σ

(

posteriors

)

= 1)

• Objection: priors are often unknown

- but omitting them amounts to assuming they are all equal

Pr ωi( )p x ωi( )

p x ω j( ) Pr ω j( )⋅j∑

---------------------------------------------------⋅ Pr ωi x( )=

Posteriorprobability

‘Evidence’ = p(x)

Likelihood

Prior probability

Bayes’ Rule:

Page 12: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 12

Bayesian (MAP) classifier

• Optimal classifier is

but we don’t know

• Can model conditional distributions then use Bayes’ rule to find MAP class

• Or, can model directly e.g. train a neural net to map from inputs x to a set of outputs Pr(ωi)

- discriminative model

ω̂ Pr ωi x( )ωi

argmax =

Pr ωi x( )

p x ωi( )

Labeledtraining

examples{xn,ωxn}

Sortaccordingto class

Estimateconditional pdf

for class ω1

p(x|ω1)

Page 13: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 13

Outline

Music Content Analysis

Classification and Features

Statistical Pattern Recognition

Gaussian Mixtures and Neural Nets- Gaussians- Gaussian mixtures- Multi-layer perceptrons (MLPs)- Training and test data

Singing Detection

1

2

3

4

5

Page 14: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 14

Gaussian Mixtures and Neural Nets

• Gaussians as parametric distribution models:

• Described by d dimensional mean vector µi

and d x d covariance matrix Σi

• Classify by maximizing log likelihood i.e.

4

p x ωi( )1

2π( )dΣi

1 2⁄----------------------------------- 1

2--- x µµµµi–( )

TΣi

1–x µi–( )–exp⋅=

0

2

4

0

2

4

0

0.5

1

0 1 2 3 4 50

1

2

3

4

5

12--- x µµµµi–( )

TΣi

1–x µi–( )–

12--- Σilog– Pr ωi( )log+

ωiargmax

Page 15: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 15

Gaussian Mixture models (GMMs)

• Weighted sum of Gaussians can fit any PDF:

i.e.

- each observation from random single Gaussian?

• Find ck and mk parameters via EM

- easy if we knew which mk generated each x

p x( ) ck p x mk( )k∑≈

weights ck

Gaussians p(x|mk)

05

1015

2025

3035

0

0.2

0.4

0.6

0.8

1

1.2

1.4

-2

-1

0

1

2

3

original data

Gaussiancomponents

resultingsurface

Page 16: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 16

GMM examples

• Vowel data fit with different mixture counts:

200 400 600 800 1000 1200600

800

1000

1200

1400

16001 Gauss logp(x)=-1911

200 400 600 800 1000 1200600

800

1000

1200

1400

16002 Gauss logp(x)=-1864

200 400 600 800 1000 1200600

800

1000

1200

1400

16003 Gauss logp(x)=-1849

200 400 600 800 1000 1200600

800

1000

1200

1400

16004 Gauss logp(x)=-1840

Page 17: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 17

Neural networks

• Don’t model distributions ,

instead, model posteriors

• Sums over nonlinear functions of sums → large range of decision surfaces

• e.g. Multi-layer perceptron (MLP) with 1 hidden layer:

• Train the weights wij with back-propagation

p x ωi( )

Pr ωi x( )

yk F w jk F wijxij∑[ ]⋅j∑[ ]=

+

+wij

x1 y1

h1

h2

y2

x2x3

F[·]+

+

+wjk

Inputlayer

Hiddenlayer

Outputlayer

Page 18: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 18

Neural net example

• 2 input units (normalized F1, F2)

• 5 hidden units, 3 output units (“U”, “O”, “A”)

• Sigmoid nonlinearity:

F1“A”“O”“U”

F2

F x[ ] 1

1 ex–

+-----------------=

xddF

⇒ F 1 F–( )=

5 4 3 2 1 0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

sigm(x)

sigm(x)ddx

Page 19: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 19

Neural net training

example...

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

12:5:3 net: MS error by training epoch

Mea

n sq

uare

d er

ror

Training epoch

200 400 600 800 1000 1200600

800

1000

1200

1400

1600Contours @ 10 iterations

200 400 600 800 1000 1200600

800

1000

1200

1400

1600Contours @ 100 iterations

Page 20: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 20

Aside: Training and test data

• A rich model can learn every training example (overtraining)

• But, goal is to classify new, unseen datai.e. generalization- sometimes use ‘cross validation’ set to decide

when to stop training

• For evaluation results to be meaningful:- don’t test with training data!- don’t train on test data (even indirectly...)

training or parameters

errorrate

Testdata

Trainingdata

Overfitting

Page 21: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 21

Outline

Music Content Analysis

Classification and Features

Statistical Pattern Recognition

Gaussian Mixtures and Neural Nets

Singing Detection- Motivation- Features- Classifiers

1

2

3

4

5

Page 22: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 22

Singing Detection(Berenzweig et al. ’01)

• Can we automatically detect when singing is present?

- for further processing (lyrics recognition?)- as a song signature?- as a basis for classification?

5

File: /Users/dpwe/projects/aclass/aimee.wav

f 9 Printed: T

ue Mar 11 13:04:28

5001000150020002500300035004000450050005500600065007000

Hz

0:02 0:04 0:06 0:08 0:10 0:12 0:14 0:16 0:18 0:20 0:22 0:24 0:26 0:28t

mus mus musvoxvox

Page 23: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 23

Singing Detection: Requirements

• Labeled training examples- 60 x 15 sec. radio excerpts - hand-mark sung phrases

• Labeled test data- several complete tracks from

CDs, hand-labelled

• Feature choice- Mel-frequency Cepstral Coefficients (MFCCs)

popular for speech; maybe sung voice too?- separation of voices? temporal dimension?

• Classifier choice- MLP Neural Net- GMMs for singing / music- SVM?

0

0

0

tim

freq

/ kH

z

0 2 4 6 8 10 120

5

10

freq

/ kH

z

5

10

freq

/ kH

z

5

10

freq

/ kH

z5

10

trn/mus/3

trn/mus/8

trn/mus/19

trn/mus/28

hand-label vox

Page 24: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 24

GMM System

• Separate models for p(x|sing), p(x|no sing)- combined via likelihood ratio test

• How many Gaussians for each?- say 20; depends on data & complexity

• What kind of covariance?- diagonal (spherical?)

C0C1

C12

p(X|“singing”)

music singing?MFCCcalculation

GMM1

...p(X|“no singing”)

GMM2

Log l'hoodratio testp(X|“singing”)

logp(X|“not”)

Page 25: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 25

GMM Results

• Raw and smoothed results (Best FA=84.9%):

• MLP has advantage of discriminant training

• Each GMM trains only on data subset → faster to train? (2 x 10 min vs. 20 min)

-10

-5

0

5

10

0 5 10 15 20 25 30-5

0

5

freq

/ kH

zlo

g(lh

ood)

log(

lhoo

d)

0

2

4

6

8

10

time / sec

Aimee Mann : Build That Wall + handvox

26d 20mix GMM on 16ms frames (FA=65.8% @ thr=0)

26d 20mix GMM smoothed by 61 pt (1 sec) hann (FA=84.9% @ thr=-0.8)

Page 26: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 26

MLP Neural Net

• Directly estimate p(singing | x)

- net has 26 inputs (+∆), 15 HUs, 2 o/ps (26:15:2)

• How many hidden units?- depends on data amount, boundary complexity

• Feature context window?- useful in speech

• Delta features?- useful in speech

• Training parameters...

C0C1

C12

“singing”music

“not singing”...

MFCCcalculation

Page 27: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 27

MLP Results

• Raw net outputs on a CD track (FA 74.1%):

• Smoothed for continuity: best FA = 90.5%

time / sec

freq

/ kH

zp(

sing

ing)

0

2

4

6

8

10

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30

Aimee Mann : Build That Wall + handvox

26:15:1 netlab on 16ms frames (FA=74.1% @ thr=0.5)

time / sec

p(si

ngin

g)

0 5 10 15 20 25 300

0.2

0.4

0.6

0.8

1y p ( ) ( )

Page 28: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 28

Artist Classification(Berenzweig et al. 2002)

• Artist label as available stand-in for genre

• Train MLP to classify frames among 21 artists

• Using only “voice” segments:Song-level accuracy improves 56.7% → 64.9%

0 10 20 30 40 50 60 70 80

Boards of CanadaSugarplastic

Belle & SebastianMercury Rev

CorneliusRichard Davies

Dj ShadowMouse on Mars

The Flaming LipsAimee Mann

WilcoXTCBeck

Built to SpillJason Falkner

OvalArto LindsayEric Matthews

The MolesThe Roots

Michael Penntrue voice

time / sec

Track 4 - Arto Lindsay (dynvox=Arto, unseg=Oval)

0 50 100 150 200

Boards of CanadaSugarplastic

Belle & SebastianMercury Rev

CorneliusRichard Davies

Dj ShadowMouse on Mars

The Flaming LipsAimee Mann

WilcoXTCBeck

Built to SpillJason Falkner

OvalArto Lindsay

Eric MatthewsThe MolesThe Roots

Michael Penntrue voice

time / sec

Track 117 - Aimee Mann (dynvox=Aimee, unseg=Aimee)

Page 29: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 29

Summary

• Music content analysis:Pattern classification

• Basic machine learning methods:Neural Nets, GMMs

• Singing detection: classic application

but... the time dimension?

Page 30: Pattern Recognition Applied to Music Signals

Dan Ellis Pattern Recognition 2003-07-01 - 30

References

A.L. Berenzweig and D.P.W. Ellis (2001)“Locating Singing Voice Segments within MusicSignals”, Proc. IEEE Workshop on Apps. of Sig. Proc. toAcous. and Audio, Mohonk NY, October 2001. http://www.ee.columbia.edu/~dpwe/pubs/waspaa01-singing.pdf

R.O. Duda, P. Hart, R. Stork (2001)Pattern Classification, 2nd Ed.Wiley, 2001.

E. Scheirer and M. Slaney (1997) “Construction and evaluation of a robust multifeature speech/music discriminator”, Proc. IEEE ICASSP, Munich, April 1997. http://www.ee.columbia.edu/~dpwe/e6820/papers/ScheiS97-mussp.pdf