Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Dan Ellis Pattern Recognition 2003-07-01 - 1
JHU CLSP Summer School
Pattern Recognition Applied to Music Signals
Music Content Analysis
Classification and Features
Statistical Pattern Recognition
Gaussian Mixtures and Neural Nets
Singing Detection
Dan Ellis <[email protected]>http://www.ee.columbia.edu/~dpwe/muscontent/
Laboratory for Recognition and Organization of Speech and AudioColumbia University, New York
July 1st, 2003
1
2
3
4
5
Dan Ellis Pattern Recognition 2003-07-01 - 2
Music Content Analysis
• Music contains informationat many levels
- what is it?
• We’d like to get this information out automatically
- fine-level transcription of events- broad-level classification of pieces
• Information extraction can be framed as:pattern classification / recognitionor machine learning
- build systems based on (labeled) training data
1
Dan Ellis Pattern Recognition 2003-07-01 - 3
Music analysis
• What information can we get from music?
• Score recovery
- extract the ‘performance’
• Instrument identification
• Ensemble performance
- ‘gestalts’: chords, tone colors
• Broader timescales
- phrasing & musical structure- artist / genre clustering and classification
Time
Fre
quen
cy
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
1000
2000
3000
4000
Dan Ellis Pattern Recognition 2003-07-01 - 4
Outline
Music Content Analysis
Classification and Features
- classification- spectrograms- cepstra
Statistical Pattern Recognition
Gaussian Mixtures and Neural Nets
Singing Detection
1
2
3
4
5
Dan Ellis Pattern Recognition 2003-07-01 - 5
Classification and Features
• Classification means:finding categorical (
discrete
) labels for real-world (
continuous
) observations
• Problems
- parameter tuning- feature overlap
2
0
00
1000
1000
2000
2000
1000
2000
3000
4000
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
f/Hz
time/s
F1/Hz
F2/Hz
x
x ayao
Dan Ellis Pattern Recognition 2003-07-01 - 6
Classification system parts
• Right features are critical
- place upper bound on classifier- should make important aspects visible- invariance under irrelevant modifications
signal
segment
feature vector
class
Pre-processing/segmentation
Feature extraction
Classification
Post-processing
Sensor
• STFT• Locate vowels
• Formant extraction
• Context constraints• Costs/risk
Dan Ellis Pattern Recognition 2003-07-01 - 7
The Spectrogram
• Short-time Fourier transform:
• Plot STFT as a grayscale image:
X k m,[ ] x n[ ] w n mL–[ ] j 2πk n mL–( )N
--------------------------------( )–exp⋅ ⋅n 0=N 1–
∑=
X k m,[ ]
time / s
time / s
freq
/ H
zintensity / dB
2.35 2.4 2.45 2.5 2.55 2.60
1000
2000
3000
4000
freq
/ H
z
0
1000
2000
3000
4000
0
0.1
-50
-40
-30
-20
-10
0
10
0 0.5 1 1.5 2 2.5
Dan Ellis Pattern Recognition 2003-07-01 - 8
Cepstra
• Spectrograms are good for visualization;Cepstrum is preferred for classification
- dct of STFT:
• Cepstra capture coarse information in fewer dimensions with less correlation:
ck idft X k m,[ ]log( )=
frames
5
10
15
20
25
4
8
12
16
20
5 10 15 20
4
8
12
16
20
-5
-4
-3
-2
50 100 150
Cep
stra
l co
effi
cien
tsA
ud
ito
ry
spec
tru
m
Covariance matrixFeatures Example joint distrib (10,15)
2468
1012141618
-5 0 5-4-3-2-10123
Dan Ellis Pattern Recognition 2003-07-01 - 9
Outline
Music Content Analysis
Classification and Features
Statistical Pattern Recognition
- Priors and posteriors- Bayesian classifier
Gaussian Mixtures and Neural Nets
Singing Detection
1
2
3
4
5
Dan Ellis Pattern Recognition 2003-07-01 - 10
Statistical Pattern Recognition
• Observations are random variableswhose distribution depends on the class:
• Source distributions
p
(
x
|
ω
i
)
- reflect variability in feature- reflect noise in observation- generally have to be estimated from data
(rather than known in advance)
3
Classωi
(hidden)discrete continuous
Observationx
p(x|ωi)Pr(ωi|x)
p(x|ωi)
x
ω1 ω2 ω3 ω4
Dan Ellis Pattern Recognition 2003-07-01 - 11
Priors and posteriors
• Bayesian inference can be interpreted as updating prior beliefs with new information,
x
:
• Posterior is prior scaled by likelihood & normalized by evidence (so
Σ
(
posteriors
)
= 1)
• Objection: priors are often unknown
- but omitting them amounts to assuming they are all equal
Pr ωi( )p x ωi( )
p x ω j( ) Pr ω j( )⋅j∑
---------------------------------------------------⋅ Pr ωi x( )=
Posteriorprobability
‘Evidence’ = p(x)
Likelihood
Prior probability
Bayes’ Rule:
Dan Ellis Pattern Recognition 2003-07-01 - 12
Bayesian (MAP) classifier
• Optimal classifier is
but we don’t know
• Can model conditional distributions then use Bayes’ rule to find MAP class
• Or, can model directly e.g. train a neural net to map from inputs x to a set of outputs Pr(ωi)
- discriminative model
ω̂ Pr ωi x( )ωi
argmax =
Pr ωi x( )
p x ωi( )
Labeledtraining
examples{xn,ωxn}
Sortaccordingto class
Estimateconditional pdf
for class ω1
p(x|ω1)
Dan Ellis Pattern Recognition 2003-07-01 - 13
Outline
Music Content Analysis
Classification and Features
Statistical Pattern Recognition
Gaussian Mixtures and Neural Nets- Gaussians- Gaussian mixtures- Multi-layer perceptrons (MLPs)- Training and test data
Singing Detection
1
2
3
4
5
Dan Ellis Pattern Recognition 2003-07-01 - 14
Gaussian Mixtures and Neural Nets
• Gaussians as parametric distribution models:
• Described by d dimensional mean vector µi
and d x d covariance matrix Σi
• Classify by maximizing log likelihood i.e.
4
p x ωi( )1
2π( )dΣi
1 2⁄----------------------------------- 1
2--- x µµµµi–( )
TΣi
1–x µi–( )–exp⋅=
0
2
4
0
2
4
0
0.5
1
0 1 2 3 4 50
1
2
3
4
5
12--- x µµµµi–( )
TΣi
1–x µi–( )–
12--- Σilog– Pr ωi( )log+
ωiargmax
Dan Ellis Pattern Recognition 2003-07-01 - 15
Gaussian Mixture models (GMMs)
• Weighted sum of Gaussians can fit any PDF:
i.e.
- each observation from random single Gaussian?
• Find ck and mk parameters via EM
- easy if we knew which mk generated each x
p x( ) ck p x mk( )k∑≈
weights ck
Gaussians p(x|mk)
05
1015
2025
3035
0
0.2
0.4
0.6
0.8
1
1.2
1.4
-2
-1
0
1
2
3
original data
Gaussiancomponents
resultingsurface
Dan Ellis Pattern Recognition 2003-07-01 - 16
GMM examples
• Vowel data fit with different mixture counts:
200 400 600 800 1000 1200600
800
1000
1200
1400
16001 Gauss logp(x)=-1911
200 400 600 800 1000 1200600
800
1000
1200
1400
16002 Gauss logp(x)=-1864
200 400 600 800 1000 1200600
800
1000
1200
1400
16003 Gauss logp(x)=-1849
200 400 600 800 1000 1200600
800
1000
1200
1400
16004 Gauss logp(x)=-1840
Dan Ellis Pattern Recognition 2003-07-01 - 17
Neural networks
• Don’t model distributions ,
instead, model posteriors
• Sums over nonlinear functions of sums → large range of decision surfaces
• e.g. Multi-layer perceptron (MLP) with 1 hidden layer:
• Train the weights wij with back-propagation
p x ωi( )
Pr ωi x( )
yk F w jk F wijxij∑[ ]⋅j∑[ ]=
+
+wij
x1 y1
h1
h2
y2
x2x3
F[·]+
+
+wjk
Inputlayer
Hiddenlayer
Outputlayer
Dan Ellis Pattern Recognition 2003-07-01 - 18
Neural net example
• 2 input units (normalized F1, F2)
• 5 hidden units, 3 output units (“U”, “O”, “A”)
• Sigmoid nonlinearity:
F1“A”“O”“U”
F2
F x[ ] 1
1 ex–
+-----------------=
xddF
⇒ F 1 F–( )=
5 4 3 2 1 0 1 2 3 4 50
0.2
0.4
0.6
0.8
1
sigm(x)
sigm(x)ddx
Dan Ellis Pattern Recognition 2003-07-01 - 19
Neural net training
example...
0 10 20 30 40 50 60 70 80 90 1000
0.2
0.4
0.6
0.8
12:5:3 net: MS error by training epoch
Mea
n sq
uare
d er
ror
Training epoch
200 400 600 800 1000 1200600
800
1000
1200
1400
1600Contours @ 10 iterations
200 400 600 800 1000 1200600
800
1000
1200
1400
1600Contours @ 100 iterations
Dan Ellis Pattern Recognition 2003-07-01 - 20
Aside: Training and test data
• A rich model can learn every training example (overtraining)
• But, goal is to classify new, unseen datai.e. generalization- sometimes use ‘cross validation’ set to decide
when to stop training
• For evaluation results to be meaningful:- don’t test with training data!- don’t train on test data (even indirectly...)
training or parameters
errorrate
Testdata
Trainingdata
Overfitting
Dan Ellis Pattern Recognition 2003-07-01 - 21
Outline
Music Content Analysis
Classification and Features
Statistical Pattern Recognition
Gaussian Mixtures and Neural Nets
Singing Detection- Motivation- Features- Classifiers
1
2
3
4
5
Dan Ellis Pattern Recognition 2003-07-01 - 22
Singing Detection(Berenzweig et al. ’01)
• Can we automatically detect when singing is present?
- for further processing (lyrics recognition?)- as a song signature?- as a basis for classification?
5
File: /Users/dpwe/projects/aclass/aimee.wav
f 9 Printed: T
ue Mar 11 13:04:28
5001000150020002500300035004000450050005500600065007000
Hz
0:02 0:04 0:06 0:08 0:10 0:12 0:14 0:16 0:18 0:20 0:22 0:24 0:26 0:28t
mus mus musvoxvox
Dan Ellis Pattern Recognition 2003-07-01 - 23
Singing Detection: Requirements
• Labeled training examples- 60 x 15 sec. radio excerpts - hand-mark sung phrases
• Labeled test data- several complete tracks from
CDs, hand-labelled
• Feature choice- Mel-frequency Cepstral Coefficients (MFCCs)
popular for speech; maybe sung voice too?- separation of voices? temporal dimension?
• Classifier choice- MLP Neural Net- GMMs for singing / music- SVM?
0
0
0
tim
freq
/ kH
z
0 2 4 6 8 10 120
5
10
freq
/ kH
z
5
10
freq
/ kH
z
5
10
freq
/ kH
z5
10
trn/mus/3
trn/mus/8
trn/mus/19
trn/mus/28
hand-label vox
Dan Ellis Pattern Recognition 2003-07-01 - 24
GMM System
• Separate models for p(x|sing), p(x|no sing)- combined via likelihood ratio test
• How many Gaussians for each?- say 20; depends on data & complexity
• What kind of covariance?- diagonal (spherical?)
C0C1
C12
p(X|“singing”)
music singing?MFCCcalculation
GMM1
...p(X|“no singing”)
GMM2
Log l'hoodratio testp(X|“singing”)
logp(X|“not”)
Dan Ellis Pattern Recognition 2003-07-01 - 25
GMM Results
• Raw and smoothed results (Best FA=84.9%):
• MLP has advantage of discriminant training
• Each GMM trains only on data subset → faster to train? (2 x 10 min vs. 20 min)
-10
-5
0
5
10
0 5 10 15 20 25 30-5
0
5
freq
/ kH
zlo
g(lh
ood)
log(
lhoo
d)
0
2
4
6
8
10
time / sec
Aimee Mann : Build That Wall + handvox
26d 20mix GMM on 16ms frames (FA=65.8% @ thr=0)
26d 20mix GMM smoothed by 61 pt (1 sec) hann (FA=84.9% @ thr=-0.8)
Dan Ellis Pattern Recognition 2003-07-01 - 26
MLP Neural Net
• Directly estimate p(singing | x)
- net has 26 inputs (+∆), 15 HUs, 2 o/ps (26:15:2)
• How many hidden units?- depends on data amount, boundary complexity
• Feature context window?- useful in speech
• Delta features?- useful in speech
• Training parameters...
C0C1
C12
“singing”music
“not singing”...
MFCCcalculation
Dan Ellis Pattern Recognition 2003-07-01 - 27
MLP Results
• Raw net outputs on a CD track (FA 74.1%):
• Smoothed for continuity: best FA = 90.5%
time / sec
freq
/ kH
zp(
sing
ing)
0
2
4
6
8
10
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30
Aimee Mann : Build That Wall + handvox
26:15:1 netlab on 16ms frames (FA=74.1% @ thr=0.5)
time / sec
p(si
ngin
g)
0 5 10 15 20 25 300
0.2
0.4
0.6
0.8
1y p ( ) ( )
Dan Ellis Pattern Recognition 2003-07-01 - 28
Artist Classification(Berenzweig et al. 2002)
• Artist label as available stand-in for genre
• Train MLP to classify frames among 21 artists
• Using only “voice” segments:Song-level accuracy improves 56.7% → 64.9%
0 10 20 30 40 50 60 70 80
Boards of CanadaSugarplastic
Belle & SebastianMercury Rev
CorneliusRichard Davies
Dj ShadowMouse on Mars
The Flaming LipsAimee Mann
WilcoXTCBeck
Built to SpillJason Falkner
OvalArto LindsayEric Matthews
The MolesThe Roots
Michael Penntrue voice
time / sec
Track 4 - Arto Lindsay (dynvox=Arto, unseg=Oval)
0 50 100 150 200
Boards of CanadaSugarplastic
Belle & SebastianMercury Rev
CorneliusRichard Davies
Dj ShadowMouse on Mars
The Flaming LipsAimee Mann
WilcoXTCBeck
Built to SpillJason Falkner
OvalArto Lindsay
Eric MatthewsThe MolesThe Roots
Michael Penntrue voice
time / sec
Track 117 - Aimee Mann (dynvox=Aimee, unseg=Aimee)
Dan Ellis Pattern Recognition 2003-07-01 - 29
Summary
• Music content analysis:Pattern classification
• Basic machine learning methods:Neural Nets, GMMs
• Singing detection: classic application
but... the time dimension?
Dan Ellis Pattern Recognition 2003-07-01 - 30
References
A.L. Berenzweig and D.P.W. Ellis (2001)“Locating Singing Voice Segments within MusicSignals”, Proc. IEEE Workshop on Apps. of Sig. Proc. toAcous. and Audio, Mohonk NY, October 2001. http://www.ee.columbia.edu/~dpwe/pubs/waspaa01-singing.pdf
R.O. Duda, P. Hart, R. Stork (2001)Pattern Classification, 2nd Ed.Wiley, 2001.
E. Scheirer and M. Slaney (1997) “Construction and evaluation of a robust multifeature speech/music discriminator”, Proc. IEEE ICASSP, Munich, April 1997. http://www.ee.columbia.edu/~dpwe/e6820/papers/ScheiS97-mussp.pdf