Research in Sound Analysisdpwe/talks/rosaview-2008-09.pdf · Sound Analysis - Ellis 2008-09-05 p. /16 Audio-Only Results • Wide range of results: audio (music, ski) vs. non-audio

Sound Analysis - Ellis 2008-09-05 p. /16

1. Personal and Consumer Audio2. Musical Cover Song Detection3. Binaural Source Separation

Research in Sound Analysis Dan Ellis

Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA

[email protected]

1

mailto:[email protected]:[email protected]


LabROSA Overview

2

InformationExtraction

MachineLearning

SignalProcessing

Speech

Music EnvironmentRecognition

Retrieval

Separation


1. Personal Audio Archives

• Easy to record everything you hear


Browsing Interface• Browsing / Diary interface

links to other information (diary, email, photos)synchronize with note taking? (Stifelman & Arons)audio thumbnails

• Release Tools + “how to” for capture

4

!"#!!

!"#$!

!%#!!

!%#$!

&!#!!

&!#$!

&!!

&$!

&'#!!

&'#$!

&$#!!

&$#$!

&(#!!

&(#$!

&)#!!

&)#$!

&*#!!

&*#$!

&+#!!

&+#$!

&"#!!

&"#$!

&%#!!

&%#$!

'!#!!

,-./01223

045.

,-./01223

045.3.067-.

25580.

25580.276922-

:-27,

34;

045.


Consumer Video

• Short video clips as the evolution of snapshots10-60 sec, one location, no editingbrowsing?

• More information for indexing...video + audioforeground + background

5


MFCC Covariance Representation• Each clip/segment → fixed-size statistics

similar to speaker ID and music genre classification

• Full Covariance matrix of MFCCs maps the kinds of spectral shapes present

• Clip-to-clip distances for SVM classifierby KL or 2nd Gaussian model

6

VTS_04_0001 - Spectrogram

freq / k

Hz

1 2 3 4 5 6 7 8 90

1

2

3

4

5

6

7

8

-20

-10

0

10

20

30

time / sec

time / sec

level / dB

value

MF

CC

bin

1 2 3 4 5 6 7 8 9

2468

101214161820

-20-15-10-505101520

MFCC dimension

MF

CC

dim

ensio

n

MFCC covariance

5 10 15 20

2

4

6

8

10

12

14

16

18

20

-50

0

50

Video Soundtrack

MFCCfeatures

MFCCCovariance

Matrix


Audio-Only Results

• Wide range of results:

audio (music, ski) vs. non-audio (group, night)large AP uncertainty on infrequent classes

7

Anim

alBa

byBe

ach

Birth

day

Boat

Crow

d

Grou

p of 3

+

Grou

p of 2

Muse

umNi

ght

One p

erso

nPa

rkPic

nic

Playg

roun

dSh

owSp

orts

Suns

et

Wed

ding

Danc

ing

Para

de

Singin

gCh

eer

Music

Concepts

Ski0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

AP

1G + KL1G + MahGMM Hist. + pLSAGuess


How does it ‘feel’?

• Browser impressions: How wrong is wrong?

8

Top 8 hitsfor “Baby”


2. Cover Song Detection: Chroma

• Chroma features map spectral energy into one canonical octavei.e. 12 semitone bins

• Can resynthesize as “Shepard Tones”all octaves at once

9

Piano scale

2 4 6 8 10 100 200 300 400 500 600 700time / sec

freq / k

Hz

0

1

2

3

4

time / frames

chro

ma

A

C

D

F

G

Piano chromatic scale IF chroma

0 500 1000 1500 2000 2500 freq / Hz-60-50-40-30-20-10

0

2 4 6 8 10 time / sec

freq

/ kH

z

leve

l / d

B

0

1

2

3

4Shepard tone resynth12 Shepard tone spectra


Beat-Synchronous Chroma Features

• Beat + chroma features / 30ms frames→ average chroma within each beatcompact; sufficient?

! "# "! $# $! %# %!

$

&

'

(

"#

"$

"#

$#

%#

$

&

'

(

"#

"$

# ! "# "!)*+,-.-/,0

)*+,-.-1,2)/

34,5-.-6,7

89/,)-/)4,9:);

0;48+2-1*9/

0;48+2-1*9/

#


Cross-Correlation Matching

• Look inside global cross-correlation to find matching fragments...

xcorr = t f (C1(t, f)⋅C2(t, f)) - view along time

11

Let It Be / Beatles (beats 11-441)

50 100 150 200 250 300 350 400

0 50 100 150 200 250 300 350 400

-0.2

0

0.2

0.4

Let It Be / Nick Cave (beats 13-443)

50 100 150 200 250 300 350 400 time / beats

time / beats

time / beats

ch

rom

a

A

C

D

F

G

chro

ma

A

C

D

F

G


“The Meaning of Music”The ultimate goal of this research...

• What does music evoke in a listener’s mind?

i.e. “what does it all mean?” (metaphysics?)study with subjective experiments(then build detectors for specific responses ...?)

• What phenomena are denoted by “music”?

i.e. delineate the “set of all music”(the ultimate music/nonmusic classifier?)

12

?


3. Binaural Source Separation• 2 or 3 sources in reverberation

assume just 2 ‘ears’

• Tasks:identify positions of sources (and number?)recover source signals

13


Spatial Estimation in Reverb

• Model interaural spectrum of each sourceas stationary level and time differences:

converge via EM to a(), τ for each sourcemask is Pr(X(t,ω) dominated by source i)

14

Mandel & Ellis ’07

LeftLeft ILDILD

IPDIPD

8.77dB8.77dB

Mask 1Mask 1

Mask 2Mask 2

RightRight


Spatial Estimation Results

• Modeling uncertainty improves resultstradeoff between constraints & noisiness

15

2.45 dB2.45 dB

9.12 dB9.12 dB

0.22 dB0.22 dB

-2.72 dB-2.72 dB

12.35 dB12.35 dB8.77 dB8.77 dB


Conclusions

• LabROSAinformation from sound ...... via signal processing and machine learning

• Environmental Sound• Music Audio• Source Separation• Speech, models, dolphins...

16

Documents

Research in Sound Analysisdpwe/talks/rosaview-2008-09.pdf · Sound Analysis - Ellis 2008-09-05 p. /16 Audio-Only Results • Wide range of results: audio (music, ski) vs. non-audio