26
Audio/Music @ LabROSA - Dan Ellis 2004-08-24 1. Eigenrhythms: Representing drum tracks 2. Frequency-Domain Linear Prediction 3. Segmenting meeting turns 4. Analyzing ‘personal audio’ recordings Audio & Music Research at LabROSA Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA [email protected] http://labrosa.ee.columbia.edu/

Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

1. Eigenrhythms: Representing drum tracks2. Frequency-Domain Linear Prediction3. Segmenting meeting turns4. Analyzing ‘personal audio’ recordings

Audio & Music Researchat LabROSA

Dan EllisLaboratory for Recognition and Organization of Speech and Audio

Dept. Electrical Eng., Columbia Univ., NY USA

[email protected] http://labrosa.ee.columbia.edu/

Page 2: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

LabROSA Projects Overview

InformationExtraction

MachineLearning

SignalProcessing

Speech

Music Environment

FDLPMeetingturns

Personalaudio

Eigenrhythms

Page 3: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

1. Eigenrhythms: Drum Pattern Space

• Pop songs built on repeating “drum loop”bass drum, snare, hi-hatsmall variations on a few basic patterns

• Eigen-analysis (PCA) to capture variations?by analyzing lots of (MIDI) data

• Applicationsmusic categorization“beat box” synthesis

with John Arroyo

Page 4: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

Aligning the Data• Need to align patterns prior to PCA...

tempo (stretch): by inferring BPM &

normalizing

downbeat (shift): correlate against ‘mean’ template

Page 5: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

Eigenrhythms

• Need 20+ Eigenvectors for good coverage of 100 training patterns (1200 dims)

• Top patterns:

Page 6: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

Eigenrhythms for Classification

• Clusters in Eigenspace:

• Genre classification? (10 way)nearest neighbor in 4D eigenspace: 21% correct

-6 -4 -2 0 2 4 6-6

-4

-2

0

2

4

6

bl:boomboom

bl:chicken

bl:crosfire

bl:hideaway

bl:meanwoma

bl:onebeer

bl:thrill

bl:blues2gm

bl:dimplesco:goodlook

co:walkline

co:ringfire

co:SArose

co:byYrMan

co:alabama

co:tennesse

co:aftermid

co:texas

co:walkmid

di:boogiewl

di:booty

di:carwash

di:danqueen

di:discoinf

di:dontstop

di:funkytwn

di:lafreak

di:satnight

di:boogient

hh:1mChance

hh:bigpimpn

hh:bigPoppa

hh:gThang

hh:nEpisode

hh:rufryder

hh:slmshady

hh:superstr

hh:jacksonhh:stan

ho:badtouch

ho:bemylove

ho:inside

ho:onemore

ho:dpworld

ho:pvandyk

ho:modjo

nw:amadeus

nw:banvenus

nw:bmonday

nw:dbdance

nw:deserve

nw:dontyounw:evcountsnw:psboysin

nw:whipit

nw:pure

rc:blackdog

rc:californ

rc:whteroom

rc:rolstone

rc:harddayrc:jump

rc:layla

rc:money

rc:zztop

rc:tuesdays

pp:distance

pp:dllal

pp:downundr

pp:fly pp:lkvirgin

pp:loveshck

pp:lvprayer

pp:mjBeatit

pp:onemore

pp:bholly

pu:aWalk

pu:beatbratpu:blitzkrg

pu:bombshel

pu:bSedated

pu:happyguy

pu:rubysohopu:waitinRm

pu:anarchy

rb:chgWorld

rb:downlow

rb:heyLover

rb:honey

rb:lsaround

rb:mgirlsat

rb:bismine

rb:volove

All tracks projected onto 1st two eigenrhythms

Eigenrhythm 1

Eig

en

rhyth

m 2

Page 7: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

Eigenrhythm BeatBox

• Resynthesize rhythms from eigen-space

Page 8: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

2. Frequency-Domain Lin. Pred.

• (Time-domain) Linear Predictionthe well-known spectral estimator

• Apply to a ‘frequency domain’ signaldual: estimates temporal envelope

4/16Athineos & Ellis - Music processing with FDLP 2004-05-25

• Time domain– The well-known spectral estimator

• Frequency domain

– Frequency is time and vice versa

Linear PredictionLinear Prediction

FDLPDCT

TDLP

y[n] = aiy[n ! i] + e[n]

i=1.. p

"

Y[k] = biY[k ! i] + E[k]i=1.. p

"

4/16Athineos & Ellis - Music processing with FDLP 2004-05-25

• Time domain– The well-known spectral estimator

• Frequency domain

– Frequency is time and vice versa

Linear PredictionLinear Prediction

FDLPDCT

TDLP

y[n] = aiy[n ! i] + e[n]

i=1.. p

"

Y[k] = biY[k ! i] + E[k]i=1.. p

"

with Marios Athineos

Page 9: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

Aside: Spectrogram of the DCT

• DCT gives a pure-real signal:Can we treat it like a waveform?

3/16Athineos & Ellis - Music processing with FDLP 2004-05-25

• Looks like a mirror image over time = freq axis

DCT spectrogramDCT spectrogram

Page 10: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

FDLP and TDLP Duality

!"#$%#&'()*+#

!,-. ),-.

Page 11: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

Subband FDLP

• Temporal envelopes without 25 ms windows

12/16Athineos & Ellis - Music processing with FDLP 2004-05-25

Time-frequency slicingTime-frequency slicing

Subband FDLP(per frequency subband)

Auditory STFT(10-25ms + Bark bin)

TDLP(per time frame)

Page 12: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

FDLP Applications

• Time-scale modification

• Modulation-domain “temporal equalization”

• Perceptual audio features... 8/16Athineos & Ellis - Music processing with FDLP 2004-05-25

Cascade Time-Frequency LPCascade Time-Frequency LP

• Analysis

• Synthesis

13/16Athineos & Ellis - Music processing with FDLP 2004-05-25

Temporal equalizationTemporal equalization

• Filtering in frequency

DCT

1 sec up to whole sample

Residual

in freq.

Flat

Temporal

EnvelopesOverlap

= Filtering by inverse FDLP

(temporal equalization)

OLA

& iDCT

Page 13: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

PLP-squared

• FDLP fits temporal envelope with LPPerceptual Linear Prediction (PLP) smooths across frequencycan we do both... iteratively?

• Speech features without ST windows

t / sec

Ba

rk b

an

d

0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22

5

10

15

Marios AthineosHynek Hermansky

Page 14: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

• Multi-mic recordings for speaker turnsevery voice reaches every mic... (?)... but with differing coupling filters (delays, gains)

• Find turns with minimal assumptionse.g. ad-hoc sensor setups (multiple PDAs)differences to remove effect of source signal- no spectral models, < 1xRT

3. Meeting Turnswith Jerry Liu and ICSI

Page 15: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

Between-channel cues: Timing (ITD) & Level

125 130 135 140 145 150 155 160 165 170 1750

0.5

1norm xcorr pk val

125 130 135 140 145 150 155 160 165 170 175-50

0

50xocrr peak lags (5pt med filt)

skew

/sam

p

125 130 135 140 145 150 155 160 165 170 175-80

-70

-60

-50

-40per-chan E

dB

125 130 135 140 145 150 155 160 165 170 175-10

-5

0

5

10chan E diffs

dB

time/s

Speaker activity

Speakerground-truth

Timing diffs (ITD)(2 mic pairs, 250ms win)

Peak correlationcoefficient r

Per-channelenergy

Between-channelenergy differences

Page 16: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

• Inverse-filter by 12-pole LPC models (32 ms windows) to remove local resonances

• Filter out noise < 500 Hz, > 6 kHz• Then cross-correlate...

Pre-whitening for ITD

Short-time xcorr: whitened+filtered signals

1220 1225 1230 1235 1240-100

-50

0

50

100

lag

/ s

am

ps

Short-time xcorr: raw signals

1220 1225 1230 1235 1240-100

-50

0

50

100

Speaker ground truth

sp

kr

ID

time / sec1220 1225 1230 1235 1240

2

4

6

Speaker ground truth

time / sec1220 1225 1230 1235 1240

2

4

6

Page 17: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

• Correlation coef. r ~ channel similarity:

• Select frames with r in top 50% in both pairs

•• Cleaner basis for models

Choosing “Good” Frames

ri j[!] =!nmi[n] ·mj[n+ !]√

!m2i !m2

j

-150 -100 -50 0

-100

-50

0

Skew12 / samples

Skew

34 / s

am

ple

s

ITD - all points

-150 -100 -50 0

-100

-50

0

ITD - high-correlation points (435/1201)

Skew12 / samples

Skew

34 / s

am

ple

sabout 35% of points

Page 18: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

• Eigenvectors of “affinity matrix” A to pick out similar points:

•• Ad-hoc mapping to clusters

Number of clusters K from eigenvalues ≈ points

Spectral clustering

amn = exp{−‖x[m]−x[n]‖2/2!2}

Affinity matrix A

point index

100 200 300 400

50

100

150

200

250

300

350

400

0 100 200 300 400-12

-10

-8

-6

-4

-2

0

2first 12 eigenvectors (normalized)

point index

Page 19: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

• Actual clusters depend on and K heuristic• Fit Gaussians to each cluster,

assign that class to all frames within radiusor: consider dimensions independently, choose best

Speaker Models & Classification!

-100 -50 0-100

-80

-60

-40

-20

0ICSI0: good points

-100 -50 0-100

-80

-60

-40

-20

0All pts: nearest class

-100 -50 0-100

-80

-60

-40

-20

0All pts: closest dimension

Page 20: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

Performance Analysis

• Compare reference & system activity maps:

system misses quiet speakers 2,3,4 (deletions)system splits speaker 6 (deletions+insertions)many short gaps (deletions)

• ~52% avg. error on NIST 2004 dev setspeaker-characteristic-based systems ~25%

Page 21: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

4. Segmenting Personal Audio

• Easy to record everything you hear~100GB / year @ 64 kbps

• Very hard to find anythinghow to scan?how to visualize?how to index?

• Starting point: Collect data~ 60 hours (8 days, ~7.5 hr/day)hand-mark 139 segments (26 min/seg avg.)assign to 16 classes (8 have multiple instances)

with Kean sub Lee

Page 22: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

Features for Long Recordings

• Feature frames = 1 min (not 25 ms!)• Characterize variation within each frame...

•and structure within coarse auditory bands

60dB

80

100

120

fre

q /

ba

rk

Average Linear Energy

5

10

15

20

dB

20

40

60

fre

q /

ba

rk

Normalized Energy Deviation

5

10

15

20

bits

0.1

0.2

0.3

0.4

0.5

time / min

fre

q /

ba

rk

Spectral Entropy Deviation

50 100 150 200 250 300 350 400 450

5

10

15

20

bits

0.5

0.6

0.7

0.8

0.9

fre

q /

ba

rk

Average Spectral Entropy

5

10

15

20

dB

5

10

15

fre

q /

ba

rk

Log Energy Deviation

5

10

15

20

dB60

80

100

120

fre

q /

ba

rk

Average Log Energy

5

10

15

20

Page 23: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

BIC Segmentation• Untrained segmentation technique

statistical test indicates good change points:

• Evaluate: 60hr hand-marked boundariesdifferent features & combinationsCorrect Accept % @ False Accept = 2%:

log L(X1;M1)L(X2;M2)L(X;M0)

≷ λ2 log(N)∆#(M)

µdB 80.8%µH 81.1%

σH/µH 81.6%µdB + σH/µH 84.0%

µdB + σH/µH + µH 83.6%0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 - Specificity

Sensitiv

ity

µdB

µH

!H/µH

µdB + !H/µH

µdB + µH + !H/µH

Page 24: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

Segment clustering

• Daily activity has lots of repetition:Automatically cluster similar segments

• Spectral clustering achieves ~70% correct16-way ground truth labelsKL distance, smoothed covariance estimates

campuslibrary

restaurantstreet

bowlinghomecar/taxilecture1breakbilliardlecture2barberkaraokemeetingsupermkt

lib rst str ...

0.5

1

0cmp

Page 25: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

Future Work

• Visualization / browsing / diary inferencelink to other information sources

• Privacy protectionspeaker/speech “search and destroy”

Page 26: Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

LabROSA Summary

• LabROSAsignal processing + machine learning + information extraction

• ApplicationsEigenrhythms: drum pattern modelsFDLP temporal envelopesMeeting recordingsPersonal audio analysis

• Also...music similarity, signal separation, ...