Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature

Audio/Music @ LabROSA - Dan Ellis 2004-08-24

1. Eigenrhythms: Representing drum tracks2. Frequency-Domain Linear Prediction3. Segmenting meeting turns4. Analyzing ‘personal audio’ recordings

Audio & Music Researchat LabROSA

Dan EllisLaboratory for Recognition and Organization of Speech and Audio

Dept. Electrical Eng., Columbia Univ., NY USA

[email protected] http://labrosa.ee.columbia.edu/


LabROSA Projects Overview

InformationExtraction

MachineLearning

SignalProcessing

Speech

Music Environment

FDLPMeetingturns

Personalaudio

Eigenrhythms


1. Eigenrhythms: Drum Pattern Space

• Pop songs built on repeating “drum loop”bass drum, snare, hi-hatsmall variations on a few basic patterns

• Eigen-analysis (PCA) to capture variations?by analyzing lots of (MIDI) data

• Applicationsmusic categorization“beat box” synthesis

with John Arroyo


Aligning the Data• Need to align patterns prior to PCA...

tempo (stretch): by inferring BPM &

normalizing

downbeat (shift): correlate against ‘mean’ template


Eigenrhythms

• Need 20+ Eigenvectors for good coverage of 100 training patterns (1200 dims)

• Top patterns:


Eigenrhythms for Classification

• Clusters in Eigenspace:

• Genre classification? (10 way)nearest neighbor in 4D eigenspace: 21% correct

-6 -4 -2 0 2 4 6-6

-4

-2

0

2

4

6

bl:boomboom

bl:chicken

bl:crosfire

bl:hideaway

bl:meanwoma

bl:onebeer

bl:thrill

bl:blues2gm

bl:dimplesco:goodlook

co:walkline

co:ringfire

co:SArose

co:byYrMan

co:alabama

co:tennesse

co:aftermid

co:texas

co:walkmid

di:boogiewl

di:booty

di:carwash

di:danqueen

di:discoinf

di:dontstop

di:funkytwn

di:lafreak

di:satnight

di:boogient

hh:1mChance

hh:bigpimpn

hh:bigPoppa

hh:gThang

hh:nEpisode

hh:rufryder

hh:slmshady

hh:superstr

hh:jacksonhh:stan

ho:badtouch

ho:bemylove

ho:inside

ho:onemore

ho:dpworld

ho:pvandyk

ho:modjo

nw:amadeus

nw:banvenus

nw:bmonday

nw:dbdance

nw:deserve

nw:dontyounw:evcountsnw:psboysin

nw:whipit

nw:pure

rc:blackdog

rc:californ

rc:whteroom

rc:rolstone

rc:harddayrc:jump

rc:layla

rc:money

rc:zztop

rc:tuesdays

pp:distance

pp:dllal

pp:downundr

pp:fly pp:lkvirgin

pp:loveshck

pp:lvprayer

pp:mjBeatit

pp:onemore

pp:bholly

pu:aWalk

pu:beatbratpu:blitzkrg

pu:bombshel

pu:bSedated

pu:happyguy

pu:rubysohopu:waitinRm

pu:anarchy

rb:chgWorld

rb:downlow

rb:heyLover

rb:honey

rb:lsaround

rb:mgirlsat

rb:bismine

rb:volove

All tracks projected onto 1st two eigenrhythms

Eigenrhythm 1

Eig

en

rhyth

m 2


Eigenrhythm BeatBox

• Resynthesize rhythms from eigen-space


2. Frequency-Domain Lin. Pred.

• (Time-domain) Linear Predictionthe well-known spectral estimator

• Apply to a ‘frequency domain’ signaldual: estimates temporal envelope

4/16Athineos & Ellis - Music processing with FDLP 2004-05-25

• Time domain– The well-known spectral estimator

• Frequency domain

– Frequency is time and vice versa

Linear PredictionLinear Prediction

FDLPDCT

TDLP

y[n] = aiy[n ! i] + e[n]

i=1.. p

"

Y[k] = biY[k ! i] + E[k]i=1.. p

"


• Time domain– The well-known spectral estimator

• Frequency domain

– Frequency is time and vice versa

Linear PredictionLinear Prediction

FDLPDCT

TDLP

y[n] = aiy[n ! i] + e[n]

i=1.. p

"

Y[k] = biY[k ! i] + E[k]i=1.. p

"

with Marios Athineos


Aside: Spectrogram of the DCT

• DCT gives a pure-real signal:Can we treat it like a waveform?


• Looks like a mirror image over time = freq axis

DCT spectrogramDCT spectrogram


FDLP and TDLP Duality

!"#$%#&'()*+#

!,-. ),-.


Subband FDLP

• Temporal envelopes without 25 ms windows


Time-frequency slicingTime-frequency slicing

Subband FDLP(per frequency subband)

Auditory STFT(10-25ms + Bark bin)

TDLP(per time frame)


FDLP Applications

• Time-scale modification

• Modulation-domain “temporal equalization”

• Perceptual audio features... 8/16Athineos & Ellis - Music processing with FDLP 2004-05-25

Cascade Time-Frequency LPCascade Time-Frequency LP

• Analysis

• Synthesis


Temporal equalizationTemporal equalization

• Filtering in frequency

DCT

1 sec up to whole sample

Residual

in freq.

Flat

Temporal

EnvelopesOverlap

= Filtering by inverse FDLP

(temporal equalization)

OLA

& iDCT


PLP-squared

• FDLP fits temporal envelope with LPPerceptual Linear Prediction (PLP) smooths across frequencycan we do both... iteratively?

• Speech features without ST windows

t / sec

Ba

rk b

an

d

0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22

5

10

15

Marios AthineosHynek Hermansky


• Multi-mic recordings for speaker turnsevery voice reaches every mic... (?)... but with differing coupling filters (delays, gains)

• Find turns with minimal assumptionse.g. ad-hoc sensor setups (multiple PDAs)differences to remove effect of source signal- no spectral models, < 1xRT

3. Meeting Turnswith Jerry Liu and ICSI


Between-channel cues: Timing (ITD) & Level

125 130 135 140 145 150 155 160 165 170 1750

0.5

1norm xcorr pk val

125 130 135 140 145 150 155 160 165 170 175-50

0

50xocrr peak lags (5pt med filt)

skew

/sam

p

125 130 135 140 145 150 155 160 165 170 175-80

-70

-60

-50

-40per-chan E

dB

125 130 135 140 145 150 155 160 165 170 175-10

-5

0

5

10chan E diffs

dB

time/s

Speaker activity

Speakerground-truth

Timing diffs (ITD)(2 mic pairs, 250ms win)

Peak correlationcoefficient r

Per-channelenergy

Between-channelenergy differences


• Inverse-filter by 12-pole LPC models (32 ms windows) to remove local resonances

• Filter out noise < 500 Hz, > 6 kHz• Then cross-correlate...

Pre-whitening for ITD

Short-time xcorr: whitened+filtered signals

1220 1225 1230 1235 1240-100

-50

0

50

100

lag

/ s

am

ps

Short-time xcorr: raw signals

1220 1225 1230 1235 1240-100

-50

0

50

100

Speaker ground truth

sp

kr

ID

time / sec1220 1225 1230 1235 1240

2

4

6

Speaker ground truth

time / sec1220 1225 1230 1235 1240

2

4

6


• Correlation coef. r ~ channel similarity:

•

• Select frames with r in top 50% in both pairs

•• Cleaner basis for models

Choosing “Good” Frames

ri j[!] =!nmi[n] ·mj[n+ !]√

!m2i !m2

j

-150 -100 -50 0

-100

-50

0

Skew12 / samples

Skew

34 / s

am

ple

s

ITD - all points

-150 -100 -50 0

-100

-50

0

ITD - high-correlation points (435/1201)

Skew12 / samples

Skew

34 / s

am

ple

sabout 35% of points


• Eigenvectors of “affinity matrix” A to pick out similar points:

•• Ad-hoc mapping to clusters

Number of clusters K from eigenvalues ≈ points

Spectral clustering

amn = exp{−‖x[m]−x[n]‖2/2!2}

Affinity matrix A

point index

100 200 300 400

50

100

150

200

250

300

350

400

0 100 200 300 400-12

-10

-8

-6

-4

-2

0

2first 12 eigenvectors (normalized)

point index


• Actual clusters depend on and K heuristic• Fit Gaussians to each cluster,

assign that class to all frames within radiusor: consider dimensions independently, choose best

Speaker Models & Classification!

-100 -50 0-100

-80

-60

-40

-20

0ICSI0: good points

-100 -50 0-100

-80

-60

-40

-20

0All pts: nearest class

-100 -50 0-100

-80

-60

-40

-20

0All pts: closest dimension


Performance Analysis

• Compare reference & system activity maps:

system misses quiet speakers 2,3,4 (deletions)system splits speaker 6 (deletions+insertions)many short gaps (deletions)

• ~52% avg. error on NIST 2004 dev setspeaker-characteristic-based systems ~25%


4. Segmenting Personal Audio

• Easy to record everything you hear~100GB / year @ 64 kbps

• Very hard to find anythinghow to scan?how to visualize?how to index?

• Starting point: Collect data~ 60 hours (8 days, ~7.5 hr/day)hand-mark 139 segments (26 min/seg avg.)assign to 16 classes (8 have multiple instances)

with Kean sub Lee


Features for Long Recordings

• Feature frames = 1 min (not 25 ms!)• Characterize variation within each frame...

•and structure within coarse auditory bands

60dB

80

100

120

fre

q /

ba

rk

Average Linear Energy

5

10

15

20

dB

20

40

60

fre

q /

ba

rk

Normalized Energy Deviation

5

10

15

20

bits

0.1

0.2

0.3

0.4

0.5

time / min

fre

q /

ba

rk

Spectral Entropy Deviation

50 100 150 200 250 300 350 400 450

5

10

15

20

bits

0.5

0.6

0.7

0.8

0.9

fre

q /

ba

rk

Average Spectral Entropy

5

10

15

20

dB

5

10

15

fre

q /

ba

rk

Log Energy Deviation

5

10

15

20

dB60

80

100

120

fre

q /

ba

rk

Average Log Energy

5

10

15

20


BIC Segmentation• Untrained segmentation technique

statistical test indicates good change points:

• Evaluate: 60hr hand-marked boundariesdifferent features & combinationsCorrect Accept % @ False Accept = 2%:

log L(X1;M1)L(X2;M2)L(X;M0)

≷ λ2 log(N)∆#(M)

µdB 80.8%µH 81.1%

σH/µH 81.6%µdB + σH/µH 84.0%

µdB + σH/µH + µH 83.6%0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 - Specificity

Sensitiv

ity

µdB

µH

!H/µH

µdB + !H/µH

µdB + µH + !H/µH


Segment clustering

• Daily activity has lots of repetition:Automatically cluster similar segments

• Spectral clustering achieves ~70% correct16-way ground truth labelsKL distance, smoothed covariance estimates

campuslibrary

restaurantstreet

bowlinghomecar/taxilecture1breakbilliardlecture2barberkaraokemeetingsupermkt

lib rst str ...

0.5

1

0cmp


Future Work

• Visualization / browsing / diary inferencelink to other information sources

•

• Privacy protectionspeaker/speech “search and destroy”


LabROSA Summary

• LabROSAsignal processing + machine learning + information extraction

• ApplicationsEigenrhythms: drum pattern modelsFDLP temporal envelopesMeeting recordingsPersonal audio analysis

• Also...music similarity, signal separation, ...

Documents

Audio & Music Research at LabROSA - Columbia Universitydpwe/talks/msr-2004-08.pdf · 2004-08-24 · Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature