Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
1. Eigenrhythms: Representing drum tracks2. Frequency-Domain Linear Prediction3. Segmenting meeting turns4. Analyzing ‘personal audio’ recordings
Audio & Music Researchat LabROSA
Dan EllisLaboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
[email protected] http://labrosa.ee.columbia.edu/
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
LabROSA Projects Overview
InformationExtraction
MachineLearning
SignalProcessing
Speech
Music Environment
FDLPMeetingturns
Personalaudio
Eigenrhythms
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
1. Eigenrhythms: Drum Pattern Space
• Pop songs built on repeating “drum loop”bass drum, snare, hi-hatsmall variations on a few basic patterns
• Eigen-analysis (PCA) to capture variations?by analyzing lots of (MIDI) data
• Applicationsmusic categorization“beat box” synthesis
with John Arroyo
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
Aligning the Data• Need to align patterns prior to PCA...
tempo (stretch): by inferring BPM &
normalizing
downbeat (shift): correlate against ‘mean’ template
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
Eigenrhythms
• Need 20+ Eigenvectors for good coverage of 100 training patterns (1200 dims)
• Top patterns:
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
Eigenrhythms for Classification
• Clusters in Eigenspace:
• Genre classification? (10 way)nearest neighbor in 4D eigenspace: 21% correct
-6 -4 -2 0 2 4 6-6
-4
-2
0
2
4
6
bl:boomboom
bl:chicken
bl:crosfire
bl:hideaway
bl:meanwoma
bl:onebeer
bl:thrill
bl:blues2gm
bl:dimplesco:goodlook
co:walkline
co:ringfire
co:SArose
co:byYrMan
co:alabama
co:tennesse
co:aftermid
co:texas
co:walkmid
di:boogiewl
di:booty
di:carwash
di:danqueen
di:discoinf
di:dontstop
di:funkytwn
di:lafreak
di:satnight
di:boogient
hh:1mChance
hh:bigpimpn
hh:bigPoppa
hh:gThang
hh:nEpisode
hh:rufryder
hh:slmshady
hh:superstr
hh:jacksonhh:stan
ho:badtouch
ho:bemylove
ho:inside
ho:onemore
ho:dpworld
ho:pvandyk
ho:modjo
nw:amadeus
nw:banvenus
nw:bmonday
nw:dbdance
nw:deserve
nw:dontyounw:evcountsnw:psboysin
nw:whipit
nw:pure
rc:blackdog
rc:californ
rc:whteroom
rc:rolstone
rc:harddayrc:jump
rc:layla
rc:money
rc:zztop
rc:tuesdays
pp:distance
pp:dllal
pp:downundr
pp:fly pp:lkvirgin
pp:loveshck
pp:lvprayer
pp:mjBeatit
pp:onemore
pp:bholly
pu:aWalk
pu:beatbratpu:blitzkrg
pu:bombshel
pu:bSedated
pu:happyguy
pu:rubysohopu:waitinRm
pu:anarchy
rb:chgWorld
rb:downlow
rb:heyLover
rb:honey
rb:lsaround
rb:mgirlsat
rb:bismine
rb:volove
All tracks projected onto 1st two eigenrhythms
Eigenrhythm 1
Eig
en
rhyth
m 2
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
Eigenrhythm BeatBox
• Resynthesize rhythms from eigen-space
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
2. Frequency-Domain Lin. Pred.
• (Time-domain) Linear Predictionthe well-known spectral estimator
• Apply to a ‘frequency domain’ signaldual: estimates temporal envelope
4/16Athineos & Ellis - Music processing with FDLP 2004-05-25
• Time domain– The well-known spectral estimator
• Frequency domain
– Frequency is time and vice versa
Linear PredictionLinear Prediction
FDLPDCT
TDLP
y[n] = aiy[n ! i] + e[n]
i=1.. p
"
Y[k] = biY[k ! i] + E[k]i=1.. p
"
4/16Athineos & Ellis - Music processing with FDLP 2004-05-25
• Time domain– The well-known spectral estimator
• Frequency domain
– Frequency is time and vice versa
Linear PredictionLinear Prediction
FDLPDCT
TDLP
y[n] = aiy[n ! i] + e[n]
i=1.. p
"
Y[k] = biY[k ! i] + E[k]i=1.. p
"
with Marios Athineos
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
Aside: Spectrogram of the DCT
• DCT gives a pure-real signal:Can we treat it like a waveform?
3/16Athineos & Ellis - Music processing with FDLP 2004-05-25
• Looks like a mirror image over time = freq axis
DCT spectrogramDCT spectrogram
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
FDLP and TDLP Duality
!"#$%#&'()*+#
!,-. ),-.
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
Subband FDLP
• Temporal envelopes without 25 ms windows
12/16Athineos & Ellis - Music processing with FDLP 2004-05-25
Time-frequency slicingTime-frequency slicing
Subband FDLP(per frequency subband)
Auditory STFT(10-25ms + Bark bin)
TDLP(per time frame)
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
FDLP Applications
• Time-scale modification
• Modulation-domain “temporal equalization”
• Perceptual audio features... 8/16Athineos & Ellis - Music processing with FDLP 2004-05-25
Cascade Time-Frequency LPCascade Time-Frequency LP
• Analysis
• Synthesis
13/16Athineos & Ellis - Music processing with FDLP 2004-05-25
Temporal equalizationTemporal equalization
• Filtering in frequency
DCT
1 sec up to whole sample
Residual
in freq.
Flat
Temporal
EnvelopesOverlap
= Filtering by inverse FDLP
(temporal equalization)
OLA
& iDCT
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
PLP-squared
• FDLP fits temporal envelope with LPPerceptual Linear Prediction (PLP) smooths across frequencycan we do both... iteratively?
• Speech features without ST windows
t / sec
Ba
rk b
an
d
0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22
5
10
15
Marios AthineosHynek Hermansky
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
• Multi-mic recordings for speaker turnsevery voice reaches every mic... (?)... but with differing coupling filters (delays, gains)
• Find turns with minimal assumptionse.g. ad-hoc sensor setups (multiple PDAs)differences to remove effect of source signal- no spectral models, < 1xRT
3. Meeting Turnswith Jerry Liu and ICSI
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
Between-channel cues: Timing (ITD) & Level
125 130 135 140 145 150 155 160 165 170 1750
0.5
1norm xcorr pk val
125 130 135 140 145 150 155 160 165 170 175-50
0
50xocrr peak lags (5pt med filt)
skew
/sam
p
125 130 135 140 145 150 155 160 165 170 175-80
-70
-60
-50
-40per-chan E
dB
125 130 135 140 145 150 155 160 165 170 175-10
-5
0
5
10chan E diffs
dB
time/s
Speaker activity
Speakerground-truth
Timing diffs (ITD)(2 mic pairs, 250ms win)
Peak correlationcoefficient r
Per-channelenergy
Between-channelenergy differences
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
• Inverse-filter by 12-pole LPC models (32 ms windows) to remove local resonances
• Filter out noise < 500 Hz, > 6 kHz• Then cross-correlate...
Pre-whitening for ITD
Short-time xcorr: whitened+filtered signals
1220 1225 1230 1235 1240-100
-50
0
50
100
lag
/ s
am
ps
Short-time xcorr: raw signals
1220 1225 1230 1235 1240-100
-50
0
50
100
Speaker ground truth
sp
kr
ID
time / sec1220 1225 1230 1235 1240
2
4
6
Speaker ground truth
time / sec1220 1225 1230 1235 1240
2
4
6
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
• Correlation coef. r ~ channel similarity:
•
• Select frames with r in top 50% in both pairs
•• Cleaner basis for models
Choosing “Good” Frames
ri j[!] =!nmi[n] ·mj[n+ !]√
!m2i !m2
j
-150 -100 -50 0
-100
-50
0
Skew12 / samples
Skew
34 / s
am
ple
s
ITD - all points
-150 -100 -50 0
-100
-50
0
ITD - high-correlation points (435/1201)
Skew12 / samples
Skew
34 / s
am
ple
sabout 35% of points
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
• Eigenvectors of “affinity matrix” A to pick out similar points:
•• Ad-hoc mapping to clusters
Number of clusters K from eigenvalues ≈ points
Spectral clustering
amn = exp{−‖x[m]−x[n]‖2/2!2}
Affinity matrix A
point index
100 200 300 400
50
100
150
200
250
300
350
400
0 100 200 300 400-12
-10
-8
-6
-4
-2
0
2first 12 eigenvectors (normalized)
point index
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
• Actual clusters depend on and K heuristic• Fit Gaussians to each cluster,
assign that class to all frames within radiusor: consider dimensions independently, choose best
Speaker Models & Classification!
-100 -50 0-100
-80
-60
-40
-20
0ICSI0: good points
-100 -50 0-100
-80
-60
-40
-20
0All pts: nearest class
-100 -50 0-100
-80
-60
-40
-20
0All pts: closest dimension
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
Performance Analysis
• Compare reference & system activity maps:
system misses quiet speakers 2,3,4 (deletions)system splits speaker 6 (deletions+insertions)many short gaps (deletions)
• ~52% avg. error on NIST 2004 dev setspeaker-characteristic-based systems ~25%
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
4. Segmenting Personal Audio
• Easy to record everything you hear~100GB / year @ 64 kbps
• Very hard to find anythinghow to scan?how to visualize?how to index?
• Starting point: Collect data~ 60 hours (8 days, ~7.5 hr/day)hand-mark 139 segments (26 min/seg avg.)assign to 16 classes (8 have multiple instances)
with Kean sub Lee
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
Features for Long Recordings
• Feature frames = 1 min (not 25 ms!)• Characterize variation within each frame...
•and structure within coarse auditory bands
60dB
80
100
120
fre
q /
ba
rk
Average Linear Energy
5
10
15
20
dB
20
40
60
fre
q /
ba
rk
Normalized Energy Deviation
5
10
15
20
bits
0.1
0.2
0.3
0.4
0.5
time / min
fre
q /
ba
rk
Spectral Entropy Deviation
50 100 150 200 250 300 350 400 450
5
10
15
20
bits
0.5
0.6
0.7
0.8
0.9
fre
q /
ba
rk
Average Spectral Entropy
5
10
15
20
dB
5
10
15
fre
q /
ba
rk
Log Energy Deviation
5
10
15
20
dB60
80
100
120
fre
q /
ba
rk
Average Log Energy
5
10
15
20
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
BIC Segmentation• Untrained segmentation technique
statistical test indicates good change points:
• Evaluate: 60hr hand-marked boundariesdifferent features & combinationsCorrect Accept % @ False Accept = 2%:
log L(X1;M1)L(X2;M2)L(X;M0)
≷ λ2 log(N)∆#(M)
µdB 80.8%µH 81.1%
σH/µH 81.6%µdB + σH/µH 84.0%
µdB + σH/µH + µH 83.6%0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 - Specificity
Sensitiv
ity
µdB
µH
!H/µH
µdB + !H/µH
µdB + µH + !H/µH
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
Segment clustering
• Daily activity has lots of repetition:Automatically cluster similar segments
• Spectral clustering achieves ~70% correct16-way ground truth labelsKL distance, smoothed covariance estimates
campuslibrary
restaurantstreet
bowlinghomecar/taxilecture1breakbilliardlecture2barberkaraokemeetingsupermkt
lib rst str ...
0.5
1
0cmp
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
Future Work
• Visualization / browsing / diary inferencelink to other information sources
•
• Privacy protectionspeaker/speech “search and destroy”
Audio/Music @ LabROSA - Dan Ellis 2004-08-24
LabROSA Summary
• LabROSAsignal processing + machine learning + information extraction
• ApplicationsEigenrhythms: drum pattern modelsFDLP temporal envelopesMeeting recordingsPersonal audio analysis
• Also...music similarity, signal separation, ...