16
Sound Analysis - Ellis 2008-09-05 p. /16 1. Personal and Consumer Audio 2. Musical Cover Song Detection 3. Binaural Source Separation Research in Sound Analysis Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA [email protected] 1

Research in Sound Analysisdpwe/talks/rosaview-2008-09.pdf · Sound Analysis - Ellis 2008-09-05 p. /16 Audio-Only Results • Wide range of results: audio (music, ski) vs. non-audio

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • Sound Analysis - Ellis 2008-09-05 p. /16

    1. Personal and Consumer Audio2. Musical Cover Song Detection3. Binaural Source Separation

    Research in Sound Analysis Dan Ellis

    Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA

    [email protected]

    1

    mailto:[email protected]:[email protected]

  • Sound Analysis - Ellis 2008-09-05 p. /16

    LabROSA Overview

    2

    InformationExtraction

    MachineLearning

    SignalProcessing

    Speech

    Music EnvironmentRecognition

    Retrieval

    Separation

  • Sound Analysis - Ellis 2008-09-05 p. /16

    1. Personal Audio Archives

    • Easy to record everything you hear

  • Sound Analysis - Ellis 2008-09-05 p. /16

    Browsing Interface• Browsing / Diary interface

    links to other information (diary, email, photos)synchronize with note taking? (Stifelman & Arons)audio thumbnails

    • Release Tools + “how to” for capture

    4

    !"#!!

    !"#$!

    !%#!!

    !%#$!

    &!#!!

    &!#$!

    &!!

    &$!

    &'#!!

    &'#$!

    &$#!!

    &$#$!

    &(#!!

    &(#$!

    &)#!!

    &)#$!

    &*#!!

    &*#$!

    &+#!!

    &+#$!

    &"#!!

    &"#$!

    &%#!!

    &%#$!

    '!#!!

    ,-./01223

    045.

    ,-./01223

    045.3.067-.

    25580.

    25580.276922-

    :-27,

    34;

    045.

  • Sound Analysis - Ellis 2008-09-05 p. /16

    Consumer Video

    • Short video clips as the evolution of snapshots10-60 sec, one location, no editingbrowsing?

    • More information for indexing...video + audioforeground + background

    5

  • Sound Analysis - Ellis 2008-09-05 p. /16

    MFCC Covariance Representation• Each clip/segment → fixed-size statistics

    similar to speaker ID and music genre classification

    • Full Covariance matrix of MFCCs maps the kinds of spectral shapes present

    • Clip-to-clip distances for SVM classifierby KL or 2nd Gaussian model

    6

    VTS_04_0001 - Spectrogram

    freq / k

    Hz

    1 2 3 4 5 6 7 8 90

    1

    2

    3

    4

    5

    6

    7

    8

    -20

    -10

    0

    10

    20

    30

    time / sec

    time / sec

    level / dB

    value

    MF

    CC

    bin

    1 2 3 4 5 6 7 8 9

    2468

    101214161820

    -20-15-10-505101520

    MFCC dimension

    MF

    CC

    dim

    ensio

    n

    MFCC covariance

    5 10 15 20

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    -50

    0

    50

    Video Soundtrack

    MFCCfeatures

    MFCCCovariance

    Matrix

  • Sound Analysis - Ellis 2008-09-05 p. /16

    Audio-Only Results

    • Wide range of results:

    audio (music, ski) vs. non-audio (group, night)large AP uncertainty on infrequent classes

    7

    Anim

    alBa

    byBe

    ach

    Birth

    day

    Boat

    Crow

    d

    Grou

    p of 3

    +

    Grou

    p of 2

    Muse

    umNi

    ght

    One p

    erso

    nPa

    rkPic

    nic

    Playg

    roun

    dSh

    owSp

    orts

    Suns

    et

    Wed

    ding

    Danc

    ing

    Para

    de

    Singin

    gCh

    eer

    Music

    Concepts

    Ski0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    AP

    1G + KL1G + MahGMM Hist. + pLSAGuess

  • Sound Analysis - Ellis 2008-09-05 p. /16

    How does it ‘feel’?

    • Browser impressions: How wrong is wrong?

    8

    Top 8 hitsfor “Baby”

  • Sound Analysis - Ellis 2008-09-05 p. /16

    2. Cover Song Detection: Chroma

    • Chroma features map spectral energy into one canonical octavei.e. 12 semitone bins

    • Can resynthesize as “Shepard Tones”all octaves at once

    9

    Piano scale

    2 4 6 8 10 100 200 300 400 500 600 700time / sec

    freq / k

    Hz

    0

    1

    2

    3

    4

    time / frames

    chro

    ma

    A

    C

    D

    F

    G

    Piano chromatic scale IF chroma

    0 500 1000 1500 2000 2500 freq / Hz-60-50-40-30-20-10

    0

    2 4 6 8 10 time / sec

    freq

    / kH

    z

    leve

    l / d

    B

    0

    1

    2

    3

    4Shepard tone resynth12 Shepard tone spectra

  • Sound Analysis - Ellis 2008-09-05 p. /1610

    Beat-Synchronous Chroma Features

    • Beat + chroma features / 30ms frames→ average chroma within each beatcompact; sufficient?

    ! "# "! $# $! %# %!

    $

    &

    '

    (

    "#

    "$

    "#

    $#

    %#

    $

    &

    '

    (

    "#

    "$

    # ! "# "!)*+,-.-/,0

    )*+,-.-1,2)/

    34,5-.-6,7

    89/,)-/)4,9:);

    0;48+2-1*9/

    0;48+2-1*9/

    #

  • Sound Analysis - Ellis 2008-09-05 p. /16

    Cross-Correlation Matching

    • Look inside global cross-correlation to find matching fragments...

    xcorr = t f (C1(t, f)⋅C2(t, f)) - view along time

    11

    Let It Be / Beatles (beats 11-441)

    50 100 150 200 250 300 350 400

    0 50 100 150 200 250 300 350 400

    -0.2

    0

    0.2

    0.4

    Let It Be / Nick Cave (beats 13-443)

    50 100 150 200 250 300 350 400 time / beats

    time / beats

    time / beats

    ch

    rom

    a

    A

    C

    D

    F

    G

    chro

    ma

    A

    C

    D

    F

    G

  • Sound Analysis - Ellis 2008-09-05 p. /16

    “The Meaning of Music”The ultimate goal of this research...

    • What does music evoke in a listener’s mind?

    i.e. “what does it all mean?” (metaphysics?)study with subjective experiments(then build detectors for specific responses ...?)

    • What phenomena are denoted by “music”?

    i.e. delineate the “set of all music”(the ultimate music/nonmusic classifier?)

    12

    ?

  • Sound Analysis - Ellis 2008-09-05 p. /16

    3. Binaural Source Separation• 2 or 3 sources in reverberation

    assume just 2 ‘ears’

    • Tasks:identify positions of sources (and number?)recover source signals

    13

  • Sound Analysis - Ellis 2008-09-05 p. /16

    Spatial Estimation in Reverb

    • Model interaural spectrum of each sourceas stationary level and time differences:

    converge via EM to a(), τ for each sourcemask is Pr(X(t,ω) dominated by source i)

    14

    Mandel & Ellis ’07

    LeftLeft ILDILD

    IPDIPD

    8.77dB8.77dB

    Mask 1Mask 1

    Mask 2Mask 2

    RightRight

  • Sound Analysis - Ellis 2008-09-05 p. /16

    Spatial Estimation Results

    • Modeling uncertainty improves resultstradeoff between constraints & noisiness

    15

    2.45 dB2.45 dB

    9.12 dB9.12 dB

    0.22 dB0.22 dB

    -2.72 dB-2.72 dB

    12.35 dB12.35 dB8.77 dB8.77 dB

  • Sound Analysis - Ellis 2008-09-05 p. /16

    Conclusions

    • LabROSAinformation from sound ...... via signal processing and machine learning

    • Environmental Sound• Music Audio• Source Separation• Speech, models, dolphins...

    16