23
Review: Learning Bimodal Structures in Audio-Visual Data CSE 704 : Readings in Joint Visual, Lingual and Physical Models and Inference Algorithms Suren Kumar Vision and Perceptual Machines Lab 106 Davis Hall UB North Campus [email protected] January 28, 2014 Suren Kumar January 28, 2014 1/19

Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · [email protected] January 28, 2014 Suren Kumar January 28, 2014 1/19

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Review: Learning Bimodal Structuresin Audio-Visual DataCSE 704 : Readings in Joint Visual, Lingual andPhysical Models and Inference Algorithms

Suren Kumar

Vision and Perceptual Machines Lab106 Davis HallUB North Campus

[email protected]

January 28, 2014

Suren Kumar January 28, 2014 1/19

Page 2: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Summary

What? Learn bimodally informative structures from audio-visualsignals

Why? Understand complex relationship between the inputs todifferent sensory modalities

How? Represent audio-video signals as sparse sum of kernelsconsisting of audio-waveform and a spatio-temporal visualbasis.

Motivation: Biological Evidence, Evolution

Suren Kumar January 28, 2014 Introduction 2/19

Page 3: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Literature

I Detect synchronous co-occurrences of transient structures indifferent modalities.

1. Extract fixed and predefined unimodal features in audio andvideo stream separately

2. Analyze correlation between the resulting featurerepresentation

What is the problem with it?

Suren Kumar January 28, 2014 Introduction 3/19

Page 4: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Literature

I Detect synchronous co-occurrences of transient structures indifferent modalities.

1. Extract fixed and predefined unimodal features in audio andvideo stream separately

2. Analyze correlation between the resulting featurerepresentation

What is the problem with it?

Suren Kumar January 28, 2014 Introduction 3/19

Page 5: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Basic Steps

I Capture bimodal signal structure by shift-invariance sparsegenerative model.

I Unsupervised learning for forming an overcomplete dictionary

Suren Kumar January 28, 2014 Introduction 4/19

Page 6: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

pth norm of a vector x in Euclidean space is defined as

||x ||p = (∑n

i=1 |xi |p)1p

Suren Kumar January 28, 2014 Background 5/19

Page 7: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Matching Pursuit

Consider a dictionary D = (g)g∈D in a Hilbert space H.The dictionary is countable, normalized (||g || = 1), complete, butnot orthogonal and possibly redundant.Sparse Approximation Problem: Given N > 0 and f ∈ H,construct an N-term combination fN =

∑k=1,2,..N ckgk with

gk ∈ D which approximates f “at best”, and study how fast fNconverges to f .Sparse Recovery Problem: if f has an unknown representationf =

∑g cgg with (cg )g∈D a (possibly) sparse sequence, recover

this sequence exactly or approximately from the data of f .Building an optimal sparse representation of arbitrary signals isNP-hard problem.

Suren Kumar January 28, 2014 Background:Matching Pursuit 6/19

Page 8: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Matching Pursuit

Matching pursuit is a greedy approximation to the problem.Initialization f0 = 0Projection Step: At step k − 1, projection step is theapproximation of ffk−1 = Span{g1, ..., gk−1}Selection Step: Choice of next element based on residualrk−1 = f − fk−1gk = arg maxg∈D | < rk−1, g > |

Suren Kumar January 28, 2014 Background:Matching Pursuit 7/19

Page 9: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Convolutional Generative Model

Audio visual data s = (a, v) , a(t), v(x , y , t)

Dictionary {φk}, φk = (φ(a)k (t), φ

(v)k (x , y , t))

Each atom can be translated to any point in space and time usingoperator T(p,q,r)

T(p,q,r) = (φ(a)k (t − r), φ

(v)k (x − p, y − q, t − r))

Thus an audio-visual signal can be represented as

s ≈∑K

k=1

∑nki=1 ckiT(p,q,r)ki

φk , cki = (c(a)ki, c

(v)ki

)

Coding Find the coefficients (sparse)

Learning Learning the dictionary

Suren Kumar January 28, 2014 Audio-Visual Representation 8/19

Page 10: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Audio-Visual Matching Pursuit

Transient substructures that co-occur simultaneously are indicativeof common underlying physical cause.Discrete audio-visual translation τ

Signal Approximation. Start with R0s = s

Suren Kumar January 28, 2014 Audio-Visual Representation 9/19

Page 11: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Audio-Visual Matching Pursuit

The function φ0 and its spatio-temporal translation are chosenmaximizing similarity measures C (R0s, φ)When to Stop?

Number of iterations N or the maximum value of C betweenresidual and dictionary elements falls below a threshold.

I Similarity measure should reflect properties of humanperception

I Unaffected by small relative time-shifts

Suren Kumar January 28, 2014 Audio-Visual Representation 10/19

Page 12: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Audio-Visual Matching Pursuit

The function φ0 and its spatio-temporal translation are chosenmaximizing similarity measures C (R0s, φ)When to Stop?Number of iterations N or the maximum value of C betweenresidual and dictionary elements falls below a threshold.

I Similarity measure should reflect properties of humanperception

I Unaffected by small relative time-shifts

Suren Kumar January 28, 2014 Audio-Visual Representation 10/19

Page 13: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Choosing the norm

Suren Kumar January 28, 2014 Audio-Visual Representation 11/19

Page 14: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Learning

Find the kernel functions from a given set of audio-visual data.

Assuming the noise to be gaussian

Suren Kumar January 28, 2014 Learning:Gradient Ascent Learning 12/19

Page 15: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Learning

Find the kernel functions from a given set of audio-visual data.

Assuming the noise to be gaussian

Suren Kumar January 28, 2014 Learning:Gradient Ascent Learning 12/19

Page 16: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Learning

Update:φk [j ] = φk [j − 1] + ηδφk

Suren Kumar January 28, 2014 Learning:Gradient Ascent Learning 13/19

Page 17: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Learning

Key Idea : Update only one atom φk

Update each function with principal component of the residualerrors. In general has faster convergence compared to GA.

Suren Kumar January 28, 2014 Learning:K-SVD algorithm 14/19

Page 18: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Synthetic Data

Audio - 3 sine waves and video has four black shapes

Suren Kumar January 28, 2014 Experiments 15/19

Page 19: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Audio-Visual Speech

Speaker uttering the digits from zero to nine in englishStudy convergence for 1 and 2 norm with GA and K-SVD

I C1 encodes joint audio-visual structures better.

I K-SVD is faster compared to GA.

Suren Kumar January 28, 2014 Experiments 16/19

Page 20: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Suren Kumar January 28, 2014 Experiments 17/19

Page 21: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Sound Source Localization

CUAVE - Visual Distractor and Acoustin Distractor

I Filter Audio

I Corresponding audio filter audio function and store themaximum projection

I Filter with corresponding video function and store maximumprojection.

I Cluster video position

Suren Kumar January 28, 2014 Experiments 18/19

Page 22: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Summary

I Audio - Visual Matching Pursuit

I Similarity measures and learning algorithms

I Testing for speaker localization with distractors.

Suren Kumar January 28, 2014 Experiments 19/19

Page 23: Review: Learning Bimodal Structures in Audio-Visual Datajcorso/t/2014S_SEM/files/suren_MonaciTNN2009.pdf · surenkum@buffalo.edu January 28, 2014 Suren Kumar January 28, 2014 1/19

Reference

I http://www.math.tamu.edu/ popov/Learning/Cohen.pdf

Suren Kumar January 28, 2014 Experiments 20/19