91
Yaniv Romano The Electrical Engineering Department Technion – Israel Institute of technology Joint work with Jeremias Sulam Vardan Papyan Prof. Michael Elad 1 The research leading to these results has been received funding from the European union's Seventh Framework Program (FP/2007-2013) ERC grant Agreement ERC-SPARSE- 320649 A Quest for a Universal Model for Signals: From Sparsity to ConvNets

A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Yaniv Romano The Electrical Engineering Department Technion – Israel Institute of technology

Joint work with

Jeremias Sulam Vardan Papyan Prof. Michael Elad

1

The research leading to these results has been received funding from the European union's Seventh Framework Program (FP/2007-2013) ERC grant Agreement ERC-SPARSE- 320649

A Quest for a Universal Model for Signals:

From Sparsity to ConvNets

Page 2: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Local Sparsity

=

Signal

Dictionary (learned) Sparse vector

Page 3: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Local Sparsity

Independent patch-processing

=

Page 4: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Local Sparsity

Independent patch-processing

=

Page 5: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

=

Convolutional Sparse Coding

Independent patch-processing

Local Sparsity

Page 6: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Multi-Layer Convolutional Sparse Coding

Convolutional Sparse Coding

Independent patch-processing

Local Sparsity

Page 7: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Convolutional Sparse Coding

Independent patch-processing

Local Sparsity

Convolutional Neural Networks

Multi-Layer Convolutional Sparse Coding

Page 8: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Forward pass Multi-Layer

Convolutional Sparse Coding

Extension of the classical sparse theory to a multi-layer setting

The forward pass is a sparse-coding algorithm,

serving our model

Convolutional Sparse Coding

!

Convolutional Neural Networks

Independent patch-processing

Local Sparsity

Page 9: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Our Story Begins with…

9

Page 10: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Image Denoising

Many (thousands) image denoising algorithms can be cast as the minimization of an energy function of the form

10

Original Image 𝐗

White Gaussian Noise

𝐄

Noisy Image 𝐘

Relation to measurements

Prior or regularization

min𝐗

1

2𝐗 − 𝐘 2

2 + G 𝐗 Topic=image and

noise and (removal

or denoising)

Page 11: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Leading Image Denoising Methods…

are built upon powerful patch-based local models: Popular local models: GMM Sparse-Representation Example-based Low-rank Field-of-Experts & Neural networks

11

Working locally allows us to learn the model,

e.g. dictionary learning

Page 12: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• It does come up in many applications

• Other inverse problems can be recast as an iterative denoising [Zoran & Weiss

‘11] [Venkatakrishnan, Bouman & Wohlberg ‘13] [Brifman, Romano & Elad ’16]

• In a recent work we show that a denoiser 𝑓 𝐗 can form a regularizer: Regularization by Denoising (RED) [Romano, Elad & Milanfar ‘16]

• It is the simplest inverse problem revealing the limitations of the model…

12

min𝐗 1

2𝐇𝐗 − 𝐘 2

2 + 𝜌

2 𝐗T 𝐗 − 𝑓 𝐗

G 𝐗 regularizer

min𝐗

1

2𝐇𝐗 − 𝐘 2

2 + G 𝐗

Relation to measurements

Prior

𝛻𝐗G 𝐗 = 𝐗 − 𝑓 𝐗

Under simple conditions on 𝑓 𝐱

Relation to measurements

Why is it so Popular?

Page 13: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• Assumes that every patch is a linear combination of a few columns, called atoms, taken from a matrix called a dictionary

• The operator 𝐑i extracts the i-th 𝑛-dimensional patch from 𝐗 ∈ ℝ𝑁

• Sparse coding:

𝑛

𝑚 > 𝑛

𝑛

𝐑i𝐗

𝛄i

= 𝛀

The Sparse-Land Model

13

𝐏0 : min𝛄i

𝛄i 0 s. t. 𝐑i𝐗 = 𝛀𝛄i

Page 14: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Patch Denoising Given a noisy patch 𝐑i𝐘, solve

14

Greedy methods such as Orthogonal Matching Pursuit

(OMP) or Thresholding

Convex relaxations such as Basis Pursuit (BP)

Clean patch: 𝛀𝛄 i 𝐏0 and (𝐏0

ϵ) are hard to solve

𝐏0ϵ : 𝛄 i = argmin

𝛄i

𝛄i 0 s. t. 𝐑i𝐘 − 𝛀𝛄i 2 ≤ ϵ

𝐏1ϵ : min

𝛄i 𝛄i 1 + ξ 𝐑i𝐘 − 𝛀𝛄i 2

2

Page 15: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Recall K-SVD Denoising [Elad & Aharon, ‘06]

15

Noisy Image Reconstructed Image

Denoise each patch

Using OMP

Initial Dictionary Using K-SVD

Update the dictionary

• Despite its simplicity, this is a very well-performing algorithm • We refer to this framework as “patch averaging” • A modification of this method leads to state-of-the-art results

[Mairal, Bach, Ponce, Spairo, Zisserman, `09]

Page 16: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of…

16

What is ?

Gap: Global-LocalThe

Efficient Independent Local Patch Processing VS.

The Global Need to Model The Entire Image

Page 17: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

𝐘

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

17

𝐑i𝐘 𝛀𝛄𝐢

What is ?

Page 18: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

𝐗 𝐘

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

18

𝐑i𝐘 𝛀𝛄i

Orthogonal (when using the OMP)

What is ?

Page 19: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

𝐗 𝐘

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

19

𝐑i𝐘 𝛀𝛄i

Orthogonal (when using the OMP)

The orthogonally is lost due to patch-averaging

What is ?

Page 20: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

𝐗 𝐘

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

20

𝐑i𝐘 𝛀𝛄i

Orthogonal (when using the OMP)

What is ?

Page 21: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

SOS-Boosting [Romano & Elad ’15]

21

Denoise

What is ?

Page 22: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

SOS-Boosting [Romano & Elad ’15]

22

Denoise

Previous Result

Strengthen

What is ?

Page 23: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

SOS-Boosting [Romano & Elad ’15]

23

Denoise

Previous Result

Strengthen – Operate

What is ?

Page 24: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

SOS-Boosting [Romano & Elad ’15]

24

Denoise

Previous Result

Strengthen – Operate – Subtract

What is ?

Page 25: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

25

Relation to Game Theory: Encourage overlapping patches to reach a consensus Relation to Graph Theory: Minimizing a Laplacian regularization functional:

𝐗 𝑘+1 = 𝑓 𝐘 + 𝜌𝐗 𝑘 − 𝜌𝐗 𝑘

𝐗 ∗ = argmin𝐗

𝐗 − 𝑓 𝐘 22 + 𝜌𝐗𝑇 𝐗 − 𝑓 𝐗

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

SOS-Boosting [Romano & Elad ’15]

Why boosting? Since it’s guaranteed to be able to improve any denoiser

What is ?

Page 26: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

26

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

SOS-Boosting [Romano & Elad ’15]

Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]

What is ?

Page 27: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

27

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

SOS-Boosting [Romano & Elad ’15]

Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]

A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]

What is ?

Page 28: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

28

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

SOS-Boosting [Romano & Elad ’15]

Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]

A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]

Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16]

• Con-Patch

Self-Similarity feature

What is ?

Page 29: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

29

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

SOS-Boosting [Romano & Elad ’15]

Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]

A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]

Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16]

• Con-Patch

• RAISR (Upscaling)

Edge Features: Direction, Strength, Consistency

What is ?

Page 30: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

30

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

SOS-Boosting [Romano & Elad ’15]

Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]

A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]

Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16]

• Con-Patch

• RAISR (Upscaling)

Edge Features: Direction, Strength, Consistency

What is ?

Page 31: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

31

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

SOS-Boosting [Romano & Elad ’15]

Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]

A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]

Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16]

Enforcing the local model on the final patches (EPLL)

• Sparse EPLL [Sulam & Elad ‘15]

K-SVD EPLL Noisy

What is ?

Page 32: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

32

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

SOS-Boosting [Romano & Elad ’15]

Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]

A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]

Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16]

Enforcing the local model on the final patches (EPLL)

• Sparse EPLL [Sulam & Elad ‘15]

• Image Synthesis [Ren, Romano & Elad ‘17]

What is ?

Page 33: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

33

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

SOS-Boosting [Romano & Elad ’15]

Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]

A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]

Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16]

Enforcing the local model on the final patches (EPLL)

• Sparse EPLL [Sulam & Elad ‘15]

• Image Synthesis [Ren, Romano & Elad ‘17]

What is ?

Page 34: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

SOS-Boosting [Romano & Elad ’15]

Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]

A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]

Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16]

Enforcing the local model on the final patches (EPLL)

• Sparse EPLL [Sulam & Elad ‘15]

• Image Synthesis [Ren, Romano & Elad ‘17]

What is ?

34

Page 35: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

SOS-Boosting [Romano & Elad ’15]

Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]

A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]

Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16]

Enforcing the local model on the final patches (EPLL)

• Sparse EPLL [Sulam & Elad ‘15]

• Image Synthesis [Ren, Romano & Elad ‘17]

What is ?

35

Missing a theoretical backbone!

Why can we use a local prior to solve a global problem?

Page 36: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Now, Our Story Takes a Surprising Turn

36

Page 37: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

37

Convolutional Sparse Coding

[LeCun, Bottou, Bengio and Haffner ‘98] [Lewicki & Sejnowski ‘99] [Hashimoto & Kurata, ‘00] [Mørup, Schmidt & Hansen ’08]

[Zeiler, Krishnan, Taylor & Fergus ‘10] [Jarrett, Kavukcuoglu, Gregor, LeCun ‘11] [Heide, Heidrich & Wetzstein ‘15] [Gu, Zuo, Xie, Meng, Feng & Zhang ‘15] …

Working Locally Thinking Globally: Theoretical Guarantees for Convolutional Sparse Coding

Vardan Papyan, Jeremias Sulam and Michael Elad

Convolutional Dictionary Learning via Local Processing Vardan Papyan, Yaniv Romano, Jeremias Sulam, and Michael Elad

Page 38: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Intuitively…

38

𝐗

The first filter The second filter

= + + + + + + + = + + + + + + +

Page 39: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Convolutional Sparse Coding (CSC)

39

∗ = +

+ +

+

∗ ∗ ∗

Page 40: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

𝐑i𝐗 = 𝛀𝛄i

𝐗 = 𝐃𝚪

=

Convolutional Sparse Representation

40

𝑛

(2𝑛 − 1)𝑚

𝐑i𝐗 𝑛

(2𝑛 − 1)𝑚

𝐑i+1𝐗

𝛄i 𝛄i+1

stripe-dictionary

Adjacent representations overlap, as they skip by 𝑚 items as we sweep through the patches of 𝐗

stripe vector

𝐃L

Page 41: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

CSC Relation to Our Story

• A clear global model: every patch has a sparse representation w.r.t. to the same local dictionary 𝛀, just as we have assumed

• No notion of disagreement on the patch overlaps

• Related to the current common practice of patch averaging (𝐑iT

- put the patch 𝛀𝛄i back in the i-th location of the global vector)

𝐗 = 𝐃𝚪 =1

𝑛 𝐑i

T𝛀𝛄ii

• What about the Pursuit? “Patch averaging”: independent sparse coding for each patch

CSC: should seek all the representations together

• Is there a bridge between the two? We’ll come back to this later …

• What about the theory? Until recently little was known regrading the theoretical aspects of CSC

41

Page 42: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Classical Sparse Theory (Noiseless)

42

Definition: Mutual Coherence: μ 𝐃 = maxi≠j

|diTdj|

[Donoho & Elad ‘03]

Theorem: For a signal 𝐗 = 𝐃𝚪, if 𝚪 0 <1

21 +

1

μ 𝐃

then this solution is necessarily the sparsest

[Donoho & Elad ‘03]

Theorem: The OMP and BP are guaranteed to recover the

true sparse code assuming that 𝚪 0 <1

21 +

1

μ 𝐃

[Tropp ‘04], [Donoho & Elad ‘03]

𝐏0 : min𝚪

𝚪 0 s. t. 𝐗 = 𝐃𝚪

Page 43: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• Assuming that 𝑚 = 2 and 𝑛 = 64 we have that [Welch, ’74]

μ 𝐃 ≥ 0.063

• As a result, uniqueness and success of pursuits is guaranteed as long as

𝚪 0 <1

21 +

1

μ(𝐃)≤1

21 +

1

0.063≈ 8

• Less than 8 non-zeros GLOBALLY are allowed!!! This is a very pessimistic result!

• Repeating the above for the noisy case leads to even worse performance predictions

• Bottom line: Classic SparseLand theory cannot provide good explanations for the CSC model

CSC: The Need for a Theoretical Study

43

Page 44: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• If 𝚪 is locally sparse [Papyan, Sulam & Elad ‘16]

The solution of 𝐏0,∞ is necessarily unique

The global OMP and BP are guaranteed to recover it

𝑚 = 2 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i

44

This result poses a local constraint for a global guarantee, and as such, the guarantees scale linearly with the dimension of the signal

𝚪 0,∞s is low all 𝛄i are sparse every

patch has a sparse representation over 𝛀

𝚪 0,∞s = max

i 𝛄i 0

𝐏0,∞ : min𝚪

𝚪 0,∞s s. t. 𝐗 = 𝐃𝚪

Moving to Local Sparsity: Stripes

Page 45: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• In practice, 𝐘 = 𝐃𝚪 + 𝐄, where 𝐄 is due to noise or model deviations

• If 𝚪 is locally sparse and the noise is bounded [Papyan, Sulam & Elad ‘16]

The solution of the 𝐏0,∞ϵ is stable

The solution obtained via global OMP/BP is stable

45

From Ideal to Noisy Signals

𝐃𝚪 𝐄 𝐘

𝐏0,∞ϵ : min

𝚪 𝚪 0,∞

s s. t. 𝐘 − 𝐃𝚪 2 ≤ ϵ 𝚪

The true and estimated representation are close

How close is 𝚪 to 𝚪?

Page 46: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Global Pursuit via Local Processing

46

=

𝑛

𝑚

𝑚 𝐃L αi

𝐏1ϵ : 𝚪BP = min

𝚪 1

2𝐘 − 𝐃𝚪 2

2 + ξ 𝚪 1

𝐗 = 𝐃𝚪

= 𝐑iT𝐬i

i

= 𝐑iT𝐃L𝛂i

i

si are slices not patches

Page 47: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

min𝐬𝑖 , 𝛂𝑖

1

2𝐘 − 𝐑i

T𝐬ii 2

2

+ 𝜉 𝛂𝑖 1

i

s. t. 𝐬i = 𝐃L𝛂i

47

Using variable splitting si = 𝐃Lαi

• These two problems are equivalent, and convex w.r.t their variables

• The new formulation targets the local slices, and their sparse representations

• Can be solved via ADMM, replace the constraint with a penalty

Global Pursuit via Local Processing (2)

𝐏1ϵ : 𝚪BP = min

𝚪 1

2𝐘 − 𝐃𝚪 2

2 + ξ 𝚪 1

Page 48: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Local Sparse Coding:

Slice Reconstruction:

Slice Aggregation:

Local Laplacian:

Dual Variable Update:

𝛂𝑖 = argmin𝛂𝑖

𝜌

2𝐬𝑖 − 𝐃𝐿𝛂𝑖 + 𝐮𝑖 𝟐

𝟐 + 𝜉 𝛂𝑖 𝟏

Slice-Based Pursuit

48

𝐩𝑖 =1

𝜌𝐑𝑖𝐘 + 𝐃𝐿𝛂𝑖 − 𝐮𝑖

𝐗 = 𝐑𝑗𝐓𝐩𝑗

𝑗

𝐬𝑖 = 𝐩𝑖 −1

𝜌 + 𝑛𝐑𝑖𝐗

𝐮𝑖 = 𝐮𝑖 + 𝐬𝑖 − 𝐃𝐿𝛂𝑖

Comment: One iteration of this

procedure amounts to … the very same

patch-averaging algorithm we started with

Page 49: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

49

Two Comments About this Scheme

Patches Slices

Patches Slices

The Proposed Scheme can be used for Dictionary (𝐃L) Learning

We work with Slices and not Patches

Slice-based DL algorithm using standard patch-based tools, leading

to a faster and simpler method, compared to existing methods

Patches extracted from natural images, and their corresponding slices. Observe how the slices are

far simpler, and contained by their corresponding patches

[Wohlberg, 2016] Ours

Page 50: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

50

Two Comments About this Scheme

Patches Slices

Patches Slices

The Proposed Scheme can be used for Dictionary (𝐃L) Learning

We work with Slices and not Patches

Slice-based DL algorithm using standard patch-based tools, leading

to a faster and simpler method, compared to existing methods

Patches extracted from natural images, and their corresponding slices. Observe how the slices are

far simpler, and contained by their corresponding patches

[Wohlberg, 2016] Ours

Heide et. al Wohlberg

Page 51: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

51

Going Deeper Convolutional Neural Networks

Analyzed via Convolutional Sparse Coding

Joint work with Vardan Papyan and Michael Elad

Page 52: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

CSC and CNN

• There is an analogy between CSC and CNN:

Convolutional structure, data driven models, ReLU is a sparsifying operator, and more…

• We propose a principled way to analyze CNN

• But first, a short review of CNN…

52

CNN

Convolutional Neural Networks

SparseLand Sparse Representation Theory

*Our analysis holds true for fully connected networks as well

The Underlying Idea

Modeling data sources enables a theoretical analysis of algorithms’ performance

*

Page 53: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

CNN

53

𝑁

𝑁

𝐘 𝑚1

𝑁

𝑁

𝑛0

𝑛0 𝐖1

𝑚2

𝑁

𝑁 𝑛1

𝑛1

𝐖2

ReLU z = max 0, z

[LeCun, Bottou, Bengio and Haffner ‘98] [Krizhevsky, Sutskever & Hinton ‘12] [Simonyan & Zisserman ‘14] [He, Zhang, Ren & Sun ‘15]

Page 54: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1

T𝐗 𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1

T𝐗 𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1

T𝐘 𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1

T𝐘 𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1

T𝐘 𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1

T𝐘 𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1

T𝐘 𝑓 𝐘, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1

T𝐘

Mathematically...

54

𝐙2 ∈ ℝ𝑁𝑚2

𝑚1

ReLU ReLU

𝐖2T ∈ ℝ𝑁𝑚2×𝑁𝑚1

𝑛1𝑚1 𝑚2

𝐖1T ∈ ℝ𝑁𝑚1×𝑁

𝑚1 𝑛0

𝐛1 ∈ ℝ𝑁𝑚1

𝐛2 ∈ ℝ𝑁𝑚2

𝐘 ∈ ℝ𝑁

Page 55: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

min𝐖i , 𝐛i ,𝐔

ℓ h 𝐘j , 𝐔, 𝑓 𝐘j, 𝐖i , 𝐛ij

• Consider the task of classification

• Given a set of signals 𝐘j j and their corresponding labels

h 𝐘j j, the CNN learns an end-to-end mapping

Training Stage of CNN

55

Output of last layer Classifier True label

Page 56: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Back to CSC

56

𝐗 ∈ ℝ𝑁

𝑚1

𝑛0

𝐃1 ∈ ℝ𝑁×𝑁𝑚1

𝑛1𝑚1 𝑚2

𝐃2 ∈ ℝ𝑁𝑚1×𝑁𝑚2

𝑚1

𝚪1 ∈ ℝ𝑁𝑚1

𝚪1 ∈ ℝ𝑁𝑚1

𝚪2 ∈ ℝ𝑁𝑚2

Convolutional sparsity (CSC) assumes an

inherent structure is present in natural

signals

We propose to impose the same structure on the

representations themselves

Multi-Layer CSC (ML-CSC)

Page 57: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Intuition: From Atoms to Molecules

57

𝐗 ∈ ℝ𝑁 𝐃1 ∈ ℝ𝑁×𝑁𝑚1 𝐃2 ∈ ℝ𝑁𝑚1×𝑁𝑚2

𝚪1 ∈ ℝ𝑁𝑚1

𝚪2 ∈ ℝ𝑁𝑚2

• Columns in 𝐃1 are convolutional atoms

• Columns in 𝐃2 combine the atoms in 𝐃𝟏, creating more complex structures

The dictionary 𝐃1𝐃2 is a superposition of the atoms of 𝐃1

• The size of the effective atoms is increased throughout the layers (receptive field)

𝚪1 ∈ ℝ𝑁𝑚1

Page 58: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Intuition: From Atoms to Molecules

58

𝐗 ∈ ℝ𝑁 𝐃1 ∈ ℝ𝑁×𝑁𝑚1 𝐃2 ∈ ℝ𝑁𝑚1×𝑁𝑚2

𝚪1 ∈ ℝ𝑁𝑚1

𝚪2 ∈ ℝ𝑁𝑚2

• One could chain the multiplication

of all the dictionaries into one effective dictionary 𝐃eff = 𝐃1𝐃2𝐃3 ∙∙∙ 𝐃K

and then 𝐗 = 𝐃eff 𝚪K as in SparseLand

• However, a key property in this model is the sparsity of each representation (feature-maps)

Page 59: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

A Small Taste: Model Training (MNIST)

59

𝐃1𝐃2𝐃3 (28×28)

MNIST Dictionary: •D1: 32 filters of size 7×7, with stride of 2 (dense) •D2: 128 filters of size 5×5×32 with stride of 1 - 99.09 % sparse •D3: 1024 filters of size 7×7×128 – 99.89 % sparse

𝐃1𝐃2 (15×15)

𝐃1 (7×7)

Page 60: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

A Small Taste: Pursuit

60

Γ1

Γ2

Γ3

Γ0

Y

99.51% sparse (5 nnz)

99.52% sparse (30 nnz)

94.51 % sparse (213 nnz)

Page 61: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

A Small Taste: Pursuit

61

Γ1

Γ2

Γ3

Γ0

Y

99.41 % sparse (6 nnz) 99.25 % sparse

(47 nnz)

92.20 % sparse (302 nnz)

Page 62: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

A Small Taste: Model Training (CIFAR)

62

𝐃1𝐃2𝐃3 (32×32)

CIFAR Dictionary: • D1: 64 filters of size 5x5x3, stride of 2

dense • D2: 256 filters of size 5x5x64, stride of 2

82.99 % sparse • D3: 1024 filters of size 5x5x256

90.66 % sparse

𝐃1𝐃2 (13×13) 𝐃1 (5×5×3)

Page 63: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Deep Coding Problem (𝐃𝐂𝐏)

63

Find 𝚪j j=1

K s. t.

𝐗 = 𝐃1𝚪1 𝚪1 0,∞s ≤ λ1

𝚪1 = 𝐃2𝚪2 𝚪2 0,∞s ≤ λ2

⋮ ⋮𝚪K−1 = 𝐃K𝚪K 𝚪K 0,∞

s ≤ λK

Find 𝚪j j=1

K s. t.

𝐘 − 𝐃1𝚪1 2 ≤ ℰ0 𝚪1 0,∞s ≤ λ1

𝚪1 −𝐃2𝚪2 2 ≤ ℰ1 𝚪2 0,∞s ≤ λ2

⋮ ⋮𝚪K−1 − 𝐃K𝚪K 2 ≤ ℰK−1 𝚪K 0,∞

s ≤ λK

Noiseless Pursuit

Noisy Pursuit

Page 64: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Deep Learning Problem 𝐃𝐋𝐏

64

min𝐃i i=1

K ,𝐔 ℓ h 𝐘j , 𝐔, 𝐃𝐂𝐏

⋆ 𝐘j, 𝐃𝐢

j

Supervised Dictionary Learning

Task driven dictionary learning [Mairal, Bach, Sapiro & Ponce ’12]

Unsupervised Dictionary Learning

Deepest representation obtained from DCP

Classifier True label

Find 𝐃i i=1K s. t.

𝐘j − 𝐃1𝚪𝟏j

2≤ ℰ0 𝚪𝟏

j

0,∞

s≤ λ1

𝚪1j− 𝐃2𝚪𝟐

j

𝟐≤ ℰ𝟏 𝚪𝟐

j

0,∞

s≤ λ2

⋮ ⋮

𝚪K−1j

− 𝐃K𝚪𝑲j

𝟐≤ ℰ𝐊−𝟏 𝚪K

j

0,∞

s≤ λK

j=1

𝐽

Page 65: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• The simplest pursuit algorithm (single-layer case) is the THR algorithm, which operates on a given input signal 𝐘 by:

• Restricting the coefficients to be nonnegative does not restrict the expressiveness of the model

65

ReLU = Soft Nonnegative Thresholding

ML-CSC: The Simplest Pursuit

and 𝚪 is sparse 𝚪 = 𝒫𝛽 𝐃T𝐘 𝐘 = 𝐃𝚪 + 𝐄

Page 66: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• Layered thresholding (LT):

• Forward pass of CNN:

Consider this for Solving the DCP

66

Estimate 𝚪1 via the THR algorithm

Estimate 𝚪2 via the THR algorithm

𝑓 𝐗 = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1

T𝐘

𝚪 2 = 𝒫β2 𝐃2T 𝒫β1 𝐃1

T𝐘

ReLU Soft Non-Negative THR

Bias 𝐛 Thresholds 𝛽

Dictionary D Weights 𝐖

Forward pass 𝑓 ⋅ Layered Soft NN THR 𝐃𝐂𝐏∗

𝐃𝐂𝐏λℰ : Find 𝚪j j=1

K s. t.

𝐘 − 𝐃1𝚪1 2 ≤ ℰ0 𝚪1 0,∞s ≤ λ1

𝚪1 − 𝐃2𝚪2 2 ≤ ℰ1 𝚪2 0,∞s ≤ λ2

⋮ ⋮𝚪K−1 − 𝐃K𝚪K 2 ≤ ℰK−1 𝚪K 0,∞

s ≤ λK

Page 67: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• Layered thresholding (LT):

• Forward pass of CNN:

Consider this for Solving the DCP

67

Estimate 𝚪1 via the THR algorithm

Estimate 𝚪2 via the THR algorithm

𝑓 𝐗 = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1

T𝐘

𝚪 2 = 𝒫β2 𝐃2T 𝒫β1 𝐃1

T𝐘

The layered (soft nonnegative) thresholding and the forward pass algorithm

are the very same things !!!

𝐃𝐂𝐏λℰ : Find 𝚪j j=1

K s. t.

𝐘 − 𝐃1𝚪1 2 ≤ ℰ0 𝚪1 0,∞s ≤ λ1

𝚪1 − 𝐃2𝚪2 2 ≤ ℰ1 𝚪2 0,∞s ≤ λ2

⋮ ⋮𝚪K−1 − 𝐃K𝚪K 2 ≤ ℰK−1 𝚪K 0,∞

s ≤ λK

Page 68: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• DLP (supervised ):

• CNN training:

Consider this for Solving the DLP

68

The thresholds for the DCP should

also learned

min𝐃i i=1

K ,𝐔 ℓ h 𝐘j , 𝐔, 𝐃𝐂𝐏

⋆ 𝐘j, 𝐃𝐢

j

min𝐖i , 𝐛i ,U

ℓ h 𝐘j , 𝐔, 𝑓 𝐘, 𝐖i , 𝐛ij

Estimate via the layered THR algorithm

Layered Soft NN THR 𝐃𝐂𝐏∗

Forward pass 𝑓 ⋅

CNN

Language

SparseLand

Language

Page 69: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• DLP (supervised ):

• CNN training:

Consider this for Solving the DLP

69

The thresholds for the DCP should

also learned

min𝐃i i=1

K ,𝐔 ℓ h 𝐘j , 𝐔, 𝐃𝐂𝐏

⋆ 𝐘j, 𝐃𝐢

j

Estimate via the layered THR algorithm

min𝐖i , 𝐛i ,U

ℓ h 𝐘j , 𝐔, 𝑓 𝐘, 𝐖i , 𝐛ij

Layered THR Pursuit 𝐃𝐂𝐏∗

CNN Language

SparseLand Language

Forward pass 𝑓 ⋅

The problem solved by the training stage of CNN and the DLP are equivalent as well, assuming that the DCP is approximated via

the layered thresholding algorithm

* Recall that for the ML-CSC, there exists an unsupervised avenue for training the dictionaries that has no simple parallel in CNN

*

Page 70: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

𝐘

Theoretical Questions

70

M A 𝚪 i i=1

K 𝐗 = 𝐃1𝚪1

𝚪1 = 𝐃2𝚪2 ⋮

𝚪K−1 = 𝐃K𝚪K

𝚪i is 𝐋0,∞ sparse

𝐃𝐂𝐏λℰ

Layered Thresholding

(Forward Pass)

Other?

𝐗

Page 71: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Theoretical Path: Possible Questions

• Having established the importance of the ML-CSC model and its associated pursuit, the DCP problem, we now turn to its analysis

• The main questions we aim to address:

I. Uniqueness of the solution (set of representations) to the 𝐃𝐂𝐏λ ?

II. Stability of the solution to the 𝐃𝐂𝐏λℰ problem ?

III. Stability of the solution obtained via the hard and soft layered THR algorithms (forward pass) ?

IV. Limitations of this (very simple) algorithm and alternative pursuit?

V. Algorithms for training the dictionaries 𝐃i i=1K vs. CNN ?

VI. New insights on how to operate on signals via CNN ?

71

Page 72: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Uniqueness of 𝐃𝐂𝐏λ

72

Theorem: If a set of solutions 𝚪i i=1K is

found for (𝐃𝐂𝐏λ) such that:

𝚪i 0,∞s = λi <

1

21 +

1

μ 𝐃i

Then these are necessarily the unique solution to this problem.

The feature maps CNN aims to recover are unique

[Donoho & Elad ‘03]

𝐃𝐂𝐏λ : Find a set of representations satisfying 𝐗 = 𝐃1𝚪1 𝚪1 0,∞

s ≤ λ1𝚪1 = 𝐃2𝚪2 𝚪2 0,∞

s ≤ λ2⋮ ⋮

𝚪K−1 = 𝐃K𝚪K 𝚪K 0,∞s ≤ λK

Is this set unique?

Mutual Coherence:

μ 𝐃 = maxi≠j

|diTdj|

Page 73: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• Suppose that we manage to solve the

𝐃𝐂𝐏λℰ and find a feasible set of

representations satisfying all the conditions

• The question we pose is: How close is 𝚪 i to 𝚪i?

Stability of 𝐃𝐂𝐏λℰ

73

𝚪 i i=1

K

Is this set stable?

𝐃𝐂𝐏λℰ : Find a set of representations satisfying

𝐘 − 𝐃1𝚪1 2 ≤ ℰ0 𝚪1 0,∞s ≤ λ1

𝚪1 − 𝐃2𝚪2 2 ≤ ℰ1 𝚪2 0,∞s ≤ λ2

⋮ ⋮𝚪K−1 −𝐃K𝚪K 2 ≤ ℰK−1 𝚪K 0,∞

s ≤ λK

Page 74: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Stability of 𝐃𝐂𝐏λℰ

74

Theorem: If the true representations 𝚪i i=1K satisfy

𝚪i 0,∞s ≤ λi <

1

21 +

1

μ 𝐃i

then the set of solutions 𝚪 i i=1

K obtained by solving this

problem (somehow) with ℰ0 = 𝐄 22 and ℰi = 0 (𝑖 ≥ 1)

must obey

𝚪 i − 𝚪i 2

2≤ 4 E 2

2 1

1− 2λj − 1 μ 𝐃j

𝑖

𝑗=1

Observe this annoying effect of error magnification as we

dive into the model

The problem CNN aims to solve is stable under certain conditions

Page 75: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Local Noise Assumption

• Our analysis relied on the local sparsity of the underlying solution 𝚪, which was enforced through the ℓ0,∞ norm

• In what follows, we present stability guarantees that will also depend on the local energy in the noise vector 𝐄

• This will be enforced via the ℓ2,∞ norm, defined as:

75

𝐄 2,∞p

= maxi 𝐑i𝐄 2

Page 76: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Stability of Layered-THR

76

Theorem: If 𝚪i 0,∞s <

1

21 +

1

μ 𝐃i⋅𝚪 imin

𝚪 imax −

1

μ 𝐃i⋅

εLi−1

𝚪 imax

then the layered hard THR (with the proper thresholds) will find the correct supports and

𝚪 iLT − 𝚪i 2,∞

p≤ εL

i

where we have defined εL0 = 𝐄 2,∞

p and

εLi = 𝚪i 0,∞

p⋅ εL

i−1 + μ 𝐃i 𝚪i 0,∞s − 1 𝚪 i

max

* Least-Squares update of the non-zeros?

*

The stability of the forward pass is guaranteed if the underlying representations are locally sparse and

the noise is locally bounded

Page 77: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Limitations of the Forward Pass

• The stability analysis reveals several inherent limitations of the forward pass (a.k.a. Layered THR) algorithm:

• Even in the noiseless case, the forward pass is incapable of recovering the perfect solution of the DCP problem

• Its success depends on the ratio 𝚪 imin / 𝚪 i

max . This is a direct consequence of relying on a simple thresholding operator

• The distance between the true sparse vector and the estimated one increases exponentially as a function of the layer depth

• next we propose a new algorithm that attempts to solve some of these problems

77

Page 78: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Special Case – Sparse Dictionaries

• Throughout the theoretical study we assumed that the representations in the different layers are 𝐋0,∞-sparse

• Do we know of a simple example of a set of dictionaries 𝐃i i=1K

and their corresponding signals 𝐗 that will obey this property?

• Assuming the dictionaries are sparse:

• In the context of CNN, the above happens if a sparsity promoting regularization, such as the 𝐋1, is employed on the filters

78

𝚪j 0,∞

s≤ 𝚪K 0,∞

s 𝐃i 0

K

i=j+1

Maximal number of

non-zeros in an atom in 𝐃i

Page 79: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Better Pursuit ?

• So far we proposed the Layered THR:

• The motivation is clear – getting close to what CNN use

• However, this is the simplest and weakest pursuit known in the field of sparsity – Can we offer something better?

79

𝚪 𝐾 = 𝒫βK 𝐃KT …𝒫β2 𝐃2

T 𝒫β1 𝐃1T𝐗

𝐃𝐂𝐏λ : Find a set of representations satisfying 𝐗 = 𝐃1𝚪1 𝚪1 0,∞

s ≤ λ1𝚪1 = 𝐃2𝚪2 𝚪2 0,∞

s ≤ λ2⋮ ⋮

𝚪K−1 = 𝐃K𝚪K 𝚪K 0,∞s ≤ λK

Page 80: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Layered Basis Pursuit (Noiseless)

80

• We can propose a Layered Basis Pursuit Algorithm:

𝚪1LBP = min

𝚪1 𝚪1 1 s. t. 𝐃1𝚪1 Deconvolutional = ܆

networks [Zeiler, Krishnan, Taylor &

Fergus ‘10] 𝚪2LBP = min

𝚪2 𝚪2 1 s. t. 𝚪1

LBP = 𝐃2𝚪2

𝐃𝐂𝐏λ : Find a set of representations satisfying 𝐗 = 𝐃1𝚪1 𝚪1 0,∞

s ≤ λ1𝚪1 = 𝐃2𝚪2 𝚪2 0,∞

s ≤ λ2⋮ ⋮

𝚪K−1 = 𝐃K𝚪K 𝚪K 0,∞s ≤ λK

Page 81: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

• As opposed to prior work in CNN, we can do far more than just proposing an algorithm – we can analyze its terms for success:

• Consequences: The layered BP can retrieve the underlying representations in the noiseless

case, a task in which the forward pass fails to provide

The Layered-BP’s success does not depend on the ratio 𝚪 imin / 𝚪 i

max

Guarantee for Success of Layered BP

81

Theorem: If a set of representations 𝚪i i=1K of the

Multi-Layered CSC model satisfy

𝚪i 0,∞s ≤ λi <

1

21 +

1

μ 𝐃i

then the Layered BP is guaranteed to find them

Page 82: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Layered Basis Pursuit (Noisy)

82

𝚪1LBP = min

𝚪1 1

2𝐘 − 𝐃1𝚪1 2

2 + ξ1 𝚪1 1

𝚪2LBP = min

𝚪2 1

2 𝚪1

LBP −𝐃2𝚪2 2

2+ ξ2 𝚪2 1

𝐃𝐂𝐏λℰ : Find a set of representations satisfying

𝐘 − 𝐃1𝚪1 2 ≤ ℰ0 𝚪1 0,∞s ≤ λ1

𝚪1 − 𝐃2𝚪2 2 ≤ ℰ1 𝚪2 0,∞s ≤ λ2

⋮ ⋮𝚪K−1 −𝐃K𝚪K 2 ≤ ℰK−1 𝚪K 0,∞

s ≤ λK

Similarly to the noiseless case, this can be also solved via the Layered Basis Pursuit Algorithm… But in a

Lagrangian form

Page 83: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Stability of Layered BP

83

Theorem: Assuming that 𝚪i 0,∞s ≤

1

31 +

1

μ 𝐃i

then for correctly chosen λi i=1K we are guaranteed that

1. The support of 𝚪 iLBP is contained in that of 𝚪i

2. The error is bounded: 𝚪 iLBP − 𝚪i 2,∞

p≤ εL

i , where

εLi = 7.5i 𝐄 2,∞

p 𝚪j 0,∞

pi

j=1

3. Every entry in 𝚪i greater than εLi / 𝚪i 0,∞

pwill be found

Page 84: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Layered Iterative Soft-Thresholding

𝚪jt = 𝒮ξj/cj 𝚪j

t−1 +1

cj𝐃jT 𝚪 j−1 −𝐃j𝚪j

t−1

Layered Iterative Thresholding

84 * ci > 0.5 λmax(𝐃i

T𝐃i)

Layered BP: 𝚪jLBP = min

𝚪j 1

2 𝚪j−1

LBP −𝐃j𝚪j 2

2+ ξj 𝚪j 1

t

j

j

Note that our suggestion implies that groups of layers share the same dictionaries

Can be seen as a recurrent neural network [Gregor & LeCun ‘10]

Page 85: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

This Talk

85

Independent patch-processing

Local Sparsity

We described the limitations of patch-based processing and ways to overcome some of these

Page 86: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

This Talk

86

Independent patch-processing

Local Sparsity

Convolutional Sparse Coding

We presented a theoretical study of the CSC and a practical algorithm that works locally

Page 87: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

This Talk

87

Independent patch-processing

Local Sparsity

Convolutional Sparse Coding

Convolutional Neural

Networks

We mentioned several interesting connections between CSC and CNN and this led us to …

Page 88: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

This Talk

88

Multi-Layer Convolutional Sparse Coding

Independent patch-processing

Local Sparsity

Convolutional Sparse Coding

Convolutional Neural

Networks

… propose and analyze a multi-layer extension of CSC, shown to be tightly connected to CNN

Page 89: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

This Talk

89

Extension of the classical sparse theory to a multi-layer setting

Multi-Layer Convolutional Sparse Coding

Independent patch-processing

Local Sparsity

Convolutional Sparse Coding

Convolutional Neural

Networks

A novel interpretation and theoretical understanding of CNN

The ML-CSC was shown to enable a theoretical study of CNN, along with new insights

Page 90: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

This Talk

90

Extension of the classical sparse theory to a multi-layer setting

Multi-Layer Convolutional Sparse Coding

Independent patch-processing

Local Sparsity

Convolutional Sparse Coding

Convolutional Neural

Networks

A novel interpretation and theoretical understanding of CNN

The underlying idea: Modeling the data source

in order to be able to theoretically analyze

algorithms’ performance

Page 91: A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... · Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14] A multi-scale

Questions?

91