Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Yaniv Romano The Electrical Engineering Department Technion – Israel Institute of technology
Joint work with
Jeremias Sulam Vardan Papyan Prof. Michael Elad
1
The research leading to these results has been received funding from the European union's Seventh Framework Program (FP/2007-2013) ERC grant Agreement ERC-SPARSE- 320649
A Quest for a Universal Model for Signals:
From Sparsity to ConvNets
Local Sparsity
=
Signal
Dictionary (learned) Sparse vector
Local Sparsity
Independent patch-processing
=
Local Sparsity
Independent patch-processing
=
=
Convolutional Sparse Coding
Independent patch-processing
Local Sparsity
Multi-Layer Convolutional Sparse Coding
Convolutional Sparse Coding
Independent patch-processing
Local Sparsity
Convolutional Sparse Coding
Independent patch-processing
Local Sparsity
Convolutional Neural Networks
Multi-Layer Convolutional Sparse Coding
Forward pass Multi-Layer
Convolutional Sparse Coding
Extension of the classical sparse theory to a multi-layer setting
The forward pass is a sparse-coding algorithm,
serving our model
Convolutional Sparse Coding
!
Convolutional Neural Networks
Independent patch-processing
Local Sparsity
Our Story Begins with…
9
Image Denoising
Many (thousands) image denoising algorithms can be cast as the minimization of an energy function of the form
10
Original Image 𝐗
White Gaussian Noise
𝐄
Noisy Image 𝐘
Relation to measurements
Prior or regularization
min𝐗
1
2𝐗 − 𝐘 2
2 + G 𝐗 Topic=image and
noise and (removal
or denoising)
Leading Image Denoising Methods…
are built upon powerful patch-based local models: Popular local models: GMM Sparse-Representation Example-based Low-rank Field-of-Experts & Neural networks
11
Working locally allows us to learn the model,
e.g. dictionary learning
• It does come up in many applications
• Other inverse problems can be recast as an iterative denoising [Zoran & Weiss
‘11] [Venkatakrishnan, Bouman & Wohlberg ‘13] [Brifman, Romano & Elad ’16]
• In a recent work we show that a denoiser 𝑓 𝐗 can form a regularizer: Regularization by Denoising (RED) [Romano, Elad & Milanfar ‘16]
• It is the simplest inverse problem revealing the limitations of the model…
12
min𝐗 1
2𝐇𝐗 − 𝐘 2
2 + 𝜌
2 𝐗T 𝐗 − 𝑓 𝐗
G 𝐗 regularizer
min𝐗
1
2𝐇𝐗 − 𝐘 2
2 + G 𝐗
Relation to measurements
Prior
𝛻𝐗G 𝐗 = 𝐗 − 𝑓 𝐗
Under simple conditions on 𝑓 𝐱
Relation to measurements
Why is it so Popular?
• Assumes that every patch is a linear combination of a few columns, called atoms, taken from a matrix called a dictionary
• The operator 𝐑i extracts the i-th 𝑛-dimensional patch from 𝐗 ∈ ℝ𝑁
• Sparse coding:
𝑛
𝑚 > 𝑛
𝑛
𝐑i𝐗
𝛄i
= 𝛀
The Sparse-Land Model
13
𝐏0 : min𝛄i
𝛄i 0 s. t. 𝐑i𝐗 = 𝛀𝛄i
Patch Denoising Given a noisy patch 𝐑i𝐘, solve
14
Greedy methods such as Orthogonal Matching Pursuit
(OMP) or Thresholding
Convex relaxations such as Basis Pursuit (BP)
Clean patch: 𝛀𝛄 i 𝐏0 and (𝐏0
ϵ) are hard to solve
𝐏0ϵ : 𝛄 i = argmin
𝛄i
𝛄i 0 s. t. 𝐑i𝐘 − 𝛀𝛄i 2 ≤ ϵ
𝐏1ϵ : min
𝛄i 𝛄i 1 + ξ 𝐑i𝐘 − 𝛀𝛄i 2
2
Recall K-SVD Denoising [Elad & Aharon, ‘06]
15
Noisy Image Reconstructed Image
Denoise each patch
Using OMP
Initial Dictionary Using K-SVD
Update the dictionary
• Despite its simplicity, this is a very well-performing algorithm • We refer to this framework as “patch averaging” • A modification of this method leads to state-of-the-art results
[Mairal, Bach, Ponce, Spairo, Zisserman, `09]
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of…
16
What is ?
Gap: Global-LocalThe
Efficient Independent Local Patch Processing VS.
The Global Need to Model The Entire Image
𝐘
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
17
𝐑i𝐘 𝛀𝛄𝐢
What is ?
𝐗 𝐘
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
18
𝐑i𝐘 𝛀𝛄i
Orthogonal (when using the OMP)
What is ?
𝐗 𝐘
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
19
𝐑i𝐘 𝛀𝛄i
Orthogonal (when using the OMP)
The orthogonally is lost due to patch-averaging
What is ?
𝐗 𝐘
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
20
𝐑i𝐘 𝛀𝛄i
Orthogonal (when using the OMP)
What is ?
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
SOS-Boosting [Romano & Elad ’15]
21
Denoise
What is ?
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
SOS-Boosting [Romano & Elad ’15]
22
Denoise
Previous Result
Strengthen
What is ?
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
SOS-Boosting [Romano & Elad ’15]
23
Denoise
Previous Result
Strengthen – Operate
What is ?
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
SOS-Boosting [Romano & Elad ’15]
24
Denoise
Previous Result
Strengthen – Operate – Subtract
What is ?
25
Relation to Game Theory: Encourage overlapping patches to reach a consensus Relation to Graph Theory: Minimizing a Laplacian regularization functional:
𝐗 𝑘+1 = 𝑓 𝐘 + 𝜌𝐗 𝑘 − 𝜌𝐗 𝑘
𝐗 ∗ = argmin𝐗
𝐗 − 𝑓 𝐘 22 + 𝜌𝐗𝑇 𝐗 − 𝑓 𝐗
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
SOS-Boosting [Romano & Elad ’15]
Why boosting? Since it’s guaranteed to be able to improve any denoiser
What is ?
26
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
SOS-Boosting [Romano & Elad ’15]
Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]
What is ?
27
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
SOS-Boosting [Romano & Elad ’15]
Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]
A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]
What is ?
28
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
SOS-Boosting [Romano & Elad ’15]
Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]
A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]
Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16]
• Con-Patch
Self-Similarity feature
What is ?
29
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
SOS-Boosting [Romano & Elad ’15]
Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]
A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]
Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16]
• Con-Patch
• RAISR (Upscaling)
Edge Features: Direction, Strength, Consistency
What is ?
30
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
SOS-Boosting [Romano & Elad ’15]
Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]
A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]
Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16]
• Con-Patch
• RAISR (Upscaling)
Edge Features: Direction, Strength, Consistency
What is ?
31
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
SOS-Boosting [Romano & Elad ’15]
Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]
A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]
Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16]
Enforcing the local model on the final patches (EPLL)
• Sparse EPLL [Sulam & Elad ‘15]
K-SVD EPLL Noisy
What is ?
32
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
SOS-Boosting [Romano & Elad ’15]
Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]
A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]
Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16]
Enforcing the local model on the final patches (EPLL)
• Sparse EPLL [Sulam & Elad ‘15]
• Image Synthesis [Ren, Romano & Elad ‘17]
What is ?
33
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
SOS-Boosting [Romano & Elad ’15]
Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]
A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]
Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16]
Enforcing the local model on the final patches (EPLL)
• Sparse EPLL [Sulam & Elad ‘15]
• Image Synthesis [Ren, Romano & Elad ‘17]
What is ?
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
SOS-Boosting [Romano & Elad ’15]
Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]
A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]
Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16]
Enforcing the local model on the final patches (EPLL)
• Sparse EPLL [Sulam & Elad ‘15]
• Image Synthesis [Ren, Romano & Elad ‘17]
…
What is ?
34
• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking
• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]
SOS-Boosting [Romano & Elad ’15]
Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]
A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]
Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16]
Enforcing the local model on the final patches (EPLL)
• Sparse EPLL [Sulam & Elad ‘15]
• Image Synthesis [Ren, Romano & Elad ‘17]
…
What is ?
35
Missing a theoretical backbone!
Why can we use a local prior to solve a global problem?
Now, Our Story Takes a Surprising Turn
36
37
Convolutional Sparse Coding
[LeCun, Bottou, Bengio and Haffner ‘98] [Lewicki & Sejnowski ‘99] [Hashimoto & Kurata, ‘00] [Mørup, Schmidt & Hansen ’08]
[Zeiler, Krishnan, Taylor & Fergus ‘10] [Jarrett, Kavukcuoglu, Gregor, LeCun ‘11] [Heide, Heidrich & Wetzstein ‘15] [Gu, Zuo, Xie, Meng, Feng & Zhang ‘15] …
Working Locally Thinking Globally: Theoretical Guarantees for Convolutional Sparse Coding
Vardan Papyan, Jeremias Sulam and Michael Elad
Convolutional Dictionary Learning via Local Processing Vardan Papyan, Yaniv Romano, Jeremias Sulam, and Michael Elad
Intuitively…
38
𝐗
The first filter The second filter
= + + + + + + + = + + + + + + +
Convolutional Sparse Coding (CSC)
39
∗ = +
+ +
+
…
∗
∗ ∗ ∗
𝐑i𝐗 = 𝛀𝛄i
𝐗 = 𝐃𝚪
=
Convolutional Sparse Representation
40
𝑛
(2𝑛 − 1)𝑚
𝐑i𝐗 𝑛
(2𝑛 − 1)𝑚
𝐑i+1𝐗
𝛄i 𝛄i+1
stripe-dictionary
Adjacent representations overlap, as they skip by 𝑚 items as we sweep through the patches of 𝐗
stripe vector
𝐃L
CSC Relation to Our Story
• A clear global model: every patch has a sparse representation w.r.t. to the same local dictionary 𝛀, just as we have assumed
• No notion of disagreement on the patch overlaps
• Related to the current common practice of patch averaging (𝐑iT
- put the patch 𝛀𝛄i back in the i-th location of the global vector)
𝐗 = 𝐃𝚪 =1
𝑛 𝐑i
T𝛀𝛄ii
• What about the Pursuit? “Patch averaging”: independent sparse coding for each patch
CSC: should seek all the representations together
• Is there a bridge between the two? We’ll come back to this later …
• What about the theory? Until recently little was known regrading the theoretical aspects of CSC
41
Classical Sparse Theory (Noiseless)
42
Definition: Mutual Coherence: μ 𝐃 = maxi≠j
|diTdj|
[Donoho & Elad ‘03]
Theorem: For a signal 𝐗 = 𝐃𝚪, if 𝚪 0 <1
21 +
1
μ 𝐃
then this solution is necessarily the sparsest
[Donoho & Elad ‘03]
Theorem: The OMP and BP are guaranteed to recover the
true sparse code assuming that 𝚪 0 <1
21 +
1
μ 𝐃
[Tropp ‘04], [Donoho & Elad ‘03]
𝐏0 : min𝚪
𝚪 0 s. t. 𝐗 = 𝐃𝚪
• Assuming that 𝑚 = 2 and 𝑛 = 64 we have that [Welch, ’74]
μ 𝐃 ≥ 0.063
• As a result, uniqueness and success of pursuits is guaranteed as long as
𝚪 0 <1
21 +
1
μ(𝐃)≤1
21 +
1
0.063≈ 8
• Less than 8 non-zeros GLOBALLY are allowed!!! This is a very pessimistic result!
• Repeating the above for the noisy case leads to even worse performance predictions
• Bottom line: Classic SparseLand theory cannot provide good explanations for the CSC model
CSC: The Need for a Theoretical Study
43
• If 𝚪 is locally sparse [Papyan, Sulam & Elad ‘16]
The solution of 𝐏0,∞ is necessarily unique
The global OMP and BP are guaranteed to recover it
𝑚 = 2 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i
44
This result poses a local constraint for a global guarantee, and as such, the guarantees scale linearly with the dimension of the signal
𝚪 0,∞s is low all 𝛄i are sparse every
patch has a sparse representation over 𝛀
𝚪 0,∞s = max
i 𝛄i 0
𝐏0,∞ : min𝚪
𝚪 0,∞s s. t. 𝐗 = 𝐃𝚪
Moving to Local Sparsity: Stripes
• In practice, 𝐘 = 𝐃𝚪 + 𝐄, where 𝐄 is due to noise or model deviations
• If 𝚪 is locally sparse and the noise is bounded [Papyan, Sulam & Elad ‘16]
The solution of the 𝐏0,∞ϵ is stable
The solution obtained via global OMP/BP is stable
45
From Ideal to Noisy Signals
𝐃𝚪 𝐄 𝐘
𝐏0,∞ϵ : min
𝚪 𝚪 0,∞
s s. t. 𝐘 − 𝐃𝚪 2 ≤ ϵ 𝚪
The true and estimated representation are close
How close is 𝚪 to 𝚪?
Global Pursuit via Local Processing
46
=
𝑛
𝑚
𝑚 𝐃L αi
𝐏1ϵ : 𝚪BP = min
𝚪 1
2𝐘 − 𝐃𝚪 2
2 + ξ 𝚪 1
𝐗 = 𝐃𝚪
= 𝐑iT𝐬i
i
= 𝐑iT𝐃L𝛂i
i
si are slices not patches
min𝐬𝑖 , 𝛂𝑖
1
2𝐘 − 𝐑i
T𝐬ii 2
2
+ 𝜉 𝛂𝑖 1
i
s. t. 𝐬i = 𝐃L𝛂i
47
Using variable splitting si = 𝐃Lαi
• These two problems are equivalent, and convex w.r.t their variables
• The new formulation targets the local slices, and their sparse representations
• Can be solved via ADMM, replace the constraint with a penalty
Global Pursuit via Local Processing (2)
𝐏1ϵ : 𝚪BP = min
𝚪 1
2𝐘 − 𝐃𝚪 2
2 + ξ 𝚪 1
Local Sparse Coding:
Slice Reconstruction:
Slice Aggregation:
Local Laplacian:
Dual Variable Update:
𝛂𝑖 = argmin𝛂𝑖
𝜌
2𝐬𝑖 − 𝐃𝐿𝛂𝑖 + 𝐮𝑖 𝟐
𝟐 + 𝜉 𝛂𝑖 𝟏
Slice-Based Pursuit
48
𝐩𝑖 =1
𝜌𝐑𝑖𝐘 + 𝐃𝐿𝛂𝑖 − 𝐮𝑖
𝐗 = 𝐑𝑗𝐓𝐩𝑗
𝑗
𝐬𝑖 = 𝐩𝑖 −1
𝜌 + 𝑛𝐑𝑖𝐗
𝐮𝑖 = 𝐮𝑖 + 𝐬𝑖 − 𝐃𝐿𝛂𝑖
Comment: One iteration of this
procedure amounts to … the very same
patch-averaging algorithm we started with
49
Two Comments About this Scheme
Patches Slices
Patches Slices
The Proposed Scheme can be used for Dictionary (𝐃L) Learning
We work with Slices and not Patches
Slice-based DL algorithm using standard patch-based tools, leading
to a faster and simpler method, compared to existing methods
Patches extracted from natural images, and their corresponding slices. Observe how the slices are
far simpler, and contained by their corresponding patches
[Wohlberg, 2016] Ours
50
Two Comments About this Scheme
Patches Slices
Patches Slices
The Proposed Scheme can be used for Dictionary (𝐃L) Learning
We work with Slices and not Patches
Slice-based DL algorithm using standard patch-based tools, leading
to a faster and simpler method, compared to existing methods
Patches extracted from natural images, and their corresponding slices. Observe how the slices are
far simpler, and contained by their corresponding patches
[Wohlberg, 2016] Ours
Heide et. al Wohlberg
51
Going Deeper Convolutional Neural Networks
Analyzed via Convolutional Sparse Coding
Joint work with Vardan Papyan and Michael Elad
CSC and CNN
• There is an analogy between CSC and CNN:
Convolutional structure, data driven models, ReLU is a sparsifying operator, and more…
• We propose a principled way to analyze CNN
• But first, a short review of CNN…
52
CNN
Convolutional Neural Networks
SparseLand Sparse Representation Theory
*Our analysis holds true for fully connected networks as well
The Underlying Idea
Modeling data sources enables a theoretical analysis of algorithms’ performance
*
CNN
53
𝑁
𝑁
𝐘 𝑚1
𝑁
𝑁
𝑛0
𝑛0 𝐖1
𝑚2
𝑁
𝑁 𝑛1
𝑛1
𝐖2
ReLU z = max 0, z
[LeCun, Bottou, Bengio and Haffner ‘98] [Krizhevsky, Sutskever & Hinton ‘12] [Simonyan & Zisserman ‘14] [He, Zhang, Ren & Sun ‘15]
𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1
T𝐗 𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1
T𝐗 𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1
T𝐘 𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1
T𝐘 𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1
T𝐘 𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1
T𝐘 𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1
T𝐘 𝑓 𝐘, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1
T𝐘
Mathematically...
54
𝐙2 ∈ ℝ𝑁𝑚2
𝑚1
ReLU ReLU
𝐖2T ∈ ℝ𝑁𝑚2×𝑁𝑚1
𝑛1𝑚1 𝑚2
𝐖1T ∈ ℝ𝑁𝑚1×𝑁
𝑚1 𝑛0
𝐛1 ∈ ℝ𝑁𝑚1
𝐛2 ∈ ℝ𝑁𝑚2
𝐘 ∈ ℝ𝑁
min𝐖i , 𝐛i ,𝐔
ℓ h 𝐘j , 𝐔, 𝑓 𝐘j, 𝐖i , 𝐛ij
• Consider the task of classification
• Given a set of signals 𝐘j j and their corresponding labels
h 𝐘j j, the CNN learns an end-to-end mapping
Training Stage of CNN
55
Output of last layer Classifier True label
Back to CSC
56
𝐗 ∈ ℝ𝑁
𝑚1
𝑛0
𝐃1 ∈ ℝ𝑁×𝑁𝑚1
𝑛1𝑚1 𝑚2
𝐃2 ∈ ℝ𝑁𝑚1×𝑁𝑚2
𝑚1
𝚪1 ∈ ℝ𝑁𝑚1
𝚪1 ∈ ℝ𝑁𝑚1
𝚪2 ∈ ℝ𝑁𝑚2
Convolutional sparsity (CSC) assumes an
inherent structure is present in natural
signals
We propose to impose the same structure on the
representations themselves
Multi-Layer CSC (ML-CSC)
Intuition: From Atoms to Molecules
57
𝐗 ∈ ℝ𝑁 𝐃1 ∈ ℝ𝑁×𝑁𝑚1 𝐃2 ∈ ℝ𝑁𝑚1×𝑁𝑚2
𝚪1 ∈ ℝ𝑁𝑚1
𝚪2 ∈ ℝ𝑁𝑚2
• Columns in 𝐃1 are convolutional atoms
• Columns in 𝐃2 combine the atoms in 𝐃𝟏, creating more complex structures
The dictionary 𝐃1𝐃2 is a superposition of the atoms of 𝐃1
• The size of the effective atoms is increased throughout the layers (receptive field)
𝚪1 ∈ ℝ𝑁𝑚1
Intuition: From Atoms to Molecules
58
𝐗 ∈ ℝ𝑁 𝐃1 ∈ ℝ𝑁×𝑁𝑚1 𝐃2 ∈ ℝ𝑁𝑚1×𝑁𝑚2
𝚪1 ∈ ℝ𝑁𝑚1
𝚪2 ∈ ℝ𝑁𝑚2
• One could chain the multiplication
of all the dictionaries into one effective dictionary 𝐃eff = 𝐃1𝐃2𝐃3 ∙∙∙ 𝐃K
and then 𝐗 = 𝐃eff 𝚪K as in SparseLand
• However, a key property in this model is the sparsity of each representation (feature-maps)
A Small Taste: Model Training (MNIST)
59
𝐃1𝐃2𝐃3 (28×28)
MNIST Dictionary: •D1: 32 filters of size 7×7, with stride of 2 (dense) •D2: 128 filters of size 5×5×32 with stride of 1 - 99.09 % sparse •D3: 1024 filters of size 7×7×128 – 99.89 % sparse
𝐃1𝐃2 (15×15)
𝐃1 (7×7)
A Small Taste: Pursuit
60
Γ1
Γ2
Γ3
Γ0
Y
99.51% sparse (5 nnz)
99.52% sparse (30 nnz)
94.51 % sparse (213 nnz)
A Small Taste: Pursuit
61
Γ1
Γ2
Γ3
Γ0
Y
99.41 % sparse (6 nnz) 99.25 % sparse
(47 nnz)
92.20 % sparse (302 nnz)
A Small Taste: Model Training (CIFAR)
62
𝐃1𝐃2𝐃3 (32×32)
CIFAR Dictionary: • D1: 64 filters of size 5x5x3, stride of 2
dense • D2: 256 filters of size 5x5x64, stride of 2
82.99 % sparse • D3: 1024 filters of size 5x5x256
90.66 % sparse
𝐃1𝐃2 (13×13) 𝐃1 (5×5×3)
Deep Coding Problem (𝐃𝐂𝐏)
63
Find 𝚪j j=1
K s. t.
𝐗 = 𝐃1𝚪1 𝚪1 0,∞s ≤ λ1
𝚪1 = 𝐃2𝚪2 𝚪2 0,∞s ≤ λ2
⋮ ⋮𝚪K−1 = 𝐃K𝚪K 𝚪K 0,∞
s ≤ λK
Find 𝚪j j=1
K s. t.
𝐘 − 𝐃1𝚪1 2 ≤ ℰ0 𝚪1 0,∞s ≤ λ1
𝚪1 −𝐃2𝚪2 2 ≤ ℰ1 𝚪2 0,∞s ≤ λ2
⋮ ⋮𝚪K−1 − 𝐃K𝚪K 2 ≤ ℰK−1 𝚪K 0,∞
s ≤ λK
Noiseless Pursuit
Noisy Pursuit
Deep Learning Problem 𝐃𝐋𝐏
64
min𝐃i i=1
K ,𝐔 ℓ h 𝐘j , 𝐔, 𝐃𝐂𝐏
⋆ 𝐘j, 𝐃𝐢
j
Supervised Dictionary Learning
Task driven dictionary learning [Mairal, Bach, Sapiro & Ponce ’12]
Unsupervised Dictionary Learning
Deepest representation obtained from DCP
Classifier True label
Find 𝐃i i=1K s. t.
𝐘j − 𝐃1𝚪𝟏j
2≤ ℰ0 𝚪𝟏
j
0,∞
s≤ λ1
𝚪1j− 𝐃2𝚪𝟐
j
𝟐≤ ℰ𝟏 𝚪𝟐
j
0,∞
s≤ λ2
⋮ ⋮
𝚪K−1j
− 𝐃K𝚪𝑲j
𝟐≤ ℰ𝐊−𝟏 𝚪K
j
0,∞
s≤ λK
j=1
𝐽
• The simplest pursuit algorithm (single-layer case) is the THR algorithm, which operates on a given input signal 𝐘 by:
• Restricting the coefficients to be nonnegative does not restrict the expressiveness of the model
65
ReLU = Soft Nonnegative Thresholding
ML-CSC: The Simplest Pursuit
and 𝚪 is sparse 𝚪 = 𝒫𝛽 𝐃T𝐘 𝐘 = 𝐃𝚪 + 𝐄
• Layered thresholding (LT):
• Forward pass of CNN:
Consider this for Solving the DCP
66
Estimate 𝚪1 via the THR algorithm
Estimate 𝚪2 via the THR algorithm
𝑓 𝐗 = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1
T𝐘
𝚪 2 = 𝒫β2 𝐃2T 𝒫β1 𝐃1
T𝐘
ReLU Soft Non-Negative THR
Bias 𝐛 Thresholds 𝛽
Dictionary D Weights 𝐖
Forward pass 𝑓 ⋅ Layered Soft NN THR 𝐃𝐂𝐏∗
𝐃𝐂𝐏λℰ : Find 𝚪j j=1
K s. t.
𝐘 − 𝐃1𝚪1 2 ≤ ℰ0 𝚪1 0,∞s ≤ λ1
𝚪1 − 𝐃2𝚪2 2 ≤ ℰ1 𝚪2 0,∞s ≤ λ2
⋮ ⋮𝚪K−1 − 𝐃K𝚪K 2 ≤ ℰK−1 𝚪K 0,∞
s ≤ λK
• Layered thresholding (LT):
• Forward pass of CNN:
Consider this for Solving the DCP
67
Estimate 𝚪1 via the THR algorithm
Estimate 𝚪2 via the THR algorithm
𝑓 𝐗 = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1
T𝐘
𝚪 2 = 𝒫β2 𝐃2T 𝒫β1 𝐃1
T𝐘
The layered (soft nonnegative) thresholding and the forward pass algorithm
are the very same things !!!
𝐃𝐂𝐏λℰ : Find 𝚪j j=1
K s. t.
𝐘 − 𝐃1𝚪1 2 ≤ ℰ0 𝚪1 0,∞s ≤ λ1
𝚪1 − 𝐃2𝚪2 2 ≤ ℰ1 𝚪2 0,∞s ≤ λ2
⋮ ⋮𝚪K−1 − 𝐃K𝚪K 2 ≤ ℰK−1 𝚪K 0,∞
s ≤ λK
• DLP (supervised ):
• CNN training:
Consider this for Solving the DLP
68
The thresholds for the DCP should
also learned
min𝐃i i=1
K ,𝐔 ℓ h 𝐘j , 𝐔, 𝐃𝐂𝐏
⋆ 𝐘j, 𝐃𝐢
j
min𝐖i , 𝐛i ,U
ℓ h 𝐘j , 𝐔, 𝑓 𝐘, 𝐖i , 𝐛ij
Estimate via the layered THR algorithm
Layered Soft NN THR 𝐃𝐂𝐏∗
Forward pass 𝑓 ⋅
CNN
Language
SparseLand
Language
• DLP (supervised ):
• CNN training:
Consider this for Solving the DLP
69
The thresholds for the DCP should
also learned
min𝐃i i=1
K ,𝐔 ℓ h 𝐘j , 𝐔, 𝐃𝐂𝐏
⋆ 𝐘j, 𝐃𝐢
j
Estimate via the layered THR algorithm
min𝐖i , 𝐛i ,U
ℓ h 𝐘j , 𝐔, 𝑓 𝐘, 𝐖i , 𝐛ij
Layered THR Pursuit 𝐃𝐂𝐏∗
CNN Language
SparseLand Language
Forward pass 𝑓 ⋅
The problem solved by the training stage of CNN and the DLP are equivalent as well, assuming that the DCP is approximated via
the layered thresholding algorithm
* Recall that for the ML-CSC, there exists an unsupervised avenue for training the dictionaries that has no simple parallel in CNN
*
𝐘
Theoretical Questions
70
M A 𝚪 i i=1
K 𝐗 = 𝐃1𝚪1
𝚪1 = 𝐃2𝚪2 ⋮
𝚪K−1 = 𝐃K𝚪K
𝚪i is 𝐋0,∞ sparse
𝐃𝐂𝐏λℰ
Layered Thresholding
(Forward Pass)
Other?
𝐗
Theoretical Path: Possible Questions
• Having established the importance of the ML-CSC model and its associated pursuit, the DCP problem, we now turn to its analysis
• The main questions we aim to address:
I. Uniqueness of the solution (set of representations) to the 𝐃𝐂𝐏λ ?
II. Stability of the solution to the 𝐃𝐂𝐏λℰ problem ?
III. Stability of the solution obtained via the hard and soft layered THR algorithms (forward pass) ?
IV. Limitations of this (very simple) algorithm and alternative pursuit?
V. Algorithms for training the dictionaries 𝐃i i=1K vs. CNN ?
VI. New insights on how to operate on signals via CNN ?
71
Uniqueness of 𝐃𝐂𝐏λ
72
Theorem: If a set of solutions 𝚪i i=1K is
found for (𝐃𝐂𝐏λ) such that:
𝚪i 0,∞s = λi <
1
21 +
1
μ 𝐃i
Then these are necessarily the unique solution to this problem.
The feature maps CNN aims to recover are unique
[Donoho & Elad ‘03]
𝐃𝐂𝐏λ : Find a set of representations satisfying 𝐗 = 𝐃1𝚪1 𝚪1 0,∞
s ≤ λ1𝚪1 = 𝐃2𝚪2 𝚪2 0,∞
s ≤ λ2⋮ ⋮
𝚪K−1 = 𝐃K𝚪K 𝚪K 0,∞s ≤ λK
Is this set unique?
Mutual Coherence:
μ 𝐃 = maxi≠j
|diTdj|
• Suppose that we manage to solve the
𝐃𝐂𝐏λℰ and find a feasible set of
representations satisfying all the conditions
• The question we pose is: How close is 𝚪 i to 𝚪i?
Stability of 𝐃𝐂𝐏λℰ
73
𝚪 i i=1
K
Is this set stable?
𝐃𝐂𝐏λℰ : Find a set of representations satisfying
𝐘 − 𝐃1𝚪1 2 ≤ ℰ0 𝚪1 0,∞s ≤ λ1
𝚪1 − 𝐃2𝚪2 2 ≤ ℰ1 𝚪2 0,∞s ≤ λ2
⋮ ⋮𝚪K−1 −𝐃K𝚪K 2 ≤ ℰK−1 𝚪K 0,∞
s ≤ λK
Stability of 𝐃𝐂𝐏λℰ
74
Theorem: If the true representations 𝚪i i=1K satisfy
𝚪i 0,∞s ≤ λi <
1
21 +
1
μ 𝐃i
then the set of solutions 𝚪 i i=1
K obtained by solving this
problem (somehow) with ℰ0 = 𝐄 22 and ℰi = 0 (𝑖 ≥ 1)
must obey
𝚪 i − 𝚪i 2
2≤ 4 E 2
2 1
1− 2λj − 1 μ 𝐃j
𝑖
𝑗=1
Observe this annoying effect of error magnification as we
dive into the model
The problem CNN aims to solve is stable under certain conditions
Local Noise Assumption
• Our analysis relied on the local sparsity of the underlying solution 𝚪, which was enforced through the ℓ0,∞ norm
• In what follows, we present stability guarantees that will also depend on the local energy in the noise vector 𝐄
• This will be enforced via the ℓ2,∞ norm, defined as:
75
𝐄 2,∞p
= maxi 𝐑i𝐄 2
Stability of Layered-THR
76
Theorem: If 𝚪i 0,∞s <
1
21 +
1
μ 𝐃i⋅𝚪 imin
𝚪 imax −
1
μ 𝐃i⋅
εLi−1
𝚪 imax
then the layered hard THR (with the proper thresholds) will find the correct supports and
𝚪 iLT − 𝚪i 2,∞
p≤ εL
i
where we have defined εL0 = 𝐄 2,∞
p and
εLi = 𝚪i 0,∞
p⋅ εL
i−1 + μ 𝐃i 𝚪i 0,∞s − 1 𝚪 i
max
* Least-Squares update of the non-zeros?
*
The stability of the forward pass is guaranteed if the underlying representations are locally sparse and
the noise is locally bounded
Limitations of the Forward Pass
• The stability analysis reveals several inherent limitations of the forward pass (a.k.a. Layered THR) algorithm:
• Even in the noiseless case, the forward pass is incapable of recovering the perfect solution of the DCP problem
• Its success depends on the ratio 𝚪 imin / 𝚪 i
max . This is a direct consequence of relying on a simple thresholding operator
• The distance between the true sparse vector and the estimated one increases exponentially as a function of the layer depth
• next we propose a new algorithm that attempts to solve some of these problems
77
Special Case – Sparse Dictionaries
• Throughout the theoretical study we assumed that the representations in the different layers are 𝐋0,∞-sparse
• Do we know of a simple example of a set of dictionaries 𝐃i i=1K
and their corresponding signals 𝐗 that will obey this property?
• Assuming the dictionaries are sparse:
• In the context of CNN, the above happens if a sparsity promoting regularization, such as the 𝐋1, is employed on the filters
78
𝚪j 0,∞
s≤ 𝚪K 0,∞
s 𝐃i 0
K
i=j+1
Maximal number of
non-zeros in an atom in 𝐃i
Better Pursuit ?
• So far we proposed the Layered THR:
• The motivation is clear – getting close to what CNN use
• However, this is the simplest and weakest pursuit known in the field of sparsity – Can we offer something better?
79
𝚪 𝐾 = 𝒫βK 𝐃KT …𝒫β2 𝐃2
T 𝒫β1 𝐃1T𝐗
𝐃𝐂𝐏λ : Find a set of representations satisfying 𝐗 = 𝐃1𝚪1 𝚪1 0,∞
s ≤ λ1𝚪1 = 𝐃2𝚪2 𝚪2 0,∞
s ≤ λ2⋮ ⋮
𝚪K−1 = 𝐃K𝚪K 𝚪K 0,∞s ≤ λK
Layered Basis Pursuit (Noiseless)
80
• We can propose a Layered Basis Pursuit Algorithm:
𝚪1LBP = min
𝚪1 𝚪1 1 s. t. 𝐃1𝚪1 Deconvolutional = ܆
networks [Zeiler, Krishnan, Taylor &
Fergus ‘10] 𝚪2LBP = min
𝚪2 𝚪2 1 s. t. 𝚪1
LBP = 𝐃2𝚪2
𝐃𝐂𝐏λ : Find a set of representations satisfying 𝐗 = 𝐃1𝚪1 𝚪1 0,∞
s ≤ λ1𝚪1 = 𝐃2𝚪2 𝚪2 0,∞
s ≤ λ2⋮ ⋮
𝚪K−1 = 𝐃K𝚪K 𝚪K 0,∞s ≤ λK
• As opposed to prior work in CNN, we can do far more than just proposing an algorithm – we can analyze its terms for success:
• Consequences: The layered BP can retrieve the underlying representations in the noiseless
case, a task in which the forward pass fails to provide
The Layered-BP’s success does not depend on the ratio 𝚪 imin / 𝚪 i
max
Guarantee for Success of Layered BP
81
Theorem: If a set of representations 𝚪i i=1K of the
Multi-Layered CSC model satisfy
𝚪i 0,∞s ≤ λi <
1
21 +
1
μ 𝐃i
then the Layered BP is guaranteed to find them
Layered Basis Pursuit (Noisy)
82
𝚪1LBP = min
𝚪1 1
2𝐘 − 𝐃1𝚪1 2
2 + ξ1 𝚪1 1
𝚪2LBP = min
𝚪2 1
2 𝚪1
LBP −𝐃2𝚪2 2
2+ ξ2 𝚪2 1
𝐃𝐂𝐏λℰ : Find a set of representations satisfying
𝐘 − 𝐃1𝚪1 2 ≤ ℰ0 𝚪1 0,∞s ≤ λ1
𝚪1 − 𝐃2𝚪2 2 ≤ ℰ1 𝚪2 0,∞s ≤ λ2
⋮ ⋮𝚪K−1 −𝐃K𝚪K 2 ≤ ℰK−1 𝚪K 0,∞
s ≤ λK
Similarly to the noiseless case, this can be also solved via the Layered Basis Pursuit Algorithm… But in a
Lagrangian form
Stability of Layered BP
83
Theorem: Assuming that 𝚪i 0,∞s ≤
1
31 +
1
μ 𝐃i
then for correctly chosen λi i=1K we are guaranteed that
1. The support of 𝚪 iLBP is contained in that of 𝚪i
2. The error is bounded: 𝚪 iLBP − 𝚪i 2,∞
p≤ εL
i , where
εLi = 7.5i 𝐄 2,∞
p 𝚪j 0,∞
pi
j=1
3. Every entry in 𝚪i greater than εLi / 𝚪i 0,∞
pwill be found
Layered Iterative Soft-Thresholding
𝚪jt = 𝒮ξj/cj 𝚪j
t−1 +1
cj𝐃jT 𝚪 j−1 −𝐃j𝚪j
t−1
Layered Iterative Thresholding
84 * ci > 0.5 λmax(𝐃i
T𝐃i)
Layered BP: 𝚪jLBP = min
𝚪j 1
2 𝚪j−1
LBP −𝐃j𝚪j 2
2+ ξj 𝚪j 1
t
j
j
Note that our suggestion implies that groups of layers share the same dictionaries
Can be seen as a recurrent neural network [Gregor & LeCun ‘10]
This Talk
85
Independent patch-processing
Local Sparsity
We described the limitations of patch-based processing and ways to overcome some of these
This Talk
86
Independent patch-processing
Local Sparsity
Convolutional Sparse Coding
We presented a theoretical study of the CSC and a practical algorithm that works locally
This Talk
87
Independent patch-processing
Local Sparsity
Convolutional Sparse Coding
Convolutional Neural
Networks
We mentioned several interesting connections between CSC and CNN and this led us to …
This Talk
88
Multi-Layer Convolutional Sparse Coding
Independent patch-processing
Local Sparsity
Convolutional Sparse Coding
Convolutional Neural
Networks
… propose and analyze a multi-layer extension of CSC, shown to be tightly connected to CNN
This Talk
89
Extension of the classical sparse theory to a multi-layer setting
Multi-Layer Convolutional Sparse Coding
Independent patch-processing
Local Sparsity
Convolutional Sparse Coding
Convolutional Neural
Networks
A novel interpretation and theoretical understanding of CNN
The ML-CSC was shown to enable a theoretical study of CNN, along with new insights
This Talk
90
Extension of the classical sparse theory to a multi-layer setting
Multi-Layer Convolutional Sparse Coding
Independent patch-processing
Local Sparsity
Convolutional Sparse Coding
Convolutional Neural
Networks
A novel interpretation and theoretical understanding of CNN
The underlying idea: Modeling the data source
in order to be able to theoretically analyze
algorithms’ performance
Questions?
91