Compressed Sensing using Generative Modelsashishbora.github.io/assets/csgm/slides_google_ashish.pdf · 2017-11-07 · Compressed Sensing Want to recover a signal (e.g. an image) from

Compressed Sensing using Generative Models

Ashish Bora Ajil Jalal Eric Price Alex Dimakis

UT Austin

Ashish Bora, Ajil Jalal, Eric Price, Alex Dimakis (UT Austin) Compressed Sensing using Generative Models 1 / 20

Compressed Sensing

Want to recover a signal (e.g. an image) from noisy measurements.

MedicalImaging

Astronomy Single-PixelCamera

Oil Exploration

Linear measurements: Choose A ∈ Rm×n, see y = Ax .

How many measurements m to learn the signal?


Compressed Sensing


MedicalImaging


Oil Exploration




Compressed Sensing


MedicalImaging


Oil Exploration




Compressed Sensing


MedicalImaging


Oil Exploration




Compressed Sensing

Given linear measurements y = Ax , for A ∈ Rm×n.

How many measurements m to learn the signal x?

I Naively: m ≥ n or else underdetermined

: multiple x possible.

I But most x aren’t plausible.

5MB 36MB

I This is why compression is possible.

Ideal answer: (information in image) / (new info. per measurement)

I Image “compressible” =⇒ information in image is small.I Measurements “incoherent” =⇒ most info new.


Compressed Sensing


How many measurements m to learn the signal x?I Naively: m ≥ n or else underdetermined

: multiple x possible.I But most x aren’t plausible.

5MB 36MB





Compressed Sensing


How many measurements m to learn the signal x?I Naively: m ≥ n or else underdetermined: multiple x possible.

I But most x aren’t plausible.

5MB 36MB





Compressed Sensing


How many measurements m to learn the signal x?I Naively: m ≥ n or else underdetermined: multiple x possible.I But most x aren’t plausible.

5MB 36MB





Compressed Sensing



5MB 36MB





Compressed Sensing



5MB 36MB





Compressed Sensing



5MB 36MB





Compressed Sensing



5MB 36MB





Compressed Sensing



5MB 36MB





Compressed Sensing



5MB 36MB


Ideal answer: (information in image) / (new info. per measurement)I Image “compressible” =⇒ information in image is small.

I Measurements “incoherent” =⇒ most info new.


Compressed Sensing



5MB 36MB


Ideal answer: (information in image) / (new info. per measurement)I Image “compressible” =⇒ information in image is small.I Measurements “incoherent” =⇒ most info new.


Compressed Sensing

Want to estimate x ∈ Rn from m n linear measurements.

Suggestion: Find “most compressible” image that fits measurements.

Three questions:

I How should we formalize that an image is “compressible”?I What algorithm for recovery?I How to choose the measurement matrix?


Compressed Sensing



Three questions:



Compressed Sensing



Three questions:



Compressed Sensing



Three questions:I How should we formalize that an image is “compressible”?

I What algorithm for recovery?I How to choose the measurement matrix?


Compressed Sensing



Three questions:I How should we formalize that an image is “compressible”?I What algorithm for recovery?

I How to choose the measurement matrix?


Compressed Sensing



Three questions:I How should we formalize that an image is “compressible”?I What algorithm for recovery?I How to choose the measurement matrix?


Compressed Sensing

Standard compressed sensing: sparsity in some basis

I Sparsity + other constraints (“structured sparsity”)

This talk: new method.


Compressed Sensing





Compressed Sensing





Compressed Sensing





Compressed Sensing





Standard Compressed Sensing Formalism“Compressible” = “sparse”

Want to estimate x from y = Ax + η, for A ∈ Rm×n.

I For this talk: ignore η, so y = Ax .

Algorithm for recovery: LASSO

minx ‖Ax − y‖22 + λ‖x‖1

Goal: x with

‖x − x‖2 ≤ O(1) · mink-sparse x ′

‖x − x ′‖2 (1)

I Reconstruction accuracy proportional to model accuracy.

Theorem [Candes-Romberg-Tao 2006]

I A satisfies REC if for all k-sparse x , ‖Ax‖ ≥ γxI If A satisfies REC, then LASSO recovers x satisfying (1)I m = O(k log(n/k)) suffices for A to satisfy REC.

Theorem [Do Ba-Indyk-Price-Woodruff 2010]

I m = Θ(k log(n/k)) is optimal



Want to estimate x from y = Ax + η, for A ∈ Rm×n.I For this talk: ignore η, so y = Ax .


minx ‖Ax − y‖22 + λ‖x‖1

Goal: x with


‖x − x ′‖2 (1)










minx ‖Ax − y‖22 + λ‖x‖1

Goal: x with


‖x − x ′‖2 (1)









Algorithm for recovery: LASSOminx ‖Ax − y‖2

2

+ λ‖x‖1

Goal: x with


‖x − x ′‖2 (1)










2 + λ‖x‖1

Goal: x with


‖x − x ′‖2 (1)










2 + λ‖x‖1

Goal: x with


‖x − x ′‖2 (1)










2 + λ‖x‖1

Goal: x with


‖x − x ′‖2 (1)










2 + λ‖x‖1

Goal: x with


‖x − x ′‖2 (1)










2 + λ‖x‖1

Goal: x with


‖x − x ′‖2 (1)


Theorem [Candes-Romberg-Tao 2006]I A satisfies REC if for all k-sparse x , ‖Ax‖ ≥ γx

I If A satisfies REC, then LASSO recovers x satisfying (1)I m = O(k log(n/k)) suffices for A to satisfy REC.







2 + λ‖x‖1

Goal: x with


‖x − x ′‖2 (1)


Theorem [Candes-Romberg-Tao 2006]I A satisfies REC if for all k-sparse x , ‖Ax‖ ≥ γxI If A satisfies REC, then LASSO recovers x satisfying (1)

I m = O(k log(n/k)) suffices for A to satisfy REC.







2 + λ‖x‖1

Goal: x with


‖x − x ′‖2 (1)


Theorem [Candes-Romberg-Tao 2006]I A satisfies REC if for all k-sparse x , ‖Ax‖ ≥ γxI If A satisfies REC, then LASSO recovers x satisfying (1)I m = O(k log(n/k)) suffices for A to satisfy REC.







2 + λ‖x‖1

Goal: x with


‖x − x ′‖2 (1)









2 + λ‖x‖1

Goal: x with


‖x − x ′‖2 (1)



Theorem [Do Ba-Indyk-Price-Woodruff 2010]I m = Θ(k log(n/k)) is optimal


Alternatives to sparsity?

Basis needs to be handcrafted

Does not capture the plausible vectors tightly : Too simplistic

Ignores a lot of domain dependent structure

Billions of natural images, millions of MRIs collected each year

Opportunity to improve structural understanding

Better structural understanding should give fewer measurements

Best way to model images in 2017?

I Deep convolutional neural networks.I In particular: generative models.





































































Best way to model images in 2017?I Deep convolutional neural networks.

I In particular: generative models.









Best way to model images in 2017?I Deep convolutional neural networks.I In particular: generative models.


Generative Models

Want to model a distribution D of images.

Function G : Rk → Rn.

When z ∼ N(0, Ik), then ideally G (z) ∼ D.

Active area of machine learning research in last few years.

Generative Adversarial Networks (GANs) [Goodfellow et al. 2014]:

I Competition between generator and discriminator.I W-GAN, BeGAN, InfoGAN, DCGAN, ...

Variational Auto-Encoders (VAEs) [Kingma & Welling 2013].

I Blurrier, but maybe better coverage of the space.

Suggestion for compressed sensing

Replace “x is k-sparse” by “x is in range of G : Rk → Rn”.


Generative Models












Generative Models












Generative Models












Generative Models












Generative Models





Generative Adversarial Networks (GANs) [Goodfellow et al. 2014]:I Competition between generator and discriminator.

I W-GAN, BeGAN, InfoGAN, DCGAN, ...






Generative Models





Generative Adversarial Networks (GANs) [Goodfellow et al. 2014]:I Competition between generator and discriminator.I W-GAN, BeGAN, InfoGAN, DCGAN, ...






Generative ModelsBeGAN




























Variational Auto-Encoders (VAEs) [Kingma & Welling 2013].I Blurrier, but maybe better coverage of the space.










Variational Auto-Encoders (VAEs) [Kingma & Welling 2013].I Blurrier, but maybe better coverage of the space.




Our Results“Compressible” = “near range(G)”

Want to estimate x from y = Ax , for A ∈ Rm×n.

We are given the generative model G : Rk → Rn.

Algorithm for recovery

I minz ‖AG (z)− y‖22 + λ‖z‖2

2I Backprop to get gradients wrt z .I Optimize with gradient descent






I minz ‖AG (z)− y‖22 + λ‖z‖2







I minz ‖AG (z)− y‖22 + λ‖z‖2






Algorithm for recoveryI minz ‖Ax − y‖2

2 + λ‖x‖1

I Backprop to get gradients wrt z .I Optimize with gradient descent





Algorithm for recoveryI minz ‖AG (z)− y‖2

2 + λ‖z‖22







2 + λ‖z‖22

I Backprop to get gradients wrt z .

I Optimize with gradient descent






2 + λ‖z‖22




Goal: x with


‖x − x ′‖2 (2)


Main Theorem I: m = O(kd log n) suffices for (2).

I G is a d-layer ReLU-based neural network.I When A is random Gaussian matrix.

Main Theorem II:

I For any Lipschitz G , m = O(k log rLδ ) suffices.

I Morally the same O(kd log n) bound: L, r , δ−1 ∼ nO(d)

Convergence:

I Just like training, no proof that gradient descent convergesI Approximate solution approximately gives (2)I Can check that ‖x − x‖2 is small.I In practice, optimization error is negligible.



Goal: x with

‖x − x‖2 ≤ O(1) · minx ′∈range(G)

‖x − x ′‖2 (2)




Main Theorem II:



Convergence:




Goal: x with

‖x − x‖2 ≤ O(1) · minx ′∈range(G)

‖x − x ′‖2 (2)




Main Theorem II:



Convergence:




Goal: x with

‖x − x‖2 ≤ O(1) · minx ′∈range(G)

‖x − x ′‖2 (2)




Main Theorem II:



Convergence:




Goal: x with

‖x − x‖2 ≤ O(1) · minx ′∈range(G)

‖x − x ′‖2 (2)


Main Theorem I: m = O(kd log n) suffices for (2).I G is a d-layer ReLU-based neural network.

I When A is random Gaussian matrix.

Main Theorem II:



Convergence:




Goal: x with

‖x − x‖2 ≤ O(1) · minx ′∈range(G)

‖x − x ′‖2 (2)


Main Theorem I: m = O(kd log n) suffices for (2).I G is a d-layer ReLU-based neural network.I When A is random Gaussian matrix.

Main Theorem II:



Convergence:




Goal: x with

‖x − x‖2 ≤ O(1) · minx ′∈range(G)

‖x − x ′‖2 (2)



Main Theorem II:



Convergence:




Goal: x with

‖x − x‖2 ≤ O(1) · minx ′∈range(G)

‖x − x ′‖2 (2)



Main Theorem II:I For any Lipschitz G , m = O(k log L) suffices.


Convergence:




Goal: x with

‖x − x‖2 ≤ O(1) · minx ′=G(z ′),‖z ′‖2≤r

‖x − x ′‖2+δ (2)



Main Theorem II:I For any Lipschitz G , m = O(k log rL

δ ) suffices.


Convergence:




Goal: x with

‖x − x‖2 ≤ O(1) · minx ′=G(z ′),‖z ′‖2≤r

‖x − x ′‖2+δ (2)




δ ) suffices.I Morally the same O(kd log n) bound: L, r , δ−1 ∼ nO(d)

Convergence:




Goal: x with

‖x − x‖2 ≤ O(1) · minx ′=G(z ′),‖z ′‖2≤r

‖x − x ′‖2+δ (2)





Convergence:




Goal: x with

‖x − x‖2 ≤ O(1) · minx ′=G(z ′),‖z ′‖2≤r

‖x − x ′‖2+δ (2)





Convergence:I Just like training, no proof that gradient descent converges

I Approximate solution approximately gives (2)I Can check that ‖x − x‖2 is small.I In practice, optimization error is negligible.



Goal: x with

‖x − x‖2 ≤ O(1) · minx ′=G(z ′),‖z ′‖2≤r

‖x − x ′‖2+δ (2)





Convergence:I Just like training, no proof that gradient descent convergesI Approximate solution approximately gives (2)

I Can check that ‖x − x‖2 is small.I In practice, optimization error is negligible.



Goal: x with

‖x − x‖2 ≤ O(1) · minx ′=G(z ′),‖z ′‖2≤r

‖x − x ′‖2+δ (2)





Convergence:I Just like training, no proof that gradient descent convergesI Approximate solution approximately gives (2)I Can check that ‖x − x‖2 is small.

I In practice, optimization error is negligible.



Goal: x with

‖x − x‖2 ≤ O(1) · minx ′=G(z ′),‖z ′‖2≤r

‖x − x ′‖2+δ (2)





Convergence:I Just like training, no proof that gradient descent convergesI Approximate solution approximately gives (2)I Can check that ‖x − x‖2 is small.I In practice, optimization error is negligible.


Experimental Results

Faces: n = 64× 64× 3 = 12288, m = 500O

rigin

al

Lass

o (

DC

T)

Lass

o (

Wavele

t)D

CG

AN

MNIST: n = 28x28 = 784, m = 100.



Faces: n = 64× 64× 3 = 12288, m = 500O

rigin

al

Lass

o (

DC

T)

Lass

o (

Wavele

t)D

CG

AN

MNIST: n = 28x28 = 784, m = 100.



10

25

50

10

0

20

0

30

04

00

50

0

75

0

Number of measurements

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Reco

nst

ruct

ion e

rror

(per

pix

el)

Lasso

VAE

VAE+Reg

MNIST

20

50

10

0

20

0

50

0

10

00

25

00

50

00

75

00

10

00

0


0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Reco

nst

ruct

ion e

rror

(per

pix

el)

Lasso (DCT)

Lasso (Wavelet)

DCGAN

DCGAN+Reg

Faces

For fixed G , have fixed k , so error stops improving after some point.

Larger m should use higher capacity G , so min‖x − G (z)‖ smaller.



10

25

50

10

0

20

0

30

04

00

50

0

75

0


0.00

0.02

0.04

0.06

0.08

0.10

0.12

Reco

nst

ruct

ion e

rror

(per

pix

el)

Lasso

VAE

VAE+Reg

MNIST

20

50

10

0

20

0

50

0

10

00

25

00

50

00

75

00

10

00

0


0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Reco

nst

ruct

ion e

rror

(per

pix

el)

Lasso (DCT)

Lasso (Wavelet)

DCGAN

DCGAN+Reg

Faces





10

25

50

10

0

20

0

30

04

00

50

0

75

0


0.00

0.02

0.04

0.06

0.08

0.10

0.12

Reco

nst

ruct

ion e

rror

(per

pix

el)

Lasso

VAE

VAE+Reg

MNIST

20

50

10

0

20

0

50

0

10

00

25

00

50

00

75

00

10

00

0


0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Reco

nst

ruct

ion e

rror

(per

pix

el)

Lasso (DCT)

Lasso (Wavelet)

DCGAN

DCGAN+Reg

Faces




Proof Outline (ReLU-based networks)

Show range(G ) lies within union of ndk k-dimensional hyperplane.

I Then analogous to proof for sparsity:(nk

)≤ 2k log(n/k) hyperplanes.

I So dk log n Gaussian measurements suffice.

ReLU-based network:

I Each layer is z → ReLU(Aiz).

I ReLU(y)i =

yi yi ≥ 00 otherwise

Input to layer 1: single k-dimensional hyperplane.

Lemma

Layer 1’s output lies within a union of nk k-dimensional hyperplanes.

Induction: final output lies within ndk k-dimensional hyperplanes.



Show range(G ) lies within union of ndk k-dimensional hyperplane.I Then analogous to proof for sparsity:

(nk



ReLU-based network:


I ReLU(y)i =



Lemma






(nk



ReLU-based network:


I ReLU(y)i =



Lemma






(nk



ReLU-based network:


I ReLU(y)i =



Lemma






(nk



ReLU-based network:I Each layer is z → ReLU(Aiz).

I ReLU(y)i =



Lemma






(nk




I ReLU(y)i =



Lemma






(nk




I ReLU(y)i =



Lemma






(nk




I ReLU(y)i =



Lemma






(nk




I ReLU(y)i =



Lemma




Proof of LemmaLayer 1’s output lies within a union of nk k-dimensional hyperplanes.

A1z is k-dimensional hyperplane in Rn.

ReLU(A1z) is linear, within any constant region of sign(A1z).

How many different patterns can sign(A1z) take?

k = 2 version

: how many regions can n lines partition plane into?

I 1 + (1 + 2 + . . .+ n) = n2+n+22 .

I n half-spaces divide Rk into less than nk regions.






k = 2 version


I 1 + (1 + 2 + . . .+ n) = n2+n+22 .







k = 2 version


I 1 + (1 + 2 + . . .+ n) = n2+n+22 .







k = 2 version


I 1 + (1 + 2 + . . .+ n) = n2+n+22 .







k = 2 version: how many regions can n lines partition plane into?

I 1 + (1 + 2 + . . .+ n) = n2+n+22 .








I 1 + (1 + 2 + . . .+ n) = n2+n+22 .








I 1 + (1 + 2 + . . .+ n) = n2+n+22 .








I 1 + (1 + 2 + . . .+ n) = n2+n+22 .



Proof outline (Lipschitz networks)

Need that if x1, x2 ∈ range(G ) are very different, then ‖Ax1 − Ax2‖ islarge.

I Hence can distinguish with noise.

True for fixed x1, x2 with 1− e−Ω(m) probability.

Apply to δ-cover of range(G ).

I Comes from δ/L-cover of domain(G ).I Size ( rL

δ )k : union bound works for m = O(k log rLδ ).

x1, x2 lie close to cover =⇒ additive δ‖A‖ loss:

‖Ax1 − Ax2‖ ≥ 0.9‖x1 − x2‖ − O(δn)

Hence‖x − x‖2 ≤ C min

x ′∈range(G)‖x ′ − x‖2 + δn










‖Ax1 − Ax2‖ ≥ 0.9‖x1 − x2‖ − O(δn)












‖Ax1 − Ax2‖ ≥ 0.9‖x1 − x2‖ − O(δn)












‖Ax1 − Ax2‖ ≥ 0.9‖x1 − x2‖ − O(δn)








Apply to δ-cover of range(G ).I Comes from δ/L-cover of domain(G ).

I Size ( rLδ )k : union bound works for m = O(k log rL

δ ).


‖Ax1 − Ax2‖ ≥ 0.9‖x1 − x2‖ − O(δn)








Apply to δ-cover of range(G ).I Comes from δ/L-cover of domain(G ).I Size ( rL



‖Ax1 − Ax2‖ ≥ 0.9‖x1 − x2‖ − O(δn)











‖Ax1 − Ax2‖ ≥ 0.9‖x1 − x2‖ − O(δn)











‖Ax1 − Ax2‖ ≥ 0.9‖x1 − x2‖ − O(δn)











‖Ax1 − Ax2‖ ≥ 0.9‖x1 − x2‖ − O(δn)




Non-gaussian measurements

A = subset of Fourier matrix: MRI

A = Gaussian blur: superresolution

A = diagonal with zeros: inpainting

Algorithm can be applied, unifying problems.

I Guarantee only holds if G and A are “incoherent”.



























Algorithm can be applied, unifying problems.I Guarantee only holds if G and A are “incoherent”.


Inpainting


Super-resolution


Summary

Number of measurements required to estimate an image= (information content of image) / (new information per measurement)

Generative models can bound information content as O(kd log n).

I Open: do real networks require linear in d?

Generative models differentiable =⇒ can optimize in practice.

Gaussian measurements ensure independent information.


Summary


Generative models can bound information content as O(kd log n).

I Open: do real networks require linear in d?




Summary


Generative models can bound information content as O(kd log n).I Open: do real networks require linear in d?




Summary






Summary






Future work

More uses of differentiable compression?

More applications of distance to range of G?

I Metric of how good generative model is.I Objective to train new generative model.

F Bojanowski et al. 2017: impressive results with this.

Computable proxy for Lipschitz parameter?

I Real networks may be much better than worst-case nd .

Competitive guarantee for non-Gaussian A?

I Even nonlinear A?

Thank You


Future work


More applications of distance to range of G?

I Metric of how good generative model is.I Objective to train new generative model.





I Even nonlinear A?

Thank You


Future work


More applications of distance to range of G?I Metric of how good generative model is.

I Objective to train new generative model.





I Even nonlinear A?

Thank You


Future work


More applications of distance to range of G?I Metric of how good generative model is.I Objective to train new generative model.





I Even nonlinear A?

Thank You


Future work







I Even nonlinear A?

Thank You


Future work







I Even nonlinear A?

Thank You


Future work




Computable proxy for Lipschitz parameter?I Real networks may be much better than worst-case nd .


I Even nonlinear A?

Thank You


Future work






I Even nonlinear A?

Thank You


Future work





Competitive guarantee for non-Gaussian A?I Even nonlinear A?

Thank You


Future work





Competitive guarantee for non-Gaussian A?I Even nonlinear A?

Thank You




Documents

Compressed Sensing using Generative Modelsashishbora.github.io/assets/csgm/slides_google_ashish.pdf · 2017-11-07 · Compressed Sensing Want to recover a signal (e.g. an image) from