Low Complexity Regularization of Inverse Problems · 2014-09-11 · FIGURE 2.2: Phase transitions for linear inverse problems. [left] Recovery of sparse vectors. The empirical probability

Gabriel Peyré

www.numerical-tours.com

Low Complexity Regularization of Inverse Problems

Samuel Vaiter Jalal FadiliJoint work with:

VISI N

http://www.ceremade.dauphine.fr/~peyre/wavelet-tour/

http://www.ceremade.dauphine.fr/~peyre/wavelet-tour/

y = �x0 + w 2 RP

Inverse Problems

Recovering x0 � RN from noisy observations

y = �x0 + w 2 RP

�

Examples: Inpainting, super-resolution, . . .

Inverse Problems

Recovering x0 � RN from noisy observations

x0

�

�x = (p✓k)16k6K

Inverse Problems in Medical Imaging

Magnetic resonance imaging (MRI):

�x = (p✓k)16k6K

�x = (f̂(!))!2⌦


x̂

Magnetic resonance imaging (MRI):

Other examples: MEG, EEG, . . .

�x = (p✓k)16k6K

�x = (f̂(!))!2⌦


x̂

x̃0

Compressed Sensing[Rice Univ.]

P measures � N micro-mirrors

x̃0

y[i] = hx0, 'ii


P/N = 0.16 P/N = 0.02P/N = 1

P measures � N micro-mirrors

x̃0

y[i] = hx0, 'ii


Inverse Problem Regularization

observations y

parameter �Estimator: x(y) depends only on

Observations: y = �x0 + w 2 RP .


observations y

parameter �Example: variational methods

Estimator: x(y) depends only on

x(y) 2 argminx2RN

1

2||y � �x||2 + � J(x)

Data fidelity Regularity


J(x0)Regularity of x0


observations y



x(y) 2 argminx2RN

1

2||y � �x||2 + � J(x)



Choice of �: tradeo� ||w||Noise level


x(y) 2 argmin�x=y

J(x)


observations y



x(y) 2 argminx2RN

1

2||y � �x||2 + � J(x)



No noise: � � 0+, minimize



x(y) 2 argmin�x=y

J(x)


observations y



x(y) 2 argminx2RN

1

2||y � �x||2 + � J(x)



No noise: � � 0+, minimize


�! Criteria on (x0, ||w||,�) to ensure

||x(y)� x0|| = O(||w||)model stability

Performance analysis:

Overview

• Low-complexity Convex Regularization

• Performance Guarantees: L2 Error

• Performance Guarantees: Model Consistency

�

Coe�cients x Image x

M

Union of Models for Data Processing

Synthesissparsity:

Union of models: M ⇢ RNsubspaces or manifolds.

�


M


Synthesissparsity:

Structuredsparsity:


�


M


D�

Image x

Gradient D⇤x

Synthesissparsity:

Structuredsparsity:

Analysissparsity:


�


Multi-spectral imaging:xi,· =

Prj=1 Ai,jSj,·

M


D�

Image x

Gradient D⇤x

Synthesissparsity:

Structuredsparsity:

Analysissparsity:

Low-rank:

S1,·

S2,·

S3,·Figure 3. The concept of hyperspectral imagery. Image measurements are made atmany narrow contiguous wavelength bands, resulting in a complete spectrum for eachpixel.

Hyperspectral Data

Most multispectral imagers (e.g., Landsat, SPOT, AVHRR) measure radiation reflectedfrom a surface at a few wide, separated wavelength bands (Fig. 4). Most hyperspectralimagers (Table 1), on the other hand, measure reflected radiation at a series of narrowand contiguous wavelength bands. When we look at a spectrum for one pixel in ahyperspectral image, it looks very much like a spectrum that would be measured in aspectroscopy laboratory (Fig. 5). This type of detailed pixel spectrum can provide muchmore information about the surface than a multispectral pixel spectrum.

x


(iii) @J is continuous on Mx

around x.

[Lew

is2003]

(i) J is C

2along M

x

around x ;

(ii) 8h 2 Tx

(Mx

)

?, t 7! J(x+ th) non-smooth at t = 0.

Tx

(Mx

)

Partly Smooth Functions

J : RN ! R is partly smooth at x for a manifold Mx

x

Mx

J(x) = max(0, ||x||� 1)

x

Mx

Mx

={z ; supp(z) ⇢ supp(x)}

Examples of Partly-smooth Regularizers

J(x) = ||x||1

x

0

Mx

0

`

1 sparsity: J(x) = ||x||1

x

Mx

same Mx

Mx



J(x) = ||x||1

x

0

Mx

J(x)= |x1|+||x2,3||x

0

x

Mx

0Mx

0

`


Structured sparsity: J(x) =P

b ||xb||

x

Mx

Mx

= {x ; rank(z) = rank(x)}

same Mx

Mx



J(x) = ||x||1

x

0

Mx

J(x) = ||x||⇤

x

Mx

J(x)= |x1|+||x2,3||x

0

x

Mx

0Mx

0

`



b ||xb||

Nuclear norm: J(x) = ||x||⇤

x

Mx

Mx

= {x ; rank(z) = rank(x)}

same Mx

I = {i ; |xi| = ||x||1}

Mx


Mx

= {z ; zI

/ x

I

}


J(x) = ||x||1

x

0

Mx

0

Mx

J(x) = ||x||1

x

x

0

Mx

J(x) = ||x||⇤

x

Mx

J(x)= |x1|+||x2,3||x

0

x

Mx

0Mx

0

Anti-sparsity: J(x) = ||x||1

`



b ||xb||

Nuclear norm: J(x) = ||x||⇤

Overview




�x = �

x0

Dual CertificatesNoiseless recovery:

min�x=�x0

J(x) (P0)

x0

Dual certificates:

�x = �

x0Proposition:

D(x0) = Im(�⇤) \ @J(x0)

x0 solution of (P0) () 9 ⌘ 2 D(x0)

@J(x) = {⌘ ; 8 y, J(y) > J(x) + h⌘, y � xi}

Dual Certificates

⌘@J(x0)

Noiseless recovery:

min�x=�x0

J(x) (P0)

x0

Dual certificates:

�x = �

x0Proposition:

D(x0) = Im(�⇤) \ @J(x0)

x0 solution of (P0) () 9 ⌘ 2 D(x0)

Example: J(x) = ||x||1 �x = x ? '

D(x0) = {⌘ = x ? ' ; ⌘i = sign(x0,i), ||⌘||1 6 1}

@J(x) = {⌘ ; 8 y, J(y) > J(x) + h⌘, y � xi}

Dual Certificates

⌘@J(x0)

Noiseless recovery:

min�x=�x0

J(x) (P0)

x0

⌘⌘

= interior for the topology of a↵(E)

ri(E) = relative interior of E

Non degenerate dual certificate:

Dual Certificates and L2 Stability

�x = �

x0

⌘@J(x0)

x

?D̄(x0) = Im(�⇤) \ ri(@J(x0))





Theorem:

[Fadili et al. 2013]

If 9 ⌘ 2 ¯D(x0), for � ⇠ ||w|| one has ||x? � x0|| = O(||w||)

�x = �

x0

⌘@J(x0)

x

?D̄(x0) = Im(�⇤) \ ri(@J(x0))





[Grassmair 2012]: J(x? � x0) = O(||w||).[Grassmair, Haltmeier, Scherzer 2010]: J = || · ||1.

Theorem:



�x = �

x0

⌘@J(x0)

x

?D̄(x0) = Im(�⇤) \ ri(@J(x0))





[Grassmair 2012]: J(x? � x0) = O(||w||).[Grassmair, Haltmeier, Scherzer 2010]: J = || · ||1.

�! The constants depend on N . . .

Theorem:



�x = �

x0

⌘@J(x0)

x

?D̄(x0) = Im(�⇤) \ ri(@J(x0))

Random matrix: �i,j ⇠ N (0, 1), i.i.d.� 2 RP⇥N ,

Compressed Sensing Setting

Theorem:

Random matrix: �i,j ⇠ N (0, 1), i.i.d.

Sparse vectors: J = || · ||1.

Let s = ||x0||0. If

Then 9⌘ 2 ¯D(x0) with high probability on �.

� 2 RP⇥N ,

[Chandrasekaran et al. 2011]P > 2s log (N/s)

[Rudelson, Vershynin 2006]


Theorem:


Theorem:



Low-rank matrices: J = || · ||⇤.

Let s = ||x0||0. If

Let r = rank(x0). If


� 2 RP⇥N ,


P > 3r(N1 +N2 � r) x0 2 RN1⇥N2



[Chandrasekaran et al. 2011]

Theorem:


Theorem:



Low-rank matrices: J = || · ||⇤.

Let s = ||x0||0. If

Let r = rank(x0). If


� 2 RP⇥N ,

�! Similar results for || · ||1,2, || · ||1.


P > 3r(N1 +N2 � r) x0 2 RN1⇥N2



[Chandrasekaran et al. 2011]

From [Amelunxen et al. 2013]

Phase TransitionsP/N

0

1

s/N r/pN

J = || · ||1 J = || · ||⇤THE GEOMETRY OF PHASE TRANSITIONS IN CONVEX OPTIMIZATION 7

0 25 50 75 1000

25

50

75

100

0 10 20 300

300

600

900

FIGURE 2.2: Phase transitions for linear inverse problems. [left] Recovery of sparse vectors. The empiricalprobability that the `1 minimization problem (2.6) identifies a sparse vector x0 2 R100 given random linearmeasurements z0 = Ax0. [right] Recovery of low-rank matrices. The empirical probability that the S1minimization problem (2.7) identifies a low-rank matrix X0 2 R30£30 given random linear measurementsz0 =A (X0). In each panel, the colormap indicates the empirical probability of success (black = 0%; white =100%). The yellow curve marks the theoretical prediction of the phase transition from Theorem II; the red curvetraces the empirical phase transition.

fixed dimensions. This calculation gives the exact (asymptotic) location of the phase transition for the S1

minimization problem (2.7) with random measurements.To underscore these achievements, we have performed some computer experiments to compare the

theoretical and empirical phase transitions. Figure 2.2[left] shows the performance of (2.6) for identifyinga sparse vector in R100; Figure 2.2[right] shows the performance of (2.7) for identifying a low-rank matrixin R30£30. In each case, the colormap indicates the empirical probability of success over the randomness inthe measurement operator. The empirical 5%, 50%, and 95% success isoclines are determined from thedata. We also draft the theoretical phase transition curve, promised by Theorem II, where the number m ofmeasurements equals the statistical dimension of the appropriate descent cone, which we compute using theformulas from Sections 4.5 and 4.6. See Appendix A for the experimental protocol.

In both examples, the theoretical prediction of Theorem II coincides almost perfectly with the 50% successisocline. Furthermore, the phase transition takes place over a range of O(

pd) values of m, as promised.

Although Theorem II does not explain why the transition region tapers at the bottom-left and top-rightcorners of each plot, we have established a more detailed version of Theorem I that allows us to predict thisphenomenon as well. See the discussion after Theorem 7.1 for more information.

2.4. Demixing problems. In a demixing problem [MT12], we observe a superposition of two structuredvectors, and we aim to extract the two constituents from the mixture. More precisely, suppose that we measurea vector z0 2Rd of the form

z0 = x0 +U y0 (2.8)where x0, y0 2 Rd are unknown and U 2 Rd£d is a known orthogonal matrix. If we wish to identify the pair(x0, y0), we must assume that each component is structured to reduce the number of degrees of freedom.In addition, if the two types of structure are coherent (i.e., aligned with each other), it may be impossibleto disentangle them, so it is expedient to include the matrix U to model the relative orientation of the twoconstituent signals.

To solve the demixing problem (2.8), we describe a convex programming technique proposed in [MT12].Suppose that f and g are proper convex functions on Rd that promote the structures we expect to find in x0

THE GEOMETRY OF PHASE TRANSITIONS IN CONVEX OPTIMIZATION 7

0 25 50 75 1000

25

50

75

100

0 10 20 300

300

600

900

FIGURE 2.2: Phase transitions for linear inverse problems. [left] Recovery of sparse vectors. The empiricalprobability that the `1 minimization problem (2.6) identifies a sparse vector x0 2 R100 given random linearmeasurements z0 = Ax0. [right] Recovery of low-rank matrices. The empirical probability that the S1minimization problem (2.7) identifies a low-rank matrix X0 2 R30£30 given random linear measurementsz0 =A (X0). In each panel, the colormap indicates the empirical probability of success (black = 0%; white =100%). The yellow curve marks the theoretical prediction of the phase transition from Theorem II; the red curvetraces the empirical phase transition.

fixed dimensions. This calculation gives the exact (asymptotic) location of the phase transition for the S1

minimization problem (2.7) with random measurements.To underscore these achievements, we have performed some computer experiments to compare the

theoretical and empirical phase transitions. Figure 2.2[left] shows the performance of (2.6) for identifyinga sparse vector in R100; Figure 2.2[right] shows the performance of (2.7) for identifying a low-rank matrixin R30£30. In each case, the colormap indicates the empirical probability of success over the randomness inthe measurement operator. The empirical 5%, 50%, and 95% success isoclines are determined from thedata. We also draft the theoretical phase transition curve, promised by Theorem II, where the number m ofmeasurements equals the statistical dimension of the appropriate descent cone, which we compute using theformulas from Sections 4.5 and 4.6. See Appendix A for the experimental protocol.

In both examples, the theoretical prediction of Theorem II coincides almost perfectly with the 50% successisocline. Furthermore, the phase transition takes place over a range of O(

pd) values of m, as promised.

Although Theorem II does not explain why the transition region tapers at the bottom-left and top-rightcorners of each plot, we have established a more detailed version of Theorem I that allows us to predict thisphenomenon as well. See the discussion after Theorem 7.1 for more information.

2.4. Demixing problems. In a demixing problem [MT12], we observe a superposition of two structuredvectors, and we aim to extract the two constituents from the mixture. More precisely, suppose that we measurea vector z0 2Rd of the form

z0 = x0 +U y0 (2.8)where x0, y0 2 Rd are unknown and U 2 Rd£d is a known orthogonal matrix. If we wish to identify the pair(x0, y0), we must assume that each component is structured to reduce the number of degrees of freedom.In addition, if the two types of structure are coherent (i.e., aligned with each other), it may be impossibleto disentangle them, so it is expedient to include the matrix U to model the relative orientation of the twoconstituent signals.

To solve the demixing problem (2.8), we describe a convex programming technique proposed in [MT12].Suppose that f and g are proper convex functions on Rd that promote the structures we expect to find in x0

1

0 11

P/N

Overview




Minimal Norm Certificate

⌘0 = argmin⌘=�⇤

q2@J(x0)||q||

Minimal-norm certificate:

@J(x0) ⇢ A(x0) = A↵Hull(@J(x0))



q2@J(x0)||q||


Case J = || · ||1

A(x0)

x0

@J(x

0 )

T =Tx0(Mx0)

@J(x0) ⇢ A(x0) = A↵Hull(@J(x0))



q2@J(x0)||q||


⌘F

= argmin⌘=�⇤

q2A(x0)||q||

Linearized pre-certificate:

Case J = || · ||1

A(x0)

x0

@J(x

0 )

T =Tx0(Mx0)

�! ⌘F is computed by solving a linear system.

�! One does not always have ⌘F 2 D(x0) !

@J(x0) ⇢ A(x0) = A↵Hull(@J(x0))



q2@J(x0)||q||


⌘F

= argmin⌘=�⇤

q2A(x0)||q||


Case J = || · ||1

A(x0)

x0

@J(x

0 )

T =Tx0(Mx0)

�! ⌘F is computed by solving a linear system.

�! One does not always have ⌘F 2 D(x0) !

@J(x0) ⇢ A(x0) = A↵Hull(@J(x0))



q2@J(x0)||q||


Theorem: If ker(�) \ T = {0},⇢

⌘F 2 D̄(x0) =) ⌘F = ⌘0,

⌘0 2 D̄(x0) =) ⌘F = ⌘0.

⌘F

= argmin⌘=�⇤

q2A(x0)||q||


Case J = || · ||1

A(x0)

x0

@J(x

0 )

T =Tx0(Mx0)

[Vaiter et al. 2014]

Model Stability

Theorem:

the unique solution x

?of P�(y) for y = �x0 + w satisfies

If ⌘F 2 ¯D(x0), there exists C such that if

max (�, ||w||/�) 6 C

x? 2 Mx0 and ||x? � x0|| = O(||w||,�)

[Fuchs 2004]: J = || · ||1.

[Bach 2008]: J = || · ||1,2 and J = || · ||⇤.

Previous works:

[Vaiter et al. 2014]

Model Stability

Theorem:

the unique solution x

?of P�(y) for y = �x0 + w satisfies

If ⌘F 2 ¯D(x0), there exists C such that if

max (�, ||w||/�) 6 C

[Vaiter et al. 2011]: J = ||D⇤ · ||1.

x? 2 Mx0 and ||x? � x0|| = O(||w||,�)


Theorem:



Let s = ||x0||0. If

� 2 RP⇥N ,

Then ⌘0 2 ¯D(x0) with high probability on �.

[Dossal et al. 2011]

[Wainwright 2009]

P > 2s log(N)

P ⇠ 2s log(N)P ⇠ 2s log(N/s)

L2 stability Model stability

Phase

transitions:

vs.


Theorem:



Let s = ||x0||0. If

� 2 RP⇥N ,



[Wainwright 2009]

P > 2s log(N)

�! Similar results for || · ||1,2, || · ||⇤, || · ||1.



Phase

transitions:

vs.


Theorem:



Let s = ||x0||0. If

� 2 RP⇥N ,



[Wainwright 2009]

P > 2s log(N)

�! Similar results for || · ||1,2, || · ||⇤, || · ||1.

�! Not using RIP technics (non-uniform result on x0).



Phase

transitions:

vs.


Theorem:



Let s = ||x0||0. If

� 2 RP⇥N ,



[Wainwright 2009]

P > 2s log(N)

⇥x =�

i

xi�(·� �i)

Increasing �:� reduces correlation.� reduces resolution.

J(x) = ||x||1

1-D Sparse Spikes Deconvolution

�x0

x0�

⇥x =�

i

xi�(·� �i)

Increasing �:� reduces correlation.� reduces resolution.

�0 10

2

support recovery.

J(x) = ||x||1

()

I = {j \ x0(j) 6= 0}||⌘F,Ic ||1||⌘F,Ic ||1 < 1

⌘0 = ⌘F 2 D̄(x0)

1-D Sparse Spikes Deconvolution

�x0

x0�

201

()

ConclusionPartial smoothness: encodes models using singularities.

Performance measures

L2error

model

di↵erent CS guarantees



L2error

model


Specific certificate:⌘0, ⌘F , . . .



L2error

model


Specific certificate:

– CS performance with arbitrary gauges.

– Infinite dimensional regularizations (BV, . . . )

– Convergence discrete ! continuous.

⌘0, ⌘F , . . .

Conclusion

Open problems:

Partial smoothness: encodes models using singularities.

Documents

Low Complexity Regularization of Inverse Problems · 2014-09-11 · FIGURE 2.2: Phase transitions for linear inverse problems. [left] Recovery of sparse vectors. The empirical probability