Using Deep Learning to Accelerate Sparse Recoverywotaoyin/papers/pdf/ALISTA_slides_TAMU.pdf ·...

Using Deep Learning to Accelerate Sparse Recovery

Wotao Yin†

Joint work with Xiaohan Chen�, Jialin Liu†, Zhangyang Wang�

†UCLA Math �Texas A&M CSE

Texas A&M U — February 20, 2019

1 / 32

This talk is based on the following papers:

• X. Chen, J. Liu, Z. Wang, and W. Yin, Theoretical linear convergenceof unfolded ISTA and its practical weights and thresholds, Advances inNeural Information Processing Systems (NeurIPS), 2018.

• J. Liu, X. Chen, Z. Wang, W. Yin, ALISTA: analytic weights are asgood as learned weights in LISTA, International Conference on LearningRepresentations (ICLR), 2019.

X. Chen and J. Liu are equal first authors in both papers.2 / 32

Overview

Recover a sparse x∗

b := Ax∗ + white noise

where A ∈ Rm×n and b ∈ Rm are given.

Known as compressed sensing, feature selection or LASSO.A fundamental problem with numerous applications in signal processing,inverse problems, and statistical/machine learning.

3 / 32

Application: Examples

MRI Reconstruction

Radar Sensing

4 / 32

Our methods improve upon classical analytical sparse recovery algorithms by

• recovering a signal closer to x∗ (higher quality)

• reducing the total number of iterations to just 15–20 (fast recovery)

Our methods improve upon existing deep learning-based recovery algorithms„e.g., LISTA (Gregor & LeCun’10), by

• learning (much) fewer parameters (faster training)

• adding support detection (faster recovery)

• proving linear convergence and robustness (theoretical guarantee!)

5 / 32

This talk is based on two recent papers:

• Xiaohan Chen*, Jialin Liu*, Zhangyang Wang, Wotao Yin. “Theoreticallinear convergence of unfolded ISTA and its practical weights andthresholds.” NIPS’18.

• Jialin Liu*, Xiaohan Chen*, Zhangyang Wang, Wotao Yin. “ALISTA:Analytic weights are as good as learned weights in LISTA.” to appear inICLR’19.

* denotes equal-contribution first authors.

6 / 32

Outline

• Review LASSO model and ISTA method

• LISTA: classic, a series of parameter elimination

• Theoretical results

• How to make it robust

7 / 32

LASSO and ISTA

LASSO model:

xlasso ← minimizex

12‖b−Ax‖

22 + λ‖x‖1

where λ is a model parameter, tuned by hand or cross validation.

Forward-backward splitting gives ISTA:

x(k+1) = η λL

(x(k) + 1

LAT (b−Ax(k))

)Sublinearly converges to xlasso with an eventual linear speed, not x∗.

FPC (fixed-point continuation): faster by using large λ and scheduling itsreduction. Proves finite support detection and eventual linear convergence.

8 / 32

Relax ISTA

Rewrite ISTA asx(k+1) = ηθ(W1b+W2x

where W1 = 1LAT ,W2 = In − 1

LATA and θ = λ

9 / 32

Gregor & LeCun’10: Learned ISTA (LISTA)

Unfold K iteration of ISTA

Free W k1 ,W

k2 and θk, k = 0, . . . ,K − 1 as parameters.

Learn them from training set D = {(bi, x∗i )}

minimize{Wk

1 ,Wk2 ,θ

∑(b,x∗)∈D

∥∥xK(b)− x∗∥∥2

10 / 32

Just generate synthetic sparse signals, train it like a neural network.

Training is very slow. But, K = 16 is enough. Better denoising quality.

0 100 200 300 400 500 600 700 800

ISTA ( = 0.1)

ISTA ( = 0.05)

ISTA ( = 0.025)

11 / 32

However, it does not scale

To iterate K iterations, the total number of parameters is

O(n2K +mnK).

Too many parameters and too many hours to learn!

12 / 32

Coupling W1, W2If we need xK → x? uniformly for all sparse signals and no measurement noise,then we must have:

• W k2 +W k

1 A→ I,• θk → 0.

True under learning:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

13 / 32

Therefore, we enforce the following coupling in all layers:

W k2 = In −W k

yielding the iteration:

x(k+1) = ηθk (x(k) +W k1 (b−Ax(k))).

Parameter reduction

O(n2K +mnK)→ O(mnK),

significant especially if m� n. Also, helps to stabilize training.

14 / 32

Support selection

Inspired by FPC (Hale, Yin, Zhang’08) and Linearized Bregman (Osher etal’10).

Idea: at each iteration, let the largest components bypass soft-thresholding.

Selection of the largest by fraction, hand tuned.

We obtained both empirical and theoretical improvements.

15 / 32

Empirical results

• We compare results using normalized MSE (NMSE) in dB.

NMSE(x, x∗) = 20 log10 (‖x− x∗‖2/‖x∗‖2)

• Notation:• Original LISTA: LISTA;• LISTA with weight coupling: LISTA-CP;• LISTA with support selection: LISTA-SS;• LISTA with both structures: LISTA-CPSS;

• Setting:• m = 250, n = 500, sparsity s ≈ 50.• Aij ∼ N (0, 1/

√m), iid. A is column-normalized.

• Magnitudes are sampled from standard Gaussian.• Measurement noise levels are measured by signal-to-noise ratio.

16 / 32

Weight coupling

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

LISTA-CP

Weight coupling stabilizes intermediate results.No change in final recovery quality.

17 / 32

(Adding) support selection

Noiseless case (SNR=∞)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

LISTA-CP

LISTA-SS

LISTA-CPSS

← LISTA and LISTA-CP

← LISTA-SS← LISTA-CPSS

18 / 32

(Adding) support selection

Noisy case (SNR=30)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

LISTA-CP

LISTA-SS

LISTA-CPSS

19 / 32

Natural image compressive sensing reconstruction

(a) Ground truth (b) 20% sample rate (c) 30% sample rate

(d) 40% sample rate (e) 50% sample rate (f) 60% sample rate

20 / 32

Theory: convergence analysis

Theorem (Convergence of LISTA-CP)Suppose K =∞ and let {x(k)}∞k=1 be generated by LISTA-CP. There exists asequence of parameters Θ(k) = {W i

1 , θi}k−1i=0 such that

‖x(k)(Θ(k), b, x0)− x∗‖2 ≤ C1 exp(−ck) + C2σ, ∀k = 1, 2, · · · ,

holds for all (x∗, ε) that are sparse and bounded, where c, C1, C2 > 0 areconstants that depend only on A and the distribution of x∗, and σ is the noiselevel.

The error bound consists of two parts:

• the error that linearly converges to zero;• the irreducible error caused by the measurement noise.

21 / 32

Theory: convergence analysis

Theorem (Convergence of LISTA-CPSS)Suppose K =∞ and let {x(k)}∞k=1 be generated by LISTA-CPSS. There existsa sequence of parameters Θ(k) = {W i

1 , θi}k−1i=0 such that

‖x(k)(Θ(k), b, x0)− x∗‖2 ≤ C1 exp(−k−1∑t=0

)+ Cssσ, ∀k = 1, 2, · · ·

holds for all (x∗, ε) satisfying some assumptions, where ckss ≥ c for all k, ckss > c

for large enough k, and Css < C2.

The convergence rate is better: ckss > c for large enough k. The acceleration ismore significant in deeper layers.The recovery error is better: Css < C2.

22 / 32

Tie W1 across the iterations

In the proofs, we chose one W k1 independent on layer k.

So, we use just one W for all iterations:

O(mnK)→ O(mn),

yielding tied LISTA (TiLISTA):

x(k+1) = ηθk (x(k) + γkWT (b−Ax(k))).

We learn step sizes {γk}k, thresholds {θk}k and just one matrix W .Tied LISTA works as well as LISTA.

23 / 32

Analytic LISTA (ALISTA)

Proofs also reveal that W needs to have small mutual coherence to A. So, wetried to solve for W , independent of training data.

Two steps:

1. Pre-compute W :

W ∈ argminW∈Rm×n

∥∥WTA∥∥2F, s.t. (W:,j)TA:,j = 1, ∀j = 1, 2, · · · , n,

which is a standard convex quadratic program and easy to solve.2. With W fixed, learn {γk, θk}k from data (back-propagation)

24 / 32

Analytic LISTA (ALISTA)

For the resulting ALISTA network:

1. The layer-wise weights W depends on model A, but not on training data.2. Step sizes γk and thresholds θk are learned from data, but they are only a

small number of scalars.

25 / 32

Numerical evaluation

Noiseless case(SNR=∞)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

LISTA-CPSS

TiLISTA

ALISTA

Noisy case(SNR=30dB)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

LISTA-CPSS

TiLISTA

ALISTA

26 / 32

Numbers of parameters to train

K: number of layers. A has M rows and N columns.

Original LISTA O(KM2 +K +MN)LISTA-CPSS O(KNM +K)

TiLISTA O(NM +K)ALISTA O(K)

A 16-layer ALISTA network only takes around 0.1 hours (6 minutes) oftraining, to achieve the comparable performance as LISTA-CPSS, which takesaround 1.5 hours to train.

27 / 32

Extension to convolutional A

Our main results can be directly extended to very large convolutions (circulantmatrices). They can handle large images.

Problem: forming a full matrix W is impossible, even for 100× 100 imagingproblems.

Approach: use a convolution W , find a nearly optimal one, minimize coherenceby FFTs.

Theoretical guarantee: we can get accurate approximation when the image (weconsider 2D conv) is large enough.

28 / 32

An end-to-end robust modelMutual coherence minimization can also be solved by unrolling an algorithm!

The coherence minimization can be relaxed as

arg minW∈Rm×n

∥∥Q� (ATW − In)∥∥2F,

where � is the Hadamard product and Q is a weight matrix that put morepenalty on errors on diagonals. It can be solved by gradient descent:

W (k+1) = W (k) − γ(k)A(Q2 � (ATW (k) − In)).

Figure: One Layer of the Encoder.

29 / 32

Robust ALISTA: An end-to-end robust model

We feed encoder perturbed models A = A+ εA so that W is robust to modelperturbations to some extent.The encoder takes A and returns W . It is obtained by unrolling the gradientdescent in the last slide.The decoder takes W , A, b and returns x. It is the ALISTA model.

Figure: Robust ALISTA: cascaded Encoder-Decoder Structure.

30 / 32

Numerical results

We perturb A0 element-wise with Gaussian σ ≤ 0.03. Perturbed A is thencolumn normalized. Testing model is A, perturbed from A0.

W matrices in non-robust LISTA methods are obtained using A0.

-0.005 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035

31 / 32

Summary

There is huge room of speed improvement by adapting an algorithm to asubset of optimization problems.

We can integrate data-driven (slow, adaptive) and analytic (fast, universal)approaches to obtain fast and adaptive algorithms.

While optimization helps deep learning, deep learning (ideas) can also helpoptimization.It is a part of the bigger picture called “differential programming”, a hot risingfield in deep learning theory.

Thank you!

32 / 32

Using Deep Learning to Accelerate Sparse Recoverywotaoyin/papers/pdf/ALISTA_slides_TAMU.pdf ·...

Documents

Learning Near-Isometric Linear Embeddings Richard Baraniuk Rice University Chinmay Hegde MIT Aswin Sankaranarayanan CMU Wotao Yin UCLA

GPU Acceleration for ANSYS Mechanical Peter Tiefenthaler · GPU Acceleration for ANSYS Mechanical “Accelerate” Sparse Direct Solver Supported options Static, full transient, full

GROUP SPARSE OPTIMIZATION BY ALTERNATING ...optimization/L1/GroupSparsity/group...GROUP SPARSE OPTIMIZATION BY ALTERNATING DIRECTION METHOD WEI DENG, WOTAO YIN, AND YIN ZHANG April

An introduction to Sparse coding, Sparse sensing, and ...disp.ee.ntu.edu.tw/~pujols/An introduction to... · An Introduction to Sparse Coding, Sparse Sensing, and Optimization Speaker:

Sparse Optimization - Lecture: Basic Sparse Optimization ...wotaoyin/summer2013/...Sparse Optimization Lecture: Basic Sparse Optimization Models Instructor: Wotao Yin July 2013 online

Sparse coding - GitHub Pagesyiiwood.github.io/images/Sparse Model for Data.pdf · 3/3/ Sparse representation – Sparse coding – Optimization for sparse coding – Dictionary learning

Sparse Coding in Sparse Winner networks

Accelerate your concept. Accelerate your launch

SPARSE STORAGE RECOMMENDATION SYSTEM FOR SPARSE … · 2015. 8. 4. · SPARSE STORAGE RECOMMENDATION SYSTEM FOR SPARSE MATRIX VECTOR MULTIPLICATION ON GPU Monika Shah Department of

LEARNING MULTI-LEVEL HIERARCHIES WITH HINDSIGHTgdk/pubs/multi_level_her.pdf · Multi-level hierarchies have the potential to accelerate learning in sparse reward tasks because they

Sparse regularization path by differential inclusion · 2020. 4. 24. · Sparse regularization path by differential inclusion Wotao Yin (UCLA Math) joint with: Stanley Osher, Ming

Sparse Matrices - Boston University Sparse Matrices ! Have small number of non-zero entries Dense Matrix Sparse Matrix

Accelerate --

Sparse Optimization Lecture: Dual Certi cate in 1 …wotaoyin/summer2013/slides/Lec04_Dual...Sparse Optimization Lecture: Dual Certi cate in ‘ 1 Minimization Instructor: Wotao Yin

An introduction to Sparse coding, Sparse sensing, and Optimizationcse.msu.edu/~cse902/S14/ppt/Sparse Coding and Dictionary... · 2014-01-17 · An Introduction to Sparse Coding and

Ernest K. Ryu and Wotao Yin Large-Scale Convex

MatRaptor: A Sparse-Sparse Matrix Multiplication ......Sparse-sparse matrix-matrix multiplication (SpGEMM) is a key computational primitive in many important application do-mains such

Golden-angle RAdial Sparse Parallel MRI (GRASP ... · Compressed sensing (CS) (14-16) is another strategy to accelerate data acquisition in dynamic MRI. CS methods exploit spatial

Sparse Optimization Methodspages.cs.wisc.edu/~swright/talks/sjw-toulouse.pdf1 Sparse Optimization Motivation for Sparse Optimization Applications of Sparse Optimization Formulating

From Sparse Regression to Sparse Multiple Correspondence ... · From Sparse Regression to Sparse Multiple Correspondence Analysis Gilbert Saporta CEDRIC, CNAM, Paris . gilbert.saporta@cnam.fr