58
Lecture: Algorithms for Compressed Sensing Zaiwen Wen Beijing International Center For Mathematical Research Peking University http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html [email protected] Acknowledgement: part of this slides is based on Prof. Lieven Vandenberghe and Prof. Wotao Yin’s lecture notes 1/56

Lecture: Algorithms for Compressed Sensing

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Lecture: Algorithms for Compressed Sensing

Zaiwen Wen

Beijing International Center For Mathematical ResearchPeking University

http://bicmr.pku.edu.cn/~wenzw/[email protected]

Acknowledgement: part of this slides is based on Prof. Lieven Vandenberghe and Prof.Wotao Yin’s lecture notes

1/56

2/56

Outline

Proximal gradient method

Complexity analysis of Proximal gradient method

Alternating direction methods of Multipliers (ADMM)

Linearized Alternating direction methods of Multipliers

Check lecture notes athttp://bicmr.pku.edu.cn/~wenzw/opt-2017-fall.html

3/56

`1-regularized least square problem

Considermin ψµ(x) := µ‖x‖1 +

12‖Ax− b‖2

2

Approaches:Interior point method: l1_lsSpectral gradient method: GPSRFixed-point continuation method: FPCActive set method: FPC_ASAlternating direction augmented Lagrangian methodNesterov’s optimal first-order methodmany others

4/56

Subgradient

recall basic inequality for convex differentiable f :

f (y) ≥ f (x) +∇f (x)>(y− x).

g is a subgradient of a convex function f at x ∈ domf if

f (y) ≥ f (x) + g>(y− x),∀y ∈ domf .

g2, g3 are subgradients at x2, g1 is a subgradient at x1.

5/56

Optimality conditions — unconstrained

x∗ minimizes f (x) if and only

0 ∈ ∂f (x∗)

Proof: by definition

f (y) ≥ f (x∗) + 0>(y− x∗) for all y ⇐⇒ 0 ∈ ∂f (x∗).

6/56

Optimality conditions — constrained

min f0(x)

s.t. fi(x) ≤ 0, i = 1, . . . ,m.

From Lagrange duality: if strong duality holds, then x∗, λ∗ areprimal, dual optimal if and only if

x∗ is primal feasible

λ∗ ≥ 0

complementary: λ∗i fi(x∗) = 0 for i = 1, . . . ,m

x∗ is a minimizer of min L(x, λ∗) = f0(x) +∑

i λ∗i fi(x), i.e.,

0 ∈ ∂xL(x, λ∗) = ∂f0(x∗) +∑

i

λ∗i ∂fi(x∗)

7/56

Proximal Gradient Method

Let f (x) = 12‖Ax− b‖2

2. The gradient ∇f (x) = A>(Ax− b). Consider

min ψµ(x) := µ‖x‖1 + f (x).

First-order approximation + proximal term:

xk+1 := arg minx∈Rn

µ‖x‖1 + (∇f (xk))>(x− xk) +1

2τ‖x− xk‖2

2

= arg minx∈Rn

µ‖x‖1 +1

2τ‖x− (xk − τ∇f (xk))‖2

2

= shrink(xk − τ∇f (xk), µτ)

gradient step: bring in candidates for nonzero componentsshrinkage step: eliminate some of them by “soft” thresholding

8/56

Shrinkage (soft thresholding)

shrink(y, ν) : = arg minx∈R

ν‖x‖1 +12‖x− y‖2

2

= sgn(y) max(|y| − ν, 0)

=

{y− νsgn(y), if |y| > ν

0, otherwise

Chambolle, Devore, Lee and LucierFigueirdo, Nowak and WrightElad, Matalon and ZibulevskyHales, Yin and ZhangDarbon, OsherMany others

9/56

Proximal gradient method For General Problems

Consider the model

min F(x) := f (x) + r(x)

f (x) is convex, differentiabler(x) is convex but may be nondifferentiable

General scheme: linearize f (x) and add a proximal term:

xk+1 := arg minx∈Rn

r(x) + (∇f (xk))>(x− xk) +1

2τ‖x− xk‖2

2

= arg minx∈Rn

τr(x) +12‖x− (xk − τ∇f (xk))‖2

= proxτr(xk − τ∇f (xk))

Proximal Operator

proxr(y) := arg minx

r(x) +12‖x− y‖2

2.

10/56

Proximal mapping

if g is convex and closed (has a closed epigraph), then

proxr(y) := arg minx

r(x) +12‖x− y‖2

2

exists and is unique for all y.from optimality conditions of minimization in the definition:

x = proxr(y) ⇐⇒ y− x ∈ ∂r(x)

⇐⇒ r(z) ≥ r(x) + (y− x)>(z− x), ∀z

subdifferential of a convex function is a monotone operator:

(u− v)>(x− y) ≥ 0 ∀x, y, u ∈ ∂f (x), v ∈ ∂f (y)

proof: combining the following two inequalities

f (y) ≥ f (x) + u>(y− x), f (x) ≥ f (y) + v>(x− y)

11/56

Nonexpansiveness

if u = proxr(x), v = proxr(y), then

(u− v)>(x− y) ≥ ‖u− v‖2

proxr is is firmly nonexpansive, or co-coercive with constant 1follows from last page and monotonicity of subgradients

x− u ∈ ∂r(u), y− v ∈ ∂r(v) =⇒ (x− u− y + v)>(u− v) ≥ 0

implies (from Cauchy-Schwarz inequality)

‖proxr(x)− proxr(y)‖2 ≤ ‖x− y‖2

12/56

Global convergence

The proximal mapping is nonexpansiveUnder certain assumptions, h(x) = x− τ∇f (x) is nonexpansive:

‖h(x)− h(x′)‖ ≤ ‖x− x′‖,

and r(x) = r(x′) whenever the equality holds.x is a fixed point if

‖proxτr(h(x))− proxτr(h(x∗))‖ ≡ ‖proxτr(h(x))− x∗‖ = ‖x− x∗‖

13/56

Proximal gradient method for `1-minimization

AdvantagesLow computational cost (first-order method)Finite convergence of the nondegenerate supportGlobal q-linear convergence

DisadvantagesTakes a lot of steps when the support is relatively largeSlow convergence: degenerate support identification ormagnitude recovery?

Strategies for improving shrinkageAdjust λ dynamically (use λk)Line search xk+1 = xk + αk(shrink(xk − λk∇f k, µλk)− xk)Continuation: approximately solve for µ0 > µ1 > · · · > µp = µ(FPC)Debiasing or subspace optimization

14/56

Line search approach

Let xk(τ k) = shrink(xk − τ k∇f k, µτ k). Then set

xk+1 = xk + αk(xk(τ k)− xk)

= xk + αkdk

Choosing τ k: Barzilai-Borwein method∇f k = A>(Axk − b)sk−1 = xk − xk−1 and yk−1 = ∇f k −∇f k−1

τ k,BB1 = (sk−1)>sk−1

(sk−1)>yk−1 or τ k,BB2 = (sk−1)>yk−1

(yk−1)>yk−1 .

Choosing αk: Armijo-like Line search

ψµ(xk + αkdk) ≤ Ck + σαk∆k

FPC: ∆k := (∇f k)>dk

FPC_AS: ∆k := (∇f k)>dk + µ‖xk(τ k)‖1 − µ‖xk‖1Ck = (ηQk−1Ck−1 + ψµ(xk))/Qk, Qk = ηQk−1 + 1, C0 = ψµ(x0) andQ0 = 1 (Zhang and Hager)

15/56

Outline: Complexity Analysis

Amir Beck and Marc Teboulle, A fast iterative shrinkage thresholdingalgorithm for linear inverse problemsPaul Tseng, On accelerated proximal gradient methods forconvex-concave optimizationPaul Tseng, Approximation accuracy, gradient methods and errorbound for structured convex optimization

N. S. Aybat and G. Iyengar, A first-order augmented Lagrangianmethod for compressed sensing

16/56

Proximal Gradient/ISTA/FPC

The following terminologies refer to the same algorithm:proximal gradient methodISTA: iterative shrinkage thresholding algorithmFPC: fixed-point continuation method

Considermin F(x) := µ‖x‖1 +

12‖Ax− b‖2

2

ISTA/FPC computes

xk+1 := arg minµ‖x‖1 + (∇fk)>(x− xk) +1

2τ‖x− xk‖2

2

= shrink(xk − τ∇fk, µτ)

Complexity?How can we speed it up?

17/56

Generalization of ISTA

Consider a solvable model

min F(x) := f (x) + r(x)

r(x) continuous convex, may be nonsmoothf (x) continuously diff. with Lipschitz continuous gradient

‖∇f (x)−∇f (y)‖2 ≤ Lf ‖x− y‖2

The cost of the following problem is tractable

pL(y) := arg minx

QL(x, y) := f (y) + 〈∇f (y), x− y〉+L2‖x− y‖2

2 + r(x)

= arg minx

r(x) +L2

∥∥∥∥x−(

y− 1L∇f (y)

)∥∥∥∥2

2

= prox1/L(y− 1L∇f (y))

ISTABasic iteration: xk = pL(xk−1)

18/56

Quadratic upper bound

Suppose ∇f is Lipschitz continuous with parameter L and domf isconvex. Then

f (y) ≤ f (x) +∇f (x)>(y− x) +L2‖y− x‖2

2, ∀x, y.

Proof: Let v = y− x. Then

f (y) = f (x) +∇f (x)>v +

∫ 1

0(∇f (x + tv)−∇f (x))>vdt

≤ f (x) +∇f (x)>v +

∫ 1

0‖∇f (x + tv)−∇f (x)‖2‖v‖2dt

≤ f (x) +∇f (x)>v +

∫ 1

0Lt‖v‖2

2dt

= f (x) +∇f (x)>v +L2‖y− x‖2

2

The above inequality also implies that F(pL(y)) ≤ Q(pL(y), y).

19/56

Lemma: Assume F(pL(y)) ≤ Q(pL(y), y), then for any x ∈ Rn,

F(x)− F(pL(y)) ≥ L2‖pL(y)− y‖2 + L 〈y− x, pL(y)− y〉

=L2(‖pL(y)− x‖2

2 − ‖x− y‖22)

Proof: ∇f (y) + L(pL(y)− y) + z = 0 for some z ∈ ∂r(pL(y)).

The convexity of f and g gives

f (x) ≥ f (y) +∇f (y)>(x− y)

r(x) ≥ r(pL(y)) + z>(x− pL(y)).

Using the definition of Q(pL(y), y) and z

F(x)− F(pL(y)) ≥ F(x)− Q(pL(y), y)

≥ −L2‖pL(y)− y‖2 + (x− pL(y))>(∇f (y) + z)

20/56

Complexity analysis of ISTA

Complexity results: F(xk)− F(x∗) ≤ Lf ‖x0−x∗‖22

2k

Let x = x∗, y = xj, then

2L

(F(x∗)− F(xj+1)) ≥ ‖x∗ − xj+1‖22 − ‖x∗ − xj‖2

2

Summing over j := 0, · · · , k − 1 gives

2L

(kF(x∗)−k−1∑j=0

F(xj+1)) ≥ ‖x∗ − xk‖22 − ‖x∗ − x0‖2

2

Let x = y = xj yields2L (F(xj)− F(xj+1)) ≥ ‖xj − xj+1‖2

2Multiplying the last ineq. by j and summing over j := 0, · · · , k − 1 gives

2L (−kF(xk) +

∑k−1j=0 F(xj+1)) ≥

∑k−1j=0 j‖xj − xj+1‖2

2

Therefore2kL (F(x∗)− F(xk)) ≥ ‖x∗ − xk‖2

2 +∑k−1

j=0 j‖xj − xj+1‖22 − ‖x∗ − x0‖2

2

21/56

FISTA: Accelerated proximal gradient

Given y1 = x0 and t1 = 1, compute:

xk = pL(yk)

tk+1 =1 +

√1 + 4t2

k

2

yk+1 = xk +tk − 1tk+1

(xk − xk−1)

Complexity results:

F(xk)− F(x∗) ≤2Lf ‖x0 − x∗‖2

2(k + 1)2

Lipschitz constant Lf is unknown?Choose L by backtracking so that F(pL(yk)) ≤ QL(pL(yk), yk)

22/56

Complexity analysis of FISTA

Let vk := F(xk)− F(x∗) and uk := tkxk − (tk − 1)xk−1 − x∗,

2L

(t2kvk − t2

k+1vk+1)≥ ‖uk+1‖2

2 − ‖uk‖22

If ak − ak+1 ≥ bk+1 − bk, with a1 + b1 ≤ c, then ak ≤ c

tk ≥ (k + 1)/2

Let ak = 2L t2

kvk, bk = ‖uk‖22, and c = ‖x0 − x∗‖2

2.Check a1 + b1 ≤ c:

F(x∗)− F(x1) = F(x∗)− F(pL(y1))

≥ L2‖pL(y1)− y1‖2

2 + L 〈y1 − x∗, pL(y1)− y1〉

=L2

(‖x1 − x∗‖22 − ‖y1 − x∗‖2

2)

23/56

Complexity analysis of FISTA

Let (x = xk, y = yk+1) and (x = x∗, y = yk+1). Note xk+1 = pLk+1 (yk+1):

2L−1(vk − vk+1) ≥ ‖xk+1 − yk+1‖22 + 2 〈xk+1 − yk+1, yk+1 − xk〉 (1)

−2L−1vk+1 ≥ ‖xk+1 − yk+1‖22 + 2 〈xk+1 − yk+1, yk+1 − x∗〉 (2)

((1) ∗ (tk+1 − 1) + (2)) ∗ tk+1 and using t2k = t2

k+1 − tk+1:

2tk+1

L((tk+1 − 1)vk − tk+1vk+1) =

2L

(t2k vk − t2

k+1vk+1)

≥ ‖tk+1(xk+1 − yk+1)‖22 + 2tk+1 〈xk+1 − yk+1, tk+1yk+1 − (tk+1 − 1)xk − x∗〉

‖b− a‖22 + 2 〈b− a, a− c〉 = ‖b− c‖2

2 − ‖a− c‖22

yk+1 = xk + tk−1tk+1

(xk − xk−1):

2L

(t2k vk − t2

k+1vk+1)

≥ ‖tk+1xk+1 − (tk+1 − 1)xk − x∗‖22 − ‖tk+1yk+1 − (tk+1 − 1)xk − x∗‖2

2

= ‖uk+1‖22 − ‖uk‖2

2

24/56

Proximal gradient method

Consider the model

min F(x) := f (x) + r(x)

Linearize f (x):

`f (x, y) := f (y) + 〈∇f (y), x− y〉+ r(x)

Given a strictly convex function h(x), Bregman distance:

D(x, y) := h(x)− h(y)− 〈∇h(y), x− y〉 .

For example, take h(x) = ‖x‖22. Then D(x, y) = ‖x− y‖2

2.Proximal gradient method can also be written as

xk+1 := arg minx

`f (x, xk) +L2

D(x, xk)

25/56

APG Variant 1

Acclerated proximal gradient (APG):Set x−1 = x0 and θ−1 = θ0 = 1:

yk = xk + θk(θ−1k−1 − 1)(xk − xk−1)

xk+1 = arg minx

`f (x, yk) +L2‖x− yk‖2

2

θk+1 =

√θ4

k + 4θ2k − θ2

k

2

Question: what is the difference between θk and tk?

26/56

APG Variant 2

Replace 12‖x− yk‖2

2 by Bregman distance D(x, yk)Set x0, z0 and θ0 = 1:

yk = (1− θk)xk + θkzk

zk+1 = arg minx

`f (x, yk) + θkLD(x, zk)

xk+1 = (1− θk)xk + θkzk+1

θk+1 =

√θ4

k + 4θ2k − θ2

k

2

27/56

APG Variant 3

Set x0, z0 := arg min h(x) and θ0 = 1:

yk = (1− θk)xk + θkzk

zk+1 = arg minx

k∑i=0

`f (x, yi)

θi+ Lh(x)

xk+1 = (1− θk)xk + θkzk+1

θk+1 =

√θ4

k + 4θ2k − θ2

k

2

28/56

Complexity analysis

Proximal gradient method

F(xk) ≤ F(x) +1k

LD(x, x0)

APG1:F(xk) ≤ F(x) +

4(k + 1)2 LD(x, x0)

APG2:F(xk) ≤ F(x) +

4(k + 1)2 LD(x, z0)

APG3:F(xk) ≤ F(x) +

4(k + 1)2 L(h(x)− h(x0))

29/56

Complexity analysis

Let f (x) is convex and assume ∇f (x) is Lipschitz cont.

F(x) ≥ `f (x, y) ≥ F(x)− L2‖x− y‖2

2.

For any proper lsc convex function ψ(x), if z+ = arg minx ψ(x) + D(x, z)and h(x) is differentiable at z+, then

ψ(x) + D(x, z) ≥ ψ(z+) + D(x, z+) + D(z+, z).

Proof: The optimality at z+ and definition of subgradient gives

ψ(x) +∇xD(z+, z)>(x− z+) ≥ ψ(z+).

Note ∇xD(z+, z) = ∇h(z+)−∇h(z). Rearranging terms yields

ψ(x)−∇h(z)>(x− z) ≥ ψ(z+)−∇h(z)>(z+ − z)−∇h(z+)>(x− z+).

Adding h(x)− h(z) to both sides.

30/56

Complexity analysis of APG2

F(xk+1)

≤ `f (xk+1, yk) +L2‖xk+1 − yk‖2

2

= `f ((1− θk)xk + θzk+1, yk) +Lθ2

k2‖zk+1 − zk‖2

2

≤ (1− θk)`f (xk, yk) + θk`f (zk+1, yk) + θ2k LD(zk+1, zk)

≤ (1− θk)`f (xk, yk) + θk (`f (x, yk) + θkLD(x, zk)− θkLD(x, zk+1))

≤ (1− θk)F(xk) + θkF(x) + θ2k LD(x, zk)− θ2

k LD(x, zk+1)

The third inequality applies the inequality for `f (x, z) + θLD(x, z).

Let ek = F(xk)− F(x) and ∆k = LD(x, zk),then

ek+1 ≤ (1− θk)ek + θ2k∆k − θ2

k∆k+1

31/56

Complexity analysis of APG2

Divide both sides by θ2k and using 1

θ2k

=1−θk+1θ2

k+1:

1− θk+1

θ2k+1

ek+1 + ∆k+1 ≤1− θk

θ2k

ek + ∆k

Hence1− θk+1

θ2k+1

ek+1 + ∆k+1 ≤1− θ0

θ20

e0 + ∆0

Using 1θ2

k=

1−θk+1θ2

k+1and θ0 = 1:

1θ2

kek+1 ≤ ∆0 −∆k+1 ≤ ∆0 = LD(x, z0)

32/56

An application to Basis Pursuit problem

Considermin ‖x‖1 s.t. Ax = b

Augmented Lagragian (Bregman) framework (b1 = b, k = 1):

x∗k := arg minx

Fk(x) := µk‖x‖1 +12‖Ax− bk‖2

2

bk+1 := b +µk+1

µk(bk − Axk)

obtaining x∗k+1 exactly is difficultinexact solver: how do we control the accuracy?Analysis of Bregman approach (see wotao), no complexitySolving each x∗k+1 by the APG algorithms?

33/56

Outline: ADMM

Alternating direction augmented Lagrangian methodsVariable splitting methodConvergence for problems with two blocks of variables

34/56

References

Wotao Yin, Stanley Osher, Donald Goldfarb, Jerome Darbon,Bregman Iterative Algorithms for l1-Minimization withApplications to Compressed SensingJunfeng Yang, Yin Zhang, Alternating direction algorithms forl1-problems in Compressed SensingTom Goldstein, Stanely Osher, The Split Bregman Method forL1-Regularized ProblemsB.S. He, H. Yang, S.L. Wang, Alternating Direction Method withSelf-Adaptive Penalty Parameters for Monotone VariationalInequalities

35/56

Basis pursuit problem

Primal: min ‖x‖1, s.t. Ax = b

Dual: max b>λ, s.t. ‖A>λ‖∞ ≤ 1

The dual problem is equivalent to

max b>λ, s.t. A>λ = s, ‖s‖∞ ≤ 1.

36/56

Augmented Lagrangian (Bregman) framework

Augmented Lagrangian function:

L(λ, s, x) := −b>λ+ x>(A>λ− s) +1

2µ‖A>λ− s‖2

Algorithmic frameworkCompute λk+1 and sk+1 at k-th iteration

(DL) minλ,s L(λ, s, xk), s.t. ‖s‖∞ ≤ 1

Update the Lagrangian multiplier:xk+1 = xk + A>λk+1−sk+1

µ

Pros and Cons:Pros: rich theory, well understood and a lot of algorithmsCons: L(λ, s, xk) is not separable in λ and s, and the subproblem(DL) is difficult to minimize

37/56

An alternating direction minimization scheme

Divide variables into different blocks according to their rolesMinimize the augmented Lagrangian function with respect to oneblock at a time while all other blocks are fixed

ADMM

λk+1 = arg minλL(λ, sk, xk)

sk+1 = arg minsL(λk+1, s, xk), s.t. ‖s‖∞ ≤ 1

xk+1 = xk +A>λk+1 − sk+1

µ

38/56

An alternating direction minimization scheme

Explicit solutions:

λk+1 = (AA>)−1 (µ(Axk − b) + Ask)sk+1 = arg min ‖s− A>λk+1 − µxk‖2, s.t. ‖s‖∞ ≤ 1

= P[−1,1](A>λk+1 + µxk)

xk+1 = xk +A>λk+1 − sk+1

µ

39/56

ADMM for BP-denoising

Primal:min ‖x‖1, s.t. ‖Ax− b‖2 ≤ σ

which is equivalent to

min ‖x‖1, s.t. Ax− b + r = 0, ‖r‖2 ≤ σ

Lagrangian function:

L(x, r, λ) := ‖x‖1 − λ>(Ax− b + r) + π(‖r‖2 − σ)

= ‖x‖1 − (A>λ)>x + π‖r‖2 − λ>r + b>λ− πσ

Hence, the dual problem is:

max b>λ− πσ, s.t. ‖A>λ‖∞ ≤ 1, ‖λ‖2 ≤ π

40/56

ADMM for BP-denoising

The dual problem is equivalent to:

max b>λ− πσ, s.t. A>λ = s, ‖s‖∞ ≤ 1, ‖λ‖2 ≤ π

Augmented Lagrangian function is:

L(λ, s, x) := −b>λ+ πσ + x>(A>λ− s) +1

2µ‖A>λ− s‖2

ADMM scheme:

λk+1 = arg min1

2µ‖A>λ− sk‖2 + (Axk − b)>λ, s.t. ‖λ‖2 ≤ πk

sk+1 = arg min ‖s− A>λk+1 − µxk‖2, s.t. ‖s‖∞ ≤ 1

= P[−1,1](A>λk+1 + µxk)

πk+1 = ‖λk+1‖2

xk+1 = xk +A>λk+1 − sk+1

µ

41/56

ADMM for `1-regularized problem

Primal:min µ‖x‖1 +

12‖Ax− b‖2

2

which is equivalent to

min µ‖x‖1 +12‖r‖2

2, s.t. Ax− b = r.

Lagrangian function:

L(x, r, λ) := µ‖x‖1 +12‖r‖2

2 − λ>(Ax− b− r)

= µ‖x‖1 − (A>λ)>x +12‖r‖2

2 + λ>r + b>λ

Hence, the dual problem is:

max b>λ− 12‖λ‖2, s.t. ‖A>λ‖∞ ≤ µ

42/56

ADMM for `1-regularized problem

The dual problem is equivalent to

max b>λ− 12‖λ‖2, s.t. A>λ = s, ‖s‖∞ ≤ µ.

Augmented Lagrangian function is:

L(λ, s, x) := −b>λ+12‖λ‖2 + x>(A>λ− s) +

12µ‖A>λ− s‖2

ADMM scheme:

λk+1 = (AA> + µI)−1 (µ(Axk − b) + Ask)sk+1 = arg min ‖s− A>λk+1 − µxk‖2, s.t. ‖s‖∞ ≤ µ

= P[−µ,µ](A>λk+1 + µxk)

xk+1 = xk +A>λk+1 − sk+1

µ

43/56

YALL1

Derive ADMM for the following problems:

BP: minx∈Cn ‖Wx‖w,1, s.t. Ax = b

L1/L1: minx∈Cn ‖Wx‖w,1 +1ν‖Ax− b‖1

L1/L2: minx∈Cn ‖Wx‖w,1 +1

2ρ‖Ax− b‖2

2

BP+: minx∈Rn ‖x‖w,1, s.t. Ax = b, x ≥ 0

L1/L1+: minx∈Rn ‖x‖w,1 +1ν‖Ax− b‖1, s.t. x ≥ 0

L1/L2+: minx∈Rn ‖x‖w,1 +1

2ρ‖Ax− b‖2

2, s.t. x ≥ 0

ν, ρ ≥ 0, A ∈ Cm×n, b ∈ Cm,x ∈ Cn for the first three and x ∈ Rn for thelast three, W ∈ Cn×n is an unitary matrix serving as a sparsifyingbasis, and ‖x‖w,1 :=

∑ni=1 wi|xi|.

44/56

Variable splitting

Given A ∈ Rm×n, consider min f (x) + g(Ax), which is

min f (x) + g(y), s.t. Ax = y

Augmented Lagrangian function:

L(x, y, λ) = f (x) + g(y)− λ>(Ax− y) +1

2µ‖Ax− y‖2

2

ADMM

(Px) : xk+1 := arg minx∈X

L(x, yk, λk),

(Py) : yk+1 := arg miny∈Y

L(xk+1, y, λk),

(Pλ) : λk+1 := λk − γAxk+1 − yk+1

µ

45/56

Variable splitting

split Bregman (Goldstein and Osher) for anisotropic TV:

min α‖Du‖1 + β‖Ψu‖1 +12‖Au− f‖2

2

Introduce y = Du and w = Ψu, obtain

min α‖y‖1 + β‖w‖1 +12‖Au− f‖2

2, s.t. y = Du, w = Ψu

Augmented Lagrangian function:

L := α‖y‖1 + β‖w‖1 +12‖Au− f‖2

2 − p>(Du− y) +1

2µ‖Du− y‖2

2

−q>(Ψu− w) +1

2µ‖Ψu− w‖2

2

46/56

Variable splitting

The variable u can be otained by(A>A +

(D>D + I))

u = A>f +1µ

(D>y + Ψ>w) + D>p + Ψ>q

If A and D are diagonalizable by FFT, then the computationalcost is very cheap. For example, A = RF , both R and D arecirculant matrices.Variables y and w:

y := shrink(Du− µp, αµ)

w := shrink(Ψu− µq, αµ)

apply a few iterations before updating the Lagrangian multipliersp and q

Exercise: isotropic TV

min α‖Du‖2 + β‖Ψu‖1 +12‖Au− f‖2

2

47/56

FTVd: Fast TV deconvolution

Wang-Yang-Yin-Zhang consider:

minu

∑‖Diu‖2 +

12µ‖Ku− f‖2

2

Introducing w and quadratic penalty:

minu,w

∑(‖wi‖2 +

12β‖wi − Diu‖2

2

)+

12µ‖Ku− f‖2

2

Alternating minimization:For fixed u, {wi} can be solved by shrinkage at O(N)

For fixed {wi}, u can be solved by FFT at O(N log N)

48/56

Outline: Linearized ADMM

Linearized Bregman and Bregmanized operator splittingADMM + proximal point methodXiaoqun Zhang, Martin Burgerz, Stanley Osher, A unifiedprimal-dual algorithm framework based on Bregman iteration

49/56

Review of Bregman method

Consider BP:min ‖x‖1, s.t. Ax = b

Bregman method:

Dpk

J (x, xk) := ‖x‖1 − ‖xk‖1 −⟨pk, x− xk

⟩xk+1 := arg minx µDpk

J (x, xk) + 12‖Ax− b‖2

2

pk+1 = pk + 1µA>(b− Axk+1)

Augmented Lagrangian (updating multiplier or b):xk+1 := arg minx µ‖x‖1 + 1

2‖Ax− bk‖22

bk+1 = b + (bk − Axk+1)

They are equivalent, see Yin-Osher-Goldfarb-Darbon

50/56

Linearized approaches

Linearized Bregman method:

xk+1 := arg min µDpk

J (x, xk) + (A>(Axk − b))>(x− xk) +12δ‖x− xk‖2

2,

pk+1 := pk − 1µδ

(xk+1 − xk)− 1µ

A>(Axk − b),

which is equivalent to

xk+1 := arg min µ‖x‖1 +12δ‖x− vk‖2

2

vk+1 := vk − δA>(Axk+1 − b)

Bregmanized operator splitting:

xk+1 := arg min µ‖x‖1 + (A>(Axk − bk))>(x− xk) +12δ‖x− xk‖2

2

bk+1 = b + (bk − Axk+1)

Are they equivalent?

50/56

Linearized approaches

Linearized Bregman method:

xk+1 := arg min µDpk

J (x, xk) + (A>(Axk − b))>(x− xk) +12δ‖x− xk‖2

2,

pk+1 := pk − 1µδ

(xk+1 − xk)− 1µ

A>(Axk − b),

which is equivalent to

xk+1 := shrink(vk, µδ)

vk+1 := vk − δA>(Axk+1 − b)or

xk+1 := shrink(δA>bk, µδ)

bk+1 := b + (bk − Axk+1)

Bregmanized operator splitting:

xk+1 := shrink(xk − δ(A>(Axk − bk)), µδ) = shrink(δA>bk + xk − δA>Axk, µδ)

bk+1 = b + (bk − Axk+1)

51/56

Linearized approaches

Linearized Bregman:If the sequence xk converges and pk is bounded, then the limit ofxk is the unique solution of

min µ‖x‖1 +12δ‖x‖2

2 s.t. Ax = b.

For µ large enough, the limit solution solves BP.Exact regularization if δ > δ̄

What about Bregmanized operator splitting?

52/56

Primal ADMM for `1-regularized problem

Primal: min µ‖x‖1 + 12‖Ax− b‖2

2 which is equivalent to

min µ‖x‖1 +12‖r‖2

2, s.t. Ax− b = r.

Augmented Lagrangian function:

L(x, r, λ) = µ‖x‖1 +12‖r‖2

2 − λ>(Ax− b− r) +12δ‖Ax− b− r‖2

2

ADMM scheme:

xk+1 = arg minx

µ‖x‖1 +12δ‖Ax− b− rk − δλk‖2

2 original problem

rk+1 = arg minr

12‖r‖2

2 +12δ‖Axk+1 − b− r − δλk‖2

2

λk+1 = λk +Axk+1 − b− rk+1

δ

52/56

Primal ADMM for `1-regularized problem

Primal: min µ‖x‖1 + 12‖Ax− b‖2

2 which is equivalent to

min µ‖x‖1 +12‖r‖2

2, s.t. Ax− b = r.

Augmented Lagrangian function:

L(x, r, λ) = µ‖x‖1 +12‖r‖2

2 − λ>(Ax− b− r) +12δ‖Ax− b− r‖2

2

ADMM scheme:

xk+1 = arg minx

µ‖x‖1 + (gk)>(x− xk) +1

2τ‖x− xk‖2

2

rk+1 = arg minr

12‖r‖2

2 +12δ‖Axk+1 − b− r − δλk‖2

2

λk+1 = λk +Axk+1 − b− rk+1

δ

Convergence of the linearized scheme?

53/56

Generalized algorithm

Considermin f (x), s.t. Ax = b

Proximal-point method:

xk+1 := arg minx

f (x)−⟨λk,Ax− b

⟩+

12δ‖Ax− b‖2 +

12‖x− xk‖2

Q

Cλk+1 := Cλk + (b− Axk+1)

Augmented Lagrangian or Bregman method if Q = 0 and C = δ

Proximal point method by Rockafellar if Q = I and C = γ

Bregmanized operator splitting if Q = 1δ(I − A>A)

Theoretical results: limk ‖Axk − b‖2 = 0, limk f (xk) = f (x̄) and all limitpoint of (xk, λk) are saddle points.

54/56

Convergence proof

From the x-subproblem:

∂f (xk+1) 3 sk+1 := A>λk − 1δ

A>(Axk+1 − b)− Q(xk+1 − xk)

Assume (x̄, λ̄) is a saddle point, then Ax̄ = b, and s̄− A>λ̄ = 0.Let sk+1

e = sk+1 − s̄, xk+1e = xk+1 − x̄ and λk+1

e = λk+1 − λ̄:

sk+1e +

A>Axk+1e + Qxk+1

e = Qxke + A>λk

e, Cλk+1e = Cλk

e − Axk+1e

Taking the inner product with xk+1e on both sides of the first equality:

12

(‖xk+1

e ‖2Q + ‖xk+1 − xk‖2

Q − ‖xke‖2

Q

)= −

⟨sk+1

e , xk+1e

⟩− 1δ‖Axk+1

e ‖22 +

⟨λk

e,Axk+1e

⟩Taking the inner product with λk+1

e on both sides of the second equality:

12

(‖λk+1

e ‖2C − ‖λk

e‖2C − ‖Axk+1

e ‖2C−1

)= −

⟨λk

e,Axk+1e

⟩Adding the above inequality (wk = (xk, λk)):

‖wk+1e ‖2

Q + ‖xk+1 − xk‖2Q + 2

⟨sk+1

e , xk+1e

⟩+

2δ‖Axk+1

e ‖22 − ‖Axk+1

e ‖2C−1 = ‖wk

e‖2Q

55/56

Convergence proof

The convexity of f (x) gives⟨sk+1

e , xk+1e⟩≥ 0. If 2

δ> 1

λCm

, then

2⟨

sk+1e , xk+1

e

⟩+

2δ‖Axk+1

e ‖22 − ‖Axk+1

e ‖2C−1 ≥ 0

Hence, ‖wk+1e ‖2

Q ≤ ‖wke‖2

Q, which implies the boundedness of wk = (xk, λk).Furthermore,∑

k

{2⟨

sk+1e , xk+1

e

⟩+ ‖xk+1 − xk‖2

Q +

(2δ‖Axk+1

e ‖22 − ‖Axk+1

e ‖2C−1

)}≤ ‖w0

e‖2Q <∞

Hence, 0 = limk ‖Axk+1e ‖2

2 = limk ‖Axk − b‖, and limk sk − A>λk = 0 follows from

sk+1 := A>λk − 1δ

A>(Axk+1 − b)− Q(xk+1 − xk)

56/56

Generalized algorithm

Considermin f (x) + g(z), s.t. Bx = z

Proximal-point method:

xk+1 := arg minx

f (x)−⟨λk,Bx

⟩+

12δ‖Bx− zk‖2 +

12‖x− xk‖2

Q

zk+1 := arg minz

g(z)−⟨λk, z

⟩+

12δ‖Bxk+1 − z‖2 +

12‖z− zk‖2

M

Cλk+1 := Cλk + (Bxk+1 − zk+1)

ADMM if Q = M = 0 and C = δ

Proximal point method by Rockafellar if Q = I and C = γ

Bregmanized operator splitting if Q = 1δ(I − A>A)

Theoretical results: limk ‖Bxk − zk‖2 = 0, limk f (xk) = f (x̄) ,limk g(zk) = g(z̄) and all limit point of (xk, zk, λk) are saddle points.