L.Vandenberghe EE236C(Spring2016) 13.Douglas ...vandenbe/236C/lectures/dr.pdf · L.Vandenberghe EE236C(Spring2016) ... (ADMM) 1.minimizeaugmentedLagrangianoverx ... D.O’ConnorandL.Vandenberghe,Primal-dual

L. Vandenberghe EE236C (Spring 2016)

13. Douglas-Rachford method and ADMM

• Douglas-Rachford splitting method

• examples

• alternating direction method of multipliers

• image deblurring example

• convergence

13-1

Douglas-Rachford splitting algorithm

minimize f(x) + g(x)

f and g are closed convex functions

Douglas-Rachford iteration: start at any y(0) and repeat for k = 1, 2, . . .,

x(k) = proxf(y(k−1))

y(k) = y(k−1) + proxg(2x(k) − y(k−1))− x(k)

• useful when f and g have inexpensive prox-operators

• x(k) converges to a solution of 0 ∈ ∂f(x) + ∂g(x) (if a solution exists)

• not symmetric in f and g

Douglas-Rachford method and ADMM 13-2

Douglas-Rachford iteration as fixed-point iteration

• iteration on page 13-2 can be written as fixed-point iteration

y(k) = F (y(k−1))

whereF (y) = y + proxg(2proxf(y)− y)− proxf(y)

• y is a fixed point of F if and only if x = proxf(y) satisfies 0 ∈ ∂f(x) + ∂g(x):

y = F (y)

m0 ∈ ∂f(proxf(y)) + ∂g(proxf(y))

(proof on next page)


Proof.

x = proxf(y), y = F (y)

mx = proxf(y), x = proxg(2x− y)

my − x ∈ ∂f(x), x− y ∈ ∂g(x)

• therefore, if y = F (y), then x = proxf(y) satisfies

0 = (y − x) + (x− y) ∈ ∂f(x) + ∂g(x)

• conversely, if −z ∈ ∂f(x) and z ∈ ∂g(x), then y = x− z is a fixed point of F


Equivalent form of DR algorithm

• start iteration on page 13-2 at y-update and renumber iterates

y(k) = y(k−1) + proxg(2x(k−1) − y(k−1))− x(k−1)

x(k) = proxf(y(k))

• switch y- and x-updates

u(k) = proxg(2x(k−1) − y(k−1))

x(k) = proxf(y(k−1) + u(k) − x(k−1))

y(k) = y(k−1) + u(k) − x(k−1)

• make change of variables w(k) = x(k) − y(k)

u(k) = proxg(x(k−1) + w(k−1))

x(k) = proxf(u(k) − w(k−1))

w(k) = w(k−1) + x(k) − u(k)


Scaling

algorithm applied to cost function scaled by t > 0

minimize tf(x) + tg(x)

• algorithm of page 13-2

x(k) = proxtf(y(k−1))

y(k) = y(k−1) + proxtg(2x(k) − y(k−1))− x(k)


u(k) = proxtg(x(k−1) + w(k−1))

x(k) = proxtf(u(k) − w(k−1))

w(k) = w(k−1) + x(k) − u(k)

• the algorithm is not invariant with respect to scaling

• in theory, t can be any positive constant; several heuristics exist for adapting t


Douglas-Rachford iteration with relaxation

• fixed-point iteration with relaxation

y(k) = y(k−1) + ρk(F (y(k−1))− y(k−1))

1 < ρk < 2 is overrelaxation, 0 < ρk < 1 underrelaxation

• algorithm of page 13-2 with relaxation


y(k) = y(k−1) + ρk (proxg(2x(k) − y(k−1))− x(k))


u+ = proxg(x+ w)

x+ = proxf(x+ ρ(u+ − x)− w)

w+ = w + x+ − x+ ρ(x− u+)


Primal-dual formulation

primal: minimize f(x) + g(x)

dual: maximize −g∗(z)− f∗(−z)

• use Moreau decomposition to simplify step 2 of DR iteration (page 13-2):


y(k) = x(k) − proxg∗(2x(k) − y(k−1))

• make change of variables z(k) = x(k) − y(k):

x(k) = proxf(x(k−1) − z(k−1))

z(k) = proxg∗(z(k−1) + 2x(k) − x(k−1))


Outline


• examples



• convergence

Sparse inverse covariance selection

minimize tr(CX)− log detX + γ∑i>j

|Xij|

variable is X ∈ Sn; parameters C ∈ Sn++ and γ > 0 are given

Douglas-Rachford splitting

f(X) = tr(CX)− log detX, g(X) = γ∑i>j

|Xij|

• X = proxtf(X) is positive solution of C −X−1 + (1/t)(X − X) = 0

easily solved via eigenvalue decomposition of X − tC (see homework)

• X = proxtg(X) is soft-thresholding


Spingarn’s method of partial inverses

Equality constrained convex problem (f closed and convex; V a subspace)

minimize f(x)subject to x ∈ V

Spingarn’s method: Douglas-Rachford splitting with g = δV (indicator of V )

x(k) = proxtf(y(k−1))

y(k) = y(k−1) + PV (2x(k) − y(k−1))− x(k)

Primal-dual form (algorithm of page 13-8):

x(k) = proxtf(x(k−1)− z(k−1))

z(k) = PV ⊥(z(k−1) + 2x(k) − x(k−1))


Application to composite optimization problem

minimize f1(x) + f2(Ax)

f1 and f2 have simple prox-operators

• problem is equivalent to minimizing f(x1, x2) over subspace V where

f(x1, x2) = f1(x1) + f2(x2), V = {(x1, x2) | x2 = Ax1}

• proxtf is separable:

proxtf(x1, x2) =(proxtf1

(x1), proxtf2(x2)

)• projection of (x1, x2) on V reduces to linear equation:

PV (x1, x2) =

[IA

](I +ATA)−1(x1 +ATx2)

=

[x1x2

]+

[AT

−I

](I +AAT )−1(x2 −Ax1)


Decomposition of separable problems

minimizen∑

j=1

fj(xj) +

m∑i=1

gi(Ai1x1 + · · ·+Ainxn)

• same problem as page 12-17, but without strong convexity assumption

• we assume the functions fj and gi have inexpensive prox-operators

Equivalent formulation

minimizen∑

j=1

fj(xj) +m∑i=1

gi(yi1 + · · ·+ yin)

subject to yij = Aijxj, i = 1, . . . ,m, j = 1, . . . , n

• prox-operator of first term requires evaluations of proxtfjfor j = 1, . . . , n

• prox-operator of 2nd term requires proxntgifor i = 1, . . . ,m (see page 8-8)

• projection on constraint set reduces to n independent linear equations


Decomposition of separable problems

Second equivalent formulation: introduce extra splitting variables xij

minimizen∑

j=1

fj(xj) +m∑i=1

gi(yi1 + · · ·+ yin)

subject to xij = xj, i = 1, . . . ,m, j = 1, . . . , nyij = Aijxij, i = 1, . . . ,m, j = 1, . . . , n

• make first set of constraints part of domain of fj:

fj(xj, x1j, . . . , xmj) =

{fj(xj) xij = xj, i = 1, . . . ,m+∞ otherwise

prox-operator of fj reduces to prox-operator of fj

• projection on other constraints involvesmn independent linear equations


Outline


• examples



• convergence

Dual application of Douglas-Rachford method

Separable convex problem

minimize f1(x1) + f2(x2)subject to A1x1 +A2x2 = b

Dual problem

maximize − bTz − f∗1 (−AT1 z)− f∗2 (−AT

2 z)

we apply the Douglas-Rachford method (page 13-5) to minimize

bTz + f∗1 (−AT1 z)︸︷︷︸

g(z)

+ f∗2 (−AT2 z)︸︷︷︸

f(z)


Douglas Rachford applied to the dual

u+ = proxtg(z + w), z+ = proxtf(u+ − w), w+ = w + z+ − u+

First line: use result on page 10-7 to compute u+ = proxtg(z + w)

x1 = argminx1

(f1(x1) + zT (A1x1 − b) +t

2‖A1x1 − b+ w/t‖22)

u+ = z + w + t(A1x1 − b)

Second line: similarly, compute z+ = proxtf(z + t(A1x1 − b))

x2 = argminx2

(f2(x2) + zTA2x2 +t

2‖A1x1 +A2x2 − b‖22

z+ = z + t(A1x1 +A2x2 − b)

Third line reduces to w+ = tA2x2


Alternating direction method of multipliers (ADMM)

1. minimize augmented Lagrangian over x1

x(k)1 = argmin

x1

(f1(x1) + (z(k−1))TA1x1 +

t

2‖A1x1 +A2x

(k−1)2 − b‖22

)

2. minimize augmented Lagrangian over x2

x(k)2 = argmin

x2

(f2(x2) + (z(k−1))TA2x2 +

t

2‖A1x

(k)1 +A2x2 − b‖22

)

3. dual updatez(k) = z(k−1) + t(A1x

(k)1 +A2x

(k)2 − b)

this the alternating direction method of multipliers or split Bregman method


Comparison with other multiplier methods

Alternating minimization method (page 12-22) with g(y) = δ{b}(y)

• same dual update, same update for x2

• x1-update in alternating minimization method is simpler:

x(k)1 = argmin

x1

(f1(x1) + (z(k−1))TA1x1)

• ADMM does not require strong convexity of f1

• in theory, parameter t in ADMM can be any positive constant

Augmented Lagrangian method (page 12-23) with g(y) = δ{b}(y)

• same dual update

• AL method requires joint minimization of the augmented Lagrangian

f1(x1) + f2(x2) + (z(k−1))T (A1x1 +A2x2) +t

2‖A1x1 +A2x2 − b‖22


Application to composite optimization (method 1)


• apply ADMM tominimize f1(x1) + f2(x2)subject to Ax1 = x2

• augmented Lagrangian is

f1(x1) + f2(x2) +t

2‖Ax1 − x2 + z/t‖22

• x1-update requires (possibly nontrivial) minimization of

f1(x1) +t

2‖Ax1 − x2 + z/t‖22

• x2-update is evaluation of proxt−1f2


Application to composite optimization (method 2)

introduce an extra ‘splitting’ variable x3

minimize f1(x3) + f2(x2)

subject to[AI

]x1 =

[x2x3

]

• alternate minimization of augmented Lagrangian over x1 and (x2, x3)

f1(x3) + f2(x2) +t

2

(‖Ax1 − x2 + z1/t‖22 + ‖x1 − x3 + z2/t‖22

)• x1-update: linear equation with coefficient I +ATA

• (x2, x3)-update: decoupled evaluations of proxt−1f1 and proxt−1f2


Outline


• examples



• convergence

Image blurring model

b = Kxt + w

• xt is unknown image

• b is observed (blurred and noisy) image; w is noise

• N ×N -images are stored in column-major order as vectors of length N2

Blurring matrixK

• represents 2D convolution with space-invariant point spread function

• with periodic boundary conditions, block-circulant with circulant blocks

• can be diagonalized by multiplication with unitary 2D DFT matrixW :

K = WH diag(λ)W

equations with coefficient I +KTK can be solved in O(N2 logN) time


Total variation deblurring with 1-norm

minimize ‖Kx− b‖1 + γ‖Dx‖tvsubject to 0 ≤ x ≤ 1

second term in objective is total variation penalty

• Dx is discretized first derivative in vertical and horizontal direction

D =

[I ⊗D1

D1 ⊗ I

], D1 =

−1 0 0 · · · 0 0 11 −1 0 · · · 0 0 00 1 −1 · · · 0 0 0... ... ... . . . ... ... ...0 0 0 · · · −1 0 00 0 0 · · · 1 −1 00 0 0 · · · 0 1 −1

• ‖ · ‖tv is a sum of Euclidean norms: ‖(u, v)‖tv =

n∑i=1

√u2i + v2i


Solution via Douglas-Rachford method

an example of a composite optimization problem


with f1 the indicator of [0, 1]n and

A =

[KD

], f2(u, v) = ‖u‖1 + γ‖v‖tv

Primal DR method (page 13-11) and ADMM (page 13-19) require:

• decoupled prox-evaluations of ‖u‖1 and γ‖v‖tv, and projections on C

• solution of linear equations with coefficient matrix

I +KTK +DTD

solvable in O(N2 logN) time


Example

• 1024× 1024 image, periodic boundary conditions

• Gaussian blur

• salt-and-pepper noise (50% pixels randomly changed to 0/1)

original noisy/blurred restored


Convergence

0 100 200 300 400 50010−6

10−5

10−4

10−3

10−2

10−1

100

iteration

rela

tive

prim

alsu

bopt

imal

ity

ADMMprimal DR

cost per iteration is dominated by 2D FFTs


Outline


• examples



• convergence

Douglas-Rachford iteration mappings

define iteration map F and negative step G (in notation of page 13-7)

F (y) = y + proxg(2proxf(y)− y)− proxf(y)

G(y) = y − F (y)

= proxf(y)− proxg(2proxf(y)− y)

• F is firmly nonexpansive (co-coercive with parameter 1)

(F (y)− F (y))T (y − y) ≥ ‖F (y)− F (y)‖22 ∀y, y

• this implies that G is firmly nonexpansive:

(G(y)−G(y))T (y − y)

= ‖G(y)−G(y)‖22 + (F (y)− F (y))T (y − y)− ‖F (y)− F (y)‖22≥ ‖G(y)−G(y)‖22


Proof (of firm nonexpansiveness of F ).

• define x = proxf(y), x = proxf(y), and

v = proxg(2x− y), v = proxg(2x− y)

• substitute expressions F (y) = y + v − x and F (y) = y + v − x:

(F (y)− F (y))T (y − y)

≥ (y + v − x− y − v + x)T (y − y)− (x− x)T (y − y) + ‖x− x‖22= (v − v)T (y − y) + ‖y − x− y + x‖22= (v − v)T (2x− y − 2x+ y)− ‖v − v‖22 + ‖F (y)− F (y)‖22≥ ‖F (y)− F (y)‖22

inequalities use firm nonexpansiveness of proxf and proxg (page 6-9):

(x− x)T (y − y) ≥ ‖x− x‖22, (2x− y − 2x+ y)T (v − v) ≥ ‖v − v‖22


Convergence result

y(k) = (1− ρk)y(k−1) + ρkF (y(k−1))

= y(k−1) − ρkG(y(k−1))

Assumptions

• F has fixed points (points x that satisfy 0 ∈ ∂f(x) + ∂g(x))

• ρk ∈ [ρmin, ρmax] with 0 < ρmin < ρmax < 2

Result

• y(k) converges to a fixed point y? of F

• x(k) = proxf(y(k−1)) converges to a solution x? = proxf(y?)

(follows from continuity of proxf )


Proof: let y? be any fixed point of F (y) (zero of G(y))

consider iteration k (with y = y(k−1), ρ = ρk, y+ = y(k)):

‖y+ − y?‖22 − ‖y − y?‖22 = 2(y+ − y)T (y − y?) + ‖y+ − y‖22= −2ρG(y)T (y − y?) + ρ2‖G(y)‖22≤ −ρ(2− ρ)‖G(y))‖22≤ −M‖G(y))‖22 (1)

whereM = ρmin(2− ρmax) (on line 3 we use firm nonexpansiveness of G)

• (1) implies that

M

∞∑k=0

‖G(y(k))‖22 ≤ ‖y(0) − y?‖22, ‖G(y(k))‖2 → 0

• (1) implies that ‖y(k) − y?‖2 is nonincreasing; hence y(k) is bounded

• since ‖y(k) − y?‖2 is nonincreasing, the limit limk→∞ ‖y(k) − y?‖2 exists


Proof (continued)

• since the sequence y(k) is bounded, it has a convergent subsequence

• let yk be a convergent subsequence with limit y; by continuity of G,

0 = limk→∞

G(yk) = G(y)

hence, y is a zero of G and the limit limk→∞ ‖y(k) − y‖2 exists

• let y1 and y2 be two limit points; the limits

limk→∞

‖y(k) − y1‖2, limk→∞

‖y(k) − y2‖2

exist, and subsequences of y(k) converge to y1, resp. y2; therefore

‖y2 − y1‖2 = limk→∞

‖y(k) − y1‖2 = limk→∞

‖y(k) − y2‖2 = 0


References

• P. L. Lions and B. Mercier, Splitting algorithms for the sum of two nonlinear operators, SIAMJournal on Numerical Analysis (1979).

• D. Gabay, Applications of the method of multipliers to variational inequalities, in: Studies inMathematics and Its Applications (1983).

• J. E. Spingarn, Applications of the method of partial inverses to convex programming:decomposition, Mathematical Programming (1985).

• J. Eckstein and D. Bertsekas, On the Douglas-Rachford splitting method and the proximalalgorithm for maximal monotone operators, Mathematical Programming (1992).

• P.L. Combettes and J.-C. Pesquet, A Douglas-Rachford splitting approach to nonsmoothconvex variational signal recovery, IEEE Journal of Selected Topics in Signal Procesing (2007).

• S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statisticallearning via the alternating direction method of multipliers, Foundations and Trends in Machinelearning (2010).

• D. O’Connor and L. Vandenberghe, Primal-dual decomposition by operator splitting andapplications to image deblurring, SIAM J. Imaging Sciences (2014).

The image deblurring example is taken from this paper.


Documents

L.Vandenberghe EE236C(Spring2016) 13.Douglas ...vandenbe/236C/lectures/dr.pdf · L.Vandenberghe EE236C(Spring2016) ... (ADMM) 1.minimizeaugmentedLagrangianoverx ... D.O’ConnorandL.Vandenberghe,Primal-dual