First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

First-order methods for structured nonsmoothoptimization

Sangwoon Yun

Department of Mathematics EducationSungkyunkwan University

Oct 19, 2016Center for Mathematical Analysis & Computation, Yonsei University

Outline

I. Coordinate (Gradient) Descent Method

II. Incremental Gradient Method

III. (Linearized) Alternating Direction Method of Multipliers

Thank you!

I. Coordinate (Gradient) Descent Method

Structured Nonsmooth Optimization

minx

F (X ) = f (x) + P(x),

f : real-valued, (convex) smooth on domf

P: proper, convex, lsc

In particular, P: separable i.e., P(x) =∑n

j=1 Pj (xj )

Bound constraint:

P(x) =

0 if l ≤ x ≤ u;

∞ else,

where l ≤ u (possibly with −∞ or∞).

`1-norm: P(x) = λ‖x‖1 with λ > 0

or indicator function of linear constraints (Ax = b).

Bound-constrained Optimization

minl≤x≤u

f (x),

where f : <N → < is smooth, l ≤ u (possibly with −∞ or∞ components).

Can be reformulated as the following unconstrained optimization problem:

minx

f (x) + P(x),

where P(x) =

0 if l ≤ x ≤ u∞ else

.

`1-regularized Convex Minimization

1. `1-regularized linear least squares problem

Find x so that Ax − b ≈ 0 and x has “few” nonzeros.Formulate this as an unconstrained convex optimization problem:

minx∈<n

‖Ax − b‖22 + λ‖x‖1

2. `1-regularized logistic regression problem

minw∈<n−1,v∈<

1m

m∑i=1

log(1 + exp(−(wT ai + vbi ))) + λ‖w‖1,

where ai = bizi and (zi ,bi ) ∈ <n−1 × −1,1, i = 1, ...,m are a given set of(observed or training) examples.

Support Vector Machines


Support Vector Classification

Training points : zi ∈ <p, i = 1, ...,n.

Consider a simple case with two classes (linear separable case):Define a vector a :

ai =

1 if zi in class 1−1 if zi in class 2

A hyperplane (0 = wT z − b) separates data with the maximal margin.Margin is the distance of the hyperplane to the nearest of the positiveand negative points.Nearest points lie on the planes ±1 = wT z − b


Convex Quadratic Programming Problem: Support Vector Classification

The (original) Optimization Problem

mina,b

12‖a‖2

2

subject to zi(aT xi − b

)≥ 1, i = 1, ...,n.

The Modified Optimization Problem (allows, but penalizes, the failure ofa point to reach the correct margin)

mina,b,ξ

12‖a‖2

2 + Cn∑

i=1

ξi

subject to zi(aT xi − b

)≥ 1− ξi , ξi ≥ 0, i = 1, ...,n.


SVM (Dual) Optimization Problem (Convex Quadratic Program)

minx12 xT Qx − eT x

subject to 0 ≤ xi ≤ C, i = 1, ...,n,aT x = 0,

where a ∈ −1,1n, 0 < C ≤ ∞, e = [1, ...,1]T , Q ∈ <n×n is a sym. pos.semidef. with Qij = aiajK (zi , zj ), K : <p ×<p → < (“kernel function”), andzi ∈ <p (“i th data point”), i = 1, ...,n.

Popular Choices of K :

linear kernel K (zi , zj ) = zTi zj

radial basis function kernel K (zi , zj ) = exp(−γ‖zi − zj‖22)

sigmoid kernel K (zi , zj ) = tanh(γzTi zj )

where γ is a constant.Q is an n × n fully dense matrix and even indefinite. (n ≥ 5000)

Rank Minimization

Now, imagine that we only observe a few entries of a data matrix. Then is itpossible to accurately guess the entries that we have not seen?

Netflix problem: Given a sparse matrix where Mij is the rating given by user ion movie j , predict the rating a user would assign to a movie he has not seen,i.e., we would like to infer users preference for unrated movies. (impossible!in general)

The problem is ill-posed. Intuitively, users’ preferences depend only on a fewfactors, i.e., rank(M) is small.

Thus can be formulated as the low-rank matrix completion problem (affinerank minimization):

minX∈<m×n

rank(X ) |Xij = Mij , (i , j) ∈ Ω

, (NP hard!)

where Ω is an index set of p observed entries.

Rank Minimization

Nuclear norm minimization:

minX∈<m×n

‖X‖∗ :=

m∑i=1

σi (X ) |Xij = Mij , (i , j) ∈ Ω.

where σi (X )’s are singular values of X .

a more general nuclear norm minimization problem:

minX∈<m×n

‖X‖∗ : A(X ) = b

.

When the matrix variable is restricted to be diagonal, the above problemreduces to the following `1-minimization problem:

minx∈<n

‖x‖1 : Ax = b

.

Nuclear Norm Regularized Least Squares Problem

If the observation b is contaminated with noise

consider the following nuclear norm regularized least squares problem:

minX∈<m×n

12‖A(X )− b‖2

2 + µ‖X‖∗.

where µ > 0 is a given parameter.

appeared in many applications of engineering and science including

collaborative filtering

global positioning

system identification

remote sensing

computer vision

Sparse Portfolio Selection

How to allocate an investor’s available capital into a prefixed set of assetswith the aims of maximizing the expected return and minimizing theinvestment risk.

Traditional Markowitz portfolio selection model:

minx

xT Qx

subject to µT x = β, eT x = 1, x ≥ 0,

where β is the desired expected return of the portfolio.

Modified models:

minx

xT Qx

subject to µT x = β, eT x = 1, x ≥ 0, ‖x‖0 ≤ K .

minx

‖x‖0

subject to µT x = β, xT Qx ≤ α, eT x = 1, x ≥ 0.

Sparse Covariance Selection

Undirected graphical models offer a way to describe and explain relationshipsamong a set of variables, central element of multivariate data analysis

From a sample covariance matrix, wish to estimate the true covariancematrix, in which some of entries of its inverse are zero, by maximizing`1-regularized log-likelihood:

maxX∈Sn

log detX −⟨

Σ, X⟩−∑

(i,j) 6∈V ρij |Xij |subject to Xij = 0, ∀(i , j) ∈ V ,

where

Σ ∈ Sn+ is an empirical covariance matrix and Σ is singular or nearly so

May want to impose structural conditions on Σ−1, such as conditionalindependence, which is reflected as zero entries in Σ−1.

V is a collection of all pairs of conditional independent nodes

ρij > 0: parameter controlling the trade-off between the goodness-of-fitand the sparsity of X

Sparse Covariance Selection

− log det X +⟨

Σ, X⟩

is strictly convex, cont. diff. on its domain Sn++, O(n3)

opers. to evaluate. In applications, n can exceed 5000.

Dual problem can be expressed:

minX∈Sn

− log det X − n

subject to |(X − Σ)ij | ≤ υij , i , j = 1, ...,n,

where υij = ρij for all (i , j) 6∈ V and υij =∞ for all (i , j) ∈ V .

Coordinate Descent Method

When P ≡ 0. Given x ∈ <n, Choose i ∈ N = 1, ...,n. Update

xnew = arg minu|uj=xj ∀j 6=i

f (u).

Repeat until convergence.

Gauss-Seidel rule: Choose i cyclically, 1, 2, ..., n, 1, 2, ...

Gauss-Southwell rule: Choose i with | ∂f∂xi

(x)| maximum.

Coordinate Descent Method

Properties:

If f convex, then every cluster point of the x-sequence is a minimizer.

If f nonconvex, then G-Seidel can cycle 1 but G-Southwell stillconverges.

Convergence is possible when P 6≡ 0 2

1M. J. D. Powell, On search directions for minimization algorithms, Math. Program. 4, (1973),193–201

2P. Tseng, Convergence of block coordinate descent method for nondifferentiableminimization, J. Optim. Theory Appl. 109, (2001), 473–492

Coord. Gradient Descent Method

Descent direction.For x ∈ domP, choose J ( 6= ∅) ⊆ N and H 0n, Then solve

mind|dj=0 ∀j 6∈J

∇f (x)T d +12

dT Hd + P(x + d)− P(x) direc.subprob

Let dH(x ;J ) and qH(x ;J ) be the opt. soln and obj. value of the direc.subprob.

Facts:

dH(x ;N ) = 0 ⇔ F ′(x ; d) ≥ 0 ∀d ∈ <n.

H is diagonal ⇒ dH(x ;J ) =∑j∈J

dH(x ; j), qH(x ;J ) =∑j∈J

qH(x ; j).

qH(x ;J ) ≤ − 12 dT Hd where d = dH(x ;J ).


This coord. grad. descent approach may be viewed as a hybrid ofgradient-projection and coordinate descent. In particular,

if J = N and P(x) =


, then dH(x ;N ) is a scaled

gradient-projection direction for bound-constrained optimization.

if f is quadratic and we choose H = ∇2f (x), then dH(x ;J ) is a (block)coordinate descent direction.

If H is diagonal, then subproblems can be solved in parallel.

If P ≡ 0, then dH(x)j = −∇f (x)j/Hjj .

If P(x) =


, then

dH(x)j = medianlj − xj ,−∇f (x)j/Hjj ,uj − xj.

If P is the 1-norm, thendH(x)j = −median(∇f (x)j − λ)/Hjj , xj , (∇f (x)j + λ)/Hjj.


Stepsize: Armijo rule

Choose α to be the largest element of βkk=0,1,... satisfying

F (x + αd)− F (x) ≤ σαqH(x ;J ) (0 < β < 1, 0 < σ < 1).

For the `1-regularized linear least squares problem, the minimization rule

α ∈ arg minF (x + td) | t ≥ 0

or the limited minimization rule

α ∈ arg minF (x + td) | 0 ≤ t ≤ s,

where 0 < s <∞, can also be used.


Choose J :

Gauss-Seidel rule:J cycles through 1, 2, ..., n.

Gauss-Southwell-r rule:

‖dD(x ;J )‖∞ ≥ υ‖dD(x ;N )‖∞

where 0 < υ ≤ 1, D 0n is diagonal (e.g., D = diag(H)).

Gauss-Southwell-q rule:

qD(x ;J ) ≤ υ qD(x ;N ),

Where 0 < υ ≤ 1, D 0n is diagonal (e.g., D = diag(H)).


Advantage of CGD

CGD method is simple, highly parallelizable, and is suited for solvinglarge-scale problems.

CGD not only has cheaper iterations than exact coordinate descent, italso has stronger global convergence properties.

Convergence Results

Global convergence If

0 ≺ λI D, H λI,

J is chosen by G-Seidel, G-Southwell-r , G-Southwell-q rule,

is chosen by Armijo rule,

then every cluster point of the x-sequence generated by CGD method is astationary point of F .

Convergence Results

Local convergence rate If

0 ≺ λI D, H λI,

J is chosen by G-Seidel or Gauss-Southwell-q rule,

α is chosen by Armijo rule,

in addition, if P and f satisfy any of the following assumptions, then thex-sequence generated by CGD method converges at R-linear rate.

C1 f is strongly convex, ∇f is Lipschitz cont. on domP.

C2 f is (nonconvex) quadratic. P is polyhedral.

C3 f (x) = g(Ex) + qT x , where E ∈ <m×N , q ∈ <N , g is stronglyconvex, ∇g is Lipschitz cont. on <m. P is polyhedral.

C4 f (x) = maxy∈Y(Ex)T y − g(y)+ qT x , where Y ⊆ <m ispolyhedral, E ∈ <m×N , q ∈ <N , g is strongly convex, ∇g isLipschitz cont. on <m. P is polyhedral.

Convergence Results

Complexity Bound

If f is convex with Lipschitz cont. grad., then the number of iterations forachieving ε-optimality is

Gauss-Seidel rule:

O(

n2Lr0

ε

),

where L is a Lipschitz constant andr0 = max

dist(x ,X ∗)2 | F (x) ≤ F (x0)

.

Gauss-Southwell-q rule:

O(

Lr0

υε+ max

0,

Lυ

ln(

e0

r0

)),

where e0 = F (x0)−minx∈X F (x).

Thank you!

II. Incremental Gradient Method

Sum of several functions

minx

F (x) := f (x) + P(x),

where c > 0, P : <n → (−∞,∞] is a proper, convex, lower semicontinuous(lsc) function, and

f (x) :=m∑

i=1

fi (x),

where each function fi is real-valued and smooth (i.e., continuouslydifferentiable) on <n.


In applications, m is often large (exceeding 104).

In this case, traditional gradient methods would be inefficient since theyrequire evaluating ∇fi (x) for all i before updating x .

Incremental gradient methods (IGM), in contrast, update x afterevaluation of ∇fi (x) for only one or a few i .

In the unconstrained case P ≡ 0, this method has the basic form

xk+1 = xk + αk∇fik (xk ), k = 0,1, . . . ,

where ik is chosen to cycle through 1, . . . ,m (i.e.,i0 = 1, i1 = 2, . . . , im−1 = m, im = 1, . . . ) and αk > 0.


For global convergence of IGMs, stepsize αk (also called “learning rate”)needs to diminish to zero, which can lead to slow convergence3.

If a constant stepsize is used, only convergence to an approximatesolution can be shown4.

Methods5 to overcome this difficulty were proposed. However, thesemethods need additional assumptions such as ∇fi (x) = 0 for all i at astationary point x to achieve global convergence without the stepsizetending to zero.

Moreover, its extension to P 6≡ 0 is problematic.

3D. P. Bertsekas, A new class of incremental gradient methods for least squares problems,SIAM J. Optim., 7 (1997), 913–926

4M. V. Solodov, Incremental gradient algorithms with stepsizes bounded away from zero,Comput. Optim. Appl., 11 (1998), 23–35

5P. Tseng, An incremental gradient(-projection) method with momentum term and adaptivestepsize rule, SIAM J. Optim., 8 (1998), 506–531


For the case of P ≡ 0, Blatt, Hero and Gauchman6 proposed a method:

gk = gk−1 +∇fik (xk )−∇fik (xk−m),

xk+1 = xk − αgk ,

with α > 0, g−1 =∑m

i=1∇fi (x i−m−1), x0, x−1, . . . , x−m ∈ <n given.

Computes the gradient of a single component function at each iteration.

But instead of updating x using this gradient, it uses the sum of the mmost recently computed gradients.

This method requires more storage (O(mn) instead of O(n)) and slightlymore communication/computation per iteration.

6D. Blatt, A. O. Hero, and H. Gauchman, A convergent incremental gradient method with aconstant step size, SIAM J. Optim., 18 (2007), pp. 29–51

IGM: Constant Stepsize

0. Choose x0, x−1, · · · ∈ domP and α ∈ (0,1]. Initialize k = 0. Go to 1.

1. Choose Hk 0 and 0 ≤ τ ki ≤ k for i = 1, . . . ,m, and compute gk , dk ,

and xk+1 by

gk =m∑

i=1

∇fi (xτki ),

dk = arg mind∈<n

〈gk ,d〉+

12〈d ,Hk d〉+ P(xk + d)

,

xk+1 = xk + αdk .

Increment k by 1 and return to 1.

The method of Blatt et al. corresponds to the special case of P ≡ 0, Hk = I,K = m − 1, and

τ ki =

k if i = (k mod m) + 1;

τ k−1i otherwise,

1 ≤ i ≤ m, k ≥ m,


Assumption 1

(a) τ ki ≥ k − K for all i and k , where K ≥ 0 is an integer.

(b) λI Hk λI for all k , where 0 < λ ≤ λ.

Assumption 2

‖∇fi (y)−∇fi (z)‖ ≤ Li‖y − z‖ ∀y , z ∈ domP,

for some Li ≥ 0, i = 1, . . . ,m. Let L =∑m

i=1 Li .


Global Convergence

Let xk, dk, Hk be sequences generated by Algorithm 1 underAssumptions 1 and 2, and with α < 2λ/(L(2K + 1)).Then dk → 0 and every cluster point of xk is a stationary point.

IGM: Adaptive Stepsize

0. Choose x0, x−1, · · · ∈ domP, β ∈ (0,1), σ > 12 , and α ∈ (0,1]. Initialize

k = 0. Go to 1.

1. Choose Hk 0 and 0 ≤ τ ki ≤ k for i = 1, . . . ,m, compute gk , dk by

gk =m∑

i=1

∇fi (xτki ),

dk = arg mind∈<n

〈gk ,d〉+

12〈d ,Hk d〉+ cP(xk + d)

.

Choose αinit

k ∈ [α,1] and let αk be the largest element of αinit

k βjj=0,1,...

satisfying the descent-like condition

Fc(xk + αk dk )− Fc(xk ) ≤ −σKL‖αk dk‖2 +L2

k−1∑j=(k−K )+

‖αjd j‖2

and set xk+1 = xk + αdk . Increment k by 1 and return to 1.

IGM: Adaptive Stepsize

Global Convergence

Let xk, dk, Hk, αk be sequences generated by Algorithm 2 underAssumptions 1 and 2. Then the following results hold.

(a) For each k ≥ 0, the descent-like condition holds whenever αk ≤ α,where α = λ

L(σK+K/2+1/2) .

(b) We have αk ≥ minα, βα for all k .

(c) dk → 0 and every cluster point of xk is a stationary point.

IGM: 1-memory with Adaptive Stepsize

0. Choose x0 ∈ domP. Initialize k = 0 and g−1 = 0. Go to 1.

1. Choose Hk 0 and compute gk , dk , and xk+1 by

gk =k

k + 1gk−1 +

mk + 1

∇fik (xk ) with ik = (k mod m) + 1,

dk = arg mind∈<n

〈gk ,d〉+

12〈d ,Hk d〉+ cP(xk + d)

,

xk+1 = xk + αk dk ,

with αk ∈ (0,1].Increment k by 1 and return to 1.


Assumption 3

(a)∞∑

k=0

αk =∞.

(b) lim`→∞

∑j=0

j + 1`+ 1

δj = 0, where δj := maxi=0,1,...,m

‖xk+i − xk+m‖∣∣∣∣k=jm−1

(x−1 = x0).


Global Convergence

Let xk, dk, Hk, αk be sequences generated by Algorithm 3 underAssumptions 1(b), 2, and 3. Then the following results hold.

(a) ‖xk+1 − xk‖ → 0 and ‖∇f (xk )− gk‖ → 0.

(b) lim infk→∞ ‖dk‖ = 0.

(c) If xk is bounded, then there exists a cluster point of xk that is astationary point.

Thank you!

III. (Linearized) Alternating Direction Method ofMultipliers

Total Variation Regularized Linear Least SquaresProblemImage restorations such as image deconvolution, image inpainting and imagedenoising is often formulated as an inverse problem

b = Au + η,

unknown true image u ∈ <n.

observed image (or measurements) b ∈ <`.

η is a Gaussian noise.

A ∈ <`×n is a linear operator, typically a convolution operator indeconvolution, a projection in inpainting and the identity in denoising.

The unknown images can be recovered by solving TV regularized linear leastsquares problem (Primal):

minx∈<n

12‖Ax − b‖2

2 + µ‖∇x‖,

where µ > 0 and ‖∇x‖ =∑n

i=1 ‖(∇x)i‖2 with (∇x)i ∈ <2.

Gaussian Noise

Figure: denoising

(a) original: 512 × 512 (b) noised image (c) recovered image

Gaussian Noise

Figure: deblurring

(a) original: 512 × 512 (b) motion blurred image (c) recovered image

Gaussian Noise

Figure: inpainting

(a) original: 512 × 512 (b) scratched image (c) recovered image

AlgorithmsSaddle Point Formulation and Dual Problem

1. Dual Norm:min

umax‖p‖∗≤1

〈∇u, p〉+12‖Au − b‖2

2.

2. Convex Conjugate (Legendre-Fenchel transform) of J(u) = ‖∇u‖:

minu

supp〈p, ∇u〉 − J∗(p) +

12‖Au − b‖2

2.

where J∗(p) = supz 〈p, z〉 − J(z).

3. Lagrangian Function:

maxp

infu,zL1(u,p, z).

whereL1(u,p, z) =:

12‖Au − b‖2

2 + ‖z‖+ 〈p, ∇u − z〉

Algorithms

4. Dual Problem:

maxp

infu〈p, ∇u〉 − J∗(p) +

12‖Au − b‖2

2 = −J∗(p)− H∗(div p)

.

where H(u) = 12‖Au − b‖2

2.

5. Lagrangian Function with y = div p:

maxu

infp,yL2(u,p, y).

whereL2(u,p, y) =: J∗(p) + H∗(div p) + 〈u, div p − y〉

ADMM

Alternating Direction Method of Multipliers7

Augmented Lagrangian Function:

Lα(u,p, z) =:12‖Au − b‖2

2 + ‖z‖+ 〈p, ∇u − z〉+α

2‖∇u − z‖2

2

uk+1 = arg min

u

12‖Au − b‖2

2 +⟨pk , ∇u − zk

⟩+ α

2 ‖∇u − zk‖22

zk+1 = arg mind‖z‖+

⟨pk , ∇uk+1 − z

⟩+ α

2 ‖∇uk+1 − z‖22

pk+1 = pk + α(zk+1 −∇uk+1)

ADMM on primal (or 3) is equivalent to Douglas-Rachford8 on dual (4).ADMM on dual (or 5) is equivalent to Douglas-Rachford on primal.

7E. Esser, X. Zhang, and T. Chan, A general framework for a class of first order primal-dualalgorithms for convex optimization in imaging science, SIAM J. Imaging Sci., 2010

8J. Douglas and H. H. Rachford, On the numerical solution of heat conduction problems intwo or three space variables, Trans. Amer. Math. Soc., 1956

ADMMAlternating Minimization Algorithm & Forward-Backward Splitting Method

AMA9 on primal (or 3)uk+1 = arg min

u

12‖Au − b‖2

2 +⟨pk , ∇u − zk

⟩zk+1 = arg min

d‖z‖+

⟨pk , ∇uk+1 − z

⟩+ α

2 ‖∇uk+1 − z‖22

pk+1 = pk + α(zk+1 −∇uk+1)

FBS10 on dual (4) uk+1 = arg minu

12‖Au − b‖2

2 +⟨pk , ∇u

⟩pk+1 = arg min

dJ∗(p)−

⟨p, ∇uk+1

⟩+ 1

2α‖p − pk‖22

FBS on dual is equivalent to AMA on primal (or 3).

9P. Tseng, Applications of a splitting algorithm to decomposition in convex programming andvariational inequalities, SIAM J. Control and Optimization, 1991

10P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting,Multiscale Model. Simul., 2005

Proximal Splitting Method

Linearly Constrained Separable Convex Programming Problem:

minx,yf (x) + g(y) | Qx = y .

Augmented Lagrangian Function:

Lα(x , y , z) := f (x) + g(y) + 〈z, y −Qx〉+α

2‖Qx − y‖2

2.xk+1 = arg min

uLα(x , yk , zk )

yk+1 = arg minzLα(xk+1, y , zk )

zk+1 = zk + α(Qxk+1 − yk+1)


If f (x) = 12‖Ax − b‖2

2,

xk+1 = (AT A + αQT Q)−1(AT b + QT zk + αQT yk ).

If f (x) = 〈1, Ax〉 − 〈b, log(Ax)〉Need inner solver or inversion involving Q and/or A.

If f (x) = 〈x + be−x , 1〉Need inner solver.

Poisson Noise

Figure: deblurring

(a) original (b) blurred image (c) recovered image

Multiplicative Noise


Copyright : Sandia Nat. Lab. (http://www.sandia.gov/RADAR/sar.html)


4M (222) pixel @ 20sec.





In order to avoid any inner iterations or inversions involving Laplacianoperator required in algorithms based on augment Lagrangian.

Alternating minimization algorithmuk+1 = L0(x , yk , zk )zk+1 = Lα(xk+1, y , zk )pk+1 = pk + α(zk+1 −∇uk+1)

Linearized augmented Lagrangian with proximal functionuk+1 = LLα(x , yk , zk )zk+1 = Lα(xk+1, y , zk )pk+1 = pk + α(zk+1 −∇uk+1)

LLα(x , xk , yk , zz) :=⟨∇x f (xk ), x − xk⟩+ g(yk ) +

⟨zk , yk −Qx

⟩+⟨αQT (Qxk − yk ), x − xk⟩+

12δ‖x − xk‖2

2

Thank you!Thank you!

Documents

First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department