Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented

Examples Subdifferentials Algorithms Nonconvex-Nonsmooth General Framework

Algorithms for Nonsmooth Optimization

Frank E. Curtis, Lehigh University

presented at

Center for Optimization and Statistical Learning,

Northwestern University

2 March 2018

Algorithms for Nonsmooth Optimization 1 of 55


Outline

Motivating Examples

Subdifferential Theory

Fundamental Algorithms

Nonconvex Nonsmooth Functions

General Framework



Outline

Motivating Examples




General Framework



Nonsmooth optimization

In mathematical optimization, one wants to

I minimize an objective

I subject to constraints

i.e.,minx∈X

f(x)

Why nonsmooth optimization?

Nonsmoothness can arise for different reasons:I physical

(phenomena can be nonsmooth)I phase changes in materials

I technological

(constraints impose nonsmoothness)I obstacles in shape design

I methodological

(nonsmoothness introduced by solution method)I decompositions; penalty formulations

I numerical

(analytically smooth, but practically nonsmooth)I “stiff” problems

(Bagirov, Karmitsa, Makela (2014))







i.e.,minx∈X

f(x)

Why nonsmooth optimization? Nonsmoothness can arise for different reasons:I physical

(phenomena can be nonsmooth)I phase changes in materials

I technological


I methodological


I numerical









i.e.,minx∈X

f(x)

Why nonsmooth optimization? Nonsmoothness can arise for different reasons:I physical (phenomena can be nonsmooth)

I phase changes in materials

I technological


I methodological


I numerical









i.e.,minx∈X

f(x)



I technological (constraints impose nonsmoothness)I obstacles in shape design

I methodological


I numerical









i.e.,minx∈X

f(x)




I methodological (nonsmoothness introduced by solution method)I decompositions; penalty formulations

I numerical









i.e.,minx∈X

f(x)




I methodological (nonsmoothness introduced by solution method)I decompositions; penalty formulations

I numerical (analytically smooth, but practically nonsmooth)I “stiff” problems




Data fitting

minx∈Rn

θ(x) + ψ(x) where, e.g., θ(x) = ‖Ax− b‖22

and ψ(x) =n∑i=1

φ(xi) with

φ1(t) =α|t|

1 + α|t|,

φ2(t) = log(α|t|+ 1),

φ3(t) = |t|q , or

φ4(t) = α−(α− t)2+

α



Clusterwise linear regression (CLR)

Given a dataset of pairs A := {(ai, bi)}li=1, the goal of CLR is to simultaneously

I partition the dataset into k disjoint clusters, and

I find regression coefficients {(xj , yj)}kj=1 for each cluster

in order to minimize overall error in the fit; e.g.,

min{(xj ,yj)}

fk({xj , yj}), where fk({xj , yj}) =l∑i=1

minj∈{1,...,k}

|xTj ai − yj − bi|p.

This objective is nonconvex (though it is a difference of convex functions).



Decomposition

Various types of decomposition strategies introduce nonsmoothness.I Primal decomposition can be used for

min(x1,x2,y)

f1(x1, y) + f2(x2, y),

where y is the complicating/linking variable; equivalent to

miny

φ1(y) + φ2(y) where

φ1(y) := min

x1f1(x1, y)

φ2(y) := minx2

f2(x2, y)

This master problem may be nonsmooth in y.

I Dual decomposition can be used for same problem, reformulating as

min(x1,x2,y)

f1(x1, y1) + f2(x2, y2) s.t. y1 = y2

The Lagrangian is separable, meaning the dual function decomposes:

g1(λ) = inf(x1,y1)

(f1(x1, y1) + λT y1)

g2(λ) = inf(x2,y2)

(f2(x2, y2)− λT y2)

Dual problem to maximize g(λ) = g1(λ) + g2(λ) may be nonsmooth in λ.



Decomposition

Various types of decomposition strategies introduce nonsmoothness.I Primal decomposition can be used for

min(x1,x2,y)

f1(x1, y) + f2(x2, y),

where y is the complicating/linking variable; equivalent to

miny

φ1(y) + φ2(y) where

φ1(y) := min

x1f1(x1, y)

φ2(y) := minx2

f2(x2, y)

This master problem may be nonsmooth in y.I Dual decomposition can be used for same problem, reformulating as

min(x1,x2,y)

f1(x1, y1) + f2(x2, y2) s.t. y1 = y2

The Lagrangian is separable, meaning the dual function decomposes:

g1(λ) = inf(x1,y1)

(f1(x1, y1) + λT y1)

g2(λ) = inf(x2,y2)

(f2(x2, y2)− λT y2)

Dual problem to maximize g(λ) = g1(λ) + g2(λ) may be nonsmooth in λ.



Dual decomposition with constraints

Consider the nearly separable problem

min(x1,...,xm)

m∑i=1

fi(xi)

s.t. xi ∈ Xi for all i ∈ {1, . . . ,m}m∑i=1

Aixi ≤ b (e.g., shared resource constraint)

where the last are complicating/linking constraints; “dualizing” leads to

g(λ) := min(x1,...,xm)

m∑i=1

fi(xi) + λT

(m∑i=1

Aixi − b)

s.t. xi ∈ Xi for all i ∈ {1, . . . ,m}.

Given λ ∈ Rm, the value g(λ) comes from solving separable problems; the dual

maxλ≥0

g(λ)

is typically nonsmooth (and people often use poor algorithms!).



Control of dynamical systems

Consider the discrete time linear dynamical system:

yk+1 = Ayk +Buk (state equation)

zk = Cyk (observation equation)

Supposing we want to “design” a control such that

uk = XCyk (where X is our variable)

consider the “closed loop system” given by

yk+1 = Ayk +Buk

= Ayk +BXCyk

= (A+BXC)yk.

Common objectives are to minimize a stability measure

ρ(A+BXC),

which are often functions of the eigenvalues of A+BXC.



Eigenvalue optimization

Plots of ordered eigenvalues as matrix is perturbed along a given direction:



Other sources of nonsmooth optimization problems

I Lagrangian relaxation

I Composite optimization (e.g., penalty methods for “soft constraints”)

I Parametric optimization (e.g., for model predictive control)

I Multilevel optimization



Outline

Motivating Examples




General Framework



Derivatives

When I teach an optimization class, I always start with the same question:

What is a derivative? (f : R → R)

Answer I get: “slope of the tangent line”

x

f(x)

slope = f ′(x)



Derivatives

When I teach an optimization class, I always start with the same question:

What is a derivative? (f : R → R)

Answer I get: “slope of the tangent line”

x

f(x)

slope = f ′(x)



Gradients

Then I ask:

What is a gradient? (f : Rn → R)

Answer I get: “direction along which the function increases at the fastest rate”



Gradients

Then I ask:

What is a gradient? (f : Rn → R)

Answer I get: “direction along which the function increases at the fastest rate”



Derivative vs. gradient

So if a derivative is a magnitude (here, a slope), then why does it generalize inmultiple dimensions to something that is a direction?

(n = 1) f ′(x) =df

dx(x) =

∂f

∂x(x)

(n ≥ 1) ∇f(x) =

∂f∂x1

(x)

...∂f∂xn

(x)

What’s important? Magnitude? direction?

Answer: The gradient is a vector in Rn, which

I has magnitude (e.g., its 2-norm)

I can be viewed as a direction

I and gives us a way to compute directional derivatives



Derivative vs. gradient

So if a derivative is a magnitude (here, a slope), then why does it generalize inmultiple dimensions to something that is a direction?

(n = 1) f ′(x) =df

dx(x) =

∂f

∂x(x)

(n ≥ 1) ∇f(x) =

∂f∂x1

(x)

...∂f∂xn

(x)

What’s important? Magnitude? direction?

Answer: The gradient is a vector in Rn, which

I has magnitude (e.g., its 2-norm)

I can be viewed as a direction

I and gives us a way to compute directional derivatives



Differentiable f

How should we think about the gradient?

If f is continuously differentiable (i.e., f ∈ C1),

then ∇f(x) is the unique vector in the linear (Taylor) approximation of f at x.

x x

f(x) +∇f(x)T (x− x)

f(x)

Both are graphs of functions of x!



Differentiable and convex f

If f ∈ C1 is convex, then

f(x) ≥ f(x) +∇f(x)T (x− x) for all (x, x) ∈ Rn × Rn

x x

f(x) +∇f(x)T (x− x)

f(x)



Graphs and epigraphs

There is another interpretation of a gradient that is also useful. First. . .

What is a graph?

A set of points in Rn+1, namely, {(x, z) : f(x) = z}

A related quantity, another set, is the epigraph: {(x, z) : f(x) ≤ z}





What is a graph?



x

{(x, f(x))}





What is a graph?



x

{(x, f(x))}




If f ∈ C1 is convex, then, for all (x, x) ∈ Rn × Rn,

f(x) ≥ f(x) +∇f(x)T (x− x)

⇐⇒ f(x)−∇f(x)T x ≥ f(x)−∇f(x)T x

⇐⇒[−∇f(x)

1

]T [x

f(x)

]≥[−∇f(x)

1

]T [x

f(x)

]

Note: Given x, the vector

[−∇f(x)

1

]is given,

so the inequality above involves a linear function over Rn+1 and says

the value at any point

[x

f(x)

]in the graph is at least the value at

[x

f(x)

]




If f ∈ C1 is convex, then, for all (x, x) ∈ Rn × Rn,

f(x) ≥ f(x) +∇f(x)T (x− x)

⇐⇒ f(x)−∇f(x)T x ≥ f(x)−∇f(x)T x

⇐⇒[−∇f(x)

1

]T [x

f(x)

]≥[−∇f(x)

1

]T [x

f(x)

]

Note: Given x, the vector

[−∇f(x)

1

]is given,

so the inequality above involves a linear function over Rn+1 and says

the value at any point

[x

f(x)

]in the graph is at least the value at

[x

f(x)

]



Linearization

and supporting hyperplane for epigraph

x x

f(x) +∇f(x)T (x− x)

f(x)



Linearization and supporting hyperplane for epigraph

x x

f(x) +∇f(x)T (x− x)

{(x, f(x))}

[x

f(x)

]+

[−∇f(x)

1

]



Subgradients (convex f)

Why was that useful?

We can generalize this idea when the function is not differentiable somewhere.

[x

f(x)

]

[x

f(x)

]+

[−g1

]

A vector g ∈ Rn is a subgradient of a convex f : Rn → R at x ∈ Rn if

f(x) ≥ f(x) + gT (x− x)

⇐⇒[−g1

]T [x

f(x)

]≥[−g1

]T [x

f(x)

]



Subdifferentials

Theorem

If f is convex and differentiable at x, then ∇f(x) is its unique subgradient at x.

But in general,

the set of all subgradients for a convex f at x is the subdifferential of f at x:

∂f(x) := {g ∈ Rn : g is a subgradient of f at x}.

From the definition, it is easily seen that

x∗ is a minimizer of f if and only if 0 ∈ ∂f(x∗)



What about nonconvex, nonsmooth?

We need to generalize the idea of a subgradient further.

I Directional derivatives

I Subgradients

I Subdifferentials

Let’s return to this after we discuss some algorithms. . .



Outline

Motivating Examples




General Framework



A fundamental iteration

Thinking of −∇f(xk), we have a vector that

I directs us in a direction of descent, and

I vanishes as we approach a minimizer

Algorithm : Gradient Descent

1: Choose an initial point x0 ∈ Rn and stepsize α ∈ (0, 1/L]2: for k = 0, 1, 2, . . . do3: if ‖∇f(xk)‖ ≈ 0, then return xk4: else set

xk+1 ← xk − α∇f(xk)

I call this a fundamental iteration.

Here, we suppose ∇f is Lipschitz continuous, i.e., there exists L ≥ 0 such that

‖∇f(x)−∇f(x)‖2 ≤ L‖x− x‖2 for all (x, x) ∈ Rn × Rn

=⇒ f(x) ≤ f(x) +∇f(x)T (x− x) + 12L‖x− x‖22 for all (x, x) ∈ Rn × Rn.





























Convergence of gradient descent

xk x

f(xk)

f(xk) +∇f(xk)T (x − xk) + 12L‖x − xk‖

22

f(x)? f(x)?




xk x

f(xk)

f(xk) +∇f(xk)T (x − xk) + 12L‖x − xk‖

22

f(x)? f(x)?




xk x

f(xk)

f(xk) +∇f(xk)T (x − xk) + 12L‖x − xk‖

22

f(x)? f(x)?



Gradient descent for f

Theorem

If ∇f is Lipschitz continuous with constant L > 0 and α ∈ (0, 1/L], then

∞∑j=0

‖∇f(xj)‖22 <∞ which implies {∇f(xj)} → 0.

Proof.

Let k ∈ N and recall that xk+1 − xk = −α∇f(xk). Then, since α ∈ (0, 1/L],

f(xk+1) ≤ f(xk) +∇f(xk)T (xk+1 − xk) + 12L‖xk+1 − xk‖22

= f(xk)− α‖∇f(xk)‖22 + 12α2L‖∇f(xk)‖22

= f(xk)− α(1− 12αL)‖∇f(xk)‖22

≤ f(xk)− 12α‖∇f(xk)‖22.

Thus, summing over j ∈ {0, . . . , k}, one finds

∞ > f(x0)− finf ≥ f(x0)− f(xk+1) ≥ 12α∑kj=0 ‖∇f(xj)‖22.



Gradient descent for f

Theorem

If ∇f is Lipschitz continuous with constant L > 0 and α ∈ (0, 1/L], then

∞∑j=0

‖∇f(xj)‖22 <∞ which implies {∇f(xj)} → 0.

Proof.

Let k ∈ N and recall that xk+1 − xk = −α∇f(xk). Then, since α ∈ (0, 1/L],

f(xk+1) ≤ f(xk) +∇f(xk)T (xk+1 − xk) + 12L‖xk+1 − xk‖22

= f(xk)− α‖∇f(xk)‖22 + 12α2L‖∇f(xk)‖22

= f(xk)− α(1− 12αL)‖∇f(xk)‖22

≤ f(xk)− 12α‖∇f(xk)‖22.

Thus, summing over j ∈ {0, . . . , k}, one finds

∞ > f(x0)− finf ≥ f(x0)− f(xk+1) ≥ 12α∑kj=0 ‖∇f(xj)‖22.



Strong convexity

Now suppose that f is c-strongly convex, which means that

f(x) ≥ f(x) +∇f(x)T (x− x) + 12c‖x− x‖22 for all (x, x) ∈ Rn × Rn.

Important consequences of this are that

I f has a unique global minimizer, call it x∗ with f∗ := f(x∗), and

I the gradient norm grows with the optimality error in that

2c(f(x)− f∗) ≤ ‖∇f(x)‖22 for all x ∈ Rn.



Strong convexity, lower bound

xk x

f(xk)

f(xk) +∇f(xk)T (x − xk) + 12L‖x − xk‖

22

f(xk) +∇f(xk)T (x − xk) + 12c‖x − xk‖

22



Strong convexity, lower bound

xk x

f(xk)

f(xk) +∇f(xk)T (x − xk) + 12L‖x − xk‖

22

f(xk) +∇f(xk)T (x − xk) + 12c‖x − xk‖

22



Gradient descent for strongly convex f

Theorem

If ∇f is Lipschitz with L > 0, f is c-strongly convex, and α ∈ (0, 1/L], then

f(xj+1)− f∗ ≤ (1− αc)j+1(f(x0)− f∗) for all j ∈ N.

Proof.

Let k ∈ N. Following the previous proof, one finds

f(xk+1) ≤ f(xk)− 12α‖∇f(xk)‖22

≤ f(xk)− αc(f(xk)− f∗).

Subtracting f∗ from both sides, one finds

f(xk+1)− f∗ ≤ (1− αc)(f(xk)− f∗).

Applying the result repeatedly over j ∈ {0, . . . , k} yields the result.



Gradient descent for strongly convex f

Theorem

If ∇f is Lipschitz with L > 0, f is c-strongly convex, and α ∈ (0, 1/L], then

f(xj+1)− f∗ ≤ (1− αc)j+1(f(x0)− f∗) for all j ∈ N.

Proof.

Let k ∈ N. Following the previous proof, one finds

f(xk+1) ≤ f(xk)− 12α‖∇f(xk)‖22

≤ f(xk)− αc(f(xk)− f∗).

Subtracting f∗ from both sides, one finds

f(xk+1)− f∗ ≤ (1− αc)(f(xk)− f∗).

Applying the result repeatedly over j ∈ {0, . . . , k} yields the result.



A fundamental iteration when f is nonsmooth?

What is a fundamental iteration for nonsmooth optimization?

Steepest descent!

For convex f , the directional derivative of f at x along s is

f ′(x; s) = maxg∈∂f(x)

gT s

Along which direction is f decreasing at the fastest rate?

The solution of an optimization problem!

min‖s‖2≤1

f ′(x; s) = min‖s‖2≤1

maxg∈∂f(x)

gT s

= maxg∈∂f(x)

min‖s‖2≤1

gT s (von Neumann minimax theorem)

= maxg∈∂f(x)

(−‖g‖2)

= − ming∈∂f(x)

‖g‖2 =⇒ (need minimum norm subgradient)





Steepest descent!



gT s



min‖s‖2≤1

f ′(x; s) = min‖s‖2≤1

maxg∈∂f(x)

gT s

= maxg∈∂f(x)

min‖s‖2≤1


= maxg∈∂f(x)

(−‖g‖2)







Steepest descent!



gT s



min‖s‖2≤1

f ′(x; s) = min‖s‖2≤1

maxg∈∂f(x)

gT s

= maxg∈∂f(x)

min‖s‖2≤1


= maxg∈∂f(x)

(−‖g‖2)





Main challenge

But, typically, we can only access g ∈ ∂f(x), not all of ∂f(x)

I would argue:

no practical fundamental iteration for general nonsmooth optimization

(no computable descent direction that vanishes near a minimizer)

What are our options?

There are a few ways to design a convergent algorithm:

I algorithmically (e.g., subgradient method)

I iteratively (e.g., cutting plane / bundle methods)

I randomly (e.g., gradient sampling)



Main challenge

But, typically, we can only access g ∈ ∂f(x), not all of ∂f(x)

I would argue:

no practical fundamental iteration for general nonsmooth optimization

(no computable descent direction that vanishes near a minimizer)

What are our options?

There are a few ways to design a convergent algorithm:

I algorithmically (e.g., subgradient method)

I iteratively (e.g., cutting plane / bundle methods)

I randomly (e.g., gradient sampling)



Subgradient method

Algorithm : Subgradient method (not descent)

1: Choose an initial point x0 ∈ Rn.2: for k = 0, 1, 2, . . . do3: if a termination condition is satisfied, then return xk4: else compute gk ∈ ∂f(xk), choose αk ∈ R>0, and set

xk+1 ← xk − αkgk



Why not “subgradient descent”?

Consider

minx∈R2

f(x), where f(x1, x2) := x1 + x2 + max{0, x21 + x22 − 4}.

At x = (0,−2), we have

∂f(x) = conv

{[11

],

[1−3

]}, but −

[11

]and −

[1−3

]are both directions of ascent for f from x!



Decreasing the distance to a solution

The objective f is not the only measure of progress.

I Given an arbitrary subgradient gk for f at xk, we have

f(x) ≥ f(xk) + gTk (x− xk) for all x ∈ Rn, (1)

which means that all points with an objective value lower than f(xk) lie in

Hk := {x ∈ Rn : gTk (x− xk) ≤ 0}

I Thus, a small step along −gk should decrease the distance to a solution

I (Convexity is crucial for this idea)



“Algorithmic convergence”

Theorem

If f has a minimizer, ‖gk‖2 ≤ G ∈ R>0 for all k ∈ N, and the stepsizes satisfy

∞∑k=1

αk =∞ and∞∑k=1

α2k <∞, (2)

then

limk→∞

{min

j∈{0,...,k}fj

}= f∗.

I An example sequence satisfying (2) is αk = α/k for k = 1, 2, . . .



Proof, limk→∞{

minj∈{0,...,k} fj}

= f∗, part 1.

Let k ∈ N. By (1), the iterates satisfy

‖xk+1 − x∗‖22 = ‖xk − αkgk − x∗‖22= ‖xk − x∗‖22 − 2αkg

Tk (xk − x∗) + α2

k‖gk‖22

≤ ‖xk − x∗‖22 − 2αk(fk − f∗) + α2k‖gk‖

22.

Applying this inequality recursively, we have

0 ≤ ‖xk+1 − x∗‖22 ≤ ‖x0 − x∗‖22 − 2k∑j=0

αj(fj − f∗) +k∑j=0

α2j‖gj‖22,

which implies that

2k∑j=0

αj(fj − f∗) ≤ ‖x0 − x∗‖22 +k∑j=1

α2j‖gj‖22

⇒ minj∈{0,...,k}

fj − f∗ ≤‖x0 − x∗‖22 +G2

∑kj=1 α

2j

2∑kj=0 αj

. (3)



Proof, limk→∞{

minj∈{0,...,k} fj}

= f∗, part 2.

Now consider an arbitrary scalar ε > 0. By (2), there exists a nonnegative integerK such that, for all k > K,

αk ≤ε

G2and

k∑j=0

αj ≥1

ε

‖x0 − x∗‖22 +G2K∑j=0

α2j

.

Then, by (3), it follows that for all k > K we have

minj∈{0,...,k}

fj − f∗ ≤‖x0 − x∗‖22 +G2

∑Kj=0 α

2j

2∑kj=0 αj

+G2∑kj=K+1 α

2j

2∑Kj=0 αj + 2

∑kj=K+1 αj

≤‖x0 − x∗‖22 +G2

∑Kj=0 α

2j

2ε

(‖x0 − x∗‖22 +G2

∑Kj=0 α

2j

) +G2∑kj=K+1

εG2 αj

2∑kj=K+1 αj

=ε

2+ε

2= ε.

The result follows since ε > 0 was chosen arbitrarily.



Cutting plane method

Subgradient methods lose previously computed information in every iteration.

I Suppose, after a sequence of iterates, we have the affine underestimators

fi(x) = f(xi) + gTi (x− xi) for all i ∈ {0, . . . , k}.

x

f(x)

x0 x1x2

f(x1) + gT1 (x− x1)

f(x0) + gT0 (x− x0)

I At iteration k, we can compute the next iterate by solving the master problem

xk+1 ← arg minx∈X

fk(x), where fk(x) := maxi∈{1,...,k}

(f(xi) + gTi (x− xi)).



Cutting plane method

Subgradient methods lose previously computed information in every iteration.

I Suppose, after a sequence of iterates, we have the affine underestimators

fi(x) = f(xi) + gTi (x− xi) for all i ∈ {0, . . . , k}.

x

f(x)

x0 x1x2

f(x1) + gT1 (x− x1)

f(x0) + gT0 (x− x0)

I At iteration k, we can compute the next iterate by solving the master problem

xk+1 ← arg minx∈X

fk(x), where fk(x) := maxi∈{1,...,k}

(f(xi) + gTi (x− xi)).



Cutting plane method convergence

The iterates of the cutting plane method yield lower bounds of the optimal value:

vk+1 := minx∈X

fk(x) ≤ minx∈X

f(x) =: f∗.

Therefore, if vk+1 = f(xk+1), then we terminate since f(xk+1) = f∗.

I If f is piecewise linear, then convergence occurs in finitely many iterations!

x

f(x)

x0 x1x2

f(x1) + gT1 (x− x1)

f(x0) + gT0 (x− x0)

However, in general, we have the following theorem.

Theorem

The cutting plane method yields {xk} satisfying {f(xk)} → f∗.



Cutting plane method convergence

The iterates of the cutting plane method yield lower bounds of the optimal value:

vk+1 := minx∈X

fk(x) ≤ minx∈X

f(x) =: f∗.

Therefore, if vk+1 = f(xk+1), then we terminate since f(xk+1) = f∗.

I If f is piecewise linear, then convergence occurs in finitely many iterations!

x

f(x)

x0 x1x2

f(x1) + gT1 (x− x1)

f(x0) + gT0 (x− x0)

However, in general, we have the following theorem.

Theorem

The cutting plane method yields {xk} satisfying {f(xk)} → f∗.



Bundle method

A bundle method attempts to combine the practical advantages of a cutting planemethod with the theoretical strengths of a proximal point method.

I Given xk, consider the regularized master problem

minx∈Rn

(fk(x) +

γ

2‖x− xk‖22

), where fk(x) := max

i∈Ik(f(xi) + gTi (x− xi)).

Here, Ik ⊆ {1, . . . , k − 1} indicates a subset of previous iterations.

I This problem is equivalent to the quadratic optimization problem

min(x,v)∈Rn×R

v +γ

2‖x− xk‖22

s.t. f(xi) + gTi (x− xi) ≤ v for all i ∈ Ik.

I Only move to a “new” point when a sufficient decrease is obtained.

Convergence rate analyses are limited; O( 1ε

log( 1ε)) for strongly convex f



Bundle method convergence

Analysis makes use of the Moreau-Yosida regularization function

fγ(x) = minx∈Rn

(f(x) + 1

2γ‖x− x‖22

).

Theorem

If xk is not a minimizer, then fγ(xk) < f(xk).



Bundle method convergence

Theorem

For all (k, j) ∈ N × N in a bundle method,

vk,j + 12γ‖xk,j − xk‖22 ≤ fγ(xk) < f(xk).



Outline

Motivating Examples




General Framework



Clarke subdifferential

What if f is nonconvex and nonsmooth? What are subgradients?

We still need some structure; we assume

I f is locally Lipschitz and

I f is differentiable on a full measure set D

The Clarke subdifferential of f at x is

∂f(x) = conv

{limj→∞

∇f(xj) : xj → x and xj ∈ D},

i.e., convex hull of limits of gradients of f at points in D converging to x

Theorem

If f is continuously differentiable at x, then ∂f(x) = {∇f(x)}







I f is differentiable on a full measure set DThe Clarke subdifferential of f at x is

∂f(x) = conv

{limj→∞



Theorem








I f is differentiable on a full measure set DThe Clarke subdifferential of f at x is

∂f(x) = conv

{limj→∞



Theorem




Differentiable, but nonsmooth

Theorem

If f is differentiable at x, then {∇f(x)} ⊆ ∂f(x) (not necessarily equal)

Considering

f(x) =

{x2 cos( 1

x) if x 6= 0

0 if x = 0

one finds that

f ′(0) = 0

yet [−1, 1] ⊆ ∂f(0)



Clarke ε-subdifferential

and gradient sampling

As before, we typically cannot compute ∂f(x).

It is approximated by the Clarke ε-subdifferential, namely,

∂εf(x) = conv{∂f(B(x, ε))},

which in turn can be approximated as in

∂εf(x) ≈ conv{∇f(xk),∇f(xk,1), . . . ,∇f(xk,m)},where {xk,1, . . . , xk,m} ⊂ B(xk, ε).

In gradient sampling, we compute the minimum norm element in

conv{∇f(xk),∇f(xk,1), . . . ,∇f(xk,m)},

which is equivalent to solving

min(x,v)∈Rn×R

v + ‖x− xk‖22

s.t. f(xk) +∇f(xk,i)T (x− xk) ≤ v for all i ∈ {1, . . . ,m}



Clarke ε-subdifferential and gradient sampling

As before, we typically cannot compute ∂f(x).

It is approximated by the Clarke ε-subdifferential, namely,

∂εf(x) = conv{∂f(B(x, ε))},

which in turn can be approximated as in

∂εf(x) ≈ conv{∇f(xk),∇f(xk,1), . . . ,∇f(xk,m)},where {xk,1, . . . , xk,m} ⊂ B(xk, ε).

In gradient sampling, we compute the minimum norm element in

conv{∇f(xk),∇f(xk,1), . . . ,∇f(xk,m)},

which is equivalent to solving

min(x,v)∈Rn×R

v + ‖x− xk‖22

s.t. f(xk) +∇f(xk,i)T (x− xk) ≤ v for all i ∈ {1, . . . ,m}



Outline

Motivating Examples




General Framework



Popular and effective method

Despite all I’ve talked about, a very effective method: BFGS

Approximate second-order information with gradient displacements:

x

xkxk+1

Secant equation Hkyk = sk to match gradient of f at xk, where

sk := xk+1 − xk and yk := ∇f(xk+1)−∇f(xk)



Popular and effective method

Despite all I’ve talked about, a very effective method: BFGS

Approximate second-order information with gradient displacements:

x

xkxk+1

Secant equation Hkyk = sk to match gradient of f at xk, where

sk := xk+1 − xk and yk := ∇f(xk+1)−∇f(xk)



BFGS-type updates

Inverse Hessian and Hessian approximation updating formulas (sTk vk > 0):

Wk+1 ←(I −

vksTk

sTk vk

)TWk

(I −

vksTk

sTk vk

)+sks

Tk

sTk vk

Hk+1 ←(I −

sksTkHk

sTkHksk

)THk

(I −

sksTkHk

sTkHksk

)+vkv

Tk

sTk vk

With an appropriate technique for choosing vk, we attain

I self-correcting properties for {Hk} and {Wk}I (inverse) Hessian approximations that can be used in other algorithms



Subproblems in nonsmooth optimization algorithms

With sets of points, scalars, and (sub)gradients

{xk,j}mj=1, {fk,j}mj=1, {gk,j}mj=1,

nonsmooth optimization methods involve the primal subproblem

minx∈Rn

(max

j∈{1,...,m}{fk,j + gTk,j(x− xk,j)}+ 1

2(x− xk)THk(x− xk)

)s.t. ‖x− xk‖ ≤ δk,

(P)

but, with Gk ← [gk,1 · · · gk,m], it is typically more efficient to solve the dual

sup(ω,γ)∈Rm

+×Rn− 1

2(Gkω + γ)TWk(Gkω + γ) + bTk ω − δk‖γ‖∗

s.t. 1Tmω = 1.

(D)

The primal solution can then be recovered by

x∗k ← xk −Wk (Gkωk + γk)︸︷︷︸gk

.



Algorithm Self-Correcting Variable-Metric Alg. for Nonsmooth Opt.

1: Choose x1 ∈ Rn.2: Choose a symmetric positive definite W1 ∈ Rn×n.3: Choose α ∈ (0, 1)4: for k = 1, 2, . . . do5: Solve (P)–(D) such that setting

Gk ←[gk,1 · · · gk,m

],

sk ← −Wk(Gkωk + γk),

and xk+1 ← xk + sk

6: yields

f(xk+1) ≤ f(xk)− 12α(Gkωk + γk)TWk(Gkωk + γk).

7: Choose vk (details omitted, but very simple)8: Set

Wk+1 ←(I −

vksTk

sTk vk

)TWk

(I −

vksTk

sTk vk

)+sks

Tk

sTk vk.



Instances of the framework

Cutting plane / bundle methods

I Points added incrementally until sufficient decrease obtained

I Finite number of additions until accepted step

Gradient sampling methods

I Points added randomly / incrementally until sufficient decrease obtained

I Sufficient number of iterations with “good” steps

In any case: convergence guarantees require {Wk} to be uniformly positivedefinite and bounded on a sufficient number of accepted steps



C++ implementation: NonOpt

BFGS w/ weak Wolfe line search

Name Exit εend f(xend) #iter #func #grad #subs

maxq Stationary +9.77e-05 +2.26e-07 450 1017 452 451

mxhilb Stepsize +3.13e-03 +9.26e-02 101 1886 113 102

chained lq Stepsize +5.00e-02 -6.93e+01 205 4754 207 206

chained cb3 1 Stepsize +1.00e-01 +9.80e+01 347 7469 348 348

chained cb3 2 Stepsize +1.00e-01 +9.80e+01 64 1496 69 65

active faces Stepsize +2.50e-02 +2.22e-16 24 672 27 25

brown function 2 Stepsize +1.00e-01 +2.04e-05 395 17259 396 396

chained mifflin 2 Stepsize +5.00e-02 -3.47e+01 476 10808 508 477

chained crescent 1 Stepsize +1.00e-01 +2.18e-01 74 2278 91 75

chained crescent 2 Stepsize +1.00e-01 +5.86e-02 313 7585 334 314

Bundle method with self-correcting properties

Name Exit εend f(xend) #iter #func #grad #subs

maxq Stationary +9.77e-05 +1.04e-06 193 441 635 440

mxhilb Stationary +9.77e-05 +2.25e-05 39 338 351 137

chained lq Stationary +9.77e-05 -6.93e+01 29 374 398 366

chained cb3 1 Stationary +9.77e-05 +9.80e+01 50 1038 1069 1017

chained cb3 2 Stationary +9.77e-05 +9.80e+01 29 174 204 173

active faces Stationary +9.77e-05 +2.09e-02 17 387 165 32

brown function 2 Stationary +9.77e-05 +2.49e-03 232 10094 9674 9438

chained mifflin 2 Stationary +9.77e-05 -3.48e+01 393 24410 19493 18924

chained crescent 1 Stationary +9.77e-05 +2.73e-04 30 66 92 59

chained crescent 2 Stationary +9.77e-05 +4.36e-05 137 6679 6140 5997



Thanks!

NonOpt coming soon. . .

I Andreas could finish in a day. . .

I . . . what has taken me 6 months on sabbatical, so

I it’ll be done when he has a free day ;-)

Thanks for listening!


Documents

Algorithms for Nonsmooth Optimizationcoral.ise.lehigh.edu/frankecurtis/files/talks/nu_tutorial_18.pdf · Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented