Subgradient and Bundle methods

Subgradient and Bundle Methodsfor optimization of convex non-smooth functions

April 1, 2009

Subgradient and Bundle Methods

Motivation

Many naturally occuring problems are nonsmoothHinge lossFeasible region of a convex minimization problemPiecewise Linear function

If a function is approximating a non-smooth function, then itmay be analytically smooth, but “numerically nonsmooth”


Motivation




Motivation




Motivation




Motivation




Methods for nonsmooth optimizations

Approximate by a series of smooth functionsReformulate the problem adding more constraints suchthat the objective is smoothSubgradient MethodsCutting Plane MethodsMoreau-Yosida RegularizationBundle MethodsU V decomposition




















DefinitionAn extension of gradients

For a convex differentiable function f (x), ∀x , y

f (y) ≥ f (x) +∇f (x)T(y − x) (1)

So, a subgradient is defined as any g ∈ Rn such that ∀y

f (y) ≥ f (x) + gT(y − x) (2)

The set of all subgradients of f at x is denoted ∂f (x)




f (y) ≥ f (x) +∇f (x)T(y − x) (1)


f (y) ≥ f (x) + gT(y − x) (2)





f (y) ≥ f (x) +∇f (x)T(y − x) (1)


f (y) ≥ f (x) + gT(y − x) (2)



Some FactsFrom Convex Analysis

A convex function is always subdifferentiable i.e. theSubgradient of a convex function exists at every point.Directional derivatives also exist at every point.If a convex function f is differentiable at x , its subgradientis the gradient at that point. i.e. ∂f (x) = {∇f (x)}Subgradients are lower bounds for directional derivatives.f ′(x ; d) = supg∈∂f (x) 〈g,d〉Further, d is a descent direction iff gT d < 0 ∀g ∈ ∂f (x)

















PropertiesWithout Proof

∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)

∂αf (x) = α∂f (x)

g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)

Local minima⇒ 0 ∈ ∂f (x)However, For f (x) = |x |, the oracle returns subgradient 0only at 0. So this is not a good way to find minima



∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)

∂αf (x) = α∂f (x)

g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)




∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)

∂αf (x) = α∂f (x)

g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)




∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)

∂αf (x) = α∂f (x)

g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)




∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)

∂αf (x) = α∂f (x)

g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)




∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)

∂αf (x) = α∂f (x)

g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)



Subgradient MethodAlgorithm

Subgradient Method is NOT a descent method!x (k+1) = x (k) − αkgk for αk ≥ 0 and gk ∈ ∂f (x)

f (k)best = min{f (k−1)

best , f (x (k))}Line search is not performed. Step lengths αk usually fixedahead of time

















Step Lengths

Commonly used Step lengthsConstant step size: αk = α

Constant step length: αk = αk = γ/‖g(k)‖2Square summable but not summable step size:αk ≥ 0,

∑∞k=1 α

2k <∞,

∑∞k=1 αk =∞.

Nonsummable diminishing:αk ≥ 0, limk→∞ αk = 0,

∑∞k=1 αk =∞.

Nonsummable diminishing step lengths:γk ≥ 0, limk→∞ γk = 0,

∑∞k=1 γk =∞.


Step Lengths



∑∞k=1 α

2k <∞,

∑∞k=1 αk =∞.


∑∞k=1 αk =∞.


∑∞k=1 γk =∞.


Step Lengths



∑∞k=1 α

2k <∞,

∑∞k=1 αk =∞.


∑∞k=1 αk =∞.


∑∞k=1 γk =∞.


Step Lengths



∑∞k=1 α

2k <∞,

∑∞k=1 αk =∞.


∑∞k=1 αk =∞.


∑∞k=1 γk =∞.


Step Lengths



∑∞k=1 α

2k <∞,

∑∞k=1 αk =∞.


∑∞k=1 αk =∞.


∑∞k=1 γk =∞.


Convergence Result

Assume that ∃G such that the norm of the subgradients isbounded i.e. ||g(k)||2 ≤ G(For example, Suppose f is Lipshitz continuous)

Result f kbest − f ∗ ≤

dist(x (1),X ∗

)2+ G2∑k

i=1 α2i

2∑k

i=1 αi

Proof is through proving ||x − x∗|| decreases


Convergence Result



dist(x (1),X ∗

)2+ G2∑k

i=1 α2i

2∑k

i=1 αi



Convergence Result



dist(x (1),X ∗

)2+ G2∑k

i=1 α2i

2∑k

i=1 αi



Convergence Result



dist(x (1),X ∗

)2+ G2∑k

i=1 α2i

2∑k

i=1 αi



Convergence for Commonly used Step lengths

Constant step size: f (k)best within

G2h2

of optimal

Constant step length: f (k)best within Gh of optimal

Square summable but not summable step size: f (k)best → f ∗

Nonsummable diminishing: f (k)best → f ∗

Nonsummable diminishing step lengths: f (k)best → f ∗

f (k)best − f ∗ ≤

R2 + G2∑ki=1 α

2i

2∑k

i=1 αi

So, optimal αi areR/G√

kand converges in (RG/ε)2 steps




G2h2

of optimal






R2 + G2∑ki=1 α

2i

2∑k

i=1 αi






G2h2

of optimal






R2 + G2∑ki=1 α

2i

2∑k

i=1 αi






G2h2

of optimal






R2 + G2∑ki=1 α

2i

2∑k

i=1 αi






G2h2

of optimal






R2 + G2∑ki=1 α

2i

2∑k

i=1 αi






G2h2

of optimal






R2 + G2∑ki=1 α

2i

2∑k

i=1 αi






G2h2

of optimal






R2 + G2∑ki=1 α

2i

2∑k

i=1 αi




Variations

If optimal value is known eg. if the optimal value is knownto be 0, but the point is not known

αk =f (x (k))− f ∗

||g(k)||2Projected Subgradient: minimize f (x) s.t. x ∈ Cx (k+1) = P(x (k) + αkg(k))

Alternating projections: Find a point in the intesection of 2convex setsHeavy Ball method:x (k+1) = x (k) − αkg(k) + βk (x (k )− x (k−1))


Variations


αk =f (x (k))− f ∗




Variations


αk =f (x (k))− f ∗




Variations


αk =f (x (k))− f ∗




ProsCan immediately be applied to a wide variety of problems,especially when accuracy required is not very high.Low memory usageOften possible to design distributed methods if objectiveis decomposible

ConsSlower than second-order methods

















Cutting Plane Method

Again, Consider the problem: minimize f (x) subject tox ∈ CConstruct an Approximate Model:f (x) = maxi∈I(f̂ (xi) + gi

T (x − xi)

Minimize model over x and find f (x) and gUpdate model and repeat till desired accuracyNumerically unstable




T (x − xi)





T (x − xi)





T (x − xi)





T (x − xi)



Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems tominimize f (x)

F (x) = miny∈Rn

{f (y) +

λ

2||y − x ||2

}p(x) = argminy∈Rn

{f (y) +

λ

2||y − x ||2

}F (x) is differentiable!∇F (x) = λ(x − p(x))

Minimization is done using the dual.Cutting Plane Method + Moreau-Yosida Regularization =Bundle Methods




F (x) = miny∈Rn

{f (y) +

λ

2||y − x ||2


{f (y) +

λ

2||y − x ||2






F (x) = miny∈Rn

{f (y) +

λ

2||y − x ||2


{f (y) +

λ

2||y − x ||2






F (x) = miny∈Rn

{f (y) +

λ

2||y − x ||2


{f (y) +

λ

2||y − x ||2






F (x) = miny∈Rn

{f (y) +

λ

2||y − x ||2


{f (y) +

λ

2||y − x ||2






F (x) = miny∈Rn

{f (y) +

λ

2||y − x ||2


{f (y) +

λ

2||y − x ||2






F (x) = miny∈Rn

{f (y) +

λ

2||y − x ||2


{f (y) +

λ

2||y − x ||2




Elementary Bundle Method

As before, f is assumed to be Lipshitz continuousAt a generic iteration we maintain a “bundle”< yi , f (yi), si , αi >



As before, f is assumed to be Lipshitz continuousAt a generic iteration we maintain a “bundle”< yi , f (yi), si , αi >



Follow Cutting Plane Method, but use M-Y Regularizationfor building the model

yk+1 = argminy∈Rn f̂k (y) +µk

2||y − x̂k ||2

δk = f (x̂k )− [f̂k (yk+1) +µk

2||yk+1 − x̂k ||2] ≥ 0

if δk < δ stopIf f (x̂k )− f (yk+1) ≥ mδkSerious Step x̂k+1 = yk+1

else Null Step x̂k+1 = x̂k

f̂k+1(y) = max{f̂k (y), f (yk+1) +⟨sk+1, y − yk+1⟩}





2||y − x̂k ||2

δk = f (x̂k )− [f̂k (yk+1) +µk

2||yk+1 − x̂k ||2] ≥ 0








2||y − x̂k ||2

δk = f (x̂k )− [f̂k (yk+1) +µk

2||yk+1 − x̂k ||2] ≥ 0








2||y − x̂k ||2

δk = f (x̂k )− [f̂k (yk+1) +µk

2||yk+1 − x̂k ||2] ≥ 0








2||y − x̂k ||2

δk = f (x̂k )− [f̂k (yk+1) +µk

2||yk+1 − x̂k ||2] ≥ 0








2||y − x̂k ||2

δk = f (x̂k )− [f̂k (yk+1) +µk

2||yk+1 − x̂k ||2] ≥ 0





Convergence

The Algorithm either makes a finite number of SeriousSteps and then only makes Null stepsThen, If k0 is the last Serious Step, and µk isnondecreasing, then δk → 0Or it makes an infinite number of Serious steps

Then,∑

k∈Ksδk ≤

f (x̂0)− f ∗

mso δk → 0


Convergence


Then,∑

k∈Ksδk ≤

f (x̂0)− f ∗

mso δk → 0


Convergence


Then,∑

k∈Ksδk ≤

f (x̂0)− f ∗

mso δk → 0


Convergence


Then,∑

k∈Ksδk ≤

f (x̂0)− f ∗

mso δk → 0


Convergence


Then,∑

k∈Ksδk ≤

f (x̂0)− f ∗

mso δk → 0


Convergence


Then,∑

k∈Ksδk ≤

f (x̂0)− f ∗

mso δk → 0


Variations

Replace ||y − x ||2 by (y − x)T Mk (y − x) : Still differentiableConjuguate Gradient methods are achieved as a slightmodification of the algorithm (Refer [5])Variable Metric Methods [10]Mk = uk I for Diagonal Variable Metric MethodsBundle-Newton Methods


Variations



Variations



Variations



Variations



Summary

Nonsmooth convex optimization has been explored since1960’s. The original subgradient methods were introducedby Naum Shor. Bundle methods have been developedmore recently.Subgradient Methods are simple but slow, unlessdistributed, which is the predominant current application.Bundle Methods solve a bounded QP, which is slow, butneed fewer iterations. Preferred for applications whereoracle cost is high.


Summary



Summary



Summary



Summary



For Further Reading I

Naum Z. ShorMinimization Methods for non-differentiable functions.Springer-Verlag, 1985.

Boyd and VanderbergeConvex Optimization.Cambridge University Press

A. RuszczyinskiNonlinear OptimizationPrinceton University Press

Wikipediaen.wikipedia.org/wiki/Subgradient_method


en.wikipedia.org/wiki/Subgradient_method

For Further Reading II

Marko MakelaSurvey of Bundle Methods, 2009http://www.informaworld.com/smpp/content~db=all~content=a713741700

Alexandre BelloniAn Introduction to Bundle Methodshttp://web.mit.edu/belloni/www/LecturesIntroBundle.pdf

John E. MitchellCutting Plane and Subgradient Methods, 2005http://www.optimization-online.org/DB_HTML/2009/05/2298.html


http://www.informaworld.com/smpp/content~db=all~content=a713741700

http://www.informaworld.com/smpp/content~db=all~content=a713741700

http://web.mit.edu/belloni/www/LecturesIntroBundle.pdf

http://web.mit.edu/belloni/www/LecturesIntroBundle.pdf

http://www.optimization-online.org/DB_HTML/2009/05/2298.html

http://www.optimization-online.org/DB_HTML/2009/05/2298.html

For Further Reading III

Lecture Notes on Subgradient methods by Stephen Boydhttp://www.stanford.edu/class/ee392o/subgrad_method.pdf

Alexander J. Smola, S.V. N. Vishwanathan, Quoc V. LeBundle Methods for Machine Learning, 2007http://books.nips.cc/papers/files/nips20/NIPS2007_0470.pdf

C Lemarechal Variable metric bundle methods, 1997.http://www.springerlink.com/index/3515WK428153171N.pdf

Quoc Le, Alexander SmolaDirect Optimization of Ranking Measures, 2007http://arxiv.org/abs/0704.3359


http://www.stanford.edu/class/ee392o/subgrad_method.pdf

http://www.stanford.edu/class/ee392o/subgrad_method.pdf

http://books.nips.cc/papers/files/nips20/NIPS2007_0470.pdf

http://books.nips.cc/papers/files/nips20/NIPS2007_0470.pdf

http://www.springerlink.com/index/3515WK428153171N.pdf

http://www.springerlink.com/index/3515WK428153171N.pdf

http://arxiv.org/abs/0704.3359

For Further Reading IV

SVN Vishwanathan, A. SmolaQuasi-Newton Methods for Efficient Large-Scale MachineLearninghttp://portal.acm.org/ft_gateway.cfm?id=1390309&type=pdfandwww.stat.purdue.edu/~vishy/talks/LBFGS.pdf


http://portal.acm.org/ft_gateway.cfm?id=1390309&type=pdf

http://portal.acm.org/ft_gateway.cfm?id=1390309&type=pdf

www.stat.purdue.edu/~vishy/talks/LBFGS.pdf

Documents

Subgradient and Bundle methods