98
Subgradient and Bundle Methods for optimization of convex non-smooth functions April 1, 2009 Subgradient and Bundle Methods

Subgradient and Bundle methods

Embed Size (px)

DESCRIPTION

A short first seminar on subgradient and bundle methods for nonsmooth optimization.

Citation preview

Page 1: Subgradient and Bundle methods

Subgradient and Bundle Methodsfor optimization of convex non-smooth functions

April 1, 2009

Subgradient and Bundle Methods

Page 2: Subgradient and Bundle methods

Motivation

Many naturally occuring problems are nonsmoothHinge lossFeasible region of a convex minimization problemPiecewise Linear function

If a function is approximating a non-smooth function, then itmay be analytically smooth, but “numerically nonsmooth”

Subgradient and Bundle Methods

Page 3: Subgradient and Bundle methods

Motivation

Many naturally occuring problems are nonsmoothHinge lossFeasible region of a convex minimization problemPiecewise Linear function

If a function is approximating a non-smooth function, then itmay be analytically smooth, but “numerically nonsmooth”

Subgradient and Bundle Methods

Page 4: Subgradient and Bundle methods

Motivation

Many naturally occuring problems are nonsmoothHinge lossFeasible region of a convex minimization problemPiecewise Linear function

If a function is approximating a non-smooth function, then itmay be analytically smooth, but “numerically nonsmooth”

Subgradient and Bundle Methods

Page 5: Subgradient and Bundle methods

Motivation

Many naturally occuring problems are nonsmoothHinge lossFeasible region of a convex minimization problemPiecewise Linear function

If a function is approximating a non-smooth function, then itmay be analytically smooth, but “numerically nonsmooth”

Subgradient and Bundle Methods

Page 6: Subgradient and Bundle methods

Motivation

Many naturally occuring problems are nonsmoothHinge lossFeasible region of a convex minimization problemPiecewise Linear function

If a function is approximating a non-smooth function, then itmay be analytically smooth, but “numerically nonsmooth”

Subgradient and Bundle Methods

Page 7: Subgradient and Bundle methods

Methods for nonsmooth optimizations

Approximate by a series of smooth functionsReformulate the problem adding more constraints suchthat the objective is smoothSubgradient MethodsCutting Plane MethodsMoreau-Yosida RegularizationBundle MethodsU V decomposition

Subgradient and Bundle Methods

Page 8: Subgradient and Bundle methods

Methods for nonsmooth optimizations

Approximate by a series of smooth functionsReformulate the problem adding more constraints suchthat the objective is smoothSubgradient MethodsCutting Plane MethodsMoreau-Yosida RegularizationBundle MethodsU V decomposition

Subgradient and Bundle Methods

Page 9: Subgradient and Bundle methods

Methods for nonsmooth optimizations

Approximate by a series of smooth functionsReformulate the problem adding more constraints suchthat the objective is smoothSubgradient MethodsCutting Plane MethodsMoreau-Yosida RegularizationBundle MethodsU V decomposition

Subgradient and Bundle Methods

Page 10: Subgradient and Bundle methods

Methods for nonsmooth optimizations

Approximate by a series of smooth functionsReformulate the problem adding more constraints suchthat the objective is smoothSubgradient MethodsCutting Plane MethodsMoreau-Yosida RegularizationBundle MethodsU V decomposition

Subgradient and Bundle Methods

Page 11: Subgradient and Bundle methods

Methods for nonsmooth optimizations

Approximate by a series of smooth functionsReformulate the problem adding more constraints suchthat the objective is smoothSubgradient MethodsCutting Plane MethodsMoreau-Yosida RegularizationBundle MethodsU V decomposition

Subgradient and Bundle Methods

Page 12: Subgradient and Bundle methods

Methods for nonsmooth optimizations

Approximate by a series of smooth functionsReformulate the problem adding more constraints suchthat the objective is smoothSubgradient MethodsCutting Plane MethodsMoreau-Yosida RegularizationBundle MethodsU V decomposition

Subgradient and Bundle Methods

Page 13: Subgradient and Bundle methods

Methods for nonsmooth optimizations

Approximate by a series of smooth functionsReformulate the problem adding more constraints suchthat the objective is smoothSubgradient MethodsCutting Plane MethodsMoreau-Yosida RegularizationBundle MethodsU V decomposition

Subgradient and Bundle Methods

Page 14: Subgradient and Bundle methods

DefinitionAn extension of gradients

For a convex differentiable function f (x), ∀x , y

f (y) ≥ f (x) +∇f (x)T(y − x) (1)

So, a subgradient is defined as any g ∈ Rn such that ∀y

f (y) ≥ f (x) + gT(y − x) (2)

The set of all subgradients of f at x is denoted ∂f (x)

Subgradient and Bundle Methods

Page 15: Subgradient and Bundle methods

DefinitionAn extension of gradients

For a convex differentiable function f (x), ∀x , y

f (y) ≥ f (x) +∇f (x)T(y − x) (1)

So, a subgradient is defined as any g ∈ Rn such that ∀y

f (y) ≥ f (x) + gT(y − x) (2)

The set of all subgradients of f at x is denoted ∂f (x)

Subgradient and Bundle Methods

Page 16: Subgradient and Bundle methods

DefinitionAn extension of gradients

For a convex differentiable function f (x), ∀x , y

f (y) ≥ f (x) +∇f (x)T(y − x) (1)

So, a subgradient is defined as any g ∈ Rn such that ∀y

f (y) ≥ f (x) + gT(y − x) (2)

The set of all subgradients of f at x is denoted ∂f (x)

Subgradient and Bundle Methods

Page 17: Subgradient and Bundle methods

Some FactsFrom Convex Analysis

A convex function is always subdifferentiable i.e. theSubgradient of a convex function exists at every point.Directional derivatives also exist at every point.If a convex function f is differentiable at x , its subgradientis the gradient at that point. i.e. ∂f (x) = {∇f (x)}Subgradients are lower bounds for directional derivatives.f ′(x ; d) = supg∈∂f (x) 〈g,d〉Further, d is a descent direction iff gT d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods

Page 18: Subgradient and Bundle methods

Some FactsFrom Convex Analysis

A convex function is always subdifferentiable i.e. theSubgradient of a convex function exists at every point.Directional derivatives also exist at every point.If a convex function f is differentiable at x , its subgradientis the gradient at that point. i.e. ∂f (x) = {∇f (x)}Subgradients are lower bounds for directional derivatives.f ′(x ; d) = supg∈∂f (x) 〈g,d〉Further, d is a descent direction iff gT d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods

Page 19: Subgradient and Bundle methods

Some FactsFrom Convex Analysis

A convex function is always subdifferentiable i.e. theSubgradient of a convex function exists at every point.Directional derivatives also exist at every point.If a convex function f is differentiable at x , its subgradientis the gradient at that point. i.e. ∂f (x) = {∇f (x)}Subgradients are lower bounds for directional derivatives.f ′(x ; d) = supg∈∂f (x) 〈g,d〉Further, d is a descent direction iff gT d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods

Page 20: Subgradient and Bundle methods

Some FactsFrom Convex Analysis

A convex function is always subdifferentiable i.e. theSubgradient of a convex function exists at every point.Directional derivatives also exist at every point.If a convex function f is differentiable at x , its subgradientis the gradient at that point. i.e. ∂f (x) = {∇f (x)}Subgradients are lower bounds for directional derivatives.f ′(x ; d) = supg∈∂f (x) 〈g,d〉Further, d is a descent direction iff gT d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods

Page 21: Subgradient and Bundle methods

Some FactsFrom Convex Analysis

A convex function is always subdifferentiable i.e. theSubgradient of a convex function exists at every point.Directional derivatives also exist at every point.If a convex function f is differentiable at x , its subgradientis the gradient at that point. i.e. ∂f (x) = {∇f (x)}Subgradients are lower bounds for directional derivatives.f ′(x ; d) = supg∈∂f (x) 〈g,d〉Further, d is a descent direction iff gT d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods

Page 22: Subgradient and Bundle methods

Some FactsFrom Convex Analysis

A convex function is always subdifferentiable i.e. theSubgradient of a convex function exists at every point.Directional derivatives also exist at every point.If a convex function f is differentiable at x , its subgradientis the gradient at that point. i.e. ∂f (x) = {∇f (x)}Subgradients are lower bounds for directional derivatives.f ′(x ; d) = supg∈∂f (x) 〈g,d〉Further, d is a descent direction iff gT d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods

Page 23: Subgradient and Bundle methods

PropertiesWithout Proof

∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)

∂αf (x) = α∂f (x)

g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)

Local minima⇒ 0 ∈ ∂f (x)However, For f (x) = |x |, the oracle returns subgradient 0only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods

Page 24: Subgradient and Bundle methods

PropertiesWithout Proof

∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)

∂αf (x) = α∂f (x)

g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)

Local minima⇒ 0 ∈ ∂f (x)However, For f (x) = |x |, the oracle returns subgradient 0only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods

Page 25: Subgradient and Bundle methods

PropertiesWithout Proof

∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)

∂αf (x) = α∂f (x)

g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)

Local minima⇒ 0 ∈ ∂f (x)However, For f (x) = |x |, the oracle returns subgradient 0only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods

Page 26: Subgradient and Bundle methods

PropertiesWithout Proof

∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)

∂αf (x) = α∂f (x)

g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)

Local minima⇒ 0 ∈ ∂f (x)However, For f (x) = |x |, the oracle returns subgradient 0only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods

Page 27: Subgradient and Bundle methods

PropertiesWithout Proof

∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)

∂αf (x) = α∂f (x)

g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)

Local minima⇒ 0 ∈ ∂f (x)However, For f (x) = |x |, the oracle returns subgradient 0only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods

Page 28: Subgradient and Bundle methods

PropertiesWithout Proof

∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)

∂αf (x) = α∂f (x)

g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)

Local minima⇒ 0 ∈ ∂f (x)However, For f (x) = |x |, the oracle returns subgradient 0only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods

Page 29: Subgradient and Bundle methods

Subgradient MethodAlgorithm

Subgradient Method is NOT a descent method!x (k+1) = x (k) − αkgk for αk ≥ 0 and gk ∈ ∂f (x)

f (k)best = min{f (k−1)

best , f (x (k))}Line search is not performed. Step lengths αk usually fixedahead of time

Subgradient and Bundle Methods

Page 30: Subgradient and Bundle methods

Subgradient MethodAlgorithm

Subgradient Method is NOT a descent method!x (k+1) = x (k) − αkgk for αk ≥ 0 and gk ∈ ∂f (x)

f (k)best = min{f (k−1)

best , f (x (k))}Line search is not performed. Step lengths αk usually fixedahead of time

Subgradient and Bundle Methods

Page 31: Subgradient and Bundle methods

Subgradient MethodAlgorithm

Subgradient Method is NOT a descent method!x (k+1) = x (k) − αkgk for αk ≥ 0 and gk ∈ ∂f (x)

f (k)best = min{f (k−1)

best , f (x (k))}Line search is not performed. Step lengths αk usually fixedahead of time

Subgradient and Bundle Methods

Page 32: Subgradient and Bundle methods

Subgradient MethodAlgorithm

Subgradient Method is NOT a descent method!x (k+1) = x (k) − αkgk for αk ≥ 0 and gk ∈ ∂f (x)

f (k)best = min{f (k−1)

best , f (x (k))}Line search is not performed. Step lengths αk usually fixedahead of time

Subgradient and Bundle Methods

Page 33: Subgradient and Bundle methods

Step Lengths

Commonly used Step lengthsConstant step size: αk = α

Constant step length: αk = αk = γ/‖g(k)‖2Square summable but not summable step size:αk ≥ 0,

∑∞k=1 α

2k <∞,

∑∞k=1 αk =∞.

Nonsummable diminishing:αk ≥ 0, limk→∞ αk = 0,

∑∞k=1 αk =∞.

Nonsummable diminishing step lengths:γk ≥ 0, limk→∞ γk = 0,

∑∞k=1 γk =∞.

Subgradient and Bundle Methods

Page 34: Subgradient and Bundle methods

Step Lengths

Commonly used Step lengthsConstant step size: αk = α

Constant step length: αk = αk = γ/‖g(k)‖2Square summable but not summable step size:αk ≥ 0,

∑∞k=1 α

2k <∞,

∑∞k=1 αk =∞.

Nonsummable diminishing:αk ≥ 0, limk→∞ αk = 0,

∑∞k=1 αk =∞.

Nonsummable diminishing step lengths:γk ≥ 0, limk→∞ γk = 0,

∑∞k=1 γk =∞.

Subgradient and Bundle Methods

Page 35: Subgradient and Bundle methods

Step Lengths

Commonly used Step lengthsConstant step size: αk = α

Constant step length: αk = αk = γ/‖g(k)‖2Square summable but not summable step size:αk ≥ 0,

∑∞k=1 α

2k <∞,

∑∞k=1 αk =∞.

Nonsummable diminishing:αk ≥ 0, limk→∞ αk = 0,

∑∞k=1 αk =∞.

Nonsummable diminishing step lengths:γk ≥ 0, limk→∞ γk = 0,

∑∞k=1 γk =∞.

Subgradient and Bundle Methods

Page 36: Subgradient and Bundle methods

Step Lengths

Commonly used Step lengthsConstant step size: αk = α

Constant step length: αk = αk = γ/‖g(k)‖2Square summable but not summable step size:αk ≥ 0,

∑∞k=1 α

2k <∞,

∑∞k=1 αk =∞.

Nonsummable diminishing:αk ≥ 0, limk→∞ αk = 0,

∑∞k=1 αk =∞.

Nonsummable diminishing step lengths:γk ≥ 0, limk→∞ γk = 0,

∑∞k=1 γk =∞.

Subgradient and Bundle Methods

Page 37: Subgradient and Bundle methods

Step Lengths

Commonly used Step lengthsConstant step size: αk = α

Constant step length: αk = αk = γ/‖g(k)‖2Square summable but not summable step size:αk ≥ 0,

∑∞k=1 α

2k <∞,

∑∞k=1 αk =∞.

Nonsummable diminishing:αk ≥ 0, limk→∞ αk = 0,

∑∞k=1 αk =∞.

Nonsummable diminishing step lengths:γk ≥ 0, limk→∞ γk = 0,

∑∞k=1 γk =∞.

Subgradient and Bundle Methods

Page 38: Subgradient and Bundle methods

Convergence Result

Assume that ∃G such that the norm of the subgradients isbounded i.e. ||g(k)||2 ≤ G(For example, Suppose f is Lipshitz continuous)

Result f kbest − f ∗ ≤

dist(x (1),X ∗

)2+ G2∑k

i=1 α2i

2∑k

i=1 αi

Proof is through proving ||x − x∗|| decreases

Subgradient and Bundle Methods

Page 39: Subgradient and Bundle methods

Convergence Result

Assume that ∃G such that the norm of the subgradients isbounded i.e. ||g(k)||2 ≤ G(For example, Suppose f is Lipshitz continuous)

Result f kbest − f ∗ ≤

dist(x (1),X ∗

)2+ G2∑k

i=1 α2i

2∑k

i=1 αi

Proof is through proving ||x − x∗|| decreases

Subgradient and Bundle Methods

Page 40: Subgradient and Bundle methods

Convergence Result

Assume that ∃G such that the norm of the subgradients isbounded i.e. ||g(k)||2 ≤ G(For example, Suppose f is Lipshitz continuous)

Result f kbest − f ∗ ≤

dist(x (1),X ∗

)2+ G2∑k

i=1 α2i

2∑k

i=1 αi

Proof is through proving ||x − x∗|| decreases

Subgradient and Bundle Methods

Page 41: Subgradient and Bundle methods

Convergence Result

Assume that ∃G such that the norm of the subgradients isbounded i.e. ||g(k)||2 ≤ G(For example, Suppose f is Lipshitz continuous)

Result f kbest − f ∗ ≤

dist(x (1),X ∗

)2+ G2∑k

i=1 α2i

2∑k

i=1 αi

Proof is through proving ||x − x∗|| decreases

Subgradient and Bundle Methods

Page 42: Subgradient and Bundle methods

Convergence for Commonly used Step lengths

Constant step size: f (k)best within

G2h2

of optimal

Constant step length: f (k)best within Gh of optimal

Square summable but not summable step size: f (k)best → f ∗

Nonsummable diminishing: f (k)best → f ∗

Nonsummable diminishing step lengths: f (k)best → f ∗

f (k)best − f ∗ ≤

R2 + G2∑ki=1 α

2i

2∑k

i=1 αi

So, optimal αi areR/G√

kand converges in (RG/ε)2 steps

Subgradient and Bundle Methods

Page 43: Subgradient and Bundle methods

Convergence for Commonly used Step lengths

Constant step size: f (k)best within

G2h2

of optimal

Constant step length: f (k)best within Gh of optimal

Square summable but not summable step size: f (k)best → f ∗

Nonsummable diminishing: f (k)best → f ∗

Nonsummable diminishing step lengths: f (k)best → f ∗

f (k)best − f ∗ ≤

R2 + G2∑ki=1 α

2i

2∑k

i=1 αi

So, optimal αi areR/G√

kand converges in (RG/ε)2 steps

Subgradient and Bundle Methods

Page 44: Subgradient and Bundle methods

Convergence for Commonly used Step lengths

Constant step size: f (k)best within

G2h2

of optimal

Constant step length: f (k)best within Gh of optimal

Square summable but not summable step size: f (k)best → f ∗

Nonsummable diminishing: f (k)best → f ∗

Nonsummable diminishing step lengths: f (k)best → f ∗

f (k)best − f ∗ ≤

R2 + G2∑ki=1 α

2i

2∑k

i=1 αi

So, optimal αi areR/G√

kand converges in (RG/ε)2 steps

Subgradient and Bundle Methods

Page 45: Subgradient and Bundle methods

Convergence for Commonly used Step lengths

Constant step size: f (k)best within

G2h2

of optimal

Constant step length: f (k)best within Gh of optimal

Square summable but not summable step size: f (k)best → f ∗

Nonsummable diminishing: f (k)best → f ∗

Nonsummable diminishing step lengths: f (k)best → f ∗

f (k)best − f ∗ ≤

R2 + G2∑ki=1 α

2i

2∑k

i=1 αi

So, optimal αi areR/G√

kand converges in (RG/ε)2 steps

Subgradient and Bundle Methods

Page 46: Subgradient and Bundle methods

Convergence for Commonly used Step lengths

Constant step size: f (k)best within

G2h2

of optimal

Constant step length: f (k)best within Gh of optimal

Square summable but not summable step size: f (k)best → f ∗

Nonsummable diminishing: f (k)best → f ∗

Nonsummable diminishing step lengths: f (k)best → f ∗

f (k)best − f ∗ ≤

R2 + G2∑ki=1 α

2i

2∑k

i=1 αi

So, optimal αi areR/G√

kand converges in (RG/ε)2 steps

Subgradient and Bundle Methods

Page 47: Subgradient and Bundle methods

Convergence for Commonly used Step lengths

Constant step size: f (k)best within

G2h2

of optimal

Constant step length: f (k)best within Gh of optimal

Square summable but not summable step size: f (k)best → f ∗

Nonsummable diminishing: f (k)best → f ∗

Nonsummable diminishing step lengths: f (k)best → f ∗

f (k)best − f ∗ ≤

R2 + G2∑ki=1 α

2i

2∑k

i=1 αi

So, optimal αi areR/G√

kand converges in (RG/ε)2 steps

Subgradient and Bundle Methods

Page 48: Subgradient and Bundle methods

Convergence for Commonly used Step lengths

Constant step size: f (k)best within

G2h2

of optimal

Constant step length: f (k)best within Gh of optimal

Square summable but not summable step size: f (k)best → f ∗

Nonsummable diminishing: f (k)best → f ∗

Nonsummable diminishing step lengths: f (k)best → f ∗

f (k)best − f ∗ ≤

R2 + G2∑ki=1 α

2i

2∑k

i=1 αi

So, optimal αi areR/G√

kand converges in (RG/ε)2 steps

Subgradient and Bundle Methods

Page 49: Subgradient and Bundle methods

Variations

If optimal value is known eg. if the optimal value is knownto be 0, but the point is not known

αk =f (x (k))− f ∗

||g(k)||2Projected Subgradient: minimize f (x) s.t. x ∈ Cx (k+1) = P(x (k) + αkg(k))

Alternating projections: Find a point in the intesection of 2convex setsHeavy Ball method:x (k+1) = x (k) − αkg(k) + βk (x (k )− x (k−1))

Subgradient and Bundle Methods

Page 50: Subgradient and Bundle methods

Variations

If optimal value is known eg. if the optimal value is knownto be 0, but the point is not known

αk =f (x (k))− f ∗

||g(k)||2Projected Subgradient: minimize f (x) s.t. x ∈ Cx (k+1) = P(x (k) + αkg(k))

Alternating projections: Find a point in the intesection of 2convex setsHeavy Ball method:x (k+1) = x (k) − αkg(k) + βk (x (k )− x (k−1))

Subgradient and Bundle Methods

Page 51: Subgradient and Bundle methods

Variations

If optimal value is known eg. if the optimal value is knownto be 0, but the point is not known

αk =f (x (k))− f ∗

||g(k)||2Projected Subgradient: minimize f (x) s.t. x ∈ Cx (k+1) = P(x (k) + αkg(k))

Alternating projections: Find a point in the intesection of 2convex setsHeavy Ball method:x (k+1) = x (k) − αkg(k) + βk (x (k )− x (k−1))

Subgradient and Bundle Methods

Page 52: Subgradient and Bundle methods

Variations

If optimal value is known eg. if the optimal value is knownto be 0, but the point is not known

αk =f (x (k))− f ∗

||g(k)||2Projected Subgradient: minimize f (x) s.t. x ∈ Cx (k+1) = P(x (k) + αkg(k))

Alternating projections: Find a point in the intesection of 2convex setsHeavy Ball method:x (k+1) = x (k) − αkg(k) + βk (x (k )− x (k−1))

Subgradient and Bundle Methods

Page 53: Subgradient and Bundle methods

ProsCan immediately be applied to a wide variety of problems,especially when accuracy required is not very high.Low memory usageOften possible to design distributed methods if objectiveis decomposible

ConsSlower than second-order methods

Subgradient and Bundle Methods

Page 54: Subgradient and Bundle methods

ProsCan immediately be applied to a wide variety of problems,especially when accuracy required is not very high.Low memory usageOften possible to design distributed methods if objectiveis decomposible

ConsSlower than second-order methods

Subgradient and Bundle Methods

Page 55: Subgradient and Bundle methods

ProsCan immediately be applied to a wide variety of problems,especially when accuracy required is not very high.Low memory usageOften possible to design distributed methods if objectiveis decomposible

ConsSlower than second-order methods

Subgradient and Bundle Methods

Page 56: Subgradient and Bundle methods

ProsCan immediately be applied to a wide variety of problems,especially when accuracy required is not very high.Low memory usageOften possible to design distributed methods if objectiveis decomposible

ConsSlower than second-order methods

Subgradient and Bundle Methods

Page 57: Subgradient and Bundle methods

ProsCan immediately be applied to a wide variety of problems,especially when accuracy required is not very high.Low memory usageOften possible to design distributed methods if objectiveis decomposible

ConsSlower than second-order methods

Subgradient and Bundle Methods

Page 58: Subgradient and Bundle methods

ProsCan immediately be applied to a wide variety of problems,especially when accuracy required is not very high.Low memory usageOften possible to design distributed methods if objectiveis decomposible

ConsSlower than second-order methods

Subgradient and Bundle Methods

Page 59: Subgradient and Bundle methods

Cutting Plane Method

Again, Consider the problem: minimize f (x) subject tox ∈ CConstruct an Approximate Model:f (x) = maxi∈I(f̂ (xi) + gi

T (x − xi)

Minimize model over x and find f (x) and gUpdate model and repeat till desired accuracyNumerically unstable

Subgradient and Bundle Methods

Page 60: Subgradient and Bundle methods

Cutting Plane Method

Again, Consider the problem: minimize f (x) subject tox ∈ CConstruct an Approximate Model:f (x) = maxi∈I(f̂ (xi) + gi

T (x − xi)

Minimize model over x and find f (x) and gUpdate model and repeat till desired accuracyNumerically unstable

Subgradient and Bundle Methods

Page 61: Subgradient and Bundle methods

Cutting Plane Method

Again, Consider the problem: minimize f (x) subject tox ∈ CConstruct an Approximate Model:f (x) = maxi∈I(f̂ (xi) + gi

T (x − xi)

Minimize model over x and find f (x) and gUpdate model and repeat till desired accuracyNumerically unstable

Subgradient and Bundle Methods

Page 62: Subgradient and Bundle methods

Cutting Plane Method

Again, Consider the problem: minimize f (x) subject tox ∈ CConstruct an Approximate Model:f (x) = maxi∈I(f̂ (xi) + gi

T (x − xi)

Minimize model over x and find f (x) and gUpdate model and repeat till desired accuracyNumerically unstable

Subgradient and Bundle Methods

Page 63: Subgradient and Bundle methods

Cutting Plane Method

Again, Consider the problem: minimize f (x) subject tox ∈ CConstruct an Approximate Model:f (x) = maxi∈I(f̂ (xi) + gi

T (x − xi)

Minimize model over x and find f (x) and gUpdate model and repeat till desired accuracyNumerically unstable

Subgradient and Bundle Methods

Page 64: Subgradient and Bundle methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems tominimize f (x)

F (x) = miny∈Rn

{f (y) +

λ

2||y − x ||2

}p(x) = argminy∈Rn

{f (y) +

λ

2||y − x ||2

}F (x) is differentiable!∇F (x) = λ(x − p(x))

Minimization is done using the dual.Cutting Plane Method + Moreau-Yosida Regularization =Bundle Methods

Subgradient and Bundle Methods

Page 65: Subgradient and Bundle methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems tominimize f (x)

F (x) = miny∈Rn

{f (y) +

λ

2||y − x ||2

}p(x) = argminy∈Rn

{f (y) +

λ

2||y − x ||2

}F (x) is differentiable!∇F (x) = λ(x − p(x))

Minimization is done using the dual.Cutting Plane Method + Moreau-Yosida Regularization =Bundle Methods

Subgradient and Bundle Methods

Page 66: Subgradient and Bundle methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems tominimize f (x)

F (x) = miny∈Rn

{f (y) +

λ

2||y − x ||2

}p(x) = argminy∈Rn

{f (y) +

λ

2||y − x ||2

}F (x) is differentiable!∇F (x) = λ(x − p(x))

Minimization is done using the dual.Cutting Plane Method + Moreau-Yosida Regularization =Bundle Methods

Subgradient and Bundle Methods

Page 67: Subgradient and Bundle methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems tominimize f (x)

F (x) = miny∈Rn

{f (y) +

λ

2||y − x ||2

}p(x) = argminy∈Rn

{f (y) +

λ

2||y − x ||2

}F (x) is differentiable!∇F (x) = λ(x − p(x))

Minimization is done using the dual.Cutting Plane Method + Moreau-Yosida Regularization =Bundle Methods

Subgradient and Bundle Methods

Page 68: Subgradient and Bundle methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems tominimize f (x)

F (x) = miny∈Rn

{f (y) +

λ

2||y − x ||2

}p(x) = argminy∈Rn

{f (y) +

λ

2||y − x ||2

}F (x) is differentiable!∇F (x) = λ(x − p(x))

Minimization is done using the dual.Cutting Plane Method + Moreau-Yosida Regularization =Bundle Methods

Subgradient and Bundle Methods

Page 69: Subgradient and Bundle methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems tominimize f (x)

F (x) = miny∈Rn

{f (y) +

λ

2||y − x ||2

}p(x) = argminy∈Rn

{f (y) +

λ

2||y − x ||2

}F (x) is differentiable!∇F (x) = λ(x − p(x))

Minimization is done using the dual.Cutting Plane Method + Moreau-Yosida Regularization =Bundle Methods

Subgradient and Bundle Methods

Page 70: Subgradient and Bundle methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems tominimize f (x)

F (x) = miny∈Rn

{f (y) +

λ

2||y − x ||2

}p(x) = argminy∈Rn

{f (y) +

λ

2||y − x ||2

}F (x) is differentiable!∇F (x) = λ(x − p(x))

Minimization is done using the dual.Cutting Plane Method + Moreau-Yosida Regularization =Bundle Methods

Subgradient and Bundle Methods

Page 71: Subgradient and Bundle methods

Elementary Bundle Method

As before, f is assumed to be Lipshitz continuousAt a generic iteration we maintain a “bundle”< yi , f (yi), si , αi >

Subgradient and Bundle Methods

Page 72: Subgradient and Bundle methods

Elementary Bundle Method

As before, f is assumed to be Lipshitz continuousAt a generic iteration we maintain a “bundle”< yi , f (yi), si , αi >

Subgradient and Bundle Methods

Page 73: Subgradient and Bundle methods

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularizationfor building the model

yk+1 = argminy∈Rn f̂k (y) +µk

2||y − x̂k ||2

δk = f (x̂k )− [f̂k (yk+1) +µk

2||yk+1 − x̂k ||2] ≥ 0

if δk < δ stopIf f (x̂k )− f (yk+1) ≥ mδkSerious Step x̂k+1 = yk+1

else Null Step x̂k+1 = x̂k

f̂k+1(y) = max{f̂k (y), f (yk+1) +⟨sk+1, y − yk+1⟩}

Subgradient and Bundle Methods

Page 74: Subgradient and Bundle methods

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularizationfor building the model

yk+1 = argminy∈Rn f̂k (y) +µk

2||y − x̂k ||2

δk = f (x̂k )− [f̂k (yk+1) +µk

2||yk+1 − x̂k ||2] ≥ 0

if δk < δ stopIf f (x̂k )− f (yk+1) ≥ mδkSerious Step x̂k+1 = yk+1

else Null Step x̂k+1 = x̂k

f̂k+1(y) = max{f̂k (y), f (yk+1) +⟨sk+1, y − yk+1⟩}

Subgradient and Bundle Methods

Page 75: Subgradient and Bundle methods

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularizationfor building the model

yk+1 = argminy∈Rn f̂k (y) +µk

2||y − x̂k ||2

δk = f (x̂k )− [f̂k (yk+1) +µk

2||yk+1 − x̂k ||2] ≥ 0

if δk < δ stopIf f (x̂k )− f (yk+1) ≥ mδkSerious Step x̂k+1 = yk+1

else Null Step x̂k+1 = x̂k

f̂k+1(y) = max{f̂k (y), f (yk+1) +⟨sk+1, y − yk+1⟩}

Subgradient and Bundle Methods

Page 76: Subgradient and Bundle methods

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularizationfor building the model

yk+1 = argminy∈Rn f̂k (y) +µk

2||y − x̂k ||2

δk = f (x̂k )− [f̂k (yk+1) +µk

2||yk+1 − x̂k ||2] ≥ 0

if δk < δ stopIf f (x̂k )− f (yk+1) ≥ mδkSerious Step x̂k+1 = yk+1

else Null Step x̂k+1 = x̂k

f̂k+1(y) = max{f̂k (y), f (yk+1) +⟨sk+1, y − yk+1⟩}

Subgradient and Bundle Methods

Page 77: Subgradient and Bundle methods

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularizationfor building the model

yk+1 = argminy∈Rn f̂k (y) +µk

2||y − x̂k ||2

δk = f (x̂k )− [f̂k (yk+1) +µk

2||yk+1 − x̂k ||2] ≥ 0

if δk < δ stopIf f (x̂k )− f (yk+1) ≥ mδkSerious Step x̂k+1 = yk+1

else Null Step x̂k+1 = x̂k

f̂k+1(y) = max{f̂k (y), f (yk+1) +⟨sk+1, y − yk+1⟩}

Subgradient and Bundle Methods

Page 78: Subgradient and Bundle methods

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularizationfor building the model

yk+1 = argminy∈Rn f̂k (y) +µk

2||y − x̂k ||2

δk = f (x̂k )− [f̂k (yk+1) +µk

2||yk+1 − x̂k ||2] ≥ 0

if δk < δ stopIf f (x̂k )− f (yk+1) ≥ mδkSerious Step x̂k+1 = yk+1

else Null Step x̂k+1 = x̂k

f̂k+1(y) = max{f̂k (y), f (yk+1) +⟨sk+1, y − yk+1⟩}

Subgradient and Bundle Methods

Page 79: Subgradient and Bundle methods

Convergence

The Algorithm either makes a finite number of SeriousSteps and then only makes Null stepsThen, If k0 is the last Serious Step, and µk isnondecreasing, then δk → 0Or it makes an infinite number of Serious steps

Then,∑

k∈Ksδk ≤

f (x̂0)− f ∗

mso δk → 0

Subgradient and Bundle Methods

Page 80: Subgradient and Bundle methods

Convergence

The Algorithm either makes a finite number of SeriousSteps and then only makes Null stepsThen, If k0 is the last Serious Step, and µk isnondecreasing, then δk → 0Or it makes an infinite number of Serious steps

Then,∑

k∈Ksδk ≤

f (x̂0)− f ∗

mso δk → 0

Subgradient and Bundle Methods

Page 81: Subgradient and Bundle methods

Convergence

The Algorithm either makes a finite number of SeriousSteps and then only makes Null stepsThen, If k0 is the last Serious Step, and µk isnondecreasing, then δk → 0Or it makes an infinite number of Serious steps

Then,∑

k∈Ksδk ≤

f (x̂0)− f ∗

mso δk → 0

Subgradient and Bundle Methods

Page 82: Subgradient and Bundle methods

Convergence

The Algorithm either makes a finite number of SeriousSteps and then only makes Null stepsThen, If k0 is the last Serious Step, and µk isnondecreasing, then δk → 0Or it makes an infinite number of Serious steps

Then,∑

k∈Ksδk ≤

f (x̂0)− f ∗

mso δk → 0

Subgradient and Bundle Methods

Page 83: Subgradient and Bundle methods

Convergence

The Algorithm either makes a finite number of SeriousSteps and then only makes Null stepsThen, If k0 is the last Serious Step, and µk isnondecreasing, then δk → 0Or it makes an infinite number of Serious steps

Then,∑

k∈Ksδk ≤

f (x̂0)− f ∗

mso δk → 0

Subgradient and Bundle Methods

Page 84: Subgradient and Bundle methods

Convergence

The Algorithm either makes a finite number of SeriousSteps and then only makes Null stepsThen, If k0 is the last Serious Step, and µk isnondecreasing, then δk → 0Or it makes an infinite number of Serious steps

Then,∑

k∈Ksδk ≤

f (x̂0)− f ∗

mso δk → 0

Subgradient and Bundle Methods

Page 85: Subgradient and Bundle methods

Variations

Replace ||y − x ||2 by (y − x)T Mk (y − x) : Still differentiableConjuguate Gradient methods are achieved as a slightmodification of the algorithm (Refer [5])Variable Metric Methods [10]Mk = uk I for Diagonal Variable Metric MethodsBundle-Newton Methods

Subgradient and Bundle Methods

Page 86: Subgradient and Bundle methods

Variations

Replace ||y − x ||2 by (y − x)T Mk (y − x) : Still differentiableConjuguate Gradient methods are achieved as a slightmodification of the algorithm (Refer [5])Variable Metric Methods [10]Mk = uk I for Diagonal Variable Metric MethodsBundle-Newton Methods

Subgradient and Bundle Methods

Page 87: Subgradient and Bundle methods

Variations

Replace ||y − x ||2 by (y − x)T Mk (y − x) : Still differentiableConjuguate Gradient methods are achieved as a slightmodification of the algorithm (Refer [5])Variable Metric Methods [10]Mk = uk I for Diagonal Variable Metric MethodsBundle-Newton Methods

Subgradient and Bundle Methods

Page 88: Subgradient and Bundle methods

Variations

Replace ||y − x ||2 by (y − x)T Mk (y − x) : Still differentiableConjuguate Gradient methods are achieved as a slightmodification of the algorithm (Refer [5])Variable Metric Methods [10]Mk = uk I for Diagonal Variable Metric MethodsBundle-Newton Methods

Subgradient and Bundle Methods

Page 89: Subgradient and Bundle methods

Variations

Replace ||y − x ||2 by (y − x)T Mk (y − x) : Still differentiableConjuguate Gradient methods are achieved as a slightmodification of the algorithm (Refer [5])Variable Metric Methods [10]Mk = uk I for Diagonal Variable Metric MethodsBundle-Newton Methods

Subgradient and Bundle Methods

Page 90: Subgradient and Bundle methods

Summary

Nonsmooth convex optimization has been explored since1960’s. The original subgradient methods were introducedby Naum Shor. Bundle methods have been developedmore recently.Subgradient Methods are simple but slow, unlessdistributed, which is the predominant current application.Bundle Methods solve a bounded QP, which is slow, butneed fewer iterations. Preferred for applications whereoracle cost is high.

Subgradient and Bundle Methods

Page 91: Subgradient and Bundle methods

Summary

Nonsmooth convex optimization has been explored since1960’s. The original subgradient methods were introducedby Naum Shor. Bundle methods have been developedmore recently.Subgradient Methods are simple but slow, unlessdistributed, which is the predominant current application.Bundle Methods solve a bounded QP, which is slow, butneed fewer iterations. Preferred for applications whereoracle cost is high.

Subgradient and Bundle Methods

Page 92: Subgradient and Bundle methods

Summary

Nonsmooth convex optimization has been explored since1960’s. The original subgradient methods were introducedby Naum Shor. Bundle methods have been developedmore recently.Subgradient Methods are simple but slow, unlessdistributed, which is the predominant current application.Bundle Methods solve a bounded QP, which is slow, butneed fewer iterations. Preferred for applications whereoracle cost is high.

Subgradient and Bundle Methods

Page 93: Subgradient and Bundle methods

Summary

Nonsmooth convex optimization has been explored since1960’s. The original subgradient methods were introducedby Naum Shor. Bundle methods have been developedmore recently.Subgradient Methods are simple but slow, unlessdistributed, which is the predominant current application.Bundle Methods solve a bounded QP, which is slow, butneed fewer iterations. Preferred for applications whereoracle cost is high.

Subgradient and Bundle Methods

Page 94: Subgradient and Bundle methods

Summary

Nonsmooth convex optimization has been explored since1960’s. The original subgradient methods were introducedby Naum Shor. Bundle methods have been developedmore recently.Subgradient Methods are simple but slow, unlessdistributed, which is the predominant current application.Bundle Methods solve a bounded QP, which is slow, butneed fewer iterations. Preferred for applications whereoracle cost is high.

Subgradient and Bundle Methods

Page 95: Subgradient and Bundle methods

For Further Reading I

Naum Z. ShorMinimization Methods for non-differentiable functions.Springer-Verlag, 1985.

Boyd and VanderbergeConvex Optimization.Cambridge University Press

A. RuszczyinskiNonlinear OptimizationPrinceton University Press

Wikipediaen.wikipedia.org/wiki/Subgradient_method

Subgradient and Bundle Methods

Page 96: Subgradient and Bundle methods

For Further Reading II

Marko MakelaSurvey of Bundle Methods, 2009http://www.informaworld.com/smpp/content~db=all~content=a713741700

Alexandre BelloniAn Introduction to Bundle Methodshttp://web.mit.edu/belloni/www/LecturesIntroBundle.pdf

John E. MitchellCutting Plane and Subgradient Methods, 2005http://www.optimization-online.org/DB_HTML/2009/05/2298.html

Subgradient and Bundle Methods

Page 97: Subgradient and Bundle methods

For Further Reading III

Lecture Notes on Subgradient methods by Stephen Boydhttp://www.stanford.edu/class/ee392o/subgrad_method.pdf

Alexander J. Smola, S.V. N. Vishwanathan, Quoc V. LeBundle Methods for Machine Learning, 2007http://books.nips.cc/papers/files/nips20/NIPS2007_0470.pdf

C Lemarechal Variable metric bundle methods, 1997.http://www.springerlink.com/index/3515WK428153171N.pdf

Quoc Le, Alexander SmolaDirect Optimization of Ranking Measures, 2007http://arxiv.org/abs/0704.3359

Subgradient and Bundle Methods

Page 98: Subgradient and Bundle methods

For Further Reading IV

SVN Vishwanathan, A. SmolaQuasi-Newton Methods for Efficient Large-Scale MachineLearninghttp://portal.acm.org/ft_gateway.cfm?id=1390309&type=pdfandwww.stat.purdue.edu/~vishy/talks/LBFGS.pdf

Subgradient and Bundle Methods