Upload
harshhpareek
View
120
Download
1
Tags:
Embed Size (px)
DESCRIPTION
A short first seminar on subgradient and bundle methods for nonsmooth optimization.
Citation preview
Subgradient and Bundle Methodsfor optimization of convex non-smooth functions
April 1, 2009
Subgradient and Bundle Methods
Motivation
Many naturally occuring problems are nonsmoothHinge lossFeasible region of a convex minimization problemPiecewise Linear function
If a function is approximating a non-smooth function, then itmay be analytically smooth, but “numerically nonsmooth”
Subgradient and Bundle Methods
Motivation
Many naturally occuring problems are nonsmoothHinge lossFeasible region of a convex minimization problemPiecewise Linear function
If a function is approximating a non-smooth function, then itmay be analytically smooth, but “numerically nonsmooth”
Subgradient and Bundle Methods
Motivation
Many naturally occuring problems are nonsmoothHinge lossFeasible region of a convex minimization problemPiecewise Linear function
If a function is approximating a non-smooth function, then itmay be analytically smooth, but “numerically nonsmooth”
Subgradient and Bundle Methods
Motivation
Many naturally occuring problems are nonsmoothHinge lossFeasible region of a convex minimization problemPiecewise Linear function
If a function is approximating a non-smooth function, then itmay be analytically smooth, but “numerically nonsmooth”
Subgradient and Bundle Methods
Motivation
Many naturally occuring problems are nonsmoothHinge lossFeasible region of a convex minimization problemPiecewise Linear function
If a function is approximating a non-smooth function, then itmay be analytically smooth, but “numerically nonsmooth”
Subgradient and Bundle Methods
Methods for nonsmooth optimizations
Approximate by a series of smooth functionsReformulate the problem adding more constraints suchthat the objective is smoothSubgradient MethodsCutting Plane MethodsMoreau-Yosida RegularizationBundle MethodsU V decomposition
Subgradient and Bundle Methods
Methods for nonsmooth optimizations
Approximate by a series of smooth functionsReformulate the problem adding more constraints suchthat the objective is smoothSubgradient MethodsCutting Plane MethodsMoreau-Yosida RegularizationBundle MethodsU V decomposition
Subgradient and Bundle Methods
Methods for nonsmooth optimizations
Approximate by a series of smooth functionsReformulate the problem adding more constraints suchthat the objective is smoothSubgradient MethodsCutting Plane MethodsMoreau-Yosida RegularizationBundle MethodsU V decomposition
Subgradient and Bundle Methods
Methods for nonsmooth optimizations
Approximate by a series of smooth functionsReformulate the problem adding more constraints suchthat the objective is smoothSubgradient MethodsCutting Plane MethodsMoreau-Yosida RegularizationBundle MethodsU V decomposition
Subgradient and Bundle Methods
Methods for nonsmooth optimizations
Approximate by a series of smooth functionsReformulate the problem adding more constraints suchthat the objective is smoothSubgradient MethodsCutting Plane MethodsMoreau-Yosida RegularizationBundle MethodsU V decomposition
Subgradient and Bundle Methods
Methods for nonsmooth optimizations
Approximate by a series of smooth functionsReformulate the problem adding more constraints suchthat the objective is smoothSubgradient MethodsCutting Plane MethodsMoreau-Yosida RegularizationBundle MethodsU V decomposition
Subgradient and Bundle Methods
Methods for nonsmooth optimizations
Approximate by a series of smooth functionsReformulate the problem adding more constraints suchthat the objective is smoothSubgradient MethodsCutting Plane MethodsMoreau-Yosida RegularizationBundle MethodsU V decomposition
Subgradient and Bundle Methods
DefinitionAn extension of gradients
For a convex differentiable function f (x), ∀x , y
f (y) ≥ f (x) +∇f (x)T(y − x) (1)
So, a subgradient is defined as any g ∈ Rn such that ∀y
f (y) ≥ f (x) + gT(y − x) (2)
The set of all subgradients of f at x is denoted ∂f (x)
Subgradient and Bundle Methods
DefinitionAn extension of gradients
For a convex differentiable function f (x), ∀x , y
f (y) ≥ f (x) +∇f (x)T(y − x) (1)
So, a subgradient is defined as any g ∈ Rn such that ∀y
f (y) ≥ f (x) + gT(y − x) (2)
The set of all subgradients of f at x is denoted ∂f (x)
Subgradient and Bundle Methods
DefinitionAn extension of gradients
For a convex differentiable function f (x), ∀x , y
f (y) ≥ f (x) +∇f (x)T(y − x) (1)
So, a subgradient is defined as any g ∈ Rn such that ∀y
f (y) ≥ f (x) + gT(y − x) (2)
The set of all subgradients of f at x is denoted ∂f (x)
Subgradient and Bundle Methods
Some FactsFrom Convex Analysis
A convex function is always subdifferentiable i.e. theSubgradient of a convex function exists at every point.Directional derivatives also exist at every point.If a convex function f is differentiable at x , its subgradientis the gradient at that point. i.e. ∂f (x) = {∇f (x)}Subgradients are lower bounds for directional derivatives.f ′(x ; d) = supg∈∂f (x) 〈g,d〉Further, d is a descent direction iff gT d < 0 ∀g ∈ ∂f (x)
Subgradient and Bundle Methods
Some FactsFrom Convex Analysis
A convex function is always subdifferentiable i.e. theSubgradient of a convex function exists at every point.Directional derivatives also exist at every point.If a convex function f is differentiable at x , its subgradientis the gradient at that point. i.e. ∂f (x) = {∇f (x)}Subgradients are lower bounds for directional derivatives.f ′(x ; d) = supg∈∂f (x) 〈g,d〉Further, d is a descent direction iff gT d < 0 ∀g ∈ ∂f (x)
Subgradient and Bundle Methods
Some FactsFrom Convex Analysis
A convex function is always subdifferentiable i.e. theSubgradient of a convex function exists at every point.Directional derivatives also exist at every point.If a convex function f is differentiable at x , its subgradientis the gradient at that point. i.e. ∂f (x) = {∇f (x)}Subgradients are lower bounds for directional derivatives.f ′(x ; d) = supg∈∂f (x) 〈g,d〉Further, d is a descent direction iff gT d < 0 ∀g ∈ ∂f (x)
Subgradient and Bundle Methods
Some FactsFrom Convex Analysis
A convex function is always subdifferentiable i.e. theSubgradient of a convex function exists at every point.Directional derivatives also exist at every point.If a convex function f is differentiable at x , its subgradientis the gradient at that point. i.e. ∂f (x) = {∇f (x)}Subgradients are lower bounds for directional derivatives.f ′(x ; d) = supg∈∂f (x) 〈g,d〉Further, d is a descent direction iff gT d < 0 ∀g ∈ ∂f (x)
Subgradient and Bundle Methods
Some FactsFrom Convex Analysis
A convex function is always subdifferentiable i.e. theSubgradient of a convex function exists at every point.Directional derivatives also exist at every point.If a convex function f is differentiable at x , its subgradientis the gradient at that point. i.e. ∂f (x) = {∇f (x)}Subgradients are lower bounds for directional derivatives.f ′(x ; d) = supg∈∂f (x) 〈g,d〉Further, d is a descent direction iff gT d < 0 ∀g ∈ ∂f (x)
Subgradient and Bundle Methods
Some FactsFrom Convex Analysis
A convex function is always subdifferentiable i.e. theSubgradient of a convex function exists at every point.Directional derivatives also exist at every point.If a convex function f is differentiable at x , its subgradientis the gradient at that point. i.e. ∂f (x) = {∇f (x)}Subgradients are lower bounds for directional derivatives.f ′(x ; d) = supg∈∂f (x) 〈g,d〉Further, d is a descent direction iff gT d < 0 ∀g ∈ ∂f (x)
Subgradient and Bundle Methods
PropertiesWithout Proof
∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)
∂αf (x) = α∂f (x)
g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)
Local minima⇒ 0 ∈ ∂f (x)However, For f (x) = |x |, the oracle returns subgradient 0only at 0. So this is not a good way to find minima
Subgradient and Bundle Methods
PropertiesWithout Proof
∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)
∂αf (x) = α∂f (x)
g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)
Local minima⇒ 0 ∈ ∂f (x)However, For f (x) = |x |, the oracle returns subgradient 0only at 0. So this is not a good way to find minima
Subgradient and Bundle Methods
PropertiesWithout Proof
∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)
∂αf (x) = α∂f (x)
g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)
Local minima⇒ 0 ∈ ∂f (x)However, For f (x) = |x |, the oracle returns subgradient 0only at 0. So this is not a good way to find minima
Subgradient and Bundle Methods
PropertiesWithout Proof
∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)
∂αf (x) = α∂f (x)
g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)
Local minima⇒ 0 ∈ ∂f (x)However, For f (x) = |x |, the oracle returns subgradient 0only at 0. So this is not a good way to find minima
Subgradient and Bundle Methods
PropertiesWithout Proof
∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)
∂αf (x) = α∂f (x)
g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)
Local minima⇒ 0 ∈ ∂f (x)However, For f (x) = |x |, the oracle returns subgradient 0only at 0. So this is not a good way to find minima
Subgradient and Bundle Methods
PropertiesWithout Proof
∂(f1 + f2)(x) = ∂f1(x) + ∂f2(x)
∂αf (x) = α∂f (x)
g(x) = f (Ax + b)⇒ ∂g(x) = AT∂f (Ax + b)
Local minima⇒ 0 ∈ ∂f (x)However, For f (x) = |x |, the oracle returns subgradient 0only at 0. So this is not a good way to find minima
Subgradient and Bundle Methods
Subgradient MethodAlgorithm
Subgradient Method is NOT a descent method!x (k+1) = x (k) − αkgk for αk ≥ 0 and gk ∈ ∂f (x)
f (k)best = min{f (k−1)
best , f (x (k))}Line search is not performed. Step lengths αk usually fixedahead of time
Subgradient and Bundle Methods
Subgradient MethodAlgorithm
Subgradient Method is NOT a descent method!x (k+1) = x (k) − αkgk for αk ≥ 0 and gk ∈ ∂f (x)
f (k)best = min{f (k−1)
best , f (x (k))}Line search is not performed. Step lengths αk usually fixedahead of time
Subgradient and Bundle Methods
Subgradient MethodAlgorithm
Subgradient Method is NOT a descent method!x (k+1) = x (k) − αkgk for αk ≥ 0 and gk ∈ ∂f (x)
f (k)best = min{f (k−1)
best , f (x (k))}Line search is not performed. Step lengths αk usually fixedahead of time
Subgradient and Bundle Methods
Subgradient MethodAlgorithm
Subgradient Method is NOT a descent method!x (k+1) = x (k) − αkgk for αk ≥ 0 and gk ∈ ∂f (x)
f (k)best = min{f (k−1)
best , f (x (k))}Line search is not performed. Step lengths αk usually fixedahead of time
Subgradient and Bundle Methods
Step Lengths
Commonly used Step lengthsConstant step size: αk = α
Constant step length: αk = αk = γ/‖g(k)‖2Square summable but not summable step size:αk ≥ 0,
∑∞k=1 α
2k <∞,
∑∞k=1 αk =∞.
Nonsummable diminishing:αk ≥ 0, limk→∞ αk = 0,
∑∞k=1 αk =∞.
Nonsummable diminishing step lengths:γk ≥ 0, limk→∞ γk = 0,
∑∞k=1 γk =∞.
Subgradient and Bundle Methods
Step Lengths
Commonly used Step lengthsConstant step size: αk = α
Constant step length: αk = αk = γ/‖g(k)‖2Square summable but not summable step size:αk ≥ 0,
∑∞k=1 α
2k <∞,
∑∞k=1 αk =∞.
Nonsummable diminishing:αk ≥ 0, limk→∞ αk = 0,
∑∞k=1 αk =∞.
Nonsummable diminishing step lengths:γk ≥ 0, limk→∞ γk = 0,
∑∞k=1 γk =∞.
Subgradient and Bundle Methods
Step Lengths
Commonly used Step lengthsConstant step size: αk = α
Constant step length: αk = αk = γ/‖g(k)‖2Square summable but not summable step size:αk ≥ 0,
∑∞k=1 α
2k <∞,
∑∞k=1 αk =∞.
Nonsummable diminishing:αk ≥ 0, limk→∞ αk = 0,
∑∞k=1 αk =∞.
Nonsummable diminishing step lengths:γk ≥ 0, limk→∞ γk = 0,
∑∞k=1 γk =∞.
Subgradient and Bundle Methods
Step Lengths
Commonly used Step lengthsConstant step size: αk = α
Constant step length: αk = αk = γ/‖g(k)‖2Square summable but not summable step size:αk ≥ 0,
∑∞k=1 α
2k <∞,
∑∞k=1 αk =∞.
Nonsummable diminishing:αk ≥ 0, limk→∞ αk = 0,
∑∞k=1 αk =∞.
Nonsummable diminishing step lengths:γk ≥ 0, limk→∞ γk = 0,
∑∞k=1 γk =∞.
Subgradient and Bundle Methods
Step Lengths
Commonly used Step lengthsConstant step size: αk = α
Constant step length: αk = αk = γ/‖g(k)‖2Square summable but not summable step size:αk ≥ 0,
∑∞k=1 α
2k <∞,
∑∞k=1 αk =∞.
Nonsummable diminishing:αk ≥ 0, limk→∞ αk = 0,
∑∞k=1 αk =∞.
Nonsummable diminishing step lengths:γk ≥ 0, limk→∞ γk = 0,
∑∞k=1 γk =∞.
Subgradient and Bundle Methods
Convergence Result
Assume that ∃G such that the norm of the subgradients isbounded i.e. ||g(k)||2 ≤ G(For example, Suppose f is Lipshitz continuous)
Result f kbest − f ∗ ≤
dist(x (1),X ∗
)2+ G2∑k
i=1 α2i
2∑k
i=1 αi
Proof is through proving ||x − x∗|| decreases
Subgradient and Bundle Methods
Convergence Result
Assume that ∃G such that the norm of the subgradients isbounded i.e. ||g(k)||2 ≤ G(For example, Suppose f is Lipshitz continuous)
Result f kbest − f ∗ ≤
dist(x (1),X ∗
)2+ G2∑k
i=1 α2i
2∑k
i=1 αi
Proof is through proving ||x − x∗|| decreases
Subgradient and Bundle Methods
Convergence Result
Assume that ∃G such that the norm of the subgradients isbounded i.e. ||g(k)||2 ≤ G(For example, Suppose f is Lipshitz continuous)
Result f kbest − f ∗ ≤
dist(x (1),X ∗
)2+ G2∑k
i=1 α2i
2∑k
i=1 αi
Proof is through proving ||x − x∗|| decreases
Subgradient and Bundle Methods
Convergence Result
Assume that ∃G such that the norm of the subgradients isbounded i.e. ||g(k)||2 ≤ G(For example, Suppose f is Lipshitz continuous)
Result f kbest − f ∗ ≤
dist(x (1),X ∗
)2+ G2∑k
i=1 α2i
2∑k
i=1 αi
Proof is through proving ||x − x∗|| decreases
Subgradient and Bundle Methods
Convergence for Commonly used Step lengths
Constant step size: f (k)best within
G2h2
of optimal
Constant step length: f (k)best within Gh of optimal
Square summable but not summable step size: f (k)best → f ∗
Nonsummable diminishing: f (k)best → f ∗
Nonsummable diminishing step lengths: f (k)best → f ∗
f (k)best − f ∗ ≤
R2 + G2∑ki=1 α
2i
2∑k
i=1 αi
So, optimal αi areR/G√
kand converges in (RG/ε)2 steps
Subgradient and Bundle Methods
Convergence for Commonly used Step lengths
Constant step size: f (k)best within
G2h2
of optimal
Constant step length: f (k)best within Gh of optimal
Square summable but not summable step size: f (k)best → f ∗
Nonsummable diminishing: f (k)best → f ∗
Nonsummable diminishing step lengths: f (k)best → f ∗
f (k)best − f ∗ ≤
R2 + G2∑ki=1 α
2i
2∑k
i=1 αi
So, optimal αi areR/G√
kand converges in (RG/ε)2 steps
Subgradient and Bundle Methods
Convergence for Commonly used Step lengths
Constant step size: f (k)best within
G2h2
of optimal
Constant step length: f (k)best within Gh of optimal
Square summable but not summable step size: f (k)best → f ∗
Nonsummable diminishing: f (k)best → f ∗
Nonsummable diminishing step lengths: f (k)best → f ∗
f (k)best − f ∗ ≤
R2 + G2∑ki=1 α
2i
2∑k
i=1 αi
So, optimal αi areR/G√
kand converges in (RG/ε)2 steps
Subgradient and Bundle Methods
Convergence for Commonly used Step lengths
Constant step size: f (k)best within
G2h2
of optimal
Constant step length: f (k)best within Gh of optimal
Square summable but not summable step size: f (k)best → f ∗
Nonsummable diminishing: f (k)best → f ∗
Nonsummable diminishing step lengths: f (k)best → f ∗
f (k)best − f ∗ ≤
R2 + G2∑ki=1 α
2i
2∑k
i=1 αi
So, optimal αi areR/G√
kand converges in (RG/ε)2 steps
Subgradient and Bundle Methods
Convergence for Commonly used Step lengths
Constant step size: f (k)best within
G2h2
of optimal
Constant step length: f (k)best within Gh of optimal
Square summable but not summable step size: f (k)best → f ∗
Nonsummable diminishing: f (k)best → f ∗
Nonsummable diminishing step lengths: f (k)best → f ∗
f (k)best − f ∗ ≤
R2 + G2∑ki=1 α
2i
2∑k
i=1 αi
So, optimal αi areR/G√
kand converges in (RG/ε)2 steps
Subgradient and Bundle Methods
Convergence for Commonly used Step lengths
Constant step size: f (k)best within
G2h2
of optimal
Constant step length: f (k)best within Gh of optimal
Square summable but not summable step size: f (k)best → f ∗
Nonsummable diminishing: f (k)best → f ∗
Nonsummable diminishing step lengths: f (k)best → f ∗
f (k)best − f ∗ ≤
R2 + G2∑ki=1 α
2i
2∑k
i=1 αi
So, optimal αi areR/G√
kand converges in (RG/ε)2 steps
Subgradient and Bundle Methods
Convergence for Commonly used Step lengths
Constant step size: f (k)best within
G2h2
of optimal
Constant step length: f (k)best within Gh of optimal
Square summable but not summable step size: f (k)best → f ∗
Nonsummable diminishing: f (k)best → f ∗
Nonsummable diminishing step lengths: f (k)best → f ∗
f (k)best − f ∗ ≤
R2 + G2∑ki=1 α
2i
2∑k
i=1 αi
So, optimal αi areR/G√
kand converges in (RG/ε)2 steps
Subgradient and Bundle Methods
Variations
If optimal value is known eg. if the optimal value is knownto be 0, but the point is not known
αk =f (x (k))− f ∗
||g(k)||2Projected Subgradient: minimize f (x) s.t. x ∈ Cx (k+1) = P(x (k) + αkg(k))
Alternating projections: Find a point in the intesection of 2convex setsHeavy Ball method:x (k+1) = x (k) − αkg(k) + βk (x (k )− x (k−1))
Subgradient and Bundle Methods
Variations
If optimal value is known eg. if the optimal value is knownto be 0, but the point is not known
αk =f (x (k))− f ∗
||g(k)||2Projected Subgradient: minimize f (x) s.t. x ∈ Cx (k+1) = P(x (k) + αkg(k))
Alternating projections: Find a point in the intesection of 2convex setsHeavy Ball method:x (k+1) = x (k) − αkg(k) + βk (x (k )− x (k−1))
Subgradient and Bundle Methods
Variations
If optimal value is known eg. if the optimal value is knownto be 0, but the point is not known
αk =f (x (k))− f ∗
||g(k)||2Projected Subgradient: minimize f (x) s.t. x ∈ Cx (k+1) = P(x (k) + αkg(k))
Alternating projections: Find a point in the intesection of 2convex setsHeavy Ball method:x (k+1) = x (k) − αkg(k) + βk (x (k )− x (k−1))
Subgradient and Bundle Methods
Variations
If optimal value is known eg. if the optimal value is knownto be 0, but the point is not known
αk =f (x (k))− f ∗
||g(k)||2Projected Subgradient: minimize f (x) s.t. x ∈ Cx (k+1) = P(x (k) + αkg(k))
Alternating projections: Find a point in the intesection of 2convex setsHeavy Ball method:x (k+1) = x (k) − αkg(k) + βk (x (k )− x (k−1))
Subgradient and Bundle Methods
ProsCan immediately be applied to a wide variety of problems,especially when accuracy required is not very high.Low memory usageOften possible to design distributed methods if objectiveis decomposible
ConsSlower than second-order methods
Subgradient and Bundle Methods
ProsCan immediately be applied to a wide variety of problems,especially when accuracy required is not very high.Low memory usageOften possible to design distributed methods if objectiveis decomposible
ConsSlower than second-order methods
Subgradient and Bundle Methods
ProsCan immediately be applied to a wide variety of problems,especially when accuracy required is not very high.Low memory usageOften possible to design distributed methods if objectiveis decomposible
ConsSlower than second-order methods
Subgradient and Bundle Methods
ProsCan immediately be applied to a wide variety of problems,especially when accuracy required is not very high.Low memory usageOften possible to design distributed methods if objectiveis decomposible
ConsSlower than second-order methods
Subgradient and Bundle Methods
ProsCan immediately be applied to a wide variety of problems,especially when accuracy required is not very high.Low memory usageOften possible to design distributed methods if objectiveis decomposible
ConsSlower than second-order methods
Subgradient and Bundle Methods
ProsCan immediately be applied to a wide variety of problems,especially when accuracy required is not very high.Low memory usageOften possible to design distributed methods if objectiveis decomposible
ConsSlower than second-order methods
Subgradient and Bundle Methods
Cutting Plane Method
Again, Consider the problem: minimize f (x) subject tox ∈ CConstruct an Approximate Model:f (x) = maxi∈I(f̂ (xi) + gi
T (x − xi)
Minimize model over x and find f (x) and gUpdate model and repeat till desired accuracyNumerically unstable
Subgradient and Bundle Methods
Cutting Plane Method
Again, Consider the problem: minimize f (x) subject tox ∈ CConstruct an Approximate Model:f (x) = maxi∈I(f̂ (xi) + gi
T (x − xi)
Minimize model over x and find f (x) and gUpdate model and repeat till desired accuracyNumerically unstable
Subgradient and Bundle Methods
Cutting Plane Method
Again, Consider the problem: minimize f (x) subject tox ∈ CConstruct an Approximate Model:f (x) = maxi∈I(f̂ (xi) + gi
T (x − xi)
Minimize model over x and find f (x) and gUpdate model and repeat till desired accuracyNumerically unstable
Subgradient and Bundle Methods
Cutting Plane Method
Again, Consider the problem: minimize f (x) subject tox ∈ CConstruct an Approximate Model:f (x) = maxi∈I(f̂ (xi) + gi
T (x − xi)
Minimize model over x and find f (x) and gUpdate model and repeat till desired accuracyNumerically unstable
Subgradient and Bundle Methods
Cutting Plane Method
Again, Consider the problem: minimize f (x) subject tox ∈ CConstruct an Approximate Model:f (x) = maxi∈I(f̂ (xi) + gi
T (x − xi)
Minimize model over x and find f (x) and gUpdate model and repeat till desired accuracyNumerically unstable
Subgradient and Bundle Methods
Moreau-Yosida Regularization
Idea: solve a series of smooth convex problems tominimize f (x)
F (x) = miny∈Rn
{f (y) +
λ
2||y − x ||2
}p(x) = argminy∈Rn
{f (y) +
λ
2||y − x ||2
}F (x) is differentiable!∇F (x) = λ(x − p(x))
Minimization is done using the dual.Cutting Plane Method + Moreau-Yosida Regularization =Bundle Methods
Subgradient and Bundle Methods
Moreau-Yosida Regularization
Idea: solve a series of smooth convex problems tominimize f (x)
F (x) = miny∈Rn
{f (y) +
λ
2||y − x ||2
}p(x) = argminy∈Rn
{f (y) +
λ
2||y − x ||2
}F (x) is differentiable!∇F (x) = λ(x − p(x))
Minimization is done using the dual.Cutting Plane Method + Moreau-Yosida Regularization =Bundle Methods
Subgradient and Bundle Methods
Moreau-Yosida Regularization
Idea: solve a series of smooth convex problems tominimize f (x)
F (x) = miny∈Rn
{f (y) +
λ
2||y − x ||2
}p(x) = argminy∈Rn
{f (y) +
λ
2||y − x ||2
}F (x) is differentiable!∇F (x) = λ(x − p(x))
Minimization is done using the dual.Cutting Plane Method + Moreau-Yosida Regularization =Bundle Methods
Subgradient and Bundle Methods
Moreau-Yosida Regularization
Idea: solve a series of smooth convex problems tominimize f (x)
F (x) = miny∈Rn
{f (y) +
λ
2||y − x ||2
}p(x) = argminy∈Rn
{f (y) +
λ
2||y − x ||2
}F (x) is differentiable!∇F (x) = λ(x − p(x))
Minimization is done using the dual.Cutting Plane Method + Moreau-Yosida Regularization =Bundle Methods
Subgradient and Bundle Methods
Moreau-Yosida Regularization
Idea: solve a series of smooth convex problems tominimize f (x)
F (x) = miny∈Rn
{f (y) +
λ
2||y − x ||2
}p(x) = argminy∈Rn
{f (y) +
λ
2||y − x ||2
}F (x) is differentiable!∇F (x) = λ(x − p(x))
Minimization is done using the dual.Cutting Plane Method + Moreau-Yosida Regularization =Bundle Methods
Subgradient and Bundle Methods
Moreau-Yosida Regularization
Idea: solve a series of smooth convex problems tominimize f (x)
F (x) = miny∈Rn
{f (y) +
λ
2||y − x ||2
}p(x) = argminy∈Rn
{f (y) +
λ
2||y − x ||2
}F (x) is differentiable!∇F (x) = λ(x − p(x))
Minimization is done using the dual.Cutting Plane Method + Moreau-Yosida Regularization =Bundle Methods
Subgradient and Bundle Methods
Moreau-Yosida Regularization
Idea: solve a series of smooth convex problems tominimize f (x)
F (x) = miny∈Rn
{f (y) +
λ
2||y − x ||2
}p(x) = argminy∈Rn
{f (y) +
λ
2||y − x ||2
}F (x) is differentiable!∇F (x) = λ(x − p(x))
Minimization is done using the dual.Cutting Plane Method + Moreau-Yosida Regularization =Bundle Methods
Subgradient and Bundle Methods
Elementary Bundle Method
As before, f is assumed to be Lipshitz continuousAt a generic iteration we maintain a “bundle”< yi , f (yi), si , αi >
Subgradient and Bundle Methods
Elementary Bundle Method
As before, f is assumed to be Lipshitz continuousAt a generic iteration we maintain a “bundle”< yi , f (yi), si , αi >
Subgradient and Bundle Methods
Elementary Bundle Method
Follow Cutting Plane Method, but use M-Y Regularizationfor building the model
yk+1 = argminy∈Rn f̂k (y) +µk
2||y − x̂k ||2
δk = f (x̂k )− [f̂k (yk+1) +µk
2||yk+1 − x̂k ||2] ≥ 0
if δk < δ stopIf f (x̂k )− f (yk+1) ≥ mδkSerious Step x̂k+1 = yk+1
else Null Step x̂k+1 = x̂k
f̂k+1(y) = max{f̂k (y), f (yk+1) +⟨sk+1, y − yk+1⟩}
Subgradient and Bundle Methods
Elementary Bundle Method
Follow Cutting Plane Method, but use M-Y Regularizationfor building the model
yk+1 = argminy∈Rn f̂k (y) +µk
2||y − x̂k ||2
δk = f (x̂k )− [f̂k (yk+1) +µk
2||yk+1 − x̂k ||2] ≥ 0
if δk < δ stopIf f (x̂k )− f (yk+1) ≥ mδkSerious Step x̂k+1 = yk+1
else Null Step x̂k+1 = x̂k
f̂k+1(y) = max{f̂k (y), f (yk+1) +⟨sk+1, y − yk+1⟩}
Subgradient and Bundle Methods
Elementary Bundle Method
Follow Cutting Plane Method, but use M-Y Regularizationfor building the model
yk+1 = argminy∈Rn f̂k (y) +µk
2||y − x̂k ||2
δk = f (x̂k )− [f̂k (yk+1) +µk
2||yk+1 − x̂k ||2] ≥ 0
if δk < δ stopIf f (x̂k )− f (yk+1) ≥ mδkSerious Step x̂k+1 = yk+1
else Null Step x̂k+1 = x̂k
f̂k+1(y) = max{f̂k (y), f (yk+1) +⟨sk+1, y − yk+1⟩}
Subgradient and Bundle Methods
Elementary Bundle Method
Follow Cutting Plane Method, but use M-Y Regularizationfor building the model
yk+1 = argminy∈Rn f̂k (y) +µk
2||y − x̂k ||2
δk = f (x̂k )− [f̂k (yk+1) +µk
2||yk+1 − x̂k ||2] ≥ 0
if δk < δ stopIf f (x̂k )− f (yk+1) ≥ mδkSerious Step x̂k+1 = yk+1
else Null Step x̂k+1 = x̂k
f̂k+1(y) = max{f̂k (y), f (yk+1) +⟨sk+1, y − yk+1⟩}
Subgradient and Bundle Methods
Elementary Bundle Method
Follow Cutting Plane Method, but use M-Y Regularizationfor building the model
yk+1 = argminy∈Rn f̂k (y) +µk
2||y − x̂k ||2
δk = f (x̂k )− [f̂k (yk+1) +µk
2||yk+1 − x̂k ||2] ≥ 0
if δk < δ stopIf f (x̂k )− f (yk+1) ≥ mδkSerious Step x̂k+1 = yk+1
else Null Step x̂k+1 = x̂k
f̂k+1(y) = max{f̂k (y), f (yk+1) +⟨sk+1, y − yk+1⟩}
Subgradient and Bundle Methods
Elementary Bundle Method
Follow Cutting Plane Method, but use M-Y Regularizationfor building the model
yk+1 = argminy∈Rn f̂k (y) +µk
2||y − x̂k ||2
δk = f (x̂k )− [f̂k (yk+1) +µk
2||yk+1 − x̂k ||2] ≥ 0
if δk < δ stopIf f (x̂k )− f (yk+1) ≥ mδkSerious Step x̂k+1 = yk+1
else Null Step x̂k+1 = x̂k
f̂k+1(y) = max{f̂k (y), f (yk+1) +⟨sk+1, y − yk+1⟩}
Subgradient and Bundle Methods
Convergence
The Algorithm either makes a finite number of SeriousSteps and then only makes Null stepsThen, If k0 is the last Serious Step, and µk isnondecreasing, then δk → 0Or it makes an infinite number of Serious steps
Then,∑
k∈Ksδk ≤
f (x̂0)− f ∗
mso δk → 0
Subgradient and Bundle Methods
Convergence
The Algorithm either makes a finite number of SeriousSteps and then only makes Null stepsThen, If k0 is the last Serious Step, and µk isnondecreasing, then δk → 0Or it makes an infinite number of Serious steps
Then,∑
k∈Ksδk ≤
f (x̂0)− f ∗
mso δk → 0
Subgradient and Bundle Methods
Convergence
The Algorithm either makes a finite number of SeriousSteps and then only makes Null stepsThen, If k0 is the last Serious Step, and µk isnondecreasing, then δk → 0Or it makes an infinite number of Serious steps
Then,∑
k∈Ksδk ≤
f (x̂0)− f ∗
mso δk → 0
Subgradient and Bundle Methods
Convergence
The Algorithm either makes a finite number of SeriousSteps and then only makes Null stepsThen, If k0 is the last Serious Step, and µk isnondecreasing, then δk → 0Or it makes an infinite number of Serious steps
Then,∑
k∈Ksδk ≤
f (x̂0)− f ∗
mso δk → 0
Subgradient and Bundle Methods
Convergence
The Algorithm either makes a finite number of SeriousSteps and then only makes Null stepsThen, If k0 is the last Serious Step, and µk isnondecreasing, then δk → 0Or it makes an infinite number of Serious steps
Then,∑
k∈Ksδk ≤
f (x̂0)− f ∗
mso δk → 0
Subgradient and Bundle Methods
Convergence
The Algorithm either makes a finite number of SeriousSteps and then only makes Null stepsThen, If k0 is the last Serious Step, and µk isnondecreasing, then δk → 0Or it makes an infinite number of Serious steps
Then,∑
k∈Ksδk ≤
f (x̂0)− f ∗
mso δk → 0
Subgradient and Bundle Methods
Variations
Replace ||y − x ||2 by (y − x)T Mk (y − x) : Still differentiableConjuguate Gradient methods are achieved as a slightmodification of the algorithm (Refer [5])Variable Metric Methods [10]Mk = uk I for Diagonal Variable Metric MethodsBundle-Newton Methods
Subgradient and Bundle Methods
Variations
Replace ||y − x ||2 by (y − x)T Mk (y − x) : Still differentiableConjuguate Gradient methods are achieved as a slightmodification of the algorithm (Refer [5])Variable Metric Methods [10]Mk = uk I for Diagonal Variable Metric MethodsBundle-Newton Methods
Subgradient and Bundle Methods
Variations
Replace ||y − x ||2 by (y − x)T Mk (y − x) : Still differentiableConjuguate Gradient methods are achieved as a slightmodification of the algorithm (Refer [5])Variable Metric Methods [10]Mk = uk I for Diagonal Variable Metric MethodsBundle-Newton Methods
Subgradient and Bundle Methods
Variations
Replace ||y − x ||2 by (y − x)T Mk (y − x) : Still differentiableConjuguate Gradient methods are achieved as a slightmodification of the algorithm (Refer [5])Variable Metric Methods [10]Mk = uk I for Diagonal Variable Metric MethodsBundle-Newton Methods
Subgradient and Bundle Methods
Variations
Replace ||y − x ||2 by (y − x)T Mk (y − x) : Still differentiableConjuguate Gradient methods are achieved as a slightmodification of the algorithm (Refer [5])Variable Metric Methods [10]Mk = uk I for Diagonal Variable Metric MethodsBundle-Newton Methods
Subgradient and Bundle Methods
Summary
Nonsmooth convex optimization has been explored since1960’s. The original subgradient methods were introducedby Naum Shor. Bundle methods have been developedmore recently.Subgradient Methods are simple but slow, unlessdistributed, which is the predominant current application.Bundle Methods solve a bounded QP, which is slow, butneed fewer iterations. Preferred for applications whereoracle cost is high.
Subgradient and Bundle Methods
Summary
Nonsmooth convex optimization has been explored since1960’s. The original subgradient methods were introducedby Naum Shor. Bundle methods have been developedmore recently.Subgradient Methods are simple but slow, unlessdistributed, which is the predominant current application.Bundle Methods solve a bounded QP, which is slow, butneed fewer iterations. Preferred for applications whereoracle cost is high.
Subgradient and Bundle Methods
Summary
Nonsmooth convex optimization has been explored since1960’s. The original subgradient methods were introducedby Naum Shor. Bundle methods have been developedmore recently.Subgradient Methods are simple but slow, unlessdistributed, which is the predominant current application.Bundle Methods solve a bounded QP, which is slow, butneed fewer iterations. Preferred for applications whereoracle cost is high.
Subgradient and Bundle Methods
Summary
Nonsmooth convex optimization has been explored since1960’s. The original subgradient methods were introducedby Naum Shor. Bundle methods have been developedmore recently.Subgradient Methods are simple but slow, unlessdistributed, which is the predominant current application.Bundle Methods solve a bounded QP, which is slow, butneed fewer iterations. Preferred for applications whereoracle cost is high.
Subgradient and Bundle Methods
Summary
Nonsmooth convex optimization has been explored since1960’s. The original subgradient methods were introducedby Naum Shor. Bundle methods have been developedmore recently.Subgradient Methods are simple but slow, unlessdistributed, which is the predominant current application.Bundle Methods solve a bounded QP, which is slow, butneed fewer iterations. Preferred for applications whereoracle cost is high.
Subgradient and Bundle Methods
For Further Reading I
Naum Z. ShorMinimization Methods for non-differentiable functions.Springer-Verlag, 1985.
Boyd and VanderbergeConvex Optimization.Cambridge University Press
A. RuszczyinskiNonlinear OptimizationPrinceton University Press
Wikipediaen.wikipedia.org/wiki/Subgradient_method
Subgradient and Bundle Methods
For Further Reading II
Marko MakelaSurvey of Bundle Methods, 2009http://www.informaworld.com/smpp/content~db=all~content=a713741700
Alexandre BelloniAn Introduction to Bundle Methodshttp://web.mit.edu/belloni/www/LecturesIntroBundle.pdf
John E. MitchellCutting Plane and Subgradient Methods, 2005http://www.optimization-online.org/DB_HTML/2009/05/2298.html
Subgradient and Bundle Methods
For Further Reading III
Lecture Notes on Subgradient methods by Stephen Boydhttp://www.stanford.edu/class/ee392o/subgrad_method.pdf
Alexander J. Smola, S.V. N. Vishwanathan, Quoc V. LeBundle Methods for Machine Learning, 2007http://books.nips.cc/papers/files/nips20/NIPS2007_0470.pdf
C Lemarechal Variable metric bundle methods, 1997.http://www.springerlink.com/index/3515WK428153171N.pdf
Quoc Le, Alexander SmolaDirect Optimization of Ranking Measures, 2007http://arxiv.org/abs/0704.3359
Subgradient and Bundle Methods
For Further Reading IV
SVN Vishwanathan, A. SmolaQuasi-Newton Methods for Efficient Large-Scale MachineLearninghttp://portal.acm.org/ft_gateway.cfm?id=1390309&type=pdfandwww.stat.purdue.edu/~vishy/talks/LBFGS.pdf
Subgradient and Bundle Methods