25
Lecture 13 Gradient Methods for Constrained Optimization October 16, 2008

Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Gradient Methods for Constrained Optimization

October 16, 2008

Page 2: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Outline

• Gradient Projection Algorithm

• Convergence Rate

Convex Optimization 1

Page 3: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Constrained Minimization

minimize f(x)

subject x ∈ X

• Assumption 1:

• The function f is convex and continuously differentiable over Rn

• The set X is closed and convex

• The optimal value f ∗ = infx∈Rn f(x) is finite

• Gradient projection algorithm

xk+1 = PX[xk − αk∇f(xk)]

starting with x0 ∈ X.

Convex Optimization 2

Page 4: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Bounded Gradients

Theorem 1 Let Assumption 1 hold, and suppose that the gradients areuniformly bounded over the set X. Then, the projection gradient methodgenerates the sequence {xk} ⊂ X such that

• When the constant stepsize αk ≡ α is used, we have

lim infk→∞

f(xk) ≤ f ∗ +αL2

2

• When diminishing stepsize is used with∑

k αk = +∞, we have

lim infk→∞

f(xk) = f ∗.

Proof: We use projection properties and the line of analysis similar to that

of unconstrained method. HWK 6

Convex Optimization 3

Page 5: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Lipschitz Gradients

• Lipschitz Gradient Lemma For a differentiable convex function f with

Lipschitz gradients, we have for all x, y ∈ Rn,

1

L‖∇f(x)−∇f(y)‖2 ≤ (∇f(x)−∇f(y))T (x− y),

where L is a Lipschitz constant.

• Theorem 2 Let Assumption 1 hold, and assume that the gradients off are Lipschitz continuous over X. Suppose that the optimal solutionset X∗ is not empty. Then, for a constant stepsize αk ≡ α with0 < α < 2

Lconverges to an optimal point, i.e.,

limk→∞

‖xk − x∗‖ = 0 for some x∗ ∈ X∗.

Convex Optimization 4

Page 6: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Proof:Fact 1: If z = PX[z − v] for some v ∈ <n, then z = PX[z − τv] for any

τ > 0.

Fact 2: z ∈ X∗ if and only if z = PX[z −∇f(z)].

These facts imply that z ∈ X∗ if and only if z = PX[z − τ∇f(z)] for any

τ > 0.

By using the definition of the method and the preceding relation with

τ = α, we obtain for any z ∈ X∗,

‖xk+1 − z‖2 = ‖PX[xk − α∇f(xk)]− PX[z − α∇f(z)‖2.

By non-expansiveness of the projection, it follows

‖xk+1 − z‖2 = ‖xk − z − α(∇f(xk)−∇f(z))‖2

= ‖xk − z‖2 − 2α(xk − z)T(∇f(xk)−∇f(z))

+α2‖∇f(xk)−∇f(z)‖2

Convex Optimization 5

Page 7: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Using Lipschitz Gradient Lemma, we obtain for any z ∈ X∗,

‖xk+1 − z‖2 ≤ ‖xk − z‖2 −α

L(2− αL)‖∇f(xk)−∇f(z)‖2. (1)

Hence, for all k,

α

L(2− αL)‖∇f(xk)−∇f(z)‖2 ≤ ‖xk − z‖2 − ‖xk+1 − z‖2.

By summing the preceding relations from arbitrary K to N , with K < N,

we obtain

α

L(2−αL)

N∑k=K

‖∇f(xk)−∇f(z)‖2 ≤ ‖xK−z‖2−‖xN+1−z‖2 ≤ ‖xK−z‖2.

Convex Optimization 6

Page 8: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

In particular, setting K = 0 and letting N →∞, we see that

α

L(2− αL)

∞∑k=0

‖∇f(xk)−∇f(z)‖2 ≤ ‖x0 − z‖2 < ∞. (2)

As a consequence, we also have

limk→∞

∇f(xk) = ∇f(z). (3)

By discarding the non-positive term in the right hand side of Eq. (1), we

have for any z ∈ X∗ and all k,

‖xk+1 − z‖2 ≤ ‖xk − z‖2 + (2− αL)‖∇f(xk)−∇f(z)‖2.

By summing these relations over k = K, . . . , N for arbitrary K and N with

K < N, we obtain

Convex Optimization 7

Page 9: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

‖xN+1 − z‖2 ≤ ‖xK − z‖2 + (2− αL)N∑

k=K

‖∇f(xk)−∇f(z)‖2.

Taking limsup as N →∞, we obtain

lim supN→∞

‖xN+1− z‖2 ≤ ‖xK − z‖2 + (2− αL)∞∑

k=K

‖∇f(xk)−∇f(z)‖2.

Now, taking liminf as K →∞ yields

lim supN→∞

‖xN+1 − z‖2 ≤ lim infK→∞

‖xK − z‖2

+ (2− αL) limK→∞

‖∞∑

k=K

∇f(xk)−∇f(z)‖2

=lim infK→∞

‖xK − z‖2,

Convex Optimization 8

Page 10: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

where the equality follows in view of the relation in (2). Thus, we have that

the sequence {‖xk − z‖} is convergent for every z ∈ X∗.

By the inequality in Eq. (1), we have that

‖xk − z‖ ≤ ‖x0 − z‖ for all k.

Hence, the sequence {xk} is bounded, and it has an accumulation point.

Since the scalar sequence {‖xk − z‖} is convergent for every z ∈ X∗, the

sequence {xk} must be convergent.

Suppose now that xk → x̄. By considering the definition of the iterate xk+1,

we have

xk+1 = PX[xk − α∇f(xk)].

Letting k → ∞ and using xk → x̄, and continuity of the gradient ∇f(x),

we obtain

x̄ = PX[x̄− α∇f(x̄)].

In view of facts 1 and 2, the preceding relation is equivalent to x̄ ∈ X∗. �

Convex Optimization 9

Page 11: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Modes of Convexity: Strict and Strong

• Def. f is strictly convex if for all x 6= y and α ∈ (0,1) we have

f(αx + (1− α)y) < αf(x) + (1− α)f(y)

• Def. f is strongly convex if there exists a scalar ν > 0 such that

f(αx + (1− α)y) ≤ αf(x) + (1− α)f(y)−ν

2α(1− α)‖x− y‖2

for all x, y ∈ <n and any α ∈ [0,1].

The scalar ν is referred to as strongly convex constant.The function is said to be strongly convex with constant ν.

Convex Optimization 10

Page 12: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Modes of Convexity: Differentiable Function

• Let f : <n → R be continuously differentiable.

• Modes of convexity can be equivalently characterized in terms of the

linearization properties of the function ∇f : <n → <n.

• We have

• f is convex if and only if

f(x) +∇f(x)T(y − x) ≤ f(y) for all x, y ∈ <n

• f is strictly convex if and only if

f(x) +∇f(x)T(y − x) < f(y) for all x 6= y

• f is strongly convex with constant ν if and only if

f(x) +∇f(x)T(y − x) +ν

2‖y − x‖2 ≤ f(y) for all x, y ∈ <n

Convex Optimization 11

Page 13: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Modes of Convexity: Gradient Mapping

• Let f : <n → R be continuously differentiable.

• Modes of convexity can be equivalently characterized in terms of the

monotonicity properties of the gradient mapping ∇f : <n → <n.

• We have

• f is convex if and only if

(∇f(x)−∇f(y))T(x− y) ≥ 0 for all x, y ∈ <n

• f is strictly convex if and only if

(∇f(x)−∇f(y))T(x− y) > 0 for all x 6= y

• f is strongly convex with constant ν if and only if

(∇f(x)−∇f(y))T(x− y) ≥ ν ‖x− y‖2 for all x, y ∈ <n

Convex Optimization 12

Page 14: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Modes of Convexity: Twice Differentiable Function

• Let f : <n → R be twice continuously differentiable.

• Modes of convexity can be equivalently characterized in terms of the

definiteness of the Hessians ∇2f(x) for x ∈ <n.

• We have

• f is convex if and only if

∇2f(x) ≥ 0 for all x ∈ <n

• f is strictly convex if

∇2f(x) > 0 for all x ∈ <n

• f is strongly convex with constant ν if and only if

∇2f(x) ≥ ν I for all x ∈ <n

Convex Optimization 13

Page 15: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Strong Convexity: Implications

Let f be continuously differentiable and strongly convex∗ over Rn with

constant m

• Implications:

• Lower Bound on f over Rn: for all x, y ∈ Rn

f(y) ≥ f(x) +∇f(x)T(y − x) +m

2‖x− y‖22 (4)

� minimize w/r to y in the right-hand side:

f(y) ≥ f(x)−1

2m‖∇f(x)‖2

� minimum over y ∈ Rn:

f(x)− f ∗ ≤1

2m‖∇f(x)‖2

• Useful as a stopping criterion (if you know m)∗Strong convexity over Rn can be replaced by a strong convexity over a set X. Then, all the relations stay valid over the set

Convex Optimization 14

Page 16: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

• Relation (4) with x = x0 and f(y) ≤ f(x0) implies that the level

set Lf(f(x0)) is bounded

• Relation (4) also yields for an optimal x∗ and any x ∈ Rn,

m

2‖x− x∗‖2 ≤ f(x)− f(x∗)

• Last two bullets HWK6 assignment.

Convex Optimization 15

Page 17: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Convergence Rate: Once Differentiable

Theorem 3 Let Assumption 1 hold, and assume that the gradients of f

are Lipschitz continuous over X with constant L > 0. Suppose that thefunction is strongly convex with constant m > 0. Then:

• A solution x∗ exists and it is unique.

• The iterates generated by the gradient projection method with αk ≡ α

and α < 2L

converge to x∗ with geometric rate, i.e.,

‖xk+1 − x∗‖2 ≤ qk ‖xk − x∗‖2 for all k

with q ∈ (0,1) depending on m and L.

Proof: HWK 6.

Convex Optimization 16

Page 18: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Convergence Rate: Twice Differentiable

Theorem 4 Let Assumption 1 hold. Assume that the function is twicecontinuously differentiable and strongly convex with constant m > 0.

Assume also that ∇f2(x) ≤ L for all x ∈ X. Then:

• A solution x∗ exists and it is unique.

• The iterates generated by the gradient projection method with αk ≡ α

and α < 2L

converge to x∗ with geometric rate, i.e.,

‖xk+1 − x∗‖ ≤ qk ‖xk − x∗‖ for all k

with q = max{|1− αm|, |1− αL}.

Convex Optimization 17

Page 19: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Proof: The q here is different from the one in the preceding theorem. Since

∇f2(x) ≤ L for all x ∈ X, it follows that the gradients are Lipschitz

continuous over X with constant L. By the definition of the method and

the non-expansive property of the projection, we have for z = x∗ and any

k,

‖xk+1 − x∗‖2 = ‖PX[xk − α∇f(xk)]− PX[x∗ −∇f(x∗)]‖2

≤ ‖xk − x∗ − α(∇f(xk)−∇f(x∗))‖2. (5)

Mean Value Theorem for vector functions When g : Rn → R is

differentiable on [x, y], we have

g(y) = g(x) +∫ 1

0∇g(x + τ(y − x)) dτ

Convex Optimization 18

Page 20: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Applying this Theorem with g = ∇f , y = xk and x = x∗, we obtain

∇f(xk) = ∇f(x∗) +∫ 1

0∇2f(x∗ + τ(xk − x∗)) dτ

Hence,

∇f(xk)−∇f(x∗) =∫ 1

0∇2f(x∗ + τ(xk − x∗))dτ. (6)

By introducing Ak(x− x∗) = ∇f(xk)−∇f(x∗) and using this in relation

(5), we obtain

‖xk+1 − x∗‖ ≤ ‖(I − αAk)(xk − x∗)‖ ≤ ‖I − αAk‖ ‖xk − x∗‖

Convex Optimization 19

Page 21: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

The matrix Ak is symmetric, and hence ‖I − Ak‖ is equal to the max

absolute eigenvalue of I −Ak, i.e.,

‖I − αAk‖ = max{|1− αλmax(Ak)|, |1− αλmin(Ak)|}.

In view of Eq. (6), we have Ak =∫ 10 ∇2f(x∗ + τ(xk − x∗))dτ . By the

strong convexity of f , we have ∇2f(x) ≥ mI for all x, while by the given

condition, we have ∇2f(x) ≤ L I. Therefore,

λmax(Ak) ≤ L, λmin(Ak) ≥ m,

implying that

‖I − αAk‖ ≤ max{|1− αm|, |1− αM |}.

Convex Optimization 20

Page 22: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

• The parameter q is minimized when α∗ = 2m+L

in which case

q∗ =L−m

L + m⇐⇒ q∗ =

cond(f)− 1

cond(f) + 1,

with cond(f) = Lm

.

Convex Optimization 21

Page 23: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Upper Bound on Hessian and f over the Level Set

For a twice differentiable strongly convex f :

• The level set L0 = {x | f(x) ≤ f(x0)} is bounded

• The maximum eigenvalue of the Hessian∇2f(x) is a continuous function

of x over L0

• Hence, the maximum eigenvalue of the Hessian is bounded over L0:

there is a constant M such that ∇2f(x) � MI for all x ∈ L0

• Upper Bound on f over L0:

f(y) ≤ f(x) +∇f(x)T(y − x) +M

2‖y − x‖2 for all x, y ∈ L0

• minimize over y ∈ L0 in both sides:

f ∗ ≤ f(x)−1

2M‖∇f(x)‖2 for all x ∈ L0

Convex Optimization 22

Page 24: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Condition Number of a Matrix

For a a twice differentiable strongly convex f : mI � ∇2f(x) � MI for

all x ∈ L0

• The condition number cond(A) of a positive definite matrix A:

cond(A) =largest eigenvalue of A

smallest eigenvalue of A

• The ratio Mm

is an upper bound on the condition number ∇2f(x) for

every x ∈ L0

Convex Optimization 23

Page 25: Lecture 13 Gradient Methods for Constrained Optimizationangelia/L13_constrained_gradient.pdf · Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex

Lecture 13

Strong Convexity and Condition Number of Level Sets

Assume a minimizer x∗ of f over Rn exists and f is strongly convex.

Consider the level set L0 = {x | f(x) ≤ f(x0)}• We have seen that mI � ∇2f(x) � MI for all x ∈ L0

• Also, we have

f ∗ +m

2‖x− x∗‖2 ≤ f(x) ≤ f ∗ +

M

2‖x− x∗‖2

• Hence: Binner ⊆ L0 ⊆ Bouter, where

Binner ={x | ‖x− x∗‖ ≤

√(2 (f(x0)− f ∗) /M)

}Bouter =

{x | ‖x− x∗‖ ≤

√(2 (f(x0)− f ∗) /m)

}• Therefore, we have a bound on cond(L0)

cond(L0) ≤M

m

• The condition number of level sets affects the efficiency of the

algorithms

Convex Optimization 24